Survey Data Engineering: Normalizing and Automating Semi-Structured Data Pipelines
DOI:
https://doi.org/10.22399/ijcesen.4895Keywords:
Survey Data Engineering, Semi-Structured Data, Data Pipeline Automation, Data Normalization, Schema DriftAbstract
The resulting enormous size of data gathering through surveys in the Web space has been a significant engineering problem in developing scalable and reliable data pipes. The heterogeneous forms, nested forms, language variations, and changing schema are comprised in survey data that is typically semi-structured. This is particularly necessary when there is a creation of data within various tools and in various languages because it is relevant to normalize this data to generate a standard structure through which the data can be analyzed to obtain actionable information. The paper explains the meaning and approach of survey data engineering as it relates to schema reconciliation, metadata parsing, and linguistic standardization as the most critical areas of normalization. Besides that, it also elaborates on how to automate end-to-end pipelines using modular architectures, real-time orchestration, and cloud-native technologies. The issues, such as schema drift, multilingual inconsistency, data quality issues, and compliance requirements, are also addressed. This paper is then concluded with the best practices to develop resilient, repeatable, and scalable survey data workflows since automation may be strategic in modern data-based environments.
References
[1] Yuan, G., Lu, J., Yan, Z., & Wu, S. (2023). A survey on mapping semi-structured data and graph data to relational data. ACM Computing Surveys, 55(10), 1-38.
[2] Ahmad, H., Kermanshahani, S., Simonet, A., & Simonet, M. (2009, April). Data Warehouse-Based Approach to the Integration of Semi-structured Data. In Asia-Pacific Web Conference (pp. 88-99). Berlin, Heidelberg: Springer Berlin Heidelberg.
[3] Singh, A. (2021). Data science and human behaviour interpretation and transformation. Journal of Learning and Teaching in Digital Age, 6(1), 1-7.
[4] Chowdhury, R. H. (2021). Cloud-Based Data Engineering for Scalable Business Analytics Solutions: Designing Scalable Cloud Architectures to Enhance the Efficiency of Big Data Analytics in Enterprise Settings. Journal of Technological Science & Engineering (JTSE), 2(1), 21-33.
[5] Doleschal, J., Höllerich, N., Martens, W., & Neven, F. (2018, April). CHISEL: Sculpting tabular and non-tabular data on the web. In Companion Proceedings of the The Web Conference 2018 (pp. 139-142).
[6] Bergstrom, L., Fluet, M., Rainey, M., Reppy, J., Rosen, S., & Shaw, A. (2013, February). Data-only flattening for nested data parallelism. In Proceedings of the 18th ACM SIGPLAN symposium on Principles and practice of parallel programming (pp. 81-92).
[7] Wen, A., Fu, S., Moon, S., El Wazir, M., Rosenbaum, A., Kaggal, V. C., ... & Fan, J. (2019). Desiderata for delivering NLP to accelerate healthcare AI advancement and a Mayo Clinic NLP-as-a-service implementation. NPJ digital medicine, 2(1), 130.
[8] Tantalaki, N., Souravlas, S., & Roumeliotis, M. (2020). A review on big data real-time stream processing and its scheduling techniques. International Journal of Parallel, Emergent and Distributed Systems, 35(5), 571-601.
[9] Hendler, J. (2014). Data integration for heterogenous datasets. Big data, 2(4), 205-215.
[10] Jiang, J. A., Wade, K., Fiesler, C., & Brubaker, J. R. (2021). Supporting serendipity: Opportunities and challenges for Human-AI Collaboration in qualitative analysis. Proceedings of the ACM on Human-Computer Interaction, 5(CSCW1), 1-23.
[11] Saitwal, H., Qing, D., Jones, S., Bernstam, E. V., Chute, C. G., & Johnson, T. R. (2012). Cross-terminology mapping challenges: a demonstration using medication terminological systems. Journal of biomedical informatics, 45(4), 613-625.
[12] Jones, C. S., Duncan, D. H., Morris, W. K., Robinson, D., & Vesk, P. A. (2022). Using data calibration to reconcile outputs from different survey methods in long-term or large-scale studies. Environmental Monitoring and Assessment, 194(3), 185.
[13] Rama, K., Canhão, H., Carvalho, A. M., & Vinga, S. (2019). AliClu-Temporal sequence alignment for clustering longitudinal clinical data. BMC Medical Informatics and Decision Making, 19(1), 289.
[14] Zavala-Rojas, D., Sorato, D., Hareide, L., & Hofland, K. (2022). The Multilingual Corpus of Survey Questionnaires: a tool for refining survey translation. Meta, 67(1), 71-93.
[15] Facile, R., Muhlbradt, E. E., Gong, M., Li, Q., Popat, V., Pétavy, F., ... & Jauregui Wurst, B. (2022). Use of clinical data interchange standards consortium (CDISC) standards for real-world data: expert perspectives from a qualitative Delphi survey. JMIR medical informatics, 10(1), e30363.
[16] Stein, B., & Morrison, A. (2014). The enterprise data lake: Better integration and deeper analytics. PwC Technology Forecast: Rethinking integration, 1(1-9), 18.
[17] Ogunsola, K. O., Balogun, E. D., & Ogunmokun, A. S. (2022). Developing an automated ETL pipeline model for enhanced data quality and governance in analytics. International Journal of Multidisciplinary Research and Growth Evaluation, 3(1), 791-796.
[18] De Jong, M., van Deursen, A., & Cleve, A. (2017, May). Zero-downtime SQL database schema evolution for continuous deployment. In 2017, IEEE/ACM 39th International Conference on Software Engineering: Software Engineering in Practice Track (ICSE-SEIP) (pp. 143-152). IEEE.
[19] Crumley Alvarez, P. A. (2004). Managing large-scale systems with automated, centralized applications: using the Automated Submission System for law reviews (Doctoral dissertation, Massachusetts Institute of Technology).
[20] He, X., Dong, H., Yang, W., & Li, W. (2023). Multi-source information fusion technology and its application in smart distribution power system. Sustainability, 15(7), 6170.
[21] Hirschman, L., Grishman, R., & Sager, N. (1976, June). From text to structured information: automatic processing of medical reports. In Proceedings of the June 7-10, 1976, national computer conference and exposition (pp. 267-275).
[22] Kern, C., Klausch, T., & Kreuter, F. (2019, April). Tree-based machine learning methods for survey research. In Survey research methods (Vol. 13, No. 1, p. 73).
[23] Epoka, B. E. (2023). Literature Review of Qualitative Data with Natural Language Processing. Journal of Robotics Spectrum, 1, 056-065.
[24] Cornelissen, B., Zaidman, A., Van Deursen, A., Moonen, L., & Koschke, R. (2009). A systematic survey of program comprehension through dynamic analysis. IEEE Transactions on Software Engineering, 35(5), 684-702.
[25] Gunaratna, K., Lalithsena, S., & Sheth, A. (2014). Alignment and dataset identification for linked data in the Semantic Web. Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery, 4(2), 139-151.
[26] Sánchez, L., Lanza, J., Santana, J. R., Sotres, P., González, V., Martín, L., ... & Crespi, N. (2023). Data enrichment toolchain: a data linking and enrichment platform for heterogeneous data. IEEE Access, 11, 103079-103091.
[27] Konyushkova, K., Sznitman, R., & Fua, P. (2017). Learning active learning from data. Advances in Neural Information Processing Systems, 30.
[28] Fiannaca, A. J., Kulkarni, C., Cai, C. J., & Terry, M. (2023, April). Programming without a programming language: Challenges and opportunities for designing developer tools for prompt programming. In Extended Abstracts of the 2023 CHI Conference on Human Factors in Computing Systems (pp. 1-7).
[29] Sun, P. J. (2019). Privacy protection and data security in cloud computing: a survey, challenges, and solutions. IEEE Access, 7, 147420-147452.
[30] Chirumamilla, K. R. (2023). Low-Latency Data Pipelines Using Kafka and Snowflake. JOURNAL OF RECENT TRENDS IN COMPUTER SCIENCE AND ENGINEERING (JRTCSE), 11(1), 80-106.
Downloads
Published
How to Cite
Issue
Section
License
Copyright (c) 2023 International Journal of Computational and Experimental Science and Engineering

This work is licensed under a Creative Commons Attribution 4.0 International License.