Semantic-Aware Neural Query Optimization: Bridging the Gap Between Distributed Frameworks and Large Language Models at Terabyte Scale
DOI:
https://doi.org/10.22399/ijcesen.4943Keywords:
Distributed Query Optimization, Large Language Models, Data Skew Mitigation, Learned Query Optimization, Terabyte-Scale ProcessingAbstract
While processing and managing datasets of the same order of terabytes is commonplace in any business environment, distributed SQL engines such as Apache Spark SQL, Trino, etc., have been proliferating. Despite their advanced Cost-Based Optimizers (CBOs) and adaptive execution plans, the underlying statistical heuristics have turned out to be less useful in the presence of large-scale datasets, high-dimensional data, and data skew. On the architectural side, the NP-Hard join order problem, the cost of shuffle, and the fragility of the pipeline execution model in MPP systems are considered. It is observed that learn-based optimizers in documentation and records, such as Neo, Bao, and LITHE, along with customary and pre-existing learned optimization strategies used today in industrial systems, lack semantic reasoning capabilities to optimize queries through rewriting and physically planning hints. To address this, the Semantic-Aware Neural Query Optimization (SANQO) framework is proposed. SANQO integrates Large Language Models (LLMs) as a supervisory agent within the database kernel, utilizing a novel retrieve-then-rewrite paradigm to inject precise execution hints (e.g., broadcast thresholds, skew salting) and perform structural rewrites that elude static evaluation. Through a rigorous comparative evaluation with related work, this article demonstrates that SANQO offers a plausible, strictly superior pathway for optimizing distributed workloads, transforming the database from a passive execution engine into an active, reasoning system.
References
1. DataSturdy, "Apache Spark vs. Trino". [Online]. Available: https://datasturdy.com/apache-spark-vs-trino/
2. Nikhil Joshi, "Trino vs Spark: A Practical Comparison for Data Processing Needs", Snic Solutions, Jul. 2025. [Online]. Available: https://snicsolutions.com/compare/trino-vs-spark
3. Shuu, "Reducing Peak Memory Usage in Trino: A SQL-First Approach", Medium, May 2025. [Online]. Available: https://medium.com/@shuu1203/reducing-peak-memory-usage-in-trino-a-sql-first-approach-fc687f07d617
4. Peter Bailis et al., "Infrastructure for Usable Machine Learning: The Stanford DAWN Project", arXiv, 2017. [Online]. Available: https://arxiv.org/pdf/1705.07538
5. Ryan Marcus et al., "Neo: A Learned Query Optimizer", VLDB Endowment. [Online]. Available: https://www.vldb.org/pvldb/vol12/p1705-marcus.pdf
6. Ryan Marcus et al., "Bao: Making Learned Query Optimization Practical", SIGMOD ’21-ACM, 2021. [Online]. Available: https://15799.courses.cs.cmu.edu/spring2022/papers/17-queryopt1/marcus-sigmod2021.pdf
7. Claude Lehmann et al., "Is Your Learned Query Optimizer Behaving As You Expect?", arXiv, 2024. [Online]. Available: https://arxiv.org/html/2309.01551v2
8. Tomer Ben David, "Trino versus Apache Spark", Medium, 2024. [Online]. Available: https://medium.com/@Tom1212121/trino-versus-apache-spark-a013ca8c6906
9. Trino, "Spill to disk". [Online]. Available: https://trino.io/docs/current/admin/spill.html
10. Trino, "Cost-based optimizations". [Online]. Available: https://trino.io/docs/current/optimizer/cost-based-optimizations.html
11. AWS, "Optimize shuffles". [Online]. Available: https://docs.aws.amazon.com/prescriptive-guidance/latest/tuning-aws-glue-for-apache-spark/optimize-shuffles.html
12. Spark, "Tuning Spark". [Online]. Available: https://spark.apache.org/docs/latest/tuning.html
13. Ram Avasarala, "What I Learned About Spark Shuffles After 8 Years of Writing Production Jobs", Medium, Oct. 2025. [Online]. Available: https://medium.com/@sairam94.a/what-i-learned-about-spark-shuffles-after-8-years-of-writing-production-jobs-33d454c92150
14. Carson Wang et al., "Spark SQL* Adaptive Execution at 100 TB", Intel, 2018. [Online]. Available: https://www.intel.com/content/www/us/en/developer/articles/technical/spark-sql-adaptive-execution-at-100-tb.html
15. Spark, "Performance Tuning". [Online]. Available: https://spark.apache.org/docs/latest/sql-performance-tuning.html
16. Stanford Dawn, "A Five-Year Research Project to Democratize AI". [Online]. Available: https://dawn.cs.stanford.edu/
17. Daniel Kang et al., "BlazeIt: Optimizing Declarative Aggregation and Limit Queries for Neural Network-Based Video Analytics", VLDB Endowment. [Online]. Available: https://people.eecs.berkeley.edu/~matei/papers/2020/vldb_blazeit.pdf
18. Sriram Dharwada, et al., "Query Rewriting via LLMs", arXiv, Sep. 2025. [Online]. Available: https://arxiv.org/abs/2502.12918
19. Zhaoyan Sun et al., "R-Bot: An LLM-based Query Rewrite System", VLDB Endowment. [Online]. Available: https://www.vldb.org/pvldb/vol18/p5031-li.pdf
20. Liana Patel et al., "Semantic Operators and Their Optimization: Enabling LLM-Based Data Processing with Accuracy Guarantees in LOTUS", VLDB Endowment. [Online]. Available: https://www.vldb.org/pvldb/vol18/p4171-patel.pdf
21. Trino, "Dynamic filtering". [Online]. Available: https://trino.io/docs/current/admin/dynamic-filtering.html
22. Trino, "General properties". [Online]. Available: https://trino.io/docs/current/admin/properties-general.html
23. Databricks, "Hints", Dec. 2025. [Online]. Available: https://docs.databricks.com/aws/en/sql/language-manual/sql-ref-syntax-qry-select-hints
24. Ajay Gupta, "Five Tips to Fasten Skewed Joins in Apache Spark", Medium, 2022. [Online]. Available: https://medium.com/data-science/five-tips-to-fasten-your-skewed-joins-in-apache-spark-420f558b219e
25. Sergei Petrunia, "Lessons for the optimizer from TPC-DS benchmark", MariaDB. [Online]. Available: https://mariadb.org/wp-content/uploads/2019/03/lessons-from-tpcds-mariadb-unconf2018.pdf
26. Peter Akioyamen et al., "The Unreasonable Effectiveness of LLMs for Query Optimization", NeurIPS - Penn Engineering. [Online]. Available: https://neurips.cc/media/neurips-2024/Slides/103605.pdf
Downloads
Published
How to Cite
Issue
Section
License
Copyright (c) 2025 International Journal of Computational and Experimental Science and Engineering

This work is licensed under a Creative Commons Attribution 4.0 International License.