Semantic-Aware Neural Query Optimization: Bridging the Gap Between Distributed Frameworks and Large Language Models at Terabyte Scale

Authors

  • Vaibhav Sudhanshu Naik

DOI:

https://doi.org/10.22399/ijcesen.4943

Keywords:

Distributed Query Optimization, Large Language Models, Data Skew Mitigation, Learned Query Optimization, Terabyte-Scale Processing

Abstract

While processing and managing datasets of the same order of terabytes is commonplace in any business environment, distributed SQL engines such as Apache Spark SQL, Trino, etc., have been proliferating. Despite their advanced Cost-Based Optimizers (CBOs) and adaptive execution plans, the underlying statistical heuristics have turned out to be less useful in the presence of large-scale datasets, high-dimensional data, and data skew. On the architectural side, the NP-Hard join order problem, the cost of shuffle, and the fragility of the pipeline execution model in MPP systems are considered. It is observed that learn-based optimizers in documentation and records, such as Neo, Bao, and LITHE, along with customary and pre-existing learned optimization strategies used today in industrial systems, lack semantic reasoning capabilities to optimize queries through rewriting and physically planning hints. To address this, the Semantic-Aware Neural Query Optimization (SANQO) framework is proposed. SANQO integrates Large Language Models (LLMs) as a supervisory agent within the database kernel, utilizing a novel retrieve-then-rewrite paradigm to inject precise execution hints (e.g., broadcast thresholds, skew salting) and perform structural rewrites that elude static evaluation. Through a rigorous comparative evaluation with related work, this article demonstrates that SANQO offers a plausible, strictly superior pathway for optimizing distributed workloads, transforming the database from a passive execution engine into an active, reasoning system.

References

1. DataSturdy, "Apache Spark vs. Trino". [Online]. Available: https://datasturdy.com/apache-spark-vs-trino/

2. Nikhil Joshi, "Trino vs Spark: A Practical Comparison for Data Processing Needs", Snic Solutions, Jul. 2025. [Online]. Available: https://snicsolutions.com/compare/trino-vs-spark

3. Shuu, "Reducing Peak Memory Usage in Trino: A SQL-First Approach", Medium, May 2025. [Online]. Available: https://medium.com/@shuu1203/reducing-peak-memory-usage-in-trino-a-sql-first-approach-fc687f07d617

4. Peter Bailis et al., "Infrastructure for Usable Machine Learning: The Stanford DAWN Project", arXiv, 2017. [Online]. Available: https://arxiv.org/pdf/1705.07538

5. Ryan Marcus et al., "Neo: A Learned Query Optimizer", VLDB Endowment. [Online]. Available: https://www.vldb.org/pvldb/vol12/p1705-marcus.pdf

6. Ryan Marcus et al., "Bao: Making Learned Query Optimization Practical", SIGMOD ’21-ACM, 2021. [Online]. Available: https://15799.courses.cs.cmu.edu/spring2022/papers/17-queryopt1/marcus-sigmod2021.pdf

7. Claude Lehmann et al., "Is Your Learned Query Optimizer Behaving As You Expect?", arXiv, 2024. [Online]. Available: https://arxiv.org/html/2309.01551v2

8. Tomer Ben David, "Trino versus Apache Spark", Medium, 2024. [Online]. Available: https://medium.com/@Tom1212121/trino-versus-apache-spark-a013ca8c6906

9. Trino, "Spill to disk". [Online]. Available: https://trino.io/docs/current/admin/spill.html

10. Trino, "Cost-based optimizations". [Online]. Available: https://trino.io/docs/current/optimizer/cost-based-optimizations.html

11. AWS, "Optimize shuffles". [Online]. Available: https://docs.aws.amazon.com/prescriptive-guidance/latest/tuning-aws-glue-for-apache-spark/optimize-shuffles.html

12. Spark, "Tuning Spark". [Online]. Available: https://spark.apache.org/docs/latest/tuning.html

13. Ram Avasarala, "What I Learned About Spark Shuffles After 8 Years of Writing Production Jobs", Medium, Oct. 2025. [Online]. Available: https://medium.com/@sairam94.a/what-i-learned-about-spark-shuffles-after-8-years-of-writing-production-jobs-33d454c92150

14. Carson Wang et al., "Spark SQL* Adaptive Execution at 100 TB", Intel, 2018. [Online]. Available: https://www.intel.com/content/www/us/en/developer/articles/technical/spark-sql-adaptive-execution-at-100-tb.html

15. Spark, "Performance Tuning". [Online]. Available: https://spark.apache.org/docs/latest/sql-performance-tuning.html

16. Stanford Dawn, "A Five-Year Research Project to Democratize AI". [Online]. Available: https://dawn.cs.stanford.edu/

17. Daniel Kang et al., "BlazeIt: Optimizing Declarative Aggregation and Limit Queries for Neural Network-Based Video Analytics", VLDB Endowment. [Online]. Available: https://people.eecs.berkeley.edu/~matei/papers/2020/vldb_blazeit.pdf

18. Sriram Dharwada, et al., "Query Rewriting via LLMs", arXiv, Sep. 2025. [Online]. Available: https://arxiv.org/abs/2502.12918

19. Zhaoyan Sun et al., "R-Bot: An LLM-based Query Rewrite System", VLDB Endowment. [Online]. Available: https://www.vldb.org/pvldb/vol18/p5031-li.pdf

20. Liana Patel et al., "Semantic Operators and Their Optimization: Enabling LLM-Based Data Processing with Accuracy Guarantees in LOTUS", VLDB Endowment. [Online]. Available: https://www.vldb.org/pvldb/vol18/p4171-patel.pdf

21. Trino, "Dynamic filtering". [Online]. Available: https://trino.io/docs/current/admin/dynamic-filtering.html

22. Trino, "General properties". [Online]. Available: https://trino.io/docs/current/admin/properties-general.html

23. Databricks, "Hints", Dec. 2025. [Online]. Available: https://docs.databricks.com/aws/en/sql/language-manual/sql-ref-syntax-qry-select-hints

24. Ajay Gupta, "Five Tips to Fasten Skewed Joins in Apache Spark", Medium, 2022. [Online]. Available: https://medium.com/data-science/five-tips-to-fasten-your-skewed-joins-in-apache-spark-420f558b219e

25. Sergei Petrunia, "Lessons for the optimizer from TPC-DS benchmark", MariaDB. [Online]. Available: https://mariadb.org/wp-content/uploads/2019/03/lessons-from-tpcds-mariadb-unconf2018.pdf

26. Peter Akioyamen et al., "The Unreasonable Effectiveness of LLMs for Query Optimization", NeurIPS - Penn Engineering. [Online]. Available: https://neurips.cc/media/neurips-2024/Slides/103605.pdf

Downloads

Published

2026-02-21

How to Cite

Vaibhav Sudhanshu Naik. (2026). Semantic-Aware Neural Query Optimization: Bridging the Gap Between Distributed Frameworks and Large Language Models at Terabyte Scale. International Journal of Computational and Experimental Science and Engineering, 12(1). https://doi.org/10.22399/ijcesen.4943

Issue

Section

Research Article