Real-Time Messaging with Hybrid-State Architectures: Optimizing Latency and Cost in Generative AI-Enhanced Real-Time Messaging at the Edge

Maheshkumar Mole

doi:10.22399/ijcesen.5000

Authors

Maheshkumar Mole

DOI:

https://doi.org/10.22399/ijcesen.5000

Keywords:

Edge Intelligence, Speculative Decoding, Hybrid-State Architecture, Small Language Models, Confidential Computing

Abstract

Real-time messaging applications are rapidly integrating generative artificial intelligence capabilities, yet the prevailing cloud-only inference paradigm presents significant challenges in terms of cost, latency, and user privacy. As messaging platforms scale to serve millions of concurrent users, routing every user utterance through large cloud-hosted language models creates a Thundering Herd effect on centralized GPU clusters, inflating token costs and degrading time-to-first-token performance. Simultaneously, transmitting all raw user text to remote servers amplifies privacy exposure, particularly for sensitive data categories such as personally identifiable information and criminal justice information. This article proposes a hybrid-state architecture built upon a novel distributed inference protocol termed "speculative edge decoding." In this architecture, lightweight Small Language Models deployed on client devices handle simple tasks and generate draft token sequences, while Large Language Models hosted in the cloud perform verification passes rather than full generative inference. The system has a complexity router that automatically sorts incoming prompts and sends them to the right inference tier. It also has a dynamic low-rank adapter loading mechanism for task-specific specialization on the edge and trusted execution environments for secure cloud-side inference. Experimental evaluation in simulated high-load environments demonstrates substantial reductions in cloud infrastructure costs alongside markedly faster end-to-end response times, without meaningful degradation in response quality as measured by standard automated metrics. The article affirms that the future of scalable AI-enhanced messaging lies not in exclusive cloud dependence but in the intelligent orchestration of cloud and edge resources.

References

[1] Zhi Zhou, et al., "Edge Intelligence: Paving the Last Mile of Artificial Intelligence With Edge Computing," Proceedings of the IEEE (Vol. 107, Issue 8), Aug. 2019. Available: https://ieeexplore.ieee.org/document/8736011

[2] Mozhgan Navardi, et al., "GenAI at the Edge: Comprehensive Survey on Empowering Edge Devices with Generative Artificial Intelligence," arXiv (Preprint for IEEE Surveys), February 2025. Available: https://arxiv.org/html/2502.15816v1

[3] Heming Xia, et al., "Unlocking Efficiency in Large Language Model Inference: A Comprehensive Survey of Speculative Decoding," arXiv (Accepted to ACL Findings 2024), 15 Jan 2024. Available: https://arxiv.org/abs/2401.07851

[4] Ying Shengi, et al., "S-LoRA: Serving Thousands of Concurrent LoRA Adapters," Proceedings of Machine Learning and Systems (MLSys 2024), 6 Nov 2023. Available: https://arxiv.org/abs/2311.03285

[5] Bangyi Yang, "Navigating Privacy Risks in Generative AI: Concerns, Challenges, and Potential Solutions," ResearchGate, February 2025. Available: https://www.researchgate.net/publication/399150042_Navigating_Privacy_Risks_in_Generative_AI_Concerns_Challenges_and_Potential_Solutions

[6] Marcin Chrapek, et al., "Confidential LLM Inference: Performance and Cost Across CPU and GPU TEEs," IEEE International Symposium on Workload Characterization (IISWC), October 2025. Available: https://www.researchgate.net/publication/397822094_Confidential_LLM_Inference_Performance_and_Cost_Across_CPU_and_GPU_TEEs

[7] Jinwoo Park et al.,, "SpecEdge: Scalable Edge-Assisted Serving Framework for Interactive LLMs," arXiv (Distributed Systems), 18 Nov 2025. Available: https://arxiv.org/pdf/2505.17052

[8] Jiangsu Du, et al., "EcoServe: Enabling Cost-effective LLM Serving with Proactive Intra- and Inter-instance Scheduling," arXiv (Cloud Computing Performance), 25 April 2025. Available: https://arxiv.org/html/2504.18154v1

[9] Manne Bhagya Rekha, et al., "Battery-Aware Edge AI For Energy-Efficient Smart Home IoT Devices," International Journal of Creative Research Thoughts (IJCRT), January 2026. Available: https://www.ijcrt.org/papers/IJCRT2601648.pdf

[10] Megan Stewart, et al., "Federated Learning Strategies for Training Small Language Models on Edge Devices," ResearchGate, 2023. Available: https://www.researchgate.net/publication/391156388_Federated_Learning_Strategies_for_Training_Small_Language_Models_on_Edge_Devices

Real-Time Messaging with Hybrid-State Architectures: Optimizing Latency and Cost in Generative AI-Enhanced Real-Time Messaging at the Edge

Authors

DOI:

Keywords:

Abstract

References

Downloads

Published

How to Cite

Issue

Section

License

Make a Submission

Information

Keywords

Announcements

Current Issue