J9UJX متجر - أسرع عملية شراء مجموعة صيانة لقوس صيد Rocky Mountain. التعليقات/الآراء

Skip to main content Skip to search

YU News

YU News

Students to Present Voice-Cloning System That Reinvents How Computers Speak

Suhil Kumar, a Katz School Ph.D. candidate in Mathematics, is one of several researchers in the Department of Graduate Computer Science and Engineering who will present a new AI system that turns written text into spoken word at the Fourteenth International Conference on Learning Representations, one of the top international conferences in artificial intelligence.

By Dave DeFusco

When people hear a computer-generated voice that sounds natural, expressive and even emotional, it can feel almost magical. Behind that voice, however, is a complex system that turns written text into spoken word. A paper, “MambaVoiceCloning: Efficient and Expressive Text-to-Speech via State-Space Modeling and Diffusion Control,” by researchers in the Katz School’s Department of Graduate Computer Science and Engineering, explores how that process can be made faster, more reliable and easier to use in real-world applications. 

They will present their work in April in Brazil at the Fourteenth International Conference on Learning Representations (ICLR 2026), one of the top international conferences in artificial intelligence. The paper was selected from more than 19,000 submissions, with an overall acceptance rate of 28 percent.

Text-to-speech, often called TTS, is the technology that allows computers to read text aloud. It powers digital assistants, audiobooks, navigation apps and accessibility tools for people with visual impairments. While recent systems can sound very realistic, they often rely on heavy computational machinery that is slow, memory-hungry and difficult to deploy on everyday devices.

The researchers behind MambaVoiceCloning, or MVC, asked a simple but important question: can we redesign the “thinking” part of a text-to-speech system so it runs more smoothly, without sacrificing voice quality?

“Our goal was to rethink how the system processes text, rhythm and speaking style,” said Sahil Kumar, a Ph.D. student in mathematics in the Department of Graduate Computer Science and Engineering at the Katz School. “Instead of using complex attention mechanisms that constantly look back and forth over the entire sentence, we wanted a cleaner, more streamlined approach.”

Most modern text-to-speech systems rely on a technique called attention, which helps the model decide which parts of the text to focus on when generating speech. While effective, attention can become inefficient as sentences get longer, using more memory and slowing things down. This can cause problems for long passages, such as audiobooks or multiminute readings.

MVC replaces attention with a different mathematical framework known as state-space models. These models process information step by step in a predictable, linear way. The research team used a specific type called Mamba, which is designed to be fast and stable even for long sequences of text.

“State-space models let us keep track of what’s important without constantly revisiting everything that came before,” said Namrateben Patel, a Ph.D. student in mathematics in the Department of Graduate Computer Science and Engineering. “That means the system uses less memory and behaves more consistently, especially for long-form speech.”

The MVC system is built from three main components, all based on this Mamba approach. One part reads and understands the text, another handles timing and rhythm and a third shapes the expressiveness and tone of the voice. During training, the system briefly uses a helper module to learn how text aligns with speech, but that helper is removed when the system is actually used. This makes MVC simpler and faster during real-world operation.

The researchers did not change the part of the system that actually generates sound waves. Instead, they kept a well-known speech-generation backbone and focused only on improving the “conditioning” side—the part that prepares information for speech. This allowed for fair comparisons with existing systems.

When tested on widely used public speech datasets, MVC performed slightly but consistently better than popular models such as StyleTTS2 and VITS. Listeners rated the voices as more natural and accurate, and the system made fewer pitch and pronunciation errors. At the same time, MVC reduced the number of parameters in its encoder and increased processing speed by about 60 percent.

“These gains may look modest on paper, but they’re meaningful in practice,” said Honggang Wang, chair of the Department of Graduate Computer Science and Engineering. “Improving efficiency while maintaining quality is exactly what’s needed to move advanced text-to-speech systems out of the lab and into real products.”

Another key advantage of MVC is stability. Some speech systems struggle when reading long passages, slowly drifting in tone or rhythm. MVC showed strong performance on multiminute texts from public-domain books, maintaining consistent pacing and natural-sounding prosody throughout. The researchers also tested MVC on new speakers and even other languages, such as Spanish, German and French. Despite being trained primarily on English data, the system handled these changes well, suggesting that its underlying design is robust and adaptable.

While MVC does not aim to replace massive, industry-scale voice systems trained on proprietary data, it offers a carefully controlled and open alternative. By focusing on efficiency, transparency and fairness in comparison, the work provides valuable insights into how future voice technologies can be built.

“Our contribution is showing that you don’t need ever-larger models to make progress,” said Kumar. “With the right architecture, you can get better performance, better stability and better efficiency all at once.”

J9UJX متجر - أسرع عملية شراء مجموعة صيانة لقوس صيد Rocky Mountain. التعليقات/الآراء