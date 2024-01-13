en English
Tensoic Unveils Kan-LLaMA: Revolutionizing Tech Landscape for Kannada Language

By: Rafia Tasleem
Published: January 13, 2024 at 2:47 am EST
Tensoic Unveils Kan-LLaMA: Revolutionizing Tech Landscape for Kannada Language

In an unprecedented move, Mumbai-based software development company Tensoic has unveiled Kan-LLaMA, an advanced language model designed exclusively for the Kannada language. With a colossal 7-billion parameter Llama-2 model, it is built to revolutionize the tech landscape for Kannada speakers.

Enhancing Linguistic Capabilities

The birth of Kan-LLaMA is rooted in the innovative LoRA PreTraining method. Approximately 600 million Kannada tokens, sourced from the CulturaX Dataset, served as the foundation for pre-training. This groundwork relied on an impressive 11GB text corpus, a compilation of multilingual data from resources such as mC4 and OSCAR. The result? An enhancement of Kan-LLaMA’s base linguistic capabilities, marking a significant stride in understanding and processing Kannada text.

Bridging the Vocabulary Gap

Tensoic took a step further by expanding the Llama-2 model’s existing 32K token vocabulary to a staggering 48K. The evolution involved a new tokenizer, a vital tool in text analysis. This tokenizer, trained on a vocabulary of 20K Kannada text, was seamlessly integrated with the existing Llama-2 tokenizer. The fusion bolstered the model’s adroitness at handling diverse Kannada language tasks, bridging the gap between technology and language.

Optimizing Conversational Capabilities

But Tensoic didn’t stop there. Fine-tuning was performed on datasets specifically optimized for chat and translation purposes. The aim? To enhance the conversational capabilities of Kan-LLaMA, making it a tech-ally for Kannada speakers. The use of Axolotl, a configuration-based environment, facilitated this fine-tuning, further strengthening the model’s generation abilities.

Keeping community engagement at heart, Tensoic plans to release the models, code, datasets, and associated research paper under open licenses such as cc-by-4.0 and Apache 2.0. This move is likely to spur innovation and open new avenues for the Kannada-speaking tech community.

Rafia Tasleem

Rafia Tasleem stands as an exemplary figure in media and communications, with a profound dedication to unraveling stories that resonate. An esteemed alumna of Aligarh Muslim University, she boasts a Master's in Mass Communication and Media Studies, providing her with a rich tapestry of journalistic insights. Rafia's adeptness in fostering effective communication and cultivating robust relationships distinguishes her in the industry. Her unyielding commitment to weaving impactful narratives and fortifying bonds with peers and informants positions her as an irreplaceable pillar within our newsroom.

