Tensoic Unveils Kan-LLaMA: Revolutionizing Tech Landscape for Kannada Language

In an unprecedented move, Mumbai-based software development company Tensoic has unveiled Kan-LLaMA, an advanced language model designed exclusively for the Kannada language. With a colossal 7-billion parameter Llama-2 model, it is built to revolutionize the tech landscape for Kannada speakers.

Enhancing Linguistic Capabilities

The birth of Kan-LLaMA is rooted in the innovative LoRA PreTraining method. Approximately 600 million Kannada tokens, sourced from the CulturaX Dataset, served as the foundation for pre-training. This groundwork relied on an impressive 11GB text corpus, a compilation of multilingual data from resources such as mC4 and OSCAR. The result? An enhancement of Kan-LLaMA’s base linguistic capabilities, marking a significant stride in understanding and processing Kannada text.

Bridging the Vocabulary Gap

Tensoic took a step further by expanding the Llama-2 model’s existing 32K token vocabulary to a staggering 48K. The evolution involved a new tokenizer, a vital tool in text analysis. This tokenizer, trained on a vocabulary of 20K Kannada text, was seamlessly integrated with the existing Llama-2 tokenizer. The fusion bolstered the model’s adroitness at handling diverse Kannada language tasks, bridging the gap between technology and language.

Optimizing Conversational Capabilities

But Tensoic didn’t stop there. Fine-tuning was performed on datasets specifically optimized for chat and translation purposes. The aim? To enhance the conversational capabilities of Kan-LLaMA, making it a tech-ally for Kannada speakers. The use of Axolotl, a configuration-based environment, facilitated this fine-tuning, further strengthening the model’s generation abilities.

Keeping community engagement at heart, Tensoic plans to release the models, code, datasets, and associated research paper under open licenses such as cc-by-4.0 and Apache 2.0. This move is likely to spur innovation and open new avenues for the Kannada-speaking tech community.