Lithuanian Lemmatization: A Practical Guide

What is a Lemma?

A lemma is the canonical (dictionary) form of a word.

Where is it Used?

Lemmatization is needed in systems that extract, store, or query information based on terms. Words appear in different forms depending on grammar, but you need to normalize them to search and compare effectively.

Examples:

English: "person" (singular), "persons" (plural), "person's" (possessive)

Lithuanian: "gyvūnas" (singular), "gyvūnai" (plural), "gyvūno" (genitive). Lithuanian has many more variations of the same word due to its rich case system.

To use these words in search queries or build a terms dictionary, you need the canonical form: "person" or "gyvūnas". This base form is the lemma, and the process of converting words to their lemmas is called lemmatization.

Use Case: Legal Document Concept Graph

I use this to build a concept graph for Lithuanian legal documents. The idea is to extract legal concepts and link multiple law documents through shared concepts.

For example:

Extract terms like "sutartis" (contract), "atsakomybė" (liability), "įsipareigojimas" (obligation)
Convert all their grammatical variants to base forms
Build relationships between different legal documents that reference the same concepts
Search across the legal corpus by concept

Without lemmatization, you miss connections between documents that use different grammatical forms of the same legal concept.

How to Implement Lemmatization?

For Lithuanian, use the spaCy model lt_core_news_md, where _md indicates the model size. Other sizes are _sm (small) and _lg (large).

Model Size Comparison

lt_core_news_sm (~20MB): Fast processing, lower accuracy. Good for prototyping or when speed matters more than precision.
lt_core_news_md (~50MB): Balanced speed and accuracy. Works well for most production use cases.
lt_core_news_lg (~150MB): Best accuracy, slower processing. Use this for legal or medical text where precision is critical.

Simple Python Implementation

Here's a basic FastAPI service for lemmatization:

from fastapi import FastAPI
import spacy

app = FastAPI()
nlp = spacy.load("lt_core_news_md")

@app.post("/lemmatize")
def lemmatize(text: str):
    doc = nlp(text)
    lemmas = [token.lemma_ for token in doc]
    return {"lemmas": lemmas, "original_text": text}

Once you start the application on localhost:8000, you can:

Access the OpenAPI documentation at /docs
Use the lemmatizer via the /lemmatize endpoint

Performance Considerations

Processing Speed

Processing Lithuanian text with spaCy models is slow for large datasets. Each model initialization and text processing takes time.

Caching Strategy

For production systems processing large amounts of data, implement phrase-level caching:

Benefits:

Pipeline evaluations run much faster the second time
Unit tests execute quickly without re-processing
Lower computational costs for repeated text analysis

Limitations

While spaCy's Lithuanian models work well, they have some issues:

Context dependency: Accuracy depends on context. Ambiguous words may be lemmatized incorrectly if context is unclear.

Domain-specific terminology: Legal, medical, or technical terms may not be recognized if they were rare in the training data. Legal Lithuanian has specific terminology that general models might handle poorly. What I’ve mentiond that some generic words may be processed poorly in 50MB model. For example: “tachografo”, “Laivo”, “numatytu”.

Rare words and neologisms: New words or rare terms may be lemmatized incorrectly or returned unchanged.

Proper nouns: Names of people, places, and organizations may be incorrectly lemmatized. For example “Maximos”, which is store name, is not lemmatized.

Alternative Approaches

TypeScript/JavaScript Libraries

For JavaScript/TypeScript applications, you might consider:

1. Compromise.js with custom Lithuanian plugin

import nlp from 'compromise'

const text = "gyvūnai bėga greitai"
const doc = nlp(text)
// Note: Requires custom Lithuanian plugin (not available by default)

2. Natural.js

import natural from 'natural'
// Limited Lithuanian support

3. Remote API approach (TypeScript client)

async function lemmatize(text: string): Promise<string[]> {
  const response = await fetch(
    `http://localhost:8000/lemmatize?text=${encodeURIComponent(text)}`,
    { method: 'POST' }
  )
  const data = await response.json()
  return data.lemmas
}

// Usage
const lemmas = await lemmatize("gyvūnai bėga greitai")
console.log(lemmas) // ["gyvūnas", "bėgti", "greitai"]

Tradeoffs for Lithuanian

Python + spaCy (Recommended):

✅ Mature Lithuanian models with good accuracy
✅ Handles Lithuanian morphology well
✅ Active community and updates
✅ Good documentation
❌ Requires Python runtime
❌ Slower processing

TypeScript/JavaScript libraries:

✅ Works natively in Node.js and browsers
✅ Easier deployment in JS projects
✅ Faster for simple tasks
❌ Poor or no Lithuanian support
❌ Limited morphological analysis for Lithuanian
❌ Requires custom rule-based implementation

Recommendation: For Lithuanian, there are no good JavaScript libraries. Python + spaCy is the clear choice. If you need TypeScript integration, run a Python lemmatization server and call it via HTTP from your TypeScript application.

Complete Setup Guide

Prerequisites

Python 3.8 or higher
Node.js 18+ (for TypeScript client)
pip (Python package manager)

Installation Steps

1. Install Python

Download and install Python from python.org

Verify installation:

python --version

2. Create Project Directory

mkdir lemmatizer-service
cd lemmatizer-service

3. Create Virtual Environment

python -m venv venv

Activate it:

Linux/Mac: source venv/bin/activate
Windows: venv\Scripts\activate

4. Install Dependencies

pip install fastapi uvicorn spacy
python -m spacy download lt_core_news_md

5. Create Application File

Create main.py with the caching implementation shown above.

6. Run the Server

uvicorn main:app --reload --host 0.0.0.0 --port 8000

TypeScript Client Example

1. Initialize TypeScript Project

mkdir lemmatizer-client
cd lemmatizer-client
npm init -y
npm install typescript @types/node tsx
npx tsc --init

2. Create Client

Create client.ts:

interface LemmatizeResponse {
  lemmas: string[]
  original_text: string
}

class LemmatizerClient {
  private baseUrl: string

  constructor(baseUrl: string = 'http://localhost:8000') {
    this.baseUrl = baseUrl
  }

  async lemmatize(text: string): Promise<string[]> {
    const response = await fetch(
      `${this.baseUrl}/lemmatize?text=${encodeURIComponent(text)}`,
      { method: 'POST' }
    )

    if (!response.ok) {
      throw new Error(`HTTP error! status: ${response.status}`)
    }

    const data: LemmatizeResponse = await response.json()
    return data.lemmas
  }

  async lemmatizeBatch(texts: string[]): Promise<string[][]> {
    return Promise.all(texts.map(text => this.lemmatize(text)))
  }
}

// Example usage
async function main() {
  const client = new LemmatizerClient()

  try {
    // Single text lemmatization
    const lemmas = await client.lemmatize("Lietuvos teisės aktai reglamentuoja sutarčių sudarymo tvarką")
    console.log('Lemmas:', lemmas)

    // Batch processing
    const texts = [
      "gyvūnai bėga greitai",
      "sutartis buvo sudaryta",
      "įstatymai nustato atsakomybę"
    ]
    const batchResults = await client.lemmatizeBatch(texts)
    console.log('Batch results:', batchResults)

  } catch (error) {
    console.error('Error:', error)
  }
}

main()

3. Run the Client

npx tsx client.ts

Expected Output

Lemmas: ['Lietuva', 'teisė', 'aktas', 'reglamentuoti', 'sutartis', 'sudarymas', 'tvarka']
Batch results: [
  ['gyvūnas', 'bėgti', 'greitai'],
  ['sutartis', 'būti', 'sudaryti'],
  ['įstatymas', 'nustatyti', 'atsakomybė']
]

Project Structure

lemmatizer-service/
├── main.py
├── requirements.txt
├── lemma_cache/          # Created automatically
├── venv/                 # Virtual environment
└── README.md

lemmatizer-client/
├── client.ts
├── package.json
├── tsconfig.json
└── README.md

requirements.txt

fastapi==0.104.1
uvicorn==0.24.0
spacy==3.7.2

README.md (Python Service)

# Lithuanian Lemmatizer Service

A FastAPI service for Lithuanian text lemmatization using spaCy.

## Features
- Fast lemmatization for Lithuanian text
- Phrase-level caching for improved performance
- REST API with OpenAPI documentation

## Quick Start

1. Install dependencies: `pip install -r requirements.txt`
2. Download model: `python -m spacy download lt_core_news_md`
3. Run server: `uvicorn main:app --reload`
4. Visit: http://localhost:8000/docs

## API Endpoint

POST /lemmatize?text=your_text_here

## Performance

Caching is enabled by default. Cache is stored in `./lemma_cache/` directory.

README.md (TypeScript Client)

# Lemmatizer TypeScript Client

TypeScript client for the Lithuanian Lemmatizer Service.

## Installation

```bash
npm install

Usage

npx tsx client.ts

Requirements

Node.js 18+
Running lemmatizer service on localhost:8000

Conclusion

For Lithuanian lemmatization, Python's spaCy library is the best option because of its mature language models and good accuracy. If you need TypeScript integration, use a microservice: run the Python lemmatizer as a server and call it from your TypeScript code.

For legal document analysis and concept graph building, the caching strategy is necessary to keep processing speeds reasonable across large document collections.