Indic Languages Normalizer

This code is from: https://github.com/anoopkunchukuttan/indic_nlp_library

Also use Indic Numtowords: https://github.com/raj-sutariya/indic-num2words, https://github.com/AI4Bharat/indic-numtowords

This code has been modified by Kurian to suit to Whisper-normalizer style of coding and the logic for Malayalam normalization is expanded beyond the Indic NLP library by Dr Kavya.

source

NormalizerI

 NormalizerI ()

The normalizer classes do the following: Some characters have multiple Unicode codepoints. The normalizer chooses a single standard representation * Some control characters are deleted * While typing using the Latin keyboard, certain typical mistakes occur which are corrected by the module Base class for normalizer. Performs some common normalization, which includes: * Byte order mark, word joiner, etc. removal * ZERO_WIDTH_NON_JOINER and ZERO_WIDTH_JOINER removal * ZERO_WIDTH_SPACE and NO_BREAK_SPACE replaced by spaces Script specific normalizers should derive from this class and override the normalize() method. They can call the super class ’normalize() method to avail of the common normalization*

source

BaseNormalizer

 BaseNormalizer (lang, remove_nuktas=False, nasals_mode='do_nothing',
                 do_normalize_chandras=False,
                 do_normalize_vowel_ending=False)

Common class used in most of indic languages inherit from this code.

source

DevanagariNormalizer

 DevanagariNormalizer (lang='hi', remove_nuktas=False,
                       nasals_mode='do_nothing',
                       do_normalize_chandras=False,
                       do_normalize_vowel_ending=False)

Normalizer for the Devanagari script. In addition to basic normalization by the super class, Replaces the composite characters containing nuktas by their decomposed form * replace pipe character ‘|’ by poorna virama character * replace colon ‘:’ by visarga if the colon follows a charcter in this script*

norm = DevanagariNormalizer()
TEST_RESULT = "चीर धाराएं समुद्र तट पर लगकर लौटने वाली लहरों का प्रवाह होता है अक्सर एक चट्टान या इसी तरह के पदार्थों पर"
hi_text = "चीर धाराएं समुद्र तट पर लगकर लौटने वाली लहरों का प्रवाह होता है अक्सर एक चट्टान या इसी तरह के पदार्थों पर"
norm(hi_text)

'चीर धाराएं समुद्र तट पर लगकर लौटने वाली लहरों का प्रवाह होता है अक्सर एक चट्टान या इसी तरह के पदार्थों पर'

assert norm(hi_text) == TEST_RESULT

sample_text = "भारत का सकल घरेलू उत्पाद 1.5 ट्रिलियन अमेरिकी डॉलर है।"
norm(sample_text)

'भारत का सकल घरेलू उत्पाद एक.पाँच ट्रिलियन अमेरिकी डॉलर है।'

source

HindiNormalizer

 HindiNormalizer (lang='hi', remove_nuktas=False,
                  nasals_mode='do_nothing', do_normalize_chandras=False,
                  do_normalize_vowel_ending=False, tts_mode=False)

Fork of Devanagiri normalizer. With additional changes for Hindi and tts_mode.

norm = HindiNormalizer(tts_mode=True)
TEST_RESULT = "चीर धाराएं समुद्र तट पर लगकर लौटने वाली लहरों का प्रवाह होता है अक्सर एक चट्टान या इसी तरह के पदार्थों पर"
hi_text = "चीर धाराएं समुद्र तट पर लगकर लौटने वाली लहरों का प्रवाह होता है अक्सर एक चट्टान या इसी तरह के पदार्थों पर $134.5"
norm(hi_text)

'चीर धाराएं समुद्र तट पर लगकर लौटने वाली लहरों का प्रवाह होता है अक्सर एक चट्टान या इसी तरह के पदार्थों पर डॉलर एक सौ चौंतीस पॉइंट पाँच'

normalizer = HindiNormalizer(tts_mode=True)
output = normalizer("He spent Rs. 500 and $20 on groceries.")
print(output)

he spent रुपये पाँच सौ and डॉलर बीस on groceries.

normalizer = HindiNormalizer(tts_mode=True)
output = normalizer(
    "Visit https://example.com or mail us at help@support.in & get 20% off!"
)
print(output)

visit एच टी टी पी एस कोलन स्लैश स्लैश e x a m p l e डॉट c o m or mail us at help at s u p p o r t डॉट i n  or  get बीस percent  off!

text = "आज की तारीख में भारत में ₹100 अमेरिका में $1.1855 के बराबर है alexerws@gmail.com"
normalizer(text)

'आज की तारीख में भारत में रुपये एक सौ अमेरिका में डॉलर एक पॉइंट एक आठ पाँच पाँच के बराबर है alexerws at g m a i l डॉट c o m'

text = "प्रश्न 8 चे उत्तर 17.825 आहे."
normalizer(text)

'प्रश्न आठ चे उत्तर सत्रह पॉइंट आठ दो पाँच आहे.'

sample_text = "भारत का सकल घरेलू उत्पाद 1.5 ट्रिलियन अमेरिकी डॉलर है।"
normalizer(sample_text)

'भारत का सकल घरेलू उत्पाद एक पॉइंट पाँच ट्रिलियन अमेरिकी डॉलर है।'

source

PunjabiNormalizer

 PunjabiNormalizer (lang='pa', remove_nuktas=False,
                    nasals_mode='do_nothing', do_normalize_chandras=False,
                    do_normalize_vowel_ending=False,
                    do_canonicalize_addak=False,
                    do_canonicalize_tippi=False,
                    do_replace_vowel_bases=False, tts_mode=False)

Normalizer for the Gurmukhi script. In addition to basic normalization by the super class, Replaces the composite characters containing nuktas by their decomposed form * Replace the reserved character for poorna virama (if used) with the recommended generic Indic scripts poorna virama * replace pipe character ‘|’ by poorna virama character * replace colon ‘:’ by visarga if the colon follows a charcter in this script*

normalizer = PunjabiNormalizer(tts_mode=True)
punjabi_text = "ਪੰਜੀਵੀ ਜੀਡੀਪੀ 1.5 ਟ੍ਰੀਲੀਅਨ ਅਮਰੀਕਾ ਡੋਲਰ ਛੇ।"
normalizer(punjabi_text)

'ਪੰਜੀਵੀ ਜੀਡੀਪੀ ਇੱਕ ਪੌਇੰਟ ਪੰਜ ਟ੍ਰੀਲੀਅਨ ਅਮਰੀਕਾ ਡੋਲਰ ਛੇ।'

punjabi_text = (
    "ਸਤਿ ਸ੍ਰੀ ਅਕਾਲ ਦੁਨੀਆਂ। 4 ਮਈ, 2025 ਨੂੰ ਰਿਲੀਜ਼ ਹੋਏ ਪੰਜਾਬੀ ਨੋਰਮਲਾਈਜ਼ਰ ਵਿੱਚ ਤੁਹਾਡਾ ਸਵਾਗਤ ਹੈ।"
)
normalizer(punjabi_text)

'ਸਤਿ ਸ੍ਰੀ ਅਕਾਲ ਦੁਨੀਆਂ। ਚਾਰ ਮਈ, ਦੋ ਹਜ਼ਾਰ ਪੱਚੀ ਨੂੰ ਰਿਲੀਜ਼ ਹੋਏ ਪੰਜਾਬੀ ਨੋਰਮਲਾਈਜ਼ਰ ਵਿੱਚ ਤੁਹਾਡਾ ਸਵਾਗਤ ਹੈ।'

source

TeluguNormalizer

 TeluguNormalizer (lang='te', remove_nuktas=False,
                   nasals_mode='do_nothing', do_normalize_chandras=False,
                   do_normalize_vowel_ending=False, tts_mode=False)

Normalizer for the Teluguscript. In addition to basic normalization by the super class, Replace the reserved character for poorna virama (if used) with the recommended generic Indic scripts poorna virama * canonicalize two-part dependent vowel signs * replace colon ‘:’ by visarga if the colon follows a charcter in this script*

norm = TeluguNormalizer()
te_text = "భారతదేశ జీడీపీ 1.5 ట్రిలియన్ అమెరికా డాలర్లు."
norm(te_text)

'భారతదేశ జీడీపీ ఒకటి.ఐదు ట్రిలియన్ అమెరికా డాలర్లు.'

norm = TeluguNormalizer(tts_mode=True)
norm(te_text)

'భారతదేశ జీడీపీ ఒకటి పాయింట్ ఐదు ట్రిలియన్ అమెరికా డాలర్లు.'

source

GujaratiNormalizer

 GujaratiNormalizer (lang='gu', remove_nuktas=False,
                     nasals_mode='do_nothing',
                     do_normalize_chandras=False,
                     do_normalize_vowel_ending=False, tts_mode=False)

Normalizer for the Gujarati script. In addition to basic normalization by the super class, Replace the reserved character for poorna virama (if used) with the recommended generic Indic scripts poorna virama * replace colon ‘:’ by visarga if the colon follows a charcter in this script*

norm = GujaratiNormalizer(tts_mode=True)
gujarati_text = "ભારતનો જીડીપી 1.5 ટ્રિલિયન અમેરિકન ડોલર છે."
norm(gujarati_text)

'ભારતનો જીડીપી એક પોઈન્ટ પાંચ ટ્રિલિયન અમેરિકન ડોલર છે.'

source

OdiaNormalizer

 OdiaNormalizer (lang='or', remove_nuktas=False, nasals_mode='do_nothing',
                 do_normalize_chandras=False,
                 do_normalize_vowel_ending=False, do_remap_wa=False,
                 tts_mode=False)

Normalizer for the Oriya script. In addition to basic normalization by the super class, Replaces the composite characters containing nuktas by their decomposed form * Replace the reserved character for poorna virama (if used) with the recommended generic Indic scripts poorna virama * Canonicalize two part dependent vowels * Replace ‘va’ with ‘ba’ * replace pipe character ‘|’ by poorna virama character * replace colon ‘:’ by visarga if the colon follows a charcter in this script*

norm = OdiaNormalizer(tts_mode=True)
odia_text = "ଭାରତର ଜିଡିପି 1.5 ଟ୍ରିଲିଅନ୍ ଆମେରିକୀୟ ଡଲାର ଅଟେ।"
norm(odia_text)

'ଭାରତର ଜିଡିପି ଏକ ପଏଣ୍ଟ ପାଞ୍ଚ ଟ୍ରିଲିଅନ୍ ଆମେରିକୀୟ ଡଲାର ଅଟେ।'

source

BengaliNormalizer

 BengaliNormalizer (lang='bn', remove_nuktas=False,
                    nasals_mode='do_nothing', do_normalize_chandras=False,
                    do_normalize_vowel_ending=False,
                    do_remap_assamese_chars=False, tts_mode=False)

Normalizer for the Bengali script. In addition to basic normalization by the super class, Replaces the composite characters containing nuktas by their decomposed form * Replace the reserved character for poorna virama (if used) with the recommended generic Indic scripts poorna virama * Canonicalize two part dependent vowels * replace pipe character ‘|’ by poorna virama character * replace colon ‘:’ by visarga if the colon follows a charcter in this script*

norm = BengaliNormalizer(tts_mode=True)

TEST_RESULT = "ভারতের জিডিপি ১.৫ ট্রিলিয়ন মার্কিন ডলার।"  # Claude generated output
bn_text = "ভারতের জিডিপি 1.5 ট্রিলিয়ন মার্কিন ডলার।"
norm(bn_text)

'ভারতের জিডিপি এক পয়েন্ট পাঁচ ট্রিলিয়ন মার্কিন ডলার।'

source

TamilNormalizer

 TamilNormalizer (lang='ta', remove_nuktas=False,
                  nasals_mode='do_nothing', do_normalize_chandras=False,
                  do_normalize_vowel_ending=False, tts_mode=False)

Normalizer for the Tamil script. In addition to basic normalization by the super class, Replace the reserved character for poorna virama (if used) with the recommended generic Indic scripts poorna virama * canonicalize two-part dependent vowel signs * replace colon ‘:’ by visarga if the colon follows a charcter in this script*

norm = TamilNormalizer(tts_mode=True)
ta_text = "இந்தியாவின் மொத்த உள்நாட்டு உற்பத்தி 1.5 டிரில்லியன் அமெரிக்க டாலர்."
norm(ta_text)

'இந்தியாவின் மொத்த உள்நாட்டு உற்பத்தி ஒன்று பாயிண்ட் ஐந்து டிரில்லியன் அமெரிக்க டாலர்.'

norm = TamilNormalizer(tts_mode=True)
ta_text = "அவர் Rs. 500 மற்றும் $20 க்கு உணவுப்பொருட்கள் வாங்கினார். இணையதளம்: www.amazon.in."
norm(ta_text)

'அவர் ரூபாய் ஐநூறு மற்றும் டாலர் இருபது க்கு உணவுப்பொருட்கள் வாங்கினார். இணையதளம்ஃ w w w டாட் a m a z o n டாட் i n.'

source

KannadaNormalizer

 KannadaNormalizer (lang='kn', remove_nuktas=False,
                    nasals_mode='do_nothing', do_normalize_chandras=False,
                    do_normalize_vowel_ending=False, tts_mode=False)

Normalizer for the Kannada script. In addition to basic normalization by the super class, Replace the reserved character for poorna virama (if used) with the recommended generic Indic scripts poorna virama * canonicalize two-part dependent vowel signs * replace colon ‘:’ by visarga if the colon follows a charcter in this script*

norm = KannadaNormalizer(tts_mode=True)
kannada_text = "ಭಾರತದ ಜಿಡಿಪಿ 1.5 ಟ್ರಿಲಿಯನ್ ಅಮೇರಿಕನ್ ಡಾಲರ್ ಆಗಿದೆ."
norm(kannada_text)

'ಭಾರತದ ಜಿಡಿಪಿ ಒಂದು ಪಾಯಿಂಟ್ ಐದು ಟ್ರಿಲಿಯನ್ ಅಮೇರಿಕನ್ ಡಾಲರ್ ಆಗಿದೆ.'

source

MalayalamNormalizer

 MalayalamNormalizer (lang='ml', remove_nuktas=False,
                      nasals_mode='do_nothing',
                      do_normalize_chandras=False,
                      do_normalize_vowel_ending=False,
                      do_canonicalize_chillus=False,
                      do_correct_geminated_T=False, tts_mode=False)

Normalizer for the Malayalam script. In addition to basic normalization by the super class, Replace the reserved character for poorna virama (if used) with the recommended generic Indic scripts poorna virama * canonicalize two-part dependent vowel signs * Change from old encoding of chillus (till Unicode 5.0) to new encoding * replace colon ‘:’ by visarga if the colon follows a charcter in this script*

Tests

normalizer = MalayalamNormalizer()

TEST_RESULT = "എന്റെ കമ്പ്യൂട്ടറിന് എന്റെ ഭാഷ."
text_result = normalizer("എന്റെ കമ്പ്യൂട്ടറിനു് എന്റെ ഭാഷ.")

assert text_result == TEST_RESULT

TESTCASE_RESULT = "യുപിഎ ഭരണകാലത്തെ സാമ്പത്തിക വീഴ്ച; ധവളപത്രം ഇറക്കാൻ കേന്ദ്രസർക്കാർ.\n\nയുപിഎ സർക്കാരിന്റെ കാലത്തെ ധനവിനിയോഗത്തിലെ വീഴ്ചകൾ വ്യക്തമാക്കുന്ന ധവളപത്രം ഇറക്കാൻ കേന്ദ്രസർക്കാർ തീരുമാനം. ബജറ്റ് സമ്മേളനം ഇതിനായി ഒരു ദിവസം കൂടി നീട്ടും. വിഹിതങ്ങൾ എപ്രകാരം തെറ്റായി വിനിയോഗിക്കപ്പെട്ടു എന്നതുൾപ്പെടെയുള്ള കാര്യങ്ങൾ വിശദീകരിക്കാനാണ് കേന്ദ്രസർക്കാർ നീക്കം.\n\n"
text_result = normalizer(
    """യുപിഎ ഭരണകാലത്തെ സാമ്പത്തിക വീഴ്ച; ധവളപത്രം ഇറക്കാൻ കേന്ദ്രസർക്കാർ.

യുപിഎ സർക്കാരിന്റെ കാലത്തെ ധനവിനിയോഗത്തിലെ വീഴ്ചകൾ വ്യക്തമാക്കുന്ന ധവളപത്രം ഇറക്കാൻ കേന്ദ്രസർക്കാർ തീരുമാനം. ബജറ്റ് സമ്മേളനം ഇതിനായി ഒരു ദിവസം കൂടി നീട്ടും. വിഹിതങ്ങൾ എപ്രകാരം തെറ്റായി വിനിയോഗിക്കപ്പെട്ടു എന്നതുൾപ്പെടെയുള്ള കാര്യങ്ങൾ വിശദീകരിക്കാനാണ് കേന്ദ്രസർക്കാർ നീക്കം.

"""
)
text_result

'യുപിഎ ഭരണകാലത്തെ സാമ്പത്തിക വീഴ്ച; ധവളപത്രം ഇറക്കാൻ കേന്ദ്രസർക്കാർ.\n\nയുപിഎ സർക്കാരിന്റെ കാലത്തെ ധനവിനിയോഗത്തിലെ വീഴ്ചകൾ വ്യക്തമാക്കുന്ന ധവളപത്രം ഇറക്കാൻ കേന്ദ്രസർക്കാർ തീരുമാനം. ബജറ്റ് സമ്മേളനം ഇതിനായി ഒരു ദിവസം കൂടി നീട്ടും. വിഹിതങ്ങൾ എപ്രകാരം തെറ്റായി വിനിയോഗിക്കപ്പെട്ടു എന്നതുൾപ്പെടെയുള്ള കാര്യങ്ങൾ വിശദീകരിക്കാനാണ് കേന്ദ്രസർക്കാർ നീക്കം.\n\n'

assert text_result == TESTCASE_RESULT

normalizer = MalayalamNormalizer(tts_mode=True)
malaylam_text = "ഇന്ത്യയുടെ ജിഡിപി 1.5 ട്രില്യൺ യുഎസ് ഡോളറാണ്."
TESTCASE_RESULT = "ഇന്ത്യയുടെ ജിഡിപി ഒന്ന് പോയിന്റ് അഞ്ച് ട്രില്യൺ യുഎസ് ഡോളറാണ്."
text_result = normalizer(malaylam_text)
text_result

'ഇന്ത്യയുടെ ജിഡിപി ഒന്ന് പോയിന്റ് അഞ്ച് ട്രില്യൺ യുഎസ് ഡോളറാണ്.'

assert text_result == TESTCASE_RESULT

TESTCASE_RESULT = "ദുഃഖം"
text_result = normalizer("ദു:ഖം")
text_result

'ദുഃഖം'

assert text_result == TESTCASE_RESULT

TESTCASE_RESULT = "എന്റെ"
text_result = normalizer("എൻറെ")
text_result

# Still fails

'എൻറെ'

normalizer(
    "1000 രൂപ കൊടുത്തു. അയാൾ 500 ഡോളറിനും 20 ഡോളറിനും പലചരക്ക് സാധനങ്ങൾ വാങ്ങി. വെബ്സൈറ്റ്: www.amazon.in."
)

'ആയിരം രൂപ കൊടുത്തു. അയാൾ അഞ്ഞൂറ് ഡോളറിനും ഇരുപത് ഡോളറിനും പലചരക്ക് സാധനങ്ങൾ വാങ്ങി. വെബ്സൈറ്റ്ഃ w w w ഡോട്ട് a m a z o n ഡോട്ട് i n.'

# TESTCASE_RESULT  = "കാണ്മാനില്ല"
# text_result = normalizer("കാണ്മാനില്ല")
# text_result

# Still fails