WhisperNormalizer Base Module

OpenAI’s non-english basic text normalization module

What does this module do?

As per the text normalization/standardization approach mentioned in Appendix Section C pp.21 in the paper Robust Speech Recognition via Large-Scale Weak Supervision. The BasicTextNormalizer does the following functionality:

  1. Remove any phrases between matching brackets ([, ]).
  2. Remove any phrases between matching parentheses ((, )).
  3. Replace any markers, symbols, and punctuation characters with a space, i.e. when the Unicode category of each character in the NFKC-normalized string starts with M, S, or P.
  4. make the text lowercase.
  5. replace any successive whitespace characters with a space

source

remove_symbols

 remove_symbols (s:str)

Replace any other markers, symbols, punctuations with a space, keeping diacritics


source

remove_symbols_and_diacritics

 remove_symbols_and_diacritics (s:str, keep='')

Replace any other markers, symbols, and punctuations with a space, and drop any diacritics (category ‘Mn’ and some manual mappings)

# #| export
# add_docs(BasicTextNormalizer, "Initialize BasicTextNormalizer",
#          # remove_diacritics="Replace any other markers, symbols, and punctuations with a space and drop any diacritics",
#          # split_letters="It uses a regular expression \X to find all Unicode graphemes (extended grapheme clusters) in the string s and join them together by space",
#          __call__="Call string s and apply normalizer with `BasicTextNormalizer`"
#         )

Testing Basic Normalizer

normalizer = BasicTextNormalizer()
normalizer("എന്റെ കമ്പ്യൂട്ടറിനു് എന്റെ ഭാഷ")
'എന റ കമ പ യ ട ടറ ന എന റ ഭ ഷ'
article_text = """Language is like a map that we use to navigate the world, but it’s also like a prison that keeps us from seeing what’s beyond the walls.

But what if there was a way to break out of this prison, to expand our map, to explore new worlds with new words? This is the possibility and the challenge offered by instruction tuned language models like GPT 4, a cutting-edge technology that uses artificial neural networks to generate natural language texts based on user inputs.

GPT 4 can write anything from essays to novels to poems to tweets to code to recipes to jokes to lyrics to whatever you want. It can even write things that don’t exist yet, things that no human has ever thought of or said before.

As Wittgenstein’s quote suggests, language is a source of limitation and liberation. GPT 4 pushes this idea to the extreme by giving us access to unlimited language.

This could be the most significant new technology in modern history because it has the potential to change many domains and industries. From education to entertainment, from journalism to justice, from science to art, these models could enable new forms of learning, storytelling, reporting, reasoning, discovery, and creation.

They could also create new ethical, social, and cultural challenges that require careful reflection and regulation. How we use this technology will depend on how we recognize its implications for ourselves and others.

This technology is a form of “Artificial Intelligence”. The word “intelligence” derives from inter- (“between”) and legere (“to choose, pick out, read”). To be intelligent, then, is to be able to choose between things, to pick out what matters, to read what is written. Intelligence is not just a quantity or a quality; it is an activity, a process, a practice. It is something that we do with our minds and our words.

But when we let GPT 4 do this for us, are we not abdicating our intelligence? Are we not letting go of our ability to choose, to pick out, to read? Are we not becoming passive consumers of language instead of active producers?
"""
normalizer(article_text)
'language is like a map that we use to navigate the world but it s also like a prison that keeps us from seeing what s beyond the walls but what if there was a way to break out of this prison to expand our map to explore new worlds with new words this is the possibility and the challenge offered by instruction tuned language models like gpt 4 a cutting edge technology that uses artificial neural networks to generate natural language texts based on user inputs gpt 4 can write anything from essays to novels to poems to tweets to code to recipes to jokes to lyrics to whatever you want it can even write things that don t exist yet things that no human has ever thought of or said before as wittgenstein s quote suggests language is a source of limitation and liberation gpt 4 pushes this idea to the extreme by giving us access to unlimited language this could be the most significant new technology in modern history because it has the potential to change many domains and industries from education to entertainment from journalism to justice from science to art these models could enable new forms of learning storytelling reporting reasoning discovery and creation they could also create new ethical social and cultural challenges that require careful reflection and regulation how we use this technology will depend on how we recognize its implications for ourselves and others this technology is a form of artificial intelligence the word intelligence derives from inter and legere to be intelligent then is to be able to choose between things to pick out what matters to read what is written intelligence is not just a quantity or a quality it is an activity a process a practice it is something that we do with our minds and our words but when we let gpt 4 do this for us are we not abdicating our intelligence are we not letting go of our ability to choose to pick out to read are we not becoming passive consumers of language instead of active producers '
normalizer = BasicTextNormalizer(remove_diacritics=True)

article_text = """Language is like a map that we use to navigate the world, but it’s also like a prison that keeps us from seeing what’s beyond the walls.

But what if there was a way to break out of this prison, to expand our map, to explore new worlds with new words? This is the possibility and the challenge offered by instruction tuned language models like GPT 4, a cutting-edge technology that uses artificial neural networks to generate natural language texts based on user inputs.

GPT 4 can write anything from essays to novels to poems to tweets to code to recipes to jokes to lyrics to whatever you want. It can even write things that don’t exist yet, things that no human has ever thought of or said before.

As Wittgenstein’s quote suggests, language is a source of limitation and liberation. GPT 4 pushes this idea to the extreme by giving us access to unlimited language.

This could be the most significant new technology in modern history because it has the potential to change many domains and industries. From education to entertainment, from journalism to justice, from science to art, these models could enable new forms of learning, storytelling, reporting, reasoning, discovery, and creation.

They could also create new ethical, social, and cultural challenges that require careful reflection and regulation. How we use this technology will depend on how we recognize its implications for ourselves and others.

This technology is a form of “Artificial Intelligence”. The word “intelligence” derives from inter- (“between”) and legere (“to choose, pick out, read”). To be intelligent, then, is to be able to choose between things, to pick out what matters, to read what is written. Intelligence is not just a quantity or a quality; it is an activity, a process, a practice. It is something that we do with our minds and our words.

But when we let GPT 4 do this for us, are we not abdicating our intelligence? Are we not letting go of our ability to choose, to pick out, to read? Are we not becoming passive consumers of language instead of active producers?
"""
normalizer(article_text)
'language is like a map that we use to navigate the world but it s also like a prison that keeps us from seeing what s beyond the walls but what if there was a way to break out of this prison to expand our map to explore new worlds with new words this is the possibility and the challenge offered by instruction tuned language models like gpt 4 a cutting edge technology that uses artificial neural networks to generate natural language texts based on user inputs gpt 4 can write anything from essays to novels to poems to tweets to code to recipes to jokes to lyrics to whatever you want it can even write things that don t exist yet things that no human has ever thought of or said before as wittgenstein s quote suggests language is a source of limitation and liberation gpt 4 pushes this idea to the extreme by giving us access to unlimited language this could be the most significant new technology in modern history because it has the potential to change many domains and industries from education to entertainment from journalism to justice from science to art these models could enable new forms of learning storytelling reporting reasoning discovery and creation they could also create new ethical social and cultural challenges that require careful reflection and regulation how we use this technology will depend on how we recognize its implications for ourselves and others this technology is a form of artificial intelligence the word intelligence derives from inter and legere to be intelligent then is to be able to choose between things to pick out what matters to read what is written intelligence is not just a quantity or a quality it is an activity a process a practice it is something that we do with our minds and our words but when we let gpt 4 do this for us are we not abdicating our intelligence are we not letting go of our ability to choose to pick out to read are we not becoming passive consumers of language instead of active producers '