WhisperNormalizer English Module

OpenAI’s English text standardisation module

What does this module do?

As per the text normalization/standardization approach Appendix Section C pp.21 the paper Robust Speech Recognition via Large-Scale Weak Supervision. The EnglishTextNormalizer does the following functionality:

  1. Remove any phrases between matching brackets ([, ]).
  2. Remove any phrases between matching parentheses ((, )).
  3. Remove any of the following words: hmm, mm, mhm, mmm, uh, um
  4. Remove whitespace characters that comes before an apostrophe ’
  5. Convert standard or informal contracted forms of English into the original form.
  6. Remove commas (,) between digits
  7. Remove periods (.) not followed by numbers
  8. Remove symbols as well as diacritics from the text, where symbols are the characters with the Unicode category starting with M, S, or P, except period, percent, and currency symbols that may be detected in the next step.
  9. Detect any numeric expressions of numbers and currencies and replace with a form using Arabic numbers, e.g. “Ten thousand dollars” → “$10000”.
  10. Convert British spellings into American spellings.
  11. Remove remaining symbols that are not part of any numeric expressions.
  12. Replace any successive whitespace characters with a space.

source

EnglishNumberNormalizer

 EnglishNumberNormalizer ()

Convert any spelled-out numbers into arabic numbers, while handling:

  • remove any commas
  • keep the suffixes such as: 1960s, 274th, 32nd, etc.
  • spell out currency symbols after the number. e.g. $20 million -> 20000000 dollars
  • spell out one and ones
  • interpret successive single-digit numbers as nominal: one oh one -> 101

source

EnglishSpellingNormalizer

 EnglishSpellingNormalizer ()

Applies British-American spelling mappings as listed in [1].

[1] https://www.tysto.com/uk-us-spelling-list.html

n = EnglishSpellingNormalizer()
n("accessorise")
'accessorize'

source

EnglishTextNormalizer

 EnglishTextNormalizer ()

Applies all the rules for normalizing English text as mentioned in OpenAI whisper paper. As per the text normalization/standardization approach Appendix Section C pp.21 the paper Robust Speech Recognition via Large-Scale Weak Supervision. The EnglishTextNormalizer does the following functionality:

  1. Remove any phrases between matching brackets ([, ]).
  2. Remove any phrases between matching parentheses ((, )).
  3. Remove any of the following words: hmm, mm, mhm, mmm, uh, um
  4. Remove whitespace characters that comes before an apostrophe ’
  5. Convert standard or informal contracted forms of English into the original form.
  6. Remove commas (,) between digits
  7. Remove periods (.) not followed by numbers
  8. Remove symbols as well as diacritics from the text, where symbols are the characters with the Unicode category starting with M, S, or P, except period, percent, and currency symbols that may be detected in the next step.
  9. Detect any numeric expressions of numbers and currencies and replace with a form using Arabic numbers, e.g. “Ten thousand dollars” → “$10000”.
  10. Convert British spellings into American spellings.
  11. Remove remaining symbols that are not part of any numeric expressions.
  12. Replace any successive whitespace characters with a space.

Testing EnglishTextNormalizer

normalizer = EnglishTextNormalizer()
normalizer("I'm a little teapot, short and stout. Tip me over and pour me out!")
'i am a little teapot short and stout tip me over and pour me out'
article_text = """Language is like a map that we use to navigate the world, but it’s also like a prison that keeps us from seeing what’s beyond the walls.

But what if there was a way to break out of this prison, to expand our map, to explore new worlds with new words? This is the possibility and the challenge offered by instruction tuned language models like GPT 4, a cutting-edge technology that uses artificial neural networks to generate natural language texts based on user inputs.

GPT 4 can write anything from essays to novels to poems to tweets to code to recipes to jokes to lyrics to whatever you want. It can even write things that don’t exist yet, things that no human has ever thought of or said before.

As Wittgenstein’s quote suggests, language is a source of limitation and liberation. GPT 4 pushes this idea to the extreme by giving us access to unlimited language.

This could be the most significant new technology in modern history because it has the potential to change many domains and industries. From education to entertainment, from journalism to justice, from science to art, these models could enable new forms of learning, storytelling, reporting, reasoning, discovery, and creation.

They could also create new ethical, social, and cultural challenges that require careful reflection and regulation. How we use this technology will depend on how we recognize its implications for ourselves and others.

This technology is a form of “Artificial Intelligence”. The word “intelligence” derives from inter- (“between”) and legere (“to choose, pick out, read”). To be intelligent, then, is to be able to choose between things, to pick out what matters, to read what is written. Intelligence is not just a quantity or a quality; it is an activity, a process, a practice. It is something that we do with our minds and our words.

But when we let GPT 4 do this for us, are we not abdicating our intelligence? Are we not letting go of our ability to choose, to pick out, to read? Are we not becoming passive consumers of language instead of active producers?
"""
normalizer(article_text)
'language is like a map that we use to navigate the world but it s also like a prison that keeps us from seeing what s beyond the walls but what if there was a way to break out of this prison to expand our map to explore new worlds with new words this is the possibility and the challenge offered by instruction tuned language models like gpt 4 a cutting edge technology that uses artificial neural networks to generate natural language texts based on user inputs gpt 4 can write anything from essays to novels to poems to tweets to code to recipes to jokes to lyrics to whatever you want it can even write things that don t exist yet things that no human has ever thought of or said before as wittgenstein s quote suggests language is a source of limitation and liberation gpt 4 pushes this idea to the extreme by giving us access to unlimited language this could be the most significant new technology in modern history because it has the potential to change many domains and industries from education to entertainment from journalism to justice from science to art these models could enable new forms of learning storytelling reporting reasoning discovery and creation they could also create new ethical social and cultural challenges that require careful reflection and regulation how we use this technology will depend on how we recognize its implications for ourselves and others this technology is a form of artificial intelligence the word intelligence derives from inter and legere to be intelligent then is to be able to choose between things to pick out what matters to read what is written intelligence is not just a quantity or a quality it is an activity a process a practice it is something that we do with our minds and our words but when we let gpt 4 do this for us are we not abdicating our intelligence are we not letting go of our ability to choose to pick out to read are we not becoming passive consumers of language instead of active producers'