Picture of the author
Published on

NLP

Authors

Lexicon Normalization

  • Stemming
  • Lemmatization

Bag-of-Words Model :

  • no order of words
  • just count the occurence of words

Convert text to word count vectors with CountVectorizer. Both tokenize a collection of text documents and build a vocabulary of known words

Convert text to word frequency vectors with TfidfVectorizer.

  • Term Frequency: This summarizes how often a given word appears within a document.
  • Inverse Document Frequency: This downscales words that appear a lot across documents.

Convert text to unique integers with HashingVectorizer.

  • one way hash of words to convert them to integers