The Yelp data is one of the largest dataset we have ever meet.
You can also check out the websites for details
import pandas as pd
import nltk
from nltk.corpus import stopwords
import time
filename = 'train_data.csv'
t1 = time.time()
reviews = pd.read_csv('train_data.csv')
time.time() - t1
cityname = reviews['city'].unique()
reviews.shape
reviews.head()
rev_200 = reviews.iloc[0:200,]
rev_200.to_csv("first200.csv", index = False)
rev_200.loc[0, ['stars', 'text']]
sentence = rev_200.loc[0, 'text']
sentence
Regular Experssion!!!: Check these websites!
import re
sen_let = re.sub('[^a-zA-Z]',' ', sentence)
sen_let
The following paragraphs are from Ultimate guide to understand language processing codes in python
Any piece of text which is not relevant to the context of the data and the end-output can be specified as the noise
For example – language stopwords (commonly used words of a language – is, am, the, of, in etc), URLs or links, social media entities (mentions, hashtags), punctuations and industry specific words. This step deals with removal of all types of noisy entities present in the text.
I used stopwords from nltk.corpus, you can use nltk.download
to get it. Also notice that, all the stopwords in "english" are in lower case. So that we should also change the original sentence to lower case for this purpose.
stops = set(stopwords.words("english"))
stops
sen_new = [w for w in sen_let.lower().split() if not w in stops]
print(" ".join(sen_new))
print("\norginal sentence:\n")
print(sen_let)
Let's wrap it to one function!
def noise_remove(review, stops):
review = re.sub("[^a-zA-Z]", " ", review)
review = review.lower().split()
useful_review = [w for w in review if not w in stops]
return " ".join(useful_review)
noise_remove(sen_let, stops)
Another type of textual noise is about the multiple representations exhibited by single word.
For example – “play”, “player”, “played”, “plays” and “playing” are the different variations of the word – “play”, Though they mean different but contextually all are similar. The step converts all the disparities of a word into their normalized form (also known as lemma). Normalization is a pivotal step for feature engineering with text as it converts the high dimensional features (N different features) to the low dimensional space (1 feature), which is an ideal ask for any ML model.
from nltk.stem.wordnet import WordNetLemmatizer
lem = WordNetLemmatizer()
from nltk.stem.porter import PorterStemmer
stem = PorterStemmer()
word = "multiplying"
print(lem.lemmatize(word, "v"))
print(stem.stem(word))
" ".join([stem.stem(w) for w in sen_new]) # It becomes wird that I can not read it.... Not sure if you need it
A good data scientist also research should think more here, and that is also your personal value which people can not easy to steal.
To analyse a preprocessed data, it needs to be converted into features. Depending upon the usage, text features can be constructed using assorted techniques – Syntactical Parsing, Entities / N-grams / word-based features, Statistical features, and word embeddings. Read on to understand these techniques in detail.
Syntactical parsing invol ves the analysis of words in the sentence for grammar and their arrangement in a manner that shows the relationships among the words. Dependency Grammar and Part of Speech tags are the important attributes of text syntactics.
stanfordNLP
Entities are defined as the most important chunks of a sentence – noun phrases, verb phrases or both. Entity Detection algorithms are generally ensemble models of rule based parsing, dictionary lookups, pos tagging and dependency parsing. The applicability of entity detection can be seen in the automated chat bots, content analyzers and consumer insights.
Maybe most important?
These link1 link2 will be helpful when you realize it might be helpful...
from sklearn.feature_extraction.text import CountVectorizer
vec = CountVectorizer()
data = vec.fit_transform(sen_new)
print(vec.vocabulary_)
print(data.toarray())