Yelp Data exploration

The Yelp data is one of the largest dataset we have ever meet.

You can also check out the websites for details

In [3]:
import pandas as pd
import nltk
from nltk.corpus import stopwords
import time
In [4]:
filename = 'train_data.csv'
In [5]:
t1 = time.time()
reviews = pd.read_csv('train_data.csv')
time.time() - t1
Out[5]:
13.957161903381348
In [6]:
cityname = reviews['city'].unique()
In [7]:
reviews.shape
Out[7]:
(1546379, 8)
In [8]:
reviews.head()
Out[8]:
stars name text date city longitude latitude categories
0 1 McDonald's Seriously cannot stand this McDonald's. They N... 2014-12-29 Glendale -112.205020 33.509597 ['Burgers', 'Fast Food', 'Restaurants']
1 5 Tom Colicchio's Craftsteak Amazing food, truly excellent best lobster bis... 2013-03-07 Las Vegas -115.169751 36.102918 ['Steakhouses', 'Restaurants', 'Cheesesteaks',...
2 5 Fishman Lobster Clubhouse Restaurant This was my second time here, and the seafood ... 2015-11-24 Toronto -79.300795 43.824234 ['Seafood', 'Restaurants', 'Chinese', 'Live/Ra...
3 1 Bonjour Brioche Long story short.\n\nBunch of rude, heartless,... 2016-12-20 Toronto -79.346287 43.659795 ['Breakfast & Brunch', 'French', 'Restaurants']
4 4 Dilly's Deli We grabbed some dinner here last night before ... 2010-09-28 Tempe -111.945365 33.422175 ['Caterers', 'Sandwiches', 'Event Planning & S...

Write the data frame to csv

In [9]:
rev_200 = reviews.iloc[0:200,]
rev_200.to_csv("first200.csv", index = False)

Check one example

In [10]:
rev_200.loc[0, ['stars', 'text']]
Out[10]:
stars                                                    1
text     Seriously cannot stand this McDonald's. They N...
Name: 0, dtype: object
In [11]:
sentence = rev_200.loc[0, 'text']
sentence
Out[11]:
"Seriously cannot stand this McDonald's. They NEVER get my order right. Food almost always sucks! Service is sorry! The employees sure do show they hate their jobs in the way they perform at work! I used to work at McDonald's as a teen getting through high school, don't remember that McDonald's ever taking so long in a drive-thru or being bitter about the fact I worked there! Lol And to top it all off, this is the McDonald's I live right across the street from :( so I find myself here more often than not! ugh! Get it together people! You guys are terrible! I honestly don't have anything nice to say about this place... I usually wouldn't review a place just to talk mess, but I waited at the drive-thru for 15 minutes this morning to get two coffees!!! Standards back when I worked there was a minute or less for each car!"

What is the problem?

  1. Do we need symbol? in this example, maybe yes... ':(' has negative
  2. Will that be harder? I guess so...
  3. How to deal with it?

Regular Experssion!!!: Check these websites!

In [12]:
import re
In [13]:
sen_let = re.sub('[^a-zA-Z]',' ', sentence)
sen_let
Out[13]:
'Seriously cannot stand this McDonald s  They NEVER get my order right  Food almost always sucks  Service is sorry  The employees sure do show they hate their jobs in the way they perform at work  I used to work at McDonald s as a teen getting through high school  don t remember that McDonald s ever taking so long in a drive thru or being bitter about the fact I worked there  Lol And to top it all off  this is the McDonald s I live right across the street from    so I find myself here more often than not  ugh  Get it together people  You guys are terrible  I honestly don t have anything nice to say about this place    I usually wouldn t review a place just to talk mess  but I waited at the drive thru for    minutes this morning to get two coffees    Standards back when I worked there was a minute or less for each car '

Text Processing

  • Tokenization – process of converting a text into tokens
  • Tokens – words or entities present in the text
  • Text object – a sentence or a phrase or a word or an article

The following paragraphs are from Ultimate guide to understand language processing codes in python

Noise removal

Any piece of text which is not relevant to the context of the data and the end-output can be specified as the noise

For example – language stopwords (commonly used words of a language – is, am, the, of, in etc), URLs or links, social media entities (mentions, hashtags), punctuations and industry specific words. This step deals with removal of all types of noisy entities present in the text.

I used stopwords from nltk.corpus, you can use nltk.download to get it. Also notice that, all the stopwords in "english" are in lower case. So that we should also change the original sentence to lower case for this purpose.

In [14]:
stops = set(stopwords.words("english"))
stops
Out[14]:
{'a',
 'about',
 'above',
 'after',
 'again',
 'against',
 'ain',
 'all',
 'am',
 'an',
 'and',
 'any',
 'are',
 'aren',
 "aren't",
 'as',
 'at',
 'be',
 'because',
 'been',
 'before',
 'being',
 'below',
 'between',
 'both',
 'but',
 'by',
 'can',
 'couldn',
 "couldn't",
 'd',
 'did',
 'didn',
 "didn't",
 'do',
 'does',
 'doesn',
 "doesn't",
 'doing',
 'don',
 "don't",
 'down',
 'during',
 'each',
 'few',
 'for',
 'from',
 'further',
 'had',
 'hadn',
 "hadn't",
 'has',
 'hasn',
 "hasn't",
 'have',
 'haven',
 "haven't",
 'having',
 'he',
 'her',
 'here',
 'hers',
 'herself',
 'him',
 'himself',
 'his',
 'how',
 'i',
 'if',
 'in',
 'into',
 'is',
 'isn',
 "isn't",
 'it',
 "it's",
 'its',
 'itself',
 'just',
 'll',
 'm',
 'ma',
 'me',
 'mightn',
 "mightn't",
 'more',
 'most',
 'mustn',
 "mustn't",
 'my',
 'myself',
 'needn',
 "needn't",
 'no',
 'nor',
 'not',
 'now',
 'o',
 'of',
 'off',
 'on',
 'once',
 'only',
 'or',
 'other',
 'our',
 'ours',
 'ourselves',
 'out',
 'over',
 'own',
 're',
 's',
 'same',
 'shan',
 "shan't",
 'she',
 "she's",
 'should',
 "should've",
 'shouldn',
 "shouldn't",
 'so',
 'some',
 'such',
 't',
 'than',
 'that',
 "that'll",
 'the',
 'their',
 'theirs',
 'them',
 'themselves',
 'then',
 'there',
 'these',
 'they',
 'this',
 'those',
 'through',
 'to',
 'too',
 'under',
 'until',
 'up',
 've',
 'very',
 'was',
 'wasn',
 "wasn't",
 'we',
 'were',
 'weren',
 "weren't",
 'what',
 'when',
 'where',
 'which',
 'while',
 'who',
 'whom',
 'why',
 'will',
 'with',
 'won',
 "won't",
 'wouldn',
 "wouldn't",
 'y',
 'you',
 "you'd",
 "you'll",
 "you're",
 "you've",
 'your',
 'yours',
 'yourself',
 'yourselves'}
In [15]:
sen_new = [w for w in sen_let.lower().split() if not w in stops]  
print(" ".join(sen_new))
print("\norginal sentence:\n")
print(sen_let)
seriously cannot stand mcdonald never get order right food almost always sucks service sorry employees sure show hate jobs way perform work used work mcdonald teen getting high school remember mcdonald ever taking long drive thru bitter fact worked lol top mcdonald live right across street find often ugh get together people guys terrible honestly anything nice say place usually review place talk mess waited drive thru minutes morning get two coffees standards back worked minute less car

orginal sentence:

Seriously cannot stand this McDonald s  They NEVER get my order right  Food almost always sucks  Service is sorry  The employees sure do show they hate their jobs in the way they perform at work  I used to work at McDonald s as a teen getting through high school  don t remember that McDonald s ever taking so long in a drive thru or being bitter about the fact I worked there  Lol And to top it all off  this is the McDonald s I live right across the street from    so I find myself here more often than not  ugh  Get it together people  You guys are terrible  I honestly don t have anything nice to say about this place    I usually wouldn t review a place just to talk mess  but I waited at the drive thru for    minutes this morning to get two coffees    Standards back when I worked there was a minute or less for each car 

Let's wrap it to one function!

In [16]:
def noise_remove(review, stops):
    review = re.sub("[^a-zA-Z]", " ", review)
    review = review.lower().split()
    useful_review = [w for w in review if not w in stops]  
    return " ".join(useful_review)
In [17]:
noise_remove(sen_let, stops)
Out[17]:
'seriously cannot stand mcdonald never get order right food almost always sucks service sorry employees sure show hate jobs way perform work used work mcdonald teen getting high school remember mcdonald ever taking long drive thru bitter fact worked lol top mcdonald live right across street find often ugh get together people guys terrible honestly anything nice say place usually review place talk mess waited drive thru minutes morning get two coffees standards back worked minute less car'

Lexicon Normalization

Another type of textual noise is about the multiple representations exhibited by single word.

For example – “play”, “player”, “played”, “plays” and “playing” are the different variations of the word – “play”, Though they mean different but contextually all are similar. The step converts all the disparities of a word into their normalized form (also known as lemma). Normalization is a pivotal step for feature engineering with text as it converts the high dimensional features (N different features) to the low dimensional space (1 feature), which is an ideal ask for any ML model.

  • Stemming: Stemming is a rudimentary rule-based process of stripping the suffixes (“ing”, “ly”, “es”, “s” etc) from a word.
  • Lemmatization: Lemmatization, on the other hand, is an organized & step by step procedure of obtaining the root form of the word, it makes use of vocabulary (dictionary importance of words) and morphological analysis (word structure and grammar relations).
In [18]:
from nltk.stem.wordnet import WordNetLemmatizer 
lem = WordNetLemmatizer()

from nltk.stem.porter import PorterStemmer 
stem = PorterStemmer()
In [19]:
word = "multiplying" 
print(lem.lemmatize(word, "v"))
print(stem.stem(word))
multiply
multipli
In [20]:
" ".join([stem.stem(w) for w in sen_new]) # It becomes wird that I can not read it.... Not sure if you need it
Out[20]:
'serious cannot stand mcdonald never get order right food almost alway suck servic sorri employe sure show hate job way perform work use work mcdonald teen get high school rememb mcdonald ever take long drive thru bitter fact work lol top mcdonald live right across street find often ugh get togeth peopl guy terribl honestli anyth nice say place usual review place talk mess wait drive thru minut morn get two coffe standard back work minut less car'

Feature Engeering

A good data scientist also research should think more here, and that is also your personal value which people can not easy to steal.

To analyse a preprocessed data, it needs to be converted into features. Depending upon the usage, text features can be constructed using assorted techniques – Syntactical Parsing, Entities / N-grams / word-based features, Statistical features, and word embeddings. Read on to understand these techniques in detail.

Syntactic Parsing

Syntactical parsing invol ves the analysis of words in the sentence for grammar and their arrangement in a manner that shows the relationships among the words. Dependency Grammar and Part of Speech tags are the important attributes of text syntactics.

  • Dependency Trees stanfordNLP
  • Part of speech tagging: Apart from the grammar relations, every word in a sentence is also associated with a part of speech (pos) tag

Entity Extraction

Entities are defined as the most important chunks of a sentence – noun phrases, verb phrases or both. Entity Detection algorithms are generally ensemble models of rule based parsing, dictionary lookups, pos tagging and dependency parsing. The applicability of entity detection can be seen in the automated chat bots, content analyzers and consumer insights.

  • Named Entity Recognition *

Statistical Features

Word Embedding (text vectors)

Maybe most important?

These link1 link2 will be helpful when you realize it might be helpful...

In [21]:
from sklearn.feature_extraction.text import CountVectorizer
In [22]:
vec = CountVectorizer() 
data = vec.fit_transform(sen_new)
In [23]:
print(vec.vocabulary_)
{'seriously': 43, 'cannot': 6, 'stand': 47, 'mcdonald': 26, 'never': 31, 'get': 15, 'order': 34, 'right': 40, 'food': 14, 'almost': 1, 'always': 2, 'sucks': 50, 'service': 44, 'sorry': 46, 'employees': 10, 'sure': 51, 'show': 45, 'hate': 18, 'jobs': 21, 'way': 64, 'perform': 36, 'work': 65, 'used': 61, 'teen': 54, 'getting': 16, 'high': 19, 'school': 42, 'remember': 38, 'ever': 11, 'taking': 52, 'long': 25, 'drive': 9, 'thru': 56, 'bitter': 5, 'fact': 12, 'worked': 66, 'lol': 24, 'top': 58, 'live': 23, 'across': 0, 'street': 49, 'find': 13, 'often': 33, 'ugh': 60, 'together': 57, 'people': 35, 'guys': 17, 'terrible': 55, 'honestly': 20, 'anything': 3, 'nice': 32, 'say': 41, 'place': 37, 'usually': 62, 'review': 39, 'talk': 53, 'mess': 27, 'waited': 63, 'minutes': 29, 'morning': 30, 'two': 59, 'coffees': 8, 'standards': 48, 'back': 4, 'minute': 28, 'less': 22, 'car': 7}
In [25]:
print(data.toarray())
[[0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]
 ...
 [0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]]

Interview Question

  1. We play a game: I pick a number n from 1 to 100. If you guess correctly, I pay you $n$ yuan and zero otherwise. How much would you pay to play this game?
  2. Suppose you have a fair coin. You start with a dollar, and if you toss a H, your position doubles, if you toss a T, your position halves. What is the expected value of the money you have if. you toss the coin infinitely?
  3. Suppose we toss a fair coin, and let N denote the number of tosses until we get a head(including the final toss), what is $E(N)$ and $Var(N)$