Discussion 2 ·

Packages

Python:
- Pandas similar to R::dataframe 10 mins to pandas
- Matplotlib plot for python
- ggplot for python. Please keep in mind their latest update time.
- Plotly Please take a look.
R: R for Data Science, Cheat Sheet
- tidyverse The tidyverse is an opinionated collection of R packages designed for data science. All packages share an underlying design philosophy, grammar, and data structures.
  - dplyr dplyr is a grammar of data manipulation, providing a consistent set of verbs that help you solve the most common data manipulation challenges. Chapter 5
  - ggplot2 ggplot2 is a system for declaratively creating graphics, based on The Grammar of Graphics. You provide the data, tell ggplot2 how to map variables to aesthetics, what graphical primitives to use, and it takes care of the details.
  - tidyr The goal of tidyr is to help you create tidy data. tidyr replaces reshape2 (2010-2014) and reshape (2005-2010) Chapter 12
  - readr The goal of readr is to provide a fast and friendly way to read rectangular data (like csv, tsv, and fwf). It is designed to flexibly parse many types of data found in the wild, while still cleanly failing when data unexpectedly changes. Chapter 11
  - stringr Strings are not glamorous, high-profile components of R, but they do play a big role in many data cleaning and preparation tasks. The stringr package provide a cohesive set of functions designed to make working with strings as easy as posssible. Chapter 14
  - forcats R uses factors to handle categorical variables, variables that have a fixed and known set of possible values. Historically, factors were much easier to work with than character vectors, so many base R functions automatically convert character vectors to factors. Chapter 15
- sparklyr Connect spark to R. Maybe useful… spark for machine learning
- Plotly/R Plotly’s R graphing library makes interactive, publication-quality graphs online. Examples of how to make line plots, scatter plots, area charts, bar charts, error bars, box plots, histograms, heatmaps, subplots, multiple-axes, and 3D (WebGL based) charts.

Examples

Python

Pandas

import sys
import pandas as pd
import numpy as np
print(sys.version)

3.6.3 |Anaconda, Inc.| (default, Oct  6 2017, 12:07:11) 
[GCC 4.2.1 Compatible Clang 4.0.1 (tags/RELEASE_401/final)]

df2 = pd.DataFrame({'A' : 1.,
                    'B' : pd.Timestamp('20130102'),
                    'C' : pd.Series(1,index=list(range(4)),dtype='float32'),
                    'D' : np.array([3] * 4,dtype='int32'),
                    'E' : pd.Categorical(["test","train","test","train"]),
                    'F' : 'foo' })
print(df2)

     A          B    C  D      E    F
0  1.0 2013-01-02  1.0  3   test  foo
1  1.0 2013-01-02  1.0  3  train  foo
2  1.0 2013-01-02  1.0  3   test  foo
3  1.0 2013-01-02  1.0  3  train  foo

R

dplyr

Check Karl Broman’s website hipsteR

Understand the pipe operator %>%

Which style you prefer?

x <- rnorm(10)
round(exp(diff(log(x^2))),1)

[1] 34.3  0.2  4.5  0.6  0.1 76.5  0.2  0.6  0.1

library(dplyr)
x %>% .^2 %>% ## Usually . means current data.
  log(.) %>%
  diff() %>%
  exp() %>%
  round(1)

[1] 34.3  0.2  4.5  0.6  0.1 76.5  0.2  0.6  0.1

stringr

I personally prefer use Python than R to handle string…Python strings cheat sheet, Essential Python Cheat Sheet Most of the information or questions can definitely find by google.

Introduction to stringr

R version

library(stringr)
x <- c("abcdef")
# String length
str_length(x)

[1] 6

# The 3rd letter
str_sub(x, 3, 3)

[1] "c"

# The 2nd to 2nd-to-last character
str_sub(x, 2, -2)

[1] "bcde"

Python version

I use python::print function only in R markdown… In real python env, you don’t need print.

x = "abcdef"
print(len(x))

print(x[2]) # python index start at 0

print(x[1:]) # string start at 2nd to 1st-to-last

bcdef

print(x[1:-1]) # string start at 2nd to 2nd-to-last

bcde

print(x[1:-2]) # string start at 2nd to 3rd-to-last

bcd

Plotly

It is a really good plot library Plotly jupyter notebook, Plotly knitr but I don’t have too much experience. I encourage you to learn it.

library(plotly)
p <- plot_ly(economics, x = ~date, y = ~unemploy / pop)
p

p <- plot_ly(data = iris, x = ~Sepal.Length, y = ~Petal.Length, z = ~Sepal.Width, color = ~Species, colors = "Set1")
p

RColorBrewer

RColorBrewer is an R packages that uses the work from http://colorbrewer2.org/ to help you choose sensible colour schemes for figures in R.

R blogger for rcolorbrewer