The post Create your Machine Learning library from scratch with R ! (3/5) – KNN appeared first on Enhance Data Science.
]]>The K-nearest neighbors (KNN) is a simple yet efficient classification and regression algorithm. KNN assumes that an observation will be similar to its K closest neighbors. For instance, if most of the neighbors of a given point belongs to a given class, it seems reasonable to assume that the point will belong to the same given class.
Now, let’s quickly derive the mathematics used for KNN regression (they are similar for classification).
Let be the observations of our training dataset. The points are in . We denote the variable we seek to estimate. We know its value for the train dataset.
Let be a new point in . We do not know and will estimate it using our train dataset.
Let be a positive and non-zero integer (the number of neighbors used for estimation). We want to select the points from the dataset which are the closest to . To do so, we compute the euclidean distance . From all the distance, we can compute , the smallest radius of the circle centered on which includes exactly points from the training sample.
An estimation of is now easy to construct. This is the mean of the of the closest points to :
First, we build a “my_knn_regressor” object which stores all the training points, the value of the target variable and the number of neighbors to use.
###Nearest neighbors my_knn_regressor = function(x,y,k=5) { if (!is.matrix(x)) { x = as.matrix(x) } if (!is.matrix(y)) { y = as.matrix(y) } my_knn = list() my_knn[['points']] = x my_knn[['value']] = y my_knn[['k']] = k attr(my_knn, "class") = "my_knn_regressor" return(my_knn) }
The tricky part of KNN is to compute efficiently the distance. We will use the function we created in our previous post on vectorization. The function and mathematical derivations are specified in this post.
compute_pairwise_distance=function(X,Y) { xn = rowSums(X ** 2) yn = rowSums(Y ** 2) outer(xn, yn, '+') - 2 * tcrossprod(X, Y) }
Now we can build our predictor:
predict.my_knn_regressor = function(my_knn,x) { if (!is.matrix(x)) { x = as.matrix(x) } ##Compute pairwise distance dist_pair = compute_pairwise_distance(x,my_knn[['points']]) ##as.matrix(apply(dist_pair,2,order)<=my_knn[['k']]) orders the points by distance and select the k-closest points ##The M[i,j]=1 if x_j is on the k closest point to x_i crossprod(apply(dist_pair,1,order) <= my_knn[['k']], my_knn[["value"]]) / my_knn[['k']] }
The last line may seem complicated:
apply(dist_pair,2,order)
orders the points by distanceapply(dist_pair,2,order)<=my_knn[['k']]
selects the k-closest points to each point in our new dataset M=t(as.matrix(apply(dist_pair,2,order) <= my_knn[['k']]))
cast the matrix into a one hot matrix. if is one of the k closest points to . is zero otherwise.M %*% my_knn[['value']] / my_knn
sums the value of the k closest points and normalises it by k The previous code can be reused as it is for binary classification. Your outcome should be encoded as a one-hot variable. If the estimated output is greater (resp. less) than 0.5, you can assume that your point belongs to the class encoded as one (resp. zero). We will use the classical Iris dataset and classify the setosa versus the virginica specy.
iris_class = iris[iris[["Species"]]!="versicolor",] print(iris_class) iris_class[["Species"]] = iris_class[["Species"]] != "setosa" knn_class = my_knn_regressor(iris_class[,1:2], as.numeric(iris_class[,5])) predict(knn_class, iris_class[,1:2])
Since, we only used 2 variables, we can easily plot the decision boundaries on a 2D plot.
#Build grid x_coord = seq(min(iris_class[,1]) - 0.2,max(iris_class[,1]) + 0.2,length.out = 200) y_coord = seq(min(iris_class[,2])- 0.2,max(iris_class[,2]) + 0.2 , length.out = 200) coord = expand.grid(x = x_coord, y = y_coord) #predict probabilities coord[['prob']] = predict(knn_class, coord[,1:2]) library(ggplot2) ggplot() + ##Ad tiles according to probabilities geom_tile(data=coord,mapping=aes(x, y, fill=prob)) + scale_fill_gradient(low = "lightblue", high = "red") + ##add points geom_point(data=iris_class,mapping=aes(Sepal.Length,Sepal.Width, shape=Species),size=3 ) + #add the labels to the plots xlab('Sepal length') + ylab('Sepal width') + ggtitle('Decision boundaries of KNN')+ #remove grey border from the tile scale_x_continuous(expand=c(0,0))+scale_y_continuous(expand=c(0,0))
And this gives us this cool plot:
Our current KNN is basic, but you can improve and test it in several ways:
Thanks for reading ! To find more posts on Machine Learning, Python and R, you can follow us on Facebook or Twitter.
The post Create your Machine Learning library from scratch with R ! (3/5) – KNN appeared first on Enhance Data Science.
]]>The post Create your Machine Learning library from scratch with R ! (2/5) – PCA appeared first on Enhance Data Science.
]]>The PCA is a dimensionality reduction method which seeks the vectors which explains most of the variance in the dataset. From a mathematical standpoint, the PCA is just a coordinates change to represent the points in a more appropriate basis. Picking few of these coordinates is enough to explain an important part of the variance in the dataset.
Let be the observations of our datasets, the points are in . We assume that they are centered and of unit variance. We denote the matrix of observations.
Then, can be diagonalized and has real and positive eigenvalues (it is a symmetric positive definite matrix).
We denote its ordered eigenvalues and the associated eigenvectors. It can be shown that is the cumulative variance explained by .
It can also be shown that is the orthonormal basis of size which explains the most variances.
This is exactly what we wanted ! We have a smaller basis which explains as much variance as possible !
The implementation in R has three-steps:
###PCA my_pca<-function(x,variance_explained=0.9,center=T,scale=T) { my_pca=list() ##Compute the mean of each variable if (center) { my_pca[['center']]=colMeans(x) } ## Otherwise, we set the mean to 0 else my_pca[['center']]=rep(0,dim(x)[2]) ####Compute the standard dev of each variable if (scale) { my_pca[['std']]=apply(x,2,sd) } ## Otherwise, we set the sd to 0 else my_pca[['std']]=rep(1,dim(x)[2]) ##Normalization ##Centering x_std=sweep(x,2,my_pca[['center']]) ##Standardization x_std=x_std%*%diag(1/my_pca[['std']]) ##Cov matrix eigen_cov=eigen(crossprod(x_std,x_std)) ##Computing the cumulative variance my_pca[['cumulative_variance']] =cumsum(eigen_cov[['values']]) ##Number of required components my_pca[['n_components']] =sum((my_pca[['cumulative_variance']]/sum(eigen_cov[['values']]))<variance_explained)+1 ##Selection of the principal components my_pca[['transform']] =eigen_cov[['vectors']][,1:my_pca[['n_components']]] attr(my_pca, "class") <- "my_pca" return(my_pca) }
Now that we have the transformation matrix, we can perform the projection on the new basis.
predict.my_pca<-function(pca,x,..) { ##Centering x_std=sweep(x,2,pca[['center']]) ##Standardization x_std=x_std%*%diag(1/pca[['std']]) return(x_std%*%pca[['transform']]) }
The function applies the change of basis formula and a projection on the principals components.
Using the predict function, we can now plot the projection of the observations on the two main components. As in the part 1, we used the Iris dataset.
library(ggplot2) pca1=my_pca(as.matrix(iris[,1:4]),1,scale=TRUE,center = TRUE) projected=predict(pca1,as.matrix(iris[,1:4])) ggplot()+geom_point(aes(x=projected[,1],y=projected[,2],color=iris[,5]))
We can now compare our implementation with the standard FactoMineR implementation of Principal Component Analysis.
library(FactoMineR) pca_stats= PCA(as.matrix(iris[,1:4])) projected_stats=predict(pca_stats,as.matrix(iris[,1:4]))$coord[,1:2] ggplot(data=iris)+geom_point(aes(x=projected_stats[,1],y=-projected_stats[,2],color=Species))+xlab('PC1')+ylab('PC2')+ggtitle('Iris dataset projected on the two mains PC (FactomineR)')
When running this, you should get a plot very similar to the previous one. This ensures the sanity of our implementation.
Thanks for reading ! To find more posts on Machine Learning, Python and R, you can follow us on Facebook or Twitter.
.
The post Create your Machine Learning library from scratch with R ! (2/5) – PCA appeared first on Enhance Data Science.
]]>The post Machine Learning Explained: Vectorization and matrix operations appeared first on Enhance Data Science.
]]>
Let’s compare the naive way and the vectorized way of computing the sum of the elements of an array. To do so, we will create a large (100,000 elements) Numpy array and compute the sum of its element 1,000 times with each algorithm. The overall computation time will then be compared.
import numpy as np import time W=np.random.normal(0,1,100000) n_rep=1000
The naive way to compute the sum iterates over all the elements of the array and stores the sum:
start_time = time.time() for i in range(n_rep): loop_res=0 for elt in W: loop_res+=elt time_loop = time.time() - start_time
If is our vector of interest, the sum of its elements can be expressed as the dot product :
start_time = time.time() for i in range(n_rep): one_dot=np.ones(W.shape) vect_res=one_dot.T.dot(W) time_vect = time.time() - start_time
Finally, we can check that both methods yield the same results and compare their runtime. The vectorized version run approximately 100 to 200 times faster than the naive loop.
print(np.abs(vect_res-loop_res)<10e-10) print(time_loop/time_vect)
Note: The same results can be obtained with np.sum.The numpy version has a very similar runtime to our vectorized version. Numpy being very optimized, this show that our vectorized sum is reasonably fast.
start_time = time.time() for i in range(n_rep): vect_res=np.sum(W) time_vect_np = time.time() - start_time
The previous experiments can be replicated in R:
##Creation of the vector W=matrix(rnorm(100000)) n_rep=10000 #Naive way: library(tictoc) tic('Naive computation') for (rep in 1:n_rep) { res_loop=0 for (w_i in W) res_loop=w_i+res_loop } toc() tic('Vectorized computation') # vectorized way for (rep in 1:n_rep) { ones=rep(1,length(W)) res_vect= crossprod(ones,W) } toc() tic('built-in computation') # built-in way for (rep in 1:n_rep) { res_built_in= sum(W) } toc()
In R, the vectorized version is only an order of magnitude faster than the naive way. The built-in way achieves the best performances and is an order of magnitude faster than our vectorized way.
Vectorization divides the computation times by several order of magnitudes and the difference with loops increase with the size of the data. Hence, if you want to deal with large amount of data, rewriting the algorithm as matrix operations may lead to important performances gains.
Note 1: Though vectorization is often faster, it requires to allocate the memory of the array. If your amount of RAM is limited or if the amount of data is large, loops may be required.
Note 2: When you deal with large arrays or computationally intensive algebra ( like inversion, diagonalization, eigenvalues computations, ….) computations on GPU are even order of magnitudes faster than on CPU. To write efficient GPU code, the code needs to be composed of matrix operations. Hence, having vectorized code maked it easier to translate CPU code to GPU (or tensor-based frameworks).
The goal of this part is to show some basic matrix operations/vectorization and to end on a more complex example to show the thought process which underlies vectorization.
The column wise sum (and mean) can be expressed as a matrix product. Let be our matrix of interest. Using the matrix multiplication formula, we have: . Hence, the column-wise sum of is .
Python code:
def colWiseSum(W): ones=np.ones((W.shape[0],1)) return ones.T.dot(W)
R code:
colWiseSum=function(W) { ones=rep(1,nrow(W)) crossprod(W,ones) }
Similarly, the row-wise sum is .
Python code:
def rowWiseSum(W): ones=np.ones((W.shape[1],1)) return W.dot(ones)
R code:
rowWiseSum=function(W) { ones=rep(1,ncol(W)) W%*%ones }
The sum of all the elements of a matrix is the sum of the sum of its rows. Using previous expression, the sum of all the terms of is .
Python code:
def matSum(W): rhs_ones=np.ones((W.shape[1],1)) lhs_ones=np.ones((W.shape[0],1)) return lhs_ones.T.dot(W).dot(rhs_ones)
R code:
matSum=function(W) { rhs_ones=rep(1,ncol(W)) lhs_ones=rep(1,nrow(W)) crossprod(lhs_ones,W)%*% rhs_ones }
Let’s say we have a set of words and for each of this words we want to find the most similar words from a dictionary. We assume that the words have been projected in space of dimension (using word2vect). Let (our set of words) and (our dictionary) be two matrices resp. in and . To compute the similarity of all the observations of and we simply need to compute .
Python code:
def gramMatrix(X,Y): return X.dot(Y.T)
R code:
gramMatrix=function(X,Y) { tcrossprod(X,t(Y)) }
We want to compute the pair-wise distance between two sets of vector. Let and be two matrix in and . For each vector of we need to compute the distance with all the vectors of .Hence, the output matrix should be of size .
If and are two vectors, their distance is:
. To compute all pairwise distance, some work is required on the last equality. All the matrices should be of size , then the output vector of distance will be of size , which can be reshaped into a vector of size .
The first two terms and need to be computed for each and . is the element-wise multiplication of X with itself (its elements are ). Hence, the i-th element of is the squared sum of the coordinate of the i-th observation, .
However, this is a vector and its shape is . By replicating each of its elements times, we will get a matrix of size . The replication can be done easily, if we consider the right matrix .
Let be a vector of one of size . Let be the matrix of size with repetitions of on the “diagonal”:
Then, our final vector is (The same expression holds for ). We denote a reshape operator (used to transform the previous vector in matrix). With previous part on similarity matrix, we get the following expression of the pairwise distance:
The previous expression can seem complex, but this will help us a lot to code the pairwise distance. We only have to do the translation from maths to Numpy or R.
Python code:
def L2dist(X,Y): n_1=X.shape[0] n_2=Y.shape[0] p=X.shape[1] ones=np.ones((p,1)) x_sq=(X**2).dot(ones)[:,0] y_sq=(Y**2).dot(ones)[:,0] delta_n1_n2=np.repeat(np.eye(n_1),n_2,axis=0) delta_n2_n1=np.repeat(np.eye(n_2),n_1,axis=0) return np.reshape(delta_n1_n2.dot(x_sq),(n_1,n_2))+np.reshape(delta_n2_n1.dot(y_sq),(n_2,n_1)).T-2*gramMatrix(X,Y)
R Code:
L2dist=function(X,Y) { n_1=dim(X)[1] n_2=dim(Y)[1] p=dim(X)[2] ones=rep(1,p) x_sq=X**2 %*% ones x_sq=t(matrix(diag(n_1) %x% rep(1, n_2) %*% x_sq, n_2,n_1)) y_sq=Y**2 %*% ones y_sq=matrix(diag(n_2) %x% rep(1, n_1) %*% y_sq,n_1,n_2) x_sq+y_sq-2*gramMatrix(X,Y) }
Actually the previous L2dist is not completely optimized requires a lot of memory since has cells and is mostly empty. Using Numpy built-in function, we can circumvent this multiplication by directly repeating the vector (which reduces the memory footprints by a factor ):
Python code:
def L2dist_improved(X,Y): n_1=X.shape[0] n_2=Y.shape[0] p=X.shape[1] ones=np.ones((p,1)) x_sq=(X**2).dot(ones)[:,0] y_sq=(Y**2).dot(ones)[:,0] ##Replace multiplication by a simple repeat X_rpt=np.repeat(x_sq,n_2).reshape((n_1,n_2)) Y_rpt=np.repeat(y_sq,n_1).reshape((n_2,n_1)).T return X_rpt+Y_rpt-2*gramMatrix(X,Y)
R code:
L2dist_improved=function(X,Y) { n_1=dim(X)[1] n_2=dim(Y)[1] p=dim(X)[2] ones=rep(1,p) x_sq=X**2 %*% ones x_sq=t(matrix(rep(x_sq,each=n_2),n_2,n_1)) y_sq=Y**2 %*% ones y_sq=matrix(rep(y_sq,each=n_1),n_1,n_2) x_sq+y_sq-2*gramMatrix(X,Y) }
Note : Actually, this code can be made even shorter and more efficient by using a custom outer product (Thanks to Florian Privé for the solution):
L2dist_improved2 <- function(X, Y) { xn <- rowSums(X ** 2) yn <- rowSums(Y ** 2) outer(xn, yn,'+') - 2 * tcrossprod(X, Y) }
To show the interest of our previous work, let’s compare the computation speed of the vectorized L2 distance, the naive implementation and the scikit-learn implementation. The experiments are run on different size of dataset with 100 repetitions.
The post Machine Learning Explained: Vectorization and matrix operations appeared first on Enhance Data Science.
]]>The post Create your Machine Learning library from scratch with R ! (1/5) – Linear and logistic regression appeared first on Enhance Data Science.
]]>The goal of liner regression is to estimate a continuous variable given a matrix of observations . Before dealing with the code, we need to derive the solution of the linear regression.
Given a matrix of observations and the target . The goal of the linear regression is to minimize the norm between and a linear estimate of : . Hence, linear regression can be rewritten as an optimization problem: . A closed-form solution can easily be derived and the optimal is
Using the closed-form solution, we can easily code the linear regression. Our linear model object will have three methods, an init method where the model is fitted, a predict method to work with new data and a plot method to visualize the residuals’ distribution.
###Linear model fit_lm<-function(x,y,intercept=TRUE,lambda=0) { ##Conversion to matrix if required if (!is.matrix(x)) { x=as.matrix(x) } if (!is.matrix(y)) { y=as.matrix(y) } #Add the intercept coefficient if (intercept) { x=cbind(x,1) } my_lm=list(intercept=intercept) ##Compute coefficients estimates my_lm[['coeffs']]=solve(t(x) %*% x) %*% t(x) %*% y ##Compute estimates for the train dataset my_lm[['preds']]=x %*% my_lm[['coeffs']] my_lm[['residuals']]=my_lm[['preds']]-y my_lm[['mse']]=mean(my_lm[['residuals']]^2) attr(my_lm, "class") <- "my_lm" return(my_lm) }
The fit function is simple, the first few lines transform the data to matrices and add an intercept if required. Then, the ‘my_lm’ object is created and the coefficients are computed. The solve() function is used to invert the matrix and %*% denotes matrix multiplication. A the end, the residuals and the estimates are computed and the class of the object is set as ‘my_lm’.
Now let’s implement the predict and plot methods for the my_lm class:
predict.my_lm<-function(my_lm,x,..) { if (!is.matrix(x)) { x=as.matrix(x) } if (my_lm[["intercept"]]) { x=cbind(x,1) } x%*%my_lm[["coeffs"]] } plot.my_lm<-function(my_lm,bins=30,..) { library(ggplot2) qplot(my_lm[["residuals"]], geom="histogram",bins=bins) + xlab('Residuals values') + ggtitle('Residual distribution') }
You can test the code on some preinstalled R dataset such as the car one. The code will give you the same coefficients estimates as the lm function. For instance, on the car dataset:
my_lm1=fit_lm(cars[,1],cars[,2]) vanilla_lm=lm(dist~speed,cars) print(vanilla_lm[['coefficients']]) print(my_lm1[['coeffs']])
Previously, we worked on regression and the estimation of a continuous variable. Now, with logistic regression, we try to estimate a binary outcome (for instance, ill vs healthy, pass vs failed, …). Again, let’s deal with the maths first:
The goal is to estimate a binary outcome given the observations . We assume that follows a Bernoulli distribution of parameter . is called the sigmoid function.
Hence, we have .
We want to maximize the log-likelihood of the observed sample(over and hence over ):
This maximization will be done using Newton’s Method. Newton’s method is a variant of gradient descent in which we try to find the optimal curvature of the function to increase the speed of convergence. If you are not familiar with the Newton method, you can just see it as a variant of batch gradient descent.. The weights updates has the following form:
with: The Hessian
and the gradient
The algorithm in R will update the weights using this update until the termination criterion is met. Here, the termination criterion is met when the mean square error is below the user-defined tolerance.
###Sigmoid function sigmoid=function(x) {1/(1+exp(-x))} ###Fit logistic regression fit_logit=function(x,y,intercept=T,tol=10e-5,max_it=100) { ##Type conversion if (!is.matrix(x)) { x=as.matrix(x) } if (!is.matrix(y)) { y=as.matrix(y) } ##Add intercept is required if (intercept) { x=cbind(x,1) } ##Algorithm initialization iterations=0 converged=F ##Weights are initialized to 1 coeffs=matrix(1,dim(x)[2]) ##Updates the weight until the max number of iter ##Or the termination criterion is met while (iterations<max_it& !converged) { iterations=iterations+1 nu<-sigmoid(x %*% coeffs) old_pred=sigmoid(x %*% coeffs) nu_diag=diag(nu[,1]) ##Weights update coeffs=coeffs + solve(t(x) %*% nu_diag %*% x)%*% t(x) %*% (y-nu) ##compute mse to check termination mse=mean((y-sigmoid(x%*%coeffs))^2) ##Stop computation if tolerance is reached if (mse<tol) { converged=T } } ##Creates the logit objects my_logit=list(intercept=intercept) my_logit[['coeffs']]=coeffs my_logit[['preds']]=sigmoid(x%*%coeffs) my_logit[['residuals']]=my_logit[['preds']]-y my_logit[['mse']]=mean(my_logit[['residuals']]^2) my_logit[['iteration']]=iterations attr(my_logit, "class") <- "my_logit" return(my_logit) } ##Predict the outcome on new data predict.my_logit<-function(my_logit,x,probs=T,..) { if (!is.matrix(x)) { x=as.matrix(x) } if (my_logit[['intercept']]) { x=cbind(x,1) } if (probs) { sigmoid(x %*% my_logit[['coeffs']]) } else { sigmoid(x %*% my_logit[['coeffs']])>0.5 } }
The code is split into two part:
We can now use our logistic regression to predict the class of a flower from the iris dataset:
fit_logit(iris[,1:4],iris[,5]=='setosa')
As expected, the algorithm can predict efficiently if a flower is a setosa or not.
If you like this post, follow us to learn how to create your Machine Learning library from scratch with R!
The post Create your Machine Learning library from scratch with R ! (1/5) – Linear and logistic regression appeared first on Enhance Data Science.
]]>The post Machine Learning Explained: Kmeans appeared first on Enhance Data Science.
]]>We assume that we want to split the data into k groups, so we need to find and assign k centers. How to define and find these centers?
They are the solution to the equation: where if the observation i is assigned to the center j and 0 otherwise.
Basically, this equation means that we are looking for the k centers which will minimize the distance between them and the points of their cluster. This is an optimization problem, but since the function, we want to minimize is not convex and some variables are binary, it cannot be solved in classic ways with gradient descent.
The usual way to solve it is the following:
Now that we have the algorithm in pseudocode, let’s implement kmeans from scratch in R. First,we’ll create some toys data based on five 2D gaussian distributions.
require(MASS) require(ggplot2) set.seed(1234) set1=mvrnorm(n = 300, c(-4,10), matrix(c(1.5,1,1,1.5),2)) set2=mvrnorm(n = 300, c(5,7), matrix(c(1,2,2,6),2)) set3=mvrnorm(n = 300, c(-1,1), matrix(c(4,0,0,4),2)) set4=mvrnorm(n = 300, c(10,-10), matrix(c(4,0,0,4),2)) set5=mvrnorm(n = 300, c(3,-3), matrix(c(4,0,0,4),2)) DF=data.frame(rbind(set1,set2,set3,set4,set5),cluster=as.factor(c(rep(1:5,each=300)))) ggplot(DF,aes(x=X1,y=X2,color=cluster))+geom_point()
On this dataset, Kmeans will work well since each distribution has a circular shape. Here are what the data look like:
Now that we have a dataset, let’s inplement kmeans.
The initialisation of the centroids is crucial and will change how the algorithm behave. Here, we will wimply takes K random points from the data.
#Centroids initialisation centroids=data[sample.int(nrow(data),K),] ##Stopping criteria initilisation. current_stop_crit=10e10 ##Vector where the assigned centers of each points will be saved cluster=rep(0,nrow(data)) ##Has the alogrithm converged ? converged=F it=1
At each iteration, every points will be assigned to its closest cluster. To do so, the euclidian distance between each points and each centers is computed, the lowest distance and the center for which it’s reached is saved.
###Iterating over observations for (i in 1:nrow(data)) { ##Setting a high minimum distance min_dist=10e10 ##Iterating over centroids for (centroid in 1:nrow(centroids)) { ##Computing the L2 distance distance_to_centroid=sum((centroids[centroid,]-data[i,])^2) ##This centroid is the closest centroid to the point if (distance_to_centroid<=min_dist) { ##The point is assigned to this centroid/cluster cluster[i]=centroid min_dist=distance_to_centroid } } }
Once each point has been assigned to the closest centroids, the coordinates of each centroid are updated. The new coordinates are the means of the observations which belongs to the cluster.
##For each centroid for (i in 1:nrow(centroids)) { ##The new coordinates are the means of the point in the cluster centroids[i,]=apply(data[cluster==i,],2,mean) }
We do not want the algorithm to run indefinetely, hence we need a stopping criterion to stop the algorthm when we are close enough to a minimum. The criterion is simply that when the centroids stop moving, the algorithm should stop.
while(current_stop_crit>=stop_crit & converged==F) { it=it+1 if (current_stop_crit<=stop_crit) { converged=T } old_centroids=centroids ###Run previous step ####Recompute stop criterion current_stop_crit=mean((old_centroids-centroids)^2)
kmeans=function(data,K=4,stop_crit=10e-5) { #Initialisation of clusters centroids=data[sample.int(nrow(data),K),] current_stop_crit=1000 cluster=rep(0,nrow(data)) converged=F it=1 while(current_stop_crit>=stop_crit & converged==F) { it=it+1 if (current_stop_crit<=stop_crit) { converged=T } old_centroids=centroids ##Assigning each point to a centroid for (i in 1:nrow(data)) { min_dist=10e10 for (centroid in 1:nrow(centroids)) { distance_to_centroid=sum((centroids[centroid,]-data[i,])^2) if (distance_to_centroid<=min_dist) { cluster[i]=centroid min_dist=distance_to_centroid } } } ##Assigning each point to a centroid for (i in 1:nrow(centroids)) { centroids[i,]=apply(data[cluster==i,],2,mean) } current_stop_crit=mean((old_centroids-centroids)^2) } return(list(data=data.frame(data,cluster),centroids=centroids)) }
You can easily run the code to see your clusters:
res=kmeans(DF[1:2],K=5) res$centroids$cluster=1:5 res$data$isCentroid=F res$centroids$isCentroid=T data_plot=rbind(res$centroids,res$data) ggplot(data_plot,aes(x=X1,y=X2,color=as.factor(cluster),size=isCentroid,alpha=isCentroid))+geom_point()
Now let’s try the algorithm on two different datasets. First, on the 5 Gaussians distributions:
The centroids move and split the data in clusters which are very close to the original ones. Kmeas is doing a great job here.
Now, instead of having nice gaussian distributions, we will build three rings on into another.
##Building three sets set1=data.frame(r=runif(300,0.1,0.5),theta=runif(300,0,360),set='1') set2=data.frame(r=runif(300,1,1.5),theta=runif(300,0,360),set='2') set3=data.frame(r=runif(300,3,5),theta=runif(300,0,360),set='3') ##Transformation in rings data_2=rbind(set1,set2,set3) data_2$x=data_2$r*cos(2*3.14*data_2$theta) data_2$y=(data_2$r)*sin(2*3.14*data_2$theta)
The kmeans is performing very poorly on these new data. Actually, the euclidian distance is not adapted to this kind of problem, since the data are not in a circular shape.
So before using kmeans, you should ensure that the data is in appropriate shapes, if not, you can apply transformations or change the distance you are using in the kmeans.
The kmeans algorithm only looks for a local mimimum which is often not a global optimum. Hence, different initialisation can lead to very different results.
We ran the kmean algorithm over more than 60 different starting positions.As you can see, sometimes, the algorithm results in poor centroids due to an unlucky initialization. The solution to this is simply to run kmeans several times and to take the best centroids set. The quality of initialization can also be improved with kmeans ++, the algorithm selects starting points which are less likely to perform poorly.
Want to learn more on Machine Learning ? Here is a selection of Machine Learning Explained posts:
– Dimensionality reduction
– Supervised vs unsupervised vs reinforcement learning
– Regularization in machine learning
The post Machine Learning Explained: Kmeans appeared first on Enhance Data Science.
]]>The post Explore your McDonalds Meal with Shiny and D3partitionR appeared first on Enhance Data Science.
]]>In addition to this, I released a new version of D3partitionR a few weeks ago and was looking for use cases. Hierarchical charts like Sunburst or Treemap are very useful to split and analyze the composition of categories and items. Hence, I decided to make a small Shiny application to analyze the composition and the nutrition value of a MacDonald’s menu.
The application has four main tabs:
The menu selection is used to … select the items you want to add to your menus. Most of MacDonalds’s items are in there, and there are ordered according to their categories.
Far more interesting! This part will show you how the calories are spread between the different items, categories, and nutrients (Carbohydrates, Total fat, fibers, and proteins). The zooming makes it easy to see the precise calories composition of each item or categories.
Since calories are not the only element to take into accounts to assess a meal, these two tabs show the value of the different nutrients and their daily value (taken from McDonalds’s website). Various nutrients are available like Saturated fat, Sodium, Vitamin A, … The main point of these tabs was to show a reproducible way to imitate facetting with D3partitionR (which can probably be extended to other widgets).
Click to view slideshow.The charts in the application mainly rely on D3partitionR and show the main functionalities of D3partitionR:
The application code can be found on Github.
The post Explore your McDonalds Meal with Shiny and D3partitionR appeared first on Enhance Data Science.
]]>The post Major update of D3partitionR: Interactive viz’ of nested data with R and D3.js appeared first on Enhance Data Science.
]]>install.packages('D3partitionR')
Here is a quick overview of the possibilities using the Titanic data:
This update is a major update from the previous version which will break code from 0.3.1
##Reading data
titanic_data=fread("train.csv")
##Agregating data to have unique sequence for the 4 variables
var_names=c('Sex','Embarked','Pclass','Survived')
data_plot=titanic_data[,.N,by=var_names]
data_plot[,(var_names):=lapply(var_names,function(x){data_plot[[x]]=paste0(x,' ',data_plot[[x]])
})]
##Treemap
D3partitionR()%>%
add_data(data_plot,count = 'N',steps=c('Sex','Embarked','Pclass','Survived'))%>%
set_chart_type('treemap')%>%
plot()
##Circle treemap
D3partitionR()%>%
add_data(data_plot,count = 'N',steps=c('Sex','Embarked','Pclass','Survived'))%>%
set_chart_type('circle_treemap')%>%
plot()
Style consistency among the different type of chart. Now, it’s easy to switch from a treemap to a circle treemap or a sunburst and keep consistent styling policy.
Update to d3.js V4 and modularization. Each type of charts now has its own file and function. This function draws the chart at its root level with labels and colors, it returns a zoom function. The on-click actions (such as the breadcrumb update or the legend update) and the hover action (tooltips) are defined in a ‘global’ function.
Hence, adding new visualizations will be easy, the drawing and zooming script will just need to be adapted to this previous template.
Thanks to the several feedbacks that will be collected during next week, a stable release version should soon be on the CRAN. I will also post more ressources on D3partitionR with use cases and example of Shiny Applications build on it.
The post Major update of D3partitionR: Interactive viz’ of nested data with R and D3.js appeared first on Enhance Data Science.
]]>The post Machine Learning Explained: Dimensionality Reduction appeared first on Enhance Data Science.
]]>Let’s say we own a shop and want to collect some data on our clients. We can collect their ages, how frequently they come to our shop, how much they spend on average and when was the last time they came to our shop. Hence each of our clients can be represented by four variables (ages, frequency, spendings and last purchase date) and can be seen as a point in a four dimension space. Later on, variables and dimensions will have the same meaning.
Now, let’s add some complexity and some variables by using images. How many dimensions are used to represent the image below?
There are 8 by 8 pixels, hence 64 pixels are needed and each of these pixels is represented by a quantitative variable. 64 variables are needed!
Even bigger, how many variables do you need to represent the picture below?
Its resolution is 640 by 426 and you need three color channels (Red, Green, and Blue). So this picture requires no more than … 640*426*3= 817,920 variables. That’s how you go from four to almost one million of dimensions (and you can go well above!)
Before seeing any algorithm, everyday life provides us a great example of dimensionality reduction.
Each of these people can be represented as points in a 3 Dimensional space. With a gross approximation, each people is in a 50*50*200 (cm) cube. If we use a resolution of 1cm and three color channels, then can be represented by 1,000,000 variables.
On the other hand, the shadow is only in 2 dimensions and in black and white, so each shadow only needs 50*200=10,000 variables.
The number of variables was divided by 100 ! And if your goal is to detect human vs cat, or even men vs women, the data from the shadow may be enough.
Dimensionality reduction has several advantages from a machine learning point of view.
The most obvious way to reduce dimensionality is to remove some dimensions and to select the more suitable variables for the problem.
Here are some ways to select variables:
While features selection is efficient, it is brutal since variables are either kept or removed. However in some interval or for some modalities, removed variable may be useful while kept variable may be redundant. Features extraction (or engineering) seeks to keep only the intervals or modalities of the features which contain information.
Principal component analysis (or PCA), is a linear transformation of the data which looks for the axis where the data has the most variance. PCA will create new variables which are linear combinations of the original ones, these new variables will be orthogonal (i.e. correlation equals to zero). PCA can be seen as a rotation of the initial space to find more suitable axis to express the variability in the data.
On the new variables have been created, you can select the most important ones. The threshold is up-to-you and depends on how much variance you want to keep. You can check the tutorial below to see a working R example:
Since PCA is linear, it mostly works on linearly separable data. Hence, if you want to perform classification on other data (Like the donut below), linear PCA will probably make you lose a lot of information.
Example of Kernel PCA from Scikit-Learn
On the other hand, kernel PCA can work on nonlinearly separable data. The trick is simple:
Hence you can easily separate the two rings of the donut, which will improve the performance of classifiers. Kernel PCA raises two problems, you need to find the right kernel and the new variables are hard to understand for humans.
Linear discriminant analysis is similar to PCA but is supervised (while PCA does not require labels). The goal of the LDA is not to maximize variance but to create linear surfaces to separate the different groups. The new features will be axes that separate the data when the data are projected on them.
As with PCA, you can also apply the kernel trick to apply LDA on non linearly separable data.
Independant component analysis aims at creating independent variables from the original variables.
The typical example Cocktail party problem: you are in a room with lots of different people having different conversations and your goal is to separate the different conversations. If you have a lot of microphones in the room, each and every of them will get a linear combination of all the conversation (some kind of noise). The goal of ICA is to disentangle these conversations from the noise.
This can be seen as dimensionality reduction if you have 200 microphones but only ten independent conversations, you should be able to represent them with ten independant variables from the ICA.
Autoencoder is a powerful method to reduce the dimensionality of data. It is composed of a neural network (it can be feed-forward, convolutional or recurrent, most of the architecture can be adapted into an autoencoder) which will try to learn its input. For instance, an autoencoder trained on images will try to reconstruct these images.
In addition to this, the autoencoder has a bottleneck, the number of neurons in the autoencoder’s hidden layers will be smaller than the number of input variables. Hence, the autoencoder has to learn a compressed representation of the data where the number of dimensions is the number of neuron in the smallest hidden layer.
The main advantages of autoencoder are their non-linearity and how flexible they can be.
T-SNE or t-distributed stochastic neighbor embedding is mainly used to fit data from large dimensions in 2D or 3D space. That is the technique we used to fit and visualize the twitter data in our analysis of French election. The main idea behind T-SNE is that nearby point in the original space should be close in the low dimensional space while distant point should also be distant in the smaller space.
T-sne is highly non-linear, is originally non-parametric, dependant on the random seed and does not keep distance alike. Hence, I mainly use it to plot high dimension and to visualise cluster and similarity.
NB: There is also a parametric variant which seems less widely used than the original T-SNE.
https://lvdmaaten.github.io/publications/papers/AISTATS_2009.pdf
Here ends our presentation of the most widely used dimensionality reduction techniques.
The post Machine Learning Explained: Dimensionality Reduction appeared first on Enhance Data Science.
]]>The post Machine Learning Explained: supervised learning, unsupervised learning, and reinforcement learning appeared first on Enhance Data Science.
]]>The type of learning is defined by the problem you want to solve and is intrinsic to the goal of your analysis:
Supervised learning regroups different techniques which all share the same principles:
Some supervised learning algorithms:
Supervised learning is often used for expert systems in image recognition, speech recognition, forecasting, and in some specific business domain (Targeting, Financial analysis, ..)
Cluster Analysis from Wikipedia
On the other hand, unsupervised learning does not use output data (at least output data that are different from the input). Unsupervised algorithms can be split into different categories:
Most of the time unsupervised learning algorithms are used to pre-process the data, during the exploratory analysis or to pre-train supervised learning algorithms.
Reinforcement learning algorithms try to find the best ways to earn the greatest reward. Rewards can be winning a game, earning more money or beating other opponents. They present state-of-art results on very human task, for instance, this paper from the University of Toronto shows how a computer can beat human in old-school Atari video game.
Reinforcement learnings algorithms follow the different circular steps:
Given its and the environment’s states, the agent will choose the action which will maximize its reward or will explore a new possibility. These actions will change the environment’s and the agent states. They will also be interpreted to give a reward to the agent. By performing this loop many times, the agents will improve its behavior.
Reinforcement learning already performs wells on ‘small’ dynamic system and is definitely to follow for the years to come.
The post Machine Learning Explained: supervised learning, unsupervised learning, and reinforcement learning appeared first on Enhance Data Science.
]]>The post Twitter analysis using R (Semantic analysis of French elections) appeared first on Enhance Data Science.
]]>To perform the analysis, I needed an important number of tweets and I wanted to use all of the tweets concerning the election. The Twitter search API is limited since you only have access to a sample of tweets. On the other hand, the streaming API allows you to collect the data in real-time and to collect almost all tweets. Hence, I used the streamR package.
So, I collected tweets on 60 seconds batch and saved them on .json files. The use of batches instead of one large file is to improve RAM consumption (Instead of reading and then subsetting one large file, you can do the subset on each of the batches and then merge them). Here is the code to collect the data with streamR.
###Loading my twitter credentials load("oauth.Rdata") ##Collecting data require('streamR') i=1 while(TRUE) { i=i+1 filterStream( file=paste0("tweet_macronleaks/tweets_rstats",i,".json"), track=c("#MacronLeaks"), timeout=60, oauth=my_oauth,language = 'fr') }
The code is doing an infinite loop (stopped manually), the filterStream function filters the Twitter stream according to the defined filter. Here, we only take the tweets containing #MacronLeaks which are in French.
Now that the tweets are collected, they need to be cleaned and pre-processed. A raw tweet will contain links, tabulation, @, #, double spaces, … that will influence the analysis. It will also contain stop words (stop words are very frequent words in the language such as ‘and’, ‘or, ‘with’, …).
In addition to this, some tweets are retweeted (sometimes a lot) and may change the words and text distribution. Enough of the RT are kept to show that some tweets are more popular than others but most of them are removed to avoid them standing too much out of the crowd.
First, the saved tweets need to be read and merged:
require(data.table) data.tweet=NULL i=1 while(TRUE) { i=i+1 print(i) print(paste0("tweet_macronleaks/tweets_rstats",i,".json")) if (is.null(data.tweet)) data.tweet=data.table(parseTweets(paste0("tweet_macronleaks/tweets_rstats",i,".json"))) else data.tweet=rbind(data.tweet,data.table(parseTweets(paste0("tweet_macronleaks/tweets_rstats",i,".json")))) }
Then we only keep some of the RT. The retweet count is the indices of a given retweet, hence we only keep log(1+n) of the RT.
data.tweet[,min_RT:=min(retweet_count),by=text] data.tweet[,max_RT:=max(retweet_count),by=text] data.tweet=data.tweet[lang=='fr',] data.tweet=data.tweet[retweet_count&lt;=min_RT+log(max_RT-min_RT+1),]
Then, the text can be cleaned using function from the tm package
###Unaccent and clean the text Unaccent <- function(x) { x=tolower(x) x = gsub("@\\w+", "", x) x = gsub("[[:punct:]]", " ", x) x = gsub("[ |\t]{2,}", " ", x) x = gsub("^ ", " ", x) x = gsub("http\\w+", " ", x) x=tolower(x) x=gsub('_',' ',x,fixed=T) x } require(tm) ###Remove accents data.tweet$text=Unaccent(iconv(data.tweet$text,from="UTF-8",to="ASCII//TRANSLIT")) ##Remove top words data.tweet$text=removeWords(data.tweet$text,c('rt','a',stopwords('fr'),'e','co','pr')) ##Remove double whitespaces data.tweet$text=stripWhitespace(data.tweet$text)
Now that the tweets have been cleaned, they can be tokenized. During this step, each tweet will be split into tokens of its different words, here each word corresponds to a token.
# Create iterator over tokens tokens <- space_tokenizer(data.tweet$text) it = itoken(tokens, progressbar = FALSE)
Now a vocabulary can be created (it is a “summary” of the words distribution) based on the corpus. Then the vocabulary is pruned (very common and rare words are removed).
vocab = create_vocabulary(it) vocab = prune_vocabulary(vocab, term_count_min = 5, doc_proportion_max = 0.4, doc_proportion_min = 0.0005) vectorizer = vocab_vectorizer(vocab, grow_dtm = FALSE, skip_grams_window = 5L) tcm = create_tcm(it, vectorizer)
Now, we can create the word embedding, in this example, I used a glove embedding to learn vectors representations of the words. The new vector space has around 200 dimensions.
glove = GlobalVectors$new(word_vectors_size = 200, vocabulary = vocab, x_max = 100) glove$fit(tcm, n_iter = 200) word_vectors <- glove$get_word_vectors()
Now that the words are vectors, we would like to plot them in two dimensions to show the meaning of the words in an appealing (and understandable) way. The number of dimension needs to be reduced to two, to do so, we will use T-sne. T-sne is a non-parametric dimensionality reduction algorithm and tends to perform well on word embedding. R has a package (actually two) to perform Tsne, we will use the most recent one Rtsne.
To avoid overcrowding in our plot and reduce computing time, only words with more than 50 appearances will be used.
require('Rtsne') set.seed(123) word_vectors_sne=word_vectors[which(vocab$vocab$doc_counts>50&!rownames(word_vectors)%in%stopwords('fr')),] tsne_out=Rtsne(word_vectors_sne,perplexity =2,initial_dims = 200,dims = 2) DF_proj=data.frame(x=tsne_out$Y[,1],y=tsne_out$Y[,2],word=rownames(word_vectors_sne))
Now that the projection in 2 dimensions has been done, to color the plot we’d like to know which contenders is assigned to each word. To do so, a dictionary is created with the names and pseudo of each of the contenders and the distance from every word to each of these pseudos is computed.
For instance, to assign a candidate to the word ‘democracy’, the minimum distance between ‘democracy’ and ‘mlp’, ‘marine’, ‘fn will be computed. The same thing will be done between ‘democracy’ and ‘macron’, ’em’, ’emmarche’. If the first distance is the smallest then ‘democracy’ will be assigned to Marine Le Pen, otherwise, it will be assigned to Emmanuel Macron.
require(ggplot2) require(ggrepel) DF_proj=data.table(DF_proj) DF_proj$count=vocab$vocab$doc_counts[which(vocab$vocab$doc_counts>500& !(rownames(word_vectors)%in%stopwords('fr')))] DF_proj=DF_proj[word!='NA'] distance_to_candidat=function(word_vectors,words_list,word_in) { max(sim2(word_vectors[words_list,,drop=F],word_vectors[word_in,,drop=F])) } closest_candidat=function(word_vectors,mot_in) { mot_le_pen=c('marine','pen','lepen','fn','mlp') mot_macron=c('macron','emmanuel','em','enmarche','emmanuelmacron') dist_le_pen=distance_to_candidat(word_vectors,mot_le_pen,mot_in) dist_macron=distance_to_candidat(word_vectors,mot_macron,mot_in) if (dist_le_pen>dist_macron) 'Le Pen' else 'Macron' } DF_proj[,word:=as.character(word)] DF_proj=DF_proj[word!=""] DF_proj[,Candidat:=closest_candidat(word_vectors,word),by=word] require(plotly) gg=ggplot(DF_proj,aes(x,y,label=word,color=Candidat))+geom_text(aes(size=sqrt(count+1))) ggplotly(gg)
You can get our latest news on Twitter:
The post Twitter analysis using R (Semantic analysis of French elections) appeared first on Enhance Data Science.
]]>