Everyday R Code (19) – A function example to compare loops and vectorization speed

#to compare speed of 2 methods: loops and vectorization n=10 #generate 100 numbers between 1 and 1000, and then make a 10 by 10 matrix A=matrix(runif(100,1,1000),nrow=n,ncol=n) B=matrix(runif(100,1,100),nrow=n,ncol=n)   #method 1 A%*%B #get system.time to compare with the other method system.time(A%*%B)   #method 2 #using a function MultiplyMatrices=function(A,B,n){ R=matrix(data=0,nrow=n,ncol=n) for (i in 1:n) for (j in … Read more

Everyday R Code (18) – matrix calculation

Calculation1 Subtract different value from multiple columns b <- matrix(rep(1:20), nrow=4, ncol=5) c <- c(1,2,4) b c   for(i in 1:nrow(b)) { b[i,3:5] <- b[i,3:5] – c } b   Calculation 2 Subtract matrix from matrix from multiple columns b <- matrix(rep(1:20), nrow=4, ncol=5) d<(rep(2:21), nrow=4, ncol=5) b d   for(i in 1:nrow(b)) { b[i,3:5] … Read more

Interview:Concordant, Discordant and Tied Pairs for model validation

What are Concordant, Discordant and Tied Pairs for model validation? A friend who was interviewed by Amazon for a data related position was asked about this question. Here is a very clear solution for this question. http://www.listendata.com/2014/08/modeling-tips-calculating-concordant.html   最基本的是把1的放一组(有a个),把0的放一组(有b 个),做笛卡尔积(cartesian product)得到aXb对儿数据。然后看每一对儿,把对应该是1的和应该是0的预测出来的数值做比较,如果应该是1 的大于应该是0的,叫concordance pair, 如果应该是0的大于应该是1的,叫discordance pair, 如果相等就叫tied pair。 好的model的特点:越多的concordant pairs,越少的discordant and tied pairs 一般concordant pairs占80%以上的比例比较好。   … Read more

Everyday R code (17) Pivot table in R to replace excel

Pivot table in R to replace excel   Step1 #converting data format using data.table library(data.table) live<-data.table(live)   Step2 #finding unique count for each bucket using list uniqueData<-live[,list(Unique_user_Count=length(unique(User_ID))),by=list(Market, Company,Group)]   Step3 #pivot the table using dcast function in reshape2 package #install the package if you haven’t install.packages(“reshape2″) library(reshape2) pivot<-dcast(uniqueData, Market+Company ~ Group , value.var=”Unique_user_Count”, fun.aggregate=sum)   … Read more

Everyday SQL (7)

There are many interesting usage of sql function, like case when which is super powerful. Hope you can find something interesting or useful from the code below. It’s used in the real working environment to fill a business data request. with vvvos as ( select distinct map.ccc_o_group_id os from base.dddd_o_mapping map inner join base.dddd_ccc_o_groups groups … Read more

Everyday SQL (6)

这是我工作遇到的问题,从别的组里要来了一堆SQL code,大概是四五年前在那工作的人写的,感觉是old style,之前从没见过,贴出来给大家。 Q: What’s the meaning of (+) in SQL queries (Oracle)?   A: It’s Oracle’s synonym for OUTER JOIN. Example:   SELECT * FROM a, b WHERE b.id(+) = a.id   gives the same result as   SELECT * FROM a LEFT OUTER JOIN b ON b.id = a.id   Or we can … Read more

Python question list

if we have probability for each value, how to see the value’s distribution? We can use histgram to see the possible pdf (probability density function) overall. Then using KDE (Kernel Density Estimation) to fit pdf. Reference as follows. https://jakevdp.github.io/blog/2013/12/01/kernel-density-estimation/ Do you use random forest? If so, what’s entropy? To measure the quality of a split: … Read more