Everyday R Code (19) – A function example to compare loops and vectorization speed
#to compare speed of 2 methods: loops and vectorization n=10 #generate 100 numbers between 1 and 1000, and then make a 10 by 10 matrix A=matrix(runif(100,1,1000),nrow=n,ncol=n) B=matrix(runif(100,1,100),nrow=n,ncol=n) #method 1 A%*%B #get system.time to compare with the other method system.time(A%*%B) #method 2 #using a function MultiplyMatrices=function(A,B,n){ R=matrix(data=0,nrow=n,ncol=n) for (i in 1:n) for (j in … Read more
Everyday R Code (18) – matrix calculation
Calculation1 Subtract different value from multiple columns b <- matrix(rep(1:20), nrow=4, ncol=5) c <- c(1,2,4) b c for(i in 1:nrow(b)) { b[i,3:5] <- b[i,3:5] – c } b Calculation 2 Subtract matrix from matrix from multiple columns b <- matrix(rep(1:20), nrow=4, ncol=5) d<(rep(2:21), nrow=4, ncol=5) b d for(i in 1:nrow(b)) { b[i,3:5] … Read more
Interview:Concordant, Discordant and Tied Pairs for model validation
What are Concordant, Discordant and Tied Pairs for model validation? A friend who was interviewed by Amazon for a data related position was asked about this question. Here is a very clear solution for this question. http://www.listendata.com/2014/08/modeling-tips-calculating-concordant.html 最基本的是把1的放一组(有a个),把0的放一组(有b 个),做笛卡尔积(cartesian product)得到aXb对儿数据。然后看每一对儿,把对应该是1的和应该是0的预测出来的数值做比较,如果应该是1 的大于应该是0的,叫concordance pair, 如果应该是0的大于应该是1的,叫discordance pair, 如果相等就叫tied pair。 好的model的特点:越多的concordant pairs,越少的discordant and tied pairs 一般concordant pairs占80%以上的比例比较好。 … Read more
Everyday R code (17) Pivot table in R to replace excel
Pivot table in R to replace excel Step1 #converting data format using data.table library(data.table) live<-data.table(live) Step2 #finding unique count for each bucket using list uniqueData<-live[,list(Unique_user_Count=length(unique(User_ID))),by=list(Market, Company,Group)] Step3 #pivot the table using dcast function in reshape2 package #install the package if you haven’t install.packages(“reshape2″) library(reshape2) pivot<-dcast(uniqueData, Market+Company ~ Group , value.var=”Unique_user_Count”, fun.aggregate=sum) … Read more
Everyday SQL (7)
There are many interesting usage of sql function, like case when which is super powerful. Hope you can find something interesting or useful from the code below. It’s used in the real working environment to fill a business data request. with vvvos as ( select distinct map.ccc_o_group_id os from base.dddd_o_mapping map inner join base.dddd_ccc_o_groups groups … Read more
Everyday SQL (6)
这是我工作遇到的问题,从别的组里要来了一堆SQL code,大概是四五年前在那工作的人写的,感觉是old style,之前从没见过,贴出来给大家。 Q: What’s the meaning of (+) in SQL queries (Oracle)? A: It’s Oracle’s synonym for OUTER JOIN. Example: SELECT * FROM a, b WHERE b.id(+) = a.id gives the same result as SELECT * FROM a LEFT OUTER JOIN b ON b.id = a.id Or we can … Read more
Spyder did not work any more from Anaconda (02/18/2017). Here is the solution.
My spyder did not work any more from Anaconda. Here is the solution based on this post. Open your Terminal and use the command below. (there is a file called .spyder or something like this. This file might be broken. So we can rename it first, then open spyder from Anaconda to generate a new … Read more
latest procedure to install python and packages (Jan.27, 2017)
Happy Chinese New Year! I have to use a search tool to get information and it’s complicated. I need python to finish this task. (I still did not figure it out with help of SDE and DS friends who are good at coding. I’ll need my husband’s help. 🙂 ) I tried on Windows and … Read more
Python question list
if we have probability for each value, how to see the value’s distribution? We can use histgram to see the possible pdf (probability density function) overall. Then using KDE (Kernel Density Estimation) to fit pdf. Reference as follows. https://jakevdp.github.io/blog/2013/12/01/kernel-density-estimation/ Do you use random forest? If so, what’s entropy? To measure the quality of a split: … Read more