How to Use Scala UDF and UDAF in PySpark

2017-09-15

Spark DataFrame API provides efficient and easy-to-use operations to do analysis on distributed collection of data. Many users love the Pyspark API, which is more usable than scala API. Sometimes when we use UDF in pyspark, the performance will be a problem. How about implementing these UDF in scala, and call them in pyspark? BTW, in spark 2.0, UDAF can only be defined in scala, and how to use it in pyspark? Let’s have a try~

My First Commit to Spark Community

2017-08-30

I have been worked on spark related projects for almost 2 years. Today I submit a small patch to spark community. Hope to be a contributor~
https://issues.apache.org/jira/browse/SPARK-21859

Spark Streaming Exactly-Once Analysis

2017-08-27

最近对Spark Streaming接触比较多，主要关注的是streaming的准确性方面的需求, 忙了快半年，不禁想问为什么需要在exactly-once上花费这么多时间呢。streaming和batch的处理逻辑有什么区别呢？我觉得streaming更适合一些简单的过滤，能在100ms以内能算完的逻辑，而这些逻辑用batch也可以算完，为什么要streaming呢？用户们更希望的是更快。如果batch也能满足低延迟的需求，streaming系统就不需要了。而问题是为什么我们需要一个单独的streaming系统？

Set Up Apache Storm on Mac in 10min

2017-04-10

Storm is a great real time streaming system. Recently, my project is about spark streaming. I want to learn storm either to know more about streaming system. Okay, let’s fire up.
Today I tried to install storm cluster on my local mac.
It was easy to install. It will cost you about 10min.

Machine Learning Logistic Regression

2017-03-25

Logistic Regression is for classification problem, and the predication value is fixed descrete values, such as 1 for positive or 0 for negative. The essence of logistic regression is:

hypothesis function is sigmoid function
cost function: J(theta)
gradient descent and algorithms
advantanced optimization with regularization to solve overfitting problem.

Binary Search Algorithm in Scala

2017-02-21

One day, I wanted to use binary search in one of my feature in my project. My friend said the algorithm was not easy to implement bug free. I did’t believe that. I spent 10min to write it.

My Booklist and Reservations for 2017

2017-01-22

一直没有写关于2016的回顾，有很多方面吧。2016年发生很多事儿的一年，对于技术上的发展也有了新的思考，搞技术不再是死磕某种工具、算法或bug，其实本质上是为了解决问题或者做更好的产品。虽然我做的不是具体的产品而是底层的工具和平台，但这些工具的出口也是依赖”pillar application”, 多想想也是好处的。

2016工作忙，读的书没有很多，但想想扎克伯克比我们还忙一年能挑战23本书确实很牛，其实自己的时间管理是不太到位的，大部分周末都懒散睡觉或者出去逛街了，回归2016年，读的书们：

Scala Collections

2016-11-07

In scala there are many fancy collections with great utilities. Here are some key notes for scala collections which did a great help to me.

Eight Queens Problem in Scala

2016-09-27

I have dedicated in Programming in Scala for about 4 months. My work is busy, but I can’t give up reading more books.
Scala is a fabulous language, both object oriented and functional.
Eight qeens problem can be expressed in scala easily and concise.

Machine Learning Neural Networks

2016-04-17

This week is about the mysterious Neural Networks. The courses in this week just explain the basics about Neural Networks.

What is Neural Networks

It’s a technique to train our data based on how human brains works. A simple Neural Network has:

input layer
hidden layer
output layer

We use Neural NetWorks to make classification and regression.
We use sigmoid function the map data from input layer to hidden layer then the output layer, the function is called activation function.