AI and Big Data

Ppmml Publish Today

On the last day before the New Year Holiday, ppmml is published.
ppmml is a python library for converting machine learning models to pmml file. ppmml wraps jpmml libraries and provides clean interface.

What is pmml file?

PMML - “Predictive Model Markup Language”, which is a standard for XML documents which express trained instances of analytic models.
Various platforms adopt pmml as machine learning model standard, including IBM, SAS, Microsoft, Spark, KNIME etd.pmml-platforms

jpmml has developed pmml model library and supported models of spark, xgboost, tensorflow, sklearn, lightgbm and R. All of these libraries are separated and written in java.
ppmml wraps jpmml libraries and proved a simple and easy-to-use API for pmml files transformation.
0.0.1 version supports sklearn, tensorflow, spark, lightgbm, xgboost and R models. All models supported by jpmml are supported by ppmml. Common machine learning algorithms are supported, such as Decision Tree, Logistic Regression, GBDT, Random Forest, KMeans. However, Deep Learning support is not ready.

Continue reading

How to Use Scala UDF and UDAF in PySpark

Spark DataFrame API provides efficient and easy-to-use operations to do analysis on distributed collection of data. Many users love the Pyspark API, which is more usable than scala API. Sometimes when we use UDF in pyspark, the performance will be a problem. How about implementing these UDF in scala, and call them in pyspark? BTW, in spark 2.0, UDAF can only be defined in scala, and how to use it in pyspark? Let’s have a try~

Continue reading

Spark Streaming Exactly-Once Analysis

最近对Spark Streaming接触比较多,主要关注的是streaming的准确性方面的需求, 忙了快半年,不禁想问为什么需要在exactly-once上花费这么多时间呢。streaming和batch的处理逻辑有什么区别呢?我觉得streaming更适合一些简单的过滤,能在100ms以内能算完的逻辑,而这些逻辑用batch也可以算完,为什么要streaming呢?用户们更希望的是更快。如果batch也能满足低延迟的需求,streaming系统就不需要了。而问题是为什么我们需要一个单独的streaming系统?

Continue reading

Set Up Apache Storm on Mac in 10min

Storm is a great real time streaming system. Recently, my project is about spark streaming. I want to learn storm either to know more about streaming system. Okay, let’s fire up.
Today I tried to install storm cluster on my local mac.
It was easy to install. It will cost you about 10min.

Continue reading

Machine Learning Logistic Regression

Logistic Regression is for classification problem, and the predication value is fixed descrete values, such as 1 for positive or 0 for negative. The essence of logistic regression is:

  • hypothesis function is sigmoid function
  • cost function: J(theta)
  • gradient descent and algorithms
  • advantanced optimization with regularization to solve overfitting problem.
Continue reading

Binary Search Algorithm in Scala

One day, I wanted to use binary search in one of my feature in my project. My friend said the algorithm was not easy to implement bug free. I did’t believe that. I spent 10min to write it.

Continue reading

My Booklist and Reservations for 2017

一直没有写关于2016的回顾,有很多方面吧。2016年发生很多事儿的一年,对于技术上的发展也有了新的思考,搞技术不再是死磕某种工具、算法或bug,其实本质上是为了解决问题或者做更好的产品。虽然我做的不是具体的产品而是底层的工具和平台,但这些工具的出口也是依赖”pillar application”, 多想想也是好处的。


Continue reading