Java Performance Toolbox

I learned The Java Performance Definitive Guide[chapter 3] on this weekend, here is a brief summary about Java Performance Toolbox.

System Monitoring Tools

1. CPU Usage

vmstat: Report virtual memory statistics, vmstat reports information about processes, memory, paging, block IO, traps, disks and cpu activity
vmstat [options] [delay [count]]

Big Data And ML Learning

随着工作的时间一天天过去,不禁会思考对未来的打算,工作的事情更多的是业务和效果,少有时间学习,自我的提升比起学校需要更多的self motivation. 一直都工作在大数据领域,现在虽然业务多些,方向也没有变化,还有了很多机器学习方面的实践。以下是我觉得自己很希望学习的书籍和要点:

Awesome Books for 2018

One of my 2018 reservations is reading more books. Here I list some great books in my plan.

Machine Learning

  • Machine Learning: A Probabilistic Perspective
  • Deap Learning(Ian,Goodfellow)
  • Pattern Recognition and Machine Learning(Christopher M Bishop)
  • The elements of statistic learning
  • Hands-On Machine Learning with Scikit-Learn and TensorFlow (in progress now)
  • Python Machine Learning
  • 数学之美
  • 统计学(复习)
  • 统计学习方法
  • 机器学习

ppmml publish today

On the last day before the New Year Holiday, ppmml is published.
ppmml is a python library for converting machine learning models to pmml file. ppmml wraps jpmml libraries and provides clean interface.

What is pmml file?

PMML - “Predictive Model Markup Language”, which is a standard for XML documents which express trained instances of analytic models.
Various platforms adopt pmml as machine learning model standard, including IBM, SAS, Microsoft, Spark, KNIME etd.pmml-platforms

jpmml has developed pmml model library and supported models of spark, xgboost, tensorflow, sklearn, lightgbm and R. All of these libraries are separated and written in java.
ppmml wraps jpmml libraries and proved a simple and easy-to-use API for pmml files transformation.
0.0.1 version supports sklearn, tensorflow, spark, lightgbm, xgboost and R models. All models supported by jpmml are supported by ppmml. Common machine learning algorithms are supported, such as Decision Tree, Logistic Regression, GBDT, Random Forest, KMeans. However, Deep Learning support is not ready.

How to Use Scala UDF and UDAF in PySpark

Spark DataFrame API provides efficient and easy-to-use operations to do analysis on distributed collection of data. Many users love the Pyspark API, which is more usable than scala API. Sometimes when we use UDF in pyspark, the performance will be a problem. How about implementing these UDF in scala, and call them in pyspark? BTW, in spark 2.0, UDAF can only be defined in scala, and how to use it in pyspark? Let’s have a try~

spark streaming exactly-once analysis

最近对Spark Streaming接触比较多,主要关注的是streaming的准确性方面的需求, 忙了快半年,不禁想问为什么需要在exactly-once上花费这么多时间呢。streaming和batch的处理逻辑有什么区别呢?我觉得streaming更适合一些简单的过滤,能在100ms以内能算完的逻辑,而这些逻辑用batch也可以算完,为什么要streaming呢?用户们更希望的是更快。如果batch也能满足低延迟的需求,streaming系统就不需要了。而问题是为什么我们需要一个单独的streaming系统?

Set Up Apache Storm On Mac In 10min

Storm is a great real time streaming system. Recently, my project is about spark streaming. I want to learn storm either to know more about streaming system. Okay, let’s fire up.
Today I tried to install storm cluster on my local mac.
It was easy to install. It will cost you about 10min.

Machine Learning Logistic Regression

Logistic Regression is for classification problem, and the predication value is fixed descrete values, such as 1 for positive or 0 for negative. The essence of logistic regression is:

  • hypothesis function is sigmoid function
  • cost function: J(theta)
  • gradient descent and algorithms
  • advantanced optimization with regularization to solve overfitting problem.

© 2018 Cyanny Liang