AI and Big Data

An Introduction to Bayesian Networks

冬日晴好, 下午看完了论文, 对Bayesian Network是什么有了系统的了解.论文是causalnex工具里提到的
Stephenson, Todd Andrew. An introduction to Bayesian network theory and usage. No. REP_WORK. IDIAP, 2000.

该论文主要论述了以下几点:

  • What is Bayesian network
  • Inference Bayesian network: junction tree algorithm
  • Learning Bayesian Network
  • Applications
    • Automatic Speech Recognition: Dynamic Bayesian Network
    • Computer troubleshooting
    • Medical diagnosis
Continue reading

Structure Learning Algorithm NOTEARS

最近三年扎进了AI领域, 学了很多算法, 最近开始真正拉高维度看AI, AI不仅仅是Machine Learning, 还有State Based, Variable Bases, Logic编程等方法. 最近半年看了The book of Why, 深受启发, 看世界的角度也发生很大变化, 同时也觉得因果推理将是一个值得研究的好领域, 就算目前落地场景不多, 相信未来也是大有可为.

今天静下来, 好好看了在CausalNex库中, 用到的算法NOTEARS, 用于结构学习, 该论文发表在2018的NIPS, 方法神奇, 解决方案简洁, 以下是自己的一些笔记:

Paper: Zheng, Xun, et al. “DAGs with NO TEARS: Continuous optimization for structure learning.” Advances in Neural Information Processing Systems 31 (2018): 9472-9483.

Continue reading

Akka Http Notes

在快3年多的Scala项目编程中, Akka是我见过的比较高质量的scala库, 其核心抽象是一种基于Actor的编程模型, 同时在这个核心抽象上, 提供一组工具库, 用户只需要按Actor形式写业务逻辑, 框架会帮你处理好底层的消息传递, 高并发和IO问题. Akka在工业场景下, 很接底气, 比如有很多微服务, 服务的性能各有差异, 这时候你需要整合这些微服务, 完成比如广告投放, 在线推荐, 事故检测等业务, Akka的业务抽象就会有很大的用处.

而最近系统看了Akka-HTTP, 我个人比较喜欢这个库在meta-programming方面的应用, akka-http把一个老生常谈的HTTP库实现的很优雅, 设计和抽象值得推敲, 时间有限, 就看了一周, 以下是一些最近对我帮助比较大的总结, 如果以后有空会继续完善

1.Akka HTTP 优势

定位: 用于处理复杂业务的Library, 不是一个MVC Framework(such as Play)

  • DSL with convenient pathMatchers
  • Streaming: 流式传输, 速率限制
  • Interacting with actor easy
Continue reading

Java Performance Toolbox

I learned The Java Performance Definitive Guide[chapter 3] on this weekend, here is a brief summary about Java Performance Toolbox.

System Monitoring Tools

1. CPU Usage

vmstat: Report virtual memory statistics, vmstat reports information about processes, memory, paging, block IO, traps, disks and cpu activity
vmstat [options] [delay [count]]

Continue reading

Big Data and ML Learning

随着工作的时间一天天过去,不禁会思考对未来的打算,工作的事情更多的是业务和效果,少有时间学习,自我的提升比起学校需要更多的self motivation. 一直都工作在大数据领域,现在虽然业务多些,方向也没有变化,还有了很多机器学习方面的实践。以下是我觉得自己很希望学习的书籍和要点:

Continue reading

Awesome Books for 2018

One of my 2018 reservations is reading more books. Here I list some great books in my plan.

Machine Learning

  • Machine Learning: A Probabilistic Perspective
  • Deap Learning(Ian,Goodfellow)
  • Pattern Recognition and Machine Learning(Christopher M Bishop)
  • The elements of statistic learning
  • Hands-On Machine Learning with Scikit-Learn and TensorFlow (in progress now)
  • Python Machine Learning
  • 数学之美
  • 统计学(复习)
  • 统计学习方法
  • 机器学习
Continue reading

Ppmml Publish Today

On the last day before the New Year Holiday, ppmml is published.
ppmml is a python library for converting machine learning models to pmml file. ppmml wraps jpmml libraries and provides clean interface.

What is pmml file?

PMML - “Predictive Model Markup Language”, which is a standard for XML documents which express trained instances of analytic models.
Various platforms adopt pmml as machine learning model standard, including IBM, SAS, Microsoft, Spark, KNIME etd.pmml-platforms

jpmml has developed pmml model library and supported models of spark, xgboost, tensorflow, sklearn, lightgbm and R. All of these libraries are separated and written in java.
ppmml wraps jpmml libraries and proved a simple and easy-to-use API for pmml files transformation.
0.0.1 version supports sklearn, tensorflow, spark, lightgbm, xgboost and R models. All models supported by jpmml are supported by ppmml. Common machine learning algorithms are supported, such as Decision Tree, Logistic Regression, GBDT, Random Forest, KMeans. However, Deep Learning support is not ready.

Continue reading

How to Use Scala UDF and UDAF in PySpark

Spark DataFrame API provides efficient and easy-to-use operations to do analysis on distributed collection of data. Many users love the Pyspark API, which is more usable than scala API. Sometimes when we use UDF in pyspark, the performance will be a problem. How about implementing these UDF in scala, and call them in pyspark? BTW, in spark 2.0, UDAF can only be defined in scala, and how to use it in pyspark? Let’s have a try~

Continue reading

Spark Streaming Exactly-Once Analysis

最近对Spark Streaming接触比较多,主要关注的是streaming的准确性方面的需求, 忙了快半年,不禁想问为什么需要在exactly-once上花费这么多时间呢。streaming和batch的处理逻辑有什么区别呢?我觉得streaming更适合一些简单的过滤,能在100ms以内能算完的逻辑,而这些逻辑用batch也可以算完,为什么要streaming呢?用户们更希望的是更快。如果batch也能满足低延迟的需求,streaming系统就不需要了。而问题是为什么我们需要一个单独的streaming系统?

Continue reading