Johan Kok Zhi Kang

Data Scientist & Incoming PhD graduate

About Me

Hello world, my name is Johan Kok. I am a Data Scientist with Grab and a full time PhD student with the National University of Singapore.

I have a wealth of experience architecting big data stacks. I have developed fast pipelines and state-of-the-art machine learning models. I have also published papers in top CS conferences and have filed 1 patent.

My personal interest is to explore how we can translate cutting edge research work into real world impact. Research often moves orthogonal to what the industry needs and I hope to contribute in bridging this gap.

I am proficient in: Python, Scala, Java, GoLang

I have worked with: Spark, Spark Streaming, Airflow, SQL & NoSQL. Tensorflow & Torch, Docker

I am curious about: Fintech, Blockchain, Realtime systems design, Deep learning, Parallel programming

I have published papers in: ACM SIGMOD, ACM SIGKDD, VLDB

Projects

Crypto Arbitrage Bot

(Insert Github here)

SPOT-PERP crypto asset trading bot. We did make a fair bit when FTX crashed lol.

I collaborated with an ex-JPMorgan senior trader to design a spot-perp arbitrage bot on multiple exchanges.

The bot looks at slippages in prices and executes a trade when the slippage is too large. In theory, this sounds easy but the implementation was a steep challenge.

I designed and optimized the trading logic so that most trades are profitable. I also built supporting infrastructure to automate the bot’s deployment for multiple spot-perp pairs.

AdaptiveStream

In Progress

Python library for building capacity scaled mixture of experts (MoE) models.

AdaptiveStream is a library that enables users to customize Rules defining when experts should be created and Policies specifying how these experts are created.

We enable fast scaling and inference over MoE models and are better than traditional continual learning frameworks focused on single model incremental learning.

The Indexed Router

Published in SIGKDD 2023

Indexing machine learning models into a hierarchy for super fast knowledge access

Continuous learning models must balance between learning under concept drift and catastrophic forgetting.

We explore how we can index these trained models for fast knowledge retrieval and reboustness to catastrophic forgetting. We then developed a novel indexing scheme for each trained model and routing protocol for expert selection in logarithmic time complexity. More details can be found in our paper:

Johan Kok Zhi Kang, Sien Yi Tan, Bingsheng He and Zhen Zhang. Real Time Index and Search Across Large Quantities of GNN Experts for Low Latency Online Learning. In Proceedings of the 29th ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD 2023)

Deep Dynamic Graph Partitioning

Published in SIGKDD 2022

Graph partitioning algorithm refined using closed-loop feedback from model training performances

Most graph partitioning algorithms are deterministic. They partition a large graph based on topology.

We argue that it is more sensible to partition the graph with the objective of maximizing the training performances of machine learning models. We developed a novel algorithm that iteratively refines partition boundaries such that the final sub-graphs are optimized for model learning. More details can be found in our paper:

Johan Kok Zhi Kang, Suwei Yang, Suriya Venkatesan, Sien Yi Tan, Feng Cheng, and Bingsheng He. Dynamic Graph Segmentation for Deep Graph Neural Networks. In Proceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD 2022)

Prestroid

Published in SIGMOD 2021

Deep learning, SQL language parsing for resource estimation

Prestroid helps developers figure out how much resources their SQL query needs. No need to worry about under provisioning resources, which causes queries to fail, or over provisioning, which leads to wastage in costs.

We analysed thousands of SQL queries and their resource consumption profile at Grab. We then improved on tree-convolutional models to develop state-of-the-art resource consumption forecast for a new SQL query. More details can be found in our paper:

Johan Kok Zhi Kang, Gaurav, Sien Yi Tan, Feng Cheng, Shixuan Sun, Bingsheng He. Efficient Deep Learning Pipelines for Accurate Cost Estimations Over Large Scale Query Workload. Proceedings of the 2021 International Conference on Management of Data (SIGMOD 2021)

Experience

Grab

Data Scientist

Aug 2019 - Present

Grab website

Geo data science team / Grab-NUS Ai Lab

I build novel machine learning algorithms for a wide range of applications at Grab. My work is grounded on solving problems with near term impact, such as building better forecasting models for traffic or POI prediction.

I have worked with: Tree convolution, Graphsage, Graph attention, Spatio temporal models

Grab

Data Engineer

Jan 2018 - July 2019

Grab website

I didn't know what big data was until I saw the datalake

I am proficient with the big data ecosystem and have built data pipelines for ingestion of Petabyte-scale data lakes. I am also familiar with architecting fast data pipelines and I have worked with the full software development cycle involving POC development, unit testing, releasing to production and troubleshooting bugs in the code.

Key achievements include spearheading the development of real time data pipelines on Spark Streaming, even working with engineers senior to my grade, to optimise for costs and better service quality for Grab’s internal users.

Tutor & Adjunct Lecturer

Freelance tutoring / Adjunct @ NUS ISS Stack-up

Jan 2018 - Present

NUS ISS website

Data structures & algorithms / Object oriented programming / Data science / Python

I have tutored 9 bachelors / masters level students and conducted programming lectures for classes of above 40 pax.

I also help out with tech events organized by geekshacking.

Education

National University of singapore

PhD, Computer Science

2019 - 2023

My work is in building mixture of expert models that are robust to concept drift and catastrophic forgetting.

My thesis is titled: “Capacity Scaling Algorithms for Mixture of Expert Models Subjected to Learning Under Spatial and Temporal Concept Drift”

Doing a PhD is one of the hardest things I’ve done in my life. This is not for the faint hearted.

Nanyang Technological University

Masters, Technology Management & Bachelors, Mechanical Engineering

2013 - 2017

Graduated first class with honors for my bachelors degree.

Did a 3rd year exchange in the University of California, Berkeley, where I realized my interest was in computer science. Juggled between self studying CS courses and my university coursework ever since I came back to Singapore.

A Little More About Me

My other hobbies include

  • Exploring countries
  • Trying out weird food
  • Using stats to out-game claw machine setups