Recently I have been looking for different alternatives for mlops at my job. We end up choosing mlflow and with this article I want to summarise my findings.


By far the best way to learn about it is watching this three videos from databricks:

And here you have the documentation:


  1. What is mlflow?
  2. Server mode
    2.1. Experiments tracking
    2.2. Model tracking
  3. Client mode
    3.1. Tracking experiments
    3.2. Running models
  4. What else can you do with mlflow?

1. What is mlflow?

Basically it’s a library and through the library you can do a ton of things. It can be used as a client…

Here we are with another paper implementation. Veeeery slowly, but I’m on my way to implement TabNet paper. I think this will be the last component I need for it.

Paper: From Softmax to Sparsemax: A Sparse Model of Attention and Multi-Label Classification

Another github repo implementing sparsemax:

What can we learn from implementing this paper?

  1. A new layer capable of producing sparse outputs while being differentiable everywhere.
  2. How to implement the backward step.
  3. How check the gradient is correct.

1) Sparsemax, a new layer

From the abstract:

We propose sparsemax, a new activation function similar to the traditional softmax, but able to output sparse probabilities. […] We obtain promising empirical…

From paper to code

Related papers:


This paper aims to solve the issue of the “generalization gap”. It seems neural networks tends to do worse for unseen data when being trained on large batch sizes.

One of the ways proposed to fix this is changing batchnorm layers from calculating statistics (remember BatchNorm layers changes the input data to make it “normally distributed”, 0 mean 1 standard deviation) from using the whole batch, to use small parts of the batch, what we can…

From paper to code

Photo by Dima Pechurin on Unsplash

I have started looking into an architecture called TabNet that aims for tabular problems interpretability. As part of it I’ll do couple of posts about some of its components, in this case about GLU activation (gated linear units). Next one will be about Ghost BatchNorm.

Gated Linear Unit

Related papers:
* TabNet:
* Language modeling with Gated Convolutional Networks:

The idea is simple. I want to allow the network to decide how much information should flow through a given path, like a logical gate, hence the name. How?

  • If we multiply X by 0, nothing passes.
  • If…

Photo by Markus Spiske on Unsplash

Mean absolute value

Useful when having outliers but you don’t want them to have a big influence.

An unified view

Photo by Jan Antonin Kolar on Unsplash

Classification & Probability

Classification & Probability? Why mix both? Well, because classification indeed can be thought as fitting a probability distribution, bear with me. I’ll comment a bit about how to fit distributions and then relate it to classification.


The loss used for fitting a probability distribution is the negative log likelihood, I explain it with more detail here. What it does is given a distribution defined by some parameters (obtained from the output of a NN in this case), what is the probability of observing, (the target / label) Y? …

Today I’m going to be reviewing a deep learning model architecture for forecasting.

Photo by Johannes Plenio on Unsplash

Until M4 competition everyone thought that traditional statistic models were superior for forecasting, as happened with many other fields I would say. Today it seems deep learning is the best solution almost always… at least having enough data. Traditional forecasting techniques like ARIMA, etc seems to be doing better for small datasets still.

What about huge datasets where we have many different time series? Like amazon for example, or any retail business. …

Photo by Chris Liverani on Unsplash

Both aims to represent the “center” of some data. A proper way to define them might be that they define the location of a distribution, defining location as the middle point of it, or the position at which most of the data is located.

What’s the mean?

First thing to say is that there are multiple kinds of means, the most common is called the arithmetic mean, but there are others called geometric mean, weighted mean, harmonic mean or even the root mean square… all trying to summarize some sense os “middle” of the data considering the whole data, in different ways.


First time I hear about this (well, actually first time it was readen…) I didn’t have any idea about what was it, but hey! it sounds pretty cool!

Very likely if you are reading this you already have some idea about this topic: it’s a way to use backpropagation when using random sampling.

That’s the thing, you can’t backpropagate when you pick up random numbers. Randomness on a computer usually means reading the voltage of some component, using the internal clock… some interaction with the world. There is no way to calculate the derivative of that.

This happens if we…

Back with another simple implementation for reinforcement learning. I’m far from being an expert, but I hope you can learn something new. This time I wanted to keep it as simple as possible so I could concentrate on learning PPO basics. Here I’m solving CartPole-v1 environment with TD(0), or so called one-step TD.

You can read Reinforcement Learning: An Introduction for a better explanation on this topic, but basically: take one step on the environment, execute algorithm, take on step on the environment, execute algorithm… so no need to collect batches and then calculate the discounted reward. Just play with…

Alvaro Durán Tovar

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store