Recently I have been looking for different alternatives for mlops at my job. We end up choosing mlflow and with this article I want to summarise my findings.
By far the best way to learn about it is watching this three videos from databricks:
And here you have the documentation: https://www.mlflow.org/docs/latest/quickstart.html
Basically it’s a library and through the library you can do a ton of things. It can be used as a client…
Here we are with another paper implementation. Veeeery slowly, but I’m on my way to implement TabNet paper. I think this will be the last component I need for it.
Another github repo implementing sparsemax:
From the abstract:
We propose sparsemax, a new activation function similar to the traditional softmax, but able to output sparse probabilities. […] We obtain promising empirical…
This paper aims to solve the issue of the “generalization gap”. It seems neural networks tends to do worse for unseen data when being trained on large batch sizes.
One of the ways proposed to fix this is changing batchnorm layers from calculating statistics (remember BatchNorm layers changes the input data to make it “normally distributed”, 0 mean 1 standard deviation) from using the whole batch, to use small parts of the batch, what we can…
From paper to code
I have started looking into an architecture called TabNet that aims for tabular problems interpretability. As part of it I’ll do couple of posts about some of its components, in this case about GLU activation (gated linear units). Next one will be about Ghost BatchNorm.
The idea is simple. I want to allow the network to decide how much information should flow through a given path, like a logical gate, hence the name. How?
An unified view
Classification & Probability? Why mix both? Well, because classification indeed can be thought as fitting a probability distribution, bear with me. I’ll comment a bit about how to fit distributions and then relate it to classification.
The loss used for fitting a probability distribution is the negative log likelihood, I explain it with more detail here. What it does is given a distribution defined by some parameters (obtained from the output of a NN in this case), what is the probability of observing, (the target / label) Y? …
Today I’m going to be reviewing a deep learning model architecture for forecasting.
Until M4 competition everyone thought that traditional statistic models were superior for forecasting, as happened with many other fields I would say. Today it seems deep learning is the best solution almost always… at least having enough data. Traditional forecasting techniques like ARIMA, etc seems to be doing better for small datasets still.
What about huge datasets where we have many different time series? Like amazon for example, or any retail business. …
Both aims to represent the “center” of some data. A proper way to define them might be that they define the location of a distribution, defining location as the middle point of it, or the position at which most of the data is located.
First thing to say is that there are multiple kinds of means, the most common is called the arithmetic mean, but there are others called geometric mean, weighted mean, harmonic mean or even the root mean square… all trying to summarize some sense os “middle” of the data considering the whole data, in different ways.
First time I hear about this (well, actually first time it was readen…) I didn’t have any idea about what was it, but hey! it sounds pretty cool!
Very likely if you are reading this you already have some idea about this topic: it’s a way to use backpropagation when using random sampling.
That’s the thing, you can’t backpropagate when you pick up random numbers. Randomness on a computer usually means reading the voltage of some component, using the internal clock… some interaction with the world. There is no way to calculate the derivative of that.
This happens if we…
Back with another simple implementation for reinforcement learning. I’m far from being an expert, but I hope you can learn something new. This time I wanted to keep it as simple as possible so I could concentrate on learning PPO basics. Here I’m solving CartPole-v1 environment with TD(0), or so called one-step TD.
You can read Reinforcement Learning: An Introduction for a better explanation on this topic, but basically: take one step on the environment, execute algorithm, take on step on the environment, execute algorithm… so no need to collect batches and then calculate the discounted reward. Just play with…