We’re excited to bring back Transform 2022 in person on July 19 and virtually from July 20 through August 3. Join leaders in AI and data for in-depth discussions and exciting networking opportunities. Learn more

The use of deep learning has grown rapidly over the past decade, thanks to the adoption of cloud-based technology and the use of deep learning systems in big data, according to Emergen Research, which expects deep learning to grow into a \$93 billion market by 2028.

But what exactly is deep learning and how does it work?

Deep learning is a subset of machine learning that uses neural networks to perform learning and predictions. Deep learning has shown amazing performance in various tasks, be it text, time series, or computer vision. The success of deep learning comes mainly from the availability of big data and computing power. However, it is more than that, which makes deep learning much better than any of the classic machine learning algorithms.

## Deep learning: neural networks and functions

A neural network is an interconnected network of neurons, each neuron being a limited-function approximator. In this way, neural networks are considered as universal function approximators. If you remember high school math, a function is a mapping from an input space to an output space. A simple sin(x) function is mapped from angular space (-180oh at 180 oh or 0 oh at 360 oh) to the space of real numbers (-1 to 1).

Let’s see why neural networks are considered universal function approximators. Each neuron learns a limited function: f(.) = g(W*X) where W is the weight vector to learn, X is the input vector and g(.) is a nonlinear transformation. W*X can be visualized as a line (being trained) in high dimensional space (hyperplane) and g(.) can be any nonlinear differentiable function like sigmoid, tanh, ReLU, etc. (commonly used in the deep learning community). Learning in neural networks is nothing more than finding the optimal weight vector W. For example, in y = mx+c, we have 2 weights: m and c. Now, given the distribution of points in 2D space, we find the optimal value of m & c that satisfies certain criteria: the difference between predicted y and the actual points is minimal for all data points.

## The layer effect

Now that each neuron is a nonlinear function, we stack several such neurons into a “layer” where each neuron receives the same set of inputs but learns different weights W. Therefore, each layer has a set of learned functions: [f1, f2, …, fn], called hidden layer values. These values ​​are combined again, in the next layer: h(f1, f2, …, fn) and so on. This way, each layer is composed of functions from the previous layer (something like h(f(g(x)))). It has been shown that thanks to this composition, we can learn any nonlinear complex function.

Deep learning is a neural network with many hidden layers (usually identified as >2 hidden layers). But effectively what deep learning is a complex composition of functions from layer to layer, thus finding the function that defines a mapping from input to output. For example, if the input is an image of a lion and the output is the image classification that the image belongs to the class lions, then deep learning learns a function that maps the vectors d picture to classes. Similarly, the input is the sequence of words and the output is whether the input sentence has a positive/neutral/negative sentiment. Therefore, deep learning consists of learning a map from the input text to the output classes: neutral, positive or negative.

## Deep learning as interpolation

From a biological interpretation, humans process images of the world by interpreting them hierarchically step by step, from low-level features like contours and contours to high-level features like objects and scenes. Feature composition in neural networks is consistent with this, where each feature composition learns complex features about an image. The most common neural network architecture used for images is the convolutional neural network (CNN), which learns these features hierarchically and then a fully connected neural network classifies the image features into different classes.

Using high school math again, given a set of 2D data points, we try to fit a curve by interpolation that somehow represents a function defining those data points. The more complex the function we are fitting (in interpolation, for example, determined by the polynomial degree), the better it fits the data; however, the less it generalizes for a new data point. This is where deep learning faces challenges and what is commonly referred to as an overfitting problem: fitting as much as possible to the data, but compromising generalization. Almost all deep learning architectures have had to deal with this important factor to be able to learn a general function that can perform equally well on unseen data.

Pioneer of deep learning, Yann LeCun (creator of the convolutional neural network and winner of the ACM Turing award) posted on his Twitter username (from one article): “Deep Learning is not as impressive as you think because it is a simple interpolation resulting from glorified curve fitting. But in high dimension there is no has no interpolation. In high dimension, everything is extrapolation. So in the context of functional learning, deep learning does nothing but interpolation or, in some cases, extrapolation That’s all!

## The learning component

So how do we learn this complex function? Well, it entirely depends on the problem to be solved and this is what determines the architecture of the neural network. If we are interested in image classification, we use CNN. If we are interested in predictions or time-dependent text, we use RNN or transformers and if we have a dynamic environment (like driving a car), we use reinforcement learning. Apart from that, learning involves facing different challenges:

• Ensure that the model learns the general function and does not just form data; this is handled using regularization
• Depending on the problem posed, the choice of the loss function is made; Basically, the loss function is an error function between what we want (true value) and what we currently have (current prediction).
• Gradient descent is the algorithm used to converge to an optimal function; deciding the learning rate becomes difficult because when we are far from the optimum we want to move faster to the optimum, and when we are near optimal then we want to move slower to ensure that we are converging to optimal and global minima.
• A large number of hidden layers have to deal with the vanishing gradient problem; architectural changes such as jump connections and proper non-linear activation function help solve it.

## Computational challenges

Now that we know that deep learning is simply a complex learning function, it poses other computational challenges:

• To learn a complex function, we need a large amount of data
• To process big data, we need fast computing environments
• We need an infrastructure that supports such environments

Parallel processing with CPUs is not enough to compute millions or billions of weights (also called DL parameters). Neural networks require the learning of weights that require vector (or tensor) multiplications. This is where GPUs come in handy, as they can perform parallel vector multiplications very quickly. Depending on the deep learning architecture, the size of the data and the task at hand, sometimes we need one GPU, and sometimes several of them, a data scientist has to make a decision based on known literature or by measuring performance on 1 GPU.

With the use of an appropriate neural network architecture (number of layers, number of neurons, nonlinear function, etc.) along with large enough data, a deep learning network can learn any mapping of a vector space to another vector space. This is what makes deep learning such a powerful tool for any machine learning task.

Abhishek Gupta is the Principal Data Scientist at Talentica software.