What is regularization ?

In simple words regularization helps in reduces over-fitting on the data. There are many regularization strategies.

The major regularization techniques used in practice are:

L2 Regularization
L1 Regularization
Data Augmentation
Dropout
Early Stopping

L2 regularization :

In L2 regularization, an extra term often referred to as regularization term is added to the loss function of the network.

Consider the the following cross entropy loss function (without regularization):

$$loss= -\frac{1}{m} \sum\limits_{i = 1}^{m} (y^{(i)}\log\left(yhat^{(i)}\right) + (1-y^{(i)})\log\left(1-yhat^{(i)}\right)) $$

To apply L2 regularization to the loss function above we add the term given below to the loss function :

$$\frac{\lambda}{2m}\sum\limits_{w}w^{2} $$

where $\lambda$ is a hyperparameter of the model known as the regularization parameter. $\lambda$ is a hyper-parameter which means it is not learned during the training but is tuned by the user manually

After applying the regularization term to our original loss function : $$finalLoss= -\frac{1}{m} \sum\limits_{i = 1}^{m} (y^{(i)}\log\left(yhat^{(i)}\right) + (1-y^{(i)})\log\left(1-yhat^{(i)}\right)) + \frac{\lambda}{2m}\sum\limits_{w}w^{2}$$

or , $$ finalLoss = loss+ \frac{\lambda}{2m}\sum\limits_{w}w^{2}$$

or in simple code :

final_loss = loss_fn(y, y_hat) + lamdba * np.sum(np.pow(weights, 2)) / 2
final_loss = loss_fn(y, y_hat) + lamdba * l2_reg_term

Note: all code equations are written in python, numpy notation.

Cosequently the weight update step for vanilla SGD is going to look something like this:

w = w - learning_rate * grad_w - learning_rate * lamdba * grad(l2_reg_term, w)
w = w - learning_rate * grad_w - learning_rate * lamdba * w

Note: assume that grad_w is the gradients of the loss of the model wrt weights of the model.

Note: assume that grad(a,b) calculates the gradients of a wrt to b.

In major deep-learning libraires L2 regularization is implemented by by adding lamdba * w to the gradients, rather than actually changing the loss function.

# compute the gradients to update w
# grad_w is the gradients of loss wrt to w
gradients = grad_w + lamdba * w
# update step
w = w - learning_rate * gradients

Weight Decay :

In weight decay we do not modify the loss function, the loss function remains the instead instead we modfy the update step :

The loss remains the same :

final_loss = loss_fn(y, y_hat)

During the update parameters :

w = w - learing_rate * grad_w - learning_rate * lamdba * w

Tip: The major difference between L2 regularization & weight decay is while the former modifies the gradients to add lamdba * w , weight decay does not modify the gradients but instead it subtracts learning_rate * lamdba * w from the weights in the update step.

A weight decay update is going to look like this :

# compute the gradients to update w
# grad_w is the gradients of loss wrt to w
gradients = grad_w
# update step
w = w - learning_rate * gradients - learning_rate * lamdba * w

In this equation we see how we subtract a little portion of the weight at each step, hence the name decay.

Important: From the above equations weight decay and L2 regularization may seem the same and it is infact same for vanilla SGD , but as soon as we add momentum, or use a more sophisticated optimizer like Adam, L2 regularization and weight decay become different.

Weight Decay != L2 regularization

SGD with Momentum :

To prove this point let's first take a look at SGD with momentum

In SGD with momentum the gradients are not directly subtracted from the weights in the update step.

First, we calculate a moving average of the gradients .
and then , we subtract the moving average from the weights.

For L2 regularization the steps will be :

# compute gradients 
gradients = grad_w + lamdba * w
# compute the moving average
Vdw = beta * Vdw + (1-beta) * (gradients)
# update the weights of the model
w = w - learning_rate * Vdw

Now, weight decay’s update will look like

# compute gradients 
gradients = grad_w
# compute the moving average
Vdw = beta * Vdw + (1-beta) * (gradients)
# update the weights of the model
w = w - learning_rate * Vdw - learning_rate * lamdba * w

Note: $Vdw$ is a moving average of the parameter w . It starts at 0 and then at each step it is updated using the formulae for $Vdw$ given above. beta is a hyperparameter .

Adam :

This difference is much more visible when using the Adam Optimizer. Adam computes adaptive learning rates for each parameter. Adam stores moving average of past squared gradients and moving average of past gradients. These moving averages of past and past squared gradients $Sdw$ and $Vdw$ are computed as follows:

Vdw = beta1 * Vdw + (1-beta1) * (gradients)
Sdw = beta2 * Sdw + (1-beta2) * np.square(gradients)

Note: similar to SGD with momentum $Vdw$ and $Sdw$ are moving averages of the parameter w. These moving averages start from 0 and at each step are updated with the formulaes given above. beta1 and beta2 are hyperparameters.

and the update step is computed as :

w = w - learning_rate * ( Vdw/(np.sqrt(Sdw) + eps) )

Note: eps is a hypermarameter added for numerical stability. Commomly, $eps = 1e-08$ .

For L2 regularization the steps will be :

# compute gradients and moving_avg 
gradients = grad_w + lamdba * w

Vdw = beta1 * Vdw + (1-beta1) * (gradients)
Sdw = beta2 * Sdw + (1-beta2) * np.square(gradients)

# update the parameters
w = w - learning_rate * ( Vdw/(np.sqrt(Sdw) + eps) )

For weight-decay the steps will be :

# compute gradients and moving_avg 
gradients = grad_w

Vdw = beta1 * Vdw + (1-beta1) * (gradients)
Sdw = beta2 * Sdw + (1-beta2) * np.square(gradients)

# update the parameters
w = w - learning_rate * ( Vdw/(np.sqrt(Sdw) + eps) ) - learning_rate * lamdba * w

The difference between L2 regularization and weight decay is clearly visible now.

In the case of L2 regularization we add this $lamdba * w$ to the gradients then compute a moving average of the gradients and their squares before using both of them for the update.

Whereas the weight decay method simply consists in doing the update, then subtract to each weight.

After much experimentation Ilya Loshchilov and Frank Hutter suggest in their paper : DECOUPLED WEIGHT DECAY REGULARIZATION we should use weight decay with Adam, and not the L2 regularization that classic deep learning libraries implement. This is what gave rise to AdamW.

In simple terms, AdamW is simply Adam optimzer used with weight decay instead of classic L2 regularization.

Implementing L2 regularization, weight decay and AdamW :

Now that we have got the boring theory part out of the way. Let's look at how implement L2 regularization, weight decay and AdamW can be implemented in Tensorflow2.X .

For this part we are going to use these libraries :

Setting up imports

import tensorflow as tf
import tensorflow_addons as tfa
import tensorflow_datasets as tfds
import matplotlib.pyplot as plt


AUTOTUNE = tf.data.experimental.AUTOTUNE

Loading and pre-preocessing the data :

In this example we are going to use the tf_flowers dataset available in tensorflow datasets

# train dataset
train_ds = tfds.load("tf_flowers", 
                     split="train[:80%]",
                     as_supervised=True, 
                     with_info=False
                    )


# validation dataset
valid_ds = tfds.load("tf_flowers", 
                     split="train[80%:]",
                     as_supervised=True, 
                     with_info=False)

print("NUM EXMAPLES IN TRAIN DATASET     : ", len(train_ds))
print("NUM EXAMPLES IN VALIDATION DATASET: ", len(valid_ds))

NUM EXMAPLES IN TRAIN DATASET     :  2936
NUM EXAMPLES IN VALIDATION DATASET:  734

IMAGE_SIZE = 224


def process_image(image, label, img_size=IMAGE_SIZE):
    """
    Fn converts the images data types, scales to image to have pixel
    values betwwen [0, 1] 
    This functions also resizes the image to given `img_size`.
    
    Args:
        image   : An image
        label   : target label associated with the image
        img_size: size of the image after resize
    """
    # cast and normalize image
    image = tf.image.convert_image_dtype(image, tf.float32)
    image = tf.image.resize(image,[img_size, img_size])
    return image, label

# dataset to be

train_ds = train_ds.map(process_image, num_parallel_calls=AUTOTUNE).batch(30).prefetch(AUTOTUNE)

valid_ds = valid_ds.map(process_image, num_parallel_calls=AUTOTUNE).batch(32).prefetch(AUTOTUNE)

View images from the dataset :

def view_images(ds):
    """
    Diplays images from the given dataset
    
    Args:
        ds: A TensorFlow Dataset
    """
    image, label = next(iter(ds)) # extract 1 batch from the dataset
    image = image.numpy()
    label = label.numpy()

    fig = plt.figure(figsize=(10,10))
    for i in range(16):
        ax = fig.add_subplot(4, 4, i+1, xticks=[], yticks=[])
        ax.imshow(image[i])
        ax.set_title(f"Label: {label[i]}")

Train dataset :

# view example images from the train dataset
view_images(train_ds)

Validation dataset :

# view example images from the valid dataset
view_images(valid_ds)

The model we are going to use:

Note: For loss_fn we are ging to use tf.keras.losses.SparseCategoricalCrossentropy which is CrossEntropyLoss also since we are not using a activation in the output layer we need to set tf.keras.losses.SparseCategoricalCrossentropy(from_logits = True) and let’s compute the accuracy of our model so we will use tf.keras.metrics.SparseCategoricalAccuracy(name="accuracy").

L2 regularization :

To use L2 regularization, in tensorflow we need to use the class : tf.keras.regularizers.L2 .

After which in each layer of our model we will need to add the argument kernel_regularizer.

Let's see and exmple below

Build a model with L2 regularization

# we also need to set the value from lambda in the 
# tf.keras.regularizers.l2 this values is passed to l2

l2_reg = tf.keras.regularizers.l2(l2=0.001)


def get_l2_model():
    """
    Returns a tf.keras.Model instance with L2 regularization .
    """
    
    model = tf.keras.Sequential([
        tf.keras.Input(shape=(IMAGE_SIZE, IMAGE_SIZE, 3)),
        tf.keras.layers.Conv2D(64, 3, kernel_regularizer=l2_reg, padding="same"),
        tf.keras.layers.MaxPooling2D(2),
        tf.keras.layers.ReLU(),
        tf.keras.layers.Dropout(0.2,),
        tf.keras.layers.Conv2D(64, 3, kernel_regularizer=l2_reg, padding="same"),
        tf.keras.layers.MaxPooling2D(2),
        tf.keras.layers.ReLU(),
        tf.keras.layers.Dropout(0.2,),
        tf.keras.layers.Flatten(),
        tf.keras.layers.Dense(120, activation="relu", kernel_regularizer=l2_reg, ),
        # since our data has five distinct classes 
        tf.keras.layers.Dense(5, kernel_regularizer=l2_reg),
    ])
    return model

L2 regularizaton with SGD and momentum :

OPTIMIZER = tf.keras.optimizers.SGD(learning_rate=1e-03, momentum=0.9)
LOSS_FN   = tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True)
METRICS   = tf.keras.metrics.SparseCategoricalAccuracy(name="accuracy")

model = get_l2_model()
model.compile(optimizer=OPTIMIZER, loss=LOSS_FN, metrics=METRICS)


# Fit model on the train data
l2_sgd_hist = model.fit(train_ds, validation_data=valid_ds, epochs=10)

Epoch 1/10
98/98 [==============================] - 5s 49ms/step - loss: 1.7185 - accuracy: 0.3808 - val_loss: 1.5622 - val_accuracy: 0.4891
Epoch 2/10
98/98 [==============================] - 5s 47ms/step - loss: 1.4715 - accuracy: 0.5184 - val_loss: 1.4857 - val_accuracy: 0.5354
Epoch 3/10
98/98 [==============================] - 5s 48ms/step - loss: 1.3781 - accuracy: 0.5688 - val_loss: 1.4252 - val_accuracy: 0.5736
Epoch 4/10
98/98 [==============================] - 5s 48ms/step - loss: 1.3150 - accuracy: 0.6144 - val_loss: 1.4002 - val_accuracy: 0.5777
Epoch 5/10
98/98 [==============================] - 5s 48ms/step - loss: 1.2618 - accuracy: 0.6427 - val_loss: 1.3824 - val_accuracy: 0.5845
Epoch 6/10
98/98 [==============================] - 5s 49ms/step - loss: 1.2081 - accuracy: 0.6649 - val_loss: 1.3765 - val_accuracy: 0.5913
Epoch 7/10
98/98 [==============================] - 5s 48ms/step - loss: 1.1521 - accuracy: 0.6921 - val_loss: 1.3684 - val_accuracy: 0.5886
Epoch 8/10
98/98 [==============================] - 5s 49ms/step - loss: 1.0927 - accuracy: 0.7176 - val_loss: 1.3518 - val_accuracy: 0.6008
Epoch 9/10
98/98 [==============================] - 5s 48ms/step - loss: 1.0346 - accuracy: 0.7476 - val_loss: 1.3545 - val_accuracy: 0.6035
Epoch 10/10
98/98 [==============================] - 5s 48ms/step - loss: 0.9705 - accuracy: 0.7752 - val_loss: 1.3697 - val_accuracy: 0.5995

L2 regularization with Adam :

OPTIMIZER = tf.keras.optimizers.Adam(learning_rate=3e-04,)

model = get_l2_model()
model.compile(optimizer=OPTIMIZER, loss=LOSS_FN, metrics=METRICS)


# Fit model on the train data
l2_adam_hist = model.fit(train_ds, validation_data=valid_ds, epochs=10)

Epoch 1/10
98/98 [==============================] - 5s 49ms/step - loss: 1.9146 - accuracy: 0.4038 - val_loss: 1.4483 - val_accuracy: 0.4755
Epoch 2/10
98/98 [==============================] - 5s 48ms/step - loss: 1.3192 - accuracy: 0.5245 - val_loss: 1.2808 - val_accuracy: 0.5627
Epoch 3/10
98/98 [==============================] - 5s 49ms/step - loss: 1.1829 - accuracy: 0.6029 - val_loss: 1.2172 - val_accuracy: 0.6022
Epoch 4/10
98/98 [==============================] - 5s 48ms/step - loss: 1.0777 - accuracy: 0.6587 - val_loss: 1.1689 - val_accuracy: 0.5967
Epoch 5/10
98/98 [==============================] - 5s 49ms/step - loss: 0.9801 - accuracy: 0.7064 - val_loss: 1.1497 - val_accuracy: 0.6144
Epoch 6/10
98/98 [==============================] - 5s 49ms/step - loss: 0.8905 - accuracy: 0.7561 - val_loss: 1.1535 - val_accuracy: 0.6267
Epoch 7/10
98/98 [==============================] - 5s 49ms/step - loss: 0.7950 - accuracy: 0.7987 - val_loss: 1.1712 - val_accuracy: 0.6349
Epoch 8/10
98/98 [==============================] - 5s 49ms/step - loss: 0.6899 - accuracy: 0.8522 - val_loss: 1.1911 - val_accuracy: 0.6322
Epoch 9/10
98/98 [==============================] - 5s 48ms/step - loss: 0.5904 - accuracy: 0.8934 - val_loss: 1.3133 - val_accuracy: 0.6035
Epoch 10/10
98/98 [==============================] - 5s 48ms/step - loss: 0.5640 - accuracy: 0.9019 - val_loss: 1.2912 - val_accuracy: 0.6213

Weight Decay :

To use SGD with momentum along with weight_decay we need to use the class : tfa.optimizers.SGDW.

This class implements the SGDW optimizer described in Decoupled Weight Decay Regularization by Loshchilov & Hutter.

def get_wd_model():
    """
    Returns a tf.keras.Model instance without L2 regularization.
    """
    model = tf.keras.Sequential([
        tf.keras.Input(shape=(IMAGE_SIZE, IMAGE_SIZE, 3)),
        tf.keras.layers.Conv2D(64, 3, padding="same"),
        tf.keras.layers.MaxPooling2D(2),
        tf.keras.layers.ReLU(),
        tf.keras.layers.Dropout(0.2,),
        tf.keras.layers.Conv2D(64, 3, padding="same"),
        tf.keras.layers.MaxPooling2D(2),
        tf.keras.layers.ReLU(),
        tf.keras.layers.Dropout(0.2,),
        tf.keras.layers.Flatten(),
        tf.keras.layers.Dense(120, activation="relu",),
        # since our data has five distinct classes 
        tf.keras.layers.Dense(5,),
    ])
    return model

model = get_wd_model()

# instantiate model with SGD with WD optimizer
OPTIMIZER = tfa.optimizers.SGDW(weight_decay=0.001, learning_rate=1e-03, momentum=0.9)
model.compile(optimizer=OPTIMIZER, loss=LOSS_FN, metrics=METRICS)

# Fit model on the train data
wd_sgd_hist = model.fit(train_ds, validation_data=valid_ds, epochs=10)

Epoch 1/10
98/98 [==============================] - 5s 48ms/step - loss: 1.4168 - accuracy: 0.3324 - val_loss: 1.2875 - val_accuracy: 0.4482
Epoch 2/10
98/98 [==============================] - 5s 47ms/step - loss: 1.1985 - accuracy: 0.4888 - val_loss: 1.2030 - val_accuracy: 0.5041
Epoch 3/10
98/98 [==============================] - 5s 47ms/step - loss: 1.1253 - accuracy: 0.5307 - val_loss: 1.1591 - val_accuracy: 0.5341
Epoch 4/10
98/98 [==============================] - 5s 47ms/step - loss: 1.0873 - accuracy: 0.5562 - val_loss: 1.1328 - val_accuracy: 0.5599
Epoch 5/10
98/98 [==============================] - 5s 47ms/step - loss: 1.0591 - accuracy: 0.5732 - val_loss: 1.1161 - val_accuracy: 0.5586
Epoch 6/10
98/98 [==============================] - 5s 47ms/step - loss: 1.0364 - accuracy: 0.5923 - val_loss: 1.1023 - val_accuracy: 0.5572
Epoch 7/10
98/98 [==============================] - 5s 47ms/step - loss: 1.0166 - accuracy: 0.6001 - val_loss: 1.0915 - val_accuracy: 0.5654
Epoch 8/10
98/98 [==============================] - 5s 47ms/step - loss: 0.9987 - accuracy: 0.6124 - val_loss: 1.0836 - val_accuracy: 0.5708
Epoch 9/10
98/98 [==============================] - 5s 47ms/step - loss: 0.9833 - accuracy: 0.6250 - val_loss: 1.0800 - val_accuracy: 0.5681
Epoch 10/10
98/98 [==============================] - 5s 47ms/step - loss: 0.9669 - accuracy: 0.6342 - val_loss: 1.0804 - val_accuracy: 0.5681

AdamW :

To use AdamW optimizer we need to use the class : tfa.optimizers.AdamW

This is an implementation of the AdamW optimizer described in Decoupled Weight Decay Regularization by Loshch ilov & Hutter.

model = get_wd_model()

# instantite model AdamW Optimizer
OPTIMIZER = tfa.optimizers.AdamW(weight_decay=0.001, learning_rate=3e-04)
model.compile(optimizer=OPTIMIZER, loss=LOSS_FN, metrics=METRICS)

# Fit model on the train data
adamW_hist = model.fit(train_ds, validation_data=valid_ds, epochs=10)

Epoch 1/10
98/98 [==============================] - 5s 48ms/step - loss: 1.6241 - accuracy: 0.3866 - val_loss: 1.2442 - val_accuracy: 0.5136
Epoch 2/10
98/98 [==============================] - 5s 48ms/step - loss: 1.1282 - accuracy: 0.5419 - val_loss: 1.1058 - val_accuracy: 0.5845
Epoch 3/10
98/98 [==============================] - 5s 48ms/step - loss: 0.9986 - accuracy: 0.6134 - val_loss: 1.0358 - val_accuracy: 0.6049
Epoch 4/10
98/98 [==============================] - 5s 47ms/step - loss: 0.8969 - accuracy: 0.6679 - val_loss: 1.0081 - val_accuracy: 0.6104
Epoch 5/10
98/98 [==============================] - 5s 48ms/step - loss: 0.7981 - accuracy: 0.7159 - val_loss: 1.0099 - val_accuracy: 0.5967
Epoch 6/10
98/98 [==============================] - 5s 47ms/step - loss: 0.6941 - accuracy: 0.7640 - val_loss: 1.0121 - val_accuracy: 0.5981
Epoch 7/10
98/98 [==============================] - 5s 47ms/step - loss: 0.5770 - accuracy: 0.8236 - val_loss: 1.0145 - val_accuracy: 0.5940
Epoch 8/10
98/98 [==============================] - 5s 47ms/step - loss: 0.4714 - accuracy: 0.8764 - val_loss: 1.0802 - val_accuracy: 0.5981
Epoch 9/10
98/98 [==============================] - 5s 47ms/step - loss: 0.3971 - accuracy: 0.9077 - val_loss: 1.2265 - val_accuracy: 0.5613
Epoch 10/10
98/98 [==============================] - 5s 47ms/step - loss: 0.4072 - accuracy: 0.8893 - val_loss: 1.0511 - val_accuracy: 0.6117

Loss and Accuracy Curves :

plt.style.use("ggplot")
plt.figure(figsize=(10,6))

plt.title("Losses")
plt.plot(l2_sgd_hist.history["loss"],  label="sgd with l2")
plt.plot(l2_adam_hist.history["loss"], label="adam with l2")
plt.plot(wd_sgd_hist.history["loss"],  label="sgd with wd")
plt.plot(adamW_hist.history["loss"],   label="adamW")

plt.plot(l2_sgd_hist.history["val_loss"],  label="valid sgd with l2", linestyle='dashed')
plt.plot(l2_adam_hist.history["val_loss"], label="valid adam with l2", linestyle='dashed')
plt.plot(wd_sgd_hist.history["val_loss"],  label="valid sgd with wd", linestyle='dashed')
plt.plot(adamW_hist.history["val_loss"],   label="valid adamW", linestyle='dashed')

plt.xlabel("# epochs")
plt.ylabel("loss")
plt.legend();

Weight Decay or L2 regularization ?

After all this which should we use : L2 regularization or weight decay ?

According to the words of Jeremy Howard from Fast.ai :

So, weight decay is always better than L2 regularization with Adam then?
We haven’t found a situation where it’s significantly worse, but for either a transfer-learning problem (e.g. fine-tuning Resnet50 on Stanford cars) or RNNs, it didn’t give better results.

Also in the above models that we trained with both sgd and adam optimzer using weight decay has lead to lower loss for both sgd and adam.

Summary :

we explored the concepts behind weight decay and L2 regularization .
explored Adam & AdamW .
learned how to implement weight decay, L2 regularization & AdamW in TensorFlow

Thanks for reading !

Understanding L2 regularization, Weight decay and AdamW

What is regularization ?

L2 regularization :

Weight Decay :

Weight Decay != L2 regularization

SGD with Momentum :

Adam :

Implementing L2 regularization, weight decay and AdamW :

Setting up imports

Loading and pre-preocessing the data :

L2 regularization :

Weight Decay :

AdamW :

Weight Decay or L2 regularization ?

Summary :

References :