Computer Vision

Image Embeddings explained

In a nutshell, embedding is a dimensionality reduction technique. It is a lower dimensional vector representation of high dimensional feature vectors (i.e.

PT

Picsellia Team

·8 min read

Image Embeddings explained

Ready to build computer vision?

Go from raw images to production models. Free trial, no credit card, cancel anytime.

No credit card required14-day free trial

Introduction

Today, data is ubiquitous and growing exponentially. It is still challenging for computers to understand and handle computer vision-based data like humans. In the past, computer vision techniques used brittle edge detection methods, color profiles, and a slew of manually coded processes that also required high-quality image annotations. They still didn't provide the most effective means for computers to understand the semantics of the data. Advances in techniques in machine learning presented an opportunity for computers to leverage big data and efficiently execute computer vision tasks.

Machine learning (ML) embedding has become an essential technique for handling various forms and types of data effectively and efficiently for ML tasks like computer vision, natural language processing (NLP), speech recognition, recommender systems, fraud detection, etc. This article discusses the fundamentals of applying the embedding technique to computer vision by breaking down the concept of image embedding using convolutional neural networks (CNNs).

What is Image Embedding

In a nutshell, embedding is a dimensionality reduction technique. It is a lower dimensional vector representation of high dimensional feature vectors (i.e., raw input data) like words or images. Technically, the concept entails creating dense clusters of similarities or relationships within the data, which serve as a semantic feature encoded in a vector space. These encoded features are unique identifying vectors for a particular data class. The dimensional vectors make it possible to efficiently manage data features that stand out, enabling the machine learning model to understand a data class better.

Generally, image embedding algorithms extract distinct features in an image and represent them with dense vectors (i.e., unique numerical identifiers) in a different dimensional space. The generated dense vectors are then compared against the vector of the image to measure similarities. Think of it as representing only the most distinct features of a 3-D image in 2-D and comparing how well the features appear in 2-D. 

Methods for generating image embedding have evolved and become more advanced with the rise of deep learning (DL). Since the DL era, techniques like Bag of Visual Words (BOVW), Convolutional Neural Networks (CNN), and Visual Transformers (VIT) have been developed. Deep learning techniques use ML models that learn how to generate embeddings from images within the models and directly learn from the embedding weights rather than manually extracting embeddings (features) from images as a separate pre-processing step. They enabled the development of computer vision solutions for many datasets and use cases.

**CNN Image Embeddings **

At the time of this writing, CNNs are the de-facto standard in the CV field, with many practical and production use cases. However, they are computationally expensive and require a lot of data.

A CNN is a deep neural network model architecture containing two sets of blocks: convolutional and classification blocks. These blocks are the faces involved in generating image embedding with CNN. Each block of CNNs plays a specific role in extracting embeddings for the computer to understand the images, as we will dive into below. Although they are different CNN architectures like LeNet-5, AlexNet, VGGNet, ResNet, e.t.c., the fundamental process for extracting embedding is the same.

Image embeddings explained 6475bbefc0be9d26f9af7863 759558c3Image embeddings explained 6475bbefc0be9d26f9af7863 759558c3 Typical architecture of a CNN

Convolutional Block

This block is responsible for extracting features from images and mapping the extracted features to image embeddings. As the name suggests, this block consists of convolutional layers. Each layer contains a filter and an activation function; in between, they also use other optional layers, which commonly include pooling and normalization layers. These layers provide additional benefits, such as regularization and improved training dynamics.

The convolutional layers can extract abstract features and ideas within an image and encode them as embeddings. It consists of several convolutional layers stacked on top of one another to enable them to recognize simple features, like edges, shapes, textures, etc. As the network gets more profound, it can capture more abstract and distinctive traits, which the models eventually use to identify a particular object's concept in an image.

Pixels with a color channel make up an image. The computer sees pixels and color channels as an array of vectors (matrix) with a value range of 0 (for no color) to 255 (for maximum color). These values represent the edges, shapes, and textures of different features in an image.

Image embeddings explained 6475bbef63db6e5b5f396503 b871ba67Image embeddings explained 6475bbef63db6e5b5f396503 b871ba67

The convolutional layer filter reduces the image matrix to a lower dimensional representation by image compression. Filters are randomly initialized values with a smaller matrix shape (window size). The filter matrix is multiplied across the pixel values and returns a single value (scalar product) to represent that window portion of the image. As the filter matrix slides across each image window, it generates a complete feature map (i.e., a lower dimensional matrix embedding) of the image.

Image embeddings explained 6475bbf0591b8cffd5733af7 8d7292faImage embeddings explained 6475bbf0591b8cffd5733af7 8d7292fa

This process suppresses noise in the image to produce a smaller and smoother copy of the snippets that map the most prominent image feature, detected in the convolutional layer. These mappings are extracted embeddings that contain abstract qualities of the image. Generating embedding is done by compressing the image with CNN. Consider it is converting a video from 1080p to 360p; although the resolution is blurry, you can still identify objects within a frame because of their distinct shapes, colors, etc.

Since the extracted image embeddings in the lower dimension are smoother copies of the input image, it is essential to be mindful of excessive image compression to avoid losing vital feature information in the embedding. There are a couple of ways to control the amount of image compression. 

Modifying the filter size and stride (i.e., the number of pixels a filter moves per window) is a way of controlling compression. Increasing the stride causes the filter to traverse the entire image in fewer steps, yielding fewer values and a more compressed feature map, and vice versa.

Image embeddings explained 6475bbf03afdd871f2679a11 a50ab3f7Image embeddings explained 6475bbf03afdd871f2679a11 a50ab3f7

Image embeddings explained 6475bbf2a6288220f6a1a4e1 19c3a4f2Image embeddings explained 6475bbf2a6288220f6a1a4e1 19c3a4f2

Using padding layers can also limit compression. The layer adds zero value pixels to the edge of the image vector; as a result, the filter has more pixel vectors and image windows to aggregate. Padding is a more effective remedy for smaller images, typically placed before the convolutional layer.

Image embeddings explained 6475bbf0fd0a85c9a831d4a0 0c71a021Image embeddings explained 6475bbf0fd0a85c9a831d4a0 0c71a021

The pooling layer ensures robust embedding extraction, enabling stability when identifying the compressed information in an extracted embedding. It downsamples the embeddings to reduce the size of the feature maps by taking the maximum or average value of a group of neighboring pixels. It is helpful in cases where the pixels of the embedding features shift a bit out of place due to compression, and identifying the object becomes due to the slight deviation in shapes, edges, etc.

Image embeddings explained 6475bbf157682c73f77f1e7d 0a8ca290Image embeddings explained 6475bbf157682c73f77f1e7d 0a8ca290

Image embeddings explained 6475bbf2c0be9d26f9af7c29 764190b7Image embeddings explained 6475bbf2c0be9d26f9af7c29 764190b7

Before passing the embedding to the next layer within the convolutional block, the activation function in each convolutional layer applies non-linearity to the model, which allows the layer to learn the complex relationships between the image and the extracted embeddings. Rectified linear unit (ReLU), Exponential linear unit (ELU), Sigmoid function, and Tanh function are some of the most common activation functions used in a convolutional block.

The depth of the Convolutional layers is a critical component that contributes to the high performance of extracting useful embeddings. At every successive layer within this block, the embedding gets a more abstract understanding of the peculiar features of objects in the initial images. For example, there is an image of an iPhone 14 Pro Max. With the extracted embeddings in the first few convolutional layers, it recognizes there is a mobile phone in the image; by the intermediate layers, it pulls more embedding and can tell it's an iPhone, and with more embeddings by the last layer, its able to it identify it as an iPhone 14 Pro Max.

Image embeddings explained 6475be0710544b746a2882f4 capture 20d e2 80 99e cc 81cran 202023 05 30 20a cc 80 2011 12 24Image embeddings explained 6475be0710544b746a2882f4 capture 20d e2 80 99e cc 81cran 202023 05 30 20a cc 80 2011 12 24

Classification Block

This part of the CNN is the fully connected linear layer. Typically located comes after the convolutional block. It takes the embeddings from the convolutional layers and calculates the probability of the feature embedding belonging to an object class.

This layer transforms the vector embeddings from a vector to a scalar data point. The data point embedding shows a more precise numeric representation of the abstract features as a cluster, making identifying an object's class easier. The clusters represent different features of an object class.

Image embeddings explained 6475bbf210acd004ecd53080 e61e89e6Image embeddings explained 6475bbf210acd004ecd53080 e61e89e6

**Conclusion **

Image embeddings have revolutionized the field of computer vision by providing a compact and meaningful representation of images. With their ability to capture rich visual information, image embeddings have opened doors to numerous applications and paved the way for advancements in image analysis, interpretation, and generation.

However, the techniques for generating image embeddings are associated with typical challenges of sensitivity to image variations, computational complexity, and the need for large datasets for training. Nonetheless, ongoing research and innovation continue to address these limitations and improve the effectiveness and efficiency of image embedding techniques. As research and development in this field progress, image embeddings will undoubtedly continue to shape the future of computer vision.

annotationcomputer-visiondataset-managementdeep-learningedge-deploymentobject-detection

Related from Picsellia

Centralize your visual data

Store, search, and organize millions of images in a single place with tags, metadata, and visual similarity search.

Explore the Datalake

Ship vision AI 10x faster

Picsellia is the end-to-end MLOps platform for computer vision — from data management to production deployment.

See the Platform

Stay up to date

Get the latest posts on computer vision, MLOps, and AI delivered to your inbox.