Customizing a gesture recognition model with MediaPipe

Samuel Pröll included in Artificial Intelligence

2024-04-07 1994 words 15 minutes

/posts/ai/asl-detector-with-mediapipe-wsl/featured.jpg

Contents

MediaPipe¹ is an amazing library of ready-to-use deep learning models for common tasks in various domains. My previous post highlights how you can use it to easily detect facial landmarks. There are many other solutions available to explore. In this post however, I want to take a look at another feature, the MediaPipe Model Maker. Model maker allows you to extend the functionality of some MediaPipe solutions by customizing models to your specific use case. With only a few lines of code, you can fine tune models and rewire the internals to accommodate new targets.

In this particular example, we will customize the hand gesture recognition task to build a model for reading the American Sign Language (ASL) fingerspelling alphabet. We are going to learn how a basic solution can be created with small effort. But we will also see that there are limitations to the (current) functionality of the MediaPipe Model Maker.

Preview release

Please beware that MediaPipe is still a preview in early release and changes might occur anytime.

Compatibility issues on Windows

I was unable to make the MediaPipe Model Maker work on Windows directly. Instead, I ran the code for the following post on WSL with Ubuntu 22.04. It may or may not work on other operation systems/distributions. Unfortunately, there is little information in the documentation about compatibility.

TL;DR

Here is a quick overview of the required steps to customize the gesture recognition task for a new problem. The approach is the same for the more general image classification task.

Get or create some data. Image files for each class must be provided in a separate sub folder. The folder names are used as class labels.
Preprocess images by calling Dataset.from_folder
Set hyper-parameters with HParams and ModelOptions.
Fine tune the model with GestureRecognizer.create.
Evaluate performance with model.evaluate.
Export the model with model.export_model.

Beyond the basic steps, we are going to take a look at the hand gesture embeddings, which MediaPipe uses under the hood and we will also perform a more thorough evaluation of the model’s accuracy.

Prerequisites

Before getting started with the main code, we need take care of some external resources. First, the Model Maker does not come with the default MediaPipe installation, so we need to install it separately. Next, we also need to get some data to train the ASL alphabet detector. MediaPipe will work directly from the files on disk, so some minor preprocessing on the dataset is also needed.

Installing MediaPipe Model Maker

Installing is straightforward with pip. Note that the installation might not succeed on Windows. If you are on Windows, you can use the Windows Subsystem for Linux (WSL) instead.

1

pip install mediapipe mediapipe-model-maker

Getting the data

For this example, I am going to use the SigNN Character Database from Kaggle. It contains 8442 images showing 24 characters of the english alphabet. The dataset excludes J and Z, because they are differentiated from other characters through motion (see the image below the post). The dataset was originally created to build a mobile ASL alphabet translator - which basically does what I am creating in this post, only better. The dataset creators have a detailed description of their solution, so definitely check it out and star their page.

To download the dataset yourself, you need a Kaggle account (which is free). With 1.8GB it is fairly manageable. Before getting started, we will need to do some minor preprocessing, to make it work seamlessly with MediaPipe Model Maker.

Data preparation

According to the Model Maker documentation, only a small number of training examples is required to retrain/fine tune the models. Approximately 100 examples per class should be sufficient. The easiest way to provide the data to the Model Maker is through a from_folder method. It scans the given folder, interprets any subdirectory as target classes and any containing (image) files as instances of that class. The SigNN dataset is already provided in this format.

In addition, MediaPipe requires a "none" class for training, which should include examples that do not show any of the target labels. Training will not run without it. We can create an empty folder called "none" within the training directory. This will allow us to run the training, although it would probably be better to provide actual negative examples.

We could start training with this dataset immediately. But data processing with the Model Maker is quite slow, so I do not want to work with the full dataset. Unfortunately, the MediaPipe interface has no straightforward way to control how much data is used. The easiest way I have found is to simply copy a the desired number of samples to a separate folder on disk. The following script, allows us to extract multiple non-overlapping subsets, including a heldout test set from the original dataset.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61


"""Extract a sample from the ASL Alphabet dataset."""

import pathlib
import shutil
import typing

import click
import numpy as np
import tqdm


def process_split_sizes(splits: typing.Sequence[str]) -> typing.Dict[str, int]:
    return {s[0]: int(s[1]) for s in map(lambda x: x.split(":"), splits)}


@click.command()
@click.argument(
    "input_root",
    type=click.Path(
        exists=True,
        file_okay=False,
        path_type=pathlib.Path,
    ),
)
@click.argument(
    "output_root",
    type=click.Path(
        file_okay=False,
        path_type=pathlib.Path,
    ),
)
@click.option(
    "--split", "splits", multiple=True, type=str, default=["train:100"]
)
@click.option("--seed", default=None, type=int)
def main(
    input_root: pathlib.Path,
    output_root: pathlib.Path,
    splits: str,
    seed: typing.Union[int, None],
):
    split_sizes = process_split_sizes(splits)
    pbar = tqdm.tqdm(sorted(input_root.iterdir()))
    for labelpath in pbar:
        pbar.set_description(labelpath.name)
        files = sorted(labelpath.iterdir())
        rng = np.random.default_rng(seed=seed)
        indices = rng.permutation(len(files))

        offset = 0
        for name, size in split_sizes.items():
            output_name = output_root.name + "_" + name
            outpath = output_root.with_name(output_name) / labelpath.name
            outpath.mkdir(parents=True, exist_ok=True)
            for i in indices[offset : offset + size]:
                shutil.copy(files[i], outpath / files[i].name)
            offset += size


if __name__ == "__main__":
    main()

We use it to obtain two training sets with different sizes as well as a testing set:

1
2
3


python generate_data_samples.py \
    ./data/SigNN\ Character\ Database/ ./data/SigNN \
    --split "train100:100" --split "train50:50" --split "test:50"

Customizing the gesture recognition task

With all the requirements installed and the data finalized in the necessary format, we can now tackle the fine tuning process with MediaPipe Model Maker.

Some utility functions

We can define a couple of utility functions, which make the code easier to read and avoid duplicated lines. The helper functions are gathered in a local module named utils.py:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81


"""Some utility functions for working with the ASL detector dataset/model."""

import pathlib

import cv2
import matplotlib.pyplot as plt
import mediapipe as mp
import numpy as np
from mediapipe.framework.formats import landmark_pb2


def find_images(root: str | pathlib.Path) -> list[pathlib.Path]:
    """Find all JPG and PNG images under the given root directory."""
    return list(pathlib.Path(root).glob("**/*.[jpJP][npNP][egEG]*"))


def _read_img(filename, resize=(224, 224)):
    img = cv2.cvtColor(cv2.imread(filename), cv2.COLOR_BGR2RGB)
    return cv2.resize(img, resize)


def plot_image_files(filenames, ncols=5, resize=(224, 224)):
    img_arrays = [_read_img(str(f), resize) for f in filenames]
    fig, axarr = _plot_image_array(img_arrays, ncols)
    for i, filename in enumerate(filenames):
        axarr[i].set_title(pathlib.Path(filename).parent.name, size="smaller")
    return fig, axarr


def plot_recognizer_predictions(
    filenames, recognizer, ncols=5, landmarks=True, resize=(224, 224)
):
    img_array = [_read_img(str(f), resize) for f in filenames]

    preds = []
    for arr in img_array:
        img = mp.Image(image_format=mp.ImageFormat.SRGB, data=np.asarray(arr))
        result = recognizer.recognize(img)
        if len(result.gestures) > 0 and len(result.gestures[0]) > 0:
            preds.append(result.gestures[0][0].category_name or "N/A")
            if landmarks:
                draw_hand_landmarks(arr, result.hand_landmarks[0])
        else:
            preds.append("empty")

    fig, axarr = _plot_image_array(img_array, ncols)
    for i, (fname, pred) in enumerate(zip(filenames, preds)):
        axarr[i].set_title(
            f"{pred} (True: {pathlib.Path(fname).parent.name})", size="smaller"
        )
    return fig, axarr


def _plot_image_array(arrays, ncols):
    nrows = int(np.ceil(len(arrays) / ncols))
    fig, axarr = plt.subplots(nrows, ncols, figsize=(ncols * 2, nrows * 2))
    axarr = np.reshape(axarr, (-1,))
    for ax, img in zip(axarr, arrays):
        ax.imshow(img)
        ax.axis("off")
    return fig, axarr


def draw_hand_landmarks(img, landmarks):
    # slightly modified from here:
    # https://colab.research.google.com/github/googlesamples/mediapipe/blob/main/examples/gesture_recognizer/python/gesture_recognizer.ipynb
    proto = landmark_pb2.NormalizedLandmarkList()
    proto.landmark.extend(  # type: ignore
        [
            landmark_pb2.NormalizedLandmark(  # type: ignore
                x=landmark.x, y=landmark.y, z=landmark.z
            )
            for landmark in landmarks
        ]
    )
    connections = mp.solutions.hands.HAND_CONNECTIONS  # type: ignore
    lm_style = mp.solutions.drawing_styles.get_default_hand_landmarks_style()  # type: ignore
    c_style = mp.solutions.drawing_styles.get_default_hand_connections_style()  # type: ignore
    mp.solutions.drawing_utils.draw_landmarks(  # type: ignore
        img, proto, connections, lm_style, c_style
    )

Visualizing the data

Before training the model, we can visualize the raw data. With the utility functions from above, this only takes a few lines.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11


import pathlib
import numpy as np

import utils

data_root = pathlib.Path("./data")
dataset_train = data_root / "SigNN_train100"
trainfiles = utils.find_images(dataset_train)

sample_files = np.random.choice(np.asarray(trainfiles), 10)
fig, axarr = utils.plot_image_files(sample_files, ncols=5)

We can see that some images are fairly low-quality. It will be interesting to see how well the customized model performs.

Ingesting the data

Following the guide provided in the documentation, we can create a gesture_recognizer.Dataset object directly from the dataset folder. First, we set the preprocessing parameters with the HandDataPreprocessingParams class. This is not strictly necessary, as sensible values are used by default. Then, we call the from_folder to initialize the dataset. Finally, we can split the dataset into a training and a validation part. The held-out testing data is loaded similarly.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12


from mediapipe_model_maker.python.vision import gesture_recognizer

handparams = gesture_recognizer.HandDataPreprocessingParams(
    min_detection_confidence=0.5
)
data = gesture_recognizer.Dataset.from_folder(str(dataset_train), handparams)
train_data, validation_data = data.split(0.8)

dataset_test = data_root / "SigNN_test"
test_data = gesture_recognizer.Dataset.from_folder(
    str(dataset_test), handparams
)

Depending on the size of the dataset, calling from_folder can take quite some time. This is because some heavy processing is already applied at this point. All images in the dataset folder are read and processed. Here, a default hand landmarker model is applied to find hands in the images. The landmark coordinates are then passed to an embedding module, which produces a meaningful representation of the gesture. We are therefore not fine tuning the model on images directly. Rather, it learns to differentiate the embeddings obtained from each gesture.

Fine-tuning the gesture recognizer

We can set (and play with) a number of hyper-parameters for the model and the training process.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12


hparams = gesture_recognizer.HParams(
    export_dir="exported_model",
    batch_size=32,
    epochs=30,
    shuffle=True,
    learning_rate=0.005,
    lr_decay=0.999,
)
moptions = gesture_recognizer.ModelOptions(dropout_rate=0.05)
options = gesture_recognizer.GestureRecognizerOptions(
    hparams=hparams, model_options=moptions
)

I found that increasing the batch size from 2 (default) helps a lot to improve model performance. For this particular task, we can also get away with a slightly higher learning rate and slower LR decay, compared to the default settings. In the ModelOptions, we could increase the model size by adding additional layers through the layer_widths parameter. A bigger model could increase the performance further, I am already quite happy with the results though.

Finally, we train the model by calling the create function. Even on CPU, this should not take too long, as the model is just a single Dense layer with BatchNorm, ReLU and Dropout - 3737 parameters.

1
2
3


model = gesture_recognizer.GestureRecognizer.create(
    train_data=train_data, validation_data=validation_data, options=options
)

Repeated calls to `create`

Beware that without changing the export_dir directory, calling create repeatedly will not retrain from scratch but reuse the existing weights from the last run. When testing out different hyper-parameters, you should rename or delete the output directory.

With around 95% accuracy on the validation set, the model does a good job so far. We can also evaluate the performance on the additional testing data. It appears generalization is also good.

1
2
3


loss, acc = model.evaluate(test_data, batch_size=32)
print(f"Test loss: {loss:.4f}, Test accuracy: {acc:.2%}")
# -> Test loss: 0.1064, Test accuracy: 95.83%

Exporting the customized model

We can now export the customized model. With one final line, the fine-tuned ASL alphabet detector is written to disk. It can now be used in any application using MediaPipe through the Gesture Recognition task.

1

model.export_model("asl_recognizer.task")

Applying the customized model

Similar to the face landmarker described in the previous post, we can then load and apply the ASL recognizer through the MediaPipe Tasks API. Note that this would not necessarily have to be in Python, but could also be integrated in web or on mobile apps.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13


import mediapipe as mp
from mediapipe.tasks.python.vision.gesture_recognizer import GestureRecognizer

base_options = mp.tasks.BaseOptions(
    model_asset_path=hparams.export_dir + "/asl_recognizer.task"
)
options = mp.tasks.vision.GestureRecognizerOptions(
    base_options=base_options, running_mode=mp.tasks.vision.RunningMode.IMAGE
)

with GestureRecognizer.create_from_options(options) as recognizer:
    mp_image = mp.Image.create_from_file(str(filename))
    result = recognizer.recognize(mp_image)

We can also use the GestureRecognizer task to visualize some test examples. The utils.py file provides a helper function for this.

1
2
3
4


test_samples = np.random.choice(np.asarray(testfiles), 10)

with GestureRecognizer.create_from_options(options) as recognizer:
    fig, axarr = utils.plot_recognizer_predictions(test_samples, recognizer, 5)

Detailed performance evaluation

Using the provided model.evaluate method, we can see that the model generally performs quite well. Unfortunately, all we get are the global loss and overall accuracy. In a multiclass classification problem, it would however also be interesting to evaluate the accuracy for each class separately.

I have not found a better way to do this more granular analysis through the Model Maker interface. Instead, I am using the exported model through MediaPipe Tasks and build a custom evaluation scheme.

To do this, I am iterating through the list of test files, reading the images and running them (one by one) through the recognizer. Without any batching, this is quite an inefficient and comparatively slow solution – but it gets the job done. The original filename, corresponding label and predicted class are stored in a Pandas data frame for further analysis.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14


test_results = []
with mp.tasks.vision.GestureRecognizer.create_from_options(
    options
) as recognizer:
    for filename in tqdm.tqdm(testfiles):
        mp_image = mp.Image.create_from_file(str(filename))
        result = recognizer.recognize(mp_image)
        if len(result.gestures) > 0:
            pred = result.gestures[0][0].category_name or "n/a"
        else:
            pred = "empty"
        test_results.append((filename, filename.parent.name, pred))

results_df = pd.DataFrame(test_results, columns=["filename", "label", "pred"])

Sometimes, the model produces an empty string as class output, even though a hand was successfully detected in the image. I am replacing these with “n/a”, for better readability.

Confusion matrix

First up, we can look at the confusion matrix. For most characters, performance is very strong, but for some characters, the sensitivity drops down to 70%.

1
2
3
4
5
6
7
8
9


import sklearn.metrics

classes = sorted(test_data.label_names + ["n/a", "empty"])
cm = sklearn.metrics.confusion_matrix(
    results_df["label"], results_df["pred"], labels=classes, normalize="true"
)
sklearn.metrics.ConfusionMatrixDisplay(cm * 100, display_labels=classes).plot(
    include_values=False
)

Per-class accuracy

We can also visiualize the per-class accuracy, while taking into account whether a hand was detected or not. We find that many of the false labels come from empty predictions (~4.4%), which is not too bad compared to actually false characters (~3.1%).

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11


import seaborn as sns

results_df["result"] = np.where(
    results_df.pred == results_df.label,
    "correct",
    np.where(results_df.pred.isin(["n/a", "empty"]), "not found", "incorrect"),
)
print(results_df.result.value_counts(normalize=True))
sns.histplot(
    data=results_df, x="label", hue="result", multiple="stack", stat="count"
)

Actually incorrect classifications mostly come from M, N, T, S and E. This makes intuitive sense, since these gestures are very similar with slight variations of the thumb position (see image further down).

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11


results_df.query("result == 'incorrect'").groupby(
    "label"
).pred.value_counts().sort_values(ascending=False)

# label  pred
# M      N       12
# T      S        4
# N      S        3
# E      S        2
# R      U        2
# C      O        2

Visualizing gesture embeddings

Last but not least, I also wanted to take a look at the underlying embeddings. You will see that even without any training, there is already quite a good separation between most classes. To extract the calculated embeddings, we can to access the Tensorflow Data pipline, which is usually handled under the hood:

1
2
3
4
5
6
7
8


train_ds = train_data.gen_tf_dataset(batch_size=train_data.size)
xy = train_ds.take(1).get_single_element()

embeddings, classes_onehot = xy[0].numpy(), xy[1].numpy()  # type: ignore
class_indices = np.argmax(classes_onehot, axis=1)

print(embeddings.shape, class_indices.shape)
# -> (1861, 128) (1861,)

The gesture embeddings have 128 dimensions, which is quite difficult to visualize. I am therefore using t-SNE² to further reduce the dimensionality:

1
2
3
4


import sklearn.manifold

tsne = sklearn.manifold.TSNE()
emb = tsne.fit_transform(embeddings)

After embedding the embeddings (😉), we can visualize the structure within the training data.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12


import seaborn as sns
import pandas as pd

embdf = pd.DataFrame(emb, columns=["X1", "X2"]).assign(label=class_indices)
sns.scatterplot(
    data=embdf, x="X1", y="X2", hue="label", palette="Spectral", legend=False
)
for i, c in enumerate(train_data.label_names):
    if np.all(class_indices != i):
        continue
    center = emb[class_indices == i].mean(axis=0)
    plt.annotate(c, center, center - 6)

Note that aside from coloring the plot, class labels do not influence any of the processing steps leading to this image. t-SNE is an unsupervised method, which just helps to highlight similarity between the samples. It’s not surprising that our model was able to pick up on these structures as well.

Conclusion

This post showcases how MediaPipe Model Maker can be used to quickly build a working prototype for a custom problem. All you need is a handful of data and a few lines of code. Although definitely not perfect, the solution presented here performs surprisingly well (>95% accuracy) and can work as an easily achievable baseline for further optimization.

The Model Maker’s ease of use comes at the cost of flexibility. There are only limited options for data handling and model evaluation. We had to prepare the desired amount of data beforehand on disk. After the training process, we had to process test examples one by one to obtain a more granular understanding of the model performance. Hopefully, there will be more flexibility in future releases of the framework.

All code for this post is accessible on GitHub.

(References)

C. Lugaresi et al., “MediaPipe: A Framework for Building Perception Pipelines,” 2019, arXiv:1906.08172v1. ↩︎
van der Maaten, L., & Hinton, G. (2008). Visualizing Data using t-SNE. Journal of Machine Learning Research, 9(86), 2579–2605. http://jmlr.org/papers/v9/vandermaaten08a.html ↩︎