Preparing Otolith Data for Age Estimation

A process note by Farnaz on how a large otolith dataset was cleaned, matched, and prepared for training.

Otoliths (literally “ear stones”) are calcium carbonate structures found in the inner ear of vertebrates, including fish and humans. The goal of our project is to estimate the age of fish by counting the rings visible in the otoliths.

To achieve this, we worked with a dataset of nearly 150,000 otolith images from cod species, along with a datasheet containing features such as weight, sex, length, and other attributes.

We decided to train the model using both the images and the tabular data. However, before doing that, we first needed to clean and prepare the dataset. For this, we followed these steps:

1. Creating the Image Names

We started with a guidance file that explains how the images are named. According to this file, the images come from three different sources: DEM, SURVEY, and MINI. Each source follows a slightly different naming format.

The image names are created by combining several attributes, such as year, month, trip number, subdivision, and an identification number. For example, a sample from a SURVEY in the year 2024, with subdivision 22 and identification number 1, would have a name like: S833_2024_22_001.

In the datasheet, there is also a column called schema, which indicates whether a sample belongs to DEM, SURVEY, or MINI. Using this column, along with the related feature columns, we wrote a Python script to generate a new column called NAMES, which contains the correct image name for each entry.

2. Matching the Images with Their Feature Rows

At this stage, we had the images, which were already named correctly, and the feature table, which also included the NAMES column. The next challenge was to match them and identify what was missing on each side.

In other words, we needed to find images without corresponding feature rows, as well as feature rows without matching images.

There were also duplicate entries in both the image files and the datasheet, so these had to be identified and removed as well. In practice, this step was mainly done manually by comparing the number of images with the number of rows.

The images were stored in multiple folders based on their year, type, and sometimes trip number, which made this process much easier and more manageable.

3. Dealing with Missing Values in the Datasheet

We noticed that many columns in the datasheet contained invalid or missing values. Simply removing all rows with such values would have led to a significant loss of data, meaning fewer samples available for training. Therefore, before doing that, we followed two main steps:

Selecting the useful columns

Not all columns were relevant for our model. Missing values in less important columns were not a concern, so we focused only on the features that actually contribute to training.

Based on our understanding of the data, and supported by feature correlation analysis, we selected the following columns:

GBFANGB: the geographic coordinate Breite of the catch location in a specific format
GLFANGB: the geographic coordinate Länge of the catch location in a specific format
JAHR: the year of the catch
MONAT: the month of the catch
LAENGE: the length of the fish
SLGEW: the weight of the fish
REIFE: the maturity stage of the fish

We then limited our cleaning process to these selected features.

Fixing or removing invalid values

The columns GBFANGB and GLFANGB contained mixed geographic coordinate formats, such as compact strings like 542200N and decimal-minute values like 548.896. We standardized all of them into a consistent DDMM.mmm format.

In addition, there were many invalid entries such as -9, commas, and other non-numeric values. These were either cleaned or removed to ensure the quality and consistency of the dataset.

4. Cleaning the Image Dataset

At this stage, we had a clean datasheet, but the images still needed processing. To reduce noise during training, it was important that the otoliths in the images were centered.

We also considered removing the background, but since the color of the rings is very similar to the background, this step was not straightforward. Therefore, we postponed it for later, in case the model results were not satisfactory.

This step was done semi-manually using a combination of Python and the VIA (Visual Geometry Group Image Annotation) tool. The process was as follows:

Selecting image batches

We processed the images in groups of around 2,000 at a time to keep the task manageable.

Identifying problematic images

We ran a script to display the images one by one and manually inspected them. During this step, we marked problematic cases:

broken images
completely black images
images containing multiple otoliths
images that needed cropping to be centered

The first two categories were removed from the dataset, while the last two were kept for further processing.

Annotating the region of interest

We then used the VIA (Visual Geometry Group Image Annotation) tool to mark the correct area of each image. This tool was originally introduced in the article “An interactive AI-driven platform for fish age reading” for segmentation tasks. However, we found it very efficient for our purpose, which was cropping.

Cropping the images

Finally, we used a Python script to crop each image based on the annotated regions, resulting in a cleaner and more consistent image dataset.

With these four steps, we ensured that the datasheet and the images are properly matched, free of invalid values and noise, and ready to be used for training the model.