Preparing Otolith Data for Age Estimation
A process note by Farnaz on how a large otolith dataset was cleaned, matched, and prepared for training.
Otoliths (literally “ear stones”) are calcium carbonate structures found in the inner ear of vertebrates, including fish and humans. The goal of our project is to estimate the age of fish by counting the rings visible in the otoliths.
To achieve this, we worked with a dataset of nearly 150,000 otolith images from cod species, along with a datasheet containing features such as weight, sex, length, and other attributes.
We decided to train the model using both the images and the tabular data. However, before doing that, we first needed to clean and prepare the dataset. For this, we followed these steps:
1. Creating the Image Names
We started with a guidance file that explains how the images are named. According to this file, the images come from three different sources: DEM, SURVEY, and MINI. Each source follows a slightly different naming format.
The image names are created by combining several attributes, such as year, month, trip number, subdivision, and an identification number. For example, a sample from a SURVEY in the year 2024, with subdivision 22 and identification number 1, would have a name like: S833_2024_22_001.
In the datasheet, there is also a column called schema, which indicates whether a sample belongs to DEM, SURVEY, or MINI. Using this column, along with the related feature columns, we wrote a Python script to generate a new column called NAMES, which contains the correct image name for each entry.
2. Matching the Images with Their Feature Rows
At this stage, we had the images, which were already named correctly, and the feature table, which also included the NAMES column. The next challenge was to match them and identify what was missing on each side.
In other words, we needed to find images without corresponding feature rows, as well as feature rows without matching images.
There were also duplicate entries in both the image files and the datasheet, so these had to be identified and removed as well. In practice, this step was mainly done manually by comparing the number of images with the number of rows.
The images were stored in multiple folders based on their year, type, and sometimes trip number, which made this process much easier and more manageable.
3. Dealing with Missing Values in the Datasheet
We noticed that many columns in the datasheet contained invalid or missing values. Simply removing all rows with such values would have led to a significant loss of data, meaning fewer samples available for training. Therefore, before doing that, we followed two main steps:
Selecting the useful columns
Not all columns were relevant for our model. Missing values in less important columns were not a concern, so we focused only on the features that actually contribute to training.
Based on our understanding of the data, and supported by feature correlation analysis, we selected the following columns:
- GBFANGB: the geographic coordinate Breite of the catch location in a specific format
- GLFANGB: the geographic coordinate Länge of the catch location in a specific format
- JAHR: the year of the catch
- MONAT: the month of the catch
- LAENGE: the length of the fish
- SLGEW: the weight of the fish
- REIFE: the maturity stage of the fish
We then limited our cleaning process to these selected features.
Fixing or removing invalid values
The columns GBFANGB and GLFANGB contained mixed geographic coordinate formats, such as compact strings like 542200N and decimal-minute values like 548.896. We standardized all of them into a consistent DDMM.mmm format.
In addition, there were many invalid entries such as -9, commas, and other non-numeric values. These were either cleaned or removed to ensure the quality and consistency of the dataset.
4. Cleaning the Image Dataset
At this stage, we had a clean datasheet, but the images still needed processing. To reduce noise during training, it was important that the otoliths in the images were centered.
We also considered removing the background, but since the color of the rings is very similar to the background, this step was not straightforward. Therefore, we postponed it for later, in case the model results were not satisfactory.
This step was done semi-manually using a combination of Python and the VIA (Visual Geometry Group Image Annotation) tool. The process was as follows:
Selecting image batches
We processed the images in groups of around 2,000 at a time to keep the task manageable.
Identifying problematic images
We ran a script to display the images one by one and manually inspected them. During this step, we marked problematic cases:
- broken images
- completely black images
- images containing multiple otoliths
- images that needed cropping to be centered
The first two categories were removed from the dataset, while the last two were kept for further processing.
Annotating the region of interest
We then used the VIA (Visual Geometry Group Image Annotation) tool to mark the correct area of each image. This tool was originally introduced in the article “An interactive AI-driven platform for fish age reading” for segmentation tasks. However, we found it very efficient for our purpose, which was cropping.
Cropping the images
Finally, we used a Python script to crop each image based on the annotated regions, resulting in a cleaner and more consistent image dataset.
With these four steps, we ensured that the datasheet and the images are properly matched, free of invalid values and noise, and ready to be used for training the model.
From the project
Otolith-based age determination for Baltic codA research project on machine learning–supported age determination of Baltic cod from otolith microscopy images. The work focuses on data consolidation, quality assessment, and the development of supervised models to support expert-based age reading.
View project →You might also find these interesting
A short process note on how the Offener Kreis reflection card set evolved from a coherent layout into a clearer visual orientation system.
A structured introduction to decision trees and ensemble methods, written as accompanying material for a Data Science lecture.
A week of exchange, alignment, and first architectural decisions for PelAtlas under the EU-CONEXUS DELPHI project.
Note on authorship: This text was developed with the support of AI tools, used for drafting and refinement. Responsibility for content, structure, and conclusions remains with the author.