In a recent article published in the British Ecological Society journal, the Pl@ntNet team, building on the PhD work of Tanguy Lefort, proposes a significant advancement in optimizing artificial intelligence models for plant species identification.
The Challenge of Annotation: A Key Element of Pl@ntNet
Deep learning models applied to plant identification require large annotated datasets. The Pl@ntNet system plays a central role by allowing users worldwide to generate, submit, and annotate botanical observations. However, this approach inherently leads to variability in label quality, as users’ expertise levels differ, creating discrepancies in the annotations. Aggregating these labels thus becomes a crucial challenge for training AI models.
Traditional approaches present two major issues. Either they retain all observations, leading to significant noise in the data, or they only keep annotations that have received enough votes, which results in a loss of valuable information, particularly for rare species.
Through this publication, the research team proposes an alternative label aggregation method based on estimating user competence via a trust score. This score assesses their ability to identify plant species based on crowdsourced data correctly. Unlike traditional methods, this approach leverages botanical experts’ knowledge without penalizing their lower annotation frequency and removes unreliable observations while preserving those with a limited number of trusted annotations.
Large-Scale Experimentation
The researchers applied this strategy to a large subset of the Pl@ntNet database focused on European flora, which currently includes over 6 million observations and approximately 800,000 anonymized users.
The results demonstrate that evaluating users’ skills based on the diversity of their expertise significantly improves label quality. By integrating AI-generated votes alongside human annotations, label aggregation becomes more robust and enables the detection of unreliable observations, even when they have received few votes.
The team’s conclusions highlight the importance of synergy between human annotations and data-driven filtering to optimize AI model training. This approach opens promising perspectives for further refining training datasets and enhancing the reliability of botanical identification systems.
If you’re interested in the full article, find it here!