Comparison of manual, machine learning, and hybrid methods for video annotation to extract parental care data
Abstract
Measuring parental care behaviour in the wild is central to the study of animal ecology and evolution, but it is often labour- and time-intensive. Efficient open-source tools have recently emerged that allow animal behaviour to be quantified from videos using machine learning and computer vision techniques, but there is limited appraisal of how these tools perform compared to traditional methods. To gain insight into how different methods perform in extracting data from videos taken in the field, we compared estimates of the parental provisioning rate of wild house sparrows Passer domesticus from video recordings. We compared four methods: manual annotation by experts, crowd-sourcing, automatic detection based on the open-source software DeepMeerkat, and a hybrid annotation method. We found that the data collected by the automatic method correlated with expert annotation (r = 0.62) and further show that these data are biologically meaningful as they predict brood survival. However, the automatic method produced largely biased estimates due to the detection of non-visitation events, while the crowd-sourcing and hybrid annotation produced estimates that are equivalent to expert annotation. The hybrid annotation method takes approximately 20% of annotation time compared to manual annotation, making it a more cost-effective way to collect data from videos. We provide a successful case study of how different approaches can be adopted and evaluated with a pre-existing dataset, to make informed decisions on the best way to process video datasets. If pre-existing frameworks produce biased estimates, we encourage researchers to adopt a hybrid approach of first using machine learning frameworks to preprocess videos, and then to do manual annotation to save annotation time. As open-source machine learning tools are becoming more accessible, we encourage biologists to make use of these tools to cut annotation time but still get equally accurate results without the need to develop novel algorithms from scratch.
Introduction
Parental care behaviour is a life history trait that is commonly studied in a wide range of animals (Royle et al. 2012). Parental care is defined as any behaviour that increases the fitness of offspring (Clutton-Brock 1991, Royle et al. 2012) at a cost of the survival probability of parents (Trivers 1972), presenting a life history trade-off (Stearns 1992). While there are many forms of parental care (e.g. nest building, predator defence, incubation, and feeding; Royle et al. 2012), the feeding of young is considered particularly costly for parents because of the immense time and energy investments (Winkler and Wilkinson 1988, Owens and Bennett 1994). A large body of literature with a focus on birds, where 90% of species engage in parental care (Cockburn 2006), describes how nest visit frequency to dependent young is associated with aspects of an animal's life history. For example, studies of the life history trade-off between parent and offspring fitness (Schroeder et al. 2013), parental coordination (Wojczulanis-Jakubas et al. 2018, Ihle et al. 2019), parent–offspring conflict (Estramil et al. 2013), or ageing (Wilcoxen et al. 2010) have all used the frequency of parental visits to nests (provisioning rate) as a proxy of parental investment.
However, measuring provisioning rates in wild birds is labour-intensive and time-consuming. Data are traditionally collected by direct observation (Dunn and Cockburn 1996), which can be invasive by disturbing animals in the vicinity of their nest (Rose 2009). Less invasive methods include video recording (Nakagawa et al. 2007, García-Navas and Sanz 2010), radio tracking (Mitchell et al. 2012), and the use of radio-frequency tags and antennas at the nest (RFID; Ringsby et al. 2009, Mariette et al. 2011, Sánchez-Tójar et al. 2017). While radio tags allow visitation rates to be quantified over long periods of time, the technology is prone to missed detections (up to 20%; Mariette et al. 2011). Video analysis is more flexible and allows for other behaviours to be quantified, such as nest defence, copulations, or feeding load (Lendvai et al. 2015), but manual annotation of video data is time-consuming (Tuyttens et al. 2014). Another alternative is crowd-sourcing (Desell et al. 2015), where students or citizen scientists are recruited to collect data. Crowd-sourcing is efficient in collecting data, while also being educational for students (Voss and Cooper 2010, Unger 2022), and increases engagement with the public. However, time has to be invested in training volunteers and in designing suitable software for citizen scientists (Desell et al. 2015, Root-Gutteridge et al. 2021). As the amount of video data collected by ecologists continues to increase, there is a need for more effective ways to extract data from video recordings (Weinstein 2018a).
Recent advances in deep learning (Borowiec et al. 2021) and computer vision (Weinstein 2018a) provide a solution to this problem, by allowing quick and consistent extraction of information from field data (Valletta et al. 2017). For example, machine learning methods have been successfully applied to solve problems with species identification (Wäldchen and Mäder 2018), bird song quantification (Pearse et al. 2018, Priyadarshani et al. 2018), social behaviour measurement (Robie et al. 2017), and individual identification (Körschens et al. 2018, Bogucki et al. 2019, Schofield et al. 2019, Ferreira et al. 2020). Since computing resources are cheaper than human labour, such approaches have the potential to reduce the financial and time costs of data collection, evidenced by a recent increase in popularity for ecological applications (Borowiec et al. 2021, Tuia et al. 2022).
Recently, several open-source machine learning software packages have been developed to make methods accessible to biologists. Examples include software for tracking animals and behaviours in captive settings (Harmer and Thomas 2019, Pennington et al. 2019, Sridhar et al. 2019, Walter and Couzin 2021), pose estimation (Graving et al. 2019, Pereira et al. 2019, Lauer et al. 2022), or analysing timelapse videos (Weinstein 2015, 2018b). However, few studies have investigated whether these tools can be readily applied to pre-existing datasets, especially in field studies and long-term datasets. While much open-source software can be useful for processing videos collected in the field, biologists often do not know the appropriate software to use for their specific project, or are unsure whether investing time in constructing novel software for their study system will be worth the effort.
Here, we field tested alternative data collection methods – manual, crowd-sourcing, automatic (machine learning), and a manual/automatic hybrid method – in a model system with annotated parental care videos of wild house sparrows Passer domesticus on Lundy Island, UK (Nakagawa et al. 2007). We used DeepMeerkat (Weinstein 2018b), a popular machine learning-based framework that uses convolutional neural networks (CNNs) to detect movement events from wildlife monitoring videos. Despite the name, the software was initially designed for use in a hummingbird population (Weinstein 2018b, Marcot et al. 2019), and has been adapted for use in marine (Sheehan et al. 2020) and insect (Pegoraro et al. 2020, Mertens et al. 2021) systems. To the best of our knowledge, DeepMeerkat and its predecessor MotionMeerkat (Weinstein 2015) are the only open-source software tools designed specifically for counting occurrences of animals in ecological timelapse videos (Weinstein 2018a), hence we chose DeepMeerkat since it best fit the nature of our dataset. While Marcot et al. (2019) did a comparison of the efficacy of MotionMeerkat (Weinstein 2015) in the hummingbird system, there has been limited appraisal of the accuracy and effort trade-offs that researchers need to consider when applying these novel tools in their own systems. For this reason we set out to compare four data collection methods for measuring parental care visitation rates: 1) manual, 2) crowd-sourcing, 3) automatic (using machine learning), and 4) a hybrid of manual and automatic. Using a large multi-year dataset, we determined how well the tool performed in extracting parental care data, then compared the accuracy and time investment trade-off for each data collection method, to provide insight into how open-source tools can be readily incorporated into pre-established workflows to aid data collection.
Material and methods
We first describe the study system and the parental care video dataset used, then introduce a framework showing how a researcher can approach the use of machine learning tools with similarly large datasets. We then describe each of the data collection methods in detail, followed by evaluation metrics used to compare the methods. We then test a simple biological hypothesis of effort–fitness trade-offs, and finally quantitatively compare the accuracies of the four methods.
Study system
Data were collected from a population of house sparrows P. domesticus on Lundy Island (51°10′N, 4°40′W) in the Bristol Channel, UK. This population is part of a long-term study and has been monitored systematically since 2000, with > 99% of individuals each marked with a unique combination of colour rings, a metal ring with a unique number from the British Trust for Ornithology (Cleasby et al. 2011), and a unique passive-integrated transponder (Schroeder et al. 2011). Since house sparrows rarely fly over large bodies of water (Magnussen and Jensen 2017) very little immigration or emigration has taken place in the population (Schroeder et al. 2015). As a result, the population has high recapture rates with no trapping bias (Simons et al. 2015), and we have reliable life history data available for every individual (Schroeder et al. 2015).
Parental care data
The Lundy sparrow population is situated within an area of 0.2 km2 around a small village, the only viable habitat on the island (Schroeder et al. 2011). Nest boxes for the sparrows are checked systematically to detect all breeding attempts throughout the summer breeding season (Cleasby et al. 2011). After eggs are found and the identities of the parents confirmed from their colour–ring combinations, 90-min videos (720 × 394, 25 fps) are recorded on Day 7 and Day 11 after egg hatching, using video cameras placed 2–5 m away from the nest box and with a field of view of at least 30 cm radius around the nest entrance to measure parental visitations (Nakagawa et al. 2007a for detailed procedure).
Data collection framework
Here, we describe a framework for formulating the approaches that researchers can use when incorporating machine learning tools for video analysis (Table 1). We categorised data collection methods into four types: 1) manual: the traditional approach of data collection by expert researchers watching videos and transcribing behaviours manually, 2) crowd-sourcing: data collection manually by a large group of people such as students or the general public, 3) automatic: data collection by a machine learning tool, without human intervention, and 4) hybrid: a combination of both manual and automatic. Since the parental provisioning videos were taken in the field, there is no reliable ground truth for visitation rates because of differing video conditions, observer bias, or subjectivity of behaviours (Tuyttens et al. 2014). For this reason, when comparing among methods, we are limited to assuming that manual annotation by experts is a suitable baseline. Finally, we also distinguish two main types of data that can be extracted with these methods: 1) bird presence, where birds are present within the video frame, and 2) bird visitation, a subset of the bird presence data, where birds enter the nest box or feed their young from the outside.
Method | Definition | Sample size (number of videos) |
---|---|---|
(1) Manual | Annotation by expert researchers | 2112 |
(2) Crowd-sourcing | Annotation by undergraduate students | 18 |
(3) Automatic | Processing using machine learning pipeline without human intervention | 2629 |
(4) Hybrid | Annotation of clips extracted from automatic pipeline by expert researcher | 18 |
Method 1: manual
Manual annotation is based on an established protocol for parental care data for the study system (Nakagawa et al. 2007). During video annotation, birds were considered as feeding their young when entering and exiting the nest box and, in rare cases, when feeding behaviour could be observed through the nest box entrance with the parents not fully entering the nest box. We recorded when birds were perched outside the nest box as a separate behaviour that did not count towards visitation rate estimates. Visitation rates were calculated using the time period from the first visit of either parent until the end of the video, or until 90 min had elapsed from the first visit, whichever came first. We started counting from the first visit, and not from the beginning of the video, to allow time for habituation. The resulting time during which visits were scored was termed the effective observation time (Nakagawa et al. 2007a). The total number of visits by both parents was divided by the effective observational time to obtain the visitation rate (visits/h) as a measure of parental provisioning.
Between 2004 and 2015, videos (n = 2112) were manually annotated by graduate students and researchers working on the Lundy house sparrow project to obtain visitation rates. The dataset has contributed to multiple publications on the evolution of parental care (Nakagawa et al. 2007, Ihle et al. 2019, Schroeder et al. 2012, 2013, 2016, 2019), and we considered these data to be the manual expert (Table 1) dataset.
Method 2: crowd-sourcing
We also used a crowd-sourcing method where videos (n = 18) were provided to a cohort of 36 second-year undergraduate students, who annotated full-length videos as part of their study. Students were first briefed on the annotation protocol as above. Each student was then assigned two videos to annotate. Since students are not trained experts, more than one student was assigned to every video to ensure no bias was introduced by individual observers. Finally, the average visitation rates for each video were calculated across observers. Each student also measured the approximate time it took for them to fully annotate each video.
Method 3: automatic
We processed videos collected between 2011 and 2019 (n = 2629) with the open-source software DeepMeerkat (Weinstein 2018b), which detects image changes between frames and identifies moving objects in wildlife monitoring videos. While we considered other approaches such as object detectors (e.g. YOLOv8; Jocher et al. 2023) or background subtraction from the OpenCV library (Bradski 2000), we decided DeepMeerkat was the most appropriate since it is designed for wildlife videos, has a user-friendly interface, and has models pre-trained on similar tasks (Weinstein 2018b). Since DeepMeerkat only detects motion, events that are detected may not always accurately correspond to visitations. This method processes the data without human intervention, and has a 1:1 processing time on a single CPU (e.g. a 1.5 h video takes approximately 1.5 h; Marcot et al. 2019).
We first processed each video with DeepMeerkat, then merged movement events detected fewer than 40 frames apart (25 fps videos, i.e. 1.6 s) and more than two frames in length as the same bird presence event. We then further grouped the events into 7 s video clips, to allow downstream annotation (Method 4 below). We observed that most bird visitations can be captured well within the 7 s threshold. We tallied the number of events for each video, then divided the tally by the effective observation time (above) to obtain the automatic presence rate (in events/h; Fig. 1). Since most of the visitation events on Lundy sparrows were by parents entering the nest box, we broadly assumed each visitation event corresponded to two presence events (entering and exiting the nest box). So, the automatic presence rate was divided by two. Finally, since certain videos produced over-inflated presence rate measures due to the filming environment (e.g. camera shaking, continuous background movement), we removed any videos with an automatic presence rate of 72.7 events/h or more (1.31% videos removed), since that was the maximum visitation rate we have ever recorded through expert manual annotation.
Method 4: hybrid
In the hybrid method, the same videos used in the crowd-sourcing method (n = 18) were first processed by the automatic method (Method 3) and then the video clips were manually annotated by experts to obtain visitation rates. This method filtered out events where birds were present, but not attending the nest, so manually converting the presence rate obtained from automatic processing (Method 3) to visitation rates. The time it took to annotate each video was recorded for comparison with other methods.
Comparing manual and automatic methods
To examine the reliability of the methods, we first compared the data obtained by manual annotation by experts with the automatic method (manual, n = 2112; automatic, n = 2629; Table 1). We used a Pearson's correlation test to determine whether both measures were correlated on videos where both methods were used (n = 781), then used both measures in a case study to test the hypothesis that parental care was beneficial to offspring and parental fitness (Trivers 1972). We tested whether broods whose parents had higher presence or visitation rates had a higher fledging and recruitment success, with ‘fledging' referring to a sparrow chick being successfully fledged from the nest, and recruits being fledglings that produced at least one genetic offspring in their lifetime. While we expected visitation rates by expert manual annotation to predict higher fitness from previous studies (Schroeder et al. 2013), we were also able to determine whether bird presence rates collected by automatic processing could similarly be a proxy for parental care behaviour.
We fitted two generalised linear mixed models with the number of fledglings and the number of recruits for each brood as respective response variables, against the visitation rates and presence rates as explanatory variables, using a Poisson link function. We then z-transformed both measures of parental provisioning to allow effect sizes between the data collected by manual and automatic methods to be compared. We used absolute count of fledglings and recruits as a proxy of fitness, because we were interested in how the absolute differences in provisioning rates in the brood level will affect the fitness outcomes of the whole brood. Moreover, there is no way of knowing which offspring the parents are feeding within a brood, so we assumed that parental investment scales proportionately to the number of offspring (Schroeder et al. 2013). The clutch size of Lundy sparrows also has limited variation (4·2 ± 0·8; Westneat et al. 2014). To account for other effects that might affect fitness and parental care behaviour, we ran the models with the following specifications. Since videos were collected on Day 7 and Day 11 after hatching, we averaged the values collected on both days to match with the response variable, because there was only a single fitness estimate for each brood. However, since visitation rates increase with brood age (Schroeder et al. 2019), we also ran the models separately using the rates on Day 7 and Day 11 to ensure results were consistent. We also added the ages of the mother and father (Wiebe 2018), and hatch date (days after 1 April) as fixed effects to control for their effects on fitness outcomes. Since breeding success usually correlates with peak food abundance (Lack 1968, Cresswell and Mccleery 2003), we added a quadratic fixed-effect term for hatch date, assuming peak food abundance in the middle of the breeding season. Next, the population has undergone routine cross-fostering, which is associated with slightly increased chick survival (Winney et al. 2015), hence we added a fixed factor for fostered status (yes/no) in all models. Note that the parental provisioning rates we used were always from the parents that actually visited, who were not always the genetic parents of the young (Lattore et al. 2019). We added the social parent IDs and the year as random effects to control for environmental effects (Rose et al. 1998), and repeatable visitation rates by individual parents (Nakagawa et al. 2007). Finally, the location of the nest box was added as a random effect to control for environmental effects (Schroeder et al. 2012).
We ran all models using the R package ‘MCMCglmm' (Hadfield 2010) in R ver. 3.6.1 (www.r-project.org). The posterior distributions and autocorrelations were checked following Hadfield (2014) to ensure all fixed and random effects converged without violating any model assumptions. We defined a parameter estimate as statistically significant if the 95% credible interval did not overlap with zero.
Comparing annotation methods
Next, we quantitatively compared crowd-sourcing, automatic, and hybrid approaches for collecting data on parental care videos using expert manual annotation as a baseline. We standardised the measures by dividing the number of events detected (presence events for automatic, and visit events for crowd-sourcing and hybrid) by each method over the baseline, to obtain a proportion of events detected compared to the baseline. Since a proportion of 1 would show that a method detected the same number of events as the baseline, we carried out one-sample t-tests to determine whether each method was significantly different from the baseline (by setting the theoretical mean (µ) to 1). We then compared all methods with each other, using pairwise t-tests.
Finally, to assess the performance and sources of error of the automatic data collection method, we further analysed the data collected by hybrid expert annotation on video clips (Method 4; n = 1650 bird presence events over 18 videos). We first computed false positive rates by comparing the automatic results with what the expert annotators recorded. We calculated the proportion of events where bird visitation was falsely detected, both in the case where the bird was not present within the frame, or when the bird was present but not visiting the nest. We then computed false negative and true positive rates by calculating the proportion of bird visitation events (based on expert manual annotation; Method 1) that was captured (true positive) or missed (false negative) by the automatic method. All the proportions were calculated for each video, then subsequently averaged across videos.
Results
Automatic method is comparable to manual annotation
We found a significant positive correlation between expert manual annotation and the automatic method (r = 0.62, 95% CI: 0.58–0.66, p < 0.001; Fig. 2a). We also found that both proxies of parental care behaviour significantly predicted an increase in both the number of fledglings and recruits in our statistical analysis (Fig. 3). None of the fixed and random effects predicted fitness outcomes, except for a significant negative quadratic effect for hatch date in most models, showing that fitness is maximised in the mid-breeding season (Supporting information). To test for the effect of outliers, we also ran the same models after removing data from nests with five recruits or fledglings, which yielded similar results (Supporting information).
Comparing data collection methods
The proportion for the automatic method was significantly inflated based on the proportion we computed for the number of detected events compared to the baseline (Table 2, Fig. 2b). The proportion for crowd-sourcing and hybrid annotation was not significantly different from the baseline (Table 2, Fig. 2b), showing that the data obtained were consistent with expert annotation. The estimates obtained by the crowd-sourcing undergraduate students were consistent, with the majority of students measuring a feeding rate within a difference of 10 compared to the baseline value (relative to mean of ~ 58.5 events in each video; Supporting information). On average, the crowd-sourcing data extraction by undergraduate students took 65.4 min per video (min 25 min, max 100 min), whereas the hybrid expert annotation method took an average of 12.0 min per video (min 4.6 min, max 31.0 min). This was an average of 53 min less time required when first processing videos using the hybrid pipeline, but more time consuming overall (102; 12 + ~ 90 min computer processing time).
Method | Mean | 95% Confidence interval | t | p-value |
---|---|---|---|---|
(1) Crowd-sourcing annotation | 1.01 | 0.97–1.06 | 1.13 | 0.52 |
(2) Automatic | 1.39 | 1.17–1.63 | 3.68 | 0.002 |
(3) Hybrid annotation | 0.94 | 0.88–1.01 | −1.78 | 0.10 |
To gain further insight into the sources of error and performance of the automatic method, we computed a confusion matrix using data from hybrid manual annotation (Table 3). We show that the automatic pipeline had a false negative rate of 0.104. However, there was a higher false positive rate of 0.260, when many non-visitation events were captured. Further analysis showed that only 0.044 of visits were true false detections where no birds were present, and the majority (0.216) were caused by non-visitation events, where the birds were detected correctly by DeepMeerkat, but the bird was not visiting the nest.
Actual value | |||
---|---|---|---|
Positive | Negative | ||
Predicted value | Positive | True positive1 | False positive2 |
0.896 (0.11) | 0.260 (0.16) | ||
No presence: 0.044 (0.07) | |||
No visitation: 0.216 (0.14) | |||
Negative | False negative3 | True negative4 | |
0.104 (0.11) | NA |
Discussion
In this study, we set out to compare four different methods for collecting parental provisioning data on a pre-existing video dataset of house sparrows to determine whether available open-source software can aid biologists in laborious data collection tasks. Using an automatic data collection method based on DeepMeerkat (Weinstein 2018b), we extracted visitation rates from parental provisioning videos of house sparrows and found that the results obtained correlated positively with expert manual annotation. Our data analysis also shows that the automatic method led to comparable predictions of increased fitness for parents that expressed more parental care behaviour, showing that this new metric can be used as a proxy of parental investment (Trivers 1972, Schroeder et al. 2013). However, further analysis shows that the automatic method has high false positive and false negative error rates, since DeepMeerkat was designed to detect the presence of animals (Weinstein 2018b), and not the specific behaviour of feeding the young. We highlight that a hybrid approach in which raw videos are first processed through DeepMeerkat to obtain clips, then manually annotated, can obtain estimates equivalent to expert manual annotation in approximately 20% of the time. With the rapid development of open-source tools, we encourage any researchers with large video datasets to evaluate currently available software and, if required, to apply a hybrid method for annotation to considerably reduce the time investment needed to process huge backlogs of video data if there is no trade-off in accuracy.
Using the open-source software DeepMeerkat, we estimated an automatic presence rate that broadly correlated with expert manual annotation (r = 0.62) and predicts an increase in fledglings and recruits in broods. However, while a fully automatic method seems to be able to extract biologically meaningful presence rates for the Lundy sparrows, the effect size is quite low and should be interpreted with caution. This result may also not generalize well, because the current study system had a large sample size and a broad correlation between feeding bouts and the frequency of parents being in proximity with the nest (as a weaker alternative measure of parental care). The placement of the camera to collect these videos also contributed to the effectiveness of DeepMeerkat, since it was pointed directly at the nest box, and most movements in the frame were caused by parents visiting the nest rather than being movements in the environment or by chicks. As such, DeepMeerkat was an appropriate tool to use given the current dataset, but may be less applicable to open nest systems (Reif and Tornberg 2006) or cameras placed inside nest boxes (Zárybnická et al. 2016), which can dramatically increase the number of false detections from the movement of chicks or the background. This was also evident from the small subset of videos (1.31%) that was removed in the current dataset due to over-detection. To have an automatic pipeline that can reduce these biases, alternative custom machine learning algorithms will need to be designed to automatically recognize nest visits instead of just bird presence, using behavioural classification techniques (Conway et al. 2021, Ditria et al. 2021). However, this was largely outside the scope of the current study and represents an exciting future development.
While we show that DeepMeerkat was appropriate for the current study system, further analysis of the data collected by the hybrid annotation method revealed various sources of error when applying the automatic pipeline. We showed that the pipeline produced biased results, with a false negative rate averaging 10%, showing that the software can miss visitation events. However, this under-detection was negated by a high false positive rate (26%), mainly caused by the over-detection and misclassification of non-visitation events (22%) and false detections (4%), causing the overall estimate to be over-inflated. The larger false positive rate can be attributed to the original design of DeepMeerkat, which is intended to detect movement between frames, instead of feeding behaviours. In this sense, fine-tuning DeepMeerkat using more frames from the current study system may potentially have reduced the false negative and false detection rates, but would have been ineffective because the inflated measure was mainly due to misclassification of non-visitation events (22%). Another source of the inflated measure can also be attributed to the definition of 7 s for each event, where events can be over-inflated when parents are staying in frame for a longer period of time. Better clustering algorithms can be applied to group events without relying on simple thresholds. Overall, DeepMeerkat was an appropriate choice of software for the current system, even though it produces biased estimates. While alternative approaches such as training an object detection model (Jocher et al. 2023) might reduce false detections, it will still result in similarly inflated measures unless a model is explicitly trained to detect behaviours.
Using a smaller subset of testing videos, we show that crowd-sourcing annotation and hybrid expert annotation were both comparable with the baseline set by expert manual annotation. Of all the methods, the crowd-sourcing annotation appeared to be the most accurate, albeit with the recorded proportion of events compared to the baseline being larger than one, suggesting that the undergraduate students may have identified some visitation events that were missed by expert manual annotation. This is possibly due to the multi-observer effect (Guay et al. 2013). We also show that the variation in parental visitation estimates for each individual annotator were consistent, within ± 10 visitation events compared to the baseline, with a mean of ~ 58.5 events per video (Supporting information). However, the between-annotator variance of 10 visits can still introduce bias in provisioning estimates, hence we still recommend multiple observers to annotate a single video for future researchers who look to adopt a crowd-sourcing approach.
Even though crowd-sourcing seems to be the most accurate method, the annotation time for hybrid annotation by the investigator only takes approximately 20% of the time compared to crowd-sourcing annotation, and may still obtain comparable and accurate visitation rates. While it takes more time overall once computing processing is considered (average ~ 102 min compared to 65.4 min), computational time is much cheaper than human labour, and videos can be processed non-stop (e.g. overnight). For example, if there are 100 videos each 90 min in length in a single field season, all the data can be realistically processed on one laptop in approximately 9000 min, which is approximately 6.25 days, i.e. just under a week non-stop. We also show that the under-detection (false negative rate: 10%) reported above is negligible, such that after manual annotation in the hybrid method, the estimates of provisioning rates are still comparable to the baseline. A weakness of the approach is that if researchers would like to retrieve these missed detections, they will have to review the whole video from scratch, so future researchers will have to be aware that the parental visit rates estimated by this method might be potentially biased. Researchers who adopt this workflow in their study systems should also first make an evaluation and ensure the false negative rate is negligible, similar to the current case study. Nevertheless, we highlight the value of first using an open-source machine learning tool to pre-process videos for annotations, which not only reduces the actual annotation time, but also the time needed to train observers in the appropriate experimental protocol.
For field protocols that involve video recordings, we encourage researchers to explore available open-source machine learning tools and to evaluate the methods against traditional protocols. If the automatic tools produce comparable estimates, the method can be used directly to automate data collection, but if the pipeline produces biased estimates, researchers can consider hybrid approaches to save annotation time. For researchers with similarly large time-lapse video datasets in the field, we also recommend the use of DeepMeerkat, instead of custom algorithms that require more domain-specific knowledge. DeepMeerkat not only has a user-friendly graphical interface but also requires limited investment in terms of machine learning knowledge or expensive hardware (e.g. GPUs).
Conclusion
Machine learning and computer vision approaches are becoming widely used in ecology (Borowiec et al. 2021, Couzin and Heins 2022), but adoption of these frameworks by ornithologists is still slow and limited. Here we present a case study of using an open-source software to first pre-process long-duration videos before annotation, to considerably reduce annotation time without significant reduction in accuracy. With the increase in available open-source tools to reduce manual annotation efforts (Van Horn et al. 2015, Weinstein 2018b, Lauer et al. 2022, Walter and Couzin 2021), and the increase in computing literacy of ecology graduates (Farrell and Carey 2018), we encourage researchers to evaluate the applicability of these tools and, if needed, make use of hybrid approaches. This would not only unlock the major time bottleneck of unanalysed data that would otherwise go to waste, but also allow more interesting biological hypotheses to be tested.
Acknowledgements
– We would like to thank the Lundy Landmark Trust and the Lundy Field Society for their ongoing support for our fieldwork.
Funding
– This research was supported by Imperial College London. JS was awarded a fellowship from the Volkswagen Foundation; a grant from the German Research Foundation: Deutsche Forschungsgemeinschaft; and grant no. CIG PCIG12-GA-2012-333096 from the European Research Council. TB was awarded grant no. NE/J024597/1. WDP and the Pearse lab were funded by NSF grant no. ABI-1759965 and by the UKRI/NERC NE/V009710/1. AHHC was funded by the Deutsche Forschungsgemeinschaft (DFG, German Research Foundation) under Germany's Excellence Strategy – EXC 2117 – grant no. 422037984.
Permits
– Fieldwork was carried out, including tissue collection for DNA extraction, with the permission of the Lundy Company and Field Society and under permits from the UK Home Office (PP7009092 and PP5873078) and BTO (S:6308).
Author contributions
Alex Hoi Hang Chan: Conceptualization (equal); Data curation (equal); Formal analysis (equal); Investigation (equal); Methodology (equal); Software (equal); Validation (equal); Visualization (equal); Writing – original draft (equal); Writing – review and editing (equal). Jingqi Liu: Data curation (equal); Methodology (equal); Visualization (equal); Writing – review and editing (equal). Terry Burke: Conceptualization (equal); Funding acquisition (equal); Project administration (equal); Writing – review and editing (equal). William D. Pearse: Conceptualization (equal); Funding acquisition (equal); Methodology (equal); Project administration (equal); Resources (equal); Software (equal); Supervision (equal); Writing – review and editing (equal). Julia Schroeder: Conceptualization (equal); Data curation (equal); Funding acquisition (equal); Investigation (equal); Project administration (equal); Resources (equal); Supervision (equal); Writing – review and editing (equal).
Open Research
Transparent peer review
The peer review history for this article is available at https://publons.com/publon/10.1111/jav.03167.
Data availability statement
Code and data are available from the Zenodo Repository: https://doi.org/10.5281/zenodo.6411359 (Chan et al. 2023). The automatic processing pipeline is available from the Zenodo Digital Repository: https://zenodo.org/records/10190356. We provide a sample script in R to post-process outputs of DeepMeerkat into short video clips.