Effectiveness of Using Artificial Intelligence for Early Child Development Screening

: This study presents a novel approach to recognizing emotions in infants using machine learning models. To address the lack of infant-specific datasets, a custom dataset of infants' faces was created by extracting images from the AffectNet dataset. The dataset was then used to train various machine learning models with different parameters. The best-performing model was evaluated on the City Infant Faces dataset. The proposed deep learning model achieved an accuracy of 94.63% in recognizing positive, negative, and neutral facial expressions. These results provide a benchmark for the performance of machine learning models in infant emotion recognition and suggest potential applications in developing emotion-sensitive technologies for infants. This study fills a gap in the literature on emotion recognition, which has largely focused on adults or children and highlights the importance of developing infant-specific datasets and evaluating different parameters to achieve accurate results


Introduction
The early detection of developmental and behavioral disabilities in children is crucial for effective intervention and improved outcomes. According to the Centers for Disease Control and Prevention (CDC), one in six three-to-17-year-olds has a behavioral or developmental disability, including autism, attention deficit hyperactivity disorder (ADHD), dyslexia, and mental health issues. Early intervention is key in reducing the adverse effects of these neurodevelopmental disorders [1]. To support the early detection of developmental issues, the American Academy of Pediatrics recommends screening children for developmental and behavioral issues at 9, 18, and 30 months [2]. However, self-report measures have limitations in screening young children's behaviors and development, including potential bias and limited coverage of behaviors and contexts [3]. Therefore, objective assessments or multiple data sources should be considered.
Computer vision (CV) and artificial intelligence (AI) can potentially improve medical diagnoses and treatment by analyzing past patient cases to predict future care. AI algorithms, such as neural networks and evolutionary algorithms, are used in various professions to tackle complex problems [4−6]. Specifically, computer vision, a subfield of AI, uses computer models to simulate human visual perception for tasks like object recognition and classification, with OpenCV being widely used in CV research [7]. Convolutional neural networks (CNN) require less preparation than conventional techniques and can be implemented in various ways, making them ideal for medical diagnosis and assessment. Recognizing emotions in infants is a crucial aspect of early development essential for adaptive social functioning. Recent advances in deep learning algorithms have significantly improved facial emotion recognition performance, a hot topic in CV and AI [8,9]. However, recognizing emotions in infants presents additional challenges, including age-specific data collection and developing classifiers tailored to infant facial expressions [10−12]. Therefore, careful consideration of information extraction and classifier selection is required to achieve accurate and reliable results.
Several studies have focused on recognizing child facial expressions using the Child Affective Facial Expression (CAFE) dataset. One study achieved 56% accuracy using a Deep Boltzmann Machine with supervision [13]. Another study utilized transfer learning and a CNN network trained on the CK+ dataset to achieve 76% validation accuracy on a subset of the CAFE dataset [14]. A third study used a MobileNetV3 pre-trained network and achieved 65.78% accuracy [15]. Finally, a fourth study utilized multiple datasets to train the model and achieved a balanced accuracy of 66.9% on the CAFE dataset [16]. However, it is important to note that children's facial expressions differ from those of infants [17]. This study aimed to use facial expression detection algorithms on mobile technology for early child development screening in Asian cultures. The study evaluated the algorithm's accuracy and established a baseline for machine learning models' perception of infant emotions. The study results provide further information on developing culturally appropriate and effective screening tools and drive future research for improved approaches. Additionally, it contributes to the creation of infantfriendly, emotion-sensitive technologies to support young children's emotional well-being and development.

Machine learning algorithm.
Artificial intelligence (AI) is the oldest and broadest branch of computer science, encompassing all aspects of simulating cognitive capabilities for real-world problem-solving and developing systems that learn and think like humans. Therefore, it is often referred to as machine intelligence to differentiate it from human intelligence [18]. Recent research emphasizes that AI systems should not only address pattern recognition issues but also create causal models of the environment that allow explanation and comprehension [19]. Support Vector Machines (SVMs), which categorize data, are widely used in facial emotion recognition [20][21][22]. Research shows that SVMs can handle high-dimensional data and resist noise, making them suitable for facial expression recognition [23]. SVMs are based on finding the decision boundary that maximizes the margin, which is the distance between the boundary and the data points from different classes closest to it [24]. Several studies have used SVMs to examine different feature descriptors, including facial landmarks and the center of gravity [25], as well as SVM, KNN, LDA, Gabor, Haar, and LBP features for classification [26]. Other studies have used face geometry-based feature extraction to examine logistic regression, LDA, KNN, classification and regression trees, naive Bayes, and SVM [10]. Overall, SVMs have performed well in these tasks, particularly when using face geometry-based feature extraction. However, the choice of feature descriptors and classification methods may depend on the specific application and dataset being used.

Deep learning algorithm.
Convolutional neural networks (CNNs) are a widely used subclass of artificial neural networks, especially in computer vision tasks [21]. Since their remarkable success in the ImageNet Large Scale Visual Recognition Competition (ILSVRC) in 2012, CNNs have become the most established deep learning algorithm [12]. One of the main advantages of CNNs is their ability to learn automatically without human supervision, making them very popular [8]. However, their performance can be limited when training data is scarce or exclusive, especially with onedimensional signals. Furthermore, deep CNNs face challenges such as training instability and feature duplication in facial expression recognition, especially when running on low-end devices [9,27]. Nonetheless, researchers are developing new methods to enhance deep CNNs for facial expression recognition.
Most facial emotion recognition research has focused on adults, with CNNs yielding better results than customized methods. Several studies have achieved high accuracy rates for facial expression recognition using CNNs. For instance, Shaees et al. achieved 99.3% accuracy on thermal images using the Natural Visible and Infrared Expression (NVE) database and 98.3% accuracy on graphic illustrations using the Cohn-Kanade+ (CK+) database [28]. Similarly, Webb et al. achieved a state-of-the-art classification rate of 99.52% on a combined corpus of datasets [29]. Hassouneh et al. used virtual markers to gather six facial emotions from 55 college students and achieved a maximum recognition rate of 99.81% using CNN for emotion detection with facial landmarks [30].
Few studies have been conducted on infant facial expression recognition due to limited tagged datasets. However, some studies have proposed CNN-based techniques for infant facial expression recognition, achieving promising results. For example, one study achieved an accuracy of 87.16% and an F1-score of 62.54% using a CNN-based technique on the RCLA & NBH Smile Dataset. This study proposed a multi-label CNN for detecting infant facial action units, achieving 0.69 to 0.93 kappa with human coding [31].
Several methods can improve CNNs' emotion recognition performance. Weight initialization and data augmentation can improve CNNs' performance in emotion recognition [29,32]. Ruiz-Garcia et al. used the stacked convolutional auto-encoder (SCAE) technique to pre-train each network layer as an unsupervised auto-encoder, and the results of this approach were compared with other methods [29]. Porcu et al. found that combining artificial images with other methods gave the best results [32]. Using a pre-trained visual object recognition network to recognize facial emotions and fine-tuning a pre-trained model with a cross-dataset strategy can also improve CNNs' performance [33,34].

System overview.
The proposed system aimed to use machine learning models to recognize emotions in infants. Figure 1 presents an overview of the system, which involves a custom dataset of infant faces extracted from the AffectNet dataset, categorized into positive, neutral, and negative facial expressions. The dataset was then used to train a deep learning model that could recognize infant emotions. Additionally, a mobile app was developed to collect facial image data of infants for this study. When the model was trained, it was integrated into the mobile app for infant development screening. This app can be used by caregivers, pediatricians, or other professionals to screen a child's socio-emotional development using emotion recognition in infants.

Mobile application, Spark@Grow.
In this study, a mobile application called Spark@Grow was developed for the purpose of early child development screening. The Xamarin Forms framework was used to develop the app, enabling its publication on both the Android and iOS platforms by the first two authors, MLG and HYT. The psychologists (PJW and SWW) and pediatricians (THT and GLT) in the study team provided the early child development screening items, which were implemented in the app. The initial version of the app had two versions, namely the public and data-collecting versions. The public version was published on Google PlayStore and Apple Appstore, while the data-collecting version was shared with recruited parents and volunteers to collect image data for building the locally collected infant image dataset.

AffectNet dataset.
This dataset contains over 1,000,000 facial photos gathered from the Internet by searching three major search engines with 1,250 emotion-related keywords in six languages. Approximately half of the photos were manually tagged for seven distinct facial expressions, strength of valence and arousal. Furthermore, a random subset of 30,000 images was used to prevent the occurrence of an out-of-memory error.

City infant faces database.
The City Infant dataset is a collection of photographs that depict infants' faces displaying positive, negative, and neutral facial expressions. The dataset was developed and validated by collecting photographs from parents of infants aged between 0 to 12 months and rating the images based on facial expression, intensity, clarity, genuineness, and valence. The final database comprises 154 images with rating agreements of at least 75%, including 60 photographs of positive infant faces, 54 of negative infant faces, and 40 of neutral infant faces. The images exhibit high criterion validity and good test-retest reliability, making the City Infant dataset a useful and valid tool for researchers studying the emotional expressions of infants [35].

Non-Infant dataset
The following datasets were used to build the custom dataset for classifying non-infant facial images: Japanese Female Facial Expression (JAFFE) dataset, which consists of 213 images of different facial expressions from ten Japanese female subjects [36]; the Extended Cohn-Kanade (CK+) dataset, which is a collection of 593 video sequences from 123 subjects of various ages, genders, and heritage [37]; and the Child Affective Facial Expression (CAFE) dataset, which is a collection of images of kids between the ages of two and [38]. Sample images from the CAFE dataset will not be displayed in this paper as it is copyright protected.

Locally collected dataset.
Figure 2 displays the sample images that were gathered from local communities through various online platforms and communication channels, such as Facebook, Instagram, WhatsApp, and Telegram. Parents were invited to complete a Google Form and upload an image of their child's facial expression to support this study, with the understanding that the child's emotional images would only be used to evaluate the classification algorithm and not be disclosed to anyone else. A total of 88 facial emotion images were categorized into positive, neutral, and negative categories. The images were classified based on the emotion that the participants perceived; if the parent believed that the image showed the child expressing joy, it was placed in the positive category, while images depicting sadness, crying, or anger were categorized as negative. The neutral category comprised images that did not display any intense emotion, such as joy, sadness, or anger.    Figure 3 depicts the process of creating the custom infant dataset by extracting infant faces from the AffectNet dataset. A classifier was developed using transfer learning with a pretrained network and images from the JAFFE, CK+, and CAFE datasets were used as the noninfant faces class, while the City Infant Dataset was used as the infant faces class. Around 6,000 infant face images were extracted, and five volunteers manually removed non-infant faces. The remaining images were categorized into positive, neutral, and negative emotion expression categories, similar to the City Infant dataset. Finally, the custom dataset was utilized to train the machine learning models, and their performance was evaluated on the City Infant dataset for comparison purposes.

Hardware and software.
The deep learning model was developed and trained on an Nvidia RTX 2060 graphics card with 6GB VRAM. The model was built using the Keras API of the Tensorflow software library, which is widely used in machine learning and deep learning research. Python, NumPy, and OpenCV were also used to facilitate the training and evaluation of the model, including preprocessing the dataset, feature extraction, and performing data augmentation. OpenCV provides image and video processing tools, while NumPy supports large, multi-dimensional arrays and matrices. These tools helped to improve the model's performance in image classification tasks such as facial expression recognition.

Pre-process the datasets.
To pre-process the images, OpenCV was used to detect facial landmarks such as the eyes, nose, and mouth, which were used to align and center the facial images. The background was then cropped out, leaving only the aligned facial image, resized to a consistent size of 224 x 224 pixels. This process aimed to improve the machine learning models' performance by ensuring consistent and relevant features of the face were located and sized the same in all images. Resizing the images to a consistent size also aimed to reduce variability in the input data, potentially making it easier for the models to learn from the images. Figure 4 shows the flowchart of the Infant Emotion Classification CNN model used in this study, specifically designed to recognize emotions in infant faces. The model had four blocks of convolutional layers, each followed by max pooling and dropout layers. The first block consisted of two convolutional layers with 32 filters and a kernel size of 3 x 3, which extracted features from the input image. The subsequent blocks increase the number of filters to 64, 128, and 256, respectively. The output was then passed through two fully connected (dense) layers with 64 units each and dropout layers, followed by a dense layer with the number of units equal to the number of classes and a softmax activation function. Finally, the model was trained using the Nadam optimization algorithm and evaluated on the validation data using the evaluate function.

Comparison between algorithms and architectures.
To compare the effectiveness of different methods in recognizing emotions in infants' faces, nine methods were evaluated, including the Infant Emotion Classification CNN, ResNet50, MobileNetV2, MobileNetV3, VGG19, VGG16, Multiclass Support Vector Machine, Teachable Machine, and CustomVision.AI. The Infant Emotion Classification CNN is a model specifically designed for recognizing emotions in infants, while ResNet50 is a general-purpose CNN that has achieved success in various tasks. MobileNetV2 and MobileNetV3 are lightweight CNNs designed for efficient performance on devices with limited resources, and VGG19 and VGG16 are deeper CNNs that have achieved success in various tasks. The Multiclass Support Vector Machine is a traditional machine learning algorithm that does not require large datasets for training, while Teachable Machine is a tool for creating machine learning models with a simple graphical interface. CustomVision.AI is a cloud-based deep learning service that allows users to train and deploy models using a web interface.
To compare these methods, a custom dataset of infant faces extracted from the AffectNet dataset was used to train and evaluate each model. The City Infant dataset was used as the validation set. The accuracy of each method was measured on the validation set, and the time required to train each method on the training set was recorded. Various machine learning models were used in the training process, including the TensorFlow library and Keras API for the CNN, ResNet50, MobileNetV2, MobileNetV3, VGG19, and VGG16 models. The Multiclass Support Vector Machine was trained using the scikit-learn library, and the default training procedure for this model was used. The Teachable Machine was trained by following the instructions on the Teachable Machine website, which involved uploading, labeling, and training the model using the provided interface. Similarly, the CustomVision.AI model was trained by following the instructions on the CustomVision.AI website, which involved creating a project, uploading, labeling, and training the model using the provided interface.
The accuracy of each method was measured by calculating the percentage of correctly classified samples in the validation set, and the training time for each method was measured by timing the execution of the training procedure. The performance of the methods was compared based on the accuracy and training time. Additionally, the relative simplicity and accessibility of each method were considered, as well as any additional resources or technical expertise required to use it. Based on the results, the most suitable method for implementing infant emotion recognition in a mobile application was selected.

Results and Discussion
There were nine different models on the dataset of infant facial expressions: Infant Emotion Classification CNN, ResNet50, MobileNetV2, MobileNetV3, VGG19, VGG16, Multiclass Support Vector Machine (SVM), Teachable Machine, and CustomVision.AI. The results are shown in Table 1.  The lower accuracy of VGG19 and VGG16 may be due to their less suitability for the task because of the lack of data for training [41]. However, in terms of training speed, these models have relatively simple architectures and are, therefore, easier to train. Nevertheless, their lower accuracy compared to the ResNet models suggests that they may not be as effective at learning complex features from the data. The Infant Emotion Classification CNN and ResNet50 models had the slowest training speed to reach their highest accuracy, taking 582 seconds with 94.63% accuracy and 4,979 seconds with 93.3% accuracy, respectively. Although the accuracy difference was only 1.33%, the Infant Emotion Classification CNN model achieved its highest accuracy with about 58% lower training time. Therefore, these models have more complex architectures and are more computationally intensive to train. However, their higher accuracy compared to the other models suggests that they can learn more complex features from the data and generalize better to unseen examples, making them more effective at capturing the nuances of infant facial expressions.

Validation accuracy and training speed.
The results of the experiment are significant for the use of machine learning in the field of infant facial emotion recognition. The high accuracy of the Infant Emotion Classification CNN and ResNet50 models indicates that these models could be valuable for accurately detecting and classifying infants' emotions based on their facial expressions. Conversely, the lower performance of the VGG19 and VGG16 models and the intermediate performance of the MobileNetV2 and MobileNetV3 models suggest that these models might not be as suitable for this task. These findings underscore the importance of carefully selecting an appropriate machine learning model for the task at hand, considering both accuracy and training speed. For infant facial emotion recognition, the Infant Emotion Classification CNN and ResNet50 models seem to be the most effective, with high accuracy and relatively fast training speed. Table 2 demonstrates that the classifier had the highest accuracy in detecting neutral emotion, with 82.3% of the images correctly classified. This was followed by positive emotion, with 78.8% accuracy, and negative emotion, with 69.8% accuracy. Overall, the classifier performed relatively well in detecting positive and neutral emotions, with accuracy rates above 70%. However, the accuracy for detecting negative emotion was lower, with almost a third of the images being misclassified. This might be due to the subtle nature of negative expressions, which could be more difficult to distinguish than more intense positive and neutral expressions. Finally, it is worth noting that the accuracy rates for positive and neutral emotions were similar, but the classifier had a higher number of correct detections for neutral emotions due to the smaller number of images in this category. This highlights the importance of having sufficient examples in each category to obtain reliable results.

Conclusion
The proposed Infant Emotion Classification CNN model achieved the highest accuracy and quickest training time, demonstrating the effectiveness of machine learning techniques in identifying infant emotions. The model's high accuracy was due to the unique dataset created by excluding baby photos from the AffectNet dataset. The study's methodology involved testing several parameters. The Nadam optimizer, a feature set with a beginning value of 32 and a dense parameter of 64, produced the best results. These findings offer a useful benchmark for future research on identifying newborn emotions and suggest potential applications for developing emotion-sensitive devices for infants. However, there is still room for improvement in the model's performance. Further research could explore using more extensive and diverse datasets and advanced machine-learning models. Machine learning for infant emotion recognition has the potential to support the emotional well-being of infants by enabling the development of emotion-sensitive technologies. These results are helpful for other researchers working on similar tasks and provide a benchmark for evaluating various machine learning models' performance on infant emotion recognition. Further research can build upon these findings and explore the use of machine learning in other applications related to infant development and care.