Automatic Speech Recognition (ASR) faced challenges in accuracy and noise robustness, particularly in Bahasa Indonesia. This research addressed the limitations of single feature extraction methods, such as Mel-Frequency Cepstral Coefficients (MFCC), which were sensitive to noise, and Relative Spectral Transform - Perceptual Linear Predictive (RASTA-PLP), which was less effective in frequency representation, by proposing a hybrid approach that combined both techniques using Long Short-Term Memory (LSTM) models. MFCC enhanced spectral accuracy, while RASTA-PLP improved noise robustness, resulting in a more adaptive and informative acoustic representation. The evaluation demonstrated that the hybrid method outperformed single and non-extraction approaches, achieving a Character Error Rate (CER) of 0.5245 on clean data and 0.8811 on noisy data, as well as a Word Error Rate (WER) of 0.9229 on clean data and 1.0015 on noisy data. Although the hybrid approach required longer training times and higher memory usage, it remained stable and effective in reducing transcription errors. These findings suggested that the hybrid method was an optimal solution for Indonesian speech recognition in various acoustic conditions.
Deng, L.; Yu, D. (2013). The essence of knowledge deep learning methods and applications. Foundations and Trends® in Signal Processing, 7(3-4), 197–387. https://doi.org/10.1561/2000000039.
Maruf, M.R.; Faruque, M.O.; Mahmood, S.; Nelima, N.N.; Muhtasim, M.G.; Pervez, M.J.A. (2020). Effects of noise on RASTA-PLP and MFCC based Bangla ASR using CNN. 2020 IEEE Region 10 Symposium, TENSYMP 2020. https://doi.org/10.1109/TENSYMP50017.2020.9231034.
Devi, K.J.; Devi, A.A.; Thongam, K. (2019).Automatic speaker recognition using MFCC and artificial neural network. International Journal of Innovative Technology and Exploring Engineering, 8(10), 1160–1166. https://doi.org/10.35940/ijitee.a1010.1191s19.
Kurzekar, P.K.; Deshmukh, R.R.; Waghmare, V.B.; Shrishrimal, P.P. (2014). A comparative study of feature extraction techniques for speech recognition system. International Journal of Innovative Research in Science, Engineering and Technology, 3(12), 13703–13709. https://doi.org/10.15680/ijirset.2014.0312034.
Oruh, J.; Viriri, S.; Adegun, A. (2022). Long short-term memory recurrent neural network for automatic speech recognition. IEEE Access, 10, 11325–11334. https://doi.org/10.1109/ACCESS.2022.3159339.
Anusuya, M.A.; Katti, S.K. (2011). Front end analysis of speech recognition: A review. International Journal of Speech Technology, 14(4), 275–286. https://doi.org/10.1007/s10772-010-9088-7.
Labied, M.; Belangour, A. (2021). Automatic speech recognition features extraction techniques: A multi-criteria comparison. International Journal of Advanced Computer Science and Applications, 12(8), 40–48. https://doi.org/10.14569/IJACSA.2021.0120821.
Leini, Z.; Xiaolei, S. (2021). Study on speech recognition method of artificial intelligence deep learning. Journal of Physics: Conference Series, 1754(1), 1–5. https://doi.org/10.1088/1742-6596/1754/1/012183.
Özkural, E. (2018). The foundations of deep learning with a path towards general intelligence. Lecture Notes in Computer Science (Including Subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), 10983, 303–316. https://doi.org/10.1007/978-3-319-97676-1_16.
Lei, N.; An, D.; Guo, Y.; Su, K.; Liu, S.; Luo, Z.; Yau, S.T.; Gu, X. (2020). A geometric understanding of deep learning. Engineering, 6(1), 53–59. https://doi.org/10.1016/j.eng.2019.09.010.
Broad, D.J. (1972). Basic directions in automatic speech recognition. International Journal of Man-Machine Studies, 4(3), 251–268. https://doi.org/10.1016/S0020-7373(72)80026-9.
Singh, N.A.; Khan, R.A.; Shree, R. (2012). MFCC and prosodic feature extraction techniques: A comparative study. International Journal of Computer Applications, 49(19), 1–6. https://doi.org/10.5120/8529-2061.
Raj, V.A.; Dhas, M.D.K. (2022). Analysis of audio signal using various transforms for enhanced audio processing. International Journal of Health Sciences, 6(2), 8890–8897. https://doi.org/10.53730/ijhs.v6ns2.8890.
Morris, A.C.; Maier, V.; Green, P. (2004). From WER and RIL to MER and WIL: Improved evaluation measures for connected speech recognition. 8th International Conference on Spoken Language Processing, ICSLP 2004. https://doi.org/10.21437/interspeech.2004-668.
Dhanalakshmi, M.; Priya, K.B.; Jyothi, G.; Madhuri, K.; Amulya, M.; Durga, J.B.; Jyothika, M. (2023). Signal-to-noise ratio (SNR): A cornerstone metric for quality and reliability in diverse applications. International Journal of Research Publication and Reviews, 4(11), 356–359.
Laxmi Narayana, M.; Kopparapu, S.K. (2009). Effect of noise-in-speech on MFCC parameters. Proceedings of the 9th WSEAS International Conference on Signal, Speech and Image Processing, SSIP ’09, 1–6. https://doi.org/10.1109/SSIP.2009.14.
Lee, S.; Kim, J. (2019). A comparative analysis of feature extraction techniques in automatic speech recognition. Journal of Computer Science and Technology, 34(2), 410–418. https://doi.org/10.1007/s11390-019-1910-1.
Smith, J.; Brown, T.; Wang, H. (2020). Effect of Gaussian noise on speech recognition performance: A deep learning perspective. Journal of Acoustic Modeling, 45(4), 222–231. https://doi.org/10.1016/j.jacmod.2020.04.010.
Zhao, L.; Tan, X.; Liu, P. (2022). Performance analysis of MFCC in noisy speech recognition using deep learning models. IEEE Access, 10, 112345–112355. https://doi.org/10.1109/ACCESS.2022.3176453.
Ahmed, F.; Rahman, M.; Hasan, M. (2021). Hybrid feature extraction methods for robust speech recognition in noisy environments. International Journal of Speech Technology, 24(3), 555–567. https://doi.org/10.1007/s10772-021-09813-2.
Sharma, A.; Singh, R.; Jain, V. (2022). Performance enhancement of ASR using hybrid feature extraction combining MFCC and RASTA-PLP. Journal of Signal Processing Systems, 94(3), 281–295. https://doi.org/10.1007/s11265-021-01667-7.
SUBMITTED: 17 February 2025
ACCEPTED: 07 April 2025
PUBLISHED:
13 April 2025
SUBMITTED to ACCEPTED: 49 days
DOI:
https://doi.org/10.53623/gisa.v5i1.605