Abstract :
The multimodal approach uses the heterogeneous sources of data to promote perception, inference, and decision-making in intelligent systems. A multimodal framework does not just use one channel, like text, audio or facial appearance, but combines the corresponding streams of information and reaches more reliable and context-sensitive predictions. The present study is a multi-modal emotion recognition-based and age filtering-based developed advanced music recommendation system, which incorporates real-time multimodal emotion recognition and age filtering to deliver very personalized song recommendations by Indian/Bollywood genre. The system uses a hybrid machine learning design with an integration of Face-API.js to analyze facial expression with up to 87 percent recognition accuracy and a DistilRoBERTa transformer-based sentiment classifier to analyze text-based inputs with 85-90 percent accuracy. In addition to providing a more personalized approach to music, a selected dataset of 600 Bollywood music tracks is enriched with psychoacoustic data (valence, energy, tempo, and danceability) and mapped to affect-to-audio features with a weighted scoring model. It allows narrowing the gap between perceived emotional states and musical qualities, thus making the recommendation more relevant in the dynamic real-world context. Experimental measurements affirm good performance of the system with end-to-end response times held down to less than 2 seconds and face-processing throughput of 2530 FPS. While doing the Practical implementation a user study of around 50 participants was done in two weeks with a mean satisfaction score of 8.5/10, an 83 percent recommendation intention rate, and 72 percent repeat engagement. On the whole, it can be concluded that hybrid multimodal emotion analysis, demographic adaptation, and cultural relevance are important to further the development of next-generation music recommender systems that can provide reliable, affect-sensitive, and human-centered interaction experiences.