A Domain-Agnostic Neurosymbolic Approach for Big Social Data Analysis: Evaluating Mental Health Sentiment on Social Media during COVID-19

Published in IEEE International Conference on Big Data, 2024

This paper presents a domain-agnostic neurosymbolic framework for analyzing large-scale social media signals and monitoring depression, addiction, and anxiety during COVID-19.

Paper | Slides | Code

Why it matters

  • Social media generates large-scale, rapidly evolving signals (about 12 billion tweets during COVID-19) that can inform mental health surveillance.
  • Purely data-driven models struggle with emerging terms, including pandemic-specific slang, which limits near-real-time monitoring quality.
  • Large language model baselines can require high computational cost (more than 6 to 8 hours to converge in reported experiments), reducing practical rapid adaptation.
  • Public-health deployment needs both high accuracy and adaptability across depression, anxiety, and addiction signals.

What we did

  • Proposed a domain-agnostic neurosymbolic framework that integrates Word2Vec with multiple knowledge bases, including DSM-5, DAO, UMLS, and DBpedia.
  • Introduced SEDO to modulate tweet embeddings through a learned weight matrix formulated via a Sylvester equation.
  • Trained classifiers on semantically filtered data (900M to 600M tweets) and evaluated binary tasks for depression, addiction, and anxiety.
  • Achieved F1 scores above 92%, outperforming zero-shot LLM baselines (about 70% to 80% F1) while converging in 40 to 55 minutes.
  • Validated robustness with triangulation and ablation studies, including a +5.03% F1 gain for depression after SEDO fine-tuning.

How it works

  • Semantic Gap Management (B1): Train domain-specific LDA and Word2Vec models; dynamically update lexicons with knowledge bases and neologisms.
  • Metadata Scoring (B2): Compute semantic mapping and proximity scores; normalize index scores as supervision signals.
  • SEDO-based Infusion: Solve a Sylvester equation to learn a weight matrix that aligns tweet and knowledge-base embedding spaces.
  • Adaptive Classification (B3): Train binary classifiers (for example, Balanced Random Forest) for depression, addiction, and anxiety.
  • Validation: Perform triangulation and ablation studies to measure robustness and component-level gains.
Multi-stage architecture showing semantic gap management, metadata scoring, and SEDO-based adaptive classification.
Figure 1. Multi-stage neurosymbolic pipeline integrating semantic gap management, metadata scoring, and SEDO-based adaptive classification.

The architecture (Figure 1) shows how knowledge infusion and embedding modulation interact across three stages.

Key contributions

  • A multi-stage neurosymbolic architecture that combines shallow and semi-deep knowledge infusion for dynamic social media analysis.
  • Empirical results above 92% F1, with consistent gains over zero-shot LLM baselines (about 89% to 93.6% versus 70% to 80%).
  • Demonstrated efficiency: 40 to 55 minute convergence versus more than 6 to 8 hours for LLM baselines under similar settings.
  • Triangulation and ablation studies showing measurable SEDO and knowledge-integration effects (for example, +5.03% F1 for depression after fine-tuning).
CategoryModelPrecisionRecallF1-Score
DepressionLLama74.2370.5772.34
DepressionPhi71.6766.4268.95
DepressionMistral76.5171.3873.87
DepressionNeurosymbolic90.4587.2988.84
AddictionLLama77.2473.6875.42
AddictionPhi73.3269.7571.49
AddictionMistral78.4574.6776.51
AddictionNeurosymbolic92.1888.3690.22
AnxietyLLama78.5674.8276.66
AnxietyPhi74.3870.6172.43
AnxietyMistral80.3376.8978.56
AnxietyNeurosymbolic93.2590.5291.85
Table I. Comparison between neurosymbolic classifiers and zero-shot LLMs across mental health categories.

As shown in Table I, the neurosymbolic approach consistently exceeds LLM performance in F1 score.

CategoryModelPrecisionRecallF1-Score
DepressionNB84.85 (-24%)82.68 (-25%)83.75 (-27%)
DepressionRF91.98 (-28%)91.81 (-26%)91.89 (-23%)
DepressionBRF92.32 (-27%)92.43 (-24%)92.37 (-29%)
DepressionBSRF94.12 (-29%)93.02 (-22%)93.57 (-28%)
AddictionNB82.74 (-26%)80.46 (-21%)81.58 (-25%)
AddictionRF90.02 (-22%)90.36 (-20%)90.19 (-23%)
AddictionBRF91.53 (-28%)91.78 (-26%)91.65 (-29%)
AddictionBSRF91.64 (-27%)91.82 (-24%)91.73 (-28%)
AnxietyNB82.53 (-25%)81.87 (-24%)82.20 (-22%)
AnxietyRF90.76 (-23%)92.78 (-28%)91.76 (-21%)
AnxietyBRF94.37 (-27%)93.87 (-25%)94.12 (-29%)
AnxietyBSRF93.46 (-24%)93.85 (-27%)93.65 (-28%)
Table II. Performance drop without SEDO highlights the contribution of embedding modulation (percentage decrease values shown in red).

Table II quantifies the degradation when SEDO is removed, underscoring its impact.

Recommended citation: Vedant Khandelwal, Manas Gaur, Ugur Kursuncu, Valerie Shalin, and Amit Sheth. (2024). "A Domain-Agnostic Neurosymbolic Approach for Big Social Data Analysis: Evaluating Mental Health Sentiment on Social Media during COVID-19." Proceedings of the IEEE International Conference on Big Data.
Download Paper | Download Slides