A Domain-Agnostic Neurosymbolic Approach for Big Social Data Analysis: Evaluating Mental Health Sentiment on Social Media during COVID-19

Published in IEEE International Conference on Big Data, 2024

This paper presents a domain-agnostic neurosymbolic framework for analyzing large-scale social media signals and monitoring depression, addiction, and anxiety during COVID-19.

Paper | Slides | Code

Why it matters

Social media generates large-scale, rapidly evolving signals (about 12 billion tweets during COVID-19) that can inform mental health surveillance.
Purely data-driven models struggle with emerging terms, including pandemic-specific slang, which limits near-real-time monitoring quality.
Large language model baselines can require high computational cost (more than 6 to 8 hours to converge in reported experiments), reducing practical rapid adaptation.
Public-health deployment needs both high accuracy and adaptability across depression, anxiety, and addiction signals.

What we did

Proposed a domain-agnostic neurosymbolic framework that integrates Word2Vec with multiple knowledge bases, including DSM-5, DAO, UMLS, and DBpedia.
Introduced SEDO to modulate tweet embeddings through a learned weight matrix formulated via a Sylvester equation.
Trained classifiers on semantically filtered data (900M to 600M tweets) and evaluated binary tasks for depression, addiction, and anxiety.
Achieved F1 scores above 92%, outperforming zero-shot LLM baselines (about 70% to 80% F1) while converging in 40 to 55 minutes.
Validated robustness with triangulation and ablation studies, including a +5.03% F1 gain for depression after SEDO fine-tuning.

How it works

Semantic Gap Management (B1): Train domain-specific LDA and Word2Vec models; dynamically update lexicons with knowledge bases and neologisms.
Metadata Scoring (B2): Compute semantic mapping and proximity scores; normalize index scores as supervision signals.
SEDO-based Infusion: Solve a Sylvester equation to learn a weight matrix that aligns tweet and knowledge-base embedding spaces.
Adaptive Classification (B3): Train binary classifiers (for example, Balanced Random Forest) for depression, addiction, and anxiety.
Validation: Perform triangulation and ablation studies to measure robustness and component-level gains.

Multi-stage architecture showing semantic gap management, metadata scoring, and SEDO-based adaptive classification. — Figure 1. Multi-stage neurosymbolic pipeline integrating semantic gap management, metadata scoring, and SEDO-based adaptive classification.

The architecture (Figure 1) shows how knowledge infusion and embedding modulation interact across three stages.

Key contributions

A multi-stage neurosymbolic architecture that combines shallow and semi-deep knowledge infusion for dynamic social media analysis.
Empirical results above 92% F1, with consistent gains over zero-shot LLM baselines (about 89% to 93.6% versus 70% to 80%).
Demonstrated efficiency: 40 to 55 minute convergence versus more than 6 to 8 hours for LLM baselines under similar settings.
Triangulation and ablation studies showing measurable SEDO and knowledge-integration effects (for example, +5.03% F1 for depression after fine-tuning).

Category	Model	Precision	Recall	F1-Score
Depression	LLama	74.23	70.57	72.34
Depression	Phi	71.67	66.42	68.95
Depression	Mistral	76.51	71.38	73.87
Depression	Neurosymbolic	90.45	87.29	88.84
Addiction	LLama	77.24	73.68	75.42
Addiction	Phi	73.32	69.75	71.49
Addiction	Mistral	78.45	74.67	76.51
Addiction	Neurosymbolic	92.18	88.36	90.22
Anxiety	LLama	78.56	74.82	76.66
Anxiety	Phi	74.38	70.61	72.43
Anxiety	Mistral	80.33	76.89	78.56
Anxiety	Neurosymbolic	93.25	90.52	91.85

Table I. Comparison between neurosymbolic classifiers and zero-shot LLMs across mental health categories.

As shown in Table I, the neurosymbolic approach consistently exceeds LLM performance in F1 score.

Category	Model	Precision	Recall	F1-Score
Depression	NB	84.85 (-24%)	82.68 (-25%)	83.75 (-27%)
Depression	RF	91.98 (-28%)	91.81 (-26%)	91.89 (-23%)
Depression	BRF	92.32 (-27%)	92.43 (-24%)	92.37 (-29%)
Depression	BSRF	94.12 (-29%)	93.02 (-22%)	93.57 (-28%)
Addiction	NB	82.74 (-26%)	80.46 (-21%)	81.58 (-25%)
Addiction	RF	90.02 (-22%)	90.36 (-20%)	90.19 (-23%)
Addiction	BRF	91.53 (-28%)	91.78 (-26%)	91.65 (-29%)
Addiction	BSRF	91.64 (-27%)	91.82 (-24%)	91.73 (-28%)
Anxiety	NB	82.53 (-25%)	81.87 (-24%)	82.20 (-22%)
Anxiety	RF	90.76 (-23%)	92.78 (-28%)	91.76 (-21%)
Anxiety	BRF	94.37 (-27%)	93.87 (-25%)	94.12 (-29%)
Anxiety	BSRF	93.46 (-24%)	93.85 (-27%)	93.65 (-28%)

Table II. Performance drop without SEDO highlights the contribution of embedding modulation (percentage decrease values shown in red).

Table II quantifies the degradation when SEDO is removed, underscoring its impact.

Recommended citation: Vedant Khandelwal, Manas Gaur, Ugur Kursuncu, Valerie Shalin, and Amit Sheth. (2024). "A Domain-Agnostic Neurosymbolic Approach for Big Social Data Analysis: Evaluating Mental Health Sentiment on Social Media during COVID-19." Proceedings of the IEEE International Conference on Big Data.
Download Paper | Download Slides