Evaluating xLSTM for structural coherence and efficient recurrent inference

A research project exploring whether modern recurrent architectures can generate long-form MIDI music with human-like structure, while avoiding the heavy decoding cost of Transformer-based approaches.

Structure + Efficiency

MOS4.04
Highest perceptual score
A/B vs Museformer76.9%
Preference win rate
A/B vs Lookback RNN87.2%
Preference win rate
Speedup25×
Recurrent-state inference

Paper abstract

Abstract

Long-form symbolic music generation requires models to maintain both local musical plausibility and large-scale structure over long sequences. While Transformer-based systems such as Museformer have shown strong long-context modeling ability, they also incur substantial decoding cost. In this work, we investigate the Extended Long Short-Term Memory (xLSTM) architecture as an alternative for long-form symbolic music generation. We train xLSTM on the Lakh MIDI Dataset using REMIGEN encoding, implement recurrent-state inference for efficient autoregressive decoding, and evaluate the model both comparatively and through xLSTM-specific long-sequence analysis. Our comparative framework combines musical quality metrics, human listening tests, and three self-similarity-matrix-based structural coherence metrics: Block Coherence, Repetition Density, and Off-diagonal Similarity. We also introduce Decode Memory Growth Rate (DMGR), a rate-based metric for comparing memory growth during autoregressive decoding across architectures. The results show that xLSTM is the closest model to human music on all three structural coherence metrics, achieves the highest Mean Opinion Score, wins pairwise preference tests against Museformer and Lookback RNN, and is most often mistaken for human-composed music in the Turing test. On the efficiency side, recurrent-state decoding yields up to 25× speedup over the original parallel-style generation loop, and xLSTM maintains near-constant decoding memory growth of 0.49 MB per 1k generated tokens, compared with substantially higher memory growth for Museformer. Additional analysis shows that xLSTM extrapolates beyond its 4,096-token training context, although grammar errors rise gradually at longer lengths. Overall, the study shows that xLSTM is a viable architecture for long-form symbolic music generation and offers a strong practical trade-off between structural quality and decoding efficiency.

Approach

Methodology

MIDI Dataset

Lakh MIDI Dataset was cleaned and converted into symbolic event sequences.

REMIGEN Encoding

Music was represented using event tokens such as bars, tempo, instruments, pitch, duration, and velocity.

xLSTM Training

The model was trained as a next-token predictor using the Helibrunna training framework.

Recurrent Inference

A recurrent-state generator was implemented to reduce decoding cost and enable long-form generation.

Experiments

Experiment Setup and Implementation

Comparative Models

The proposed xLSTM model was compared with Museformer and Lookback RNN. Museformer represents a Transformer-based long-context baseline, while Lookback RNN represents a recurrent baseline designed to capture repetition.

Evaluation Design

Evaluation combined modeling metrics, musical quality metrics, SSM-based structural coherence metrics, memory and generation-time analysis, and human listening tests including MOS, pairwise A/B preference, and Turing-style discrimination.

Findings

Results and Analysis

Mean Opinion Score

13 participants · 153 stimulus evaluations · 5-point Likert scale · α = 0.05

Model Struct. Coherence Musical Flow Overall Quality Motivic Consist. Harmonic Coh. MOS (overall)
xLSTM ★ 4.18 ± 0.23 4.00 ± 0.25 4.00 ± 0.22 4.29 ± 0.32 4.12 ± 0.42 4.04 ± 0.21
Museformer 3.67 ± 0.29 3.67 ± 0.28 3.67 ± 0.28 3.62 ± 0.43 3.71 ± 0.38 3.67 ± 0.25
Lookback RNN 2.67 ± 0.32 2.51 ± 0.32 2.45 ± 0.30 2.58 ± 0.46 2.50 ± 0.43 2.56 ± 0.29

Kruskal-Wallis H = 49.62, p < 0.0001. All pairwise Mann-Whitney U tests significant (p < 0.05).

xLSTM scored highest on every criterion, most markedly on motivic consistency (4.29) — suggesting its long-range recurrent memory supports more coherent development of musical ideas over time.

Pairwise A/B Preference

39 comparisons per pair · 117 total · binomial significance test

xLSTM vs Museformer 76.9% — p = 0.001
xLSTM vs Lookback RNN 87.2% — p < 0.001
Museformer vs Lookback RNN 84.6% — p < 0.001
Pair Total Preferred Win % p-value Significant
xLSTM vs Museformer 39 xLSTM 76.9% 0.001 Yes
xLSTM vs Lookback RNN 39 xLSTM 87.2% < 0.001 Yes
Museformer vs Lookback RNN 39 Museformer 84.6% < 0.001 Yes
xLSTM's preference margins are well above chance (50%) in both pairings. Binomial tests confirm these outcomes are not attributable to chance in any case.

Turing Test

13 participants · 10 clips each (4 AI + 6 human) · labelled Human or AI

AI – xLSTM
76.9%
Rated Human
Detection: 23.1%* below chance
AI – Museformer
30.8%
Rated Human
Detection: 69.2% (p = 0.076)
AI – Lookback RNN
42.3%
Rated Human
Detection: 57.7% (ns)
Human Clips
51.9%
Rated Human
Correct: 51.9% (chance)
Category N Rated Human Rated AI Human Rate Correct % p-value
AI – xLSTM 26 20 6 76.9% 23.1% 0.009
AI – Museformer 26 8 18 30.8% 69.2% 0.076
AI – Lookback RNN 26 11 15 42.3% 57.7% 0.557
Human Clips 52 27 25 51.9% 51.9% 0.890
xLSTM's detection rate of 23.1% is significantly below chance (p = 0.009) — listeners were systematically more likely to label xLSTM output as human than as AI, a level of perceptual realism that even exceeds the human reference clips.

Efficiency Analysis

Decode Memory Growth Rate (DMGR) · recurrent-state inference vs parallel loop

Inference Speedup
25×
Recurrent-state decoding over original parallel-style generation loop
xLSTM DMGR
0.49 MB
Near-constant memory growth per 1k generated tokens
Context Extrapolation
12k
Tokens tested beyond 4,096-token training context
Training Context
4,096
Token sequence length used during xLSTM training
Recurrent-state inference eliminates the O(n²) attention cost of Transformer-based decoding, giving xLSTM a 25× speedup and near-constant memory growth. Museformer's memory growth is substantially higher at equivalent sequence lengths.

Generated samples

Demos

xLSTM Demo 01

Long-form generated symbolic music rendered as audio with a visual piano-roll / playback video.

Replace these media paths with your final MP4/MP3 files.

Summary

Conclusion

This project shows that xLSTM is a viable architecture for long-form symbolic music generation. Its recurrent-state inference enables efficient long-sequence decoding, while its generated music demonstrates strong structural coherence and favorable human-listening results. Future work can explore longer training contexts, alternative xLSTM configurations, constrained decoding, and larger listening studies.

People

Team and Supervisors

Research Team

Haritha Bandara

Haritha Bandara

Research Team

Yohan Senanayake

Yohan Senanayake

Research Team

Chamodi Senaratne

Chamodi Senaratne

Research Team

Supervisors

Dr. Isuru Nawinne

Dr. Isuru Nawinne

Supervisor

Ms. Isuri Devindi

Ms. Isuri Devindi

Supervisor