xLSTM Music Generation Project

Paper abstract

Abstract

Long-form symbolic music generation requires models to maintain both local musical plausibility and large-scale structure over long sequences. While Transformer-based systems such as Museformer have shown strong long-context modeling ability, they also incur substantial decoding cost. In this work, we investigate the Extended Long Short-Term Memory (xLSTM) architecture as an alternative for long-form symbolic music generation. We train xLSTM on the Lakh MIDI Dataset using REMIGEN encoding, implement recurrent-state inference for efficient autoregressive decoding, and evaluate the model both comparatively and through xLSTM-specific long-sequence analysis. Our comparative framework combines musical quality metrics, human listening tests, and three self-similarity-matrix-based structural coherence metrics: Block Coherence, Repetition Density, and Off-diagonal Similarity. We also introduce Decode Memory Growth Rate (DMGR), a rate-based metric for comparing memory growth during autoregressive decoding across architectures. The results show that xLSTM is the closest model to human music on all three structural coherence metrics, achieves the highest Mean Opinion Score, wins pairwise preference tests against Museformer and Lookback RNN, and is most often mistaken for human-composed music in the Turing test. On the efficiency side, recurrent-state decoding yields up to 25× speedup over the original parallel-style generation loop, and xLSTM maintains near-constant decoding memory growth of 0.49 MB per 1k generated tokens, compared with substantially higher memory growth for Museformer. Additional analysis shows that xLSTM extrapolates beyond its 4,096-token training context, although grammar errors rise gradually at longer lengths. Overall, the study shows that xLSTM is a viable architecture for long-form symbolic music generation and offers a strong practical trade-off between structural quality and decoding efficiency.

Approach

Methodology

MIDI Dataset

Lakh MIDI Dataset was cleaned and converted into symbolic event sequences.

REMIGEN Encoding

Music was represented using event tokens such as bars, tempo, instruments, pitch, duration, and velocity.

xLSTM Training

The model was trained as a next-token predictor using the Helibrunna training framework.

Recurrent Inference

A recurrent-state generator was implemented to reduce decoding cost and enable long-form generation.

Experiments

Experiment Setup and Implementation

Comparative Models

The proposed xLSTM model was compared with Museformer and Lookback RNN. Museformer represents a Transformer-based long-context baseline, while Lookback RNN represents a recurrent baseline designed to capture repetition.

Evaluation Design

Evaluation combined modeling metrics, musical quality metrics, SSM-based structural coherence metrics, memory and generation-time analysis, and human listening tests including MOS, pairwise A/B preference, and Turing-style discrimination.

Findings

Results and Analysis

Mean Opinion Score

13 participants · 153 stimulus evaluations · 5-point Likert scale · α = 0.05

Model	Struct. Coherence	Musical Flow	Overall Quality	Motivic Consist.	Harmonic Coh.	MOS (overall)
xLSTM ★	4.18 ± 0.23	4.00 ± 0.25	4.00 ± 0.22	4.29 ± 0.32	4.12 ± 0.42	4.04 ± 0.21
Museformer	3.67 ± 0.29	3.67 ± 0.28	3.67 ± 0.28	3.62 ± 0.43	3.71 ± 0.38	3.67 ± 0.25
Lookback RNN	2.67 ± 0.32	2.51 ± 0.32	2.45 ± 0.30	2.58 ± 0.46	2.50 ± 0.43	2.56 ± 0.29

Kruskal-Wallis H = 49.62, p < 0.0001. All pairwise Mann-Whitney U tests significant (p < 0.05).

xLSTM scored highest on every criterion, most markedly on motivic consistency (4.29) — suggesting its long-range recurrent memory supports more coherent development of musical ideas over time.

Pairwise A/B Preference

39 comparisons per pair · 117 total · binomial significance test

xLSTM vs Museformer 76.9% — p = 0.001

xLSTM vs Lookback RNN 87.2% — p < 0.001

Museformer vs Lookback RNN 84.6% — p < 0.001

Pair	Total	Preferred	Win %	p-value	Significant
xLSTM vs Museformer	39	xLSTM	76.9%	0.001	Yes
xLSTM vs Lookback RNN	39	xLSTM	87.2%	< 0.001	Yes
Museformer vs Lookback RNN	39	Museformer	84.6%	< 0.001	Yes

xLSTM's preference margins are well above chance (50%) in both pairings. Binomial tests confirm these outcomes are not attributable to chance in any case.

Turing Test

13 participants · 10 clips each (4 AI + 6 human) · labelled Human or AI

AI – xLSTM
76.9%
Rated Human
Detection: 23.1%* below
                chance

AI – Museformer

30.8%

Rated Human
Detection: 69.2% (p = 0.076)

AI – Lookback RNN

42.3%

Rated Human
Detection: 57.7% (ns)

Human Clips

51.9%

Rated Human
Correct: 51.9% (chance)

Category	N	Rated Human	Rated AI	Human Rate	Correct %	p-value
AI – xLSTM	26	20	6	76.9%	23.1%	0.009
AI – Museformer	26	8	18	30.8%	69.2%	0.076
AI – Lookback RNN	26	11	15	42.3%	57.7%	0.557
Human Clips	52	27	25	51.9%	51.9%	0.890

xLSTM's detection rate of 23.1% is significantly below chance (p = 0.009) — listeners were systematically more likely to label xLSTM output as human than as AI, a level of perceptual realism that even exceeds the human reference clips.

Efficiency Analysis

Decode Memory Growth Rate (DMGR) · recurrent-state inference vs parallel loop

Inference Speedup

25×

Recurrent-state decoding over original parallel-style generation loop

xLSTM DMGR

0.49 MB

Near-constant memory growth per 1k generated tokens

Context Extrapolation

12k

Tokens tested beyond 4,096-token training context

Training Context

4,096

Token sequence length used during xLSTM training

Recurrent-state inference eliminates the O(n²) attention cost of Transformer-based decoding, giving xLSTM a 25× speedup and near-constant memory growth. Museformer's memory growth is substantially higher at equivalent sequence lengths.

Evaluating xLSTM for structural coherence and efficient recurrent inference

Structure + Efficiency

Abstract

Methodology

MIDI Dataset

REMIGEN Encoding

xLSTM Training

Recurrent Inference

Experiment Setup and Implementation

Comparative Models

Evaluation Design

Results and Analysis

Mean Opinion Score

Pairwise A/B Preference

Turing Test

Efficiency Analysis

Demos

xLSTM — Sample 1

Conclusion

Links

Team and Supervisors

Research Team

Haritha Bandara

Yohan Senanayake

Chamodi Senaratne

Supervisors

Dr. Isuru Nawinne

Ms. Isuri Devindi

Abstract

Related Works

Symbolic Music Generation

Long-Sequence Models

xLSTM Direction

Methodology

MIDI Dataset

REMIGEN Encoding

xLSTM Training

Recurrent Inference

Experiment Setup and Implementation

Comparative Models

Evaluation Design

Results and Analysis

Mean Opinion Score

Pairwise A/B Preference

Turing Test

Efficiency Analysis

Demos

xLSTM — Sample 1

Conclusion

Links

Team and Supervisors

Research Team

Haritha Bandara

Yohan Senanayake

Chamodi Senaratne

Supervisors

Dr. Isuru Nawinne

Ms. Isuri Devindi