A research project exploring whether modern recurrent architectures can generate long-form MIDI music with human-like structure, while avoiding the heavy decoding cost of Transformer-based approaches.
Paper abstract
Approach
Lakh MIDI Dataset was cleaned and converted into symbolic event sequences.
Music was represented using event tokens such as bars, tempo, instruments, pitch, duration, and velocity.
The model was trained as a next-token predictor using the Helibrunna training framework.
A recurrent-state generator was implemented to reduce decoding cost and enable long-form generation.
Experiments
The proposed xLSTM model was compared with Museformer and Lookback RNN. Museformer represents a Transformer-based long-context baseline, while Lookback RNN represents a recurrent baseline designed to capture repetition.
Evaluation combined modeling metrics, musical quality metrics, SSM-based structural coherence metrics, memory and generation-time analysis, and human listening tests including MOS, pairwise A/B preference, and Turing-style discrimination.
Findings
13 participants · 153 stimulus evaluations · 5-point Likert scale · α = 0.05
| Model | Struct. Coherence | Musical Flow | Overall Quality | Motivic Consist. | Harmonic Coh. | MOS (overall) |
|---|---|---|---|---|---|---|
| xLSTM ★ | 4.18 ± 0.23 | 4.00 ± 0.25 | 4.00 ± 0.22 | 4.29 ± 0.32 | 4.12 ± 0.42 | 4.04 ± 0.21 |
| Museformer | 3.67 ± 0.29 | 3.67 ± 0.28 | 3.67 ± 0.28 | 3.62 ± 0.43 | 3.71 ± 0.38 | 3.67 ± 0.25 |
| Lookback RNN | 2.67 ± 0.32 | 2.51 ± 0.32 | 2.45 ± 0.30 | 2.58 ± 0.46 | 2.50 ± 0.43 | 2.56 ± 0.29 |
Kruskal-Wallis H = 49.62, p < 0.0001. All pairwise Mann-Whitney U tests significant (p < 0.05).
39 comparisons per pair · 117 total · binomial significance test
| Pair | Total | Preferred | Win % | p-value | Significant |
|---|---|---|---|---|---|
| xLSTM vs Museformer | 39 | xLSTM | 76.9% | 0.001 | Yes |
| xLSTM vs Lookback RNN | 39 | xLSTM | 87.2% | < 0.001 | Yes |
| Museformer vs Lookback RNN | 39 | Museformer | 84.6% | < 0.001 | Yes |
13 participants · 10 clips each (4 AI + 6 human) · labelled Human or AI
| Category | N | Rated Human | Rated AI | Human Rate | Correct % | p-value |
|---|---|---|---|---|---|---|
| AI – xLSTM | 26 | 20 | 6 | 76.9% | 23.1% | 0.009 |
| AI – Museformer | 26 | 8 | 18 | 30.8% | 69.2% | 0.076 |
| AI – Lookback RNN | 26 | 11 | 15 | 42.3% | 57.7% | 0.557 |
| Human Clips | 52 | 27 | 25 | 51.9% | 51.9% | 0.890 |
Decode Memory Growth Rate (DMGR) · recurrent-state inference vs parallel loop
Generated samples
Long-form generated symbolic music rendered as audio with a visual piano-roll / playback video.
Replace these media paths with your final MP4/MP3 files.
Summary
This project shows that xLSTM is a viable architecture for long-form symbolic music generation. Its recurrent-state inference enables efficient long-sequence decoding, while its generated music demonstrates strong structural coherence and favorable human-listening results. Future work can explore longer training contexts, alternative xLSTM configurations, constrained decoding, and larger listening studies.
People