Intro:

I am training a machine learning model on the classical piano compositional works of Catherine Rollin with the goal of having it create music in her writing style. This research will be used in the presentation of the lecture Human Imagination or AI? at the 2024 Music Teachers National Association conference.

Step 0: Setup

The important libraries in this project are tensorFlow, keras, and music21. Getting tensorFlow/keras to run on an M1 chip in python was a very tedious process. Only certain versions will work with the M1 and even then, some versions of tensorFlow will not work with other versions of python. It will save you 1 million headaches to use Python 3.8 and tensorFlow 2.13.

Finally, with the correct versions of everything you still need to setup a virtual environment using conda for everything to work properly on the M1. This video will help navigate that task.

I began with a foundation of code based on this github. However, in my experience Skuldur’s model could not learn without tweaks. The loss function does not decrease, accuracy does not rise and thus, the model it makes will predict the same note over and over. So, I had to play with changing the layers/dropout/batch normalization/etc. This blog helped me understand what to tweak and Softology adds matlab plotting which makes everything a lot more readable. I still did things differently in setting up my model, but these sources were super useful for getting started. More details on my methods below!

Step 1: Organize the data

There were 22 pieces already written in Finale, which can then be easily exported to midi. I transposed these pieces all to the key of C major or A minor (depending on if they were major/minor to start with). This seemed like a logical organizational step so that the model learns based on broader musical characteristics than key signature. The first runs just involved those pieces, but it became clear that overfitting would be an issue without a larger data set.

I then obtained 187 PDFs of other pieces by Catherine Rollin courtesy of Alfred Publishing. Using PlayScore2, I translated those into midi and then transposed them. With 209 examples in the training data set I began running tests.

Step 2: Structuring the Model

My initial goal was to simply get the model to learn. This involved removing the batch normalization layers, adding a dropout layer and removing an activation layer (some of these might get added back in later). The “lstm” and “predict” codes in this github are the first versions of the model that would appear to learn. One thing to note is that the Skuldur base code is considering only one parameter option for each instance in the sequence: note/chord. This means that the model does not consider velocity, rhythm, meter, ADSR, resonance, portamento, etc. Later into testing I will try adding parameters. Another thing to note is that Skuldur’s model takes training data 100-note sequences at a time. This means that longer musical works will be broken into more training sequences, and therefore the longer pieces will have an outsized effect on the training data. First I began experimenting with different batch sizes, using a window of 200 epochs as a starting point. Here are the results for 64, 128, and 256 batches:

Fig1: 1 Parameter (Notes/Chords), 64 Batch, 200 Epoch *convergence around Epoch 76 (trained to 77) with 86% accuracy*

Listen to three examples of this model HERE.

Fig2: 1 Parameter (Notes/Chords), 128 Batch, 200 Epoch *convergence around Epoch 84 (trained to 92) with 92% accuracy*

Listen to three examples of this model HERE.

Fig3: 1 Parameter (Notes/Chords), 256 Batch, 200 Epoch *convergence around Epoch 156 (trained to 189) with 97% accuracy*

Listen to three examples of this model HERE.

I want to use a model with the epoch count corresponding to when the graph first hits its limit. This is what I am called the epoch of “convergence” (as opposed to when it finished training.)

Step 3: Adding a parameter

Adding velocity seems like the most straightforward way to squeeze a little more “musicality” out of the machine. If the model is learning correctly, this parameter should result in us beginning to hear “phrasing” which is the method by which a musician shapes a sequence of notes in a passage with expressive articulation.

The model currently intakes instances of a note as note+octave or instances of an interval/chord as note+octave.note+octave. This means a middle C would be C4 and a major chord built on it would be C4.E4.G4. So to add velocity I decided to separate it by an underscore. This would look like C4_60 or C6_60.E4_80.G4_70.

Upon first run the loss function did not decrease at all. Investigating further, I realized that it is creating possible outcomes for every single velocity+note combination. This means 128 possible notes (not even including chords) combining with 128 possible velocities. Big number! Too big for this little GPU I fear, so I experimented with quantizing the velocities of the training data to the nearest 20. This means there are 6 possible velocities (20,40,60,80,100,120). This code can be found under “lstmTake2” and “predictTake2”.

Fig4: 2 Parameters (note+velocity) 64Batch 200Epoch Quant20 *never converged, **unusable***

Fig5: 2 Parameters (note+velocity) 128Batch 200Epoch Quant20 *never converged, **unusable***

Fig6: 2 Parameters (note+velocity) 256Batch 200Epoch Quant20 *convergence around Epoch 157 (trained to 164) with 67% accuracy*

Listen to three examples of this model HERE.

The first two batch sizes still don’t work, but we begin to see convergence at the 256 Batch! Now this model only has 67% accuracy, before velocity was added the 256 Batch model had 97% accuracy. To try to increase accuracy I run a version quantizing velocity to the nearest 40, so 3 possible velocities (40, 80, 120). This brings accuracy up to 75%, however in my opinion it seems like the output of the model is noticeably less musical with only three shades of volume. Example 2 in this set is especially repetitive too.

Fig7: 2 Parameters (note+velocity) 256Batch 200Epoch Quant40 *convergence around Epoch 129 (trained to 137) with 75% accuracy*

Listen to three examples of this model HERE.

Step 3: Experimenting with batch normalization

Next I try adding several batch normalization layers back in but in different places than they were originally located. The model appears to train faster and at a higher accuracy but the output is definitely the most meandering and I wonder if it has become too generalized. Then I tried this with both the 40 and 20 velocity quantizations.

Fig8: 2 Parameters (note+velocity) 256Batch 200Epoch Quant40 added 2 Layers of Batch Normalization *convergence around Epoch 141 (trained to 150) with 83% accuracy*

Listen to three examples of this model HERE.

Fig9: 2 Parameters (note+velocity) 256Batch 200Epoch Quant20 2 Layers of Batch Normalization *convergence around Epoch 135 (trained to 137) with 74% accuracy*

Listen to three examples of this model HERE.

Batch normalization is taking training times down but creating over-generalized results. I try decreasing the amount of normalization happening on those layers with this line, setting the momentum (originally at .9) lower:

model.add(BatchNormalization(momentum=0.5, epsilon=1e-5))

I want to see if this version of the model is able to train with 128 Batch sizes (which did not work before in Fig 5). It also does not converge:

Fig10: 2 Parameters (note+velocity) 128Batch 200Epoch Quant20 2 Layers of Batch Normalization *never converged, **unusable***

Step 4: Experimenting with length of sequence

I tried reducing the length of input sequences from 100 to 50. This decreased training time and worked for 128 and 256 batch sizes. Also, I had to comment out the batch normalization for it to train. Compare 128 batch size in Fig 11 to Fig 5 and 256 batch size in Fig 12 to Fig 6 to see the difference in sequence length (all other aspects of these two models are identical). The 128 batch seems to be way too overgeneralized, it is probably the least musical in terms of note choice of any of the models, the 256 batch is a little better but is still pretty meandaring.

Fig11: 2 Parameters (note+velocity) 128 Batch 200 Epoch Quant20 50SeqLength *convergence around 141 (trained to 150) with 66% accuracy*

Listen to three examples of this model HERE.

Fig12: 2 Parameters (note+velocity) 256Batch 200Epoch Quant20 50SeqLength *convergence around 190 (trained to 200) with 74% accuracy*

Listen to three examples of this model HERE.

****Interesting thoughts I am having*****

How could we use the thinking style of mid-side processing and apply it to compositional decision making? Say we added 209 pieces by Rachmaninov to the training data, we would get:

Rollin + Rachmaninov = stronger signal of compositional similarities

In the mid-side analogy, this is the “center” channel. The interesting question becomes, how would we get the side channel? How can someone phase flip compositional decisions? We would need an “inverted_Rollin” signal. For every note or chord event in each of the Rollin midi pieces, there is a set of notes that were not chosen. For each of these instances we would load all of the choices that were not made. So if a chord event was C3.E3.G3 then inverted_Rollin would be every single note possibility but those three.

inverted_Rollin + Rachmaninov = stronger signal of compositional differences

This would sound chaotic, like free jazz. However, what makes mid-side processing useful is that there can be variable amounts of side summed with the center. So, we could take 10 pieces created by the model that reinforces compositional differences and add it to the training data of Rollin + Rachmaninov model. Additionally, we could effectively create a Rollin – Rachmaninov model by adding Rollin + (inverted_Rollin + Rachmaninov). Could this lead to interesting outcomes?

Step 5: Separate parameters (lstmTake3/predictTake3)

So we were combining pitch and velocity with an underscore into one string. Now I tried to separate them into two axes. At first in quick testing I made the seq length 50, velocity quant 20, and epoch 100. Now we get two different accuracies: one for pitch one for velocity. At epoch 100 we got .75 on pitch and .98 on velocity (and with the expectation that it should be able to go further on pitch because it stopped itself at 100!) It seemed to train so great that I got excited and did a full 200 epoch train with no velocity quantization at all and 50 sequence length (to make it go faster). This stopped training at around epoch 74 with .57 pitch accuracy and .92 velocity accuracy. While .57 isn’t terrible for pitch, it isn’t nearly as great as we had before. I am uncertain as to whether velocity quant affects pitch training in the way they are divided now. It could have also been seq length that changed its effectiveness (from 50 to 100) so we have to experiment.

Fig13: 2 Param 2 axis, 128Batch200Epoch, Quant1, 100 Seq length, trained until Epoch 74 with .57 pitch accuracy and .92 velocity accuracy

So the first layer of the model is shared before they split into separate dense layers for pitch/velocity which means they do seem to have some effect on each other. With 6 different velocity options instead of 128, it seems that the pitch accuracy was able to train to 72% instead of 57%. Still, important to note that they are not as tied as before when they were one parameter (with underscore separation). In that case, more than 6 velocity options made pitch accuracy unable to train at all! So this is definitely a good improvement (and in ways, the fact that they are still tied in the initial layer is good because in a musical sense there certainly is a connection between velocity patters and melody).

Fig14: 2 Param 2 axes, 128Batch, Quant20, 100 Seq length trained until Epoch 79 with 72% pitch accuracy and 97% velocity accuracy

Listen to three examples of this model HERE.

Step 6: Three separate parameters (lstmTake4/predictTake4)

Adding rhythm via duration and something magical is happening- it seems to be training faster with this added parameter. The compositions are lacking in a melodic and harmonic way that they weren’t before, BUT this is exciting to hear rhythms emerge from the machine!

Fig15: 3 Param 3 axes, 128Batch, Quant20, 100 Seq length trained until Epoch 79 with 73% pitch accuracy, 95% velocity accuracy, and 94% duration accuracy

Listen to three examples of this model HERE.

Step 7: Three separate parameters , no chords just notes! (lstmTake5/predictTake5)

*Okay, so one of the complications of pitch struggling to train throughout this project has been that there are not just 128 note options but 6000+. This is because each chord combination of notes present in the composer’s training samples gets logged as a separate pitch event even though it’s a combination of the core 128 notes. If we are extracting rhythm by getting the duration as an offset between notes can’t the notes in a chord (other than the first one) just have a duration offset of 0? But, could this have some negative implications on harmony learning?

So, I try this concept and get what appear to be great training results numbers-wise. But, upon listening to the model I am immediately disappointed because it sounds very unmusical! (listen below)

Fig16: 3 Param 3 axes, 128Batch, Quant5, 100 Seq length trained until Epoch 139 with 94% pitch accuracy, 97% velocity accuracy, and 97% duration accuracy

Listen to three examples of this model HERE.

Imagine my disappointment! After finally giving it 64 velocity options it only uses a few. Every pitch combination possible and it chooses a few repeating notes. The illusion of choice!

So, I try taking away some velocity possibilities (quantizing by 20 so that it only has 6 options) and reduce input sequence size to 50 to see if smaller chunks give more pitch variance.

Listen to three examples of this model HERE.

I would consider this better but still pretty unmusical. So, I tried to add a final concatenate layer that forces the parameters to cross compare. This merges the dense layers’ outputs into a single tensor, aggregating the learned features from pitch, velocity and duration. Ultimately, outputting a probability distribution across these features.

Listen to three examples of this model HERE.

I think we are on the right track with this because it sounds more musical! Now let’s pause and attempt to quantify these rather loose terms “musical” or “unmusical.” Visually inspecting both of these midi sequences, we can see a repetition, a pattern forming over and over.

Listening to these snippets back to back, we can hear that the first has no velocity differences, the melody doesn’t “go anywhere” and holds a steady tempo. In the second pattern we hear many velocity differences, a melody that “builds,” and has micro tempo changes (even ritardandos at one point). I’m noting this because it is the emergence of a higher concept that in music we call “phrasing.”

Okay so, I had previously stopped the training early so I let this model that had some promise run until 96% pitch accuracy, 98% duration and 98% velocity. I think this made the most compelling music so far.

Listen to five examples of this model HERE.

lstmTake6/predictTake6 was an experiment that has failed. Added a parameter to reinforce the interval relationships and this resulted in more “legato” sounds but overall less solid generative compositions.

———————————————————————————————

Training an AI on My Own Musical Compositions: Weeks 1–10 of Mirroring

What happens when you teach an AI to compose music in your own style—down to the phrasing, instrumentation, and sonic texture?

That’s the question at the heart of Mirroring, a project I’ve been working on this spring. After years of composing, producing, and performing my own music under the name Summer Like The Season, I decided to train a generative model not on a generic dataset—but on myself. This is a brief look at the first ten weeks of that journey: from dataset creation to early model architectures and breakthroughs in musical coherence.

Week 1–2: From Piano to Full Ensemble

Early experiments in this space were piano-only—single-instrument LSTM models that could generate surprisingly convincing monophonic sequences. But my own music isn’t monophonic. It’s layered and textured, drawing from vocals, drums, synths, and guitars. So the first step was to expand the input: not just a melody line, but 10 separate instrument roles.

I started translating audio stems from my discography—albums like Hum, Aggregator, Thin Today, and Friend Of The Monster—into MIDI. I used Spotify’s BasicPitch for melodic instruments and Logic’s Drum Replacement for percussive parts. Each track was assigned to a consistent role: lead vocals, harmony vocals, auxiliary vocals, guitar, synth, bass, kick, snare, toms, cymbals.

Around this time, I also started thinking about the model architecture that would eventually guide the project: what I began calling the “Mixture of Musicians” model. Inspired by the Mixture of Experts framework in machine learning, the idea was to treat each instrument as its own specialist “expert” with a dedicated model branch—rather than forcing all instruments into a single flattened sequence. Each track (vocals, synth, drums, etc.) would be modeled by its own LSTM or encoder branch, allowing it to learn its own phrasing, rhythmic behavior, and expressive tendencies. These outputs would then be combined using shared attention layers or temporal context modules to produce a cohesive ensemble output. The goal was to reflect how real music is arranged: not as a monolithic sequence, but as the interaction of many distinct voices listening and responding to each other.

Week 3–4: The One-Instrument Takeover Problem (and a Pitch-0 Trap)

Early in training the multi-instrument model, I ran into a strange failure mode: the outputs would start strong, with rich textures across all 10 instruments—but then quickly collapse into a single instrument continuing while everything else fell silent.

At first, I thought it was a loss function issue or a sign of overfitting. But after digging deeper, I realized the problem was how I represented silence in the data. I had used pitch = 0 to indicate that an instrument wasn’t playing at a given time step. It seemed like a neutral placeholder—but the model learned to over-predict it. Because pitch = 0 appeared so frequently during training, it became a kind of safe fallback. Eventually, the model defaulted to silence for most tracks, letting just one instrument carry on alone.

I didn’t eliminate pitch = 0, but I changed how the model handled it:

I added a monophony penalty (lambda_mono = 0.15) to discourage the model from favoring just one active track.
I tweaked the sampling strategy to reduce the tendency to predict pitch = 0 by default.
I also increased the training step size (from 1 to 10) to reduce overlap between input sequences, which had been reinforcing early silences.

These changes didn’t alter the underlying representation of silence—but they helped the model resist collapsing into it. After these updates, the model’s outputs became much more balanced, with all instruments staying active and contributing meaningfully to the generated texture.

Week 5–6: Cleaning the Data, Stressing the Duration

After training on the initial auto-transcribed dataset, I started to notice rhythmic weirdness—phrasing that felt unnatural, and patterns that repeated in overly rigid ways. It didn’t sound like my music anymore.

So I dove back into the data and manually cleaned the MIDI for 11 of the original 15 songs, checking transcriptions and adjusting phrasing to better reflect the original recordings. I also realized that my duration bins were poorly distributed, especially in the sub-beat range. Short notes were getting lumped together too easily, which flattened the rhythmic nuance. I added more granularity to capture things like 16th notes, dotted values, and triplets, while keeping longer durations coarser to avoid sparsity.

At the same time, I started testing a bar-aligned input format in lstm_torch3.py, thinking it might help the model learn structural phrasing more naturally. But that version didn’t converge well. The rigid slicing by bar created data alignment problems and made the model brittle during training.

What ended up working better was going back to event-based sequencing—but this time with an important upgrade: I added IOI (inter-onset interval) as a feature. This gave the model a clearer sense of rhythmic spacing without needing to force fixed time steps. It also helped minimize my reliance on padding inactive instruments with pitch = 0, since IOI could indicate silence implicitly through time gaps between events.

Week 7–8: Breaking the LSTM and the Free Jazz Phase

Even after solving the one-instrument collapse, something still wasn’t right. The model was now generating notes for all 10 instruments—but they weren’t really playing together. Each track sounded like it had its own internal logic, but the ensemble as a whole was disjointed. It reminded me of a chaotic free jazz session—ten musicians soloing at once without listening to each other. Interesting, but not what I was after.

The problem was clearly cross-instrument temporal coherence. My current model setup—separate LSTM branches per instrument—was decent at modeling individual lines, but it had no real mechanism for learning the interplay between instruments over time. The musician attention layer helped align notes at individual time steps, but it wasn’t enough to shape the broader arc of a composition.

I tried adding a BiLSTM refinement layer after the attention step to give the model more global temporal awareness. It helped a little, but convergence was slow and unpredictable. I also experimented with a Transformer, hoping that self-attention across the entire token sequence might solve the issue—but with my relatively small dataset, the model overfit almost immediately.

That’s when I turned to Temporal Convolutional Networks (TCNs). TCNs use dilated convolutions instead of recurrence, which makes it possible to capture long-range dependencies in time while being much more stable and parallelizable. Unlike the LSTM, which models time sequentially and struggles with saturation, the TCN can see all instruments and all past time steps at once—without needing to memorize them.

By this point, I also dropped the lambda_mono penalty entirely. I had originally introduced it to discourage the model from muting every track except one. But with the TCN’s more global, instrument-aware architecture, I no longer needed to enforce polyphony manually. The model held balance naturally.

It wasn’t just a better fit technically—it felt like the model was finally starting to listen to itself. Instruments began playing in ways that felt coordinated, sometimes even conversational. The free jazz phase was over.

Soohyun Kim explains the DSP-based reasoning of the math behind this realization. While the RNN (LSTM) style model seems similar, performing gradient decent on this type of equation is more difficult because there are lots of local minimum. The CNN (TCN) polynomial series seems much more complicated to solve but using gradient descent is actually way simpler because it is “convex shaped.”

Week 9–10: Composing with the Machine

With a functioning model architecture in place, I shifted toward compositional iteration.

Here’s how it works:

The model generates MIDI multitrack output across all 10 instruments: pitch, velocity, duration, and IOI.
I assign sound samples to each track (keeping the model’s timing exactly as-is).
Then I build arrangements, compositions, and textures—using this machine-generated skeleton as raw musical material.

1: Model’s MIDI output (notes, velocity, duration, inter-onset-interval IOI across 10 instruments)

2: Me curating the sound samples used on the exact same MIDI multitrack (the notes, velocity, duration, IOI remain exactly the model’s output)

3: My compositional iteration

At this point, I expanded the dataset from 15 to 22 songs. I hoped the added material would help the model generalize more effectively—and, more specifically, help resolve the strange gap I was seeing in the TCN: validation loss was low, but training loss remained high, even though the outputs sounded coherent. I thought perhaps adding more real compositions would help the model learn the structure better and bring those losses into closer alignment.

But it didn’t. The gap persisted, and the model kept defaulting to familiar rhythmic territory. That’s when I realized the problem wasn’t that the data needed to be “cleaner” or more carefully edited—it was that I simply didn’t have enough of it. Refining one new song at a time wasn’t going to move the needle at this stage.

So I pivoted from manual refinement to data augmentation. I built out a pipeline that could 10x the dataset automatically, applying:

Pitch shifts of ±1 and ±2 semitones
Time-stretching of ±5%
Velocity offsets of ±5 units
IOI perturbations of ±10%

This gave the model a much broader field to learn from, while keeping the musical style anchored in my own compositions. And this time, it worked—the training and validation loss started to come into closer alignment.

Week 10+: Future Challenges — Repetitive Drum Patterns and Drum Segmentation

Even with the TCN stabilizing the overall musical structure, some patterns have started to feel too familiar—particularly in the drum tracks. The generations aren’t static or identical, and there’s definitely variation in which drum instrument plays on which beat. But despite that surface-level variation, the core rhythmic structure stays almost the same across generations. It’s as if the model has learned one “safe” groove and keeps finding ways to rephrase it slightly, without diverging too far from that center.

I suspect this behavior stems from how I’ve been representing the drums in the model. Early on, I split the drumkit into separate instrument tracks: one for kick, one for snare, one for toms, one for cymbals. This made the modeling cleaner and gave each element a defined role. I also gave each drum sub-instrument its own pitch vocabulary:

Kick might only have a single pitch.
Snare has a few—center, rimshot, maybe a flam.
Toms and cymbals are similarly limited.

This design choice helped early LSTM models, which had trouble staying within the drumkit’s pitch range when given access to the full vocabulary. But I haven’t revisited this structure since switching to the TCN.

What may be happening now is that this per-piece separation is actually limiting rhythmic diversity at a higher level. Since each drum instrument only knows its own tightly defined role and pitch space, the TCN might be settling into familiar configurations that feel “correct” within those boundaries—even though, musically, I’d prefer it to explore more distinct patterns from generation to generation.

I haven’t yet tested merging the drum tracks back into a single unified “drums” stream, or letting them share a common pitch vocabulary. That’s likely the next step. It’s possible the TCN is now robust enough to handle this less constrained setup—and might produce more varied grooves as a result.

I also am planning for training expansion—first with more of my own material (aiming for 71+ songs). Then later, I’d like to use the Lakh MIDI data set to make a massive model refined on my music, and compare results with my single composer model.

Training an AI on My Own Musical Compositions: Weeks 1–10 of Mirroring

Week 1–2: From Piano to Full Ensemble

Week 3–4: The One-Instrument Takeover Problem (and a Pitch-0 Trap)

Week 5–6: Cleaning the Data, Stressing the Duration

Week 7–8: Breaking the LSTM and the Free Jazz Phase

Week 9–10: Composing with the Machine

Week 10+: Future Challenges — Repetitive Drum Patterns and Drum Segmentation

Share this: