daring robin Jul 15, 2025, 5:21 PM

#

Predicts a huge drop in price after the last known datapoint
(Green is real price, blue is predicted price)

daring robin Jul 15, 2025, 6:47 PM

#

Sharp change in price, after which it just continues as if nothing has happened
The model makes all the predictions for the following days once, without knowing what its prediction for other days is.

#

(Same problem happens with both Transformer and LSTM architectures)

#

===

Training and validation data split seems correct:
Green is training data, blue is evaluation input, red is expected eval output (real price)

daring robin Jul 15, 2025, 7:05 PM

#

Please ping me if you might know a possible reason to this issue

paper solstice Jul 15, 2025, 11:17 PM

#

daring robin Please ping me if you might know a possible reason to this issue

bro you need to provide more info...

daring robin Jul 16, 2025, 10:54 AM

#

paper solstice bro you need to provide more info...

Transformer model that takes 30 days * 7 features as input and predicts price changes for the following 30 days at once.

#

Transformer model was Encoder-only, now I changed it to be Decoder-only with causal mask and RoPE.

#

Same issue persists even with LSTM implementation

#

(I already work on this issue for a couple of days now, and each day I update mine csv dataset files, so the last known data point also moves. So it's not like on that date there was some huge unseen spike in the input data)

paper solstice Jul 16, 2025, 11:53 AM

#

daring robin (I already work on this issue for a couple of days now, and each day I update mi...

Is there just that one spike in the validation data, or are all validation predictions bad?

#

And how do your training graphs look?

#

And also I don't think that the decoder-only model with causal masks makes sense for predicting in chunks.
You are basically predicting timesteps that are 30 days away all the time

gusty hazel Jul 16, 2025, 12:24 PM

#

@daring robin And on how many examples did u train it on?

daring robin Jul 16, 2025, 4:30 PM

#

paper solstice Is there just that one spike in the validation data, or are all validation predi...

daring robin Jul 16, 2025, 4:31 PM

#

paper solstice And how do your training graphs look?

Loss graph looks okay

daring robin Jul 16, 2025, 4:33 PM

#

paper solstice And also I don't think that the decoder-only model with causal masks makes sense...

So what could be a good choice in this situation?

daring robin Jul 16, 2025, 4:34 PM

#

gusty hazel <@449543814214844426> And on how many examples did u train it on?

Data from around 2000 till today. Sometimes for a few stocks, sometimes up to 30 stocks in the same field (tech for ex.). Batches shuffled each epoch

paper solstice Jul 16, 2025, 4:35 PM

#

daring robin Loss graph looks okay

Even validation loss? This looks like overfitting to me

daring robin Jul 16, 2025, 4:41 PM

#

This is supposed to be validation prediction

paper solstice Jul 16, 2025, 4:42 PM

#

daring robin So what could be a good choice in this situation?

I'm not sure what's best.
But I think you should do full self-attention, if you are predicting whole 30 day blocks at a time.
When you use causal mask, it hides embeddings, that are in the future for each token.
Normally, you would use it to predict the direct next output.
input1 sees [input1] and predicts input2
input2 sees [input1, input2] and predicts input3
...

But in you case
input1 sees [input1] and predicts input31
input2 sees [input1, input2] and predicts input32
...

paper solstice Jul 16, 2025, 4:48 PM

#

daring robin This is supposed to be validation prediction

do you track validation loss during training?

daring robin Jul 16, 2025, 7:16 PM

#

paper solstice do you track validation loss during training?

#

It seems like the validation data might leak into training or something, because the validation predictions are too good...

paper solstice Jul 16, 2025, 7:26 PM

#

daring robin It seems like the validation data might leak into training or something, because...

yeah, looks like it

daring robin Jul 16, 2025, 7:27 PM

#

paper solstice yeah, looks like it

Found the issue, fixing it rn

daring robin Jul 16, 2025, 9:01 PM

#

@paper solstice Fixed. This looks more like it. RMSE is % deviation from real data, but those are on the one of the training batches, so not really real-world scenario, fixing that aswell now.

#

It follows the real price quite well (since it was trained on it), but once it has to predict for the next day it just drops like crazy (-97.08% in this case)

#

And this is really weird, because the model just predicts all 30 days at once, so it's doesn't accumulate errors.

#

You can also see "ghost" predictions, that start a little earlier, they also drop in the same exact place. This shows that the issue is not in one of the model outputs, but persists even with offset.

paper solstice Jul 16, 2025, 9:07 PM

#

daring robin And this is really weird, because the model just predicts all 30 days at once, s...

Looking at your validation loss, it's not that weird

daring robin Jul 16, 2025, 9:08 PM

#

2.0 loss is pretty bad

paper solstice Jul 16, 2025, 9:08 PM

#

it didn't lower at all, so the model isn't learning any generalization

daring robin Jul 16, 2025, 9:08 PM

#

Ah

#

Uh..

#

Why?

paper solstice Jul 16, 2025, 9:09 PM

#

And what architecture do you have now?

daring robin Jul 16, 2025, 9:10 PM

#

paper solstice And what architecture do you have now?


def apply_rope(x):
    # x: (batch, seq_len, dim)
    # RoPE expects even dim
    batch, seq_len, dim = x.shape
    assert dim % 2 == 0, "Model dim must be even for RoPE"
    half_dim = dim // 2
    pos = torch.arange(seq_len, device=x.device).unsqueeze(1)  # (seq_len, 1)
    freq = torch.exp(-math.log(10000) * torch.arange(0, half_dim, device=x.device) / half_dim)  # (half_dim,)
    angles = pos * freq  # (seq_len, half_dim)
    cos = torch.cos(angles)
    sin = torch.sin(angles)
    x1, x2 = x[..., :half_dim], x[..., half_dim:]
    x_rope = torch.cat([x1 * cos - x2 * sin, x1 * sin + x2 * cos], dim=-1)
    return x_rope

# Transformer model
class StockTransformer(nn.Module):
    def __init__(self, input_dim, model_dim, num_heads, num_layers, dropout):
        super().__init__()
        self.input_dim = input_dim
        self.model_dim = model_dim
        self.embedding = nn.Linear(input_dim, model_dim)
        self.decoder_layers = nn.ModuleList([
            nn.TransformerDecoderLayer(
                d_model=model_dim, nhead=num_heads, dropout=dropout, batch_first=True
            ) for _ in range(num_layers)
        ])
        self.decoder_norm = nn.LayerNorm(model_dim)
        self.output = nn.Linear(model_dim, 1)

    def forward(self, x):
        # x: (batch, seq_len, input_dim)
        x = self.embedding(x)  # (batch, seq_len, model_dim)
        x = apply_rope(x)      # (batch, seq_len, model_dim)
        tgt = x
        memory = torch.zeros_like(x)  # dummy, not used

        seq_len = x.size(1)
        # Causal mask: (seq_len, seq_len), True means masked
        mask = torch.triu(torch.ones(seq_len, seq_len, device=x.device), diagonal=1).bool()
        for layer in self.decoder_layers:
            tgt = layer(tgt, memory, tgt_mask=mask)
        tgt = self.decoder_norm(tgt)
        out = self.output(tgt)  # (batch, seq_len, 1)
        return out.squeeze(-1)

#

@paper solstice Is that okay?

#

Or should I use full self-attention (no causal mask), so each output position can attend to all positions in the input block?

paper solstice Jul 16, 2025, 9:22 PM

#

daring robin ```py def apply_rope(x): # x: (batch, seq_len, dim) # RoPE expects even...

I think the model itself looks good

#

But what are you using as inputs and targets?

daring robin Jul 16, 2025, 9:29 PM

#

paper solstice I think the model itself looks good

I rerun training on only GOOGL and MSFT, and it did manage to generalize them

#

Not good, but just a bit of movement in a correct direction

paper solstice Jul 16, 2025, 9:33 PM

#

daring robin I rerun training on only GOOGL and MSFT, and it did manage to generalize them

And are you normalizing the data?

daring robin Jul 16, 2025, 9:35 PM

#

paper solstice And are you normalizing the data?

Yes:

# Fit scaler only on train, then transform both
self.scaler = StandardScaler()
feature_cols = ["Pct_Change", "High_pct", "Low_pct", "Volume_pct"] + [col for col, _ in ext_ticker_cols]
train_features = self.train_df[feature_cols]
self.scaler.fit(train_features)
self.train_df[feature_cols] = self.scaler.transform(train_features)
self.val_df[feature_cols] = self.scaler.transform(self.val_df[feature_cols])
self.test_df[feature_cols] = self.scaler.transform(self.test_df[feature_cols])

#

.
Also removed causal mask

def forward(self, x):
    # x: (batch, seq_len, input_dim)
    x = self.embedding(x)  # (batch, seq_len, model_dim)
    x = apply_rope(x)      # (batch, seq_len, model_dim)
    tgt = x
    memory = torch.zeros_like(x)  # dummy, not used

    # No causal mask for block prediction
    for layer in self.decoder_layers:
        tgt = layer(tgt, memory, tgt_mask=None)
    tgt = self.decoder_norm(tgt)
    out = self.output(tgt)  # (batch, seq_len, 1)
    return out.squeeze(-1)

And the result is somewhat better now:

paper solstice Jul 16, 2025, 9:41 PM

#

daring robin . Also removed causal mask ```py def forward(self, x): # x: (batch, seq_len,...

yeah, that makes sense

#

but I though even with that the model would be able to learn atleast something

#

you could also try to predict just the next value

#

and leave the causal mask

daring robin Jul 16, 2025, 9:45 PM

#

paper solstice yeah, that makes sense

But it still breaks around last seen during training data point

#

#

What's strange is that sometimes it can predict just good.
For example here you can see it making predictions for INTC (Intel stock which it didn't see during training at all):

#

There is still a drop, but not as big

#

Testing it on some other stocks, it also has that "drop" on many, but here is NVDA for example, where it has no crazy fluctuations:

daring robin Jul 16, 2025, 9:54 PM

#

paper solstice you could also try to predict just the next value

The problem with that, is that I give the model many inputs from real world, which I suppose it would also need to predict in order to make it autoregressive, so it could "walk" its way in the future. But that makes it more complicated

paper solstice Jul 16, 2025, 9:56 PM

#

daring robin

it's weird

daring robin Jul 16, 2025, 9:58 PM

#

paper solstice it's weird

I agree

#

It pisses me off 😂

#

Just trained a new model with different hyperparameters, and fewer inputs.
Now it goes to different price levels, so it's pct_change for that day kinda depends on where it starts from. But it still happens at that one date:

#

(It could seem like it drops at different dates now, but I think it's because of the working days, and that sometimes it starts to predict from weekends, stuff like that)

paper solstice Jul 16, 2025, 10:03 PM

#

and are you scaling the targets properly?

paper solstice Jul 16, 2025, 10:06 PM

#

daring robin The problem with that, is that I give the model many inputs from real world, whi...

oh right

#

you could try encoder-decoder, but I have no idea if it'd perform better

daring robin Jul 16, 2025, 10:09 PM

#

paper solstice you could try encoder-decoder, but I have no idea if it'd perform better

That could improve accuracy, but will not fix the problem we are facing...

paper solstice Jul 16, 2025, 10:13 PM

#

daring robin Just trained a new model with different hyperparameters, and fewer inputs. Now i...

And the blue line is the finall prediction?

daring robin Jul 16, 2025, 10:15 PM

#

paper solstice And the blue line is the finall prediction?

Yes, the blue line with dots is the last prediction

paper solstice Jul 16, 2025, 10:17 PM

#

daring robin Yes, the blue line with dots is the last prediction

what features are you using?

daring robin Jul 16, 2025, 10:19 PM

#

paper solstice what features are you using?

Input: 30 days * ["Weekday", "Month", "Pct_Change", "High_pct", "Low_pct", "Volume_pct"]
Output: 30 days * "Pct_change"

#

Pretty basic for now

paper solstice Jul 16, 2025, 10:20 PM

#

daring robin Input: 30 days * ["Weekday", "Month", "Pct_Change", "High_pct", "Low_pct", "Volu...

can you try to train again without weekday and month?

#

I'm curious if it is overfitting on those features

daring robin Jul 16, 2025, 10:23 PM

#

paper solstice I'm curious if it is overfitting on those features

Could be, I will try

#

But it trains on like 25 years of data, months and weekdays repeat and overlap, so I guess that shouldn't be a thing to overfit on

#

I am curious to try tho

#

While training new model, testing the old one, apperently it can also do this:

#

Trained. This is rather strange looking graph

#

Guess what

#

Input window (all days, all features):
Pct_Change | High_pct | Low_pct | Volume_pct |
Day 1: -0.075 | -0.768 | 0.100 | 0.243 |
...
Day 30: 0.688 | 0.736 | 0.007 | 0.317 |

Top input factors for biggest predicted price jump (Day 11):
SI=F_pct: 2.859
HG=F_pct: 2.601
GC=F_pct: 2.519```

paper solstice Jul 16, 2025, 10:42 PM

#

daring robin ```Biggest predicted price jump: -98.99 on 2025-07-14 Input window (all days, al...

what does this mean?

daring robin Jul 16, 2025, 11:00 PM

#

paper solstice what does this mean?

Some information, like what the inputs look like, and what inputs exactly contributed the most to the sspike in price

#

This is so weird, I don't even know what to do

#

Maybe start completely from scratch

paper solstice Jul 16, 2025, 11:01 PM

#

daring robin Some information, like what the inputs look like, and what inputs exactly contri...

But if I understand it correctly, you are also passing comodities as inputs?

#

isn't the SI, HG and GC silver, copper and gold?

daring robin Jul 16, 2025, 11:31 PM

#

paper solstice isn't the SI, HG and GC silver, copper and gold?

Correct

daring robin Jul 16, 2025, 11:32 PM

#

paper solstice But if I understand it correctly, you are also passing comodities as inputs?

Sometimes I do, sometimes I train it simpler, without them, but it has no affect on that weird glitch

paper solstice Jul 16, 2025, 11:33 PM

#

daring robin Sometimes I do, sometimes I train it simpler, without them, but it has no affect...

I mean it says that those values have the largest effect on the jump

#

Did you try to train in super clean?
Like only ["Pct_Change", "High_pct", "Low_pct", "Volume_pct"]

daring robin Jul 16, 2025, 11:36 PM

#

paper solstice Did you try to train in super clean? Like only ["Pct_Change", "High_pct", "Low_p...

Will do now

daring robin Jul 16, 2025, 11:40 PM

#

paper solstice Did you try to train in super clean? Like only ["Pct_Change", "High_pct", "Low_p...

Okay, so it produces quite a strange result. It clearly underperforms, and lacks accuracy. Also the validation loss behaves strangely

#

Still does The Drop

#

Some information if useful:

Input window (all days, all features):
Pct_Change | High_pct | Low_pct | Volume_pct
Day 1: 0.446 | -0.259 | 0.075 | 0.385
Day 2: 0.033 | -0.066 | 0.783 | -0.693
Day 3: -0.594 | -0.096 | -0.490 | 0.828
Day 4: 0.960 | 2.561 | 0.846 | 0.909
Day 5: 0.450 | 0.854 | 0.464 | -0.120
Day 6: -0.547 | -0.514 | 0.445 | -1.137
Day 7: 0.902 | 0.212 | 0.837 | -0.010
Day 8: -0.155 | -0.111 | 0.426 | -0.313
Day 9: -0.147 | -0.684 | -0.370 | -0.445
Day 10: -0.068 | -0.532 | -0.596 | 1.310
Day 11: -0.609 | -0.116 | 0.723 | -0.650
Day 12: -0.648 | -0.479 | 0.035 | 0.154
Day 13: 0.362 | -0.305 | 0.749 | -0.903
Day 14: -0.008 | -0.623 | -0.159 | 0.496
Day 15: 1.125 | 0.434 | 0.898 | -0.189
Day 16: 0.500 | -0.177 | 0.835 | -0.509
Day 17: 0.469 | 0.809 | 0.419 | 1.944
Day 18: -0.294 | -0.630 | -0.201 | -1.059
Day 19: -0.377 | -0.646 | 0.367 | -0.783
Day 20: -0.253 | 0.770 | 0.879 | 0.438
Day 21: 0.389 | -0.084 | 0.868 | -0.175
Day 22: -0.209 | -0.272 | 0.481 | -0.320
Day 23: -0.579 | -0.643 | -0.147 | 0.120
Day 24: -1.426 | -0.694 | -2.297 | 2.872
Day 25: -0.355 | -0.448 | -0.783 | -0.597
Day 26: 0.301 | -0.369 | 0.589 | -0.707
Day 27: 0.799 | 0.829 | 0.867 | -0.387
Day 28: 0.559 | -0.397 | -0.047 | -0.349
Day 29: 0.990 | 0.912 | 0.216 | 4.289
Day 30: -0.505 | -0.680 | -1.347 | -0.923
Top input factors for biggest predicted price jump (Day 11):
Low_pct: 0.723
Volume_pct: -0.650
Pct_Change: -0.609

paper solstice Jul 17, 2025, 12:05 AM

#

daring robin Still does **The Drop**

that's so weird

#

I don't really know anymore

#

And is the drop there for all predictions that have 2025-07-14 in their prediction window?

#

or only last x < 30

#

I can't really tell from the picture exactly how many lines with drop are there

daring robin Jul 17, 2025, 8:57 AM

#

paper solstice I can't really tell from the picture exactly how many lines with drop are there

So it's 10 lines with offset -1 day for each. Last day prediction being the most visible dashed blue line with dots

daring robin Jul 17, 2025, 8:58 AM

#

paper solstice And is the drop there for all predictions that have 2025-07-14 in their predicti...

Basically yes, but this 07-14 is being close to the last known data point to the model, that is saw during training

#

I already have this issue for quite a few days now, and each time I update my training dataset with new data and train new model, then that date where the model drops shifts +1 day. So it's not about that specific date, it's that after it model has never seen anything. Which is logical, but doesn't explain the sudden instability, and why it's only for that one date. If it was overfitting or some other kind of issue I would expect the whole prediction line to be choopy and all over the place, but it's quite okay except that one day.

daring robin Jul 17, 2025, 9:28 AM

#

Validation being lower at first makes sense, since here I trained it with 0.3 dropout

#

I took a model at epoch 14 (low validation loss) and it still resulted in that drop

#

drop drdop drop drop I am going crazyt

#

Uh, oh. Hmm. If I start prediction after that day, it's all fine actually

#

To say that I am confused is a huge understatement

daring robin Jul 17, 2025, 9:37 AM

#

daring robin Uh, oh. Hmm. If I start prediction after that day, it's all fine actually

You know what @paper solstice, look there. The pattern after the drop is exactly the same as in the last most visible prediction, same movement a bit higher and then down and a bit up again. So the model works exactly the same, and predictions for each day are generalized pretty okay. If I just replace that drop with 0 pct_change for that day, it will look almost identical

#

I am thinking that this could be the Scaler/Normalization Issue: the inverse transform (when converting predicted pct_change back to price) can do such weird thing with numbers.

daring robin Jul 17, 2025, 10:08 AM

#

daring robin I am thinking that this could be the Scaler/Normalization Issue: the inverse tra...

No bro

#

I just manually cut last 60 days for each ticker in training data, and went back in time to predict price changes in April. There is that drop! I caught it in 4K

#

And if I go to the current date it's fine, no drop, because now we are so much in the future

#

So, this issue does not come from:

Scaler/Normalization
Some spike in the input data right before 07-17
Bad model architecture or hyperparameters
Overfitting on the data

It looks like:

The price drops in prediction right after exiting the training dataframe

paper solstice Jul 17, 2025, 3:13 PM

#

daring robin Validation being lower at first makes sense, since here I trained it with 0.3 dr...

and how big validation split are you using for this evaluation? It looks super unstable, so I guess it's kinda small

paper solstice Jul 17, 2025, 3:17 PM

#

daring robin # So, this issue does not come from: - Scaler/Normalization - Some spike in the ...

Maybe some error in the data preparation/prediction code? Like accidentally including padding or something like that

daring robin Jul 18, 2025, 8:28 PM

#

paper solstice and how big validation split are you using for this evaluation? It looks super u...

test_size = window_size + predict_size
So it's only one prediction per each ticker. Yeah, very small

#

Working on it now

#

Crazy atm

#

Added +10 predictions with offset for each ticker during training, looks much more smooth:

#

Validation loss is a bit smaller at the start cuz of the dropout. But yeah, we can clearly see that it doesn't generalize at all, only gets worse as the training goes. Overfitting?

paper solstice Jul 18, 2025, 9:13 PM

#

daring robin test_size = window_size + predict_size So it's only one prediction per each tick...

wait, you only have one validation sample?

#

Id say use like 80-20 split

#

Especially if your're experimenting with different approaches

hearty stratus Jul 19, 2025, 6:45 AM

#

daring robin Still does **The Drop**

I think its because you give the model too much room to predict. Assuming that the price wont drop less than 10% in a day, you can set that as the “largest” prediction it can make.

Depending on how you want to structure your code you can make something like this, so for example if the current price it as 20$, it couldn’t go under 18$ (you know better how to structure your code)

[l_prediction = current_price - (current_price * 0.9)]

Let me know if it worked!

daring robin Jul 19, 2025, 4:43 PM

#

hearty stratus I think its because you give the model too much room to predict. Assuming that t...

That's a way to just clip this instability, but it will still be there and could potentially ruin the prediction even with that custom clipping. Thanks for the advice tho

daring robin Jul 19, 2025, 4:44 PM

#

paper solstice Id say use like 80-20 split

Now I have 11 validations on the last known data for each ticker. So let's say I train it on 10 tickers, that would result in 110 entries for validation

hearty stratus Jul 19, 2025, 5:12 PM

#

daring robin That's a way to just clip this instability, but it will still be there and could...

Try it and see if it still drops, also how does the model predict the future prices? There might be a problem there

daring robin Jul 19, 2025, 7:17 PM

#

hearty stratus Try it and see if it still drops, also how does the model predict the future pri...

Logically thinking there will be the same drop on the same date, but now smaller, clipped to be only -10%.
That's why I say it won't fix the issue

hearty stratus Jul 19, 2025, 7:41 PM

#

daring robin Logically thinking there will be the same drop on the same date, but now smaller...

Doesn’t need to be true, in your example it also doesn’t go straight to zero, you also don’t know how the AI learns the predictions

#

Depending on how you structured the AI it could also be a problem with the math, if you calculate the derivative on the accuracy

daring robin Jul 20, 2025, 4:38 PM

#

@paper solstice I think I might have found the issue, or at least some part of it

#

The gold data does not exist till 2001 or so

#

But I think I've tried to train a model without external tickers like gold and still got an issue

#Custom model for stock prices

So, this issue does not come from:

It looks like: