I'm trying to apply Intent Classification using distilbert on my spam/normal email dataset (30000 emails). However, BERT only supports up to 512 tokens, and it's common for email body content to have more than that.
I tried splitting the content into <512 tokens and getting intent for each split, then aggregating the intents. However, it's still running after several hours😅 .
I also tried summarizing the body content 1st before intent classification, but after summarization, the main intent is either getting cut or altered.
So, I'm curious if you guys have some techniques or strategies on how to apply it in longer texts and to make it better in terms of code execution time.