#llm-detect-ai-generated-text
1 messages Β· Page 1 of 1 (latest)
Hi
hehe
The competition rules state:
βTo the extent your Submission makes use of generally commercially available software not owned by you that you used to generate your submission, but that can be procured by the Competition Sponsor without undue expense, you do not grant the license in the preceding sentence to that software.β
Does this prohibit the (extensive) use of a (fine-tuned) closed-source foundation model, such as GPT-4, which are pay-per-use? Could I use such models for generating training data or for making predictions?
hey guys beginner question here...can someone explain how submissions work in this competition?
thanks
so, to be able to submit, we need to make 2 notebooks, where the second one loads the trained model and tokenizer (if applicable) from the kaggle working space (or from hard drive?), and then infer from those on the test data?
if someone would have any pointers, much appreciated
or can I first download the model and tokenizer and then use those? (in my case distilbert-base-uncased-finetuned-sst-2-english from hf)
You gotta download them using internet on first (then download to local and upload as a dataset) then load from the custom dataset directory
Don't forget to turn internet off when submitting the notebook
ok thanks, will try that
I need a team for this project can anyone invite me
Hi, I'm have some experiences in ML/DL and interested in LLMs, but I don't have much experience in kaggle competitions (attended a few). Is there any one wants to team up with me for this project?
Hello! If you still haven't found a team, I'd love to join up!
I'm looking for a team as well, I'd love to join up. I have some interesting text features to share, not based on word counts, or tf idf.
wanna work together. I have decent backround with ML. I know a variety of ML methods like nerural networks, logistic regression, etc
Can someone help me out understanding the difference between the training data, and the data used during the submission scoring?
I'm using daigt_external_dataset.csv and train_essays_RDizzl3_seven_v1.csv as training data.
I've done a write up to better explain my problem / observation: https://www.kaggle.com/competitions/llm-detect-ai-generated-text/discussion/455056
Identify which essay was written by a large language model
Why did the competition organizers, not supply a representative test set?
Is this because they don't have enough human written essays, and generated ChatGPT essays?
I find it really silly they can't supply 50 each as test.csv and supply a much bigger test.csv during the submission scoring.
I find it equally silly the process by which this "test.csv used during scoring" is created, is not public knowledge.
Hey guys I'm thinking of using langchain/hf to make a dayset of llm essays (using the prompts from train_dataset) does anyone want to team up?
Anyone willing to share some ideas on how to improve?
My best code has 0.91 auc on the leaderboard, but it's really hard to go up from here.
Anyone willing to do a code review for feedback? π
On a related note, can I see what code others used? Or is this hidden?
hi
Hi, I'm looking for a teammate. I have decent experience in NLP and training machine learning/neutral network based models.
Pls DM π
hey
I am new to kaggle competitions
I trained my model and saved checkpoint to /kaggle/working directory, but I cannot load it during test time
isn't /kaggle/working folder preserved during testing?
Should be fine to load it
I am also saving it and then loaded it again
Works fine for me
I have made a guide on how to train a tokenizer from scratch
interesting, I tried a lost but coudn't load from /working folder. Now I am training in one notebook and saving the weights then upload them to submission notebook as dataset..
omg...
I'm currently at place 6 but imo haven't done something special
Anyone who has used something else as tf-idf
With a decent score?
Has anyone been using ChatGPT APIs?
I'm trying something with transformers, but I don't know if it's really useful with the 512 token limit?
Hey guys check out my notebook:
https://www.kaggle.com/code/pranshubahadur/detect-llm-generated-essays-using-retention/notebook
Would appreciate any insights!
I'm not sure about this due to nature of tfidf and positional embedding... Be interesting to see though!
Interesting notebook, from all of the data sources you use. Are there any that are no texts related to the suggested 7 prompts?
Tf-idf and the war of available memory.
Yeah try the one generated using PaLM
You can also try to generate your own using llama cpp
Owh nice, yes I'm just scared to lose a lot of positions by focusing on those 7 prompts.
Have you tried this only using the dataset that is popular among the Best public score notebooks?
I used all of these datasets, made tfidf features but the score was much worse than what you did.
Using Retention?
So i don't think tf_idf works with retention / transformers since these architectures are dependent on word positions
could you link this?
No no I did not feed tfidf features to the retention algo. I used the same data as you, concatenated all the different sources. But finally had a score way worse than you have in that notebook.
It's impossible to create all the tfidf features using the same logic as found in the top Public notebooks. So I shrunk the data using truncated SVD, this enabled me to use all available data. But the score was something in the range of 0.715
Would that make such a big difference, but I could have a look into it.
Since I doubt that would increase my score from 0.7 ish to 0.9ish I didn't go further with that
Hi everyone, since I found that many people saying that the lead board is highly over fitted does any one have a good idea how to design a good validation set for this competition ?
hey stupid question everyone but how do you access the hidden test file in the submission (python) notebook?
I tried
test_samples = pd.read_csv("../input/test-essays/test_essays.csv",sep=',')
but it failed
test_set = pd.read_csv('/kaggle/input/llm-detect-ai-generated-text/test_essays.csv')
thanks!
and for people who are using ngrams in the (3,5) range with the big datasets (> 40k rows) how are you dealing with the 30GB RAM limit?
I suggest having a good look at the scikit-learn API, there are some interesting parameters to tune.
hey beginner question here...can someone explain how submissions work in this competition?
I am trying to submit it always failsπ
I m doing by loading model and tokanizer in kagle input directory. During submission my notebook runs successfully but scoring fails.... Exception error comes
Please suggest me a solution someone
Why is it failing, what is the error message?
This is the error π any suggestions so that it can be resolved and I can submit
Click on it and scroll down until you find the error.
There is no error in notebook
I checked log too
π how u submitted?? Give suggestions
One of the public notebooks had an if statement that looks at the length of the test file and therefore executes different code in the submission.
Did you copy a notebook?
No
So how to solve it
I do not know since I do not have your code. If you publish it on Kaggle I could have a look since private sharing of code is not allowed.
Okay will try to do it.
How u submitted??? During submission what's inside notebook anything to see!?
by the way, you're using windows, you do realise you can send images of your screen by pressing the print screen button rather than sending a photo of your computer right
like that
yes π
During submission i need to make another notebook ????? in which no code for training will be there . is it like this????
ππ???
can anyone solve my problem of submission not happening
https://www.kaggle.com/code/nidhipriya123/chexk
here is my code please check and give a solution if u can really help me in submission. i will be very thankful
it says it ran successfully, but theres no score. what happened when you submitted
exception error
notebook ran but score failed
what happens when you run it on your own test data
It gives me the output as wanted
opt/conda/lib/python3.10/site-packages/traitlets/traitlets.py:2930: FutureWarning: --Exporter.preprocessors=["remove_papermill_header.RemovePapermillHeader"] for containers is deprecated in traitlets 5.0. You can pass --Exporter.preprocessors item ... multiple times to add items to a list.
29.9s 9 warn(
29.9s 10 [NbConvertApp] WARNING | Config option kernel_spec_manager_class not recognized by NbConvertApp.
29.9s 11 [NbConvertApp] Converting notebook notebook.ipynb to notebook
30.3s 12 [NbConvertApp] Writing 10481 bytes to notebook.ipynb
31.9s 13 /opt/conda/lib/python3.10/site-packages/traitlets/traitlets.py:2930: FutureWarning: --Exporter.preprocessors=["nbconvert.preprocessors.ExtractOutputPreprocessor"] for containers is deprecated in traitlets 5.0. You can pass --Exporter.preprocessors item ... multiple times to add items to a list.
31.9s 14 warn(
31.9s 15 [NbConvertApp] WARNING | Config option kernel_spec_manager_class not recognized by NbConvertApp.
32.0s 16 [NbConvertApp] Converting notebook notebook.ipynb to html
32.9s 17 [NbConvertApp] Writing 298426 bytes to results.html
in log this is coming
my notebook ran successfully but after that submissin failed~~!!!!
any solution from experts??? or all are just noob???
notebook ran successfully after that error in submission.
What's the error message? (click the β οΈ)
hey guys started generating a dataset using mistral:
https://www.kaggle.com/code/pranshubahadur/llm-generate-essay-mistral-hf-langchain
like this error coming help me
notebook runs fine after that during submision it happens
It's either an error in the note book (check the notebook not the logs) or it's an error with the submission part of your code (try index=False when you do to_csv)
Is there an explanation why the test_essays.csv file has only 3 examples, and why those examples are basically gibberish? I couldn't find an explanation on the competition page.
Hello, i hope someone can help me, I can not submit on the competition, my notebook works fine, my model and tokenizer are well loaded but i have constantly an error.
this is my error : OSError: We couldn't connect to 'https://huggingface.co' to load this file, couldn't find it in the cached files and it looks like distilbertuncases_model is not the path to a directory containing a file named config.json.
Checkout your internet connection or see how to run the library in offline mode at 'https://huggingface.co/docs/transformers/installation#offline-mode'.
Please, help me ^^ what have i done wrong ?
Anyone ? *
Is this channel also meant for finding teams? Just started in the challenge, bringing some transformer and machine learning experience, as well as a 4090 to the table. Got 3 more weeks of vacation and looking for motivated and/or experienced team partners to exchange ideas π Feel free to reach out
4090 π«¨
The competition doesn't let you use the internet.
Eversthing I trained so far for the competition didnt go above 14GB RAM, so it hardly matters so far. But maybe some team mate has some ideas how to put it to use π
- make sure the path is correct (you can check for example with os.listdir). For example, for me the path was mymodel/mymodel
- Make sure you have the model offline (usable with Internet = Off). Its probably best to train it outside your submission Notebook (because you need Internet to first use it), save it locally on your device and Upload it as Dataset on kaggle. Thats how I am doing it and never had Problems
- Make sure the model folder includes both the tokenizer and the model.
Thank you for your advices, I will do it locally first. Thank you
the actual test set is secret. That's just a dummy file to use to run your code
why is nobody else using deberta? I got to top 25 using the existing deberta model + like 10 more lines of code
Im fairly confident I can go about 0.05 further by just retraining it as well
i think people are fine tuning the huggingface pretrained models for classification
Wait what?
What is your username?
^
I see disqualified at school
0.963
Tomorrow I'll have a look into Deberta, I abondoned the idea because I did not see any good notebooks using it.
So instead I optimized my own pipeline.
I was sick the last couple of days but nobody caught me π
i used deberta, fine tuned it (without much effort etc), got a .78 π
again, what a weird competition where you are top 1k at .95 and top 35 at .963 π
I'm first in the competition
I thought you mean the deberta score is .963
Yea they are .964 - my point was just that .964 is top 21 and .960 is top 777
Im fairly confident I can go about 0.05 further by just retraining it as well
Waiting for your 0.968 score tomorrow then.
oh i completely forgot i was training that thanks for sending me the notification
... and my ssh key seems to have been deleted ...
cool
can you tell us your team name? Really looking forward to seeing the sharing of your team's deberta solution after the competition endsπ
Linguistic Ninjas
π
anyone try run catboost on gpu ? I meet Kernel crashes when trying to run CatBoost model on GPU https://www.kaggle.com/competitions/llm-detect-ai-generated-text/discussion/463218
Identify which essay was written by a large language model
I also tried but got the kernel crash
I've tried catboost on the normal CPU cores. Performance was terrible. An earlier poster got PyCaret to work, but I haven't been able to install the library in the notebook
Hey guys I'm new and lets see if I got this correct so the simplest way of explaining this task is to say that we will build a model that inputs a text(text) and outputs a label(generated). right ?
Yeah, you're right
Duhhh I m literally irritated with this OOM error while running catboost on GPU
ππππ
thanks
Use less data. Use 2 T4 instead of a P100.
I keep getting an error that my submission is throwing an exception, I feel like my solution is the most simplest there is, just hardcoded values for each line of test_essays.csv; I'd appreciate any help. Would I be ok posting 7 lines of code?
when you submit a new test dataset is provided. You have to compute predictions for that new test data. You cannot precompute predictions given you can't see this data before hitting the submit button.
Also, "for x in test_essays" is not iterating over rows (which is done by: "for idx, row in test_essays.iterrows()"), but over the columns. So you will always create 3 values, since the test essay file has 3 columns. Coincidentally, this works for the fake test data which we can access, because it also has 3 rows.
TLDR; your code has a bug - on submission you try to fill a column with ~4.5k rows with a list of 3 values
Hi, just entered this competition.
Seems like the best public results (disregarding efficiency) are achieved by training a few classical ML models on top of TF-IDF encodings of tokens. Not BERT, or any other deep learning model. Is that right? It doesnt seem right.
I would expect a pre-trained and fine-tuned BERT or any other decoder model (llama, mistral, etc.) would perform much better than training an ensemble of classical models on top of TF-IDF.
this has been part of many discussion so far (even a couple messages earlier in here π )
- decoder models are not fit for classification tasks, at least not to the degrees of encoder-only transformers or classic ML classifiers
- the competition used A LOT of LLMs - but only for generating data. we are at a point where we went from a few human-generated training rows to over 60k AI rows and over 25k human rows (congrats to the community for that)
- the main question: why does classic ML outperform transformers? no real clue. on paper (at least in my opinion), the transformer embeddings should outperform tf-idf. However, it is worth noting that the used tf-idf embeddings are based on n-grams (3-grams to 5-grams), which means we are closer to positional meaning than pure word-based tf-idf. I strongly believe the reason to be that the n-grams tell more about the GenAI behavior (as in which texts are generated) than the pre-trained transformer embeddings + attention etc.
3b) ensembles are in most situations increasing performance - if you are using transformers, maybe use a transformer ensemble too (e.g., finetune BERT + finetune DeBERTa + finetune RoBERTa + finetune Electra + maybe finetune an existing finetuned model for essays / scientific articles etc. )
You are obviously more than welcome to prove all current top leaderboard submissions wrong with the use of transformers - but for me even a 2 layer NN has been outperforming a fine-tuned deberta3 large by almost .12
@ruby chasm Thanks for all the info! Super helpful!
What do you mean "existing deberta model"? Are you referring to a public kernel or just the base pre-trained deberta from hugginface?
The one on kaggle
That is 0.833
I have since changed approach quite heavily
Thanks for the info @ocean flame
Can someone link to the dataset used in this kernel: https://www.kaggle.com/code/carrot1500/distilbertclassifier-from-scratch-with
/kaggle/input/daigt-v2-train-dataset/train_v2_drcat_02.csv
Also are there better datasets/versions people are using?
Edit: found it https://www.kaggle.com/datasets/thedrcat/daigt-v2-train-dataset
What is the meaning of the bool column "RDizzl3_seven" in the daigt-v2 train dataset?
Thank you! That solved my problem!
hey shatz - sorry for replying somewhat late, we had some holidays π as i see nobody else replied: this will give you some insights:
https://www.kaggle.com/competitions/llm-detect-ai-generated-text/discussion/453410
Identify which essay was written by a large language model
@ruby chasm Thanks! Still am a bit confused. So if a row in the daigt-v2 train dataset has RDizzl3_seven=True it means that it came from this "7 prompts" dataset? Meaning the 7 prompts dataset is a subset of the daigt-v2 train dataset
correct?
the daigt v2 is comprised of multiple data sets, the rdizzl3 set is a separate set. that is included in daigt v2 iirc
gotcha! Thanks!!
Has anyone experimented in max_features for TfidfVectorizer? Does limiting the number of features impact score significantly?
I think I experimented with that once, but not much changed
experimenting with it right now. it definetly impacts the runtime of the notebook a lot
Wow .982 in the leaderboard π― @noble tendon found some secret sauce indeed!!
Damn
yup, faster runtime and also lower score. Got .875 when going to a mere 1000 features.
with the .961 baseline?
How are you guys balancing the OOM error? Its my first time turning towards the .961 baseline and whenever i add more data than the daigt v2 dataset im going OOM
I had the same reaction when you passed me the first time.
I went from a one stage pipeline to a 2 stage pipeline. I'm trying to combine multiple methods now but they're hardly increasing my score.
With multiple I mean more than 2 btw.
yup
I am having problems to submit my notebooks. The submission remains as "scoring" for several hours and then it does not submit. Does anyone has the same problem?
just want to share a funny AI-generated essay on "phones and driving"
There are many people out in this world that get along just fine and there are some that are totally unhappy and unlucky in love, well that is not the case in this story because I am not unhappy at all in love. I have been married to my high school sweetheart for ten years. My love and I have four children that are all above the age of six. I am at home with the kids while my wife is working. However, not all is peaceful at the Green household. Someone always has an attitude and is mad at another person and that's not me but the kids. There are so many things that they get on each other's nerves and love to pick on each other. When they get on each other's nerves, they turn on each other and just can't seem to get over the things that they say.
One day in the afternoon, I had to take my oldest son on his annual Physical for his football team. The physical was at four o'clock in the afternoon. The quarterback is my son and he has to pass this physical or he will not be able to play football this season. When I dropped my oldest son off to be checked out by the doctor, I went to pick up my wife from work and then we were going to go on to the doctorβs office together. My daughter and daughter-in-law were working for the day so it was going to be a long drive. The doctor's office was about ten minutes away from where we worked. When I got home, I told my wife and oldest son about the doctorβs appointment and told them to get ready and we were heading out
.
I drove for about fifteen minutes trying to get us to the doctor's office. On the way there, my cell phone started to ring. When I looked to see who it was, it was my wife who I just said to meet me at the doctorβs office. I stopped in the middle of the road because I was in my mind and not looking at the road. When I turned around, I notice that I was in the middle of the road and there was a SUV right behind me. If I would have looked up a few more seconds, I would have been rammed and possibly killed. I would have killed someone with my own stupidity. To this day, I look back to that day and wonder if my life could have changed by using my cell phone less.
As you can see, there are many reasons why one should not drive while talking on the phone. You can run into things and get distracted and crash into something or somebody. Also, it is against the law in Iowa to talk on the phone and text while driving. For these reasons, people should not be allowed to talk on their phone while driving.
^^ it's educational
When trying to incorporate public notebooks into the pipeline
I tried most things
Omitting n_jobs
Deleting all variables
GC.collect()
Can someone enlighten me about some psychology behind publishing solutions?
I understand why competitors in the top 20 or so do not publish their solutions. They are clearly there to win.
But, considering the competitors with a score of 0.963. Would they not benefit more from publishing their solution and getting a ton of upvotes than trying to actually win the competition?
^ hope I dont sound demeaning or something btw. Totally cool to not publish solutions haha. Just curious behind the thought process of those with 0.963.
How about you achieve that score and see what would you then?
We'll if you have an unique idea and are scoring 0.963 and believe you have room of improvement I wouldn't share the code.
Besides that it is unknown wether these people have used public notebooks as well in their pipeline
So there is still a possibility they're in for the gold.
@lofty mirage i just made a poston oom which might help your problems
@lofty mirage makes sense thanks! I guess it depends on the situation. Hevent been in it myself yet haha
I see thank you for the tips.
Does anyone have insight as to how the VotingClassifier's estimators were figured out in the .961 and .962 kernels?
Is it just trial and error or does the author have a sophisticated hyperparameter optimization pipeline?
probably both
The one with 0.963 with deberta clearly thinks he can get a better score. And if you look at LB, you don't need to improve a lot to be in gold zone. Ofc this assumes little to no shakeup.
Sharing the exact code has one problem. It will lead to lots of forkmittings, where many people just directly copy and submit without making any modifications. Sharing the approach only gives ideas but forces people to actually put in the work to implement
and thereβs only 3 weeks left in the comp
Also they had quite the operation going. There is no way we are able to replicate this within the last 17 days. They did not share code but just how transformers may work - which I think is really good and what kaggle is for
I guess youre referring to the deberta with 0.963. Can you link the reference? I didnt see this
Identify which essay was written by a large language model
btw, have you tried more of this? im currently struggling to get the catboost going. for some reason its even going oom when copying it from notebooks that already scored on lb..
Have not tried anything in between 1000 features and whatever the default it (I guess unlimited?). If i do ill update.
Yep, unlimited atm
Strange to get OOM when running stuff that already scored
Maybe unlimited is just barely on the edge of max memory allotment
its very frustrating. since it also takes forever..
thanks for the quick reply tho π
Sure, thanks to u too haha
So I am not the only one getting 9 hour runtimes on the submission stage lol
which is wild, because if I remove boost from my planned pipeline, it takes less than 30minutes
catboost takes too long to train but I can't even use gpu without getting an oom :(
has anyone successfully used catboost with gpu?
trying it rn π
I just found this: https://github.com/catboost/catboost/issues/2192
Problem:When i use catboost to train my dataset on GPU.It reports the error:"out of memory. requested 1795MB; free 595 MB".So i want to use two GPUs to solve this error by set parameters ...
according to that calculation, the ~9200 samples have ~8M 3-5ngrams (features) and we have training set of ~45k rows. That means 432_000_000_000 Bytes. Thats approx. 402 GB RAM
just ran an xgb model taking 3hrs
@ruby chasm 10k feats results in 0.901
on the whole ensemble or just boost?
oh wait nvm, I also removed MultinomialNB
i did a 42k multinomial that scored .909
yea, just multinomial
any tips for local validation? I gotta stop using kaggle submissions for validation
was thinking just to hold out the official training data prompts for validation set:
val_df = train[train['prompt_name'].isin(["Car-free cities", "Does the electoral college work?"])]
train_df = train[~train['prompt_name'].isin(["Car-free cities", "Does the electoral college work?"])]
we are fairly certain those prompts are not in the test set. so i cant tell you how good that idea works
i have been just doing normal CV (which is not .999 for me atm) - selecting folds in a way to get close to 9k test samples... this probably aint a good / the best way, but i dont have a good validation set
My intution says that at least those prompts should be within the same "distribution" as the test set (assuming they are a subset of the original test dataset). That being said, they are just too easy haha. My local validation and LB are way off
How is anybody able to submit on catboost? Im even going OOM on it with max_features=1M
If you have N samples in your training data and M features then catboost will use NxMs4 bits (I assume it uses FP32). Then compare this with your memory to get the number of features you can afford.
@velvet vortex am i correct to assume i have the same Hardware limitations during submission? I even copied the model from a notebook that was already scored on LB, but when I try it locally it keeps on failing. this is logically beyond my understanding. I have tried it with less features AND less data than the Notebook I copied it from. only reasoning for me would be having more than 30GB RAM during submission
Are you executing it in the same environment?
I am copying the notebook 1:1 but wanted to test it inside editor first, where it crashes. I never submitted it since I dont want to waste submissions.
What I did: fake a test set, reduce train size (smaller than in the notebook i copied from), set max_features. I cant see how this would go oom
Are you guys referring to max num features for tf-idf? Or is there a parameter for catboost that limits the number of features it makes?
nope max feature of tfidf
Im surprised it makes 1M+ features
did you check that?
we should run it on the train set and see that it even makes that many. Maybe your max is greater than what tfidf needs
when I create a fake test set with ~9k subsamples from the train set, in created ~7M tfidf features (i.e., my training matrix is 40000x7300000). I am not sure how many featuers cat creates, can never check, its always oom
on train set i think it was like 60-80M features from tfidf
oh damn i stand corrected haha
catboost has this param: used_ram_limit=None
interesting π§
also: per_float_feature_quantization=None
used ram limit didnt help
this sounds interesting, didnt see it on performance thingies
Another interesting direction: KBinsDiscretizer
https://scikit-learn.org/stable/auto_examples/cluster/plot_face_compress.html
The example they show doesnt actually use less memory, but at the end they do say that it would use less memory if it started out as float64 (their starting point is int8)
not sure if sklearn pipeline would take advantage of it anyway. They could cast to float64 immediately upon training
interesting, thanks for sharing
also I looked at the docs for catboost per_float_feature_quantization. have absolutely no idea what the hell it does
seems like you need to understand catboost
Oh wait I understand something
border_count: The number of splits for numerical features. Allowed values are integers from 1 to 65535 inclusively. Default: 254 on cpu, 128 on gpu
per_float_feature_quantization changes the amount of "borders" which is splits for a feature
Basically we would need to make the per_float_feature_quantization less than 254 for every feature
and we might save some rams
i despise catboost in this competition
something like: per_float_feature_quantization=[f'{i}:border_count=50' for i in range(num_feats)]
I can see texts talking about scores of more than 0.95, I was just wondering if these guys have generated artificial data via LLMs, as I can see only 3 data points for these class in the original data
I think so yes https://www.kaggle.com/datasets/thedrcat/daigt-v2-train-dataset, or discussed in https://www.kaggle.com/competitions/llm-detect-ai-generated-text/discussion/465882 this author even created a dataset of 500k data points
Identify which essay was written by a large language model
Check out public notebooks shared in the code part of the competition. Most notebooks use this amazing dataset, which seems to work great out of the box https://www.kaggle.com/datasets/thedrcat/daigt-v2-train-dataset
Woaaha, Now I m getting 0
O.960 without catboost also
And it just takes 2 hours to run
cool
Okay so I have one question
So I collected my own data as well
Will it be good to fit my vectorizer on my own data
Or just fit it on publically available data of drcat
would be nice if you share approach π I cant beat .95 without catboost
the .962 Baseline that was shared without CatBoost is .955 in about 2 hrs
Hi, I'm a newbie and I just wanted to give some of my thoughts about why classic ML performs so well, or even better than LLM. If there's any point that is wrong, please correct me π
About point 3 from Valentin's discussion "on paper (at least in my opinion), the transformer embeddings should outperform tf-idf", I think we should clarify which task do transformer embeddings outperform tf-idf. I think if the task is language modeling, then yes, transformer embeddings should be better because they are specifically trained to learn human knowledge, while tf-idf are just probability. But I don't think this should hold for the task of LLM-generated text detection. In my opinion, the reason why classic ML performs better than LLM models is that because of the tf-idf features that captures the WRITING STYLES, which I believe is what sets LLM-generated text apart from student essays. ChatGPT will likely give an essay that sounds similar to an article on the news, which can sound more formal, with rich vocabularies than student essays. And this can be captured efficiently by tf-idf embeddings, while LLM embeddings will be affected by its human knowledge, which is not the focus of this task, and will make LLM perform worse (unless there are much more data as the solution of BERT with 500k augmented data from https://www.kaggle.com/competitions/llm-detect-ai-generated-text/discussion/465882).
Another point I want to mention is that I think using LLM to detect LLM-generated text, in the long run, might not be a sustainable solution, as LLM are trained to generate human like text. It's just a matter of time for LLM to be able to generate total-human-like text, and eventually LLM-based detection solution will not be able to do detection
Identify which essay was written by a large language model
Guys
Is there a way to load large pickle files
I dumped it from from joblib
It's size is 5.5 gb
But when I load it on kaggle my notebook crashes
I think tf idf is just highly over fitted
heya how's it going with deberta? I have been using distilbert and deberta-v3-small (I don't have the resources for deberta-v3-base or deberta-v3-large) as well as my own hand rolled 2 or 3 layer LSTM/FF NN with attention and I'm getting around 0.8 for distilbert and my hand rolled NN and less than 0.8 for deberta, I trained deberta and distilbert for 1 or 2 epochs with lr in the 1e-6 to 1e-5 range
from what i've been doing the LLM/NN approach are outperforming tfidf on my own train/test split from the train_v3_drcat_02.csv that's publicly available on kaggle, but tfidf outperforms LLM/NN on the competition test data
arbitraty text statistics (potentially as a result of unfortunate data collection) > text semantics & intricacies
i still dont think tfidf is overfitting in the context of the Kaggle use case, If you define "new data" as the private LB.
But I also believe that the .963 Transformer solution group will climb some Ranks on private LB. Probably top 5-10
my best performing submission so far is tfidf + multinomial naive bayes (0.918), using distilbert I get like 0.8 something
but i get 0.98 something ROC AUC with distilbert on my own 80/20 train test split (use 3 of the prompts for validation and 12 for training)
if i can't get the NN approach to work in the next few days I'll just focus on getting higher with tfidf using ensemble methods
have you looked into the confidence of predictions? I have only trained 1 model so far for the competition, but it was extremely overconfident. All predictions were either 0 or 1. If you find a way to reduce this, your model is probably going to perform better (my deberta was .86-ish)
I will also give it another try soon π
damn..whats with the dislikes?..
your score is .86-ish when you submit to the kaggle comp? did you evaluate it on your own test data?
i don't know what the prediction probabilities are for the competition test data but based on the ROC AUC it should be close to 1 or 0 for most of the data and closer to .5 for the misclassified
exactly. Same with sgd. its Like 25-30 min iirc and a Bit better
yes on own data it was like .99 or such - by far the most overconfident model with the highest CV LB gap
for me there was no preds between 0.05 and 0.95 I think. the model was very overconfident
I just found something weird
Just split my train set into two (test size is 9000)
To replicate the submission environment. (The code works properly without oom when submit)
But for some reason I get TerminatedWorkerError: A worker process managed by the executor was unexpectedly terminated. (basically OOM) when running my code (not submitting)
I don't really know where... but I remember that someone said how when we submit the model gets ran in both the public and private test sets?
it does perfectly work when i set the test size to 4500
Guess the competition does only run the subset for public scoring?
it runs on all test set as always
hmm
then I guess my code was wrong
What was your highest public score for the deberta? Mine was less than .8 and for distilbert it was .83 run with less than 3 epochs. It seems like the LLMs are strongly fitting to my train data, generalising well to my validation data (but much worse than the training) and generalising less well to the competition data
i've also been getting a weird problem where I edit the notebook and submit several times but the score is the same for all the submissions after editing
like when i look at the notebooks for the submissions they are clearly different versions but they all have the same score
I tried concatting spaCy embeddings with the TFIDF features, feel free to check it out here: https://www.kaggle.com/code/danshatzz/spacy-embeddings-tfidf-ensemble
^ normalizing the embeddings similar to tfidf features improved the score a bit. But I had to remove Catboost due to out of time error, so it doesnt beat baseline
I also tried the same with BERT, got worse score
yeah with catboost you can't do (3,5) ngrams if you train the vectorizr on the full training data. I saw some notebooks where they train the vectorizer on the hidden test data instead
Someone please clarify, do we have to predict labels like(0 or 1) or the probabilities like (0.8, 0.4, etc)?
Probabilities
you can predict either, but auc is very kind in terms of probabilities.
How much score are you getting? BERT gives me 0.825
Hey, Does anyone know how to solve
CatBoostError: /src/catboost/catboost/libs/data/quantization.cpp:2416: All features are either constant or ignored.
Will the results be disclosed at midnight UTC sharp? Or will there be a period of gestation?
it is immediate at 23:59 UTC
but these will be prvisional results. Kaggle may (and probably will) remove some teams for cheating.
Removal happens in the next few days, it is not immediate.
yessir! at least top3 is somewhat expected..
I am top 350 instead of top 100 because of bad selection.. i guess its a learning experience
like, if you keep everything constant, and just replace your CAT model with an LGB model, your score goes from 0.91 to 0.86
I haven't seen that much diff between CAT and LGB before
Not sure why cat would overfit more than other boosters tho
We had one sub with only CAT and one sub with only LGB (with MNB and SGD) and the difference was massive
Even though the public score was almost the same...
gg everyone! see yall in the next LLM-ish competition!
Has anyone here done transfer learning with Google Gemini as the base model?
what can you tell me about the work?
would it be possible to share a link to the notebook??