#cafa-6-protein-function-prediction
1 messages · Page 1 of 1 (latest)
Hi, the test superset is a large set of proteins that we ask participants to predict GO terms for. The actual test set (or evaluation set) is unknown until the Final Evaluation Date (June 1, 2026) as these are proteins that accumulate new GO terms between submission deadline (Feb 1, 2026) and evaluation time (June 1, 2026). The leaderboard is evaluated on a small set of proteins with GO terms that are provided to us by the UniProtKB team, but not available in UniProtKB or other public databases.
I was wondering if changes in the MLP architecture will have a huge impact on the results
Like i am using Protien Embeddings and then Using (embedding_size,864,782,num_classes) MLP layer. Got a Score of 0.186
but
Train - Loss: 0.0021 | ROC-AUC: 0.9567 | AUPRC: 0.3414 Val - ROC-AUC: 0.9484 | AUPRC: 0.3136
should the focus be on Postprocessing or something else?
In CAFA5 and previous rounds, top-performing methods often integrated multiple information sources (structures, textual descriptions, literature text mining) so you can try integrating these information to improve your method https://www.kaggle.com/competitions/cafa-6-protein-function-prediction/discussion/612582
Hello! Quick question: Is the full source code to the CAFA5 winner's solution (GoCurator, NetGO4.0) available somewhere? I cannot find it.
Do they not have to publish the full source code because they make the tool freely available to run via a webserver? Or am I missing something?
Winners of CAFA5 were required to provide source code to Kaggle, but not required to open source their solution. Only their solution write-up and their explanation video is public for everyone
Ah, in that case I misunderstood that rule. Thank you for clearing it up!
Hi
Research Collaboration Opportunity – Clinical Proteomics × LLM Reliability
We are conducting a study titled “Hallucination Risks of Large Language Models in Clinical Proteomics.”
The project systematically evaluates how models such as GPT-4, Claude, and Gemini perform when interpreting clinical proteomics data, focusing on hallucination frequency, error patterns, and reliability assessment.
Our results indicate that even frontier models generate 27–35% factual errors, rising to over 50% for complex or rare-protein queries.
These findings highlight the significant reliability and safety challenges of applying LLMs in biomedical contexts.
We are seeking a collaborator who:
• Has experience working with proteomics or mass-spectrometry datasets
• Understands LLM architectures, evaluation frameworks, or AI safety
If this aligns with your expertise or interests, feel free to contact me or reply here.
Code, datasets, and the full evaluation pipeline are available on GitHub:
https://github.com/olaflaitinen/llm-proteomics-hallucination
Tags: #AIResearch #Bioinformatics #LLM #ClinicalData #ResearchCollab
Hi
Hi, how do you usually store ESM embeddings? When I save them to Google Drive, loading them takes forever.
https://media.discordapp.net/attachments/1444971360047726605/1445085758598938824/image1.gif?ex=692f107d&is=692dbefd&hm=94f18cd6e7350e7cc612826beb5d11a9fd125485a58ee1e39a16a03b6f9e2426&=&width=237&height=315
https://media.discordapp.net/attachments/1444971360047726605/1445085766937088000/image2.gif?ex=692f107f&is=692dbeff&hm=51e8429e6818b166e21485a613e8f0c706d64c765aefc93f65a7bcefa10907c2&=&width=864&height=1152
https://media.discordapp.net/attachments/1444971360047726605/1445085774562197535/image3.gif?ex=692f1081&is=692dbf01&hm=e520e8e4edd4eea02e82168a7059a868ea59c19d9b90c7c34402f7bb3616c76f&=&width=864&height=1152
https://media.discordapp.net/attachments/1444971360047726605/1445085781801566319/image4.gif?ex=692f1082&is=692dbf02&hm=bdc0715977fdcda4b7804916e5bfb36af1d3132f535d1b4327894a067fbfc769&=&width=725&height=907
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||| _ _ _ _ _ _ https://imgur.com/TC6h8P4 https://imgur.com/iiKXKB5 https://imgur.com/JAkE28j https://imgur.com/keASgw9
Hello everyone, not sure if this has been answered or if this is a stupid question, after running the cafa evaluator script is the final score just an average of the f_micro?
The cafa evaluator script will return a few result files, one of which is "evaluation_best_f_micro_w.tsv" file. In this file you will find the best F-score in each of the three subontologies (biological process, molecular function, cellular aspect), and the score of the method would be average of these 3 F-scores. In addition, the cafa evaluator is run 3 times for each subset of proteins (no-knowledge, limited-knowledge, partial-knowledge), each subset of proteins has a different ground truth file and will create a separate result file. The final score is a 9-way average of 9 best F-scores (across 3 subontologies and across 3 subsets of proteins)
Thank you for the info, @scenic haven
hello. I have a question about this competition, are we supposed to form teams?
its not necessary and it applies to every competition on kaggle
First time popping by to see if anyone is active here
I've been using an approach focused on feature engineering useful data
ex. Here's a 6-length subsequence breakdown of the most identifiable subsequences to GO:0005515
rank subseq count rel_freq
1 QQQQQQ 198437 4.982011
2 EEEEEE 128135 3.216991
3 PPPPPP 127984 3.213200
4 AAAAAA 98647 2.476657
5 SSSSSS 77090 1.935442
6 GGGGGG 44055 1.106056
7 HTGEKP 36445 0.914998
8 NNNNNN 22862 0.573979
9 TGEKPY 22347 0.561050
10 LLLLLL 12581 0.315862
11 EEEEED 11286 0.283349
12 IHTGEK 9619 0.241497
13 DEEEEE 9443 0.237078
14 DDDDDD 8181 0.205394
15 RIHTGE 7537 0.189226
16 ECGKAF 6996 0.175643
17 HRDLKP 6042 0.151692
18 RTHTGE 5491 0.137858
19 EEDEEE 5454 0.136930
20 DFGLAR 5371 0.134846```
some of these are not strong indicators due to being identifiable to other GO#s as well
for comparison, note that QQQQQQ is not a common identifiable subsequence to 5516 (however, 5516 needs more data collection)
rank subseq count rel_freq
1 NNNNNN 62 0.533104
2 EEEEEE 52 0.447120
3 ESGAGK 17 0.146174
4 GESGAG 17 0.146174
5 NSSRFG 17 0.146174
6 SGAGKT 17 0.146174
7 SSRFGK 17 0.146174
8 LLEKSR 16 0.137575
9 YLLEKS 16 0.137575
10 EAFGNA 15 0.128977
11 RCIKPN 15 0.128977
12 HRDLKP 15 0.128977
13 SGESGA 15 0.128977
14 FGNAKT 14 0.120378
15 AFGNAK 14 0.120378
16 GAGKTE 14 0.120378
17 NSFEQF 13 0.111780
18 LEAFGN 13 0.111780
19 SSSSSS 12 0.103181
20 EQFCIN 11 0.094583```
@scenic haven Is there a way to be granted permission to share images in this competition channel? I have graphed data I've collected (based on the data I collected above) that could help encourage other participants to see this information in another light.
They're just pyplot images
I am not sure on Discord, since we do get a lot of spam images. I think it is easier to show your work on a post on Kaggle discussion forum instead
I have used up my Google colab GPU to max. Does anyone have any idea to use higher computing power at zero cost?
Hi guys, I'm super new to this. When formatting the submission file, how do you decide which terms to keep? I'm assuming you probably don't want a probability for every term for each protein because that will be huge. Do you use some threshold on the probability? Or just keep the top k terms for each protein? Sorry if this seems like a really dumb question this is my first time doing something like this
kaggle gpus ig
Which is?
t4 x2? or p100?
Trying with TPU.....
Okie
Thank you so much. It worked. Used up the TPU. Ended with a 0.217 from 0.16.

Does anyone have any idea on how to run a homology search (using MMSeqs2 or other) with GPUs on a notebook? I spent my whole day trying to debug the installation yesterday but I get a CUDA error...
MMseqs2 was compiled without CUDA support
Instead of trying to compile from source (which is where most CUDA errors happen), try to grab the specific AVX2 + CUDA build directly.
// BASH in Terminal
wget https://mmseqs.com/latest/mmseqs-linux-cuda.tar.gz
tar xvf mmseqs-linux-cuda.tar.gz
export PATH=$(pwd)/mmseqs/bin/:$PATH
//
This requires a Linux environment (or WSL2 on Windows)
hey guys i dont know why but the submit prediction on the cafa page is inactive the button
i dont understand what to do
oh Submissions are now closed. but is 3 months to go it says
Submission already closed, now we wait for new annotations to accumulate and the final evaluation will be based on those