#cafa-6-protein-function-prediction | Kaggle | Page 1

thin terrace Oct 17, 2025, 9:53 AM

#

I'm a little bit confused about the testing superset. Does it contain proteins same as the training set? And the majority come from SwissProt. Is it a deliberate design to hide the real testing cases? But if the testing set overlaps the training set, isn't the testing and leaderboard useless then?

scenic haven Oct 17, 2025, 2:38 PM

#

Hi, the test superset is a large set of proteins that we ask participants to predict GO terms for. The actual test set (or evaluation set) is unknown until the Final Evaluation Date (June 1, 2026) as these are proteins that accumulate new GO terms between submission deadline (Feb 1, 2026) and evaluation time (June 1, 2026). The leaderboard is evaluated on a small set of proteins with GO terms that are provided to us by the UniProtKB team, but not available in UniProtKB or other public databases.

austere jasper Oct 21, 2025, 10:05 AM

#

I was wondering if changes in the MLP architecture will have a huge impact on the results
Like i am using Protien Embeddings and then Using (embedding_size,864,782,num_classes) MLP layer. Got a Score of 0.186
but
Train - Loss: 0.0021 | ROC-AUC: 0.9567 | AUPRC: 0.3414 Val - ROC-AUC: 0.9484 | AUPRC: 0.3136

#

should the focus be on Postprocessing or something else?

scenic haven Oct 22, 2025, 3:19 PM

#

In CAFA5 and previous rounds, top-performing methods often integrated multiple information sources (structures, textual descriptions, literature text mining) so you can try integrating these information to improve your method https://www.kaggle.com/competitions/cafa-6-protein-function-prediction/discussion/612582

grizzled dragon Oct 28, 2025, 12:27 PM

#

Hello! Quick question: Is the full source code to the CAFA5 winner's solution (GoCurator, NetGO4.0) available somewhere? I cannot find it.
Do they not have to publish the full source code because they make the tool freely available to run via a webserver? Or am I missing something?

scenic haven Oct 28, 2025, 3:00 PM

#

grizzled dragon Hello! Quick question: Is the full source code to the CAFA5 winner's solution (G...

Winners of CAFA5 were required to provide source code to Kaggle, but not required to open source their solution. Only their solution write-up and their explanation video is public for everyone

grizzled dragon Oct 28, 2025, 3:01 PM

#

Ah, in that case I misunderstood that rule. Thank you for clearing it up!

livid gyro Nov 8, 2025, 5:02 PM

#

Hi

cold lantern Nov 10, 2025, 7:50 AM

#

Research Collaboration Opportunity – Clinical Proteomics × LLM Reliability

We are conducting a study titled “Hallucination Risks of Large Language Models in Clinical Proteomics.”
The project systematically evaluates how models such as GPT-4, Claude, and Gemini perform when interpreting clinical proteomics data, focusing on hallucination frequency, error patterns, and reliability assessment.

Our results indicate that even frontier models generate 27–35% factual errors, rising to over 50% for complex or rare-protein queries.
These findings highlight the significant reliability and safety challenges of applying LLMs in biomedical contexts.

We are seeking a collaborator who:
• Has experience working with proteomics or mass-spectrometry datasets
• Understands LLM architectures, evaluation frameworks, or AI safety

If this aligns with your expertise or interests, feel free to contact me or reply here.
Code, datasets, and the full evaluation pipeline are available on GitHub:
https://github.com/olaflaitinen/llm-proteomics-hallucination

Tags: #AIResearch #Bioinformatics #LLM #ClinicalData #ResearchCollab

hot vigil Nov 12, 2025, 3:27 AM

#

Hi

marble compass Nov 26, 2025, 7:45 PM

#

Hi, how do you usually store ESM embeddings? When I save them to Google Drive, loading them takes forever.

shy canyon Dec 1, 2025, 5:29 PM

#

https://media.discordapp.net/attachments/1444971360047726605/1445085758598938824/image1.gif?ex=692f107d&is=692dbefd&hm=94f18cd6e7350e7cc612826beb5d11a9fd125485a58ee1e39a16a03b6f9e2426&=&width=237&height=315
https://media.discordapp.net/attachments/1444971360047726605/1445085766937088000/image2.gif?ex=692f107f&is=692dbeff&hm=51e8429e6818b166e21485a613e8f0c706d64c765aefc93f65a7bcefa10907c2&=&width=864&height=1152
https://media.discordapp.net/attachments/1444971360047726605/1445085774562197535/image3.gif?ex=692f1081&is=692dbf01&hm=e520e8e4edd4eea02e82168a7059a868ea59c19d9b90c7c34402f7bb3616c76f&=&width=864&height=1152
https://media.discordapp.net/attachments/1444971360047726605/1445085781801566319/image4.gif?ex=692f1082&is=692dbf02&hm=bdc0715977fdcda4b7804916e5bfb36af1d3132f535d1b4327894a067fbfc769&=&width=725&height=907

kind pawn Dec 2, 2025, 5:21 PM

#

|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||| _ _ _ _ _ _ https://imgur.com/TC6h8P4 https://imgur.com/iiKXKB5 https://imgur.com/JAkE28j https://imgur.com/keASgw9

robust glacier Dec 8, 2025, 3:47 AM

#

Hello everyone, not sure if this has been answered or if this is a stupid question, after running the cafa evaluator script is the final score just an average of the f_micro?

scenic haven Dec 8, 2025, 4:01 AM

#

The cafa evaluator script will return a few result files, one of which is "evaluation_best_f_micro_w.tsv" file. In this file you will find the best F-score in each of the three subontologies (biological process, molecular function, cellular aspect), and the score of the method would be average of these 3 F-scores. In addition, the cafa evaluator is run 3 times for each subset of proteins (no-knowledge, limited-knowledge, partial-knowledge), each subset of proteins has a different ground truth file and will create a separate result file. The final score is a 9-way average of 9 best F-scores (across 3 subontologies and across 3 subsets of proteins)

robust glacier Dec 8, 2025, 4:22 AM

#

Thank you for the info, @scenic haven

obsidian onyx Dec 19, 2025, 9:09 AM

#

hello. I have a question about this competition, are we supposed to form teams?

static coyote Dec 19, 2025, 1:49 PM

#

obsidian onyx hello. I have a question about this competition, are we supposed to form teams?

its not necessary and it applies to every competition on kaggle

pearl ferry Dec 31, 2025, 10:00 PM

#

First time popping by to see if anyone is active here

#

I've been using an approach focused on feature engineering useful data

#

ex. Here's a 6-length subsequence breakdown of the most identifiable subsequences to GO:0005515

#

 rank subseq  count  rel_freq
    1 QQQQQQ 198437  4.982011
    2 EEEEEE 128135  3.216991
    3 PPPPPP 127984  3.213200
    4 AAAAAA  98647  2.476657
    5 SSSSSS  77090  1.935442
    6 GGGGGG  44055  1.106056
    7 HTGEKP  36445  0.914998
    8 NNNNNN  22862  0.573979
    9 TGEKPY  22347  0.561050
   10 LLLLLL  12581  0.315862
   11 EEEEED  11286  0.283349
   12 IHTGEK   9619  0.241497
   13 DEEEEE   9443  0.237078
   14 DDDDDD   8181  0.205394
   15 RIHTGE   7537  0.189226
   16 ECGKAF   6996  0.175643
   17 HRDLKP   6042  0.151692
   18 RTHTGE   5491  0.137858
   19 EEDEEE   5454  0.136930
   20 DFGLAR   5371  0.134846```

#

some of these are not strong indicators due to being identifiable to other GO#s as well

#

for comparison, note that QQQQQQ is not a common identifiable subsequence to 5516 (however, 5516 needs more data collection)

#

 rank subseq  count  rel_freq
    1 NNNNNN     62  0.533104
    2 EEEEEE     52  0.447120
    3 ESGAGK     17  0.146174
    4 GESGAG     17  0.146174
    5 NSSRFG     17  0.146174
    6 SGAGKT     17  0.146174
    7 SSRFGK     17  0.146174
    8 LLEKSR     16  0.137575
    9 YLLEKS     16  0.137575
   10 EAFGNA     15  0.128977
   11 RCIKPN     15  0.128977
   12 HRDLKP     15  0.128977
   13 SGESGA     15  0.128977
   14 FGNAKT     14  0.120378
   15 AFGNAK     14  0.120378
   16 GAGKTE     14  0.120378
   17 NSFEQF     13  0.111780
   18 LEAFGN     13  0.111780
   19 SSSSSS     12  0.103181
   20 EQFCIN     11  0.094583```

#

@scenic haven Is there a way to be granted permission to share images in this competition channel? I have graphed data I've collected (based on the data I collected above) that could help encourage other participants to see this information in another light.

#

They're just pyplot images

scenic haven Dec 31, 2025, 10:20 PM

#

I am not sure on Discord, since we do get a lot of spam images. I think it is easier to show your work on a post on Kaggle discussion forum instead

glass mica Jan 25, 2026, 1:19 AM

#

I have used up my Google colab GPU to max. Does anyone have any idea to use higher computing power at zero cost?

azure wren Jan 25, 2026, 1:38 AM

#

Hi guys, I'm super new to this. When formatting the submission file, how do you decide which terms to keep? I'm assuming you probably don't want a probability for every term for each protein because that will be huge. Do you use some threshold on the probability? Or just keep the top k terms for each protein? Sorry if this seems like a really dumb question this is my first time doing something like this

vestal hazel Jan 25, 2026, 3:11 PM

#

glass mica I have used up my Google colab GPU to max. Does anyone have any idea to use hig...

kaggle gpus ig

glass mica Jan 25, 2026, 4:42 PM

#

vestal hazel kaggle gpus ig

Which is?

vestal hazel Jan 25, 2026, 10:20 PM

#

glass mica Which is?

t4 x2? or p100?

glass mica Jan 25, 2026, 11:20 PM

#

vestal hazel t4 x2? or p100?

Trying with TPU.....

vestal hazel Jan 26, 2026, 7:37 AM

#

glass mica Trying with TPU.....

Okie

glass mica Jan 26, 2026, 8:15 AM

#

vestal hazel Okie

Thank you so much. It worked. Used up the TPU. Ended with a 0.217 from 0.16.

vestal hazel Jan 26, 2026, 10:16 AM

#

glass mica Thank you so much. It worked. Used up the TPU. Ended with a 0.217 from 0.16.

woohoo woohoo

weak patrol Jan 30, 2026, 4:15 AM

#

Does anyone have any idea on how to run a homology search (using MMSeqs2 or other) with GPUs on a notebook? I spent my whole day trying to debug the installation yesterday but I get a CUDA error...

MMseqs2 was compiled without CUDA support

glass mica Jan 30, 2026, 10:21 AM

#

weak patrol Does anyone have any idea on how to run a homology search (using MMSeqs2 or othe...

Instead of trying to compile from source (which is where most CUDA errors happen), try to grab the specific AVX2 + CUDA build directly.

// BASH in Terminal
wget https://mmseqs.com/latest/mmseqs-linux-cuda.tar.gz
tar xvf mmseqs-linux-cuda.tar.gz
export PATH=$(pwd)/mmseqs/bin/:$PATH
//

This requires a Linux environment (or WSL2 on Windows)

alpine forge Mar 7, 2026, 3:15 AM

#

hey guys i dont know why but the submit prediction on the cafa page is inactive the button

#

i dont understand what to do

#

oh Submissions are now closed. but is 3 months to go it says

scenic haven Mar 7, 2026, 3:18 AM

#

Submission already closed, now we wait for new annotations to accumulate and the final evaluation will be based on those

alpine forge Mar 7, 2026, 3:18 AM

#

why is closed if it has 3 months to go?

#

is first time i see this in a competition

#

oh it had 2 timeline points

#

i didn't notice at all

#

mh wasted time ahhh sad