#cafa-6-protein-function-prediction

1 messages · Page 1 of 1 (latest)

thin terrace
#

I'm a little bit confused about the testing superset. Does it contain proteins same as the training set? And the majority come from SwissProt. Is it a deliberate design to hide the real testing cases? But if the testing set overlaps the training set, isn't the testing and leaderboard useless then?

scenic haven
#

Hi, the test superset is a large set of proteins that we ask participants to predict GO terms for. The actual test set (or evaluation set) is unknown until the Final Evaluation Date (June 1, 2026) as these are proteins that accumulate new GO terms between submission deadline (Feb 1, 2026) and evaluation time (June 1, 2026). The leaderboard is evaluated on a small set of proteins with GO terms that are provided to us by the UniProtKB team, but not available in UniProtKB or other public databases.

austere jasper
#

I was wondering if changes in the MLP architecture will have a huge impact on the results
Like i am using Protien Embeddings and then Using (embedding_size,864,782,num_classes) MLP layer. Got a Score of 0.186
but
Train - Loss: 0.0021 | ROC-AUC: 0.9567 | AUPRC: 0.3414 Val - ROC-AUC: 0.9484 | AUPRC: 0.3136

#

should the focus be on Postprocessing or something else?

scenic haven
grizzled dragon
#

Hello! Quick question: Is the full source code to the CAFA5 winner's solution (GoCurator, NetGO4.0) available somewhere? I cannot find it.
Do they not have to publish the full source code because they make the tool freely available to run via a webserver? Or am I missing something?

scenic haven
grizzled dragon
#

Ah, in that case I misunderstood that rule. Thank you for clearing it up!

livid gyro
#

Hi

cold lantern
#

Research Collaboration Opportunity – Clinical Proteomics × LLM Reliability

We are conducting a study titled “Hallucination Risks of Large Language Models in Clinical Proteomics.”
The project systematically evaluates how models such as GPT-4, Claude, and Gemini perform when interpreting clinical proteomics data, focusing on hallucination frequency, error patterns, and reliability assessment.

Our results indicate that even frontier models generate 27–35% factual errors, rising to over 50% for complex or rare-protein queries.
These findings highlight the significant reliability and safety challenges of applying LLMs in biomedical contexts.

We are seeking a collaborator who:
• Has experience working with proteomics or mass-spectrometry datasets
• Understands LLM architectures, evaluation frameworks, or AI safety

If this aligns with your expertise or interests, feel free to contact me or reply here.
Code, datasets, and the full evaluation pipeline are available on GitHub:
https://github.com/olaflaitinen/llm-proteomics-hallucination

Tags: #AIResearch #Bioinformatics #LLM #ClinicalData #ResearchCollab

hot vigil
#

Hi

marble compass
#

Hi, how do you usually store ESM embeddings? When I save them to Google Drive, loading them takes forever.

kind pawn
#

||​||||​||||​||||​||||​||||​||||​||||​||||​||||​||||​||||​||||​||||​||||​||||​||||​||||​||||​||||​||||​||||​||||​||||​||||​||||​||||​||||​||||​||||​||||​||||​||||​||||​||||​||||​||||​||||​||||​||||​||||​||||​||||​||||​||||​||||​||||​||||​||||​||||​||||​||||​||||​||||​||||​||||​||||​||||​||||​||||​||||​||||​||||​||||​||||​||||​||||​||||​||||​||||​||||​||||​||||​||||​||||​||||​||||​||||​||||​||||​||||​||||​||||​||||​||||​||||​||||​||||​||||​||||​||||​||||​||||​||||​||||​||||​||||​||||​||||​||||​||||​||||​||||​||||​||||​||||​||||​||||​||||​||||​||||​||||​||||​||||​||||​||||​||||​||||​||||​||||​||||​||||​||||​||||​||||​||||​||||​||||​||||​||||​||||​||||​||||​||||​||||​||||​||||​||||​||||​||||​||||​||||​||||​||||​||||​||||​||||​||||​||||​||||​||||​||||​||||​||||​||||​||||​||||​||||​||||​||||​||||​||||​||||​||||​||||​||||​||||​||||​||||​||||​||||​||||​||||​||||​||||​||||​||||​||||​||||​||||​||||​||||​||||​||||​||||​||||​||||​||||​||||​||||​||||​||||​||||​||||​||||​||||​||||​||||​|| _ _ _ _ _ _ https://imgur.com/TC6h8P4 https://imgur.com/iiKXKB5 https://imgur.com/JAkE28j https://imgur.com/keASgw9

robust glacier
#

Hello everyone, not sure if this has been answered or if this is a stupid question, after running the cafa evaluator script is the final score just an average of the f_micro?

scenic haven
#

The cafa evaluator script will return a few result files, one of which is "evaluation_best_f_micro_w.tsv" file. In this file you will find the best F-score in each of the three subontologies (biological process, molecular function, cellular aspect), and the score of the method would be average of these 3 F-scores. In addition, the cafa evaluator is run 3 times for each subset of proteins (no-knowledge, limited-knowledge, partial-knowledge), each subset of proteins has a different ground truth file and will create a separate result file. The final score is a 9-way average of 9 best F-scores (across 3 subontologies and across 3 subsets of proteins)

robust glacier
#

Thank you for the info, @scenic haven

obsidian onyx
#

hello. I have a question about this competition, are we supposed to form teams?

static coyote
pearl ferry
#

First time popping by to see if anyone is active here

#

I've been using an approach focused on feature engineering useful data

#

ex. Here's a 6-length subsequence breakdown of the most identifiable subsequences to GO:0005515

#
 rank subseq  count  rel_freq
    1 QQQQQQ 198437  4.982011
    2 EEEEEE 128135  3.216991
    3 PPPPPP 127984  3.213200
    4 AAAAAA  98647  2.476657
    5 SSSSSS  77090  1.935442
    6 GGGGGG  44055  1.106056
    7 HTGEKP  36445  0.914998
    8 NNNNNN  22862  0.573979
    9 TGEKPY  22347  0.561050
   10 LLLLLL  12581  0.315862
   11 EEEEED  11286  0.283349
   12 IHTGEK   9619  0.241497
   13 DEEEEE   9443  0.237078
   14 DDDDDD   8181  0.205394
   15 RIHTGE   7537  0.189226
   16 ECGKAF   6996  0.175643
   17 HRDLKP   6042  0.151692
   18 RTHTGE   5491  0.137858
   19 EEDEEE   5454  0.136930
   20 DFGLAR   5371  0.134846```
#

some of these are not strong indicators due to being identifiable to other GO#s as well

#

for comparison, note that QQQQQQ is not a common identifiable subsequence to 5516 (however, 5516 needs more data collection)

#
 rank subseq  count  rel_freq
    1 NNNNNN     62  0.533104
    2 EEEEEE     52  0.447120
    3 ESGAGK     17  0.146174
    4 GESGAG     17  0.146174
    5 NSSRFG     17  0.146174
    6 SGAGKT     17  0.146174
    7 SSRFGK     17  0.146174
    8 LLEKSR     16  0.137575
    9 YLLEKS     16  0.137575
   10 EAFGNA     15  0.128977
   11 RCIKPN     15  0.128977
   12 HRDLKP     15  0.128977
   13 SGESGA     15  0.128977
   14 FGNAKT     14  0.120378
   15 AFGNAK     14  0.120378
   16 GAGKTE     14  0.120378
   17 NSFEQF     13  0.111780
   18 LEAFGN     13  0.111780
   19 SSSSSS     12  0.103181
   20 EQFCIN     11  0.094583```
#

@scenic haven Is there a way to be granted permission to share images in this competition channel? I have graphed data I've collected (based on the data I collected above) that could help encourage other participants to see this information in another light.

#

They're just pyplot images

scenic haven
#

I am not sure on Discord, since we do get a lot of spam images. I think it is easier to show your work on a post on Kaggle discussion forum instead

glass mica
#

I have used up my Google colab GPU to max. Does anyone have any idea to use higher computing power at zero cost?

azure wren
#

Hi guys, I'm super new to this. When formatting the submission file, how do you decide which terms to keep? I'm assuming you probably don't want a probability for every term for each protein because that will be huge. Do you use some threshold on the probability? Or just keep the top k terms for each protein? Sorry if this seems like a really dumb question this is my first time doing something like this

glass mica
vestal hazel
glass mica
vestal hazel
glass mica
weak patrol
#

Does anyone have any idea on how to run a homology search (using MMSeqs2 or other) with GPUs on a notebook? I spent my whole day trying to debug the installation yesterday but I get a CUDA error...

MMseqs2 was compiled without CUDA support
glass mica
alpine forge
#

hey guys i dont know why but the submit prediction on the cafa page is inactive the button

#

i dont understand what to do

#

oh Submissions are now closed. but is 3 months to go it says

scenic haven
#

Submission already closed, now we wait for new annotations to accumulate and the final evaluation will be based on those

alpine forge
#

why is closed if it has 3 months to go?

#

is first time i see this in a competition

#

oh it had 2 timeline points

#

i didn't notice at all

#

mh wasted time ahhh sad