Conflicting class labels with Naive Bayes text classification | Smarter Dev | Page 1

tawny raft Apr 30, 2023, 2:30 AM

#

Doing a multiclass text classification problem on scientific abstracts belonging in one of the classes of Archaea, Bacteria, Eukaryota, or Virus.

Originally, the data looks like this without pretty much only one feature and one target.

#

I found these rows where the abstracts are the same but the labels are different (B vs E)

#

#

So I should replace the B with E right? cause 3 entries are E and only 1 is B

#

@coral vale

coral vale Apr 30, 2023, 9:35 PM

#

I'm not sure if I get the second image. It looks like you performed a pandas merge, but I'm kinda lost.

Regardless, the correct approach would be to investigate the reason for this inconsistency in the dataset. Is it possible it was annotated automatically and the system accused both classes? If you read said abstract, can you identify the "correct" class? Like I've said before, some datasets have multiple annotators and therefore different entries. But to assume that is the reason for this and just go with the majority can cause you problems on the long run

#

you gotta do a bit of detective work. Tbh, anything I advise you to do without knowing why you have conflicting entries in the dataset would be irresponsible

tawny raft Apr 30, 2023, 10:16 PM

#

Fair enough

#

Thanks

tawny raft May 1, 2023, 4:14 AM

#

coral vale I'm not sure if I get the second image. It looks like you performed a pandas mer...

Apologies about not sufficiently explaining the second image. What I did was numbered rows with the same abstract into clusters (cluster_index) so all rows with the same cluster index have the same abstract. I joined it with itself so that I can find the pairwise combinations where the cluster is the same but class label are different

#Conflicting class labels with Naive Bayes text classification