#Conflicting class labels with Naive Bayes text classification

10 messages · Page 1 of 1 (latest)

tawny raft
#

Doing a multiclass text classification problem on scientific abstracts belonging in one of the classes of Archaea, Bacteria, Eukaryota, or Virus.

Originally, the data looks like this without pretty much only one feature and one target.

#

I found these rows where the abstracts are the same but the labels are different (B vs E)

#

So I should replace the B with E right? cause 3 entries are E and only 1 is B

#

@coral vale

coral vale
#

I'm not sure if I get the second image. It looks like you performed a pandas merge, but I'm kinda lost.

Regardless, the correct approach would be to investigate the reason for this inconsistency in the dataset. Is it possible it was annotated automatically and the system accused both classes? If you read said abstract, can you identify the "correct" class? Like I've said before, some datasets have multiple annotators and therefore different entries. But to assume that is the reason for this and just go with the majority can cause you problems on the long run

#

you gotta do a bit of detective work. Tbh, anything I advise you to do without knowing why you have conflicting entries in the dataset would be irresponsible

tawny raft
#

Fair enough

#

Thanks

tawny raft