I'm in the process of labeling a dataset, and I'm trying to build up some intuition about how CNNs generalize, so that I can get a sense of what types of images I should focus on, and what sort of performance I should expect.
The end goal is to use this dataset to train a CNN to classify images into two classes 'room within a house' and 'not a room in a house'. I have a dataset set of tens of millions of unlabeled images, that need to be filtered down to just images of residential room interiors, and I'm trying to automate this process.
This is a task that is easy for most pictures, but has tricky edge cases. For example, any image that is taken outdoors we can throw out right away. And a bedroom, or a closet full of clothes, those are usually very recognizable rooms in a home. But a room with a desk and an office chair is harder, even for a human, because it could be a home office or in an office building. The dataset is extremely varied, with rooms containing all sorts of things, some that provide obvious clues and some that do not.
I manually labeled about 10k images, and used them to train a ResNet18, and it achieves about 80% validation accuracy. There are tricky cases, but nevertheless I'm pretty sure a human could achieve high 90s on this dataset, so I'm hoping to improve that.
Currently, I suspect the issue is a kind of class imbalance. My dataset has an equal number of 'residential' and 'nonresidential' images, however there are other classes I could consider balancing. For example, I could try to collect an equal number of all types of rooms. The problem is that there's an almost infinite long tail of increasingly unique rooms, making this a daunting task! And so I'm wondering, as my dataset grows, should I expect my CNN to learn more generalizable features automatically, or do I need to fastidiously identify and balance subtle classes?