Q1. For classification tasks, when we use contrastive learning as the self-supervised objective, the uniformity of the representations is optimized but how the linear-separability is naturally optimized along with uniformity?
A1. In my opinion, the reason that class concentration happens is ultimately the inductive bias of neural networks. Afterall, unrestricted function class, there definitely are encoders that are very aligned and uniform, but gives useless features, in terms of linear classification at least. If you believe the inductive bias of NN tends to lead to “smooth” solutions, then intuitively class concentration happens. It is certainly difficult to argue about this formally though.
Q2. For tasks requiring structured output (e.g. reconstruction), I could understand that uniformity is desired since we want the representation to be as different as possible so that when constructing output, different samples are not going to be confused. Then I wonder if the contrastive method is better than the Autoencoder-based method (in this scenario). My assumption is “no” since the contrastive method also pushes similar samples away from each other, while it’s against our instinct that similar inputs shall have similar representations.
A2. I’d say it depends on how you choose positive pairs in contrastive learning. Surely this could be true if you ask two random crops of the same image to have the same features.