Abstract: CLIP is a powerful spatial feature extractor trained on a large dataset of image-text pairs. It exhibits strong generalization when extended to other domains and modalities. However, its ...
Some results have been hidden because they may be inaccessible to you
Show inaccessible results