News

Given that existing feature fusion methods have not fully explored the relationship between fine- and coarse-grained features, direct feature fusion can result in poor correlation between them, ...
In this paper we propose CFSum, a transformer-based multi-modal video summarization framework with coarse-fine fusion. CFSum exploits video, text, and audio modal features as input, and incorporates a ...