23 days agoHugging Face Daily PapersUI-Genie: A Self-Improving Approach for Iteratively Boosting MLLM-based Mobile GUI Agents
23 days agoHugging Face Daily PapersAdversarial Attacks against Closed-Source MLLMs via Feature Optimal Alignment
23 days agoHugging Face Daily PapersDetailFlow: 1D Coarse-to-Fine Autoregressive Image Generation via Next-Detail Prediction
23 days agoHugging Face Daily PapersScaling External Knowledge Input Beyond Context Windows of LLMs via Multi-Agent Collaboration
23 days agoHugging Face Daily PapersActive-O3: Empowering Multimodal Large Language Models with Active Perception via GRPO
23 days agoHugging Face Daily PapersCoDA: Coordinated Diffusion Noise Optimization for Whole-Body Manipulation of Articulated Objects
23 days agoHugging Face Daily PapersR2R: Efficiently Navigating Divergent Reasoning Paths with Small-Large Model Token Routing
23 days agoHugging Face Daily PapersVideo-Holmes: Can MLLM Think Like Holmes for Complex Video Reasoning?
23 days agoHugging Face Daily PapersHoliTom: Holistic Token Merging for Fast Video Large Language Models
23 days agoHugging Face Daily PapersMME-VideoOCR: Evaluating OCR-Based Capabilities of Multimodal LLMs in Video Scenarios
23 days agoHugging Face Daily PapersMME-Reasoning: A Comprehensive Benchmark for Logical Reasoning in MLLMs
23 days agoHugging Face Daily PapersrStar-Coder: Scaling Competitive Code Reasoning with a Large-Scale Verified Dataset