About
I'm a PhD student in Artificial Intelligence at MICC, University of Florence, advised by Prof. Andrew D. Bagdanov and Prof. Marco Bertini.
Currently, I'm an Applied Scientist Intern at Amazon (RufusX Team, London), developing large-scale multimodal generative models for the Amazon Rufus initiative, impacting millions of users worldwide.
Research Interests
My research focuses on Multimodal Vision-Language Models (VLMs) like CLIP and their practical applications. I work on fundamental problems in:
- Multimodal Representation Learning: Understanding and bridging modality gaps in vision-language models
- Prompt Learning & Knowledge Distillation: Improving zero-shot generalization without labeled data
- Continual Learning: Enabling models to learn incrementally while preventing catastrophic forgetting
- Few-Shot & Test-Time Adaptation: Efficient model adaptation with minimal supervision
Key Contributions
I have published 3 first-author papers at top-tier AI conferences:
- ICLR 2025 (main conference) — Exposing intra-modal misalignment in CLIP via modality inversion
- ECCV 2024 (main conference) — Unsupervised prompt learning via knowledge distillation
- NeurIPS 2023 (workshop) — Incremental fine-tuning for biomedical vision-language models
My work advances the state-of-the-art in vision-language understanding and has practical applications in medical imaging, open-vocabulary recognition, and continual learning systems.
News
-
Jul, 2025
Started my internship at the Amazon RufusX team in London.
-
Jan, 2025
A paper on multimodal VLMs representation is accepted at ICLR 2025.
-
Sep, 2024
Presented a paper on prompt learning at ECCV 2024.
-
Aug, 2024
KDPL source code is finally available!
-
Dec, 2023
Presented a paper on continual learning at NeurIPS 2023 (workshop).
Recent Publications
-
Cross the Gap: Exposing the Intra-modal Misalignment in CLIP via Modality Inversion
Marco Mistretta, Alberto Baldrati, Lorenzo Agnolucci, Marco Bertini, Andrew D. Bagdanov
-
KDPL: Improving Zero-shot Generalization of Learned Prompts via Unsupervised Knowledge Distillation
Marco Mistretta, Alberto Baldrati, Marco Bertini, Andrew D. Bagdanov
-
RE-tune: Incremental Fine Tuning of Biomedical Vision-Language Models for Multi-label Chest X-ray Classification
Marco Mistretta, Andrew D. Bagdanov
Work Experience
-
July 2025 — December 2025
Applied Scientist Intern – Amazon (RufusX Team, London)
Worked on Generative AI and Multimodal Large Language Models (MLLMs) as part of the Amazon Rufus initiative. Fine-tuned, evaluated, and deployed large-scale multimodal models impacting millions of customers, advancing multimodal reasoning and generation at scale.
Education
-
PhD student in Artificial Intelligence
Nov, 2023 — PresentUniversity of Florence, Florence, Italy
Topic: Multimodal Vision-Language Models, Incremental Learning, Prompt Learning.
-
M.S. in Artificial Intelligence
Sep, 2021 — Jul 2023University of Florence, Florence, Italy
Thesis: "RE-Tune - Incremental Fine-Tuning of Biomedical Vision-Language Models"
-
B.S. in Computer Science and Engineering
Sep, 2018 — Sep, 2021University of Florence, Florence, Italy
Thesis: "Scarlatti-Gen - AI-Driven Sonata Generation Using Weighted Graphs and CNNs"
Teaching and Mentoring
-
Teaching Assistant, University of Florence
Delivering interactive lessons on C/C++ and Python to over 200 bachelor students.
Jan 2024, Jan 2025 -
Thesis Co-Supervisor, University of Florence
Apr 2024, Sep 2024"Mitigating Catastrophic Zero-shot Forgetting in CLIP via Distillation of Low-Rank Adapters from Learned Prompts", Proposed a novel method to efficiently few-shots fine-tune CLIP models that mitigates catastrophic forgetting and preserves zero-shot capabilities, based on distilling learned prompts in LoRa adapters.
-
Student Ambassador, University of Florence
Jan 2020, Dec 2020Mentoring students on exams projects, internships, and career development.