About

I'm a PhD student in Artificial Intelligence at MICC, University of Florence, advised by Prof. Andrew D. Bagdanov and Prof. Marco Bertini.

Currently, I'm an Applied Scientist Intern at Amazon (RufusX Team, London), developing large-scale multimodal generative models for the Amazon Rufus initiative, impacting millions of users worldwide.

Research Interests

My research focuses on Multimodal Vision-Language Models (VLMs) like CLIP and their practical applications. I work on fundamental problems in:

  • Multimodal Representation Learning: Understanding and bridging modality gaps in vision-language models
  • Prompt Learning & Knowledge Distillation: Improving zero-shot generalization without labeled data
  • Continual Learning: Enabling models to learn incrementally while preventing catastrophic forgetting
  • Few-Shot & Test-Time Adaptation: Efficient model adaptation with minimal supervision

Key Contributions

I have published 3 first-author papers at top-tier AI conferences:

  • ICLR 2025 (main conference) — Exposing intra-modal misalignment in CLIP via modality inversion
  • ECCV 2024 (main conference) — Unsupervised prompt learning via knowledge distillation
  • NeurIPS 2023 (workshop) — Incremental fine-tuning for biomedical vision-language models

My work advances the state-of-the-art in vision-language understanding and has practical applications in medical imaging, open-vocabulary recognition, and continual learning systems.

News

  1. Jul, 2025

    Started my internship at the Amazon RufusX team in London.

  2. Jan, 2025

    A paper on multimodal VLMs representation is accepted at ICLR 2025.

  3. Sep, 2024

    Presented a paper on prompt learning at ECCV 2024.

  4. Aug, 2024

    KDPL source code is finally available!

  5. Dec, 2023

    Presented a paper on continual learning at NeurIPS 2023 (workshop).

Recent Publications

Work Experience

  1. Amazon Logo Applied Scientist Intern – Amazon (RufusX Team, London)

    July 2025 — December 2025

    Worked on Generative AI and Multimodal Large Language Models (MLLMs) as part of the Amazon Rufus initiative. Fine-tuned, evaluated, and deployed large-scale multimodal models impacting millions of customers, advancing multimodal reasoning and generation at scale.

Education

  1. PhD student in Artificial Intelligence

    Nov, 2023 — Present

    University of Florence, Florence, Italy

    Topic: Multimodal Vision-Language Models, Incremental Learning, Prompt Learning.

  2. M.S. in Artificial Intelligence

    Sep, 2021 — Jul 2023

    University of Florence, Florence, Italy

    Thesis: "RE-Tune - Incremental Fine-Tuning of Biomedical Vision-Language Models"

  3. B.S. in Computer Science and Engineering

    Sep, 2018 — Sep, 2021

    University of Florence, Florence, Italy

    Thesis: "Scarlatti-Gen - AI-Driven Sonata Generation Using Weighted Graphs and CNNs"

Teaching and Mentoring

  1. Teaching Assistant, University of Florence

    Delivering interactive lessons on C/C++ and Python to over 200 bachelor students.

    Jan 2024, Jan 2025
  2. Thesis Co-Supervisor, University of Florence

    Apr 2024, Sep 2024

    "Mitigating Catastrophic Zero-shot Forgetting in CLIP via Distillation of Low-Rank Adapters from Learned Prompts", Proposed a novel method to efficiently few-shots fine-tune CLIP models that mitigates catastrophic forgetting and preserves zero-shot capabilities, based on distilling learned prompts in LoRa adapters.

  3. Student Ambassador, University of Florence

    Jan 2020, Dec 2020

    Mentoring students on exams projects, internships, and career development.

Publications

My research focuses on advancing Multimodal Vision-Language Models (VLMs) like CLIP, with applications in continual learning, prompt learning, and representation learning. Below are my published works in top-tier AI conferences.

Contact

  • Institutional Email

    marco.mistretta@unifi.it
  • LinkedIn

    Marco Mistretta
  • Google Scholar

    Marco Mistretta
  • Instagram

    marcomistre99
  • Twitter

    mistretta_marco

Location