Building an End-to-End Video Summarizer with Phidata and Google Gemini

In this tutorial, we'll create an advanced AI application that can analyze and summarize video content using Google's Gemini model and Phidata framework. This multi-modal agent can understand video content, answer questions, and even perform web searches for additional context.

Prerequisites

Python environment
Google AI Studio API key
Basic understanding of Streamlit
Phidata framework knowledge

Project Setup

Required Libraries


# requirements.txt
streamlit
phidata
google-generative-ai
duckduckgo-search

Implementation

1. Initial Imports and Setup


import streamlit as st
from phi.agent import Agent
from phi.model.google import Gemini
from phi.tools.duckduckgo import DuckDuckGo
from google.generative_ai import upload_file, get_file
import google.generative_ai as gen
import time
from pathlib import Path
from dotenv import load_dotenv
import os

load_dotenv()

2. API Configuration


api_key = os.getenv("GOOGLE_API_KEY")
gen.configure(api_key=api_key)

3. Streamlit Page Configuration


st.set_page_config(
    page_title="Multimodal AI Agent - Video Summarizer",
    page_icon="🎥",
    layout="wide"
)

st.title("Multimodal Video AI Summarizer Agent")
st.caption("Powered by Gemini 2.0 Flash Experience")

4. Agent Initialization


@st.cache_resource
def initialize_agent():
    return Agent(
        name="Video AI Summarizer",
        model=Gemini(
            model="gemini-2.0-flash-exp"
        ),
        tools=[DuckDuckGo()],
        markdown=True
    )

multimodal_agent = initialize_agent()

5. Video Upload and Processing


video_file = st.file_uploader(
    "Upload a video file",
    type=["mp4", "mov"],
    help="Upload a video for AI analysis"
)

if video_file:
    temp_file = Path("temp_video.mp4")
    temp_file.write_bytes(video_file.read())
    
    st.video(temp_file)
    
    user_query = st.text_input(
        "Ask anything about the video:",
        placeholder="The AI agent will analyze and gather additional information"
    )
    
    analyze_button = st.button("Analyze Video")

6. Analysis Logic


if analyze_button and video_file:
    with st.spinner("Processing video..."):
        prompt = """
        Analyze the uploaded video for content and context.
        Respond to the following queries using video insights
        and supplementary web search: {user_query}
        """
        
        response = multimodal_agent.run(
            prompt,
            videos=[temp_file]
        )
        
        st.write(response)

Key Features

Video Analysis: Process and understand video content
Web Search Integration: Supplement video analysis with web information
Interactive Interface: User-friendly Streamlit frontend
Multi-modal Processing: Handles both video and text inputs

Limitations and Solutions

File Size Limit: Maximum 200MB video size
Solution for Larger Files: Use cloud storage (AWS S3)

Cloud Integration Example


# For videos > 200MB
if file_size > 200_000_000:
    # Upload to S3
    s3_path = upload_to_s3(video_file)
    # Process from S3 URL
    video_url = get_s3_url(s3_path)
    response = multimodal_agent.run(
        prompt,
        videos=[video_url]
    )

Running the Application


# Install requirements
pip install -r requirements.txt

# Run the application
streamlit run app.py

Best Practices

Always validate video file size before processing
Implement proper error handling for API calls
Use caching for better performance
Consider implementing progress bars for long operations

Future Enhancements

Add support for more video formats
Implement batch processing
Add video segment analysis
Integrate with different LLM providers

Conclusion

This project demonstrates the power of combining Google's Gemini model with Phidata's agent framework to create a sophisticated video analysis tool. While the local version has limitations, the architecture can be easily extended to handle larger files through cloud integration.

Next Steps

Implement cloud storage integration
Add more analysis capabilities
Improve response formatting
Add export functionality for analysis results