✨🚀 Enroll now for "Live Building Agentic AI & Generative AI Applications" starting Jan 25—Contact us for details! 🚀✨Contact Us

Building an End-to-End Video Summarizer with Phidata and Google Gemini

In this tutorial, we'll create an advanced AI application that can analyze and summarize video content using Google's Gemini model and Phidata framework. This multi-modal agent can understand video content, answer questions, and even perform web searches for additional context.

Prerequisites

  • Python environment
  • Google AI Studio API key
  • Basic understanding of Streamlit
  • Phidata framework knowledge

Project Setup

Required Libraries


# requirements.txt
streamlit
phidata
google-generative-ai
duckduckgo-search
    

Implementation

1. Initial Imports and Setup


import streamlit as st
from phi.agent import Agent
from phi.model.google import Gemini
from phi.tools.duckduckgo import DuckDuckGo
from google.generative_ai import upload_file, get_file
import google.generative_ai as gen
import time
from pathlib import Path
from dotenv import load_dotenv
import os

load_dotenv()
    

2. API Configuration


api_key = os.getenv("GOOGLE_API_KEY")
gen.configure(api_key=api_key)
    

3. Streamlit Page Configuration


st.set_page_config(
    page_title="Multimodal AI Agent - Video Summarizer",
    page_icon="🎥",
    layout="wide"
)

st.title("Multimodal Video AI Summarizer Agent")
st.caption("Powered by Gemini 2.0 Flash Experience")
    

4. Agent Initialization


@st.cache_resource
def initialize_agent():
    return Agent(
        name="Video AI Summarizer",
        model=Gemini(
            model="gemini-2.0-flash-exp"
        ),
        tools=[DuckDuckGo()],
        markdown=True
    )

multimodal_agent = initialize_agent()
    

5. Video Upload and Processing


video_file = st.file_uploader(
    "Upload a video file",
    type=["mp4", "mov"],
    help="Upload a video for AI analysis"
)

if video_file:
    temp_file = Path("temp_video.mp4")
    temp_file.write_bytes(video_file.read())
    
    st.video(temp_file)
    
    user_query = st.text_input(
        "Ask anything about the video:",
        placeholder="The AI agent will analyze and gather additional information"
    )
    
    analyze_button = st.button("Analyze Video")
    

6. Analysis Logic


if analyze_button and video_file:
    with st.spinner("Processing video..."):
        prompt = """
        Analyze the uploaded video for content and context.
        Respond to the following queries using video insights
        and supplementary web search: {user_query}
        """
        
        response = multimodal_agent.run(
            prompt,
            videos=[temp_file]
        )
        
        st.write(response)
    

Key Features

  • Video Analysis: Process and understand video content
  • Web Search Integration: Supplement video analysis with web information
  • Interactive Interface: User-friendly Streamlit frontend
  • Multi-modal Processing: Handles both video and text inputs

Limitations and Solutions

  • File Size Limit: Maximum 200MB video size
  • Solution for Larger Files: Use cloud storage (AWS S3)

Cloud Integration Example


# For videos > 200MB
if file_size > 200_000_000:
    # Upload to S3
    s3_path = upload_to_s3(video_file)
    # Process from S3 URL
    video_url = get_s3_url(s3_path)
    response = multimodal_agent.run(
        prompt,
        videos=[video_url]
    )
    

Running the Application


# Install requirements
pip install -r requirements.txt

# Run the application
streamlit run app.py
    

Best Practices

  • Always validate video file size before processing
  • Implement proper error handling for API calls
  • Use caching for better performance
  • Consider implementing progress bars for long operations

Future Enhancements

  • Add support for more video formats
  • Implement batch processing
  • Add video segment analysis
  • Integrate with different LLM providers

Conclusion

This project demonstrates the power of combining Google's Gemini model with Phidata's agent framework to create a sophisticated video analysis tool. While the local version has limitations, the architecture can be easily extended to handle larger files through cloud integration.

Next Steps

  • Implement cloud storage integration
  • Add more analysis capabilities
  • Improve response formatting
  • Add export functionality for analysis results