Building an End-to-End Video Summarizer with Phidata and Google Gemini
In this tutorial, we'll create an advanced AI application that can analyze and summarize video content using Google's Gemini model and Phidata framework. This multi-modal agent can understand video content, answer questions, and even perform web searches for additional context.
Prerequisites
- Python environment
- Google AI Studio API key
- Basic understanding of Streamlit
- Phidata framework knowledge
Project Setup
Required Libraries
# requirements.txt
streamlit
phidata
google-generative-ai
duckduckgo-search
Implementation
1. Initial Imports and Setup
import streamlit as st
from phi.agent import Agent
from phi.model.google import Gemini
from phi.tools.duckduckgo import DuckDuckGo
from google.generative_ai import upload_file, get_file
import google.generative_ai as gen
import time
from pathlib import Path
from dotenv import load_dotenv
import os
load_dotenv()
2. API Configuration
api_key = os.getenv("GOOGLE_API_KEY")
gen.configure(api_key=api_key)
3. Streamlit Page Configuration
st.set_page_config(
page_title="Multimodal AI Agent - Video Summarizer",
page_icon="🎥",
layout="wide"
)
st.title("Multimodal Video AI Summarizer Agent")
st.caption("Powered by Gemini 2.0 Flash Experience")
4. Agent Initialization
@st.cache_resource
def initialize_agent():
return Agent(
name="Video AI Summarizer",
model=Gemini(
model="gemini-2.0-flash-exp"
),
tools=[DuckDuckGo()],
markdown=True
)
multimodal_agent = initialize_agent()
5. Video Upload and Processing
video_file = st.file_uploader(
"Upload a video file",
type=["mp4", "mov"],
help="Upload a video for AI analysis"
)
if video_file:
temp_file = Path("temp_video.mp4")
temp_file.write_bytes(video_file.read())
st.video(temp_file)
user_query = st.text_input(
"Ask anything about the video:",
placeholder="The AI agent will analyze and gather additional information"
)
analyze_button = st.button("Analyze Video")
6. Analysis Logic
if analyze_button and video_file:
with st.spinner("Processing video..."):
prompt = """
Analyze the uploaded video for content and context.
Respond to the following queries using video insights
and supplementary web search: {user_query}
"""
response = multimodal_agent.run(
prompt,
videos=[temp_file]
)
st.write(response)
Key Features
- Video Analysis: Process and understand video content
- Web Search Integration: Supplement video analysis with web information
- Interactive Interface: User-friendly Streamlit frontend
- Multi-modal Processing: Handles both video and text inputs
Limitations and Solutions
- File Size Limit: Maximum 200MB video size
- Solution for Larger Files: Use cloud storage (AWS S3)
Cloud Integration Example
# For videos > 200MB
if file_size > 200_000_000:
# Upload to S3
s3_path = upload_to_s3(video_file)
# Process from S3 URL
video_url = get_s3_url(s3_path)
response = multimodal_agent.run(
prompt,
videos=[video_url]
)
Running the Application
# Install requirements
pip install -r requirements.txt
# Run the application
streamlit run app.py
Best Practices
- Always validate video file size before processing
- Implement proper error handling for API calls
- Use caching for better performance
- Consider implementing progress bars for long operations
Future Enhancements
- Add support for more video formats
- Implement batch processing
- Add video segment analysis
- Integrate with different LLM providers
Conclusion
This project demonstrates the power of combining Google's Gemini model with Phidata's agent framework to create a sophisticated video analysis tool. While the local version has limitations, the architecture can be easily extended to handle larger files through cloud integration.
Next Steps
- Implement cloud storage integration
- Add more analysis capabilities
- Improve response formatting
- Add export functionality for analysis results