How to Build Your Own AI Meeting Note Taker with OpenAI API

You know that sinking feeling. You just spent 45 minutes in a meeting hashing out an architecture decision, and now you're staring at your screen trying to reconstruct who said what. The official notes say "discussed microservices tradeoffs" — four words for a conversation that shaped your roadmap. Somebody probably recorded it. Nobody will rewatch it.
AI meeting note takers like Fireflies and Otter solve this. They join your calls, transcribe everything, and hand you a clean summary with action items. But they cost $10–30 per user per month, they upload your most sensitive strategy conversations to the cloud, and you get exactly zero control over how the summaries work.
What if you built your own?
In this guide, I'll walk you through two complete approaches: a cloud-first stack that gives you production-grade accuracy with minimal setup, and a local-first open-source alternative that keeps every byte of audio on your machine. By the end, you'll have working code for both — and a clear picture of which one fits your use case.
The Full Pipeline: Bot, Transcript, Summary, Action Items
Every AI meeting note taker — whether it's a $30/month SaaS tool or your own weekend project — runs the same four-stage pipeline:
Bot joins call → Audio captured → Speech-to-text → LLM summarization → Structured output
Stage 1 — Bot joins the call. Something needs to get into the meeting room and listen. That "something" is either a bot participant (visible in the participant list, joins like another person) or a local capture agent (runs on your machine, invisible to others).
Stage 2 — Audio is captured. The meeting platform streams audio to your bot, or your local agent grabs the system audio output. This is the raw PCM audio that speech-to-text engines consume.
Stage 3 — Speech-to-text (STT). The raw audio becomes text. This is where you choose between cloud APIs (Deepgram, AssemblyAI, OpenAI Whisper) or local models (Whisper.cpp, OWhisper). Accuracy, latency, and cost all diverge sharply here.
Stage 4 — LLM summarization. The transcript — often thousands of words for an hour-long meeting — is fed to an LLM with a prompt that extracts decisions, action items, and key points. This is where you have the most creative control.
The "last mile" is output: structured JSON pushed to your CRM, a Markdown file saved to Notion, or a Slack message to the channel. That part is straightforward once the pipeline is solid.
Let's build it.
Approach 1: Cloud-First Stack — Recall.ai + OpenAI + Webhooks
The fastest path to production quality is a cloud pipeline. Here's the stack:
Recall.ai — universal meeting bot API (handles Zoom, Meet, Teams, Webex with one integration)
Deepgram or AssemblyAI — streaming transcription with speaker diarization
OpenAI GPT-4o — transcript summarization and action item extraction
A simple webhook server — receives transcripts, triggers summarization, stores results
Step 1: Deploy a Bot to Join Meetings
Recall.ai abstracts away every video platform's API. Instead of building separate Zoom, Google Meet, and Teams integrations, you make one API call:
# bot_create.py — Send a bot to a meeting
import requests
RECALL_API = "https://us-east-1.recall.ai/api/v1/bot/"
RECALL_TOKEN = "your_recall_api_key"
response = requests.post(
RECALL_API,
json={
"meeting_url": "https://zoom.us/j/123456789",
"bot_name": "Meeting Notes Bot",
"recording_config": {
"transcription": {
"provider": "deepgram", # or assembly_ai, gladia_v2
"language": "en"
}
},
"automatic_leave": {
"waiting_room_timeout": 300,
"noone_joined_timeout": 300,
"everyone_left_timeout": 60
}
},
headers={
"Authorization": f"Token {RECALL_TOKEN}",
"Content-Type": "application/json"
}
)
bot = response.json()
print(f"Bot deployed: {bot['id']}")
The bot shows up as a named participant. It captures audio, streams it to your chosen transcription provider, and fires a webhook when done.
Step 2: Receive the Transcript via Webhook
Recall.ai POSTs a status_change event to your server when the bot finishes. You then pull the transcript:
# webhook_handler.py — FastAPI endpoint
from fastapi import FastAPI, Request
app = FastAPI()
@app.post("/recall-webhook")
async def handle_webhook(request: Request):
data = await request.json()
if data.get("data", {}).get("status", {}).get("code") == "done":
bot_id = data["data"]["bot_id"]
# Fetch the transcript
transcript_resp = requests.get(
f"https://us-east-1.recall.ai/api/v1/bot/{bot_id}/transcript",
headers={"Authorization": f"Token {RECALL_TOKEN}"}
)
transcript = transcript_resp.json()
# Summarize it (Step 3)
summary = summarize_transcript(transcript)
# Store it
save_meeting_notes(bot_id, summary)
return {"status": "ok"}
Step 3: Summarize with OpenAI GPT-4o
This is where your pipeline gets smart. You control the prompt — so you control exactly what the output looks like:
# summarize.py — Extract decisions, action items, and key points
from openai import OpenAI
client = OpenAI()
def summarize_transcript(transcript: list[dict]) -> dict:
# Build a speaker-labeled transcript string
lines = []
for entry in transcript:
speaker = entry.get("speaker", "Unknown")
text = entry.get("text", "")
lines.append(f"{speaker}: {text}")
full_text = "\n".join(lines)
response = client.chat.completions.create(
model="gpt-4o",
messages=[
{
"role": "system",
"content": (
"You are a precise meeting note taker. Analyze the transcript "
"and return a JSON object with these fields:\n"
"- summary: 3-4 sentence overview of what was discussed\n"
"- decisions: array of concrete decisions made (with context)\n"
"- action_items: array of {task, assignee, deadline} objects\n"
"- key_quotes: 2-3 notable quotes with speaker attribution\n"
"- topics: array of main topics discussed\n\n"
"Be specific. If no decision was made on a topic, say so. "
"If an action item has no clear assignee, note that too."
)
},
{
"role": "user",
"content": f"Meeting transcript:\n\n{full_text}"
}
],
response_format={"type": "json_object"},
temperature=0.3
)
return json.loads(response.choices[0].message.content)
The response_format: json_object parameter guarantees structured output — no parsing gymnastics needed. Temperature 0.3 keeps things factual while leaving a little room for natural phrasing.
Cloud Stack: What It Costs
Per one-hour meeting:
Service | Unit Cost | Per Meeting |
|---|---|---|
Recall.ai (recording) | $0.50/hr | $0.50 |
Deepgram (transcription) | $0.46/hr | $0.46 |
GPT-4o (summarization) | ~$5/1M input tokens | ~$0.05 |
Total per meeting | ~$1.01 |
That's roughly $20/month for a team running 20 meetings. Compare that to $10–30 per user for Fireflies or Otter — the math tilts fast once more than 2 people need coverage.
But there's a tradeoff, and it's a big one: every word of your meeting audio flows through third-party servers. For strategy sessions, funding discussions, or anything involving client confidentiality, that's a non-starter.
Which brings us to approach two.
Approach 2: Local-First Open-Source Stack — OWhisper + Ollama
This approach keeps everything on your machine. No API keys. No cloud. The tradeoff is accuracy — local models are less refined than Deepgram or AssemblyAI — but the privacy guarantee is absolute.
The stack:
OWhisper — local speech-to-text server (think: "Ollama for STT")
Whisper.cpp or Parakeet TDT — on-device transcription models
Ollama + Llama 3 — local LLM for summarization
A Python script — ties it all together
Step 1: Set Up OWhisper
Install and pull a model:
# macOS
brew tap fastrepl/hyprnote && brew install owhisper
# Linux — download binary
curl -L https://owhisper.hyprnote.com/download/latest/linux-x86_64 -o owhisper
chmod +x owhisper
# Pull a model (base is a good balance of speed and accuracy)
owhisper pull whisper-cpp-base-q8-en
# Start the server (Deepgram-compatible API on port 8080)
owhisper run whisper-cpp-base-q8-en
OWhisper exposes a Deepgram-compatible REST API, so any code written for Deepgram's SDK works against it — just pointed at localhost:8080.
Step 2: Capture System Audio and Transcribe
On macOS, BlackHole creates a virtual audio device that routes system audio to your script. On Linux, PulseAudio or PipeWire can create a loopback:
# local_transcribe.py — Capture system audio, send to OWhisper
import pyaudio
import requests
import wave
import time
OWHISPER_URL = "http://localhost:8080/v1/listen"
def record_and_transcribe(duration_minutes: int = 60) -> str:
CHUNK = 4096
FORMAT = pyaudio.paInt16
CHANNELS = 1
RATE = 16000
audio = pyaudio.PyAudio()
stream = audio.open(
format=FORMAT,
channels=CHANNELS,
rate=RATE,
input=True,
input_device_index=find_blackhole_device(audio),
frames_per_buffer=CHUNK
)
frames = []
for _ in range(0, int(RATE / CHUNK * duration_minutes * 60)):
data = stream.read(CHUNK, exception_on_overflow=False)
frames.append(data)
stream.stop_stream()
stream.close()
audio.terminate()
# Save to WAV
wav_path = "/tmp/meeting_audio.wav"
with wave.open(wav_path, "wb") as wf:
wf.setnchannels(CHANNELS)
wf.setsampwidth(audio.get_sample_size(FORMAT))
wf.setframerate(RATE)
wf.writeframes(b"".join(frames))
# Send to OWhisper
with open(wav_path, "rb") as f:
resp = requests.post(
OWHISPER_URL,
files={"audio": f},
data={"model": "whisper-cpp-base-q8-en"}
)
return resp.json()["transcript"]
Step 3: Summarize Locally with Ollama
# local_summarize.py — Run summarization on your own hardware
import ollama
import json
def summarize_local(transcript: str) -> dict:
prompt = f"""Analyze this meeting transcript and return a JSON object with:
- summary: 3-4 sentence overview
- decisions: array of concrete decisions
- action_items: array of {{task, assignee}} objects
- topics: array of main discussion topics
Transcript:
{transcript}
Return ONLY valid JSON, no other text."""
response = ollama.chat(
model="llama3.2",
messages=[{"role": "user", "content": prompt}],
options={"temperature": 0.3}
)
return json.loads(response["message"]["content"])
Local Stack: The Privacy Math
What | Cloud Stack | Local Stack |
|---|---|---|
Audio leaves your machine | Yes (Recall.ai + STT provider) | No |
Transcript processed remotely | Yes (OpenAI servers) | No (your GPU/CPU) |
Monthly cost (20 meetings) | ~$20 | $0 (your hardware) |
Setup time | ~1 hour | ~2-3 hours |
Accuracy (clean audio) | ~95%+ | ~85-90% |
Accuracy (accents, crosstalk) | ~80-85% | ~65-75% |
The local stack wins on privacy and cost. The cloud stack wins on accuracy and setup speed. There's no universally correct answer — it depends on whether your meetings contain information you'd be uncomfortable uploading to a third-party server.
Cost Analysis: Build vs Buy
Here's how the numbers shake out for a 10-person team running 40 hours of meetings per week:
Approach | Monthly Cost | Annual Cost |
|---|---|---|
Fireflies Business ($19/user) | $190 | $2,280 |
Otter Business ($20/user) | $200 | $2,400 |
Cloud DIY (Recall.ai + Deepgram + OpenAI) | ~$160 | ~$1,920 |
Local DIY (OWhisper + Ollama) | $0 + electricity | ~$50 + electricity |
The cloud DIY approach saves roughly 15-20% over SaaS tools for a 10-person team — and the savings grow with meeting volume. The local approach costs essentially nothing beyond the machine you already own. But both approaches require maintenance: API version updates, model upgrades, and the occasional "why didn't the bot join?" debugging session.
Pro tip: Start with the cloud stack to validate the pipeline works for your team's meeting patterns. Once you've run it for a month and know what you actually need, decide whether to optimize for cost (local) or accuracy (cloud).
Production Hardening: Accuracy, Diarization, and Privacy
The demo code above works on a clean laptop. Real meetings are messier. Here's what you'll hit in production and how to handle it.
Speaker Diarization — Who Said What?
This is the hardest problem in meeting transcription. When two people talk at once, or voices sound similar, diarization fails.
Cloud fix: Use gpt-4o-transcribe-diarize (OpenAI's newest speech-to-text model) or AssemblyAI's entity detection. Both label speakers and track them across the transcript:
# OpenAI diarization
transcript = client.audio.transcriptions.create(
model="gpt-4o-transcribe-diarize",
file=audio_file,
response_format="diarized_json"
)
# Returns segments with speaker labels: "A", "B", "C"
Local fix: This is the weak spot. Open-source diarization (pyannote.audio, SpeechBrain) works but requires significant tuning. If diarization accuracy matters, the cloud APIs are currently 2-3 years ahead of open-source.
Accuracy with Accents and Technical Jargon
Custom vocabulary is your friend. Both Deepgram and AssemblyAI let you upload a list of domain-specific terms that the model should prioritize:
# Deepgram with custom vocabulary
options = {
"model": "nova-3",
"keyterm": ["Kubernetes", "gRPC", "DLQ", "idempotency", "SLO"]
}
For the local stack, Whisper.cpp accepts a prompt string that biases recognition toward specific terms — not as reliable as cloud keyterm boosting, but it helps.
Prompt Engineering for Better Summaries
The prompt in Approach 1 is a starting point. After running it on real meetings, you'll discover edge cases. Here's what I've learned:
Meetings that go in circles: Add
"If the same topic was discussed multiple times without resolution, note that explicitly instead of pretending a decision was reached."Action items people "might" do: Add
"Distinguish between firm commitments ('I will') and vague intentions ('I might,' 'we should'). Only create action items for firm commitments."Speaker attribution in summaries: Include the speaker label in the transcript string so the LLM can say "Sarah proposed..." instead of "it was proposed..."
Privacy: What About Botless Capture?
If you want cloud accuracy without a visible bot participant, you have options:
Fathom and Granola run locally and capture without a bot — but they're SaaS tools, not APIs you can build on.
Recall.ai Desktop SDK captures meeting audio on your machine without a bot joining — audio still goes to their servers for transcription.
Meetily (MIT licensed, open source) captures locally and is fully bot-free — but it's a consumer app, not a programmable pipeline.
The local stack in Approach 2 is effectively a DIY version of what Granola and Meetily do, with the added benefit that you own the entire pipeline.
Which Stack Should You Start With?
After building both, my recommendation is simple:
Start with the cloud stack if you need production-grade accuracy fast, and your meetings don't contain legally sensitive information. A weekend of work gets you a working pipeline that handles most real-world meetings.
Start with the local stack if privacy is non-negotiable. Accept the ~10-15% accuracy tradeoff and budget extra time to tune Whisper and diarization settings.
Mix them — local Whisper for routine standups and internal syncs, cloud Recall.ai + Deepgram for client calls where accuracy matters. It's not all-or-nothing.
The code in this guide is a starting point. The real value comes from iterating on the prompt, tuning the diarization, and building the integrations that make meeting notes actually useful — the CRM sync, the Slack notification, the Notion page that writes itself. If you're curious about the broader agentic AI landscape that's driving tools like these, check out our guide to AI agent orchestration tools. That part is uniquely yours.
