Gemini + Python: Smart Info Extraction with LangExtract (Full Dockerized Example)

If you’ve ever wanted to extract structured data from unstructured text — think resumes, invoices, or customer support emails — LangExtract is a Python library that makes it dead simple.

Backed by Google’s Gemini models, LangExtract lets you define what you want to extract (a schema or prompt), and it handles the rest. This post focuses on how to use LangExtract effectively with two real examples: a job application email and a service invoice.

Project Overview (Very Brief)

Before we dive in, here’s the basic layout of the demo project:

langextract-demo/
├── langextract_demo/
│   ├── __main__.py
│   ├── extractor_job.py
│   └── extractor_invoice.py
├── data/
│   ├── job_application_email.txt
│   └── invoice.txt
├── .env

Now, let’s explore how to use the library itself.

How LangExtract Works

The core idea is simple: LangExtract takes a prompt_description, a few examples, and your input text — then sends it to Gemini via a structured API.

You get back typed, structured Extraction results with class labels and attributes. Think of it like “few-shot prompting as code.”

Define the Prompt

invoice_prompt = \"\"\"Extract invoice number, date, recipient, itemized services, 
total amount due, and due date. 
Use exact text spans — do not infer or reword.\"\"\"

Provide Few-Shot Examples

Few-shot examples help steer the model. You give it one or more annotated examples like this:

lx.data.ExampleData(
    text=\"Invoice #INV-10123\\nDate: August 5, 2025\\n...\",
    extractions=[
        lx.data.Extraction(\"invoice_number\", \"INV-10123\"),
        lx.data.Extraction(\"date\", \"August 5, 2025\", attributes={\"type\": \"issue\"}),
        ...
    ]
)

Load Your Input Text

with open(\"data/invoice.txt\") as f:
    invoice_text = f.read()

Run the Extraction

This is where the magic happens:

result = lx.extract(
    text_or_documents=invoice_text,
    prompt_description=invoice_prompt,
    examples=[invoice_example],
    model_id=\"gemini-2.5-pro\",
    api_key=os.getenv(\"GOOGLE_API_KEY\")
)

And now you can iterate over structured results:

for item in result.extractions:
    print(f\"- Class: {item.extraction_class}\")
    print(f\"  Text: {item.extraction_text}\")
    print(f\"  Attributes: {item.attributes}\")

Job Application Example

You can do the same for emails or resumes:

email_schema_prompt = \"\"\"Extract candidate name, email, phone, 
experience, skills, expected salary, and availability.\"\"\"

Just swap the prompt, example, and input text — the rest stays the same.

When to Use LangExtract

LangExtract is ideal for:

  • Parsing job applications or CVs
  • Extracting line items from invoices
  • Summarizing customer support emails
  • Metadata extraction from legal or medical text

Get the Full Example Project

The full demo project with Docker and Makefile support is on GitHub:

github.com/sprusov/langextract-demo

What Will You Build?

LangExtract makes structured extraction feel like cheating. What domains would you apply this to — HR, finance, legal, healthcare?

Share your ideas or fork the repo and experiment!

You May Also Like