S3 Bulk Delivery

For enterprise clients who need programmatic access to the full FinancialReports catalog — raw filings, processed markdown, rendered HTML, and a queryable parquet index — we offer direct S3 read access via cross-account IAM role grants.

This is the recommended integration path for RAG pipelines, backtesting engines, data lakes, and any workflow that needs to pull filings at scale without paginating through the REST API.

Getting Started

Prerequisites

An active Enterprise contract with FinancialReports

An AWS account with an IAM role your team will use to access the bucket

Setup

Create an IAM role in your AWS account that your applications will assume.

Send the role ARN to your FinancialReports account manager. Format:

arn:aws:iam::<account-id>:role/<role-name>

We attach a read-only grant scoped to the data prefixes listed below. You'll receive a confirmation email once access is live.

Configure your S3 client to access the bucket in eu-central-1:

Bucket:  financialreports-bucket
Region:  eu-central-1

No API keys or signed URLs are involved — access is controlled entirely via your IAM role and our bucket policy.

Recommended Infrastructure

For lowest latency and zero data-transfer cost, we strongly recommend:

Region: Host your reader infrastructure in eu-central-1 (Frankfurt), co-located with our bucket. Cross-region or out-of-AWS reads incur AWS data-transfer charges.

VPC Gateway Endpoint for S3: Attach an S3 Gateway Endpoint to the VPC your reader runs in. Free to create, keeps traffic on the AWS backbone, and is required for zero-cost access when reading from a VPC.

Incremental reads: Use manifests/incremental/YYYY-MM-DD.parquet instead of re-scanning full year partitions nightly. Reduces GETs by 100–1000× for daily-or-more-frequent pipelines.

Filter to processing_status = 'COMPLETED' unless you explicitly need pending, failed, or skipped filings.

Data Transfer Costs

Requests from AWS infrastructure in eu-central-1 with a VPC Gateway Endpoint for S3 incur zero data-transfer cost to either party. All other access paths — cross-region AWS, outside AWS, or AWS without a VPC endpoint — route over the public internet and incur standard AWS egress charges.

For contracts starting after May 1, 2026, data-transfer costs for non-colocated access are billed directly to the requester via S3 Requester Pays. Your S3 client must include the x-amz-request-payer: requester header (or RequestPayer='requester' in boto3, --request-payer requester in the AWS CLI). Pipelines configured per the Recommended Infrastructure section above are unaffected.

Existing contracts are grandfathered until renewal.

Bucket Layout

s3://financialreports-bucket/
├── financialreports/media/filings/          Raw filing documents (PDF, HTML, ZIP, XBRL)
├── processed/markdown/{shard}/{id}.md       Processed markdown (one file per filing)
├── processed/html/{shard}/{id}.html         Pre-rendered HTML (one file per filing)
└── manifests/                               Parquet catalog (your entry point)
    ├── README.md
    ├── schema.json
    ├── filings/year=YYYY/part-00000.parquet
    ├── incremental/YYYY-MM-DD.parquet
    ├── companies.parquet
    └── reference/
        ├── sources.parquet
        ├── filing_types.parquet
        ├── filing_categories.parquet
        └── languages.parquet

Parquet Catalog

The manifests/ prefix is the queryable index of the entire platform. Start here — use it to discover filings, then pull the raw or processed content you need via the S3 keys in each row.

Unless you have a reason not to, filter WHERE processing_status = 'COMPLETED' in every query. The other states (PENDING, FAILED, SKIPPED) do not have usable markdown_s3_key or html_s3_key values.

Filings Table

Path: manifests/filings/year=YYYY/part-00000.parquet

Hive-partitioned by the filing's release year. One row per filing. This is the primary fact table.

Column	Type	Description
`filing_id`	int64	Unique filing identifier
`company_id`	int64	FinancialReports company ID (join to `companies.parquet`)
`company_name`	string	Legal name of the company
`company_tagline`	string	Short company description
`ticker`	string	Primary stock ticker
`is_listed`	bool	Whether the company is currently listed
`main_stock_exchange`	string	Primary exchange URL
`lei`	string	Legal Entity Identifier (20-char)
`country_code`	string	ISO 3166-1 alpha-2 country code
`sector_code`	string	ISIC Section code
`industry_group_code`	string	ISIC Division code
`industry_code`	string	ISIC Group code
`sub_industry_code`	string	ISIC Class code
`filing_type_id`	int64	Filing type ID (join to `filing_types.parquet`)
`filing_type_code`	string	Short code (e.g. `10-K`, `AR`, `RNS`)
`filing_type_name`	string	Full name (e.g. `Annual Report`)
`filing_category`	string	Top-level disclosure category
`source_id`	int64	Regulatory source ID (join to `sources.parquet`)
`source_name`	string	Regulatory source name (e.g. `SEC`, `BaFin`)
`language_id`	int64	Language ID (join to `languages.parquet`)
`language_code`	string	ISO 639-1 language code
`title`	string	Filing title (if available)
`fiscal_year`	int16	Fiscal year the filing covers
`fiscal_period`	string	Period: `FY`, `Q1`, `Q2`, `Q3`, `Q4`, `H1`, `H2`, `9M`
`period_ending_date`	date	End date of the reported period
`release_datetime`	timestamp (UTC)	When the filing was published by the source
`dissemination_datetime`	timestamp (UTC)	When the filing was disseminated to the public
`added_to_platform`	timestamp (UTC)	When FinancialReports ingested the filing
`updated_date`	timestamp (UTC)	Last modification timestamp
`file_extension`	string	Original file format (`PDF`, `HTML`, `ZIP`, etc.)
`file_size`	int64	File size in bytes
`processing_status`	string	`COMPLETED`, `PENDING`, `FAILED`, `SKIPPED`
`filing_type_confidence`	float64	AI classification confidence (0.0–1.0)
`language_confidence`	float64	Language detection confidence (0.0–1.0)
`raw_document_s3_key`	string	S3 key for the original document
`markdown_s3_key`	string	S3 key for the processed markdown
`html_s3_key`	string	S3 key for the pre-rendered HTML

To look up ISINs for a filing's company, join to companies.parquet via company_id.

Incremental Files

Path: manifests/incremental/YYYY-MM-DD.parquet

Same schema as the filings table. Contains every filing that was added or updated on that UTC day. Rolling 30-day window — files older than 30 days are automatically pruned.

Use these if your pipeline runs more frequently than daily and you want to avoid re-reading full year partitions.

Companies Table

Path: manifests/companies.parquet

One row per company. Full company profiles including identifiers, classification, and metadata.

Column	Type	Description
`company_id`	int64	Unique company identifier
`name`	string	Legal name
`slug`	string	URL-safe identifier
`tagline`	string	Short description
`ticker`	string	Primary stock ticker
`is_listed`	bool	Currently listed
`main_stock_exchange`	string	Primary exchange URL
`lei`	string	Legal Entity Identifier
`isins`	list<string>	All ISINs associated with the company
`country_code`	string	ISO 3166-1 alpha-2
`sector_code`	string	ISIC Section
`industry_group_code`	string	ISIC Division
`industry_code`	string	ISIC Group
`sub_industry_code`	string	ISIC Class
`homepage_link`	string	Company website
`ir_link`	string	Investor relations page
`headcount`	int64	Number of employees
`shares_outstanding`	int64	Total shares outstanding
`date_ipo`	date	IPO date
`date_public`	date	Date public on FinancialReports
`year_founded`	date	Founding date
`legal_status`	string	`ACTIVE`, `INACTIVE`, `MERGED`, etc.

Reference Tables

Small lookup tables for joining and filtering.

File	Rows	Key columns
`reference/sources.parquet`	~43	`id`, `name`, `url`, `description`
`reference/filing_types.parquet`	~38	`id`, `code`, `name`, `category_name`
`reference/filing_categories.parquet`	~11	`id`, `name`
`reference/languages.parquet`	~186	`id`, `code`, `name`, `alpha3_code`

Schema Metadata

Path: manifests/schema.json

Machine-readable column definitions for all tables. Includes manifest_version for tracking breaking changes.

Pulling Efficiently

The raw filings corpus is ~14 TB and grows daily. A naïve aws s3 sync will transfer the full corpus every time and is never the right approach.

Use the incremental files — for pipelines running daily or more frequently:

Then GetObject only the S3 keys for rows you don't already have cached locally.

Cache by filing_id — store previously fetched filings keyed by filing_id and compare against updated_date. Refetch only when updated_date exceeds your cache timestamp.

Partition-prune the year index — when scanning historical data, always filter on year to get partition pruning. Without a year = predicate, the query scans every partition.

Refresh Cadence

The parquet catalog is regenerated nightly at approximately 04:15 CET (02:15 UTC).

The current year partition is fully rewritten each night.

Historical year partitions are immutable snapshots (rewritten only when corrections or reprocessing occur).

Incremental files for the previous day and current day are written each night.

Companies and reference tables are refreshed nightly.

New filings typically appear in the parquet catalog within 24 hours of ingestion. For lower latency, use our Webhooks or REST API.

Example Queries

All examples use DuckDB, which reads parquet directly from S3. Athena, Spark, Pandas, and Polars work equivalently.

Configure S3 access (DuckDB)

All filings from 2025

German annual reports with partition pruning

Filings updated in the last 24 hours

Filings updated in the last 30 days

Join filings to company profiles

Filter filings by ISIN

Count filings by country and year

Download a specific filing's markdown

Once you've identified a filing via the parquet catalog, pull its content directly:

Or in Python:

S3 Key Construction

The raw_document_s3_key, markdown_s3_key, and html_s3_key columns in the filings parquet are paths relative to the bucket root. To access a file:

s3://financialreports-bucket/{key_from_parquet}

Markdown and HTML keys follow a sharded layout:

processed/markdown/{filing_id // 10000}/{filing_id}.md
processed/html/{filing_id // 10000}/{filing_id}.html

The shard prefix keeps any single S3 prefix under ~10K objects for fast listing.

Versioning

The manifest_version field is embedded in each parquet file's metadata and in schema.json. The current version is 1.1.

Changelog

1.1 — Removed isins column from filings parquet. ISINs are available in companies.parquet and should be joined via company_id. This eliminates redundant data for companies with large numbers of ISINs (e.g. structured product issuers).

1.0 — Initial release.

Breaking changes (column removal, type changes) will bump the version number. Additive changes (new columns) will bump the minor version. We will notify you before any breaking change.

Limitations

S3 bulk delivery provides the full catalog. Per-country or per-source filtering is not applied at the S3 level — use the parquet columns to filter client-side.

Markdown and HTML content is not embedded in the parquet files. The parquet contains S3 keys that point to the content files. Pull them separately as needed.

To filter filings by ISIN, join to companies.parquet via company_id and use list_contains(c.isins, 'YOUR_ISIN').

Production Checklist

Before putting a pipeline into production against this bucket:

Reader runs in AWS eu-central-1

VPC Gateway Endpoint for S3 attached to the reader's VPC

Local cache of fetched filings keyed by filing_id and compared against updated_date

Daily-or-faster pipelines read from manifests/incremental/ rather than full-year partitions

Queries filter WHERE processing_status = 'COMPLETED' unless other states are explicitly required

Partition pruning verified: the filings/**/*.parquet glob is always used with a year = predicate

Support

Technical questions: api@financialreports.eu

Account and billing: sales@financialreports.eu

Getting Started#

Prerequisites#

Setup#

Recommended Infrastructure#

Data Transfer Costs#

Bucket Layout#

Parquet Catalog#

Filings Table#

Incremental Files#

Companies Table#

Reference Tables#

Schema Metadata#

Pulling Efficiently#

Refresh Cadence#

Example Queries#

Configure S3 access (DuckDB)#

All filings from 2025#

German annual reports with partition pruning#

Filings updated in the last 24 hours#

Filings updated in the last 30 days#

Join filings to company profiles#

Filter filings by ISIN#

Count filings by country and year#

Download a specific filing's markdown#

S3 Key Construction#

Versioning#

Changelog#

Limitations#

Production Checklist#

Support#