arn:aws:iam::<account-id>:role/<role-name>eu-central-1:Bucket: financialreports-bucket
Region: eu-central-1eu-central-1 (Frankfurt), co-located with our bucket. Cross-region or out-of-AWS reads incur AWS data-transfer charges.manifests/incremental/YYYY-MM-DD.parquet instead of re-scanning full year partitions nightly. Reduces GETs by 100–1000× for daily-or-more-frequent pipelines.processing_status = 'COMPLETED' unless you explicitly need pending, failed, or skipped filings.eu-central-1 with a VPC Gateway Endpoint for S3 incur zero data-transfer cost to either party. All other access paths — cross-region AWS, outside AWS, or AWS without a VPC endpoint — route over the public internet and incur standard AWS egress charges.x-amz-request-payer: requester header (or RequestPayer='requester' in boto3, --request-payer requester in the AWS CLI). Pipelines configured per the Recommended Infrastructure section above are unaffected.s3://financialreports-bucket/
├── financialreports/media/filings/ Raw filing documents (PDF, HTML, ZIP, XBRL)
├── processed/markdown/{shard}/{id}.md Processed markdown (one file per filing)
├── processed/html/{shard}/{id}.html Pre-rendered HTML (one file per filing)
└── manifests/ Parquet catalog (your entry point)
├── README.md
├── schema.json
├── filings/year=YYYY/part-00000.parquet
├── incremental/YYYY-MM-DD.parquet
├── companies.parquet
└── reference/
├── sources.parquet
├── filing_types.parquet
├── filing_categories.parquet
└── languages.parquetmanifests/ prefix is the queryable index of the entire platform. Start here — use it to discover filings, then pull the raw or processed content you need via the S3 keys in each row.WHERE processing_status = 'COMPLETED' in every query. The other states (PENDING, FAILED, SKIPPED) do not have usable markdown_s3_key or html_s3_key values.manifests/filings/year=YYYY/part-00000.parquet| Column | Type | Description |
|---|---|---|
filing_id | int64 | Unique filing identifier |
company_id | int64 | FinancialReports company ID (join to companies.parquet) |
company_name | string | Legal name of the company |
company_tagline | string | Short company description |
ticker | string | Primary stock ticker |
is_listed | bool | Whether the company is currently listed |
main_stock_exchange | string | Primary exchange URL |
lei | string | Legal Entity Identifier (20-char) |
country_code | string | ISO 3166-1 alpha-2 country code |
sector_code | string | ISIC Section code |
industry_group_code | string | ISIC Division code |
industry_code | string | ISIC Group code |
sub_industry_code | string | ISIC Class code |
filing_type_id | int64 | Filing type ID (join to filing_types.parquet) |
filing_type_code | string | Short code (e.g. 10-K, AR, RNS) |
filing_type_name | string | Full name (e.g. Annual Report) |
filing_category | string | Top-level disclosure category |
source_id | int64 | Regulatory source ID (join to sources.parquet) |
source_name | string | Regulatory source name (e.g. SEC, BaFin) |
language_id | int64 | Language ID (join to languages.parquet) |
language_code | string | ISO 639-1 language code |
title | string | Filing title (if available) |
fiscal_year | int16 | Fiscal year the filing covers |
fiscal_period | string | Period: FY, Q1, Q2, Q3, Q4, H1, H2, 9M |
period_ending_date | date | End date of the reported period |
release_datetime | timestamp (UTC) | When the filing was published by the source |
dissemination_datetime | timestamp (UTC) | When the filing was disseminated to the public |
added_to_platform | timestamp (UTC) | When FinancialReports ingested the filing |
updated_date | timestamp (UTC) | Last modification timestamp |
file_extension | string | Original file format (PDF, HTML, ZIP, etc.) |
file_size | int64 | File size in bytes |
processing_status | string | COMPLETED, PENDING, FAILED, SKIPPED |
filing_type_confidence | float64 | AI classification confidence (0.0–1.0) |
language_confidence | float64 | Language detection confidence (0.0–1.0) |
raw_document_s3_key | string | S3 key for the original document |
markdown_s3_key | string | S3 key for the processed markdown |
html_s3_key | string | S3 key for the pre-rendered HTML |
companies.parquet via company_id.manifests/incremental/YYYY-MM-DD.parquetmanifests/companies.parquet| Column | Type | Description |
|---|---|---|
company_id | int64 | Unique company identifier |
name | string | Legal name |
slug | string | URL-safe identifier |
tagline | string | Short description |
ticker | string | Primary stock ticker |
is_listed | bool | Currently listed |
main_stock_exchange | string | Primary exchange URL |
lei | string | Legal Entity Identifier |
isins | list<string> | All ISINs associated with the company |
country_code | string | ISO 3166-1 alpha-2 |
sector_code | string | ISIC Section |
industry_group_code | string | ISIC Division |
industry_code | string | ISIC Group |
sub_industry_code | string | ISIC Class |
homepage_link | string | Company website |
ir_link | string | Investor relations page |
headcount | int64 | Number of employees |
shares_outstanding | int64 | Total shares outstanding |
date_ipo | date | IPO date |
date_public | date | Date public on FinancialReports |
year_founded | date | Founding date |
legal_status | string | ACTIVE, INACTIVE, MERGED, etc. |
| File | Rows | Key columns |
|---|---|---|
reference/sources.parquet | ~43 | id, name, url, description |
reference/filing_types.parquet | ~38 | id, code, name, category_name |
reference/filing_categories.parquet | ~11 | id, name |
reference/languages.parquet | ~186 | id, code, name, alpha3_code |
manifests/schema.jsonmanifest_version for tracking breaking changes.aws s3 sync will transfer the full corpus every time and is never the right approach.GetObject only the S3 keys for rows you don't already have cached locally.filing_id — store previously fetched filings keyed by filing_id and compare against updated_date. Refetch only when updated_date exceeds your cache timestamp.year to get partition pruning. Without a year = predicate, the query scans every partition.raw_document_s3_key, markdown_s3_key, and html_s3_key columns in the filings parquet are paths relative to the bucket root. To access a file:s3://financialreports-bucket/{key_from_parquet}processed/markdown/{filing_id // 10000}/{filing_id}.md
processed/html/{filing_id // 10000}/{filing_id}.htmlmanifest_version field is embedded in each parquet file's metadata and in schema.json. The current version is 1.1.isins column from filings parquet. ISINs are available in companies.parquet and should be joined via company_id. This eliminates redundant data for companies with large numbers of ISINs (e.g. structured product issuers).companies.parquet via company_id and use list_contains(c.isins, 'YOUR_ISIN').eu-central-1filing_id and compared against updated_datemanifests/incremental/ rather than full-year partitionsWHERE processing_status = 'COMPLETED' unless other states are explicitly requiredfilings/**/*.parquet glob is always used with a year = predicate