Web Harvesting Services - Library of Congress
Overview
Buyer
Place of Performance
NAICS
PSC
Set Aside
Original Source
Timeline
Qualification Details
Fit reasons
- NAICS alignment with historical contract wins in similar service areas.
- Scope strongly matches core technical capabilities and delivery model.
Risks
- Past performance thresholds may require one additional teaming partner.
- Potential clarification needed on staffing minimums before bid/no-bid.
Next steps
Validate eligibility requirements, assign capture owner, and schedule partner outreach to confirm teaming strategy before submission planning.
Quick Summary
The Library of Congress has issued a Combined Synopsis/Solicitation (RFP 030ADV26R0010) for Web Harvesting Services. This Unrestricted opportunity seeks contract support for systematic, at-scale web content harvesting, including temporary access, crawl reports, and content transfer for preservation and public access. The contract is an Indefinite-Delivery, Indefinite-Quantity (IDIQ) with firm-fixed-price Task Orders, with an estimated value between $300,000 and $15,000,000. Proposals are due February 24, 2026, at 5:00 PM EST.
Purpose & Scope
The Library of Congress requires services to systematically harvest web content based on staff instructions, provide temporary access to content and crawl reports for quality review, and enable content transfer for preservation and public access. The scope includes capturing an estimated 350-700 Terabytes (TB) of data through various crawl types: weekly (1,700 seeds), monthly (4,000 seeds), extended (up to 10,000 seeds), and specific weekly crawls for the US Election 2026 (up to 1,500 seeds). Crawls generally ignore robots.txt and require deduplication.
Key Requirements
Contractors must perform web content harvesting, packaging captured content into valid WARC (Web ARChive) files (ISO 28500_2017) with 11-field CDX indexes for transfer to the Library's S3 bucket via secure internet (HTTPS). Single BagIt bags should not exceed 1 TB, with target WARC files around 1 GB. Services include providing an access tool for quality review, generating detailed reports (ASCII text and XML) within 5 days of crawl completion, and developing a Quality Control Program (QCP). Infrastructure must utilize US-based servers with reliable and secure data storage. Strict information security policies apply, including restrictions on Generative AI use and mandatory IT Security Training. Key personnel (Program Manager, Crawl Engineer, Quality Assurance Lead) with specified experience are required.
Contract Details
- Contract Type: Indefinite-Delivery, Indefinite-Quantity (IDIQ) with firm-fixed-price Task Orders.
- Period of Performance: Base period from June 1, 2026, to May 31, 2031.
- Estimated Value: Minimum order of $300,000.00; Maximum order of $15,000,000.00.
- Place of Performance: Contractor's own facilities.
- Set-Aside: Unrestricted.
- Product Service Code: DK10 - Cloud Solutions Delivered As A Service.
Submission Requirements & Deadlines
- Proposals Due: February 24, 2026, at 5:00 PM EST.
- Questions Due: February 2, 2026, at 12:00 PM EST.
- Sample Web Crawl Size Notification Due: February 17, 2026, by 5:00 PM EST (via email to cdaly@loc.gov and jzwa@loc.gov).
- Past Performance Questionnaires (PPQs) Due: February 24, 2026, by noon EST (submitted directly by references to the Contracting Team).
- Proposal Content: Must include four volumes: Technical Approach (including a sample web crawl), Corporate Experience and Capabilities, Past Performance (using Attachment J3), and Price (using Attachment J4).
- Submission Method: Electronically via email to cdaly@loc.gov and jzwa@loc.gov. Total email attachment size not to exceed 20MB. Proposals must be valid through June 6, 2026.
Evaluation Criteria
Award will be based on a Best-Value Trade-Off (BVTO) approach. Evaluation factors, in descending order of importance, are: Technical Approach, Corporate Experience and Capabilities, Past Performance, and Price. Non-price factors combined are significantly more or equally important to price. The Library may award without discussions.
Technical Environment
The Library's existing environment uses Digiboard for seed management, provides seed lists in SURT format, and supports open-source tools like Heritrix, Brozzler, OpenWayback, and pywb. Harvested content is stored in WARC format, and data transfer occurs via AWS S3. Detailed crawl reports are required, including specific metrics on hosts, documents, MIME types, HTTP codes, and data sizes.
Contact Information
- Primary: Colleen Daly (cdaly@loc.gov)
- Secondary: Jennifer Zwahlen (jzwa@loc.gov)