Web Harvesting Services - Library of Congress
Overview
Buyer
Place of Performance
NAICS
PSC
Set Aside
Original Source
Timeline
Qualification Details
Fit reasons
- NAICS alignment with historical contract wins in similar service areas.
- Scope strongly matches core technical capabilities and delivery model.
Risks
- Past performance thresholds may require one additional teaming partner.
- Potential clarification needed on staffing minimums before bid/no-bid.
Next steps
Validate eligibility requirements, assign capture owner, and schedule partner outreach to confirm teaming strategy before submission planning.
Quick Summary
The Library of Congress is soliciting proposals for Web Harvesting Services under an Unrestricted Indefinite-Delivery, Indefinite-Quantity (IDIQ) contract. This opportunity seeks support for systematic, at-scale web content harvesting, temporary access, crawl reports, and content transfer for preservation and public access. Proposals are due by March 3, 2026, at 5:00 PM EST.
Purpose and Scope
The Library of Congress requires contract support to enable the systematic harvesting of web content at scale, based on instructions from Library staff. This includes providing temporary access to the content, generating required crawl reports for quality review, and facilitating the transfer of content to the Library for preservation and public access. The goal is to enrich the Library's digital collections.
Key requirements include:
- Performing web crawls based on Library specifications, seed lists, and scoping instructions.
- Comprehensive capture of various digital objects (HTML, images, PDFs, multimedia) to accurately replicate webpages.
- Packaging captured content in valid WARC (Web ARChive) files with 11-field CDX indexes for transfer to the Library's AWS S3 bucket.
- Providing an access tool for Library staff to review crawl results prior to transfer.
- Generating detailed crawl reports (ASCII text and XML) within five days of crawl completion.
- Utilizing US-based servers for crawling and maintaining secure data storage.
- Adhering to strict information security policies, including restrictions on Generative AI use.
- Key Personnel: Program Manager/Alternate, Crawl Engineer, and Quality Assurance Lead.
- Estimated annual crawl volume ranges from 300-700 Terabytes.
Contract Details
- Contract Type: Indefinite-Delivery, Indefinite-Quantity (IDIQ) with firm-fixed-price Task Orders.
- Set-Aside: Unrestricted.
- Product Service Code: DK10 (Cloud Solutions Delivered As A Service).
- Period of Performance: A base period from June 1, 2026, to May 31, 2031.
- Estimated Value: Minimum order of $300,000.00; Maximum order of $15,000,000.00.
- Place of Performance: Contractor's own facilities.
Submission and Evaluation
- Proposal Due Date: March 3, 2026, 5:00 PM EST.
- Past Performance Questionnaires (PPQs) Due Date: March 3, 2026, noon ET. PPQs must be sent directly from the past performance reference.
- Sample Web Crawl Transfer Information Due Date: February 17, 2026, 5:00 PM ET.
- Submission Method: Electronically via email to Jennifer Zwahlen (jzwa@loc.gov) and Colleen Daly (cdaly@loc.gov). Total email attachment size must not exceed 20MB, and no zipped files are permitted.
- Proposal Content: Must include four volumes: Technical Approach (including a sample web crawl), Corporate Experience and Capabilities (including Key Personnel resumes), Past Performance (using Attachment J3), and Price (using Attachment J4).
- Evaluation Criteria: Best-Value Trade-Off (BVTO) approach. Factors in descending order of importance are Technical Approach, Corporate Experience and Capabilities, Past Performance, and Price. Non-price factors combined are significantly more or equally important to price.
- SAM Registration: Offerors must be registered in SAM to be considered for award.
Technical Requirements
The Library's technical environment utilizes tools like Digiboard, Heritrix, Brozzler, OpenWayback, pywb, and OutbackCDX. Harvested content is stored in WARC format, and data transfer occurs via AWS S3. The required sample web crawl must be completed within 48 hours, should not respect robots.txt, and results (WARC, CDX, reports) must be delivered via SFTP.