Closed Bug 1596537 Opened 5 years ago Closed 4 years ago

Deployment of CRLite Production Environment

Categories

(Cloud Services :: Operations: CRLite, task)

task
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: jcj, Assigned: sven)

References

(Blocks 1 open bug, )

Details

(Whiteboard: [Target: Q1 2020])

CRLite is a WebPKI-wide certificate revocation system, to be distributed via Remote Settings for all Firefox users, replacing OCSP. We're experimenting with it now using a pre-production CRLite instance and manual inspection and submission of CRLite filter files to Remote Settings.

This bug is to formally deploy CRLite and hand-over control of the production instance to CloudOps.

As of this writing, CRLite consists of several components, taken from https://github.com/mozilla/crlite/wiki/Overview:

Google Firestore

Bulk storage of all unexpired certificates in the Web PKI, as well as CT log metadata. They are organized in a heirarchy:

logs
    /<url>
ct
    /<expiration date string>
               /issuer
                       /<issuer SPKI string>
                                /certs
                                       /<certificate SPKI string>

Google Memorystore (Redis)

Fast lists of all unexpired certificate serial numbers, their issuers, and metadata (such as CRL distribution URLs).

A container, crlite-fetch

https://github.com/mozilla/crlite/tree/master/containers/crlite-fetch

This uses the ct-fetch tool from ct-mapreduce to download from all CT logs, placing the certificates into Firestore and the Memorystore/Redis cache. This container runs as an always-on Kubernetes deployment.

A container, crlite-generate

https://github.com/mozilla/crlite/tree/master/containers/crlite-generate

This run-to-completion Kubernetes cronjob uses several tools to construct a CRLite filter, and publish it, ultimately to Remote Settings.

A container, crlite-rebuild

https://github.com/mozilla/crlite/tree/master/containers/crlite-rebuild

This run-to-completion Kubernetes job is used when the Memorystore/Redis cache is invalid in some way. It reads all unexpired entries from the Google Firestore and rebuilds the Memorystore data.

Google Stackdriver

Metrics are published to Stackdriver for overall system health, as are logs. Errors and warnings are generally of two categories:

  1. Problems with infrastructure performance, which are still being addressed via adjustments to how operations are performed
  2. Problems with the WebPKI, which might well be used by the Mozilla CA Root Program for enforcement

Environments

As of this writing, jcj is still actively developing CRLite, and needs the full dataset for development. So if a magic wand were used today to make the deployments, I would need a stage environment to work in as a sandbox -- which could certainly be the existing environment I am using.

*NOTE: * The prod environment would probably want to start from a clone of my current environment's Firestore data, as it takes multiple calendar-months to synchronize CT data from the original sources.

(Supercedes bug 1429802)

Is https://github.com/mozilla/crlite a private repo on purpose? (I get a 404)

As CRLite is a very important service, I think the code should be open-source and released before a production environment is setup.

It is. We have to do a scrub and potentially a reinitialization. That said, the vast majority of the code is actually in https://github.com/jcjones/ct-mapreduce , what's in crlite is the kubernetes mechanisms and the filter generator that matches https://pypi.org/project/filtercascade/ and https://github.com/mozilla/rust-cascade.

I will definitely get it released. I completely agree opening it up, I just have to ensure it's clean and then get a review pass.

Assignee: nobody → sven
Component: Operations → Operations:CRLite
QA Contact: sven
Depends on: 1647841
Status: NEW → ASSIGNED
Depends on: 1665107
Depends on: 1672418
Status: ASSIGNED → RESOLVED
Closed: 4 years ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.