github.com/filecoin-project/bacalhau@v0.3.23-0.20230228154132-45c989550ace/ROADMAP.md

github.com/filecoin-project/bacalhau@v0.3.23-0.20230228154132-45c989550ace/ROADMAP.md (about)

1 # Bacalhau Master Plan Roadmap
2
3 ## MAY
4
5 1. *(basic)* Build a system for unreliably running a single deterministic program where a single 10MB piece of data is on IPFS, assuming everyone participating is trustworthy, assuming only 10 nodes in the network.*Example:* Run cloud detection on a single Landsat image file. Get the result back. Verify it by eye.
6 1. **STATUS** Complete.
7
8
9 ## JUNE
10
11 1. ***(reliability)* Extend that system to work 99% of the time. Submit 10,000 jobs and show that at most 100 of them fail. It might take several minutes to resolve each job.**
12 1. **Example:** By the end of this phase, job execution will be significantly more reliable. We’ll generally be able to submit 50 jobs and have them all succeed.
13 2. **Status:** Final benchmarking cluster in-flight.
14 2. *(scale-1)* Extend that system to work when 100 nodes and have access to 100TB of data. Still, the error rate might be high.
15 1. **Example:** 90 more nodes join the network and at first things break, but by the end of the milestone the network is working well again, albeit slowly and with some error rate
16
17 ## JULY
18
19 1. *(multi-file)* Extend that system to work when jobs consist of many (thousands) of files, rather than a single file, and we want to distribute the work across the network and run it in parallel.
20 1. **Example:** A user can submit a cloud detection on 10,000 Landsat images at once and have the work be parallelised automatically on the network, still according to data locality where possible.
21 2. *(scale-2)* Extend the system to work when 1000 nodes are participating, over 1PB of data. Resolving jobs now may take a very long time (10s of minutes).
22 1. **Example:** Many users can run landsat data in parallel, along with use cases on public biomedical images and 9 other use cases without the network failing.
23
24 ## AUGUST
25
26 1. *(performance-1)* Get the resolution of jobs down to seconds, even in large networks where 1000s of nodes are participating with hundreds of job submissions per second.
27 1. **Example:** As the network is dealing with a multitude of use cases (landsat, biomedical, SETI@home and protein folding has migrated over to use Bacalhau, etc) and the network is processing hundreds of job executions per second, it's now started to slow down a lot. This phase is all about getting it speedy again.
28 2. *(filecoin)* Add support for reading datasets from Filecoin so that data in that network becomes accessible to IPCS workloads
29 1. **Example:** A big data provider has put petabytes of public data onto Filecoin. Bacalhau users can consume it by attaching a Filecoin wallet to their Bacalhau node and giving it a spending limit.
30
31 ## SEPTEMBER
32
33 1. *(byzantine-1)* Extend that system to work when up to 10% of the nodes are malicious.
34 1. **Example:** Even when a small minority of nodes are trying to mess up the results, a user can still run cloud detection on 10,000 files in IPFS with no errors or incorrect results.
35
36 1. *(dag)* Extend that system to support jobs that are described in terms of pipelines: the output of one job feeding into the input of the next.
37 1. **Example:** Cloud removal in the Landsat job is actually a pipeline which first detects images with clouds, then only for those images, forwards them to a pipeline which removes the clouds.
38
39 ## OCTOBER (TBD)
40
41 1. *(byzantine-2)* Extend that system to work when up to 33% of the nodes are malicious.
42 1. **Example:** A larger attack happens on the network (>10%, <33%). Before this phase, this attack would bring down the network. After this phase, the network would carry on operating (although potentially degraded, higher latencies etc).
43 1. *(nondeterminism)* Extend that system to work with execution runtimes that are non-deterministic, e.g. arbitrary user-provided container images, to support workloads such ML training. In particular, this would prove out that the system is pluggable in terms of verification strategies, this lays the groundwork for future support for other strategies in the triad of trustless compute, such as cryptographic verifiability, secrecy and optimistic verifiability.
44 1. **Example:** Nondeterministic workloads, or ones that can’t be expressed as deterministic WASM binaries, can now be run on the network, for example training ML models.
45
46 To be continued in [master plan part two](https://hackmd.io/i-UdANDVSwycXtVacIPgEg)