Filecoin RetroPGF
Round 3
Round 2
Round 1
Round 2 / JZFS
Category
ToolingDrips Project
GitDataAI/jzfsGitHub URL
https://github.com/GitDataAI/jzfsFunding amount
0 FILStructured, semi-structured, and unstructured data are the three categories of data sources that can be classified. Unstructured data account for approximately 80% of all global data, whereas structured data account for only 20%.
As models have become more sophisticated and pushbutton, AI teams have realized that focusing on data iteration is just as important, if not more so, for developing and deploying high-accuracy models successfully and efficiently. ML models have become increasingly complicated and opaque in recent years, necessitating significantly larger amounts of training data. In addition, data have evolved into a useful interface for working with subject matter experts and transforming their expertise into software. Finally, data-centric AI enables a higher level of model accuracy than was previously feasible using only model centric techniques.
Datasets are dynamic. New files and new versions of existing files enter the datasets at the ingestion stage. Additionally, extractors can evolve over time and generate new versions of raw data. As a result, datasets versioning is a cross-cutting concern across all stages of a datasets. Of course vanilla distributed file systems are not adequate for versioning-related operations. For example, simply storing all versions may be too costly for large datasets, and without a good version manager, just using filenames to track versions can be error-prone. In a datasets, for which there are usually many users, it is even more important to clearly maintain correct versions being used and evolving across different users. Furthermore, as the number of versions increases, efficiently and cost-effectively providing storage and retrieval of versions is going to be an important feature of a successful datasets system.
JZFS was born as a solution for the above problems.
JZFS is an industry-leading Data-Centric Version Control File System, helps ensure Responsible AI Engineering by improving Data Versioning, Provenance, and Reproducibility.
In production systems with machine learning components, updates and experiments are frequent. New updates to models(data products) may be released every day or every few minutes, and different users may see the results of different models as part of A/B experiments or canary releases.
- Version Everything: Data scientists are often criticized for being less disciplined with versioning their experiments(versioning of data, pipeline, code, and models), especially when using computational notebooks.
- Track Data Provenance: This applies to all processing steps in an AI/ML pipeline, including data collection/acquisition, data merging, data cleaning, feature extraction, learning, or deployment.
- Reproducibility: A final question of AI/ML that is often relevant for debugging, audits, and also science more broadly is to what degree data, models, and decisions can be reproduced.