When the open-source platform GitHub launched in 2008, it provided developers everywhere with a way to store, share and manage all the code needed to collaborate on software projects regardless of their team’s geographic location. However, source code isn’t the only essential facet of building a functional program, especially in the machine learning space.
The data that trains machine learning systems is vital to ensuring those systems behave as intended. Though GitHub has been helping manage code for over a decade, a similar solution for managing data recently emerged from stealth. XetHub is a data management platform that serves hybrid and remote development teams with a place to store their data alongside code in the same Git repository.
“XetHub is a data storage and data management platform that enables teams to treat their data like they treated their code for the last 15 years,” Rajat Arya, XetHub co-founder, told Built In. “Put another way, we scale source control to hold your data as well. Or even simpler: we’re like GitHub, but for data.”
Launched out of stealth in January, Xethub’s platform lets engineers work with growing data files while using tools they’re already familiar with. It also makes it easier for developers to see how data has changed at any point during a project. Its solution helps individual engineers trace exactly how they achieved certain results and subsequently reproduce those outcomes.
“With XetHub, your data, your files and your code can live together in the same system, and this ensures reproducibility,” Arya said. “If I’ve put all of my code in one place and all of my data in another place, someone might go change the data and now the code stops working, or someone might change the code and now the data is incompatible with the code and that incompatibility damages your reproducibility.”
XetHub provides visibility into why elements of a project were changed and who’s responsible for those changes. This lets distributed teams coordinate their builds more efficiently than relying on disparate communication methods like Slack or internal Wiki pages. Organizations can use XetHub to keep track of all their data and keep up with data compliance laws.
XetHub is the brainchild of three co-founders, each with extensive tech backgrounds. Arya spent most of his career in software development before founding XetHub alongside Yucheng Low, the company’s CEO, and Ajit Banerjee.
An expert in working with distributed systems, Arya has worked at Microsoft, a startup called AddThis and AWS. Arya also worked at Low’s machine learning startup Turi before it was acquired by Apple in 2016. Banerjee, with experience as an entrepreneur and working at tech companies like Amazon, joined the crew sometime after Turi’s acquisition.
“For machine learning teams prior to our arrival [at Apple], each [team] had to spin up their machine learning infrastructure on their own,” Arya said. “Our charter was to build a platform that all of Apple could leverage for machine learning. … Yucheng and I got to build a service together and that service took off like wildfire, and today it’s the most widely used service in the ML platform at Apple. That service was about data management.”
When the trio left Apple in 2021, they built a data management solution of their own and founded XetHub in February 2022. The company has since grown steadily. XetHub is currently active in a public beta where users can sign up for free and get 20 gigabytes of storage on the platform, according to Arya.
The company also recently announced a $7.5 million seed funding round led by Madrona to further its solution. Having recently added Windows to its list of compatible operating systems alongside MacOS and Linux, XetHub has a range of product developments in the works with its fresh capital.
In the coming months, XetHub plans to increase its focus on enabling large data collaboration. The company already offers solutions for automating data summarization and custom visualization but plans to incorporate a tool for understanding data drift. XetHub also plans to hire new teammates in engineering and product design.
A large issue within the ML space is reproducibility, a conundrum that impacts other branches of the tech world. While XetHub’s initial focus has been on the machine learning sector, Arya foresees the company’s product branching out to serve other industries including hardware development, biotechnology and video gaming.
“One of the challenges machine learning teams face is they need to be able to reproduce all of their results, but they also need to be able to easily experiment. And these are requirements that are in conflict with each other,” Arya said. “I believe that what we’re building with XetHub can enable technical teams to experiment fearlessly and work and innovate quickly while still maintaining full history of what they’ve done.”