How Assurance Builds Data Pipelines That Last

Assurance’s platform is built to crunch an overwhelming amount of data. Here’s how the team prepares for continued growth (and yes, more data). 

Written by Adrienne Teeley
Published on Jul. 28, 2021
Brand Studio Logo

When a company’s CEO is a former data scientist, you know it has to have top-notch data pipelines, hygiene and transparency. 

At least, that’s how things are at Assurance, according to Senior Data Engineer Andrew Kellberg. Founded in 2016, the company is out to improve the personal insurance industry through its platform, which blends its team’s insurance knowledge with powerful technology that helps users find coverage that uniquely fits their needs and budgets. 

At the heart of the platform’s success is its ability to crunch data. A lot of it. Because the company has grown significantly over the last five years — including an acquisition by Prudential — scalability and iteration have been a driving force for the data engineering team. In using tools like AWS, Apache Spark and Database Delta, Kellberg said that his team has been able to accommodate the influx in data and continue powering Assurance’s ability to provide custom recommendations to users. 

“This scalability has made it possible for pipeline jobs created over two years ago to continue working with today’s volume of data, needing only minimal configuration changes to increase hardware or system resources,” Kellberg said.

To learn more about the data tools and processes Assurance leans on, Built In Seattle connected with Kellberg. He walked us through the steps his team takes in order to impress their users (and their CEO). 

 

Image of Andrew Kellberg
Andrew Kellberg
Senior Data Engineer • Assurance

What technologies or tools are you currently using to build your data pipeline, and why did you choose those technologies specifically?

On the data engineering team, the core technology we use in our data pipelines is Apache Spark. We selected Spark because it enables us to quickly create scalable jobs in our pipelines. Spark abstracts away most of the parallel processing details needed for scalability, which allows us to spend our time focusing on the data and business logic. This scalability has made it possible for pipeline jobs created over two years ago to continue working with today’s volume of data, needing only minimal configuration changes to increase hardware or system resources. 

Another important technology we use in our pipelines is Amazon Web Services (AWS). We chose AWS because we can easily create ephemeral computing resources on-demand that are needed to power our data pipelines. We have hundreds of pipeline jobs that run throughout the day at various times, and maintaining a cluster powerful enough to run everything would require a lot of time and effort. By using AWS, we can create ephemeral resources that require no maintenance and allow jobs to run independently. This hardware is also perfect for working with Spark, as it can be scaled with simple configuration changes.

 

As your company — and thus, your volume of data — grows, what steps are you taking to ensure your data pipeline continues to scale with the business?

As Assurance has grown and the volume of data has increased, one problem we faced was our PostgreSQL databases were running into performance and data size limits. We frequently need to update data after the initial write, so our other standard storage solution of Parquet on S3 was not an option. To solve this problem, we took the step of implementing Databricks Delta tables. Delta tables combine the ACID transactions of PostgreSQL with the scalability of storing Parquet on S3. This change has allowed us to migrate PostgreSQL tables to Delta tables as data volumes grow, without needing to redesign our systems.

Serving data from our data pipelines to the rest of the business is equally as important as generating the data. We started running into performance issues serving data as the business has scaled. The solution we found for this problem was PrestoDB. With Presto, we can select an appropriate data store for each data set based on the size, shape and access requirements, while still having the ability to combine data from different data stores. Moving to Presto made it possible to seamlessly integrate a new data store, like a Delta table, into our Assurance ecosystem.