Apache Hudi

Data Infrastructure and Analytics

San Francisco, CA 8,692 followers

Open source pioneer of the lakehouse reimagining batch processing with incremental framework for low latency analytics

View 1 employee

About us

Open source pioneer of the lakehouse reimagining old-school batch processing with a powerful new incremental framework for low latency analytics. Hudi brings database and data warehouse capabilities to the data lake making it possible to create a unified data lakehouse for ETL, analytics, AI/ML, and more. Apache Hudi is battle-tested at scale powering some of the largest data lakes on the planet. Apache Hudi provides an open foundation that seamlessly connects to all other popular open source tools such as Spark, Presto, Trino, Flink, Hive, and so much more. Being an open source table format is not enough, Apache Hudi is also a comprehensive platform of open services and tools that are necessary to operate your data lakehouse in production at scale. Most importantly, Apache Hudi is a community built by a diverse group of engineers from all around the globe! Hudi is a friendly and inviting open source community that is growing every day. Join the community in Github: https://github.com/apache/hudi or find links to email lists and slack channels on the Hudi website: https://hudi.apache.org/

Website: https://hudi.apache.org/
External link for Apache Hudi
Industry: Data Infrastructure and Analytics
Company size: 201-500 employees
Headquarters: San Francisco, CA
Type: Nonprofit
Founded: 2016
Specialties: ApacheHudi, DataEngineering, ApacheSpark, ApacheFlink, TrinoDB, Presto, DataAnalytics, DataLakehouse, AWS, GCP, Azure, ChangeDataCapture, and StreamProcessing

Locations

Primary

San Francisco, CA, US

Get directions

Employees at Apache Hudi

Dragan V.

Search Engine Optimization Expert

See all employees

Updates

Apache Hudi

8,692 followers
11h
Report this post
How can you quickly build Analytical Apps (in Python) using Apache Hudi? Here is a simple example that presents how easily & quickly one can build analytical apps using data directly from open lakehouse platforms like Hudi. This is facilitated by "Hudi-rs" - a native Rust library with Python bindings. You can use pure #Python to work with libraries like Streamlit without the need to setup Spark, Java & additional dependencies. This means: ❌ Less 'data copies' - usually to cater to such use cases in the data lake, we export a subset of the data from a proprietary warehouse & make copies on copies. ❌ Less data hops (ETL), saving time & leading to fresher insights. ✅ Analysts/Scientists have quicker (& more) access to data for better insights/robust ML models. Usually the wait time on these stakeholders are pretty high. 👨🏻💻Code Example: https://lnkd.in/dmbiU4F3 📙Blog: https://lnkd.in/gc6XPxEh 🌟Hudi-rs project: https://lnkd.in/dyvyd4cK #dataengineering #softwareengineering
Like Comment Share
Apache Hudi reposted this

Soumil S.

Sr. Software Engineer at Zeta Global (NYSE: ZETA) | Big Data & AWS Expert | Apache Hudi Specialist | Spark & AWS Glue Professional | YouTuber
1w Edited
Report this post
I had a fantastic Saturday working on a proof of concept with Apache Hudi Streamer! 🎉 I streamed data into Hudi using its Streamer feature and leveraged the Hudi Extension to build metadata for Iceberg, Delta and Hudi This allowed me to sync data seamlessly with the Glue Hive Metastore. I then integrated this with Snowflake using the ext_volume and created an Iceberg external table in Snowflake with glue catalog integration in snowflake The result? I was able to query the same data across multiple engines: Athena Spark Snowflake Redshift Spectrum via Glue Catalog Mount And the best part? I didn’t need to create multiple copies of the data! Thanks to Apache Hudi, data ingestion was incredibly fast, and the Hudi Streamer extension made syncing after each commit effortless. The multi-modal indexing also enabled faster upserts and efficient indexing. Saturday well spent, and I had a blast! 🚀 #ApacheHudi Apache XTable (Incubating) #DataEngineering #BigData #DataLake #Snowflake #Athena #Spark #Redshift #Glue Onehouse #DataIntegration
7 Comments

Like Comment Share
Apache Hudi

8,692 followers
1d
Report this post
What is the Layout of a Hudi Table on the File System? ✅ Hudi organizes tables within a specific directory hierarchy, located under a base path on a distributed file storage system. ✅ The Tables are divided into distinct partitions. ✅ Inside each of these partitions, files are grouped into file groups, each marked by a unique file ID. ✅ These file groups are further broken down into multiple file slices. ✅ File slices are composed of base files (#parquet/orc), created at a specific commit/compaction moment, and log files. These log files hold the changes applied to the base file after its creation. Read more here: https://lnkd.in/gwirNtVr ⭐️ Hudi Github Repo: https://lnkd.in/gmM8yjVS 💬 Hudi Slack: https://lnkd.in/gZSZzdX4 #dataengineering #softwareengineering
Like Comment Share
Apache Hudi

8,692 followers
2w
Report this post
In this talk, Peloton's Data Platform Team will share their experience in building a Hudi-powered data lake tailored for analytics, using change data capture from relational databases. They'll delve into how they implemented an auto-heal pipeline to ensure data quality, along with additional pipelines to maintain data freshness. The session will also address the challenges they faced and the innovative solutions they developed for seamlessly propagating data downstream to support machine learning use cases.

Modernizing Data Infrastructure at Peloton using Apache Hudi

www.linkedin.com

52 Comments

Like Comment Share
Apache Hudi

8,692 followers
2d
Report this post
Metica achieved a "~ 6x reduction in storage" and "~2x performance improvement" by using Apache Hudi in their data platform. This is all made possible with the Clustering table service in Hudi. ✅ Storage costs: Initially started with vanilla #Parquet files (without Hudi), which led to too many small files. Hudi’s ability to easily tune parameters during clustering to pack Parquet with max bytes helped here. Result -> 6x storage cost reduction ✅ Read Performance: They enabled clustering with sorting by columns that are frequently used as query predicates. Result -> min of 2x improvement in read performance Hear more about Metica's data architecture & Hudi journey in this talk by Subash P. Link: https://lnkd.in/dYaxZNYC #dataengineering #softwareengineering #lakehouse
Like Comment Share
Apache Hudi reposted this

Dipankar Mazumdar, M.Sc 🥑

Staff Data Engineering Advocate @Onehouse.ai | Apache Hudi, Iceberg Contributor | Distributed Systems | Technical Author
3d
Report this post
Small File = Slow Queries in Lakehouse. Addressing the small file issue is critical for optimizing the query performance on data lakes. The problem occurs when writing data in smaller chunks. For e.g. Stream processing engines, like #ApacheFlink, ingest continuous data streams into data lake table formats like Apache Hudi. ⭐️ This would involve writing continuous, high frequency small batches of data to the data lake, leading to the creation of many small files if not managed properly. ⭐️ The presence of many small files can hurt read performance because there is a cost to open, read & close many small files, which is generally less efficient than reading a smaller number of larger files. So if you don’t size the files appropriately, you can slow down the queries and the pipelines! One of the key design principle of Hudi is the prevention of small file generation, ensuring it automatically writes files of optimal size. It takes care of file sizing in 2 ways: ✅ Automatic File Sizing: During data ingestion, Hudi allows you to automatically adjusts file sizes. ✅ Post-Write Clustering: Hudi offers a solution to consolidate small files into larger ones after data has been written (during clustering) Benefits? - Optimized Query Performance: Hudi reduces the need to scan through numerous small files, improving query speed and efficiency - Enhanced Pipeline Efficiency: By managing file sizes effectively, Hudi decreases scheduling overhead & memory requirements in Spark/Flink jobs - Improved Storage Utilization: Without a proactive file management strategy, storage costs can increase Detailed link in comments. #dataengineering #softwareengineering
2 Comments

Like Comment Share
Apache Hudi reposted this

Dipankar Mazumdar, M.Sc 🥑

Staff Data Engineering Advocate @Onehouse.ai | Apache Hudi, Iceberg Contributor | Distributed Systems | Technical Author
3d
Report this post
Indexing in Lakehouse Table formats. Let's touch upon on one of the capabilities that is unique to Apache Hudi - Indexes. One of the main design principles of Hudi is faster UPSERTs on the #datalake. Now to achieve so, we need to have a sense of where a particular record is. If we don’t, we have to scan through all the records & then make updates, which can be extremely expensive (esp at scale). That’s where Indexing comes in and helps in locating the records. How Hudi & Index works is: ✅ the very 1st time a record comes in Hudi → it is assigned to a File Group (logical grouping of files) & that assignment never changes. ✅ Hudi maintains a mapping between an incoming record's key and the File Group ✅ Index is responsible for finding the record based on the File Group mapping. Therefore, any indexing technique will tell whether a record exist or not based on the mapping knowledge. If it exists, it will tell which File Group it exactly belongs to & that way we achieve faster UPSERTs. Takeaways: 👉 This design enables Hudi to limit the number of records that must be merged with each individual base file. 👉 So, a particular base file only needs to be merged with updates related to the records it contains. 👉 Table formats lacking an indexing component may require merging all base files with every incoming update or deletion record, which would be much less efficient. Here is a simple illustration that explains the Total Merge Cost for four(4) 100 MB #Parquet files with & without indexing. Detailed reading in comments. #dataengineering #softwareengineering
4 Comments

Like Comment Share
Apache Hudi

8,692 followers
3d
Report this post
Catch Peloton's Data Platform Team at the upcoming Hudi Community Sync. Learn: ✅ how Peloton built a Hudi-powered data lake optimized for analytics, using change data capture (CDC) from relational databases. ✅ about their implementation of an auto-healing pipeline to ensure data quality, as well as additional pipelines to maintain data freshness. ✅ about the lessons learned & the creative solutions they developed to seamlessly propagate data downstream, supporting machine learning use cases. 🗓️ 12th September, 9 AM PT | 12 PM ET 🔗 https://lnkd.in/ejNBhwtC #dataengineering #softwareengineering
Like Comment Share
Apache Hudi

8,692 followers
4d
Report this post
The AUGUST edition of the Onehouse-Hudi Newsletter's is NOW OUT 🎉 Check out all the amazing things happening in the Hudi Community. Highlights: 🌟 Blogs on getting started with Hudi, real-world projects, CDC, etc. 🌟 Community events 🌟 Project Updates & more! 🔗 Link: https://lnkd.in/dg3Ci7Mm #lakehouse #dataengineering #softwareengineering
Like Comment Share
Apache Hudi

8,692 followers
1w
Report this post
Join us on September 12th, 2024, for an insightful Hudi Community Sync where Peloton's data platform team will dive into their journey of modernizing their data infrastructure using Apache Hudi. In this session, we'll explore how Peloton built a robust Hudi lakehouse for analytics by using change data capture (CDC) from relational databases. The talk will also highlight the implementation of an auto-heal pipeline to boost data quality, ensuring data freshness, and the challenges of propagating this data downstream for machine learning applications. 🗓️ Date/Time: September 12th, 9 AM PT | 12 PM ET 🔗 Link: https://lnkd.in/ejNBhwtC #DataEngineering #Lakehouse

Apache Hudi

8,692 followers
2w

In this talk, Peloton's Data Platform Team will share their experience in building a Hudi-powered data lake tailored for analytics, using change data capture from relational databases. They'll delve into how they implemented an auto-heal pipeline to ensure data quality, along with additional pipelines to maintain data freshness. The session will also address the challenges they faced and the innovative solutions they developed for seamlessly propagating data downstream to support machine learning use cases.

Modernizing Data Infrastructure at Peloton using Apache Hudi

www.linkedin.com

Like Comment Share

Apache Hudi

Data Infrastructure and Analytics

San Francisco, CA 8,692 followers

Open source pioneer of the lakehouse reimagining batch processing with incremental framework for low latency analytics

About us

Locations

Employees at Apache Hudi

Dragan V.

Search Engine Optimization Expert

Updates

Modernizing Data Infrastructure at Peloton using Apache Hudi

www.linkedin.com

Modernizing Data Infrastructure at Peloton using Apache Hudi

www.linkedin.com

Join now to see what you are missing

Similar pages

Apache Iceberg

Apache XTable (Incubating)

Delta Lake

Onehouse

Apache Iceberg Workshops

Apache Doris

DuckDB

Apache Airflow

Tabular (now part of Databricks)

Polars