Apache Hudi

Apache Hudi

Data Infrastructure and Analytics

San Francisco, CA 8,692 followers

Open source pioneer of the lakehouse reimagining batch processing with incremental framework for low latency analytics

About us

Open source pioneer of the lakehouse reimagining old-school batch processing with a powerful new incremental framework for low latency analytics. Hudi brings database and data warehouse capabilities to the data lake making it possible to create a unified data lakehouse for ETL, analytics, AI/ML, and more. Apache Hudi is battle-tested at scale powering some of the largest data lakes on the planet. Apache Hudi provides an open foundation that seamlessly connects to all other popular open source tools such as Spark, Presto, Trino, Flink, Hive, and so much more. Being an open source table format is not enough, Apache Hudi is also a comprehensive platform of open services and tools that are necessary to operate your data lakehouse in production at scale. Most importantly, Apache Hudi is a community built by a diverse group of engineers from all around the globe! Hudi is a friendly and inviting open source community that is growing every day. Join the community in Github: https://github.com/apache/hudi or find links to email lists and slack channels on the Hudi website: https://hudi.apache.org/

Website
https://hudi.apache.org/
Industry
Data Infrastructure and Analytics
Company size
201-500 employees
Headquarters
San Francisco, CA
Type
Nonprofit
Founded
2016
Specialties
ApacheHudi, DataEngineering, ApacheSpark, ApacheFlink, TrinoDB, Presto, DataAnalytics, DataLakehouse, AWS, GCP, Azure, ChangeDataCapture, and StreamProcessing

Locations

Employees at Apache Hudi

Updates

  • View organization page for Apache Hudi, graphic

    8,692 followers

    How can you quickly build Analytical Apps (in Python) using Apache Hudi? Here is a simple example that presents how easily & quickly one can build analytical apps using data directly from open lakehouse platforms like Hudi. This is facilitated by "Hudi-rs" - a native Rust library with Python bindings. You can use pure #Python to work with libraries like Streamlit without the need to setup Spark, Java & additional dependencies. This means: ❌ Less 'data copies' - usually to cater to such use cases in the data lake, we export a subset of the data from a proprietary warehouse & make copies on copies. ❌ Less data hops (ETL), saving time & leading to fresher insights. ✅ Analysts/Scientists have quicker (& more) access to data for better insights/robust ML models. Usually the wait time on these stakeholders are pretty high. 👨🏻💻Code Example: https://lnkd.in/dmbiU4F3 📙Blog: https://lnkd.in/gc6XPxEh 🌟Hudi-rs project: https://lnkd.in/dyvyd4cK #dataengineering #softwareengineering

    • No alternative text description for this image
  • Apache Hudi reposted this

    View profile for Soumil S., graphic

    Sr. Software Engineer at Zeta Global (NYSE: ZETA) | Big Data & AWS Expert | Apache Hudi Specialist | Spark & AWS Glue Professional | YouTuber

    I had a fantastic Saturday working on a proof of concept with Apache Hudi Streamer! 🎉 I streamed data into Hudi using its Streamer feature and leveraged the Hudi Extension to build metadata for Iceberg, Delta and Hudi This allowed me to sync data seamlessly with the Glue Hive Metastore. I then integrated this with Snowflake using the ext_volume and created an Iceberg external table in Snowflake with glue catalog integration in snowflake The result? I was able to query the same data across multiple engines: Athena Spark Snowflake Redshift Spectrum via Glue Catalog Mount And the best part? I didn’t need to create multiple copies of the data! Thanks to Apache Hudi, data ingestion was incredibly fast, and the Hudi Streamer extension made syncing after each commit effortless. The multi-modal indexing also enabled faster upserts and efficient indexing. Saturday well spent, and I had a blast! 🚀 #ApacheHudi Apache XTable (Incubating) #DataEngineering #BigData #DataLake #Snowflake #Athena #Spark #Redshift #Glue Onehouse #DataIntegration

    • No alternative text description for this image
  • View organization page for Apache Hudi, graphic

    8,692 followers

    What is the Layout of a Hudi Table on the File System? ✅ Hudi organizes tables within a specific directory hierarchy, located under a base path on a distributed file storage system. ✅ The Tables are divided into distinct partitions. ✅ Inside each of these partitions, files are grouped into file groups, each marked by a unique file ID. ✅ These file groups are further broken down into multiple file slices. ✅ File slices are composed of base files (#parquet/orc), created at a specific commit/compaction moment, and log files. These log files hold the changes applied to the base file after its creation. Read more here: https://lnkd.in/gwirNtVr ⭐️ Hudi Github Repo: https://lnkd.in/gmM8yjVS 💬 Hudi Slack: https://lnkd.in/gZSZzdX4 #dataengineering #softwareengineering

    • No alternative text description for this image
  • View organization page for Apache Hudi, graphic

    8,692 followers

    In this talk, Peloton's Data Platform Team will share their experience in building a Hudi-powered data lake tailored for analytics, using change data capture from relational databases. They'll delve into how they implemented an auto-heal pipeline to ensure data quality, along with additional pipelines to maintain data freshness. The session will also address the challenges they faced and the innovative solutions they developed for seamlessly propagating data downstream to support machine learning use cases.

    Modernizing Data Infrastructure at Peloton using Apache Hudi

    Modernizing Data Infrastructure at Peloton using Apache Hudi

    www.linkedin.com

  • View organization page for Apache Hudi, graphic

    8,692 followers

    Metica achieved a "~ 6x reduction in storage" and "~2x performance improvement" by using Apache Hudi in their data platform. This is all made possible with the Clustering table service in Hudi. ✅ Storage costs: Initially started with vanilla #Parquet files (without Hudi), which led to too many small files. Hudi’s ability to easily tune parameters during clustering to pack Parquet with max bytes helped here. Result -> 6x storage cost reduction ✅ Read Performance: They enabled clustering with sorting by columns that are frequently used as query predicates. Result -> min of 2x improvement in read performance Hear more about Metica's data architecture & Hudi journey in this talk by Subash P. Link: https://lnkd.in/dYaxZNYC #dataengineering #softwareengineering #lakehouse

    • No alternative text description for this image
  • Apache Hudi reposted this

    View profile for Dipankar Mazumdar, M.Sc 🥑, graphic

    Staff Data Engineering Advocate @Onehouse.ai | Apache Hudi, Iceberg Contributor | Distributed Systems | Technical Author

    Small File = Slow Queries in Lakehouse. Addressing the small file issue is critical for optimizing the query performance on data lakes. The problem occurs when writing data in smaller chunks. For e.g. Stream processing engines, like #ApacheFlink, ingest continuous data streams into data lake table formats like Apache Hudi. ⭐️ This would involve writing continuous, high frequency small batches of data to the data lake, leading to the creation of many small files if not managed properly. ⭐️ The presence of many small files can hurt read performance because there is a cost to open, read & close many small files, which is generally less efficient than reading a smaller number of larger files. So if you don’t size the files appropriately, you can slow down the queries and the pipelines! One of the key design principle of Hudi is the prevention of small file generation, ensuring it automatically writes files of optimal size. It takes care of file sizing in 2 ways: ✅ Automatic File Sizing: During data ingestion, Hudi allows you to automatically adjusts file sizes. ✅ Post-Write Clustering: Hudi offers a solution to consolidate small files into larger ones after data has been written (during clustering) Benefits? - Optimized Query Performance: Hudi reduces the need to scan through numerous small files, improving query speed and efficiency - Enhanced Pipeline Efficiency: By managing file sizes effectively, Hudi decreases scheduling overhead & memory requirements in Spark/Flink jobs - Improved Storage Utilization: Without a proactive file management strategy, storage costs can increase Detailed link in comments. #dataengineering #softwareengineering

    • No alternative text description for this image
  • Apache Hudi reposted this

    View profile for Dipankar Mazumdar, M.Sc 🥑, graphic

    Staff Data Engineering Advocate @Onehouse.ai | Apache Hudi, Iceberg Contributor | Distributed Systems | Technical Author

    Indexing in Lakehouse Table formats. Let's touch upon on one of the capabilities that is unique to Apache Hudi - Indexes. One of the main design principles of Hudi is faster UPSERTs on the #datalake. Now to achieve so, we need to have a sense of where a particular record is. If we don’t, we have to scan through all the records & then make updates, which can be extremely expensive (esp at scale). That’s where Indexing comes in and helps in locating the records. How Hudi & Index works is: ✅ the very 1st time a record comes in Hudi → it is assigned to a File Group (logical grouping of files) & that assignment never changes. ✅ Hudi maintains a mapping between an incoming record's key and the File Group ✅ Index is responsible for finding the record based on the File Group mapping. Therefore, any indexing technique will tell whether a record exist or not based on the mapping knowledge. If it exists, it will tell which File Group it exactly belongs to & that way we achieve faster UPSERTs. Takeaways: 👉 This design enables Hudi to limit the number of records that must be merged with each individual base file. 👉 So, a particular base file only needs to be merged with updates related to the records it contains. 👉 Table formats lacking an indexing component may require merging all base files with every incoming update or deletion record, which would be much less efficient. Here is a simple illustration that explains the Total Merge Cost for four(4) 100 MB #Parquet files with & without indexing. Detailed reading in comments. #dataengineering #softwareengineering

    • No alternative text description for this image
  • View organization page for Apache Hudi, graphic

    8,692 followers

    Catch Peloton's Data Platform Team at the upcoming Hudi Community Sync. Learn: ✅ how Peloton built a Hudi-powered data lake optimized for analytics, using change data capture (CDC) from relational databases. ✅ about their implementation of an auto-healing pipeline to ensure data quality, as well as additional pipelines to maintain data freshness. ✅ about the lessons learned & the creative solutions they developed to seamlessly propagate data downstream, supporting machine learning use cases. 🗓️ 12th September, 9 AM PT | 12 PM ET 🔗 https://lnkd.in/ejNBhwtC #dataengineering #softwareengineering

    • No alternative text description for this image
  • View organization page for Apache Hudi, graphic

    8,692 followers

    Join us on September 12th, 2024, for an insightful Hudi Community Sync where Peloton's data platform team will dive into their journey of modernizing their data infrastructure using Apache Hudi. In this session, we'll explore how Peloton built a robust Hudi lakehouse for analytics by using change data capture (CDC) from relational databases. The talk will also highlight the implementation of an auto-heal pipeline to boost data quality, ensuring data freshness, and the challenges of propagating this data downstream for machine learning applications. 🗓️ Date/Time: September 12th, 9 AM PT | 12 PM ET 🔗 Link: https://lnkd.in/ejNBhwtC #DataEngineering #Lakehouse

    View organization page for Apache Hudi, graphic

    8,692 followers

    In this talk, Peloton's Data Platform Team will share their experience in building a Hudi-powered data lake tailored for analytics, using change data capture from relational databases. They'll delve into how they implemented an auto-heal pipeline to ensure data quality, along with additional pipelines to maintain data freshness. The session will also address the challenges they faced and the innovative solutions they developed for seamlessly propagating data downstream to support machine learning use cases.

    Modernizing Data Infrastructure at Peloton using Apache Hudi

    Modernizing Data Infrastructure at Peloton using Apache Hudi

    www.linkedin.com

Similar pages