09-23-2021 03:06 PM
Hi All,
I have a daily spark job that reads and joins 3-4 source tables and writes the df in a parquet format. This data frame consists of 100+ columns. As this job run daily, our deduplication logic identifies the latest record from each of source tables , joins them and eventually overwrites the existing parquet file.
The question becomes - is there a way to implement the incremental write only in cases of a new record or changes in the values in the existing record of the file.
09-27-2021 04:09 AM
the MERGE functionality of delta lake is what you are looking for.
09-24-2021 11:19 AM
Thanks, Appreciate the quick response.
09-27-2021 04:09 AM
the MERGE functionality of delta lake is what you are looking for.
09-27-2021 02:55 PM
Thanks werners
Passionate about hosting events and connecting people? Help us grow a vibrant local community—sign up today to get started!
Sign Up Now