Thanks a ton for spending your valuable time to glance through this article. We have given our best to make you confident while choosing the file format for the Databricks Spark data ecosystem.  The views, thoughts, and opinions expressed in this blog belong solely to the author(s), and not necessarily to the authors’ employer, organization, committee, or any other individual or group.

Introduction:

The most important question to think about before stepping into the world of Big Data is ‘What is File Format in Apache Spark and Databricks Spark?’

To answer in layman’s terms, a file format is just a way to define how information is stored in any file system (example HDFS, DBFS, etc). This is usually driven by the use case or the processing algorithms for specific domains. File format should be well-defined and expressive. It should be able to handle a variety of data structures- specifically structs, records, maps, arrays along with strings, numbers etc.

  1. CSV

CSV (Comma Separated Values) is the simplest file format where fields are separated by using ‘comma’ as a delimiter. It is human-readable, compressible and most importantly platform-independent. Hence, CSV is the widely accepted choice for data transfer from one platform/ program to another with incompatible formats. For example, you can export the database results to a CSV and then import it to Excel spreadsheets without losing any information.

2. JSON

JSON (JavaScript Object Notation) is an open standard file format and data interchange format that uses human-readable text to store and transmit data objects consisting of key-value pairs. JSON is a self-describing language-independent data format. It is largely used for data transmission between a server and web application. 

3. AVRO

Avro is a row-based storage format for Hadoop which is widely used as a serialization platform. Avro stores the data definition (schema) in JSON format, making it easy to read and interpret by any program. The data itself is stored in binary format, making it compact and efficient. It is language-independent, splittable and robust.

4. ORC 

ORC (Optimized Row Columnar) is a free and open-source column-oriented data storage format of the Apache Hadoop ecosystem. An ORC file contains rows data in groups called as Stripes along with a file footer. Using ORC files improves performance for reading, writing, and processing data by 3x times. It is compressible, splittable, has lightweight indexes, and allows concurrent reads of the same file using separate Record Readers.

5. Parquet

Parquet is an open-source columnar storage format that is highly compressible and allows massive storage optimization. In Parquet, metadata including schema and structure is embedded within each file, making it a self-describing file format. Since the data is stored in columnar format, different encoding/ compression techniques can be applied to different columns. 

6. Delta

Delta Lake is an open-source format that enables building a lakehouse architecture on top of existing storage systems such as S3, ADLS, GCS, and HDFS. Delta table stores the data in Apache Parquet format enabling the Delta Lake to take advantage of the capabilities of Parquet, but it adds on top of that solution to enable missing capabilities including ACID Transactions, Concurrent Control, Schema Enforcement/ Evolution, Unified Batch and Streaming data processing, MERGE command capabilities. It also helps with time travel to roll back and examine previous versions of our data in the lakehouse. The Delta Lake transaction log, also called the Delta Log, helps to have an ordered record of every change that has ever been performed on a Delta Lake table since its inception, hence bringing you the single source of truth to the data system of records. 

Conclusion:

References:

Cwiki – ORC

Data-Flair blogs- AVRO

Learning Journal – Parquet vs Delta

The Apache Spark File Format Ecosystem – Databricks

Categories:

Leave a Reply

Your email address will not be published. Required fields are marked *