A very common use case when working with Hadoop is to store and query simple files (CSV, TSV, ...); then to get better performance and efficient storage convert these files into more efficient format, for example Apache Parquet.
- Columnar format
Let's take a concrete example, you can find many interesting Open Data sources that distribute data as CSV files- or equivalent format-. So you can store them into your distributed file system and use them in your applications/jobs/analytics queries. This is not the most efficient way especially when we know that these data won't move that often. So instead of simply storing the CSV let's copy this information into Parquet.
You can use code to achieve this, as you can see in the ConvertUtils sample/test class. You can use a simpler way with Apache Drill. Drill allows you save the result of a query as Parquet files.
The following steps will show you how to do convert a simple CSV into a Parquet file using Drill.
- Apache Drill : Standalone Apache Drill or using Apache Drill Sandbox from MapR
- Some CSV Files: for example Passenger Dataset from SFO Air Traffic Statistics
Let's execute a basic query:
SELECT *FROM dfs.`/opendata/Passenger/SFO_Passenger_Data/MonthlyPassengerData_200507_to_201503.csv`LIMIT 5; ["200507","ATA Airlines","TZ","ATA Airlines","TZ","Domestic","US","Deplaned","Low Fare","Terminal 1","B","27271\r"]......
As you can see, by default Drill processes each line as an array of columns, all values being simple String. So if you need to do some operations with these values (projection or where clause) you must use the column index, and cast the value to the proper type. You can see a simple example below:
SELECTcolumns as `DATE`,columns as `AIRLINE`,CAST(columns AS DOUBLE) as `PASSENGER_COUNT`FROM dfs.`/opendata/Passenger/SFO_Passenger_Data/*.csv`WHERE CAST(columns AS DOUBLE) < 5; +---------+-----------------------------------+------------------+| DATE | AIRLINE | PASSENGER_COUNT |+---------+-----------------------------------+------------------+| 200610 | United Airlines - Pre 07/01/2013 | 2.0 |......
We are now ready to create our Parquet files using the "Create Table As Select" (aka CTAS)
alter session set `store.format`='parquet'; CREATE TABLE dfs.tmp.`/stats/airport_data/` ASSELECTCAST(SUBSTR(columns,1,4) AS INT) `YEAR`,CAST(SUBSTR(columns,5,2) AS INT) `MONTH`,columns as `AIRLINE`,columns as `IATA_CODE`,columns as `AIRLINE_2`,columns as `IATA_CODE_2`,columns as `GEO_SUMMARY`,columns as `GEO_REGION`,columns as `ACTIVITY_CODE`,columns as `PRICE_CODE`,columns as `TERMINAL`,columns as `BOARDING_AREA`,CAST(columns AS DOUBLE) as `PASSENGER_COUNT`FROM dfs.`/opendata/Passenger/SFO_Passenger_Data/*.csv`
That's it! You have now a Parquet file, a single file in our case since our dataset is really small. Apache Drill will create multiples files for the tables depending of the size and configuration your environment.
I invite you to read this Chapter in the Apache Drill documentation to learn more about Drill and Parquet.
Now that you have created your Parquet files you can use them in any of your Hadoop processes, but you can also use them in Drill, as follow:
SELECT *FROM dfs.tmp.`/stats/airport_data/*`
In this article you have learned how to convert a CSV file using an Apache Drill query.
You can do that with any source supported by Drill, for example from JSON to Parquet, or even a complex join query between multiple data sources. You can also chose a different output format for example JSON, or a CSV.