Get an introduction to streaming analytics, which allows you real-time insight from captured events and big data. There are applications across industries, from finance to wine making, though there are two primary challenges to be addressed.
Did you know that a plane flying from Texas to London can generate 30 million data points per flight? As Jim Daily of GE Aviation notes, that equals 10 billion data points in one year. And we’re talking about one plane alone. So you can understand why another top GE executive recently told Ericsson Business Review that “Cloud is the future of IT,” with a focus on supporting challenging applications in industries such as aviation and energy.
Apache Spark can use various cluster manager to execute application (Stand Alone, YARN, Apache Mesos). When you install Apache Spark on MapR you can submit application in a Stand Alone mode or using YARN.
This article focuses on YARN and Dynamic Allocation, a feature that lets Spark add or remove executors dynamically based on the workload. You can find more information about this feature in this presentation from Databricks:
MapR Streams is a new distributed messaging system for streaming event data at scale, and it’s integrated into the MapR converged platform.
MapR Streams uses the Apache Kafka API, so if you’re already familiar with Kafka, you’ll find it particularly easy to get started with MapR Streams.
A very common use case when working with Hadoop is to store and query simple files (CSV, TSV, …); then to get better performance and efficient storage convert these files into more efficient format, for example Apache Parquet.
Let’s take a concrete example, you can find many interesting Open Data sources that distribute data as CSV files- or equivalent format-. So you can store them into your distributed file system and use them in your applications/jobs/analytics queries. This is not the most efficient way especially when we know that these data won’t move that often. So instead of simply storing the CSV let’s copy this information into Parquet.
How to convert CSV files into Parquet files?
You can use code to achieve this, as you can see in the ConvertUtils sample/test class. You can use a simpler way with Apache Drill. Drill allows you save the result of a query as Parquet files.
The following steps will show you how to do convert a simple CSV into a Parquet file using Drill.
Apache Drill allows users to explore any type of data using ANSI SQL. This is great, but Drill goes even further than that and allows you to create custom functions to extend the query engine. These custom functions have all the performance of any of the Drill primitive operations, but allowing that performance makes writing these functions a little trickier than you might expect.
Let’s create a new function that allows you to mask some characters in a string, and let’s make it very simple. The new function will allow user to hide x number of characters from the start and replace then by any characters of their choice. This will look like:
As you know, you have many differences between relational and document databases. The biggest, for the developer, is probably the data model: Row versus Document. This is particularly true when we talk about “relations” versus “embedded documents (or values)”. Let’s look at some examples, then see what are the various operations provided by MongoDB to help you to deal with this.