Tug’s Blog

My journey in Big Data, Hadoop, NoSQL and MapR

Convert a CSV File to Apache Parquet With Drill

| Comments

A very common use case when working with Hadoop is to store and query simple files (CSV, TSV, …); then to get better performance and efficient storage convert these files into more efficient format, for example Apache Parquet.

Apache Parquet is a columnar storage format available to any project in the Hadoop ecosystem. Apache Parquet has the following characteristics:

  • Self-describing
  • Columnar format
  • Language-independent

Let’s take a concrete example, you can find many interesting Open Data sources that distribute data as CSV files- or equivalent format-. So you can store them into your distributed file system and use them in your applications/jobs/analytics queries. This is not the most efficient way especially when we know that these data won’t move that often. So instead of simply storing the CSV let’s copy this information into Parquet.

How to convert CSV files into Parquet files?

You can use code to achieve this, as you can see in the ConvertUtils sample/test class. You can use a simpler way with Apache Drill. Drill allows you save the result of a query as Parquet files.

The following steps will show you how to do convert a simple CSV into a Parquet file using Drill.

Apache Drill : How to Create a New Function?

| Comments

Apache Drill allows users to explore any type of data using ANSI SQL. This is great, but Drill goes even further than that and allows you to create custom functions to extend the query engine. These custom functions have all the performance of any of the Drill primitive operations, but allowing that performance makes writing these functions a little trickier than you might expect.

In this article, I’ll explain step by step how to create and deploy a new function using a very basic example. Note that you can find lot of information about Drill Custom Functions in the documentation.

Let’s create a new function that allows you to mask some characters in a string, and let’s make it very simple. The new function will allow user to hide x number of characters from the start and replace then by any characters of their choice. This will look like:

1
MASK( 'PASSWORD' , '#' , 4 ) => ####WORD

You can find the full project in the following Github Repository.

As mentioned before, we could imagine many advanced features to this, but my goal is to focus on the steps to write a custom function, not so much on what the function does.

MongoDB : Playing With Arrays

| Comments

As you know, you have many differences between relational and document databases. The biggest, for the developer, is probably the data model: Row versus Document. This is particularly true when we talk about “relations” versus “embedded documents (or values)”. Let’s look at some examples, then see what are the various operations provided by MongoDB to help you to deal with this.

Introduction to MongoDB Security

| Comments

Last week at the Paris MUG, I had a quick chat about security and MongoDB, and I have decided to create this post that explains how to configure out of the box security available in MongoDB.

You can find all information about MongoDB Security in following documentation chapter:

In this post, I won’t go into the detail about how to deploy your database in a secured environment (DMZ/Network/IP/Location/…)

I will focus on Authentication and Authorization, and provide you the steps to secure the access to your database and data.

I have to mention that by default, when you install and start MongoDB, security is not enabled. Just to make it easier to work with.

The first part of the security is the Authentication, you have multiple choices documented here. Let’s focus on “MONGODB-CR” mechanism.

The second part is Authorization to select what a user can do or not once he is connected to the database. The documentation about authorization is available here.

Let’s now document how-to:

  1. Create an Administrator User
  2. Create Application Users

For each type of users I will show how to grant specific permissions.

Everybody Says “Hackathon”!

| Comments

TLTR:

  • MongoDB & Sage organized an internal Hackathon
  • We use the new X3 Platform based on MongoDB, Node.js and HTML to add cool features to the ERP
  • This shows that “any” enterprise can (should) do it to:
    • look differently at software development
    • build strong team spirit
    • have fun!

Introduction

I have like many of you participated to multiple Hackathons where developers, designer and entrepreneurs are working together to build applications in few hours/days. As you probably know more and more companies are running such events internally, it is the case for example at Facebook, Google, but also ING (bank), AXA (Insurance), and many more.

Last week, I have participated to the first Sage Hackathon!

In case you do not know Sage is a 30+ years old ERP vendor. I have to say that I could not imagine that coming from such company… Let me tell me more about it.

How to Create a Pub/sub Application With MongoDB ? Introduction

| Comments

In this article we will see how to create a pub/sub application (messaging, chat, notification), and this fully based on MongoDB (without any message broker like RabbitMQ, JMS, … ).

So, what needs to be done to achieve such thing:

  • an application “publish” a message. In our case, we simply save a document into MongoDB
  • another application, or thread, subscribe to these events and will received message automatically. In our case this means that the application should automatically receive newly created document out of MongoDB

All this is possible with some very cool MongoDB features: capped collections and tailable cursors,

Big Data… Is Hadoop the Good Way to Start?

| Comments

In the past 2 years, I have met many developers, architects that are working on “big data” projects. This sounds amazing, but quite often the truth is not that amazing.

TL;TR

You believe that you have a big data project?

  • Do not start with the installation of an Hadoop Cluster – the “how
  • Start to talk to business people to understand their problem – the “why
  • Understand the data you must process
  • Look at the volume – very often it is not “that” big
  • Then implement it, and take a simple approach, for example start with MongoDB + Apache Spark