To analyze large data sets representing them as data flows, we use Apache Pig. Basically, to provide an abstraction over MapReduce, reducing the complexities of writing a MapReduce task using Java programming, Apache Pig is designed.
At times, while performing any MapReduce tasks, programmers who are not so good at Java normally used to struggle to work with Hadoop. Hence, Pig is a boon for all such programmers. The reason is:
Criteria | Pig | Hive |
Language | Pig Latin | SQL-like |
Application | Programming purposes | Report creation |
Operation | Client Side | Server side |
Data support | Semi-structured | Structured |
Connectivity | Can be called by other applications | JDBC & BI tool integration |
We can use Pig in three categories, they are
The programs of Apache Pig are written in a language referred to as Pig Latin, which is analogous to SQL language. To carry out the query, we require an engine for execution. Pig engine alters all the queries to MapReduce tasks. Thus MapReduce operates as the primary execution engine needed to execute the programs.
BloomMapFile is categorized as the class, which broadens MapFile class, and is generally used for HBase table arrangement to speed up the relationship test for keys, which uses the filters of dynamic bloom.
A compilation of tuples is known as the bag, in Apache Pig.
The operation FOREACH in Apache Pig is required to apply to each component in the data bag, for which the respective action can be performed to create data items.
Following are the three complex data types that are supported by Apache Pig:
Many times there are data in one of the tuples or bags which on removal, lead to the next level of nesting for that data. In those cases, Flatten, a modifier, embedded in Pig is used. Flatten uninstalls bags & tuples and replaces all the areas in a tuple, whereas the un-nesting bags are more complex of its need in creating a new tuple.
Explain & Describe are important utilities for debugging in Apache Pig.
Describe is helpful to all developers when scripting Pig because it displays the schema of the relation in a script. For developers, who are freshers & are learning Apache Pig use this utility to recognize the process of these operator making the modification to this data. Pig script has many describe.
Explain utility is extremely helpful to developers of Hadoop when they are trying to optimize Pig Latin scripts or debug errors. Explain is applied on a specific alias in scripts or is applied on the entire script in the interactive shell of grunt. Explain utility creates many text graphs, which are printed to files.
Users interact with HDFS or any local file system through Grunt, which is the Apache Pig’s communicative shell. To initiate Grunt, users need to invoke the Apache Pig with a no command as follows:
Illustrate is used for implementing the scripts of Pig on vast sets of data, which generally is time-consuming. That’s why developers execute the scripts of a pig on sample data where it’s possible that the selected sample data, may not execute the script correctly. E.g., if the script consists of a join operator then there must be few records in sample data which has the same key, or else the join operation may not return the results. For managing these issues, developers use the function, illustrate, which takes data from the sample and whenever it faces operators like the filter or join, which removes the data, it makes sure that some records go through whereas some are restricted, by modifying records in such so that they follow the condition set. Illustrate shows the output of every step but does not execute MapReduce jobs.
Firstly, it is hard to find whether Pig is case sensitive or insensitive. E.g., in user-specific functions, field names, and relations in pig are case sensitive. The function COUNT is not similar to the functions of count or X=load ‘foo’ is not similar to x=load ‘foo.' Additionally, keywords in Pig are obviously case insensitive. E.g. LOAD is similar to the load.
Physical & logical plans are generated while executing a pig script. Pig is based on the function of interpreter checking. The Logical plan is generated after the semantic verification & parsing while the processing of no data takes place in the generation of any logical plan. A consistent plan consists of a compilation of operators but does not consist of edges involving the operators. After the generation of the logical plan, the execution of the script goes to the physical plan. The physical plan is the explanation of physical operators, which Pig will use, for the execution of the script. It is more or less similar to a sequence of MapReduce works, but the plans don’t have any such reference of its execution in MapReduce. While the generation of any physical plan, the logical operator cogroup is transformed into physical operators, which are – Global Rearrange, Local Rearrange, and Package.
A group of data sets is referred to as Co-group. In any case, of more than one data set, co-group, groups all the data sets and then joins them based on a common field. That is why; we can say that a co-group is obviously a group of more than one data set.
Pig big data tools are specifically used for processing iteratively, for traditional ETL data pipelines & research on raw data. Pig operates in situations where the schema is unknown, incomplete, or inconsistent; it is used by all developers who want to use the data before being loaded into the data warehouse. For building prediction models for behavior, it is used by the website to detect the reply of visitors to a variety of images, ads, articles, etc.
Strongly typed language, is characterized where the user should state all the type of variables openly, whereas in Pig, the description of the data, it anticipates the data to approach in the mentioned format. If the schema is unknown, the script adapts to the actual data types at the runtime. That’s why it is stated that PigLatin might be strongly typed in many scenarios, but in some situations, it is otherwise gently typed. It keeps on working with the data, which may not be up to the expectations.
A GROUP & COGROUP operator is the same & works within one or many relations. Operator GROUP is usually used for grouping the data in any one single relation, for enhanced readability, while COGROUP is for gathering the data for 2 or higher relations. COGROUP is a mixture of JOIN & GROUP, i.e., it can group the tables, which are based on columns, and joins them into grouped pieces. At any given time, cogroup can feature up to 127 relations.
The outer bag is just any relation in Pig whereas sny relation within a bag is known as the inner bag.
The Function COUNT_STAR (0) comprises NULL values as it counts, whereas the COUNT function doesn’t include the NULL value when counting the number of elements in a bag.
Pig supports single & multi-line commands, In the single-line command, it carries out the data but doesn’t store the file in the system, but in multiple lines commands it stores the data in HDFS.
Function TOP () returns the top (N) tuples from a relation or a bag of tuples. (N) Is passed as a constraint to function top () with the column, where the values are supposed to be evaluated in comparison to the relation R.
The operation can be easily done by using the SPLIT and UNION operators.
Types of User-Defined Functions supported in Pig are, Eval Algebraic and Filter functions are.
PigLatin and HiveQL both alter the commands to MapReduce work & cannot be used for transactions in OLAP as it is extremely difficult in executing queries of low latency.
Firstly we need to load the file employee.txt with the relation name as Employee. Then we can pull the first ten records of the data from the employee file by using the limit operator – Result = limit employee 25.
Following are some of the Limitations of the Apache Pig:
As the clause in SQL, Apache Pig has to filter for extraction of the records, which are based on a predicate or specified conditions. The records are then passed through the pipeline if the condition turns true. Predicate surrounds a variety of operators like ==, <=,!=, >=. For instance - Y = filter X by symbol matches ‘Mr.*’; X= load ‘inputs’ as(name,address)
If the Built-in operators do not provide some of the basic functions, then developers can apply those functions by writing the user-defined functions by using programming languages like Python, Java, Ruby, etc. (UDF’s) better known as User Defined Functions are then rooted into the Pig Latin Script.
UDFs can be developed by extending EvalFunc class and overriding the execution method.
Example: This UDF replaces a given string with another string
Package kelly.training.pig.udf;
Import java.io.IOException;
Import org.apache.hadoop.conf.configuration;
Import org.apache.pig.EvalFunc;
Import org.apache.pig.data Tuple;
Import org.apache.pig.impl.util.UDFContext;
Public classTransform extends EvalFunc{
Public string exec(Tuple input) throws IOException {
if(input == null || input.size[] == 0) {
Return null;
}
Configuration conf=UDFContext.getUDFContext().getJobConf();
String from = conf.get(“replace.string”);
if(from == null){
Throw new IOException (“replace.string should not be null”);
}
String to = conf.get(“replace.by.string”);
if(to==null){
Throw new IOException (“replace.by.string should not be null”);
}
Try{
String str = (string) input.get(0);
Return str.replace(from, to);
} catch (exception e){
Throw new IOException(“caught exception processing input row”,e);
}
}
}
Grunt Shell is an interactive-based shell. This means where exactly we will get the output then and their itself. Whether it is a success (or) fail.
Loads or stores relations using field delimited text format.
Each line is broken into fields using a configurable field delimiter (defaults to a tab character) to be stored in the tuples fields. It is the default storage when none is specified.
Pig | SQL |
Pig is procedural | SQL is declarative |
Nested relational data model | Flat relational data model |
Schema is optional | Schema is required |
Scan Centric analytic workloads | OLTP + OLAA workloads |
Limited query optimization | Significant opportunity for query optimization |
Mapreduce | Pig |
Mapreduce expects the programming language skills for writing the business logic | Pig there is no much of programming skills. As we are writing whole logic will making use of pig transformation (or) operations. |
If we can do any change in the Mapreduce reduce the program, we need to certain problems we can change the process entire. Compiling the program Executing the program Packing up the program Deploying the same cluster environment |
In the pig, we can complete dealing with simple scripting we can avoid another transaction process. 5 % of the Mapreduce code 5% of the Mapreduce development time Increases programmer productivity 25% of the Mapreduce execution time |
As a general saying of the Hadoop Mapreduce program writes 200 lines of mapreduce code. | In pig we can that type of Mapreduce program, we can write 10 lines of code. |
Mapreduce requires multiple stages, Leading to long development life cycles Rapid prototyping increase productivity. | Pig provides the log analysis Ad Hoc queries across various large data sets. |
COUNT function does not include the NULL value when counting the number of elements in a bag, whereas COUNT_STAR (0 function includes NULL values while counting.
This can be accomplished using the UNION and SPLIT operators.
Algebraic, Eval and Filter functions are the various types of UDF’s supported in Pig.
Integer, float, double, long, bytearray and char array are the available scalar datatypes in Apache Pig.