Feb 24, 2016

Feb 18, 2016

How to work with Apache Pig Eval Functions?

How to work with Apache Pig Eval Functions?
In this post I am focussing on the following Eval Functions. The examples shown are developed Pig v.0.15.0 as the latest binary available.

Eval Functions



Data:

Department file:


Employees file:

______________________________________________________________________________

EVAL Function Examples:

AVG:


The function computes the average of the numeric values in a single-column bag.


a) The basic AVG function example:

/*Average salary per department*/ 
emp = LOAD '/Documents/tbl_EMP.txt' USING PigStorage(',') as (empno:int, ename:chararray, job:chararray, mgr:int, hiredate:chararray, sal:float, comm:float, deptno:int); 
gg = GROUP emp by deptno; 
ag = FOREACH gg GENERATE group, (float)AVG(emp.sal); 
dump ag;


Results:
(10,2916.6667) 
(20,2175.0) 
(30,1566.6666) 

b) The below example is with the nested FOREACH
 use:


emp = LOAD '/Documents/tbl_EMP.txt' USING PigStorage(',') as (empno:int, ename:chararray, job:chararray, mgr:int, hiredate:chararray, sal:float, comm:float, deptno:int);
avggrp1 = FOREACH (GROUP emp by deptno) {
unqdept = DISTINCT emp.deptno;
                 GENERATE FLATTEN(unqdept), (int)SUM(emp.sal), (int)AVG(emp.sal);
};
dump ag;



Results:
(10,8750,2916)
(20,10875,2175)
(30,9400,1566)
______________________________________________________________________________

BagToString:

Concatenate the elements of a Bag into a chararray string, placing an optional delimiter between each value.

a) The basic BagToString function example:
/* we have used the same example as above */
emp = LOAD '
/Documents/tbl_EMP.txt' USING PigStorage(',') as (empno:int, ename:chararray, job:chararray, mgr:int, hiredate:chararray, sal:float, comm:float, deptno:int);
gg = GROUP emp by deptno;
ag = FOREACH gg GENERATE group, (float)AVG(emp.sal), BagToString(emp.(deptno, sal), ' |');

dump ag;

Results:
The below results in blue font shows bag details converted into string. It comprised of the individual deptno and individual salary that was used in average function show up in one of the examples.

(10,2916.6667,10 |1300.0 |10 |5000.0 |10 |2450.0)
(20,2175.0,20 |1100.0 |20 |3000.0 |20 |800.0 |20 |2975.0 |20 |3000.0)
(30,1566.6666,30 |1500.0 |30 |1600.0 |30 |2850.0 |30 |1250.0 |30 |1250.0 |30 |950.0)

______________________________________________________________________________

CONCAT:
Concatenates two or more expressions of identical type.

a) The basic CONCAT function example:
emp = LOAD '/Documents/tbl_EMP.txt' USING PigStorage(',') as (empno:int, ename:chararray, job:chararray, mgr:int, hiredate:chararray, sal:float, comm:float, deptno:int);
ee = FOREACH emp GENERATE CONCAT((chararray)empno,'|',ename,'~', job,'~',(chararray)mgr,'|',(chararray)deptno) as big_col;
dump ee;


Results:
Please note that I have used a different field separator as an example. The row with no values can be filtered using filter option.

(7369|SMITH~CLERK~7902|20)
(7499|ALLEN~SALESMAN~7698|30)
(7521|WARD~SALESMAN~7698|30)
(7566|JONES~MANAGER~7839|20)
(7654|MARTIN~SALESMAN~7698|30)
(7698|BLAKE~MANAGER~7839|30)
(7782|CLARK~MANAGER~7839|10)
(7788|SCOTT~ANALYST~7566|20)
()
(7844|TURNER~SALESMAN~7698|30)
(7876|ADAMS~CLERK~7788|20)
(7900|JAMES~CLERK~7698|30)
(7902|FORD~ANALYST~7566|20)
(7934|MILLER~CLERK~7782|10)
(7934|MILLER~CLERK|7782~10)


above example using filter:

ee = FOREACH emp GENERATE CONCAT((chararray)empno,'|',ename,'~', job,'~',(chararray)mgr,'|',(chararray)deptno) as big_col;
ff = FILTER ee $0 by != '';
dump ff;

Results:

(7369|SMITH~CLERK|7902~20)
(7499|ALLEN~SALESMAN|7698~30)
(7521|WARD~SALESMAN|7698~30)
(7566|JONES~MANAGER|7839~20)
(7654|MARTIN~SALESMAN|7698~30)
(7698|BLAKE~MANAGER|7839~30)
(7782|CLARK~MANAGER|7839~10)
(7788|SCOTT~ANALYST|7566~20)
(7844|TURNER~SALESMAN|7698~30)
(7876|ADAMS~CLERK|7788~20)
(7900|JAMES~CLERK|7698~30)
(7902|FORD~ANALYST|7566~20)
(7934|MILLER~CLERK|7782~10)
______________________________________________________________________________

COUNT:

Computes the number of elements in a bag. This function does not consider NULL values or records.

a) The basic COUNT function example:
emp = LOAD '/Documents/tbl_EMP.txt' USING PigStorage(',') as (empno:int, ename:chararray, job:chararray, mgr:int, hiredate:chararray, sal:float, comm:float, deptno:int);
grp1 = FOREACH (GROUP emp by mgr) { GENERATE emp.mgr, (INT)COUNT(emp.mgr); };
dump grp1;

Results:

({(7566),(7566)},2)
({(7698),(7698),(7698),(7698),(7698)},5)
({(7782)},1)
({(7788)},1)
({(7839),(7839),(7839)},3)
({(7902)},1)
({()},0)


Please note the blue highlighted record is not considered by COUNT function and the actual count produced as 0 instead of 1. This can be fixed by using COUNT_STAR function.
______________________________________________________________________________

COUNT_STAR:

Computes the number of elements in a bag.


a) The basic COUNT_STAR function example:
emp = LOAD '/Documents/tbl_EMP.txt' USING PigStorage(',') as (empno:int, ename:chararray, job:chararray, mgr:int, hiredate:chararray, sal:float, comm:float, deptno:int);
grp1 = FOREACH (GROUP emp by mgr) { GENERATE emp.mgr, (INT)COUNT_ALL(emp.mgr); };
dump grp1;

Results:
Please note the blue highlighted record is considered by COUNT_ALL function and the count produced as 1 instead of 0 (when compared to just COUNT function).


({(7566),(7566)},2)
({(7698),(7698),(7698),(7698),(7698)},5)
({(7782)},1)
({(7788)},1)
({(7839),(7839),(7839)},3)
({(7902)},1)
({()},1)
______________________________________________________________________________

DIFF

Compares two fields in a tuple.

a) The basic DIFF function example:
Please note that I had to create relation with two different bags from same dataset to be able to show how DIFF function works and can find exact differences between tow datasets.


emp = LOAD '/Documents/tbl_EMP.txt' USING PigStorage(',') as (empno:int, ename:chararray, job:chararray, mgr:int, hiredate:chararray, sal:float, comm:float, deptno:int);
/*Made a copy of emp dataset */
emp2 = FOREACH emp GENERATE *;
/* made 2 sets with different manager for testing DIFF function*/
ee1 = FOREACH (FILTER emp by mgr == 7839) GENERATE empno, ename, job, mgr, sal, comm, deptno;
ff1 = FOREACH (FILTER emp2 by mgr == 7698) GENERATE empno, ename, job, mgr, sal, comm, deptno;
/* converting columns from different datasets into bag as Diff function works on bag*/
enm1 = FOREACH (COGROUP ee1 by (empno), ff1 by (empno)) GENERATE TOBAG(ee1.empno, ee1.empno), TOBAG(ff1.empno, ff1.mgr);
dump enm1;
/* Results before diff function*/
({({}),({})},{({}),({})})
({({}),({})},{({}),({})})
({({(7839)}),({(7566)})},{({(7566)}),({(7566)})})
({({}),({})},{({}),({})})
({({(7839)}),({(7698)})},{({(7698)}),({(7698)})})
({({(7839)}),({(7782)})},{({(7782)}),({(7782)})})
({({}),({})},{({}),({})})
({({}),({})},{({}),({})})

/* applying diff function on the our relation that had 2 bags */

enm1 = FOREACH (COGROUP ee1 by (empno), ff1 by (empno)) GENERATE DIFF(TOBAG(ee1.empno, ee1.empno), TOBAG(ff1.empno, ff1.mgr));
dump enm1;

/* Results after diff function*/
({})
({})
({({(7839)})})
({})
({({(7839)})})
({({(7839)})})
({})
({})
______________________________________________________________________________

IsEmpty

Checks if a bag or map is empty.

a) The basic ISEMPTY function example:

emp = LOAD '/Documents/tbl_EMP.txt' USING PigStorage(',') as (empno:int, ename:chararray, job:chararray, mgr:int, hiredate:chararray, sal:float, comm:float, deptno:int);
dept = LOAD 'tbl_EMP.txt' USING PigStorage(',') as (deptno:int, dname:chararray, loc:chararray);
ee = FOREACH emp GENERATE empno, ename, job, mgr, sal, comm, deptno;
dd = FOREACH dept GENERATE deptno, dname, loc;

/* another way to do join, allows us to do left or right joins and checks */
cogrp1 = cogroup ee by (deptno), dd by (deptno);

/* filter out records from dept that are not in emp */
fltr_jn1 = filter cogrp1 by IsEmpty(ee) == true;
rslt = foreach fltr_jn1 generate flatten(dd);
dump rslt;

Results: Please note the blue highlighted ISEMPTY function is applied on ee bag.

(40,OPERATIONS,BOSTON)

______________________________________________________________________________

Max/Min

Computes the maximum or minimum of the numeric values or chararrays in a single-column bag. MAX requires a preceding GROUP ALL statement for global maximums and a GROUP BY statement for group maximums.

a) The basic Max and Min function example:
/* Dept wise average salary*/ 
emp = LOAD '/Documents/tbl_EMP.txt' USING PigStorage(',') as (empno:int, ename:chararray, job:chararray, mgr:int, hiredate:chararray, sal:float, comm:float, deptno:int);
/* Foreach works on inner group operator */
mm = FOREACH (GROUP emp by deptno) GENERATE MIN(emp.sal), MAX(emp.sal); 
dump mm;

Results:
(1300.0,5000.0)
(800.0,3000.0)
(950.0,2850.0)

______________________________________________________________________________



Thank you!

References:
  • http://pig.apache.org/docs/r0.15.0/





Apache Pig Grunt shell Help Commands




The following options are based on Apache Pig 0.15.0 latest at the time. Please make sure to review my earlier posts in relation to this.

These commands can be view once you entered into grunt shell either in local or map reduce mode.

grunt> help

Commands:

<pig latin statement>; - See the PigLatin manual for details: http://hadoop.apache.org/pig
File system commands:
    fs <fs arguments> - Equivalent to Hadoop dfs command: http://hadoop.apache.org/common/docs/current/hdfs_shell.html
Diagnostic commands:
    describe <alias>[::<alias] - Show the schema for the alias. Inner aliases can be described as A::B.
    explain [-script <pigscript>] [-out <path>] [-brief] [-dot|-xml] [-param <param_name>=<param_value>]
        [-param_file <file_name>] [<alias>] - Show the execution plan to compute the alias or for entire script.
        -script - Explain the entire script.
        -out - Store the output into directory rather than print to stdout.
        -brief - Don't expand nested plans (presenting a smaller graph for overview).
        -dot - Generate the output in .dot format. Default is text format.
        -xml - Generate the output in .xml format. Default is text format.
        -param <param_name - See parameter substitution for details.
        -param_file <file_name> - See parameter substitution for details.
        alias - Alias to explain.
    dump <alias> - Compute the alias and writes the results to stdout.

Utility Commands:

    exec [-param <param_name>=param_value] [-param_file <file_name>] <script> - 
        Execute the script with access to grunt environment including aliases.
        -param <param_name - See parameter substitution for details.
        -param_file <file_name> - See parameter substitution for details.
        script - Script to be executed.
    run [-param <param_name>=param_value] [-param_file <file_name>] <script> - 
        Execute the script with access to grunt environment. 
        -param <param_name - See parameter substitution for details.
        -param_file <file_name> - See parameter substitution for details.
        script - Script to be executed.
    sh  <shell command> - Invoke a shell command.
    kill <job_id> - Kill the hadoop job specified by the hadoop job id.
    set <key> <value> - Provide execution parameters to Pig. Keys and values are case sensitive.
        The following keys are supported: 
        default_parallel - Script-level reduce parallelism. Basic input size heuristics used by default.
        debug - Set debug on or off. Default is off.
        job.name - Single-quoted name for jobs. Default is PigLatin:<script name>
        job.priority - Priority for jobs. Values: very_low, low, normal, high, very_high. Default is normal
        stream.skippath - String that contains the path. This is used by streaming.
        any hadoop property.
    help - Display this message.
    history [-n] - Display the list statements in cache.
        -n Hide line numbers. 
    quit - Quit the grunt shell.

Feb 17, 2016

Apache Pig Command line Options




The following options are based on Apache Pig 0.15.0 latest at the time. Please make sure to review my earlier post in relation to this.

Link:
Ways of invoking Pig or Pig script from command line


Apache Pig version 0.15.0 (r1682971)
compiled Jun 01 2015, 11:44:35

USAGE: 


Pig [options] [-] : Run interactively in grunt shell.
Pig [options] -e[xecute] cmd [cmd ...] : Run cmd(s).
Pig [options] [-f[ile]] file : Run cmds found in file.

options include:
    -4, -log4jconf - Log4j configuration file, overrides log conf
    -b, -brief - Brief logging (no timestamps)
    -c, -check - Syntax check
    -d, -debug - Debug level, INFO is default
    -e, -execute - Commands to execute (within quotes)
    -f, -file - Path to the script to execute
    -g, -embedded - ScriptEngine classname or keyword for the ScriptEngine
    -h, -help - Display this message. You can specify topic to get help for that topic.
        properties is the only topic currently supported: -h properties.
    -i, -version - Display version information
    -l, -logfile - Path to client side log file; default is current working directory.
    -m, -param_file - Path to the parameter file
    -p, -param - Key value pair of the form param=val
    -r, -dryrun - Produces script with substituted parameters. Script is not executed.
    -t, -optimizer_off - Turn optimizations off. The following 

values are supported:
      ConstantCalculator - Calculate constants at compile time
      SplitFilter - Split filter conditions
      PushUpFilter - Filter as early as possible
      MergeFilter - Merge filter conditions
      PushDownForeachFlatten - Join or explode as late as possible
      LimitOptimizer - Limit as early as possible
      ColumnMapKeyPrune - Remove unused data
      AddForEach - Add ForEach to remove unneeded columns
      MergeForEach - Merge adjacent ForEach
      GroupByConstParallelSetter - Force parallel 1 for "group all" statement
      PartitionFilterOptimizer - Pushdown partition filter conditions to loader implementing LoadMetaData
      PredicatePushdownOptimizer - Pushdown filter predicates to loader implementing LoadPredicatePushDown
      All - Disable all optimizations
      All optimizations listed here are enabled by default. 

Optimization values are case insensitive.
    -v, -verbose - Print all error messages to screen
    -w, -warning - Turn warning logging on; also turns warning aggregation off
    -x, -exectype - Set execution mode: local|mapreduce|tez, default is mapreduce.
    -F, -stop_on_failure - Aborts execution on the first failed job; default is off
    -M, -no_multiquery - Turn multiquery optimization off; default is on
    -N, -no_fetch - Turn fetch optimization off; default is on
    -P, -propertyFile - Path to property file
    -printCmdDebug - Overrides anything else and prints the actual command used to run Pig, including any environment variables that are set by the pig command.

Thanks you!

References: http://pig.apache.org/

Feb 15, 2016

Apache Pig Exercises: 22. List total sales per departmental store wise and consumer wise



In this post the sample Apache Pig script will display the total sales with respect to either departmental stores or with consumers.

The examples and exercise scripts are created using Apache Pig current version r0.14.0.

@ Test data structure and sample data:

person,  dstore, spent
A, S, 3.3
A, S, 4.7
B, S, 1.2
B, T, 3.4
C, Z, 1.1
C, T, 5.5
D, R, 1.1

@ Apache Pig Script:

a) List total sales per department stores:

grunt> 
data = LOAD 'Documents/store_transactions.txt' using PigStorage(',') as (person:chararray, dstores:chararray, spent:float);

grp = FOREACH (GROUP data BY dstores) {
/* dereference is required to original structure columns */
GENERATE group, COUNT(data.person) AS visitors, (FLOAT)SUM(data.spent) AS revenue; 
};

dump grp;

@Apache Pig Output on Grunt Shell: The  output has dept. store, customer count, total sales.

( R,1,1.1)
( S,3,9.2)
( T,2,8.9)
( Z,1,1.1)


@ Apache Pig Script:

b) List total sales per customer:

data = LOAD 'Documents/store_transactions.txt' using PigStorage(',') as (person:chararray, dstores:chararray, spent:float);

grp = FOREACH (GROUP data BY person) {
GENERATE group, COUNT(data.person) AS visitors, (FLOAT)SUM(data.spent) AS revenue; -- dereference is required to original structure columns 
};

dump grp;

@Apache Pig Output on Grunt Shell: The  output has customer id, total transactions per customer, total sales.

(A,2,8.0)
(B,2,4.6000004)
(C,2,6.6)
(D,1,1.1)

_________________
Thank you!

Feb 13, 2016

What is Dereference in Apache Pig?




What is a Dereference?

Many time it is necessary to reference a field in a tuple or a bag that are outside the current operator scope. Here is the complete pig script for your review to be able to discuss dereferencing:

data = load 'books.txt' using PigStorage(',') as (f1:int, F1:chararray, f2:chararray, F2:int);
aaa = group data by f1;
bbb = FOREACH aaa GENERATE group, data.f2, data.f3;
dump bbb;



The dereferencing can be done in the following manners.


a) Dereferencing fields created in tuple or bag:
    Dereferencing fields this way can be observed with the Pig's FOREACH operator:       

    bbb = FOREACH aaa GENERATE group, data.f2, data.f3;
                 
    In the above line of the code if you have noticed, the fields f2 and f3 are not the part of the 
    relation aaa (pls. refer to complete pig example script shown above)

    Thus, in order to reference them they have to be defined to qualified in a tuple or a bag.
    The fields f2 and f3 are defined in the relation data, we can use them to create subsequent
    relations.


b) Dereferencing fields by their positions: 
  We can use same example to dereference the fields by their positions in the relation they were  
     created.This example dereferences the same fields as described in the top. Pls. refer to complete 
     pig example script shown above.
   
    bbb = FOREACH aaa GENERATE group, data.$1, data.$2;

Thanks!


Feb 6, 2016

Cannot locate pig-core-h2.jar. do 'ant -Dhadoopversion=23 jar', and try again


Error:

While running Pig (in local mode from command line) if we may come across the following error"

Cannot locate pig-core-h2.jar. do 'ant -Dhadoopversion=23 jar', and try again.


Solution:

Please make sure in your .bash_profile or .bashrc you have following environment variables: 

export PIG_HOME="/Library/apache-pig-0.15.0"

export PATH="/Library/apache-pig-0.15.0/bin:${PATH}"

restart the machine or restart the unix terminal;

Thanks!