How to work with Apache Pig Eval Functions?
In this post I am focussing on the following Eval Functions. The examples shown are developed Pig v.0.15.0 as the latest binary available.
Eval Functions
Data:
Department file:
Employees file:
______________________________________________________________________________
EVAL Function Examples:
AVG:
The function computes the average of the numeric values in a single-column bag.
a) The basic AVG function example:
/*Average salary per department*/
emp = LOAD '/Documents/tbl_EMP.txt' USING PigStorage(',') as (empno:int, ename:chararray, job:chararray, mgr:int, hiredate:chararray, sal:float, comm:float, deptno:int);
gg = GROUP emp by deptno;
ag = FOREACH gg GENERATE group, (float)AVG(emp.sal);
dump ag;
Results:
(10,2916.6667)
(20,2175.0)
(30,1566.6666)
b) The below example is with the nested FOREACH use:
emp = LOAD '/Documents/tbl_EMP
.txt' USING PigStorage
(',') as (empno
:int, ename
:chararray
, job:chararray
, mgr:int, hiredate
:chararray
, sal
:float, comm:float, deptno
:int);
avggrp1 = FOREACH (GROUP emp by deptno) {
unqdept = DISTINCT emp.deptno;
GENERATE FLATTEN(unqdept), (int)SUM(emp.sal), (int)AVG(emp.sal);
};
dump ag;
Results:
(10,8750,2916)
(20,10875,2175)
(30,9400,1566)
______________________________________________________________________________
BagToString:
Concatenate the elements of a Bag into a chararray string, placing an optional delimiter between each value.
a) The basic BagToString function example:
/* we have used the same example as above */
emp = LOAD '/Documents/tbl_EMP
.txt' USING PigStorage
(',') as (empno
:int, ename
:chararray
, job:chararray
, mgr:int, hiredate
:chararray
, sal
:float, comm:float, deptno
:int);
gg = GROUP emp by deptno;
ag = FOREACH gg GENERATE group, (float)AVG(emp.sal), BagToString(emp.(deptno, sal), ' |');
dump ag;
Results:
The below results in blue font shows bag details converted into string. It comprised of the individual deptno and individual salary that was used in average function show up in one of the examples.
(10,2916.6667,10 |1300.0 |10 |5000.0 |10 |2450.0)
(20,2175.0,20 |1100.0 |20 |3000.0 |20 |800.0 |20 |2975.0 |20 |3000.0)
(30,1566.6666,30 |1500.0 |30 |1600.0 |30 |2850.0 |30 |1250.0 |30 |1250.0 |30 |950.0)
______________________________________________________________________________
CONCAT:
Concatenates two or more expressions of identical type.
a) The basic CONCAT function example:
emp = LOAD '/Documents/tbl_EMP
.txt' USING PigStorage
(',') as (empno
:int, ename
:chararray
, job:chararray
, mgr:int, hiredate
:chararray
, sal
:float, comm:float, deptno
:int);
ee = FOREACH emp GENERATE CONCAT((chararray)empno,'|',ename,'~', job,'~',(chararray)mgr,'|',(chararray)deptno) as big_col;
dump ee;
Results:
Please note that I have used a different field separator as an example. The row with no values can be filtered using filter option.
(7369|SMITH~CLERK~7902|20)
(7499|ALLEN~SALESMAN~7698|30)
(7521|WARD~SALESMAN~7698|30)
(7566|JONES~MANAGER~7839|20)
(7654|MARTIN~SALESMAN~7698|30)
(7698|BLAKE~MANAGER~7839|30)
(7782|CLARK~MANAGER~7839|10)
(7788|SCOTT~ANALYST~7566|20)
()
(7844|TURNER~SALESMAN~7698|30)
(7876|ADAMS~CLERK~7788|20)
(7900|JAMES~CLERK~7698|30)
(7902|FORD~ANALYST~7566|20)
(7934|MILLER~CLERK~7782|10)
(7934|MILLER~CLERK|7782~10)
above example using filter:
ee = FOREACH emp GENERATE CONCAT((chararray)empno,'|',ename,'~', job,'~',(chararray)mgr,'|',(chararray)deptno) as big_col;
ff = FILTER ee $0 by != '';
dump ff;
Results:
(7369|SMITH~CLERK|7902~20)
(7499|ALLEN~SALESMAN|7698~30)
(7521|WARD~SALESMAN|7698~30)
(7566|JONES~MANAGER|7839~20)
(7654|MARTIN~SALESMAN|7698~30)
(7698|BLAKE~MANAGER|7839~30)
(7782|CLARK~MANAGER|7839~10)
(7788|SCOTT~ANALYST|7566~20)
(7844|TURNER~SALESMAN|7698~30)
(7876|ADAMS~CLERK|7788~20)
(7900|JAMES~CLERK|7698~30)
(7902|FORD~ANALYST|7566~20)
(7934|MILLER~CLERK|7782~10)
______________________________________________________________________________
COUNT:
Computes the number of elements in a bag. This function does not consider NULL values or records.
a) The basic COUNT function example:
emp = LOAD '/Documents/tbl_EMP
.txt' USING PigStorage
(',') as (empno
:int, ename
:chararray
, job:chararray
, mgr:int, hiredate
:chararray
, sal
:float, comm:float, deptno
:int);
grp1 = FOREACH (GROUP emp by mgr) { GENERATE emp.mgr, (INT)COUNT(emp.mgr); };
dump grp1;
Results:
({(7566),(7566)},2)
({(7698),(7698),(7698),(7698),(7698)},5)
({(7782)},1)
({(7788)},1)
({(7839),(7839),(7839)},3)
({(7902)},1)
({()},0)
Please note the blue highlighted record is not considered by COUNT function and the actual count produced as 0 instead of 1. This can be fixed by using COUNT_STAR function.
______________________________________________________________________________
COUNT_STAR:
Computes the number of elements in a bag.
a) The basic COUNT_STAR function example:
emp = LOAD '/Documents/tbl_EMP
.txt' USING PigStorage
(',') as (empno
:int, ename
:chararray
, job:chararray
, mgr:int, hiredate
:chararray
, sal
:float, comm:float, deptno
:int);
grp1 = FOREACH (GROUP emp by mgr) { GENERATE emp.mgr, (INT)COUNT_ALL(emp.mgr); };
dump grp1;
Results:
Please note the blue highlighted record is considered by COUNT_ALL function and the count produced as 1 instead of 0 (when compared to just COUNT function).
({(7566),(7566)},2)
({(7698),(7698),(7698),(7698),(7698)},5)
({(7782)},1)
({(7788)},1)
({(7839),(7839),(7839)},3)
({(7902)},1)
({()},1)
______________________________________________________________________________
DIFF
Compares two fields in a tuple.
a) The basic DIFF function example:
Please note that I had to create relation with two different bags from same dataset to be able to show how DIFF function works and can find exact differences between tow datasets.
emp = LOAD '/Documents/tbl_EMP.txt' USING PigStorage(',') as (empno:int, ename:chararray, job:chararray, mgr:int, hiredate:chararray, sal:float, comm:float, deptno:int);
/*Made a copy of emp dataset */
emp2 = FOREACH emp GENERATE *;
/* made 2 sets with different manager for testing DIFF function*/
ee1 = FOREACH (FILTER emp by mgr == 7839) GENERATE empno, ename, job, mgr, sal, comm, deptno;
ff1 = FOREACH (FILTER emp2 by mgr == 7698) GENERATE empno, ename, job, mgr, sal, comm, deptno;
/* converting columns from different datasets into bag as Diff function works on bag*/
enm1 = FOREACH (COGROUP ee1 by (empno), ff1 by (empno)) GENERATE TOBAG(ee1.empno, ee1.empno), TOBAG(ff1.empno, ff1.mgr);
dump enm1;
/* Results before diff function*/
({({}),({})},{({}),({})})
({({}),({})},{({}),({})})
({({(7839)}),({(7566)})},{({(7566)}),({(7566)})})
({({}),({})},{({}),({})})
({({(7839)}),({(7698)})},{({(7698)}),({(7698)})})
({({(7839)}),({(7782)})},{({(7782)}),({(7782)})})
({({}),({})},{({}),({})})
({({}),({})},{({}),({})})
/* applying diff function on the our relation that had 2 bags */
enm1 = FOREACH (COGROUP ee1 by (empno), ff1 by (empno)) GENERATE DIFF(TOBAG(ee1.empno, ee1.empno), TOBAG(ff1.empno, ff1.mgr));
dump enm1;
/* Results after diff function*/
({})
({})
({({(7839)})})
({})
({({(7839)})})
({({(7839)})})
({})
({})
______________________________________________________________________________
IsEmpty
Checks if a bag or map is empty.
a) The basic ISEMPTY function example:
emp = LOAD '/Documents/tbl_EMP.txt' USING PigStorage(',') as (empno:int, ename:chararray, job:chararray, mgr:int, hiredate:chararray, sal:float, comm:float, deptno:int);
dept = LOAD 'tbl_EMP.txt' USING PigStorage(',') as (deptno:int, dname:chararray, loc:chararray);
ee = FOREACH emp GENERATE empno, ename, job, mgr, sal, comm, deptno;
dd = FOREACH dept GENERATE deptno, dname, loc;
/* another way to do join, allows us to do left or right joins and checks */
cogrp1 = cogroup ee by (deptno), dd by (deptno);
/* filter out records from dept that are not in emp */
fltr_jn1 = filter cogrp1 by IsEmpty(ee) == true;
rslt = foreach fltr_jn1 generate flatten(dd);
dump rslt;
Results: Please note the blue highlighted ISEMPTY function is applied on ee bag.
(40,OPERATIONS,BOSTON)
______________________________________________________________________________
Max/Min
Computes the maximum or minimum of the numeric values or chararrays in a single-column bag. MAX requires a preceding GROUP ALL statement for global maximums and a GROUP BY statement for group maximums.
a) The basic Max and Min function example:
/* Dept wise average salary*/
emp = LOAD '/Documents/tbl_EMP.txt' USING PigStorage(',') as (empno:int, ename:chararray, job:chararray, mgr:int, hiredate:chararray, sal:float, comm:float, deptno:int);
/* Foreach works on inner group operator */
mm = FOREACH (GROUP emp by deptno) GENERATE MIN(emp.sal), MAX(emp.sal);
dump mm;
Results:
(1300.0,5000.0)
(800.0,3000.0)
(950.0,2850.0)
______________________________________________________________________________
Thank you!
References:
- http://pig.apache.org/docs/r0.15.0/