Feb 13, 2016

What is Dereference in Apache Pig?




What is a Dereference?

Many time it is necessary to reference a field in a tuple or a bag that are outside the current operator scope. Here is the complete pig script for your review to be able to discuss dereferencing:

data = load 'books.txt' using PigStorage(',') as (f1:int, F1:chararray, f2:chararray, F2:int);
aaa = group data by f1;
bbb = FOREACH aaa GENERATE group, data.f2, data.f3;
dump bbb;



The dereferencing can be done in the following manners.


a) Dereferencing fields created in tuple or bag:
    Dereferencing fields this way can be observed with the Pig's FOREACH operator:       

    bbb = FOREACH aaa GENERATE group, data.f2, data.f3;
                 
    In the above line of the code if you have noticed, the fields f2 and f3 are not the part of the 
    relation aaa (pls. refer to complete pig example script shown above)

    Thus, in order to reference them they have to be defined to qualified in a tuple or a bag.
    The fields f2 and f3 are defined in the relation data, we can use them to create subsequent
    relations.


b) Dereferencing fields by their positions: 
  We can use same example to dereference the fields by their positions in the relation they were  
     created.This example dereferences the same fields as described in the top. Pls. refer to complete 
     pig example script shown above.
   
    bbb = FOREACH aaa GENERATE group, data.$1, data.$2;

Thanks!


0 comments:

Post a Comment