Learning ECL Programming language

Posted on 06-03-2023 Edited on 08-12-2023 In HPCC Systems Comments: Views:

Learn the basics of ECL, the powerful programming language built for big data analytics.

File type

There are two kinds of operations in ECL, definition (Definitions) and execution (Actions). After using EXPORT to execute the definition operation, it will no longer be able to perform operations in this file.

Similarly, ECL has two kinds of files. Their suffixes are .ecl. Definitions file and build execution (BWR, Builder Window Runnable) file. The difference is

The definition file always contains EXPORT and SHARED definitions and contains no execution operations. Therefore, the file cannot be executed through Submit.
The BWR file contains at least one execution operation and no EXPORT and SHARED definition operations.

Define variables

Variable names cannot have spaces and end with a semicolon. Format for defining variables:

1	[Scope] [ValueType] Name [ (parms) ] := Expression [ :WorkflowService] ;

For example:

My_First_Definition1 := 5; // valid name
My First Definition := 5;  // INVALID name, spaces not allowed
MyFirstDefinition := 5; //defined as 5
MySecondDefinition := MyFirstDefinition + 5; //this is 10

Variable name cannot start with UNICODE_ , UTF8_, VARUNICODE_

Basic variable type

Boolean

Boolean can be defined by expression, TRUE or FALSE, for example:

1
2
3

IsBoolTrue  := TRUE;
IsFloridian := Person.per_st = 'FL';
IsOldPerson := Person.Age >= 65;

Value

Value can be defined by an expression, and the result must be an arithmetic value or a string:

1
2
3

ValueTrue      := 1;
FloridianCount := COUNT(Person(Person.per_st = 'FL'));
OldAgeSum     := SUM(Person(Person.Age >= 65),Person.Age);

INTEGER

1 2	[IntType] [UNSIGNED] INTEGER[n] [IntType] UNSIGNEDn

Among them, n indicates that this integer occupies number of bytes, which can be 1~8. The default is 8.

IntType describes whether the high bit of the number is at the low address or the low bit is at the low address. Can take either BIG_ENDIAN or LITTLE_ENDIAN. Default is LITTLE_ENDIAN.

UNSIGNED, used to describe whether it is signed or not, the default is signed.

REAL[** n **] represents a floating point number, n can be 4 (7 significant figures) or 8 (15 significant figures)

Set

All elements must be of the same type.

Example:

1
2
3

SetInts  := [1,2,3,4,5]; // an INTEGER set with 5 elements
SetExp := [1,2+3,45,SomeIntegerDefinition,7*3];
SetSomeField := SET(SomeFile, SomeField);

SET can be accessed by subscript, subscript starts from 1

1
2
3

MySet := [5,4,3,2,1];
ReverseNum := MySet[2]; //indexing to MySet's element number 2,
                       //so ReverseNum contains the value 4

Strings are treated as SET with multiple 1-character elements, so they can also be accessed by subscript

1 2	MyString := 'ABCDE'; MySubString := MyString[2]; // MySubString is 'B'

Strings support range access:

MyString := 'ABCDE';
MySubString1 := MyString[2..4]; // MySubString1 is 'BCD'
MySubString2 := MyString[ ..4]; // MySubString2 is 'ABCD'
MySubString3 := MyString[2.. ]; // MySubString3 is 'BCDE'

The data type in the Set can be specified:

SET OF INTEGER1 SetValues := [5,10,15,20];

IsInSetFunction(SET OF INTEGER1 x=SetValues,y) := y IN x;

OUTPUT(IsInSetFunction([1,2,3,4],5)); //false
OUTPUT(IsInSetFunction(,5)); // true

Keywords

EXPORT

How to use: EXPORT [ VIRTUAL ] definition, only one EXPORT Module is allowed in each file, and the name of this Module must be the same as the file name.

VIRTUAL is optional. If specified, the definition is only valid within the Module. Allows usage as Module.Definition from other files.

EXPORT allows nesting. If you want to access a value in Module from another file, this value must also be modified by EXPORT.

Example, file1:

EXPORT file1 := MODULE
  Value1 := 5;
  EXPORT Value2 :=6;
END;

file2:

IMPORT MyTest;
myVar := MyTest.Value2;
//myVar := MyTest.Value1; // 报错，Value1不可见
OUTPUT(myVar);    //输出6

Data structure

ENUM

Enums can be useful when you want to represent a limited set of possible values for a variable or a parameter. For example:

Color := ENUM(RED=1, GREEN=2, BLUE=3);
myColor := Color.RED;

OUTPUT(myColor);  # 1

RECORD

A RECORD in ECL represents the structure or format of a dataset. It is similar to the concept of a "table" in a SQL database, where each field in the record is similar to a column in the table. It defines the data types and names of fields.

For example:

ChildRec := RECORD
    UNSIGNED4 person_id;
    STRING20 per_surname;
    STRING20 per_forename;
END;

rec := RECORD
  STRING20 name;
  INTEGER4 age;
  BOOLEAN isEmployed;
END;

usually used with DATASET.

DATASET

It represents a set of data. A dataset is a group of records with the same record layout. A record layout is defined using the RECORD structure, which contains a set of fields, each with a name and a type.

How to use:

attr := DATASET( file, struct, filetype [,LOOKUP]);

attr := DATASET( dataset, file, filetype [,LOOKUP]);

attr := DATASET( WORKUNIT( [ wuid , ] namedoutput ), struct );

[ attr := ] DATASET( recordset [, recstruct ] );

DATASET( row )

DATASET( childstruct [, COUNT( count ) | LENGTH( size ) ] [, CHOOSEN( maxrecs ) ] )

[GROUPED] [LINKCOUNTED] [STREAMED] DATASET( struct )

DATASET( dict )

DATASET( count, transform [, DISTRIBUTED | LOCAL ] )

Example:

rec := RECORD
  STRING20 name;
  INTEGER4 age;
  BOOLEAN isEmployed;
END;

myData := DATASET([{ 'John', 30, TRUE }, { 'Jane', 25, FALSE }], rec);

OUTPUT(myData)

Construct the dataset:

Use DATASET( count, transform [, DISTRIBUTED | LOCAL ] ), for example:

RAND_MAX := POWER(2,32) -1;
trainCount := 1000;
featureCount := 5;
trainRec := RECORD
  UNSIGNED8 id;
  SET OF REAL x;
  REAL4 y;
END;

REAL4 targetFunc(REAL4 x1, REAL4 x2, REAL4 x3, REAL4 x4, REAL4 x5) := FUNCTION
  rslt := .5 * POWER(x1, 4) - .4 * POWER(x2, 3) + .3 * POWER(x3,2) - .2 * x4 + .1 * x5;
  RETURN rslt;
END;

train0 := DATASET(trainCount, TRANSFORM(trainRec,
                      SELF.id := COUNTER,
                      SELF.x := [(RANDOM() % RAND_MAX) / RAND_MAX -.5,
                                  (RANDOM() % RAND_MAX) / RAND_MAX -.5,
                                  (RANDOM() % RAND_MAX) / RAND_MAX -.5,
                                  (RANDOM() % RAND_MAX) / RAND_MAX -.5,
                                  (RANDOM() % RAND_MAX) / RAND_MAX -.5],
                      SELF.y := targetFunc(SELF.x[1], SELF.x[2], SELF.x[3], SELF.x[4], SELF.x[5]))
                      );
OUTPUT(train0, NAMED('trainData'));

// select the first record
record1 := CHOOSEN(train0, 1);

// extract the x field
xField := PROJECT(record1, TRANSFORM({SET OF REAL x}, SELF.x := LEFT.x));

// output x field
OUTPUT(xField, NAMED('xFieldOfRecord1'));

Built-in functions and operations

OUTPUT

This function is for output values.

[attr := ] OUTPUT(recordset [, [ format ] [,file [thorfileoptions ] ] [, NOXPATH ] [, UNORDERED | ORDERED( bool ) ] [, STABLE | UNSTABLE ] [, PARALLEL [ ( numthreads ) ] ] [, ALGORITHM( name ) ] );

[attr := ] OUTPUT(recordset, [ format ] ,file , CSV [ (csvoptions) ] [csvfileoptions ] [, NOXPATH ] [, UNORDERED | ORDERED( bool ) ] [, STABLE | UNSTABLE ] [, PARALLEL [ ( numthreads ) ] ] [, ALGORITHM( name ) ] );

[attr := ] OUTPUT(recordset, [ format ] , file , XML [ (xmloptions) ] [xmlfileoptions ] [, NOXPATH ] [, UNORDERED | ORDERED( bool ) ] [, STABLE | UNSTABLE ] [, PARALLEL [ ( numthreads ) ] ] [, ALGORITHM( name ) ] );

[attr := ] OUTPUT(recordset, [ format ] , file , JSON [ (jsonoptions) ] [jsonfileoptions ] [, NOXPATH ] [, UNORDERED | ORDERED( bool ) ] [, STABLE | UNSTABLE ] [, PARALLEL [ ( numthreads ) ] ] [, ALGORITHM( name ) ] );

[attr := ] OUTPUT(recordset, [ format ] ,PIPE( pipeoptions [, NOXPATH ] [, UNORDERED | ORDERED( bool ) ] [, STABLE | UNSTABLE ] [, PARALLEL [ ( numthreads ) ] ] [, ALGORITHM( name ) ] );

[attr := ] OUTPUT(recordset [, format ] , NAMED( name ) [,EXTEND] [,ALL] [, NOXPATH ] [, UNORDERED | ORDERED( bool ) ] [, STABLE | UNSTABLE ] [, PARALLEL [ ( numthreads ) ] ] [, ALGORITHM( name ) ] );

[attr := ] OUTPUT( expression [, NAMED( name ) ] [, NOXPATH ] [, UNORDERED | ORDERED( bool ) ] [, STABLE | UNSTABLE ] [, PARALLEL [ ( numthreads ) ] ] [, ALGORITHM( name ) ] );

[attr := ] OUTPUT( recordset , THOR [, NOXPATH ] [, UNORDERED | ORDERED( bool ) ] [, STABLE | UNSTABLE ] [, PARALLEL [ ( numthreads ) ] ] [, ALGORITHM( name ) ] );

For example, to output 111 to the test panel, you can write:

1	OUTPUT(1111, NAMED('test'));

TABLE

Used to create a new dataset (Dataset). The TABLE function takes a set of records (each defined by a RECORD structure) and an optional filter condition, and returns a new dataset.

person := RECORD
  STRING20 name;
  UNSIGNED1 age;
END;

people := DATASET
  (
    [ 
      {'John', 30 },
      {'Jane', 25 },
      {'Tom', 35 }
    ], 
    person
  );

adults := TABLE
  (
    people, 
    {
      STRING20 name := name;
      UNSIGNED1 age := age;
      BOOLEAN isAdult := age >= 18;
    }
  );

OUTPUT(adults);

PROJECT

Execute the TRANSFORM operation on each item in the DATASET.

TRANSFORM

Convert one DATASET to another DATASET according to the rules.

IMPORT STD;

Layout_Person := RECORD
  STRING1 num;
  STRING4 name;
END;

Layout_Output := RECORD
  Layout_Person;
  STRING4 upper_name;
END;

ds_Persons := DATASET([{'1', 'one'}, {'2', 'two'}, {'3', 'three'}], Layout_Person);

ds_Output := PROJECT(ds_Persons,
                    TRANSFORM(Layout_Output,
                              SELF.num := LEFT.num,
                              SELF.name := LEFT.name,
                              SELF.upper_name := STD.Str.ToUpperCase(LEFT.name)));

OUTPUT(ds_Output);

result:

1	one 	ONE 
2	two 	TWO 
3	thre	THRE

NORMALIZE

In PROJECT, the TRANSFORM operation is performed on each piece of data, and then a new data set is obtained, which is one-to-one in quantity. And NORMALIZE is to expand a single piece of data into multiple pieces of data. In some cases, you may have a field that contains repeated data, and you may wish to split each repetition into a separate record. In this case, you can use the NORMALIZE function. The NORMALIZE function is used by receiving a dataset and a TRANSFORM function and returning a new dataset. In the conversion function, you need to define how to split the original records into new records.

Layout := RECORD
  STRING10 name;
  INTEGER4 times;
END;

ds := DATASET([{'John', 3}, {'Jane', 2}], Layout);

Layout_Output := RECORD
  STRING10 name;
END;

ds_Output := NORMALIZE(ds, LEFT.times,
                       TRANSFORM(Layout_Output,
                                 SELF.name := LEFT.name));

OUTPUT(ds_Output);

ds.times indicates the number of times to repeat, and the TRANSFORM function defines how to convert the original record into a new record. In the output dataset, John and Jane will appear 3 and 2 times, respectively. Note that in the NORMALIZE function, for the current data, use LEFT reference.

JOIN

Merge the two DATASETs.

JOIN( leftrecset, rightrecset, joincondition [** , transform **] [** , jointype **] [** , joinflags **] )

Layout := RECORD
  STRING1 id;
  STRING10 name;
END;

Layout2 := RECORD
  STRING1 id;
  STRING10 job;
END;

OutputLayout := RECORD
  STRING1 id;
  STRING20 name;  
END;

ds1 := DATASET([{'1', 'John'}, {'2', 'Jane'}], Layout);
ds2 := DATASET([{'1', 'Engineer'}, {'2', 'Doctor'}], Layout2);

ds_join := JOIN(ds1, ds2, LEFT.id = RIGHT.id,
                TRANSFORM(OutputLayout,
                          SELF.id := LEFT.id,
                          SELF.name := (LEFT.name + ', ' + RIGHT.job)));

OUTPUT(ds_join);

EMBED

Embedding code in other languages.

EMBED(language)
    embed code.....
ENDEMBED;

Example:

IMPORT Python3 AS Python;

INTEGER4 MySquare(INTEGER4 val) := EMBED(Python)
    return val * val
ENDEMBED;

OUTPUT(MySquare(5));    // 25

LOCAL

The LOCAL keyword is mainly used to limit the scope of data or functions, or to control how data is distributed in the cluster.

If a definition (for example, a dataset or function) is declared LOCAL, then this definition is only visible in the ECL statement in which it is declared. This is similar to local variables in other programming languages. For example:

ECLCopy codemyFunction := FUNCTION
  LOCAL myLocalValue := 5;  // Only available in this function
  RETURN myLocalValue * 2;
END;

Additionally, when the LOCAL keyword is used on a dataset, it indicates that the dataset should be computed locally on each node, rather than across the entire cluster. This can be useful for reducing network communication and speeding up calculations. For example, the following ECL statement creates a dataset that is computed locally on each node:

1 2	ECLCopy code myLocalDataset := DATASET([{1, 'one'}, {2, 'two'}, {3, 'three'}], myRecordDef, LOCAL);

In this example, myRecordDef is a record definition that describes the structure of each record in the dataset. Each node will process a portion of this dataset, not the entire dataset.

ASSERT

It is often used to judge whether a function has obtained the expected result.

ASSERT( condition [** , message **] [** , **FAIL ] [** , **CONST ])

ASSERT( recset, condition [** , message **] [** , **FAIL ] [** , **CONST ] [, UNORDERED | ORDERED(** bool **) ] [, STABLE | UNSTABLE ] [, PARALLEL [ (** numthreads **) ] ] [, ALGORITHM(** name **) ] )

ASSERT( recset, assertlist [, UNORDERED | ORDERED(** bool **) ] [, STABLE | UNSTABLE ] [, PARALLEL [ (** numthreads **) ] ] [, ALGORITHM(** name **) ] )

recset can be a DATASET, and ASSERT judges the results one by one.

IMPORT Python3 AS Python;
IMPORT Std.System.Thorlib;

nNodes := Thorlib.nodes();
nodeId := Thorlib.node();
testRec := RECORD
  UNSIGNED node;
  STRING tf_version;
  STRING result := '';
END;
DATASET(testRec) DoTest := FUNCTION
  DATASET(testRec) testTF(DATASET(testRec) test) := EMBED(Python: activity)
    import traceback as tb
    nodeId = 999
    v = 0
    try:
      for tr in test:
        # Should only be one.
        nodeId, unused, result = tr
        try:
          import tensorflow.compat.v1 as tf # V2.x
          tf.disable_v2_behavior()
        except:
          import tensorflow as tf # V 1.x
        s = tf.Session()
        v = tf.__version__
        return [(nodeId, v, 'SUCCESS')]
    except:
      return [(nodeId, 'unknown', tb.format_exc())]
  ENDEMBED;
  inp := DATASET([{nodeId, '', ''}], testRec, LOCAL);
  outp := testTF(inp);
  RETURN ASSERT(outp, result = 'SUCCESS', 'Error on node ' + node + ': ' + result);
END;

OUTPUT(DoTest, NAMED('Result'));

SORT

The SORT function is used to sort the records in the data set (DATASET) according to the specified field. Its basic usage syntax is as follows:

1	SortedDataset := SORT(Dataset, Field);

Here, Dataset is the dataset you want to sort by, and Field is the field you want to sort by.

A dataset that contains employee information and wants to sort by the Salary field:

Employee := RECORD
  STRING20 Name;
  STRING20 Occupation;
  UNSIGNED Salary;
END;

Employees := DATASET([
  {'John', 'Doctor', 80000},
  {'Alice', 'Engineer', 70000},
  {'Bob', 'Teacher', 60000}], Employee);

SortedEmployees := SORT(Employees, Salary);

// descending order
SortedEmployees := SORT(Employees, -Salary);
// sort by multiple fields
SortedEmployees := SORT(Employees, Occupation, -Salary);

OUTPUT(SortedEmployees);

MAX

Get the maximum value:

// create data set
peopleRec := RECORD
  STRING5 name;
  UNSIGNED2 age;
  STRING10 job;
END;

people := DATASET([
  {'John', 25, 'Teacher'},
  {'Mary', 30, 'Engineer'},
  {'Bob', 35, 'Engineer'},
  {'Lisa', 40, 'Doctor'},
  {'Mike', 28, 'Teacher'},
  {'Anna', 32, 'Doctor'}
], peopleRec);

// Group the dataset by the job field and find the maximum age in each group
test := TABLE(people, {job, maxAge := MAX(GROUP, age)}, job);

// Find the maximum age among all people without grouping the dataset by the job field
test2 := TABLE(people, {job, maxAge := MAX(GROUP, age)});

// output result
OUTPUT(test, NAMED('MaxAgePerJob'));
OUTPUT(test2, NAMED('MaxAge'));

COUNT

Calculate the size of the data.

Layout := RECORD
    STRING20 Name;
    UNSIGNED2 Age;
    STRING20 Occupation;
END;

ds := DATASET([{'John', 30, 'Doctor'},
               {'Alice', 25, 'Engineer'},
               {'Bob', 35, 'Teacher'}], Layout);

dsCount := COUNT(ds);

OUTPUT(dsCount);