Learning ECL Programming language
Learn the basics of ECL, the powerful programming language built for big data analytics.
File type
There are two kinds of operations in ECL, definition (Definitions) and execution (Actions). After using EXPORT to execute the definition operation, it will no longer be able to perform operations in this file.
Similarly, ECL has two kinds of files. Their suffixes are
.ecl
. Definitions file and build execution (BWR, Builder
Window Runnable) file. The difference is
- The definition file always contains EXPORT and SHARED definitions and contains no execution operations. Therefore, the file cannot be executed through Submit.
- The BWR file contains at least one execution operation and no EXPORT and SHARED definition operations.
Define variables
Variable names cannot have spaces and end with a semicolon. Format for defining variables:
1 | [Scope] [ValueType] Name [ (parms) ] := Expression [ :WorkflowService] ; |
For example:
1 | My_First_Definition1 := 5; // valid name |
Variable name cannot start with
UNICODE_ , UTF8_, VARUNICODE_
Basic variable type
Boolean
Boolean can be defined by expression, TRUE
or
FALSE
, for example:
1 | IsBoolTrue := TRUE; |
Value
Value can be defined by an expression, and the result must be an arithmetic value or a string:
1 | ValueTrue := 1; |
INTEGER
1 | [IntType] [UNSIGNED] INTEGER[n] |
Among them, n indicates that this integer occupies number of bytes, which can be 1~8. The default is 8.
IntType describes whether the high bit of the number is at
the low address or the low bit is at the low address. Can take either
BIG_ENDIAN
or LITTLE_ENDIAN
. Default is
LITTLE_ENDIAN
.
UNSIGNED, used to describe whether it is signed or not, the default is signed.
REAL[** n **] represents a floating point number, n can be 4 (7 significant figures) or 8 (15 significant figures)
Set
All elements must be of the same type.
Example:
1 | SetInts := [1,2,3,4,5]; // an INTEGER set with 5 elements |
SET can be accessed by subscript, subscript starts from 1
1 | MySet := [5,4,3,2,1]; |
Strings are treated as SET with multiple 1-character elements, so they can also be accessed by subscript
1 | MyString := 'ABCDE'; |
Strings support range access:
1 | MyString := 'ABCDE'; |
The data type in the Set can be specified:
1 | SET OF INTEGER1 SetValues := [5,10,15,20]; |
Keywords
EXPORT
How to use: EXPORT [ VIRTUAL ] definition, only one EXPORT Module is allowed in each file, and the name of this Module must be the same as the file name.
VIRTUAL is optional. If specified, the definition is only valid within the Module. Allows usage as Module.Definition from other files.
EXPORT allows nesting. If you want to access a value in Module from another file, this value must also be modified by EXPORT.
Example, file1:
1 | EXPORT file1 := MODULE |
file2:
1 | IMPORT MyTest; |
Data structure
ENUM
Enums can be useful when you want to represent a limited set of possible values for a variable or a parameter. For example:
1 | Color := ENUM(RED=1, GREEN=2, BLUE=3); |
RECORD
A RECORD
in ECL represents the structure or format of a
dataset. It is similar to the concept of a "table" in a SQL database,
where each field in the record is similar to a column in the table. It
defines the data types and names of fields.
For example:
1 | ChildRec := RECORD |
usually used with DATASET
.
DATASET
It represents a set of data. A dataset is a group of records with the
same record layout. A record layout is defined using the
RECORD
structure, which contains a set of fields, each with
a name and a type.
How to use:
1 | attr := DATASET( file, struct, filetype [,LOOKUP]); |
Example:
1 | rec := RECORD |
Construct the dataset:
Use
DATASET( count, transform [, DISTRIBUTED | LOCAL ] )
, for
example:
1 | RAND_MAX := POWER(2,32) -1; |
Built-in functions and operations
OUTPUT
This function is for output values.
1 | [attr := ] OUTPUT(recordset [, [ format ] [,file [thorfileoptions ] ] [, NOXPATH ] [, UNORDERED | ORDERED( bool ) ] [, STABLE | UNSTABLE ] [, PARALLEL [ ( numthreads ) ] ] [, ALGORITHM( name ) ] ); |
For example, to output 111 to the test panel, you can write:
1 | OUTPUT(1111, NAMED('test')); |
TABLE
Used to create a new dataset (Dataset). The TABLE
function takes a set of records (each defined by a RECORD
structure) and an optional filter condition, and returns a new
dataset.
1 | person := RECORD |
PROJECT
Execute the TRANSFORM operation on each item in the DATASET.
TRANSFORM
Convert one DATASET to another DATASET according to the rules.
1 | IMPORT STD; |
result:
1 | 1 one ONE |
NORMALIZE
In PROJECT, the TRANSFORM operation is performed on each piece of
data, and then a new data set is obtained, which is one-to-one in
quantity. And NORMALIZE is to expand a single piece of data into
multiple pieces of data. In some cases, you may have a field that
contains repeated data, and you may wish to split each repetition into a
separate record. In this case, you can use the NORMALIZE
function. The NORMALIZE
function is used by receiving a
dataset and a TRANSFORM function and returning a new dataset. In the
conversion function, you need to define how to split the original
records into new records.
1 | Layout := RECORD |
ds.times
indicates the number of times to repeat, and
the TRANSFORM
function defines how to convert the original
record into a new record. In the output dataset, John
and
Jane
will appear 3 and 2 times, respectively. Note that in
the NORMALIZE
function, for the current data, use LEFT
reference.
JOIN
Merge the two DATASETs.
JOIN( leftrecset, rightrecset, joincondition [** , transform **] [** , jointype **] [** , joinflags **] )
1 | Layout := RECORD |
EMBED
Embedding code in other languages.
1 | EMBED(language) |
Example:
1 | IMPORT Python3 AS Python; |
LOCAL
The LOCAL
keyword is mainly used to limit the scope of
data or functions, or to control how data is distributed in the
cluster.
If a definition (for example, a dataset or function) is declared
LOCAL
, then this definition is only visible in the ECL
statement in which it is declared. This is similar to local variables in
other programming languages. For example:
1 | ECLCopy codemyFunction := FUNCTION |
Additionally, when the LOCAL
keyword is used on a
dataset, it indicates that the dataset should be computed locally on
each node, rather than across the entire cluster. This can be useful for
reducing network communication and speeding up calculations. For
example, the following ECL statement creates a dataset that is computed
locally on each node:
1 | ECLCopy code |
In this example, myRecordDef
is a record definition that
describes the structure of each record in the dataset. Each node will
process a portion of this dataset, not the entire dataset.
ASSERT
It is often used to judge whether a function has obtained the expected result.
ASSERT( condition [** , message **] [** , **FAIL ] [** , **CONST ])
ASSERT( recset, condition [** , message **] [** , **FAIL ] [** , **CONST ] [, UNORDERED | ORDERED(** bool **) ] [, STABLE | UNSTABLE ] [, PARALLEL [ (** numthreads **) ] ] [, ALGORITHM(** name **) ] )
ASSERT( recset, assertlist [, UNORDERED | ORDERED(** bool **) ] [, STABLE | UNSTABLE ] [, PARALLEL [ (** numthreads **) ] ] [, ALGORITHM(** name **) ] )
recset can be a DATASET, and ASSERT judges the results one by one.
1 | IMPORT Python3 AS Python; |
SORT
The SORT
function is used to sort the records in the
data set (DATASET) according to the specified field. Its basic usage
syntax is as follows:
1 | SortedDataset := SORT(Dataset, Field); |
Here, Dataset
is the dataset you want to sort by, and
Field
is the field you want to sort by.
A dataset that contains employee information and wants to sort by the Salary field:
1 | Employee := RECORD |
MAX
Get the maximum value:
1 | // create data set |
COUNT
Calculate the size of the data.
1 | Layout := RECORD |