How to Write Scalar Functions
What's scalar functions
Scalar functions (sometimes referred to as User-Defined Functions / UDFs) return a single value as a return value for each row, not as a result set, and can be used in most places within a query or SET statement, except for the FROM clause.
┌─────┐ ┌──────┐
│ a │ │ x │
├─────┤ ├──────┤
│ b │ │ y │
├─────┤ ScalarFunction ├──────┤
│ c │ │ z │
├─────┼────────────────────►──────┤
│ d │ Exec │ u │
├─────┤ ├──────┤
│ e │ │ v │
├─────┤ ├──────┤
│ f │ │ w │
└─────┘ └──────┘
Knowledge before writing the eval function
Logical datatypes and physical datatypes.
Logical datatypes are the datatypes that we use in Databend, and physical datatypes are the datatypes that we use in the execution/compute engine.
Such as Date
, it's a logical data type, but its physical is Int32
, so its column is represented by Buffer<i32>
.
Arrow's memory layout
Databend's memory layout is based on the Arrow system, you can find Arrow's memory layout [here] (https://arrow.apache.org/docs/format/Columnar.html#format-columnar).
For example a primitive array of int32s:
[1, null, 2, 4, 8] Would look like this:
* Length: 5, Null count: 1
* Validity bitmap buffer:
|Byte 0 (validity bitmap) | Bytes 1-63 |
|-------------------------|-----------------------|
| 00011101 | 0 (padding) |
* Value Buffer:
|Bytes 0-3 | Bytes 4-7 | Bytes 8-11 | Bytes 12-15 | Bytes 16-19 | Bytes 20-63 |
|------------|-------------|-------------|-------------|-------------|-------------|
| 1 | unspecified | 2 | 4 | 8 | unspecified |
In most cases, we can ignore null for simd operation, and add the null mask to the result after the operation. This is very common optimization and widely used in arrow's compute system.
Special column
Constant column
Sometimes column is constant in the block, such as:
SELECT 3 from table
, the column 3 is always 3, so we can use a constant column to represent it. This is useful to save the memory space during computation.Nullable column
By default, columns are not nullable. If we want a nullable column, we can use this to represent it.
Function Registration
The FunctionRegistry
is responsible for registering functions.
#[derive(Default)]
pub struct FunctionRegistry {
pub funcs: HashMap<&'static str, Vec<Arc<Function>>>,
#[allow(clippy::type_complexity)]
pub factories: HashMap<
&'static str,
Vec<Box<dyn Fn(&[usize], &[DataType]) -> Option<Arc<Function>> + 'static>>,
>,
pub aliases: HashMap<&'static str, &'static str>,
}
It contains three HashMaps: funcs
, factories
, and aliases
.
funcs
and factories
store registered functions. The former registering functions with a fixed number of parameters (currently the minimum number of arguments is 0
and the maximum number of arguments is 5
), register_0_arg
, register_1_arg
and so on. The latter registering functions with variable-length parameter (such as concat) and call the register_function_factory function
.
aliases
is used to store aliases for functions. A function may have more than one alias (e.g. minus
has subtract
and neg
aliases), its key is the alias of a function, value is the name of the currently existing function, and the register_aliases
function will be called.
In addition, there are different levels of register api depending on the function required.
Auto Vectorization | Access Output Column Builder | Auto Null Passthrough | Auto Combine Null | Auto Downcast | Throw Runtime Error | Varidic | Tuple | |
---|---|---|---|---|---|---|---|---|
register_n_arg | ✔️ | ❌ | ✔️ | ❌ | ✔️ | ❌ | ❌ | ❌ |
register_passthrough_nullable_n_arg | ❌ | ✔️ | ✔️ | ❌ | ✔️ | ✔️ | ❌ | ❌ |
register_combine_nullable_n_arg | ❌ | ✔️ | ✔️ | ✔️ | ✔️ | ✔️ | ❌ | ❌ |
register_n_arg_core | ❌ | ✔️ | ❌ | ❌ | ✔️ | ✔️ | ❌ | ❌ |
register_function_factory | ❌ | ✔️ | ❌ | ❌ | ❌ | ✔️ | ✔️ | ✔️ |
Function Composition
Since the values of funcs
are the body of the function, let's see how Function
is constructed in Databend.
pub struct Function {
pub signature: FunctionSignature,
#[allow(clippy::type_complexity)]
pub calc_domain: Box<dyn Fn(&[Domain]) -> Option<Domain>>,
#[allow(clippy::type_complexity)]
pub eval: Box<dyn Fn(&[ValueRef<AnyType>], FunctionContext) -> Result<Value<AnyType>, String>>,
}
Functions are represented by the Function
struct, which includes a function signature
, a calculation domain (cal_domain
), and an evaluation function (eval
).
The signature includes the function name, the parameters type, the return type and the function properties (which are not currently available and are reserved for use with functions). Note in particular that the function name needs to be lowercase when registering. Some tokens are transformed via src/query/ast/src/parser/token.rs
.
#[allow(non_camel_case_types)]
#[derive(Logos, Clone, Copy, Debug, PartialEq, Eq, Hash)]
pub enum TokenKind {
...
#[token("+")]
Plus,
...
}
As an example, let's consider the addition function used in the query select 1+2
. The +
token is converted to Plus
, and the function name needs to be lowercase. Therefore, the function name used in registration is plus
.
with_number_mapped_type!(|NUM_TYPE| match left {
NumberDataType::NUM_TYPE => {
registry.register_1_arg::<NumberType<NUM_TYPE>, NumberType<NUM_TYPE>, _, _>(
"plus",
FunctionProperty::default(),
|lhs| Some(lhs.clone()),
|a, _| a,
);
}
});
cal_domain
is used to calculate the input value set for the output value. This is described by a mathematical formula such as y = f(x)
where the domain is the set of values x
that can be used as arguments to f
to generate values y
. This allows us to easily filter out values that are not in the domain when indexing data, greatly improving response efficiency.
eval
can be understood as the concrete implementation of a function, which takes characters or numbers as input, parses them into expressions, and converts them into another set of values.
Example
There are several categories of functions, including arithmetic, array, boolean, control, comparison, datetime, math, and string.
length
function
The length function takes a String
parameter and returns a Number
. It is named length
, with no domain restrictions since any string has a length. The last argument is a closure function that serves as the implementation of length
.
registry.register_1_arg::<StringType, NumberType<u64>, _, _>(
"length",
FunctionProperty::default(),
|_| None,
|val, _| val.len() as u64,
);
In the implementation of register_1_arg
, we see that the called function is register_passthrough_nullable_1_arg
, whose name contains nullable. eval
is called by vectorize_1_arg
.
It's worth noting that the register.rs in src/query/expression/src should not be manually modified as it is generated by src/query/codegen/src/writes/register.rs.
pub fn register_1_arg<I1: ArgType, O: ArgType, F, G>(
&mut self,
name: &'static str,
property: FunctionProperty,
calc_domain: F,
func: G,
) where
F: Fn(&I1::Domain) -> Option<O::Domain> + 'static + Clone + Copy,
G: Fn(I1::ScalarRef<'_>, FunctionContext) -> O::Scalar + 'static + Clone + Copy,
{
self.register_passthrough_nullable_1_arg::<I1, O, _, _>(
name,
property,
calc_domain,
vectorize_1_arg(func),
)
}
In practical scenarios, eval
accepts not only strings or numbers, but also null or other various types. null
is undoubtedly the most special one. The parameter we receive may also be a column or a value. For example, in the following SQL queries, length is called with a null value or a column:
select length(null);
+--------------+
| length(null) |
+--------------+
| NULL |
+--------------+
select length(id) from t;
+------------+
| length(id) |
+------------+
| 2 |
| 3 |
+------------+
Therefore, if we don't need to handle null
values in the function, we can simply use register_x_arg
. If we need to handle null
values, we can refer to the implementation of try_to_timestamp.
For functions that require specialization in vectorize, register_passthrough_nullable_x_arg
should be used to perform specific vectorization optimization.
For example, the implementation of the regexp
function takes two String
parameters and returns a Bool
. In order to further optimize and reduce the repeated parsing of regular expressions, a HashMap
structure is introduced in vectorized execution. Therefore, vectorize_regexp
is separately implemented to handle this optimization.
registry.register_passthrough_nullable_2_arg::<StringType, StringType, BooleanType, _, _>(
"regexp",
FunctionProperty::default(),
|_, _| None,
vectorize_regexp(|str, pat, map, _| {
let pattern = if let Some(pattern) = map.get(pat) {
pattern
} else {
let re = regexp::build_regexp_from_pattern("regexp", pat, None)?;
map.insert(pat.to_vec(), re);
map.get(pat).unwrap()
};
Ok(pattern.is_match(str))
}),
);
Testing
To be a good engineer, don't forget to test your codes, please add unit tests and stateless tests after you finish the new scalar functions.
Unit Test
The unit tests for scalar functions are located in scalars.
Logic Test
The logic tests for functions are located in 02_function.