vinum
Vinum is a SQL query processor for Python, designed for data analysis workflows and in-memory analytics.
Vinum is running inside of the host Python process and allows to execute any functions available to the interpreter as UDFs. If you are doing data analysis or running ETL in Python, Vinum allows to execute efficient SQL queries with an ability to call native Python UDFs.
Vinum is running inside of the host Python process and has a hybrid query execution model - whenever possible it would prefer native compiled version of operators and only executes Python interpreted code where strictly necessary (ie. for native Python UDFs).
Allows to use functions available within the host Python interpreter as UDFs, including native Python, NumPy, etc.
Vinum’s execution model doesn’t require input datasets to fit into memory, as it operates on a stream of record batches. However, final result is fully materialized in memory.
Written in the mix of C++ and Python and is built from ground up on top of Apache Arrow, which provides the foundation for moving data and enables minimal overhead for transferring data to and from Numpy and Pandas.
Vinum uses PostgresSQL parser provided by pglast project.
Query planner and executor are implemented in Python, while all the physical operators are either implemented in C++ or use compiled vectorized kernels from Arrow or NumPy. The only exception to this is native python UDFs, which are running within interpreted Python.
Query execution model is based on the vectorized model described in the prolific paper by P. A. Boncz, M. Zukowski, and N. Nes. Monetdb/x100: Hyper-pipelining query execution. In CIDR, 2005.
Example of a query plan:
pip install vinum
>>> import vinum as vn >>> tbl = vn.read_csv('test.csv') >>> res_tbl = tbl.sql('SELECT * FROM t WHERE fare > 5 LIMIT 3') >>> res_tbl.to_pandas() id ts lat lng fare 0 1 2010-01-05 16:52:16.0000002 40.711303 -74.016048 16.9 1 2 2011-08-18 00:35:00.00000049 40.761270 -73.982738 5.7 2 3 2012-04-21 04:30:42.0000001 40.733143 -73.987130 7.7
Create a a Table from a python dict and return result of the query as a Pandas DataFrame.
>>> import vinum as vn >>> data = {'value': [300.1, 2.8, 880], 'mode': ['air', 'bus', 'air']} >>> tbl = vn.Table.from_pydict(data) >>> tbl.sql_pd("SELECT value, np.log(value) FROM t WHERE mode='air'") value np.log 0 300.1 5.704116 1 880.0 6.779922
>>> import pandas as pd >>> import vinum as vn >>> data = {'col1': [1, 2, 3], 'col2': [7, 13, 17]} >>> pdf = pd.DataFrame(data=data) >>> tbl = vn.Table.from_pandas(pdf) >>> tbl.sql_pd('SELECT * FROM t WHERE col2 > 10 ORDER BY col1 DESC') col1 col2 0 3 17 1 2 13
For larger datasets or datasets that won’t fit into memory - stream_csv() is the recommended way to execute a query. Compressed files are also supported and can be streamed without prior extraction.
>>> import vinum as vn >>> query = 'select passenger_count pc, count(*) from t group by pc' >>> vn.stream_csv('taxi.csv.bz2').sql(query).to_pandas() pc count 0 0 165 1 5 3453 ...
>>> import vinum as vn >>> tbl = vn.read_csv('taxi.csv') >>> res_tbl = tbl.sql('SELECT key, fare_amount, passenger_count FROM t ' ... 'WHERE fare_amount > 5 LIMIT 3') >>> res_tbl.to_pandas() key fare_amount passenger_count 0 2010-01-05 16:52:16.0000002 16.9 1 1 2011-08-18 00:35:00.00000049 5.7 2 2 2012-04-21 04:30:42.0000001 7.7 1
Use any numpy functions via the ‘np.*’ namespace.
>>> import vinum as vn >>> tbl = vn.Table.from_pydict({'x': [1, 2, 3], 'y': [7, 13, 17]}) >>> tbl.sql_pd('SELECT *, np.sqrt(np.square(x) + np.square(y)) dist ' ... 'FROM t ORDER BY dist DESC') x y dist 0 3 17 17.262677 1 2 13 13.152946 2 1 7 7.071068
Register UDF performing vectorized operations on Numpy arrays.
>>> import vinum as vn >>> vn.register_numpy('distance', ... lambda x, y: np.sqrt(np.square(x) + np.square(y))) >>> tbl = vn.Table.from_pydict({'x': [1, 2, 3], 'y': [7, 13, 17]}) >>> tbl.sql_pd('SELECT *, distance(x, y) AS dist ' ... 'FROM t ORDER BY dist DESC') x y dist 0 3 17 17.262677 1 2 13 13.152946 2 1 7 7.071068
Register Python lambda function as UDF.
>>> import math >>> import vinum as vn >>> vn.register_python('distance', lambda x, y: math.sqrt(x**2 + y**2)) >>> tbl = vn.Table.from_pydict({'x': [1, 2, 3], 'y': [7, 13, 17]}) >>> tbl.sql_pd('SELECT x, y, distance(x, y) AS dist FROM t') x y dist 0 1 7 7.071068 1 2 13 13.152946 2 3 17 17.262677
>>> import numpy as np >>> import vinum as vn >>> def z_score(x: np.ndarray): ... """Compute Standard Score""" ... mean = np.mean(x) ... std = np.std(x) ... return (x - mean) / std ... >>> vn.register_numpy('score', z_score) >>> tbl = vn.read_csv('test.csv') >>> tbl.sql_pd('select to_int(score(fare)) AS bucket, avg(fare), count(*) ' ... 'FROM t GROUP BY bucket ORDER BY bucket') bucket avg count 0 0 8.111630 92 1 1 19.380000 3 2 2 27.433333 3 3 3 34.670000 1 4 6 58.000000 1
Currently, only SELECT statement is supported.
Vinum uses SQL parser from PostgreSQL and strives to follow its SQL dialect.
Notes:
Subqueries and JOINs are currently not supported.
-arg
~arg
+
-
*
\
%
&
|
AND
OR
NOT
=
==
!=
<>
>
>=
<
<=
IS NULL
IS NOT NULL
IN
NOT IN
BETWEEN
NOT BETWEEN
LIKE
NOT LIKE
||
List of Built-in SQL functions.
np.* - All the functions from the numpy package are supported by default via the np.* namespace.
select np.log(total) from passengers
select np.power(np.min(size), 3) as cubed from measurements
to_bool(arg) - Casts argument to bool type.
to_float(arg) - Casts argument to float type.
to_int(arg) - Casts argument to int type.
to_str(arg) - Casts argument to str type.
abs(arg) - returns the absolute value of the numerical argument.
See: numpy.absolute
sqrt(arg) - returns the square root of the numerical argument.
See: numpy.sqrt
cos(arg) - returns the cosine of the argument.
See: numpy.cos
sin(arg) - returns the sine of the argument.
See: numpy.sin
tan(arg) - returns the tangent of the argument.
See: numpy.tan
power(arg, power) - returns the argument(s) raised to the power.
See: numpy.power
log(arg) - returns the natural log of the argument.
See: numpy.log
log2(arg) - returns the log base 2 of the argument.
See: numpy.log2
log10(arg) - returns the log base 10 of the argument.
See: numpy.log10
e() - returns the e constant.
pi() - returns the pi constant.
concat(arg1, arg2, …) - concatenate string arguments.
If argument is not a string type, would be converted to string.
See: numpy.char.add
upper(arg) - convert a string to uppercase.
See: numpy.char.upper
lower(arg) - convert a string to lowercase.
See: numpy.char.lower
now() - returns current datetime.
Returns current time as a datetime with seconds resolution.
date(arg) - converts the argument to date type.
Input is either a string in ISO8601 format or integer timestamp.
Use date(‘now’) for current date.
See: numpy.datetime
datetime(arg, unit) - converts the argument to datetime type.
Supported units are: [‘D’, ‘s’, ‘ms’, ‘us’, ‘ns’]
Use datetime(‘now’) for current datetime.
from_timestamp(arg, unit) - converts the integer timestamp to datetime type. Argument represents integer value of the timestamp, ie number of seconds (or milliseconds) since epoch.
Supported units are : [‘s’, ‘ms’, ‘us’, ‘ns’]
timedelta(arg, unit) - returns the timedelta type. Argument represents the duration.
Supported units are : [‘Y’, ‘M’, ‘W’, ‘D’, ‘h’, ‘m’, ‘s’, ‘ms’, ‘us’, ‘ns’]
See: numpy.datetime.timedelta
is_busday(arg) - returns True if the argument is a ‘business’ day.
See: numpy.datetime.is_busday
min(expr | column) - returns the minimum value in the group.
max(expr | column) - returns the maximum value in the group.
sum(expr | column) - returns the sum of the values in the group.
avg(expr | column) - returns the arithmetic mean of the values in the group.
Functions to define UDFs.
Register Numpy function as a User Defined Function (UDF). UDF can perform vectorized operations on arrays passed as arguments.
function_name (str) – Name of the User Defined Function.
function (callable) – Function to be used as a UDF. Function has to operate on vectorized numpy arrays. Numpy arrays will be passed as input arguments to the function and it should return numpy array.
See also
register_python
Register Python function as a User Defined Function.
Notes
Numpy package is imported under np namespace. You can invoke any function from the np.* namespace.
Arguments of the function would be numpy arrays of provided columns. UDF can perform vectorized operations on arrays passed as arguments. The function would be called only once.
Function names are case insensitive.
Examples
Define a function operating with Numpy arrays. Numpy function perform vectorized operations on input numpy arrays.
>>> import numpy as np >>> import vinum as vn >>> vn.register_numpy('cube', lambda x: np.power(x, 3)) >>> tbl = vn.Table.from_pydict({'len': [1, 2, 3], 'size': [7, 13, 17]}) >>> tbl.sql_pd('SELECT cube(size) from t ORDER BY cube(size) DESC') cube 0 4913 1 2197 2 343
>>> import numpy as np >>> import vinum as vn >>> vn.register_numpy('distance', ... lambda x, y: np.sqrt(np.square(x) + np.square(y))) >>> tbl = vn.Table.from_pydict({'x': [1, 2, 3], 'y': [7, 13, 17]}) >>> tbl.sql_pd('select x, y, distance(x, y) as dist from t') x y dist 0 1 7 7.071068 1 2 13 13.152946 2 3 17 17.262677
Please note that x and y arguments are of np.array type. In both of the cases function perform vectorized operations on input numpy arrays.
>>> import numpy as np >>> import vinum as vn >>> def z_score(x: np.array): ... """Compute Standard Score""" ... mean = np.mean(x) ... std = np.std(x) ... return (x - mean) / std ... >>> vn.register_numpy('score', z_score) >>> tbl = vn.Table.from_pydict({'x': [1, 2, 3], 'y': [7, 13, 17]}) >>> tbl.sql_pd('select x, score(x), y, score(y) from t') x score y score_1 0 1 -1.224745 7 -1.297771 1 2 0.000000 13 0.162221 2 3 1.224745 17 1.135550
Please note that x argument is of np.array type.
Register Python function as a User Defined Function (UDF).
function (callable, python function) – Function to be used as a UDF.
register_numpy
Register Numpy function as a User Defined Function.
Python functions are “vectorized” before use, via numpy.vectorize. For better performance, please try to use numpy UDFs, operating in terms of numpy arrays. See vinum.register_numpy().
numpy.vectorize
vinum.register_numpy()
Function would be invoked for individual rows of the Table.
Any python packages used inside of the function should be imported before the invocation.
Using lambda as a UDF:
>>> import vinum as vn >>> vn.register_python('cube', lambda x: x**3) >>> tbl = vn.Table.from_pydict({'len': [1, 2, 3], 'size': [7, 13, 17]}) >>> tbl.sql_pd('SELECT cube(size) from t ORDER BY cube(size) DESC') cube 0 4913 1 2197 2 343
>>> import math >>> import vinum as vn >>> vn.register_python('distance', lambda x, y: math.sqrt(x**2 + y**2)) >>> tbl = vn.Table.from_pydict({'x': [1, 2, 3], 'y': [7, 13, 17]}) >>> tbl.sql_pd('select x, y, distance(x, y) as dist from t') x y dist 0 1 7 7.071068 1 2 13 13.152946 2 3 17 17.262677
Using regular python function:
>>> import vinum as vn >>> def sin_taylor(x): ... "Taylor series approximation of the sine trig function around 0." ... return x - x**3/6 + x**5/120 - x**7/5040 ... >>> vn.register_python('sin', sin_taylor) >>> tbl = vn.Table.from_pydict({'x': [1, 2, 3], 'y': [7, 13, 17]}) >>> tbl.sql_pd('select sin(x) as sin_x, sin(y) as sin_y from t ' ... 'order by sin_y') sin_x sin_y 0 0.141120 -0.961397 1 0.909297 0.420167 2 0.841471 0.656987
Vinum API reference.
Vinum Table class.
Table represents a tabular dataset and provides SQL SELECT interface for data manipulation. Consists of a set of named columns of equal length. Essentially, is the same abstraction as a table in the relational databases world.
Provides minimum overhead transfer to and from Pandas DataFrame as well as Arrow Table, powered by Apache Arrow framework.
By invoking Table.from_* factory methods.
Table.from_*
vinum.read_csv(), vinum.read_parquet(), vinum.read_json().
vinum.read_csv()
vinum.read_parquet()
vinum.read_json()
By default, all the Numpy functions are available via ‘np.*’ namespace.
User Defined Function can be registered via vinum.register_python() or vinum.register_numpy().
vinum.register_python()
arrow_table (pyarrow.Table) – Arrow Table containing the dataset
>>> import pyarrow as pa >>> import vinum as vn >>> data = {'col1': [1, 2, 3], 'col2': [7, 13, 17]} >>> arrow_table = pa.Table.from_pydict(data) >>> tbl = vn.Table(arrow_table) >>> tbl.sql_pd('select * from t') col1 col2 0 1 7 1 2 13 2 3 17
>>> import pandas as pd >>> import vinum as vn >>> pdf = pd.DataFrame(data={'col1': [1, 2, 3], 'col2': [7, 13, 17]}) >>> tbl = vn.Table.from_pandas(pdf) >>> tbl.sql('select * from t') <vinum.core.table.Table object at 0x114cff7f0>
Notice that vinum.Table.sql() returns vinum.Table object type.
vinum.Table.sql()
vinum.Table
>>> tbl.sql_pd('select * from t') col1 col2 0 1 7 1 2 13 2 3 17
Notice that vinum.Table.sql_pd() returns pandas.DataFrame.
vinum.Table.sql_pd()
pandas.DataFrame
To register a Numpy UDF:
>>> import vinum as vn >>> vn.register_numpy('product', lambda x, y: x*y) >>> tbl.sql_pd('select product(col1, col2) from t') product 0 7 1 26 2 51
‘product’ UDF defined above, would perform vectorized multiplication on arrays, represented by columns ‘col1’ and ‘col2’.
schema
Return schema of the table.
Methods
explain(query[, print_query_tree])
explain
Returns a query plan as a string.
from_arrow(arrow_table)
from_arrow
Constructs a Table from pyarrow.Table
pyarrow.Table
from_pandas(data_frame)
from_pandas
Constructs a Table from pandas.DataFrame
from_pydict(pydict)
from_pydict
Constructs a Table from dict
dict
head(n)
head
Return first n rows of a table as Pandas DataFrame
sql(query)
sql
Executes SQL SELECT query on a Table and returns the result of the query.
sql_pd(query)
sql_pd
Executes SQL SELECT query on a Table and returns the result of the query as a Pandas DataFrame.
to_arrow()
to_arrow
Convert Table to ‘pyarrow.Table`.
to_pandas()
to_pandas
Convert Table to ‘pandas.DataFrame`.
to_string()
to_string
Return string representation of a Table.
query (str) – SQL SELECT query.
print_query_tree (bool, optional) – Set to True to also return an AST of the query.
str
Query Plan.
Executes SQL SELECT query on a Table and returns the result of the query as a Pandas Table.
>>> import vinum as vn >>> tbl = vn.read_csv('taxi.csv') >>> print(tbl.explain('select to_int(fare_amount) fare, count(*) from t ' ... 'group by fare order by fare limit 3'))
Column: to_int_4304514592 Column: count_star_4556372144
Column: to_int_4304514592
Column: fare_amount
Operator: TableReaderOperator
arrow_table (pyarrow.Table object) –
Vinum Table instance.
>>> import pyarrow as pa >>> import vinum as vn >>> data = {'col1': [1, 2, 3], 'col2': [7, 13, 17]} >>> arrow_table = pa.Table.from_pydict(data) >>> tbl = vn.Table.from_arrow(arrow_table) >>> tbl.sql_pd('select * from t') col1 col2 0 1 7 1 2 13 2 3 17
data_frame (pandas.DataFrame object) –
>>> import pandas as pd >>> import vinum as vn >>> data = {'col1': [1, 2, 3], 'col2': [7, 13, 17]} >>> pdf = pd.DataFrame(data=data) >>> tbl = vn.Table.from_pandas(pdf) >>> tbl.sql_pd('select * from t') col1 col2 0 1 7 1 2 13 2 3 17
pydict (Python dictionary) –
>>> import vinum as vn >>> data = {'col1': [1, 2, 3], 'col2': [7, 13, 17]} >>> tbl = vn.Table.from_pydict(data) >>> tbl.sql_pd('select * from t') col1 col2 0 1 7 1 2 13 2 3 17
n (int) – Number of first rows to return.
pyarrow.Schema
Only SELECT statements are supported. For SELECT statements, JOINs and subqueries are currently not supported. However, optimizations aside, one can run a subsequent query on the result of a query, to model the behaviour of subqueries.
Table name in ‘select * from table’ clause is ignored. The table of the underlying DataFrame is used to run a query.
User Defined Function can be registered via vinum.register_python() or vinum.register_numpy()
Using pandas dataframe:
>>> import pandas as pd >>> import vinum as vn >>> data = {'col1': [1, 2, 3], 'col2': [7, 13, 17]} >>> pdf = pd.DataFrame(data=data) >>> tbl = vn.Table.from_pandas(pdf) >>> res_tbl = tbl.sql('select * from t') >>> res_tbl.to_pandas() col1 col2 0 1 7 1 2 13 2 3 17
Running queries on a csv file:
>>> import vinum as vn >>> tbl = vn.read_csv('taxi.csv') >>> res_tbl = tbl.sql('select key, fare_amount from t limit 3') >>> res_tbl.to_pandas() key fare_amount 0 2009-06-15 17:26:21.0000001 4.5 1 2010-01-05 16:52:16.0000002 16.9 2 2011-08-18 00:35:00.00000049 5.7
>>> import vinum as vn >>> tbl = vn.read_csv('taxi.csv') >>> res_tbl = tbl.sql('select to_int(fare_amount) fare, count(*) from t ' ... 'group by fare order by fare limit 3') >>> res_tbl.to_pandas() fare count_star 0 -5 1 1 -3 1 2 -2 4
This is a convience method which runs :method:`vinum.Table.sql` and then calls :method:`vinum.Table.to_pandas` on the result. Equivalent to:
>>> res_tbl = tbl.sql('select * from t') >>> res_tbl.to_pandas() col1 col2 0 1 7 1 2 13 2 3 17
Pandas DataFrame.
Only SELECT statements are supported. For SELECT statements, JOINs and subqueries are currently not supported. However, optimizations aside, one can run a subsequent query on the result of the query, to model the behaviour of subqueries.
Table name in select * from table clause is ignored. The table of the underlying Table object is used to run a query.
select * from table
>>> import vinum as vn >>> tbl = vn.read_csv('taxi.csv') >>> tbl.sql_pd('select key, passenger_count from t limit 3') key passenger_count 0 2009-06-15 17:26:21.0000001 1 1 2010-01-05 16:52:16.0000002 1 2 2011-08-18 00:35:00.00000049 2
Vinum Stream Reader class.
StreamReader represents a stream of data which is used is an input for query processor.
Since input file may not fit into memory, StreamReader is the recommended way to execute queries on large files.
StreamReader instances are created by vinum.stream_* functions, for example: vinum.stream_csv().
vinum.stream_csv()
reader (pa.RecordBatchFileReader) – Arrow Stream Reader
Executes SQL SELECT query on an input stream and return the result as a Table materialized in memory.
Run aggregation query on a csv stream:
>>> import vinum as vn >>> query = 'select passenger_count pc, count(*) from t group by pc' >>> vn.stream_csv('taxi.csv').sql(query).to_pandas() pc count 0 0 165 1 5 3453 2 6 989 3 1 34808 4 2 7386 5 3 2183 6 4 1016
List of Input/Output functions and factory methods.
Open a streaming reader of CSV data.
Reading using this function is always single-threaded.
This function is a thin convenience wrapper around pyarrow.csv.open_csv, which returns StreamReader.
pyarrow.csv.open_csv
StreamReader
input_file (string, path or file-like object) – The location of CSV data. If a string or path, and if it ends with a recognized compressed file extension (e.g. “.gz” or “.bz2”), the data is automatically decompressed when reading.
read_options (pyarrow.csv.ReadOptions, optional) – Options for the CSV reader (see pyarrow.csv.ReadOptions constructor for defaults)
parse_options (pyarrow.csv.ParseOptions, optional) – Options for the CSV parser (see pyarrow.csv.ParseOptions constructor for defaults)
convert_options (pyarrow.csv.ConvertOptions, optional) – Options for converting CSV data (see pyarrow.csv.ConvertOptions constructor for defaults)
Read a Table from a stream of CSV data.
This function is a thin convenience wrapper around pyarrow.csv.read_csv, which returns Table.
pyarrow.csv.read_csv
Table
memory_pool (MemoryPool, optional) – Pool to allocate Table memory from
Read a Table from a stream of JSON data. This function is a thin convenience wrapper around pyarrow.csv.read_json which returns Table.
pyarrow.csv.read_json
input_file (string, path or file-like object) – The location of JSON data. Currently only the line-delimited JSON format is supported.
read_options (pyarrow.json.ReadOptions, optional) – Options for the JSON reader (see ReadOptions constructor for defaults)
parse_options (pyarrow.json.ParseOptions, optional) – Options for the JSON parser (see ParseOptions constructor for defaults)
>>> import vinum as vn >>> tbl = vn.read_json('test_data.json') >>> tbl.sql_pd('select * from t limit 3') id origin destination fare 0 1 London San Francisco 1348.1 1 2 Berlin London 256.3 2 3 Munich Malaga 421.7
Read a Table from Parquet format. This function is a thin convenience wrapper around pyarrow.parquet.read_table, which returns Table.
pyarrow.parquet.read_table
source (str, pyarrow.NativeFile, or file-like object) – If a string passed, can be a single file name or directory name. For file-like objects, only read a single file. Use pyarrow.BufferReader to read a file contained in a bytes or buffer-like object.
columns (list) – If not None, only these columns will be read from the file. A column name may be a prefix of a nested field, e.g. ‘a’ will select ‘a.b’, ‘a.c’, and ‘a.d.e’.
use_threads (bool, default True) – Perform multi-threaded column reads.
metadata (FileMetaData) – If separately computed
read_dictionary (list, default None) – List of names or column paths (for nested types) to read directly as DictionaryArray. Only supported for BYTE_ARRAY storage. To read a flat column as dictionary-encoded pass the column name. For nested types, you must pass the full column “path”, which could be something like level1.level2.list.item. Refer to the Parquet file’s schema to obtain the paths.
memory_map (bool, default False) – If the source is a file path, use a memory map to read file, which can improve performance in some environments.
buffer_size (int, default 0) – If positive, perform read buffering when deserializing individual column chunks. Otherwise IO calls are unbuffered.
partitioning (Partitioning or str or list of str, default "hive") – The partitioning scheme for a partitioned dataset. The default of “hive” assumes directory names with key=value pairs like “/year=2009/month=11”. In addition, a scheme like “/2009/11” is also supported, in which case you need to specify the field names or a full schema. See the pyarrow.dataset.partitioning() function for more details.
pyarrow.dataset.partitioning()
use_pandas_metadata (bool, default False) – If True and file has custom pandas schema metadata, ensure that index columns are also loaded
use_legacy_dataset (bool, default True) – Set to False to enable the new code path (experimental, using the new Arrow Dataset API). Among other things, this allows to pass filters for all columns and not only the partition keys, enables different partitioning schemes, etc.
filesystem (FileSystem, default None) – If nothing passed, paths assumed to be found in the local on-disk filesystem.
filters (List[Tuple] or List[List[Tuple]] or None (default)) – Rows which do not match the filter predicate will be removed from scanned data. Partition keys embedded in a nested directory structure will be exploited to avoid loading files at all if they contain no matching rows. If use_legacy_dataset is True, filters can only reference partition keys and only a hive-style directory structure is supported. When setting use_legacy_dataset to False, also within-file level filtering and different partitioning schemes are supported. Predicates are expressed in disjunctive normal form (DNF), like [[('x', '=', 0), ...], ...]. DNF allows arbitrary boolean logical combinations of single column predicates. The innermost tuples each describe a single column predicate. The list of inner predicates is interpreted as a conjunction (AND), forming a more selective and multiple column predicate. Finally, the most outer list combines these filters as a disjunction (OR). Predicates may also be passed as List[Tuple]. This form is interpreted as a single conjunction. To express OR in predicates, one must use the (preferred) List[List[Tuple]] notation.
[[('x', '=', 0), ...], ...]
>>> import vinum as vn >>> tbl = vn.read_parquet('users.parquet') >>> tbl.sql_pd('select * from t limit 3') registration_dttm id first_name ... usage title 0 2016-02-03 07:55:29 1 Amanda ... 49756.53 Internal Auditor 1 2016-02-03 17:04:03 2 Albert ... 150280.17 Accountant IV 2 2016-02-03 01:09:31 3 Evelyn ... 144972.51 Structural Engineer
[3 rows x 13 columns]
Index
Module Index
Search Page