vinum
Vinum Table class.
Table represents a tabular dataset and provides SQL SELECT interface for data manipulation. Consists of a set of named columns of equal length. Essentially, is the same abstraction as a table in the relational databases world.
Provides minimum overhead transfer to and from Pandas DataFrame as well as Arrow Table, powered by Apache Arrow framework.
By invoking Table.from_* factory methods.
Table.from_*
vinum.read_csv(), vinum.read_parquet(), vinum.read_json().
vinum.read_csv()
vinum.read_parquet()
vinum.read_json()
By default, all the Numpy functions are available via ‘np.*’ namespace.
User Defined Function can be registered via vinum.register_python() or vinum.register_numpy().
vinum.register_python()
vinum.register_numpy()
arrow_table (pyarrow.Table) – Arrow Table containing the dataset
Examples
>>> import pyarrow as pa >>> import vinum as vn >>> data = {'col1': [1, 2, 3], 'col2': [7, 13, 17]} >>> arrow_table = pa.Table.from_pydict(data) >>> tbl = vn.Table(arrow_table) >>> tbl.sql_pd('select * from t') col1 col2 0 1 7 1 2 13 2 3 17
>>> import pandas as pd >>> import vinum as vn >>> pdf = pd.DataFrame(data={'col1': [1, 2, 3], 'col2': [7, 13, 17]}) >>> tbl = vn.Table.from_pandas(pdf) >>> tbl.sql('select * from t') <vinum.core.table.Table object at 0x114cff7f0>
Notice that vinum.Table.sql() returns vinum.Table object type.
vinum.Table.sql()
vinum.Table
>>> tbl.sql_pd('select * from t') col1 col2 0 1 7 1 2 13 2 3 17
Notice that vinum.Table.sql_pd() returns pandas.DataFrame.
vinum.Table.sql_pd()
pandas.DataFrame
To register a Numpy UDF:
>>> import vinum as vn >>> vn.register_numpy('product', lambda x, y: x*y) >>> tbl.sql_pd('select product(col1, col2) from t') product 0 7 1 26 2 51
‘product’ UDF defined above, would perform vectorized multiplication on arrays, represented by columns ‘col1’ and ‘col2’.
schema
Return schema of the table.
Methods
explain(query[, print_query_tree])
explain
Returns a query plan as a string.
from_arrow(arrow_table)
from_arrow
Constructs a Table from pyarrow.Table
pyarrow.Table
from_pandas(data_frame)
from_pandas
Constructs a Table from pandas.DataFrame
from_pydict(pydict)
from_pydict
Constructs a Table from dict
dict
head(n)
head
Return first n rows of a table as Pandas DataFrame
sql(query)
sql
Executes SQL SELECT query on a Table and returns the result of the query.
sql_pd(query)
sql_pd
Executes SQL SELECT query on a Table and returns the result of the query as a Pandas DataFrame.
to_arrow()
to_arrow
Convert Table to ‘pyarrow.Table`.
to_pandas()
to_pandas
Convert Table to ‘pandas.DataFrame`.
to_string()
to_string
Return string representation of a Table.
query (str) – SQL SELECT query.
print_query_tree (bool, optional) – Set to True to also return an AST of the query.
str
Query Plan.
See also
Executes SQL SELECT query on a Table and returns the result of the query as a Pandas Table.
>>> import vinum as vn >>> tbl = vn.read_csv('taxi.csv') >>> print(tbl.explain('select to_int(fare_amount) fare, count(*) from t ' ... 'group by fare order by fare limit 3'))
Column: to_int_4304514592 Column: count_star_4556372144
Column: to_int_4304514592
Column: fare_amount
Operator: TableReaderOperator
arrow_table (pyarrow.Table object) –
Vinum Table instance.
>>> import pyarrow as pa >>> import vinum as vn >>> data = {'col1': [1, 2, 3], 'col2': [7, 13, 17]} >>> arrow_table = pa.Table.from_pydict(data) >>> tbl = vn.Table.from_arrow(arrow_table) >>> tbl.sql_pd('select * from t') col1 col2 0 1 7 1 2 13 2 3 17
data_frame (pandas.DataFrame object) –
>>> import pandas as pd >>> import vinum as vn >>> data = {'col1': [1, 2, 3], 'col2': [7, 13, 17]} >>> pdf = pd.DataFrame(data=data) >>> tbl = vn.Table.from_pandas(pdf) >>> tbl.sql_pd('select * from t') col1 col2 0 1 7 1 2 13 2 3 17
pydict (Python dictionary) –
>>> import vinum as vn >>> data = {'col1': [1, 2, 3], 'col2': [7, 13, 17]} >>> tbl = vn.Table.from_pydict(data) >>> tbl.sql_pd('select * from t') col1 col2 0 1 7 1 2 13 2 3 17
n (int) – Number of first rows to return.
pyarrow.Schema
Notes
Only SELECT statements are supported. For SELECT statements, JOINs and subqueries are currently not supported. However, optimizations aside, one can run a subsequent query on the result of a query, to model the behaviour of subqueries.
Table name in ‘select * from table’ clause is ignored. The table of the underlying DataFrame is used to run a query.
User Defined Function can be registered via vinum.register_python() or vinum.register_numpy()
Using pandas dataframe:
>>> import pandas as pd >>> import vinum as vn >>> data = {'col1': [1, 2, 3], 'col2': [7, 13, 17]} >>> pdf = pd.DataFrame(data=data) >>> tbl = vn.Table.from_pandas(pdf) >>> res_tbl = tbl.sql('select * from t') >>> res_tbl.to_pandas() col1 col2 0 1 7 1 2 13 2 3 17
Running queries on a csv file:
>>> import vinum as vn >>> tbl = vn.read_csv('taxi.csv') >>> res_tbl = tbl.sql('select key, fare_amount from t limit 3') >>> res_tbl.to_pandas() key fare_amount 0 2009-06-15 17:26:21.0000001 4.5 1 2010-01-05 16:52:16.0000002 16.9 2 2011-08-18 00:35:00.00000049 5.7
>>> import vinum as vn >>> tbl = vn.read_csv('taxi.csv') >>> res_tbl = tbl.sql('select to_int(fare_amount) fare, count(*) from t ' ... 'group by fare order by fare limit 3') >>> res_tbl.to_pandas() fare count_star 0 -5 1 1 -3 1 2 -2 4
This is a convience method which runs :method:`vinum.Table.sql` and then calls :method:`vinum.Table.to_pandas` on the result. Equivalent to:
>>> res_tbl = tbl.sql('select * from t') >>> res_tbl.to_pandas() col1 col2 0 1 7 1 2 13 2 3 17
Pandas DataFrame.
Only SELECT statements are supported. For SELECT statements, JOINs and subqueries are currently not supported. However, optimizations aside, one can run a subsequent query on the result of the query, to model the behaviour of subqueries.
Table name in select * from table clause is ignored. The table of the underlying Table object is used to run a query.
select * from table
>>> import vinum as vn >>> tbl = vn.read_csv('taxi.csv') >>> tbl.sql_pd('select key, passenger_count from t limit 3') key passenger_count 0 2009-06-15 17:26:21.0000001 1 1 2010-01-05 16:52:16.0000002 1 2 2011-08-18 00:35:00.00000049 2