Table¶

Vinum Table class.

class vinum.Table(arrow_table: pyarrow.lib.Table, reader=None)¶

Table represents a tabular dataset and provides SQL SELECT interface for data manipulation. Consists of a set of named columns of equal length. Essentially, is the same abstraction as a table in the relational databases world.

Provides minimum overhead transfer to and from Pandas DataFrame as well as Arrow Table, powered by Apache Arrow framework.

There are two major ways to instantiate Table:

By invoking Table.from_* factory methods.
By using convenice functions, such as:
vinum.read_csv(), vinum.read_parquet(), vinum.read_json().

By default, all the Numpy functions are available via ‘np.*’ namespace.

User Defined Function can be registered via vinum.register_python() or vinum.register_numpy().

Parameters: arrow_table (pyarrow.Table) – Arrow Table containing the dataset

Examples

>>> import pyarrow as pa
>>> import vinum as vn
>>> data = {'col1': [1, 2, 3], 'col2': [7, 13, 17]}
>>> arrow_table = pa.Table.from_pydict(data)
>>> tbl = vn.Table(arrow_table)
>>> tbl.sql_pd('select * from t')
   col1  col2
0     1     7
1     2    13
2     3    17

>>> import pandas as pd
>>> import vinum as vn
>>> pdf = pd.DataFrame(data={'col1': [1, 2, 3], 'col2': [7, 13, 17]})
>>> tbl = vn.Table.from_pandas(pdf)
>>> tbl.sql('select * from t')
<vinum.core.table.Table object at 0x114cff7f0>

Notice that vinum.Table.sql() returns vinum.Table object type.

>>> tbl.sql_pd('select * from t')
   col1  col2
0     1     7
1     2    13
2     3    17

Notice that vinum.Table.sql_pd() returns pandas.DataFrame.

To register a Numpy UDF:

>>> import vinum as vn
>>> vn.register_numpy('product', lambda x, y: x*y)
>>> tbl.sql_pd('select product(col1, col2) from t')
   product
0        7
1       26
2       51

‘product’ UDF defined above, would perform vectorized multiplication on arrays, represented by columns ‘col1’ and ‘col2’.

Attributes

schema: Return schema of the table.

Methods

`explain`(query[, print_query_tree])	Returns a query plan as a string.
`from_arrow`(arrow_table)	Constructs a Table from `pyarrow.Table`
`from_pandas`(data_frame)	Constructs a Table from `pandas.DataFrame`
`from_pydict`(pydict)	Constructs a Table from `dict`
`head`(n)	Return first n rows of a table as Pandas DataFrame
`sql`(query)	Executes SQL SELECT query on a Table and returns the result of the query.
`sql_pd`(query)	Executes SQL SELECT query on a Table and returns the result of the query as a Pandas DataFrame.
`to_arrow`()	Convert Table to ‘pyarrow.Table`.
`to_pandas`()	Convert Table to ‘pandas.DataFrame`.
`to_string`()	Return string representation of a Table.

explain(query: str, print_query_tree=False)¶

Returns a query plan as a string.

Parameters

query (str) – SQL SELECT query.
print_query_tree (bool, optional) – Set to True to also return an AST of the query.

Returns

str

Return type

Query Plan.

See also

sql: Executes SQL SELECT query on a Table and returns the result of the query.
sql_pd: Executes SQL SELECT query on a Table and returns the result of the query as a Pandas Table.

Examples

>>> import vinum as vn
>>> tbl = vn.read_csv('taxi.csv')
>>> print(tbl.explain('select to_int(fare_amount) fare, count(*) from t '
...                   'group by fare order by fare limit 3'))

Query plan:

Operator: MaterializeTableOperator

Operator: SliceOperator

Operator: ProjectOperator

Column: to_int_4304514592 Column: count_star_4556372144

Operator: SortOperator

Column: to_int_4304514592

Operator: AggregateOperator

Operator: ProjectOperator

VectorizedExpression: IntCastFunction: Column: fare_amount
Operator: ProjectOperator: Column: fare_amount

Operator: TableReaderOperator

classmethod from_arrow(arrow_table: pyarrow.lib.Table)¶

Constructs a Table from pyarrow.Table

Parameters: arrow_table (pyarrow.Table object) –
Returns: Vinum Table instance.
Return type: vinum.Table

Examples

>>> import pyarrow as pa
>>> import vinum as vn
>>> data = {'col1': [1, 2, 3], 'col2': [7, 13, 17]}
>>> arrow_table = pa.Table.from_pydict(data)
>>> tbl = vn.Table.from_arrow(arrow_table)
>>> tbl.sql_pd('select * from t')
   col1  col2
0     1     7
1     2    13
2     3    17

classmethod from_pandas(data_frame)¶

Constructs a Table from pandas.DataFrame

Parameters: data_frame (pandas.DataFrame object) –
Returns: Vinum Table instance.
Return type: vinum.Table

Examples

>>> import pandas as pd
>>> import vinum as vn
>>> data = {'col1': [1, 2, 3], 'col2': [7, 13, 17]}
>>> pdf = pd.DataFrame(data=data)
>>> tbl = vn.Table.from_pandas(pdf)
>>> tbl.sql_pd('select * from t')
   col1  col2
0     1     7
1     2    13
2     3    17

classmethod from_pydict(pydict: Dict)¶

Constructs a Table from dict

Parameters: pydict (Python dictionary) –
Returns: Vinum Table instance.
Return type: vinum.Table

Examples

>>> import vinum as vn
>>> data = {'col1': [1, 2, 3], 'col2': [7, 13, 17]}
>>> tbl = vn.Table.from_pydict(data)
>>> tbl.sql_pd('select * from t')
   col1  col2
0     1     7
1     2    13
2     3    17

head(n: int) → pd.DataFrame¶

Return first n rows of a table as Pandas DataFrame

Parameters: n (int) – Number of first rows to return.
Returns
Return type: pandas.DataFrame

property schema¶

Return schema of the table.

Returns
Return type: pyarrow.Schema

sql(query: str)¶

Executes SQL SELECT query on a Table and returns the result of the query.

Parameters: query (str) – SQL SELECT query.
Returns: Vinum Table instance.
Return type: vinum.Table

See also

sql_pd: Executes SQL SELECT query on a Table and returns the result of the query as a Pandas DataFrame.

Notes

Only SELECT statements are supported. For SELECT statements, JOINs and subqueries are currently not supported. However, optimizations aside, one can run a subsequent query on the result of a query, to model the behaviour of subqueries.

Table name in ‘select * from table’ clause is ignored. The table of the underlying DataFrame is used to run a query.

By default, all the Numpy functions are available via ‘np.*’ namespace.

User Defined Function can be registered via vinum.register_python() or vinum.register_numpy()

Examples

Using pandas dataframe:

>>> import pandas as pd
>>> import vinum as vn
>>> data = {'col1': [1, 2, 3], 'col2': [7, 13, 17]}
>>> pdf = pd.DataFrame(data=data)
>>> tbl = vn.Table.from_pandas(pdf)
>>> res_tbl = tbl.sql('select * from t')
>>> res_tbl.to_pandas()
   col1  col2
0     1     7
1     2    13
2     3    17

Running queries on a csv file:

>>> import vinum as vn
>>> tbl = vn.read_csv('taxi.csv')
>>> res_tbl = tbl.sql('select key, fare_amount from t limit 3')
>>> res_tbl.to_pandas()
                            key  fare_amount
0   2009-06-15 17:26:21.0000001          4.5
1   2010-01-05 16:52:16.0000002         16.9
2  2011-08-18 00:35:00.00000049          5.7

>>> import vinum as vn
>>> tbl = vn.read_csv('taxi.csv')
>>> res_tbl = tbl.sql('select to_int(fare_amount) fare, count(*) from t '
...                   'group by fare order by fare limit 3')
>>> res_tbl.to_pandas()
   fare  count_star
0    -5           1
1    -3           1
2    -2           4

sql_pd(query: str)¶

Executes SQL SELECT query on a Table and returns the result of the query as a Pandas DataFrame.

This is a convience method which runs :method:`vinum.Table.sql` and then calls :method:`vinum.Table.to_pandas` on the result. Equivalent to:

>>> res_tbl = tbl.sql('select * from t')
>>> res_tbl.to_pandas()
   col1  col2
0     1     7
1     2    13
2     3    17

Parameters: query (str) – SQL SELECT query.
Returns: Pandas DataFrame.
Return type: pandas.DataFrame