Table

Vinum Table class.

class vinum.Table(arrow_table: pyarrow.lib.Table, reader=None)

Table represents a tabular dataset and provides SQL SELECT interface for data manipulation. Consists of a set of named columns of equal length. Essentially, is the same abstraction as a table in the relational databases world.

Provides minimum overhead transfer to and from Pandas DataFrame as well as Arrow Table, powered by Apache Arrow framework.

There are two major ways to instantiate Table:
  1. By invoking Table.from_* factory methods.

  2. By using convenice functions, such as:

    vinum.read_csv(), vinum.read_parquet(), vinum.read_json().

By default, all the Numpy functions are available via ‘np.*’ namespace.

User Defined Function can be registered via vinum.register_python() or vinum.register_numpy().

Parameters

arrow_table (pyarrow.Table) – Arrow Table containing the dataset

Examples

>>> import pyarrow as pa
>>> import vinum as vn
>>> data = {'col1': [1, 2, 3], 'col2': [7, 13, 17]}
>>> arrow_table = pa.Table.from_pydict(data)
>>> tbl = vn.Table(arrow_table)
>>> tbl.sql_pd('select * from t')
   col1  col2
0     1     7
1     2    13
2     3    17
>>> import pandas as pd
>>> import vinum as vn
>>> pdf = pd.DataFrame(data={'col1': [1, 2, 3], 'col2': [7, 13, 17]})
>>> tbl = vn.Table.from_pandas(pdf)
>>> tbl.sql('select * from t')
<vinum.core.table.Table object at 0x114cff7f0>

Notice that vinum.Table.sql() returns vinum.Table object type.

>>> tbl.sql_pd('select * from t')
   col1  col2
0     1     7
1     2    13
2     3    17

Notice that vinum.Table.sql_pd() returns pandas.DataFrame.

To register a Numpy UDF:

>>> import vinum as vn
>>> vn.register_numpy('product', lambda x, y: x*y)
>>> tbl.sql_pd('select product(col1, col2) from t')
   product
0        7
1       26
2       51

‘product’ UDF defined above, would perform vectorized multiplication on arrays, represented by columns ‘col1’ and ‘col2’.

Attributes
schema

Return schema of the table.

Methods

explain(query[, print_query_tree])

Returns a query plan as a string.

from_arrow(arrow_table)

Constructs a Table from pyarrow.Table

from_pandas(data_frame)

Constructs a Table from pandas.DataFrame

from_pydict(pydict)

Constructs a Table from dict

head(n)

Return first n rows of a table as Pandas DataFrame

sql(query)

Executes SQL SELECT query on a Table and returns the result of the query.

sql_pd(query)

Executes SQL SELECT query on a Table and returns the result of the query as a Pandas DataFrame.

to_arrow()

Convert Table to ‘pyarrow.Table`.

to_pandas()

Convert Table to ‘pandas.DataFrame`.

to_string()

Return string representation of a Table.

explain(query: str, print_query_tree=False)

Returns a query plan as a string.

Parameters
  • query (str) – SQL SELECT query.

  • print_query_tree (bool, optional) – Set to True to also return an AST of the query.

Returns

str

Return type

Query Plan.

See also

sql

Executes SQL SELECT query on a Table and returns the result of the query.

sql_pd

Executes SQL SELECT query on a Table and returns the result of the query as a Pandas Table.

Examples

>>> import vinum as vn
>>> tbl = vn.read_csv('taxi.csv')
>>> print(tbl.explain('select to_int(fare_amount) fare, count(*) from t '
...                   'group by fare order by fare limit 3'))
Query plan:
Operator: MaterializeTableOperator
Operator: SliceOperator
Operator: ProjectOperator

Column: to_int_4304514592 Column: count_star_4556372144

Operator: SortOperator

Column: to_int_4304514592

Operator: AggregateOperator
Operator: ProjectOperator
VectorizedExpression: IntCastFunction

Column: fare_amount

Operator: ProjectOperator

Column: fare_amount

Operator: TableReaderOperator

classmethod from_arrow(arrow_table: pyarrow.lib.Table)

Constructs a Table from pyarrow.Table

Parameters

arrow_table (pyarrow.Table object) –

Returns

Vinum Table instance.

Return type

vinum.Table

Examples

>>> import pyarrow as pa
>>> import vinum as vn
>>> data = {'col1': [1, 2, 3], 'col2': [7, 13, 17]}
>>> arrow_table = pa.Table.from_pydict(data)
>>> tbl = vn.Table.from_arrow(arrow_table)
>>> tbl.sql_pd('select * from t')
   col1  col2
0     1     7
1     2    13
2     3    17
classmethod from_pandas(data_frame)

Constructs a Table from pandas.DataFrame

Parameters

data_frame (pandas.DataFrame object) –

Returns

Vinum Table instance.

Return type

vinum.Table

Examples

>>> import pandas as pd
>>> import vinum as vn
>>> data = {'col1': [1, 2, 3], 'col2': [7, 13, 17]}
>>> pdf = pd.DataFrame(data=data)
>>> tbl = vn.Table.from_pandas(pdf)
>>> tbl.sql_pd('select * from t')
   col1  col2
0     1     7
1     2    13
2     3    17
classmethod from_pydict(pydict: Dict)

Constructs a Table from dict

Parameters

pydict (Python dictionary) –

Returns

Vinum Table instance.

Return type

vinum.Table

Examples

>>> import vinum as vn
>>> data = {'col1': [1, 2, 3], 'col2': [7, 13, 17]}
>>> tbl = vn.Table.from_pydict(data)
>>> tbl.sql_pd('select * from t')
   col1  col2
0     1     7
1     2    13
2     3    17
head(n: int)pd.DataFrame

Return first n rows of a table as Pandas DataFrame

Parameters

n (int) – Number of first rows to return.

Returns

Return type

pandas.DataFrame

property schema

Return schema of the table.

Returns

Return type

pyarrow.Schema

sql(query: str)

Executes SQL SELECT query on a Table and returns the result of the query.

Parameters

query (str) – SQL SELECT query.

Returns

Vinum Table instance.

Return type

vinum.Table

See also

sql_pd

Executes SQL SELECT query on a Table and returns the result of the query as a Pandas DataFrame.

Notes

Only SELECT statements are supported. For SELECT statements, JOINs and subqueries are currently not supported. However, optimizations aside, one can run a subsequent query on the result of a query, to model the behaviour of subqueries.

Table name in ‘select * from table’ clause is ignored. The table of the underlying DataFrame is used to run a query.

By default, all the Numpy functions are available via ‘np.*’ namespace.

User Defined Function can be registered via vinum.register_python() or vinum.register_numpy()

Examples

Using pandas dataframe:

>>> import pandas as pd
>>> import vinum as vn
>>> data = {'col1': [1, 2, 3], 'col2': [7, 13, 17]}
>>> pdf = pd.DataFrame(data=data)
>>> tbl = vn.Table.from_pandas(pdf)
>>> res_tbl = tbl.sql('select * from t')
>>> res_tbl.to_pandas()
   col1  col2
0     1     7
1     2    13
2     3    17

Running queries on a csv file:

>>> import vinum as vn
>>> tbl = vn.read_csv('taxi.csv')
>>> res_tbl = tbl.sql('select key, fare_amount from t limit 3')
>>> res_tbl.to_pandas()
                            key  fare_amount
0   2009-06-15 17:26:21.0000001          4.5
1   2010-01-05 16:52:16.0000002         16.9
2  2011-08-18 00:35:00.00000049          5.7
>>> import vinum as vn
>>> tbl = vn.read_csv('taxi.csv')
>>> res_tbl = tbl.sql('select to_int(fare_amount) fare, count(*) from t '
...                   'group by fare order by fare limit 3')
>>> res_tbl.to_pandas()
   fare  count_star
0    -5           1
1    -3           1
2    -2           4
sql_pd(query: str)

Executes SQL SELECT query on a Table and returns the result of the query as a Pandas DataFrame.

This is a convience method which runs :method:`vinum.Table.sql` and then calls :method:`vinum.Table.to_pandas` on the result. Equivalent to:

>>> res_tbl = tbl.sql('select * from t')
>>> res_tbl.to_pandas()
   col1  col2
0     1     7
1     2    13
2     3    17
Parameters

query (str) – SQL SELECT query.

Returns

Pandas DataFrame.

Return type

pandas.DataFrame

See also

sql

Executes SQL SELECT query on a Table and returns the result of the query.

Notes

Only SELECT statements are supported. For SELECT statements, JOINs and subqueries are currently not supported. However, optimizations aside, one can run a subsequent query on the result of the query, to model the behaviour of subqueries.

Table name in select * from table clause is ignored. The table of the underlying Table object is used to run a query.

Examples

Using pandas dataframe:

>>> import pandas as pd
>>> import vinum as vn
>>> data = {'col1': [1, 2, 3], 'col2': [7, 13, 17]}
>>> pdf = pd.DataFrame(data=data)
>>> tbl = vn.Table.from_pandas(pdf)
>>> tbl.sql_pd('select * from t')
   col1  col2
0     1     7
1     2    13
2     3    17

Running queries on a csv file:

>>> import vinum as vn
>>> tbl = vn.read_csv('taxi.csv')
>>> tbl.sql_pd('select key, passenger_count from t limit 3')
                            key  passenger_count
0   2009-06-15 17:26:21.0000001                1
1   2010-01-05 16:52:16.0000002                1
2  2011-08-18 00:35:00.00000049                2
>>> import vinum as vn
>>> tbl = vn.read_csv('taxi.csv')
>>> res_tbl = tbl.sql('select to_int(fare_amount) fare, count(*) from t '
...                   'group by fare order by fare limit 3')
>>> res_tbl.to_pandas()
   fare  count_star
0    -5           1
1    -3           1
2    -2           4
to_arrow()pyarrow.lib.Table

Convert Table to ‘pyarrow.Table`.

Returns

Return type

pyarrow.Table

to_pandas()

Convert Table to ‘pandas.DataFrame`.

Returns

Return type

pandas.DataFrame

to_string()str

Return string representation of a Table.

Returns

Return type

str