Building generic data queries using Python AST

Building generic data queries

using Python AST

Paris.py meetup #7 - Paris
2015-09-16

Adrien Chauve @adrienchauve

@Serenytics

Building generic data queries: why?
Python AST to the rescue
Walking the AST to build data queries

1. Building generic data queries: Why?

Context:

You love data
You want to watch a movie
You know the MovieLens database (20M ratings on 27k movies by 138k users)

Disclaimer:

could also be bigger data (sales) but less sexy!
data could be stored on a SQL server instead of a CSV file

1. Context: Select the best movie (1/3)

Naïve sort: by Average Rating then by NbRatings

Title	Average Rating	NbRatings
Consuming Kids: The Commercialization of Childhood (2008)	5	2
Catastroika (2012)	5	2
Life On A String (Bian chang Bian Zou) (1991)	5	1
Hijacking Catastrophe: 9/11, Fear & the Selling of American Empire (2004)	5	1
Snow Queen, The (Lumikuningatar) (1986)	5	1
Al otro lado (2004)	5	1
Sierra, La (2005)	5	1
Between the Devil and the Deep Blue Sea (1995)	5	1
Schmatta: Rags to Riches to Rags (2009)	5	1
Moth, The (Cma) (1980)	5	1

1. Context: Select the best movie (2/3)

Naïve sort: by NbRatings

Title	Average Rating	NbRatings
Pulp Fiction (1994)	4.17	67310
Forrest Gump (1994)	4.03	66172
Shawshank Redemption, The (1994)	4.45	63366
Silence of the Lambs, The (1991)	4.18	63299
Jurassic Park (1993)	3.66	59715
Star Wars: Episode IV - A New Hope (1977)	4.19	54502
Braveheart (1995)	4.04	53769
Terminator 2: Judgment Day (1991)	3.93	52244
Matrix, The (1999)	4.19	51334
Schindler's List (1993)	4.31	50054

1. Context: Select the best movie (3/3)

Better sort: by custom rating (k=1000)

$$CustomRating_k = AverageRating * {NbRatings \over NbRatings + k}$$

Title	Custom Rating k=1000	Average Rating	NbRatings
Shawshank Redemption, The (1994)	4.378	4.45	63366
Godfather, The (1972)	4.262	4.36	41355
Usual Suspects, The (1995)	4.244	4.33	47006
Schindler's List (1993)	4.226	4.31	50054
Godfather: Part II, The (1974)	4.125	4.28	27398
Fight Club (1999)	4.124	4.23	40106
Raiders of the Lost Ark (Indiana Jones and the Raiders of the Lost Ark) (1981)	4.124	4.22	43295
Star Wars: Episode IV - A New Hope (1977)	4.115	4.19	54502
Pulp Fiction (1994)	4.113	4.17	67310
Silence of the Lambs, The (1991)	4.112	4.18	63299

1. Need computed columns to best analyze your data

New computed column: $$CustomRating = AverageRating * {NbRatings \over NbRatings + 1000}$$

Using pandas (python):


# df is a pandas.DataFrame instance
df['CustomRating'] = df['AverageRating'] * df['NbRatings'] / (df['NbRatings'] + 1000)

In SQL:


SELECT AverageRating * NbRatings / (NbRatings + 1000) AS CustomRating FROM ...;

How to generate both pandas and SQL from a single string?

2. Python AST to the rescue

2. AST: What is it?

Abstract Syntax Tree
represents your code as a tree object


                x + 42

2. AST: What is it?

represents your code as a tree object


>>> import ast
>>> ast.dump(ast.parse("x + 42", mode="eval")

Expression(body=BinOp(left=Name(id='x', ctx=Load()),
                      op=Add(),
                      right=Num(n=42))))

2. AST: What is it?

$$CustomRating = AverageRating * NbRatings / (NbRatings + 1000)$$


>>> ast.dump(ast.parse("AverageRating * NbRatings / (NbRatings + 1000)",
                       mode="eval"))


Expression(body=BinOp(left=BinOp(left=Name(id='AverageRating', ctx=Load()),
                                 op=Mult(),
                                 right=Name(id='NbRatings', ctx=Load())),
                      op=Div(),
                      right=BinOp(left=Name(id='NbRatings', ctx=Load()),
                                  op=Add(),
                                  right=Num(n=1000))))

3. Walking the AST to build data queries

3. AST: Great, but what can we do with it?


Expression(body=BinOp(left=Name(id='x', ctx=Load()),
                      op=Add(),
                      right=Num(n=42)))


OPERATORS = {
    ast.Add: operator.add,
    ast.Mult: operator.mul,
    ast.Div: operator.truediv,
}

def eval_expr(expr):
    return _eval(ast.parse(expr, mode='eval').body)

def _eval(node):  # recursively evaluate tree nodes
    if isinstance(node, ast.Num):
        return node.n
    elif isinstance(node, ast.BinOp):
        return OPERATORS[type(node.op)](_eval(node.left), _eval(node.right))
    elif isinstance(node, ast.UnaryOp):
        return OPERATORS[type(node.op)](_eval(node.operand))
    elif isinstance(node, ast.Name):
        return ???
    raise TypeError(node)

3. AST: Building a pandas query


class PandasEvaluator(object):
    def __init__(self, dataframe):
        self._dataframe = dataframe

    def eval_expr(self, expr):
        return self._eval(ast.parse(expr, mode='eval').body)

    def _eval(self, node):  # recursively evaluate tree nodes
        if isinstance(node, ast.Num):
            return node.n
        elif isinstance(node, ast.BinOp):
            return OPERATORS[type(node.op)](self._eval(node.left),
                                            self._eval(node.right))
        elif isinstance(node, ast.UnaryOp):
            return OPERATORS[type(node.op)](self._eval(node.operand))
        elif isinstance(node, ast.Name):
            return self.dataframe[node.id]
        raise TypeError(node)

df = pandas.read_csv('ratings.csv')
formula = "AverageRating * NbRatings / (NbRatings + 1000)"
df['CustomRating'] = PandasEvaluator(df).eval_expr(formula)

3. AST: Building a SQL query using SQLAlchemy


class SQLEvaluator(object):
    def __init__(self, sql_table):
        self._sql_table = sql_table  # instance of SQLAlchemy Table class

    def eval_expr(self, expr):
        return self._eval(ast.parse(expr, mode='eval').body)

    def _eval(self, node):  # recursively evaluate tree nodes
        if isinstance(node, ast.Num):
            return node.n
        elif isinstance(node, ast.BinOp):
            return OPERATORS[type(node.op)](self._eval(node.left),
                                            self._eval(node.right))
        elif isinstance(node, ast.UnaryOp):
            return OPERATORS[type(node.op)](self._eval(node.operand))
        elif isinstance(node, ast.Name):
            return self._sql_table[node.id]
        raise TypeError(node)

session = sessionmaker(...)
sql_table = Table(...)
formula = "AverageRating * NbRatings / (NbRatings + 1000)"
custom_ratings_column = SQLEvaluator(sql_table).eval_expr(formula)
data = [row for row in session.query(custom_ratings_column)]

Building generic data queries using Python AST

What we did so far:

Enter a formula as a string
Parse it and generate the AST using python ast.parse
Use AST evaluators to build pandas and SQL new columns
In just ~20 lines of code!

Wait... there is more!

Add support for python "> < = and or not" operators
Use SqlAlchemy DSL to generate conditional queries:
```
SELECT... CASE WHEN... ELSE ... END ... ;
```
Use numpy masks to do the same with pandas dataframe

Great links

The Hacker's Guide to Python
Great Python book by Julien Danjou; has a part on AST
MovieLens database by grouplens
A detailed HowTo to add new computed columns in your dataset using Serenytics (including conditional formulas)


Module(body=[
    Print(dest=None,
          values=[Str(s='Thank you! Questions?')],
          nl=True)
])

@adrienchauve
adrien.chauve@serenytics.com

Building generic data queries

using Python AST

Contents

1. Building generic data queries: Why?

1. Building generic data queries: Why?

1. Context: Select the best movie (1/3)

1. Context: Select the best movie (2/3)

1. Context: Select the best movie (3/3)

1. Need computed columns to best analyze your data

2. Python AST to the rescue

2. AST: What is it?

2. AST: What is it?

2. AST: What is it?

3. Walking the AST to build data queries

3. AST: Great, but what can we do with it?

3. AST: Building a pandas query

3. AST: Building a SQL query using SQLAlchemy

Building generic data queries using Python AST

Great links