Paris.py meetup
#7 - Paris
2015-09-16
Adrien Chauve @adrienchauve
Context:
Disclaimer:
Naïve sort: by Average Rating then by NbRatings
Title | Average Rating | NbRatings |
---|---|---|
Consuming Kids: The Commercialization of Childhood (2008) | 5 | 2 |
Catastroika (2012) | 5 | 2 |
Life On A String (Bian chang Bian Zou) (1991) | 5 | 1 |
Hijacking Catastrophe: 9/11, Fear & the Selling of American Empire (2004) | 5 | 1 |
Snow Queen, The (Lumikuningatar) (1986) | 5 | 1 |
Al otro lado (2004) | 5 | 1 |
Sierra, La (2005) | 5 | 1 |
Between the Devil and the Deep Blue Sea (1995) | 5 | 1 |
Schmatta: Rags to Riches to Rags (2009) | 5 | 1 |
Moth, The (Cma) (1980) | 5 | 1 |
Naïve sort: by NbRatings
Title | Average Rating | NbRatings |
---|---|---|
Pulp Fiction (1994) | 4.17 | 67310 |
Forrest Gump (1994) | 4.03 | 66172 |
Shawshank Redemption, The (1994) | 4.45 | 63366 |
Silence of the Lambs, The (1991) | 4.18 | 63299 |
Jurassic Park (1993) | 3.66 | 59715 |
Star Wars: Episode IV - A New Hope (1977) | 4.19 | 54502 |
Braveheart (1995) | 4.04 | 53769 |
Terminator 2: Judgment Day (1991) | 3.93 | 52244 |
Matrix, The (1999) | 4.19 | 51334 |
Schindler's List (1993) | 4.31 | 50054 |
Better sort: by custom rating (k=1000)
$$CustomRating_k = AverageRating * {NbRatings \over NbRatings + k}$$Title | Custom Rating k=1000 | Average Rating | NbRatings |
---|---|---|---|
Shawshank Redemption, The (1994) | 4.378 | 4.45 | 63366 |
Godfather, The (1972) | 4.262 | 4.36 | 41355 |
Usual Suspects, The (1995) | 4.244 | 4.33 | 47006 |
Schindler's List (1993) | 4.226 | 4.31 | 50054 |
Godfather: Part II, The (1974) | 4.125 | 4.28 | 27398 |
Fight Club (1999) | 4.124 | 4.23 | 40106 |
Raiders of the Lost Ark (Indiana Jones and the Raiders of the Lost Ark) (1981) | 4.124 | 4.22 | 43295 |
Star Wars: Episode IV - A New Hope (1977) | 4.115 | 4.19 | 54502 |
Pulp Fiction (1994) | 4.113 | 4.17 | 67310 |
Silence of the Lambs, The (1991) | 4.112 | 4.18 | 63299 |
New computed column: $$CustomRating = AverageRating * {NbRatings \over NbRatings + 1000}$$
Using pandas (python):
# df is a pandas.DataFrame instance
df['CustomRating'] = df['AverageRating'] * df['NbRatings'] / (df['NbRatings'] + 1000)
In SQL:
SELECT AverageRating * NbRatings / (NbRatings + 1000) AS CustomRating FROM ...;
How to generate both pandas and SQL from a single string?
x + 42
>>> import ast
>>> ast.dump(ast.parse("x + 42", mode="eval")
Expression(body=BinOp(left=Name(id='x', ctx=Load()),
op=Add(),
right=Num(n=42))))
$$CustomRating = AverageRating * NbRatings / (NbRatings + 1000)$$
>>> ast.dump(ast.parse("AverageRating * NbRatings / (NbRatings + 1000)",
mode="eval"))
Expression(body=BinOp(left=BinOp(left=Name(id='AverageRating', ctx=Load()),
op=Mult(),
right=Name(id='NbRatings', ctx=Load())),
op=Div(),
right=BinOp(left=Name(id='NbRatings', ctx=Load()),
op=Add(),
right=Num(n=1000))))
Expression(body=BinOp(left=Name(id='x', ctx=Load()),
op=Add(),
right=Num(n=42)))
OPERATORS = {
ast.Add: operator.add,
ast.Mult: operator.mul,
ast.Div: operator.truediv,
}
def eval_expr(expr):
return _eval(ast.parse(expr, mode='eval').body)
def _eval(node): # recursively evaluate tree nodes
if isinstance(node, ast.Num):
return node.n
elif isinstance(node, ast.BinOp):
return OPERATORS[type(node.op)](_eval(node.left), _eval(node.right))
elif isinstance(node, ast.UnaryOp):
return OPERATORS[type(node.op)](_eval(node.operand))
elif isinstance(node, ast.Name):
return ???
raise TypeError(node)
class PandasEvaluator(object):
def __init__(self, dataframe):
self._dataframe = dataframe
def eval_expr(self, expr):
return self._eval(ast.parse(expr, mode='eval').body)
def _eval(self, node): # recursively evaluate tree nodes
if isinstance(node, ast.Num):
return node.n
elif isinstance(node, ast.BinOp):
return OPERATORS[type(node.op)](self._eval(node.left),
self._eval(node.right))
elif isinstance(node, ast.UnaryOp):
return OPERATORS[type(node.op)](self._eval(node.operand))
elif isinstance(node, ast.Name):
return self.dataframe[node.id]
raise TypeError(node)
df = pandas.read_csv('ratings.csv')
formula = "AverageRating * NbRatings / (NbRatings + 1000)"
df['CustomRating'] = PandasEvaluator(df).eval_expr(formula)
class SQLEvaluator(object):
def __init__(self, sql_table):
self._sql_table = sql_table # instance of SQLAlchemy Table class
def eval_expr(self, expr):
return self._eval(ast.parse(expr, mode='eval').body)
def _eval(self, node): # recursively evaluate tree nodes
if isinstance(node, ast.Num):
return node.n
elif isinstance(node, ast.BinOp):
return OPERATORS[type(node.op)](self._eval(node.left),
self._eval(node.right))
elif isinstance(node, ast.UnaryOp):
return OPERATORS[type(node.op)](self._eval(node.operand))
elif isinstance(node, ast.Name):
return self._sql_table[node.id]
raise TypeError(node)
session = sessionmaker(...)
sql_table = Table(...)
formula = "AverageRating * NbRatings / (NbRatings + 1000)"
custom_ratings_column = SQLEvaluator(sql_table).eval_expr(formula)
data = [row for row in session.query(custom_ratings_column)]
What we did so far:
Wait... there is more!
SELECT... CASE WHEN... ELSE ... END ... ;
Module(body=[
Print(dest=None,
values=[Str(s='Thank you! Questions?')],
nl=True)
])