I have some data. 224,000 rows of it, in a SQLite database. I want to extract time series information from it to feed a data visualisation tool. Essentially, each row in the db is an event that has (among other things not strictly relevant) a time-date group in seconds since the epoch and a name responsible for it. I want to extract how many events each name has for every week in the db.
That's simple enough:
SELECT COUNT(*),
name,
strf("%W:%Y", time, "unixepoch")
FROM events
GROUP BY strf("%W:%Y", time, "unixepoch"), name
ORDER BY time
and we get about six thousand rows of data.
count name week:year
23............ fudge.......23:2009
etc...
But I don't want a row for each name in each week - I want a row for each name, and a column for each week, like this:
Name 23:2009 24:2009 25:2009
fudge........23............6............19
fish.........1.............0............12
etc...
Now, the monitoring process has been running for 69 weeks, and the count of unique names is 502. So clearly, I'm far from keen on any solution that involves hardcoding all the columns and still less the rows. I'm less unkeen on anything that involves iterating over the lot, say with python's executemany(), but I'm willing to accept it if necessary. SQL is meant to be set-wise, dammit.
A good approach in cases like this is not to push SQL to the point where it becomes convoluted and hard to understand and maintain. Let SQL do what it conveniently can and post-process the query results in Python.
Here's a cut-down version of a simple crosstab generator that I wrote. The full version delivers row/column/grand totals.
You'll note that it has built-in "group by" -- the original use-case was for summarising data obtained from Excel files using Python and xlrd.
The row_key and col_key that you supply don't need to be strings as in the example; they can be tuples -- e.g. (year, week) in your case -- or they could be integers -- e.g. you have a mapping of string column name to integer sort key.
import sys
class CrossTab(object):
def __init__(
self,
missing=0, # what to return for an empty cell. Alternatives: '', 0.0, None, 'NULL'
):
self.missing = missing
self.col_key_set = set()
self.cell_dict = {}
self.headings_OK = False
def add_item(self, row_key, col_key, value):
self.col_key_set.add(col_key)
try:
self.cell_dict[row_key][col_key] += value
except KeyError:
try:
self.cell_dict[row_key][col_key] = value
except KeyError:
self.cell_dict[row_key] = {col_key: value}
def _process_headings(self):
if self.headings_OK:
return
self.row_headings = list(sorted(self.cell_dict.iterkeys()))
self.col_headings = list(sorted(self.col_key_set))
self.headings_OK = True
def get_col_headings(self):
self._process_headings()
return self.col_headings
def generate_row_info(self):
self._process_headings()
for row_key in self.row_headings:
row_dict = self.cell_dict[row_key]
row_vals = [row_dict.get(col_key, self.missing) for col_key in self.col_headings]
yield row_key, row_vals
def dump(self, f=None, header=None, footer='', ):
if f is None:
f = sys.stdout
alist = self.__dict__.items()
alist.sort()
if header is not None:
print >> f, header
for attr, value in alist:
print >> f, "%s: %r" % (attr, value)
if footer is not None:
print >> f, footer
if __name__ == "__main__":
data = [
['Rob', 'Morn', 240],
['Rob', 'Aft', 300],
['Joe', 'Morn', 70],
['Joe', 'Aft', 80],
['Jill', 'Morn', 100],
['Jill', 'Aft', 150],
['Rob', 'Aft', 40],
['Rob', 'aft', 5],
['Dozy', 'Aft', 1],
# Dozy doesn't show up till lunch-time
['Nemo', 'never', -1],
]
NAME, TIME, AMOUNT = range(3)
xlate_time = {'morn': "AM", "aft": "PM"}
print
ctab = CrossTab(missing=None, )
# ctab.dump(header='=== after init ===')
for s in data:
ctab.add_item(
row_key=s[NAME],
col_key= xlate_time.get(s[TIME].lower(), "XXXX"),
value=s[AMOUNT])
# ctab.dump(header='=== after add_item ===')
print ctab.get_col_headings()
# ctab.dump(header='=== after get_col_headings ===')
for x in ctab.generate_row_info():
print x
Output:
['AM', 'PM', 'XXXX']
('Dozy', [None, 1, None])
('Jill', [100, 150, None])
('Joe', [70, 80, None])
('Nemo', [None, None, -1])
('Rob', [240, 345, None])
I would first do your query
SELECT COUNT(*),
name,
strf("%W:%Y", time, "unixepoch")
FROM events
GROUP BY strf("%W:%Y", time, "unixepoch"), name
ORDER BY time
and then do post processing with python.
So you don't have to iterate over 224,000 rows but over 6,000 rows. You can easyly store those 6,000 rows in memory (for processing with Python). I think you can store 224,000 rows in memory too but it takes quite a lot more memory.
However: New versions of sqlite support the aggregation function group_concat. Maybe you can use this function for pivoting with SQL? I can't try because I use an older version.
Related
I have a list of dictionaries read in from csv DictReader that represent rows of a csv file:
rows = [{"id":"123","date":"1/1/18","foo":"bar"},
{"id":"123","date":"2/2/18", "foo":"baz"}]
I would like to create a new dictionary, where only unique ID's are stored. But I would like to only keep the row entry with the most recent date. Based on the above example, it would keep the row with date 2/2/18.
I was thinking of doing something like this, but having trouble translating the pseudocode in the else statement into actual python.
I can figure out the part of checking the two dates for which is more recent, but having the most trouble figuring out how I check the new list for the dictionary that contains the same id and then retrieving the date from that row.
Note: Unfortunately, due to resource constraints on our platform I am unable to use pandas for this project.
new_data = []
for row in rows:
if row['id'] not in new_data:
new_data.append(row)
else:
check the element in new_data with the same id as row['id']
if that element's date value is less recent:
replace it with the current row
else :
continue to next row in rows
You'll need a function to convert your date (as string) to a date (as date).
import datetime
def to_date(date_str):
d1, m1, y1 = [int(s) for s in date_str.split('/')]
return datetime.date(y1, m1, d1)
I assumed your date format is d/m/yy. Consider using datetime.strptime to parse your dates, as illustrated by Alex Hall's answer.
Then, the idea is to loop over your rows and store them in a new structure (here, a dict whose keys are the IDs). If a key already exists, compare its date with the current row, and take the right one. Following your pseudo-code, this leads to:
rows = [{"id":"123","date":"1/1/18","foo":"bar"},
{"id":"123","date":"2/2/18", "foo":"baz"}]
new_data = dict()
for row in rows:
existing = new_data.get(row['id'], None)
if existing is None or to_date(existing['date']) < to_date(row['date']):
new_data[row['id']] = row
If your want your new_data variable to be a list, use new_data = list(new_data.values()).
import datetime
rows = [{"id":"123","date":"1/1/18","foo":"bar"},
{"id":"123","date":"2/2/18", "foo":"baz"}]
def parse_date(d):
return datetime.datetime.strptime(d, "%d/%m/%y").date()
tmp_dict = {}
for row in rows:
if row['id'] not in tmp_dict.keys():
tmp_dict['id'] = row
else:
if parse_date(row['date']) > parse_date(tmp_dict[row['id']]):
tmp_dict['id'] = row
print tmp_dict.values()
output
[{'date': '2/2/18', 'foo': 'baz', 'id': '123'}]
Note: you can merge the two if to if row['id'] not in tmp_dict.keys() || parse_date(row['date']) > parse_date(tmp_dict[row['id']]) for cleaner and shorter code
Firstly, work with proper date objects, not strings. Here is how to parse them:
from datetime import datetime, date
rows = [{"id": "123", "date": "1/1/18", "foo": "bar"},
{"id": "123", "date": "2/2/18", "foo": "baz"}]
for row in rows:
row['date'] = datetime.strptime(row['date'], '%d/%m/%y').date()
(check if the format is correct)
Then for the actual task:
new_data = {}
for row in rows:
new_data[row['id']] = max(new_data.get(row['id'], date.min),
row['date'])
print(new_data.values())
Alternatively:
Here are some generic utility functions that work well here which I use in many places:
from collections import defaultdict
def group_by_key_func(iterable, key_func):
"""
Create a dictionary from an iterable such that the keys are the result of evaluating a key function on elements
of the iterable and the values are lists of elements all of which correspond to the key.
"""
result = defaultdict(list)
for item in iterable:
result[key_func(item)].append(item)
return result
def group_by_key(iterable, key):
return group_by_key_func(iterable, lambda x: x[key])
Then the solution can be written as:
by_id = group_by_key(rows, 'id')
for id_num, group in list(by_id.items()):
by_id[id_num] = max(group, key=lambda r: r['date'])
print(by_id.values())
This is less efficient than the first solution because it creates lists along the way that are discarded, but I use the general principles in many places and I thought of it first, so here it is.
If you like to utilize classes as much as I do, then you could make your own class to do this:
from datetime import date
rows = [
{"id":"123","date":"1/1/18","foo":"bar"},
{"id":"123","date":"2/2/18", "foo":"baz"},
{"id":"456","date":"3/3/18","foo":"bar"},
{"id":"456","date":"1/1/18","foo":"bar"}
]
class unique(dict):
def __setitem__(self, key, value):
#Add key if missing or replace key if date is newer
if key not in self or self[key]["date"] < value["date"]:
dict.__setitem__(self, key, value)
data = unique() #Initialize new class based on dict
for row in rows:
d, m, y = map(int, row["date"].split('/')) #Split date into parts
row["date"] = date(y, m, d) #Replace date value
data[row["id"]] = row #Set new data. Will overwrite same ids with more recent
print data.values()
Outputs:
[
{'date': datetime.date(18, 2, 2), 'foo': 'baz', 'id': '123'},
{'date': datetime.date(18, 3, 3), 'foo': 'bar', 'id': '456'}
]
Keep in mind that data is a dict that essentially overrides the __setitem__ method that uses IDs as keys. And the dates are date objects so they can be compared easily.
I'm looking to optimize the code below which takes ~5 seconds, which is too slow for a file of only 1000 lines.
I have a large file where each line contains valid JSON, with each JSON looking like the following (the actual data is much larger and nested, so I use this JSON snippet for illustration):
{"location":{"town":"Rome","groupe":"Advanced",
"school":{"SchoolGroupe":"TrowMet", "SchoolName":"VeronM"}},
"id":"145",
"Mother":{"MotherName":"Helen","MotherAge":"46"},"NGlobalNote":2,
"Father":{"FatherName":"Peter","FatherAge":"51"},
"Teacher":["MrCrock","MrDaniel"],"Field":"Marketing",
"season":["summer","spring"]}
I need to parse this file in order to extract only some key-values from every JSON, to obtain the resulting dataframe:
Groupe Id MotherName FatherName
Advanced 56 Laure James
Middle 11 Ann Nicolas
Advanced 6 Helen Franc
But some keys I need in the dataframe, are missing in some JSON objects, so I should to verify if the key is present, and if not, fill the corresponding value with Null. I use with the following method:
df = pd.DataFrame(columns=['group', 'id', 'Father', 'Mother'])
with open (path/to/file) as f:
for chunk in f:
jfile = json.loads(chunk)
if 'groupe' in jfile['location']:
groupe = jfile['location']['groupe']
else:
groupe=np.nan
if 'id' in jfile:
id = jfile['id']
else:
id = np.nan
if 'MotherName' in jfile['Mother']:
MotherName = jfile['Mother']['MotherName']
else:
MotherName = np.nan
if 'FatherName' in jfile['Father']:
FatherName = jfile['Father']['FatherName']
else:
FatherName = np.nan
df = df.append({"groupe":group, "id":id, "MotherName":MotherName, "FatherName":FatherName},
ignore_index=True)
I need to optimize the runtime over the whole 1000-row file to <= 2 seconds. In PERL the same parsing function takes < 1 second, but I need to implement it in Python.
You'll get the best performance if you can build the dataframe in a single step during initialization. DataFrame.from_record takes a sequence of tuples which you can supply from a generator that reads one record at a time. You can parse the data faster with get, which will supply a default parameter when the item isn't found. I created an empty dict called dummy to pass for intermediate gets so that you know a chained get will work.
I created a 1000 record dataset and on my crappy laptop the time went from 18 seconds to .06 seconds. Thats pretty good.
import numpy as np
import pandas as pd
import json
import time
def extract_data(data):
""" convert 1 json dict to records for import"""
dummy = {}
jfile = json.loads(data.strip())
return (
jfile.get('location', dummy).get('groupe', np.nan),
jfile.get('id', np.nan),
jfile.get('Mother', dummy).get('MotherName', np.nan),
jfile.get('Father', dummy).get('FatherName', np.nan))
start = time.time()
df = pd.DataFrame.from_records(map(extract_data, open('file.json')),
columns=['group', 'id', 'Father', 'Mother'])
print('New algorithm', time.time()-start)
#
# The original way
#
start= time.time()
df=pd.DataFrame(columns=['group', 'id', 'Father', 'Mother'])
with open ('file.json') as f:
for chunk in f:
jfile=json.loads(chunk)
if 'groupe' in jfile['location']:
groupe=jfile['location']['groupe']
else:
groupe=np.nan
if 'id' in jfile:
id=jfile['id']
else:
id=np.nan
if 'MotherName' in jfile['Mother']:
MotherName=jfile['Mother']['MotherName']
else:
MotherName=np.nan
if 'FatherName' in jfile['Father']:
FatherName=jfile['Father']['FatherName']
else:
FatherName=np.nan
df = df.append({"groupe":groupe,"id":id,"MotherName":MotherName,"FatherName":FatherName},
ignore_index=True)
print('original', time.time()-start)
The key part is not to append each row to the dataframe in the loop. You want to keep the collection in a list or dict container and then concatenate all of them at once. You can also simplify your if/else structure with a simple get that returns a default value (e.g. np.nan) if the item is not found in the dictionary.
with open (path/to/file) as f:
d = {'group': [], 'id': [], 'Father': [], 'Mother': []}
for chunk in f:
jfile = json.loads(chunk)
d['groupe'].append(jfile['location'].get('groupe', np.nan))
d['id'].append(jfile.get('id', np.nan))
d['MotherName'].append(jfile['Mother'].get('MotherName', np.nan))
d['FatherName'].append(jfile['Father'].get('FatherName', np.nan))
df = pd.DataFrame(d)
Problem: Given the dataframe below, I'm trying to come up with the code that will apply a function to three distinct columns without having to write three separate function calls.
The code for the data:
import pandas as pd
data = {'name': ['Jason', 'Molly', 'Tina', 'Jake', 'Amy'],
'days': [365, 365, 213, 318, 71],
'spend_30day': [22, 241.5, 0, 27321.05, 345],
'spend_90day': [22, 451.55, 64.32, 27321.05, 566.54],
'spend_365day': [854.56, 451.55, 211.65, 27321.05, 566.54]}
df = pd.DataFrame(data)
cols = df.columns.tolist()
cols = ['name', 'days', 'spend_30day', 'spend_90day', 'spend_365day']
df = df[cols]
df
The function below will essentially annualize spend; if someone has fewer than, say, 365 days in the "days" column, the following function will tell me what the spend would have been if they had 365 days:
def annualize_spend_365(row):
if row['days']/(float(365)) < 1:
return (row['spend_365day']/(row['days']/float(365)))
else:
return row['spend_365day']
Then I apply the function to the particular column:
df.spend_365day = df.apply(annualize_spend_365, axis=1).round(2)
df
This works exactly as I want it to for that one column. However, I don't want to have to rewrite this for each of the three different "spend" columns (30, 90, 365). I want to be able to write code that will generalize and apply this function to multiple columns in one pass.
I thought I could create lists of the columns and their respective days, use the "zip" function, and nest the function in a for loop, but my attempt below ultimately fails:
spend_cols = [df.spend_30day, df.spend_90day, df.spend_365day]
days_list = [30, 90, 365]
for col, day in zip(spend_cols, days_list):
def annualize_spend(row):
if (row.days/(float(day)) < 1:
return (row.col)/((row.days)/float(day))
else:
return row.col
col = df.apply(annualize_spend, axis = 1)
The error:
AttributeError: ("'Series' object has no attribute 'col'")
I'm not sure why the loop approach is failing. Regardless, I'm hoping for guidance on how to generalize function application in pandas. Thanks in advance!
Look at your two function definitions:
def annualize_spend_365(row):
if row['days']/(float(365)) < 1:
return (row['spend_365day']/(row['days']/float(365)))
else:
return row['spend_365day']
and
#col in [df.spend_30day, df.spend_90day, df.spend_365day]
def annualize_spend(row):
if (row.days/(float(day)) < 1:
return (row.col)/((row.days)/float(day))
else:
return row.col
See the difference? On the one hand, in the first case you access the fields with explicit field names, and it works. In the second case you try to access row.col, which fails, but in this case col assumes the values of the corresponding fields in df. Instead try
spend_cols = ['spend_30day', 'spend_90day', 'spend_365day']
before your loop. On the other hand, in the syntax df.days the field name is actually "days", but in df.col the field name is not the string "col", but the value of the string col. So you might want to use row[col] in the latter case as well. And anyway, I'm not sure how wise it is to take col as an output variable inside your loop over col.
I'm unfamiliar with pandas.DataFrame.apply, but it's probably possible to use a single function definition, which takes the number of days and the field of interest as input variables:
def annualize_spend(col,day,row):
if (row['days']/(float(day)) < 1:
return (row[col])/((row['days'])/float(day))
else:
return row[col]
spend_cols = ['spend_30day', 'spend_90day', 'spend_365day']
days_list = [30, 90, 365]
for col, day in zip(spend_cols, days_list):
col = df.apply(lambda row,col=col,day=day: annualize_spend(col,day,row), axis = 1)
The lambda will ensure that only one input parameter of your function is dangling free when it gets applyd.
users
Could you please help with following?
I need to extract data from a mysql database and aggregate them.
There are two tables in the database, both of them have data in a dfferent timestep.
I need now to make one new table (txt), where all data of table 1 are combined with table 2 data.
I so only need the data of table 2 with most coresponding time to timesteps of table 1.
for better understanding, see an example of the tables here:
https://www.dropbox.com/s/mo2q0hj72ilx05n/data%20aggregation.xlsx?dl=0
I already have a python-code which extracts the hexadecimal data and makes table 2.
I also have a code which makes table 1.
I need to combine both now.
Thank you very much for your advices!
After copying your data tables into Python lists, I had to split up the values in table 2 back into independent series. Overall you may be able to skip the step where you consolidate these values into the single table Table2.
The key to solving this is to write a simple class that implements __getitem__, taking a single key argument and returning the corresponding value. For instance, in the case of a regular Python dict, then __getitem__ returns the dict entry that exactly matches the key, or a KeyError if there is no match. In your case, I implemented __getitem__ to just return the entry with the minimum difference of the entry's timestamp from the given timestamp, in this line:
closest = min(self.data, key=lambda x: abs(x[0]-keyts))
(Left as an exercise to the OP - how to handle the case where the key falls exactly between two entries.) If you need to adjust the lookup logic, just change the implementation of __getitem__ - everything else in the code will remain the same.
Here is my sample implementation:
# t1 and t2 are lists of tab-delimited strings copy-pasted
# from the OP's spreadsheet
TAB = '\t'
t1data = [t.split(TAB) for t in t1]
t2data = [t.split(TAB) for t in t2]
# split each parameter into individual time,value pairs
readings = {'A':[], 'B':[], 'C':[]}
for parm in "ABC":
for trec in t2data:
t,a,b,c = trec
t = int(t)
if a: readings['A'].append((t,int(a)))
if b: readings['B'].append((t,int(b)))
if c: readings['C'].append((t,int(c)))
# define class for retrieving value with "closest" key if
# there is not an exact match
class LookupClosest(object):
def __init__(self, pairs):
self.data = pairs
def __getitem__(self, key):
# implement logic here to find closest matching item in series
# TODO - what if key is exactly between two different values?
closest = min(self.data, key=lambda x: abs(x[0]-key))
return closest[1]
# convert each data series to LookupClosest
for key in "ABC":
readings[key] = LookupClosest(readings[key])
# extract and display data
for vals in t1data:
t = int(vals[0])
gps = vals[1]
a = readings['A'][t]
b = readings['B'][t]
c = readings['C'][t]
rec = t,gps,a,b,c
print rec
prints: (I modified the Table1 data so that you can tell the difference from one record to the next):
( 1, 'x01', 1, 10, 44)
(10, 'x10', 2, 11, 47)
(21, 'x21', 4, 12, 45)
(30, 'x30', 3, 12, 44)
(41, 'x41', 4, 12, 47)
(52, 'x52', 2, 10, 48)
Based on example found here but I guess I'm not understanding it. This works for single column primary keys but fails on multiple ones.
This is my code
#classmethod
def column_windows(cls, q, columns, windowsize, where = None):
"""Return a series of WHERE clauses against
a given column that break it into windows.
Result is an iterable of tuples, consisting of
((start, end), whereclause), where (start, end) are the ids.
Requires a database that supports window functions,
i.e. Postgresql, SQL Server, Oracle.
Enhance this yourself ! Add a "where" argument
so that windows of just a subset of rows can
be computed.
"""
#Here is the thing... how to compare...
def int_for_range(start_id, end_id):
if end_id:
return and_(
columns>=start_id,
columns<end_id
)
else:
return columns>=start_id
if isinstance(columns, Column):
columns_k=(columns,)
else:
columns_k=tuple(columns)
q2=None
cols=()
for c in columns:
cols = cols + (c,)
if not q2:
q2=q.session.query(c)
else:
q2=q2.add_column(c)
q2 = q2.add_column(func.row_number().over(order_by=columns_k).label('rownum'))
q2=q2.filter(q._criterion).from_self(cols)
if windowsize > 1:
q2 = q2.filter("rownum %% %d=1" % windowsize)
for res in q2:
print res
intervals = [id for id, in q2]
while intervals:
start = intervals.pop(0)
if intervals:
end = intervals[0]
else:
end = None
yield int_for_range(start, end)
#classmethod
def windowed_query(cls, q, columns, windowsize):
""""Break a Query into windows on a given column."""
for whereclause in cls.column_windows(q,columns, windowsize):
for row in q.filter(whereclause).order_by(columns):
yield row
Now I have the problem when comparing the set of columns of the primary key. Well I guess kind of recursive clause generating function should do it... Let's try it...
Well, result is not what expected but got it to work: Now it really windows any query keeping all in place, multi column unique ordering, and so on:
Here is my code, hope it may be usefull for someone else:
#classmethod
def window_query(cls, q, windowsize, windows=None):
"""
q=Query object we want to window results
windowsize=The number of elements each window has
windows=The window, or window list, numbers: 1-based to query
"""
windowselect=False
if windows:
if not isinstance(windows,list):
windows=list(windows)
windowselect=True
#Appending u_columns to ordered counting subquery will ensure unique ordering
u_columns=list([col for col in cls.getBestUniqueColumns()])
#o_columns is the list of order by columns for the query
o_columns=list([col for col in q._order_by])
#we append columns from u_columns not in o_columns to ensure unique ordering but keeping the desired one
sq_o_columns=list(o_columns)
for col in u_columns:
if not col in sq_o_columns:
sq_o_columns.append(col)
sub=None
#we select unique columns in subquery that we'll need to join in parent query
for col in u_columns:
if not sub:
sub=q.session.query(col)
else:
sub=sub.add_column(col)
#Generate a tuple from sq_o_columns list (I don't know why over() won't accept list itself TODO: more elegant
sq_o_col_tuple=()
for col in sq_o_columns:
sq_o_col_tuple=sq_o_col_tuple + (col,)
#we add row counting column, counting on generated combined ordering+unique columns tuple
sub = sub.add_column(func.row_number().over(order_by=sq_o_col_tuple).label('rownum')).filter(q._criterion)
#Prepare sub query to use as subquery (LOL)
sub=sub.subquery('lacrn')
#Prepare join ON clauses epxression comparing unique columns defined by u_columns
joinclause=expression.BooleanClauseList()
for col in u_columns:
joinclause=joinclause.__and__(col == sub.c[col.key])
#Make the joining
q=q.join(sub,joinclause
)
i=-1
while True:
#We try to query windows defined by windows list
if windowselect:
#We want selected-windows-results to returned
if windows:
i=windows.pop(0)-1
else:
break
else:
#We want all-windows-results to be returned
i=i+1
res=q.filter(and_(sub.c.rownum > (i*windowsize), sub.c.rownum <= ((i+1)*windowsize))).all()
if not (res or windowselect):
#We end an all-windows-results because of no more results, we must check if is selected-window-query
#because of selected-window-results may not exist and the are unordered
#EX: [1,2,9999999999999,3] : Assuming the third page required has no results it will return pages 1, 2, and 3
break
for row in res:
yield row