combine 2 tables and aggregate data

combine 2 tables and aggregate data - python

users
Could you please help with following?
I need to extract data from a mysql database and aggregate them.
There are two tables in the database, both of them have data in a dfferent timestep.
I need now to make one new table (txt), where all data of table 1 are combined with table 2 data.
I so only need the data of table 2 with most coresponding time to timesteps of table 1.
for better understanding, see an example of the tables here:
https://www.dropbox.com/s/mo2q0hj72ilx05n/data%20aggregation.xlsx?dl=0
I already have a python-code which extracts the hexadecimal data and makes table 2.
I also have a code which makes table 1.
I need to combine both now.
Thank you very much for your advices!

After copying your data tables into Python lists, I had to split up the values in table 2 back into independent series. Overall you may be able to skip the step where you consolidate these values into the single table Table2.
The key to solving this is to write a simple class that implements __getitem__, taking a single key argument and returning the corresponding value. For instance, in the case of a regular Python dict, then __getitem__ returns the dict entry that exactly matches the key, or a KeyError if there is no match. In your case, I implemented __getitem__ to just return the entry with the minimum difference of the entry's timestamp from the given timestamp, in this line:
closest = min(self.data, key=lambda x: abs(x[0]-keyts))
(Left as an exercise to the OP - how to handle the case where the key falls exactly between two entries.) If you need to adjust the lookup logic, just change the implementation of __getitem__ - everything else in the code will remain the same.
Here is my sample implementation:
# t1 and t2 are lists of tab-delimited strings copy-pasted
# from the OP's spreadsheet
TAB = '\t'
t1data = [t.split(TAB) for t in t1]
t2data = [t.split(TAB) for t in t2]
# split each parameter into individual time,value pairs
readings = {'A':[], 'B':[], 'C':[]}
for parm in "ABC":
for trec in t2data:
t,a,b,c = trec
t = int(t)
if a: readings['A'].append((t,int(a)))
if b: readings['B'].append((t,int(b)))
if c: readings['C'].append((t,int(c)))
# define class for retrieving value with "closest" key if
# there is not an exact match
class LookupClosest(object):
def __init__(self, pairs):
self.data = pairs
def __getitem__(self, key):
# implement logic here to find closest matching item in series
# TODO - what if key is exactly between two different values?
closest = min(self.data, key=lambda x: abs(x[0]-key))
return closest[1]
# convert each data series to LookupClosest
for key in "ABC":
readings[key] = LookupClosest(readings[key])
# extract and display data
for vals in t1data:
t = int(vals[0])
gps = vals[1]
a = readings['A'][t]
b = readings['B'][t]
c = readings['C'][t]
rec = t,gps,a,b,c
print rec
prints: (I modified the Table1 data so that you can tell the difference from one record to the next):
( 1, 'x01', 1, 10, 44)
(10, 'x10', 2, 11, 47)
(21, 'x21', 4, 12, 45)
(30, 'x30', 3, 12, 44)
(41, 'x41', 4, 12, 47)
(52, 'x52', 2, 10, 48)

Related

Python Replace values in list with dict

I have 2 variables I am trying to manipulate the data. I have a variable with a list that has 2 items.
row = [['Toyyota', 'Cammry', '3000'], ['Foord', 'Muustang', '6000']]
And a dictionary that has submissions
submission = {
'extracted1_1': 'Toyota', 'extracted1_2': 'Camry', 'extracted1_3': '1000',
'extracted2_1': 'Ford', 'extracted2_2': 'Mustang', 'extracted2_3': '5000',
'reportDate': '2022-06-01T08:30', 'reportOwner': 'John Smith'}
extracted1_1 would match up with the first value in the first item from row. extracted1_2 would be the 2nd value in the 1st item, and extracted2_1 would be the 1st value in the 2nd item and so on. I'm trying to update row with the corresponding submission and having a hard time getting it to work properly.
Here's what I have currently:
iter_bit = iter((submission.values()))
for bit in row:
i = 0
for bits in bit:
bit[i] = next(iter_bit)
i += 1
While this somewhat works, i'm looking for a more efficient way to do this by looping through the submission rather than the row. Is there an easier or more efficient way by looping through the submission to overwrite the corresponding value in row?

Iterate through submission, and check if the key is in the format extractedX_Y. If it does, use those as the indexes into row and assign the value there.
import re
regex = re.compile(r'^extracted(\d+)_(\d+)$')
for key, value in submissions.items():
m = regex.search(key)
if m:
x = int(m.group(1))
y = int(m.group(2))
row[x-1][y-1] = value

It seems you are trying to convert the portion of the keys after "extracted" to indices into row. To do this, first slice out the portion you don't need (i.e. "extracted"), and then split what remains by _. Then, convert each of these strings to integers, and subtract 1 because in python indices are zero-based.
for key, value in submission.items():
# e.g. key = 'extracted1_1', value = 'Toyota'
if not key.startswith("extracted"):
continue
indices = [int(i) - 1 for i in key[9:].split("_")]
# e.g. indices = [0, 0]
# Set the value
row[indices[0]][indices[1]] = value
Now you have your modified row:
[['Toyota', 'Camry', '1000'], ['Ford', 'Mustang', '5000']]

No clue if its faster but its a 2-liner hahaha
for n, val in zip(range(len(row) * 3), submission.values()):
row[n//3][n%3] = val
that said, i would probably do something safer in a work environment, like parsing the key for its index.

python cut between partitioned column results

I use below code in Spark-scala to get the partitioned columns.
scala> val part_cols= spark.sql(" describe extended work.quality_stat ").select("col_name").as[String].collect()
part_cols: Array[String] = Array(x_bar, p1, p5, p50, p90, p95, p99, x_id, y_id, # Partition Information, # col_name, x_id, y_id, "", # Detailed Table Information, Database, Table, Owner, Created Time, Last Access, Created By, Type, Provider, Table Properties, Location, Serde Library, InputFormat, OutputFormat, Storage Properties, Partition Provider)
scala> part_cols.takeWhile( x => x.length()!= 0 ).reverse.takeWhile( x => x != "# col_name" )
res20: Array[String] = Array(x_id, y_id)
and I need to get similar output in Python. I'm struggling to replicate the same code in Python for the Array Operation to get the [y_id, x_id].
Below is what I tried.
>>> part_cols=spark.sql(" describe extended work.quality_stat ").select("col_name").collect()
Is it possible using Python.

part_cols in the question is an array of rows. So the first step is to convert it into an array of strings.
part_cols = spark.sql(...).select("col_name").collect()
part_cols = [row['col_name'] for row in part_cols]
Now the start and end of the array's part that you are interessted in can be calculated with
start_index = part_cols.index("# col_name") + 1
end_index = part_cols.index('', start_index)
Finally a slice can be extracted from the list with these two values as start and end
part_cols[start_index:end_index]
This slice will contain the values
['x_id', 'y_id']
If the output really should be reversed, the slice
part_cols[end_index-1:start_index-1:-1]
will contain the values
['y_id', 'x_id']

Create Json file from Python dataframe with grouping on one col and making column name as key with unique values as a list inside the key

#Create the pandas DataFrame#
My data frame is like this
data = [[6, 1, "False","var_1"], [6, 1, "False","var_2"], [7, 1, "False","var_3"]]
df = pd.DataFrame(data, columns =['CONSTRAINT_ID','CONSTRAINT_NODE_ID','PRODUCT_GRAIN','LEFT_SIDE_TYPE'])
##Expected Output Json##
I want to group by column CONSTRAINT_ID and the key should be natural numbers or index. LEFT_SIDE_TYPE column values should come in list
{
"1": {"CONSTRAINT_NODE_ID ":[1],
"product_grain":False,
"left_side_type":["Variable_1","Variable_2"],
},
"2": {"CONSTRAINT_NODE_ID ":[2],
"product_grain":False,
"left_side_type":["Variable_3"],
}
}

It is likely not the most efficient solution. However provided a df in the format specified in your original question, the below function will return a str consisting of a valid json string with the desired structure and values.
It filters the df by CONSTRAINT_ID, iterating across each unique value and creating a JSON object with a key 1...n and the desired values based on your original question within the response variable. This implementation uses set structures to store values during iterations to avoid duplication of values before converting these to list instances before they are added to the response.
import json
def generate_response(df):
response = dict()
constraints = df['CONSTRAINT_ID'].unique()
for i, c in enumerate(constraints):
temp = {'CONSTRAINT_NODE_ID': set(),'PRODUCT_GRAIN': None, 'LEFT_SIDE_TYPE': set()}
for _, row in df[df['CONSTRAINT_ID'] == c].iterrows():
temp['CONSTRAINT_NODE_ID'].add(row['CONSTRAINT_NODE_ID'])
temp['PRODUCT_GRAIN'] = row['PRODUCT_GRAIN']
temp['LEFT_SIDE_TYPE'].add(row['LEFT_SIDE_TYPE'])
temp['CONSTRAINT_NODE_ID'] = list(temp['CONSTRAINT_NODE_ID'])
temp['LEFT_SIDE_TYPE'] = list(temp['LEFT_SIDE_TYPE'])
response[str(i + 1)] = temp
return json.dumps(response, indent=4)

create a filtered list of dictionaries based on existing list of dictionaries

I have a list of dictionaries read in from csv DictReader that represent rows of a csv file:
rows = [{"id":"123","date":"1/1/18","foo":"bar"},
{"id":"123","date":"2/2/18", "foo":"baz"}]
I would like to create a new dictionary, where only unique ID's are stored. But I would like to only keep the row entry with the most recent date. Based on the above example, it would keep the row with date 2/2/18.
I was thinking of doing something like this, but having trouble translating the pseudocode in the else statement into actual python.
I can figure out the part of checking the two dates for which is more recent, but having the most trouble figuring out how I check the new list for the dictionary that contains the same id and then retrieving the date from that row.
Note: Unfortunately, due to resource constraints on our platform I am unable to use pandas for this project.
new_data = []
for row in rows:
if row['id'] not in new_data:
new_data.append(row)
else:
check the element in new_data with the same id as row['id']
if that element's date value is less recent:
replace it with the current row
else :
continue to next row in rows

You'll need a function to convert your date (as string) to a date (as date).
import datetime
def to_date(date_str):
d1, m1, y1 = [int(s) for s in date_str.split('/')]
return datetime.date(y1, m1, d1)
I assumed your date format is d/m/yy. Consider using datetime.strptime to parse your dates, as illustrated by Alex Hall's answer.
Then, the idea is to loop over your rows and store them in a new structure (here, a dict whose keys are the IDs). If a key already exists, compare its date with the current row, and take the right one. Following your pseudo-code, this leads to:
rows = [{"id":"123","date":"1/1/18","foo":"bar"},
{"id":"123","date":"2/2/18", "foo":"baz"}]
new_data = dict()
for row in rows:
existing = new_data.get(row['id'], None)
if existing is None or to_date(existing['date']) < to_date(row['date']):
new_data[row['id']] = row
If your want your new_data variable to be a list, use new_data = list(new_data.values()).

import datetime
rows = [{"id":"123","date":"1/1/18","foo":"bar"},
{"id":"123","date":"2/2/18", "foo":"baz"}]
def parse_date(d):
return datetime.datetime.strptime(d, "%d/%m/%y").date()
tmp_dict = {}
for row in rows:
if row['id'] not in tmp_dict.keys():
tmp_dict['id'] = row
else:
if parse_date(row['date']) > parse_date(tmp_dict[row['id']]):
tmp_dict['id'] = row
print tmp_dict.values()
output
[{'date': '2/2/18', 'foo': 'baz', 'id': '123'}]
Note: you can merge the two if to if row['id'] not in tmp_dict.keys() || parse_date(row['date']) > parse_date(tmp_dict[row['id']]) for cleaner and shorter code

Firstly, work with proper date objects, not strings. Here is how to parse them:
from datetime import datetime, date
rows = [{"id": "123", "date": "1/1/18", "foo": "bar"},
{"id": "123", "date": "2/2/18", "foo": "baz"}]
for row in rows:
row['date'] = datetime.strptime(row['date'], '%d/%m/%y').date()
(check if the format is correct)
Then for the actual task:
new_data = {}
for row in rows:
new_data[row['id']] = max(new_data.get(row['id'], date.min),
row['date'])
print(new_data.values())
Alternatively:
Here are some generic utility functions that work well here which I use in many places:
from collections import defaultdict
def group_by_key_func(iterable, key_func):
"""
Create a dictionary from an iterable such that the keys are the result of evaluating a key function on elements
of the iterable and the values are lists of elements all of which correspond to the key.
"""
result = defaultdict(list)
for item in iterable:
result[key_func(item)].append(item)
return result
def group_by_key(iterable, key):
return group_by_key_func(iterable, lambda x: x[key])
Then the solution can be written as:
by_id = group_by_key(rows, 'id')
for id_num, group in list(by_id.items()):
by_id[id_num] = max(group, key=lambda r: r['date'])
print(by_id.values())
This is less efficient than the first solution because it creates lists along the way that are discarded, but I use the general principles in many places and I thought of it first, so here it is.

If you like to utilize classes as much as I do, then you could make your own class to do this:
from datetime import date
rows = [
{"id":"123","date":"1/1/18","foo":"bar"},
{"id":"123","date":"2/2/18", "foo":"baz"},
{"id":"456","date":"3/3/18","foo":"bar"},
{"id":"456","date":"1/1/18","foo":"bar"}
]
class unique(dict):
def __setitem__(self, key, value):
#Add key if missing or replace key if date is newer
if key not in self or self[key]["date"] < value["date"]:
dict.__setitem__(self, key, value)
data = unique() #Initialize new class based on dict
for row in rows:
d, m, y = map(int, row["date"].split('/')) #Split date into parts
row["date"] = date(y, m, d) #Replace date value
data[row["id"]] = row #Set new data. Will overwrite same ids with more recent
print data.values()
Outputs:
[
{'date': datetime.date(18, 2, 2), 'foo': 'baz', 'id': '123'},
{'date': datetime.date(18, 3, 3), 'foo': 'bar', 'id': '456'}
]
Keep in mind that data is a dict that essentially overrides the __setitem__ method that uses IDs as keys. And the dates are date objects so they can be compared easily.

Pivoting SQLite table, setwise like SQL should be

I have some data. 224,000 rows of it, in a SQLite database. I want to extract time series information from it to feed a data visualisation tool. Essentially, each row in the db is an event that has (among other things not strictly relevant) a time-date group in seconds since the epoch and a name responsible for it. I want to extract how many events each name has for every week in the db.
That's simple enough:
SELECT COUNT(*),
name,
strf("%W:%Y", time, "unixepoch")
FROM events
GROUP BY strf("%W:%Y", time, "unixepoch"), name
ORDER BY time
and we get about six thousand rows of data.
count name week:year
23............ fudge.......23:2009
etc...
But I don't want a row for each name in each week - I want a row for each name, and a column for each week, like this:
Name 23:2009 24:2009 25:2009
fudge........23............6............19
fish.........1.............0............12
etc...
Now, the monitoring process has been running for 69 weeks, and the count of unique names is 502. So clearly, I'm far from keen on any solution that involves hardcoding all the columns and still less the rows. I'm less unkeen on anything that involves iterating over the lot, say with python's executemany(), but I'm willing to accept it if necessary. SQL is meant to be set-wise, dammit.

A good approach in cases like this is not to push SQL to the point where it becomes convoluted and hard to understand and maintain. Let SQL do what it conveniently can and post-process the query results in Python.
Here's a cut-down version of a simple crosstab generator that I wrote. The full version delivers row/column/grand totals.
You'll note that it has built-in "group by" -- the original use-case was for summarising data obtained from Excel files using Python and xlrd.
The row_key and col_key that you supply don't need to be strings as in the example; they can be tuples -- e.g. (year, week) in your case -- or they could be integers -- e.g. you have a mapping of string column name to integer sort key.
import sys
class CrossTab(object):
def __init__(
self,
missing=0, # what to return for an empty cell. Alternatives: '', 0.0, None, 'NULL'
):
self.missing = missing
self.col_key_set = set()
self.cell_dict = {}
self.headings_OK = False
def add_item(self, row_key, col_key, value):
self.col_key_set.add(col_key)
try:
self.cell_dict[row_key][col_key] += value
except KeyError:
try:
self.cell_dict[row_key][col_key] = value
except KeyError:
self.cell_dict[row_key] = {col_key: value}
def _process_headings(self):
if self.headings_OK:
return
self.row_headings = list(sorted(self.cell_dict.iterkeys()))
self.col_headings = list(sorted(self.col_key_set))
self.headings_OK = True
def get_col_headings(self):
self._process_headings()
return self.col_headings
def generate_row_info(self):
self._process_headings()
for row_key in self.row_headings:
row_dict = self.cell_dict[row_key]
row_vals = [row_dict.get(col_key, self.missing) for col_key in self.col_headings]
yield row_key, row_vals
def dump(self, f=None, header=None, footer='', ):
if f is None:
f = sys.stdout
alist = self.__dict__.items()
alist.sort()
if header is not None:
print >> f, header
for attr, value in alist:
print >> f, "%s: %r" % (attr, value)
if footer is not None:
print >> f, footer
if __name__ == "__main__":
data = [
['Rob', 'Morn', 240],
['Rob', 'Aft', 300],
['Joe', 'Morn', 70],
['Joe', 'Aft', 80],
['Jill', 'Morn', 100],
['Jill', 'Aft', 150],
['Rob', 'Aft', 40],
['Rob', 'aft', 5],
['Dozy', 'Aft', 1],
# Dozy doesn't show up till lunch-time
['Nemo', 'never', -1],
]
NAME, TIME, AMOUNT = range(3)
xlate_time = {'morn': "AM", "aft": "PM"}
print
ctab = CrossTab(missing=None, )
# ctab.dump(header='=== after init ===')
for s in data:
ctab.add_item(
row_key=s[NAME],
col_key= xlate_time.get(s[TIME].lower(), "XXXX"),
value=s[AMOUNT])
# ctab.dump(header='=== after add_item ===')
print ctab.get_col_headings()
# ctab.dump(header='=== after get_col_headings ===')
for x in ctab.generate_row_info():
print x
Output:
['AM', 'PM', 'XXXX']
('Dozy', [None, 1, None])
('Jill', [100, 150, None])
('Joe', [70, 80, None])
('Nemo', [None, None, -1])
('Rob', [240, 345, None])

I would first do your query
SELECT COUNT(*),
name,
strf("%W:%Y", time, "unixepoch")
FROM events
GROUP BY strf("%W:%Y", time, "unixepoch"), name
ORDER BY time
and then do post processing with python.
So you don't have to iterate over 224,000 rows but over 6,000 rows. You can easyly store those 6,000 rows in memory (for processing with Python). I think you can store 224,000 rows in memory too but it takes quite a lot more memory.
However: New versions of sqlite support the aggregation function group_concat. Maybe you can use this function for pivoting with SQL? I can't try because I use an older version.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

combine 2 tables and aggregate data - python

Related

Python Replace values in list with dict

python cut between partitioned column results

Create Json file from Python dataframe with grouping on one col and making column name as key with unique values as a list inside the key

create a filtered list of dictionaries based on existing list of dictionaries

Pivoting SQLite table, setwise like SQL should be

Categories

Resources