Say I have some data with timestamps, prices and amounts. This data can be quite large and matching conditions could occur anywhere in the group. A simple example shown below:
[{"date":1387496043,"price":19.379,"amount":1.000000}
{"date":1387496044,"price":20.20,"amount":2.00000}
{"date":1387496044,"price":10.00,"amount":0.10000}
{"date":1387496044,"price":20.20,"amount":0.300000}]
How could I sort this so I combine the amounts of any item that has the same timestamp and same price?
So the results look like (note the 2.0 and 0.3 amounts have been summed together):
[{"date":1387496043,"price":19.379,"amount":1.000000}
{"date":1387496044,"price":20.20,"amount":2.30000}
{"date":1387496044,"price":10.00,"amount":0.10000}]
I've tried a number of convoluted methods (using Python 2.7.3), but I don't know python very well. I'm sure there's a good way to find 2 matching values and then updating one with new amount and removing the duplicate.
FYI Here is the test data
L=[{"date":1387496043,"price":19.379,"amount":1.000000},{"date":1387496044,"price":20.20,"amount":2.00000},{"date":1387496044,"price":10.00,"amount":0.10000},{"date":1387496044,"price":20.20,"amount":0.300000}]
A defaultdict-based approach
from collections import defaultdict
d = defaultdict(float)
z = [{"date":1387496043,"price":19.379,"amount":1.000000},
{"date":1387496044,"price":20.20,"amount":2.00000},
{"date":1387496044,"price":10.00,"amount":0.10000},
{"date":1387496044,"price":20.20,"amount":0.300000}]
for x in z:
d[x["date"], x["price"]] += x["amount"]
print [{"date": k1, "price": k2, "amount": v} for (k1, k2), v in d.iteritems()]
[{'date': 1387496044, 'price': 10.0, 'amount': 0.1},
{'date': 1387496044, 'price': 20.2, 'amount': 2.3},
{'date': 1387496043, 'price': 19.379, 'amount': 1.0}]
Probably the best way to do this would be to make a dictionary with (date, price) as keys. If you ever encounter a duplicate key, you can combine your fields to keep the keys unique.
def combine(L):
results = {}
for item in L:
key = (item["date"], item["price"])
if key in results: # combine them
results[key] = {"date": item["date"], "price": item["price"], "amount": item["amount"] + results[key]["amount"]}
else: # don't need to combine them
results[key] = item
return results.values()
This would be a slightly messy O(n) solution to your example that can obviously be generalized to solve your initial problem.
FWIW you can do it using database operations:
records = [
{"date":1387496043,"price":19.379,"amount":1.000000},
{"date":1387496044,"price":20.20,"amount":2.00000},
{"date":1387496044,"price":10.00,"amount":0.10000},
{"date":1387496044,"price":20.20,"amount":0.300000},
]
import sqlite3
db = sqlite3.connect(':memory:')
db.row_factory = sqlite3.Row
db.execute('CREATE TABLE records (date int, price float, amount float)')
db.executemany('INSERT INTO records VALUES (:date, :price, :amount)', records)
sql = 'SELECT date, price, SUM(amount) AS amount FROM records GROUP BY date, price'
records = [dict(row) for row in db.execute(sql)]
print(records)
Related
I have a script that retrieves user data from a CSV (~2.5m) and record data from Salesforce via API (~2m) and matches them based on a unique user_id.
For each user, I need the relevant record_id (if it exists). There is a one-to-one relationship with users and records, so the user_id should only appear on 1 record.
To try and increase performance both lists are sorted ascending by user_id, and I break the loop if record['user_id'] > user['user_id'] as that means there is no relevant record.
It's working, however it's slow when trying to match the 2 datasets taking ~1.5hrs. Is there a faster method of performing the matching to retrieve the relevant record_id?
Here is an example of the data, current function, and expected result:
users = [
{"user_id": 11111, "name": "Customer A", "age": 34, 'record_id': None},
{"user_id": 22222, "name": "Customer B", "age": 18, 'record_id': None},
{"user_id": 33333, "name": "Customer C", "age": 66, 'record_id': None}
]
records = [
{"user_id": 11111, "record_id": "ABC123"},
{"user_id": 33333, "record_id": "GHI789"}
]
upload = []
for user in users:
for record in records:
if user['user_id'] == record['user_id']:
user['record_id'] = record['record_id']
records.remove(record)
break
elif record['user_id'] > user['user_id']:
break
if user['record_id']:
upload.append(user)
print(upload)
This outputs:
[
{'user_id': 11111, 'name': 'Customer A', 'age': 34, 'record_id': 'ABC123'},
{'user_id': 33333, 'name': 'Customer C', 'age': 66, 'record_id': 'GHI789'}
]
Create a dictionary that maps from a user's id to its corresponding dictionary. Then, you can add the relevant record_id fields using a for loop. Finally, you can remove the entries without an assigned record_id using a list comprehension.
This doesn't require any preprocessing (e.g. sorting) to obtain speedup; the efficiency gain comes from the fact that lookups in a large dictionary are faster than searching a large list:
user_id_mapping = {entry["user_id"]: entry for entry in users}
for record in records:
if record["user_id"] in user_id_mapping:
user_id_mapping[record["user_id"]]["record_id"] = record["record_id"]
result = [item for item in user_id_mapping.values() if item["record_id"] is not None]
print(result)
This outputs:
[
{'user_id': 11111, 'name': 'Customer A', 'age': 34, 'record_id': 'ABC123'},
{'user_id': 33333, 'name': 'Customer C', 'age': 66, 'record_id': 'GHI789'}
]
With this being said, if you have to execute similar flavors of this operation repeatedly, I would recommend using some sort of a database rather than performing this in Python.
You could use pandas.read_csv() to read your CSV data into a dataframe, and then merge that with the records on the user_id value:
import pandas as pd
users = pd.read_csv('csv file')
records = pd.DataFrame('result of salesforce query')
result = users.drop('record_id', axis=1).merge(records, on='user_id')
If you want to keep the users which have no matching value in records, change the merge to
merge(records, on='user_id', how='left')
To output the result as a list of dictionaries, use to_dict():
result.to_dict('records')
Note - it may be possible to execute your Salesforce query directly into a dataframe. See for example this Q&A
For scalability, you can use pandas dataframes, like so:
result = pd.merge(pd.DataFrame(users), pd.DataFrame(records), on='user_id').to_dict('records')
If you want to keep the entries which do not have a record_id, you can add the how="left" to the arguments of the merge function.
Your approach isn't unreasonable. But removing record after it's used has a cost. Sorting your two lists ahead of time also has a cost. These costs may add up more than you think they do.
One possible approach would be to NOT sort the lists, but instead build a dict of record_ids, eg:
rdict = { r['user_id']:r['record_id'] for r in records }
for user in users:
user_id = user['user_id']
record_id = rdict.get(user_id)
if record_id:
user['record_id'] = record_id
upload.append(user)
This way you're paying the price once for building the hash, and everything else is very efficient.
I'm new to this forum, kindly excuse if the question format is not very good.
I'm trying to fetch rows from database table in mysql and print the same after processing the cols (one of the cols contains json which needs to be expanded). Below is the source and expected output. Would be great if someone can suggest an easier way to manage this data.
Note: I have achieved this with lots of looping and parsing but the challenges are.
1) There is no connection between col_names and data and hence when I am printing the data I don't know the order of the data in the resultset so there is a mismatch in the col title that I print and the data, any means to keep this in sync ?
2) I would like to have the flexibility of changing the order of the columns without much rework.
What is best possible way to achieve this. Have not explored the pandas library as I was not sure if it is really necessary.
Using python 3.6
Sample Data in the table
id, student_name, personal_details, university
1, Sam, {"age":"25","DOL":"2015","Address":{"country":"Poland","city":"Warsaw"},"DegreeStatus":"Granted"},UAW
2, Michael, {"age":"24","DOL":"2016","Address":{"country":"Poland","city":"Toruń"},"DegreeStatus":"Granted"},NCU
I'm querying the database using MySQLdb.connect object, steps below
query = "select * from student_details"
cur.execute(query)
res = cur.fetchall() # get a collection of tuples
db_fields = [z[0] for z in cur.description] # generate list of col_names
Data in variables:
>>>db_fields
['id', 'student_name', 'personal_details', 'university']
>>>res
((1, 'Sam', '{"age":"25","DOL":"2015","Address":{"country":"Poland","city":"Warsaw"},"DegreeStatus":"Granted"}','UAW'),
(2, 'Michael', '{"age":"24","DOL":"2016","Address":{"country":"Poland","city":"Toruń"},"DegreeStatus":"Granted"}','NCU'))
Desired Output:
id, student_name, age, DOL, country, city, DegreeStatus, University
1, 'Sam', 25, 2015, 'Poland', 'Warsaw', 'Granted', 'UAW'
2, 'Michael', 24, 2016, 'Poland', 'Toruń', 'Granted', 'NCU'
A not-too-pythonic way but easy to understand (and maybe you can write a more pythonic soltion) might be:
def unwrap_dict(_input):
res = dict()
for k, v in _input.items():
# Assuming you know there's only one nested level
if isinstance(v, dict):
for _k, _v in v.items():
res[_k] = _v
continue
res[k] = v
return res
all_data = list()
for row in result:
res = dict()
for field, data in zip(db_fields, row):
# Assuming you know personal_details is the only JSON column
if field == 'personal_details':
data = json.loads(data)
if isinstance(data, dict):
extra = unwrap_dict(data)
res.update(extra)
continue
res[field] = data
all_data.append(res)
In a json file with huge data I got 24 columns with 700k rows, one of columns have a dictionary inside, so i selected that column below:
dataset = pd.read_json('ecommerce-events - Copia.json', lines=True)
dataset.loc[dataset['eventType']=="transaction"]
In transaction column has "price", wanna sum all prices times quantity, how I do this with pandas?
'url': 'da7caa77e2729e12b32a9d7d1a324652ce2264a6',
'referrer': '6e03ee62984224d0c0f08d4b68b819297d7f4d14',
'order': 5545, # unique transaction id
'orderItems': [{ # list of products bought in that transaction
'product': 16493, # product id
'price': 19.9, # product unit price
'quantity': 1.0
print
def summation(x):
value=x["price"] * x["qun"]
return value
df=pd.DataFrame({"Transaction":[[{"price":23,"qun":2}],[{"price":25,"qun":2}],[{"price":24,"qun":2}]]})
df["summation_value"]=df[["Transaction"]].apply(lambda x : summation(x[0][0]), axis=1)
I have a Collection with heavily nested docs in MongoDB, I want to flatten and import to Pandas. There are some nested dicts, but also a list of dicts that I want to transform into columns (see examples below for details).
I already have function, that works for smaller batches of documents. But the solution (I found it in the answer to this question) uses json. The problem with the json.loads operation is, that it fails with a MemoryError on bigger selections from the Collection.
I tried many solutions suggesting other json-parsers (e.g. ijson), but for different reasons none of them solved my problem. The only way left, if I want to keep up the transformation via json, would be chunking bigger selections into smaller groups of documents and iterate the parsing.
At this point I thought, - and that is my main question here - maybe there is a smarter way to do the unnesting without taking the detour through json directly in MongoDB or in Pandas or somehow combined?
This is a shortened example Doc:
{
'_id': ObjectId('5b40fcc4affb061b8871cbc5'),
'eventId': 2,
'sId' : 6833,
'stage': {
'value': 1,
'Name': 'FirstStage'
},
'quality': [
{
'type': {
'value': 2,
'Name': 'Color'
},
'value': '124'
},
{
'type': {
'value': 7,
'Name': 'Length'
},
'value': 'Short'
},
{
'type': {
'value': 15,
'Name': 'Printed'
}
}
}
This is what a succcesful dataframe-representation would look like (I skipped columns '_id' and 'sId' for readability:
eventId stage.value stage.name q_color q_length q_printed
1 2 1 'FirstStage' 124 'Short' 1
My code so far (which runs into memory problems - see above):
def load_events(filter = 'sId', id = 6833, all = False):
if all:
print('Loading all events.')
cursor = events.find()
else:
print('Loading events with %s equal to %s.' %(filter, id))
print('Filtering...')
cursor = events.find({filter : id})
print('Loading...')
l = list(cursor)
print('Parsing json...')
sanitized = json.loads(json_util.dumps(l))
print('Parsing quality...')
for ev in sanitized:
for q in ev['quality']:
name = 'q_' + str(q['type']['Name'])
value = q.pop('value', 1)
ev[name] = value
ev.pop('quality',None)
normalized = json_normalize(sanitized)
df = pd.DataFrame(normalized)
return df
You don't need to convert the nested structures using json parsers. Just create your dataframe from the record list:
df = DataFrame(list(cursor))
and afterwards use pandas in order to unpack your lists and dictionaries:
import pandas
from itertools import chain
import numpy
df = pandas.DataFrame(t)
df['stage.value'] = df['stage'].apply(lambda cell: cell['value'])
df['stage.name'] = df['stage'].apply(lambda cell: cell['Name'])
df['q_']= df['quality'].apply(lambda cell: [(m['type']['Name'], m['value'] if 'value' in m.keys() else 1) for m in cell])
df['q_'] = df['q_'].apply(lambda cell: dict((k, v) for k, v in cell))
keys = set(chain(*df['q_'].apply(lambda column: column.keys())))
for key in keys:
column_name = 'q_{}'.format(key).lower()
df[column_name] = df['q_'].apply(lambda cell: cell[key] if key in cell.keys() else numpy.NaN)
df.drop(['stage', 'quality', 'q_'], axis=1, inplace=True)
I use three steps in order to unpack the nested data types. Firstly, the names and values are used to create a flat list of pairs (tuples). In the second step a dictionary based on the tuples takes keys from 1st and values from 2nd location of the tuples. Then all existing property names are extracted once using a set. Each property gets a new column using a loop. Inside the loop the values of each pair is mapped to the respective column cells.
Currently i am having an question in python pandas. I want to filter a dataframe using url query string dynamically.
For eg:
CSV:
url: http://example.com/filter?Name=Sam&Age=21&Gender=male
Hardcoded:
filtered_data = data[
(data['Name'] == 'Sam') &
(data['Age'] == 21) &
(data['Gender'] == 'male')
];
I don't want to hard code the filter keys like before because the csv file changes anytime with different column headers.
Any suggestions
The easiest way to create this filter dynamically is probably to use np.all.
For example:
import numpy as np
query = {'Name': 'Sam', 'Age': 21, 'Gender': 'male'}
filters = [data[k] == v for k, v in query.items()]
filter_data = data[np.all(filters, axis=0)]
use df.query. For example
df = pd.read_csv(url)
conditions = "Name == 'Sam' and Age == 21 and Gender == 'Male'"
filtered_data = df.query(conditions)
You can build the conditions string dynamically using string formatting like
conditions = " and ".join("{} == {}".format(col, val)
for col, val in zip(df.columns, values)
Typically, your web framework will return the arguments in a dict-like structure. Let's say your args are like this:
args = {
'Name': ['Sam'],
'Age': ['21'], # Note that Age is a string
'Gender': ['male']
}
You can filter your dataset successively like this:
for key, values in args.items():
data = data[data[key].isin(values)]
However, this is likely not to match any data for Age, which may have been loaded as an integer. In that case, you could load the CSV file as a string via pd.read_csv(filename, dtype=object), or convert to string before comparison:
for key, values in args.items():
data = data[data[key].astype(str).isin(values)]
Incidentally, this will also match multiple values. For example, take the URL http://example.com/filter?Name=Sam&Name=Ben&Age=21&Gender=male -- which leads to the structure:
args = {
'Name': ['Sam', 'Ben'], # There are 2 names
'Age': ['21'],
'Gender': ['male']
}
In this case, both Ben and Sam will be matched, since we're using .isin to match.