Group by multiples Keys in python intertools in dict - python

I have this function def group_by_transaction, and I want it to return me a new list of dictionaries, but when I run it with my example data I get:
[{'user_id': 'user3',
'transaction_category_id': '698723',
'transaction_amount_sum': 500},
{'user_id': 'user4',
'transaction_category_id': '698723',
'transaction_amount_sum': 500},
{'user_id': 'user5',
'transaction_category_id': '698723',
'transaction_amount_sum': 300}]
But I wish it was:
[{'number_of_users': 3,
'transaction_category_id': '698723',
'transaction_amount_sum': 1300}]
from itertools import groupby
from operator import itemgetter
data = [{'transaction_id': '00004ed8-2c57-4374-9a0c-3ff1d8a94a9e',
'date': '2013-12-30',
'user_id': 'user3',
'is_blocked': 'false',
'transaction_amount': 200,
'transaction_category_id': '698723',
'is_active': '0'},
{'transaction_id': '00004ed8-2c57-4374-9a0c-3ff1d8a94a7e',
'date': '2013-12-21',
'user_id': 'user3',
'is_blocked': 'false',
'transaction_amount': 300,
'transaction_category_id': '698723',
'is_active': '0'},
{'transaction_id': '00004ed8-2c57-4374-9a0c-3ff1d8a94a9e',
'date': '2013-12-30',
'user_id': 'user4',
'is_blocked': 'false',
'transaction_amount': 200,
'transaction_category_id': '698723',
'is_active': '0'},
{'transaction_id': '00004ed8-2c57-4374-9a0c-3ff1d8a94a7e',
'date': '2013-12-21',
'user_id': 'user4',
'is_blocked': 'false',
'transaction_amount': 300,
'transaction_category_id': '698723',
'is_active': '0'},
{'transaction_id': '00004ed8-2c57-4374-9a0c-3ff1d8a94a7e',
'date': '2013-12-21',
'user_id': 'user5',
'is_blocked': 'false',
'transaction_amount': 300,
'transaction_category_id': '698723',
'is_active': '0'}]
def group_by_transaction(data):
grouper = ['user_id', 'transaction_category_id']
key = itemgetter(*grouper)
data.sort(key=key)
return [{**dict(zip(grouper, k)), 'transaction_amount_sum': sum(map(itemgetter('transaction_amount'), g))}
for k, g in groupby(data, key=key)]
group_by_transaction(data)
Can anybody help me please?
I tried to add a new column into the calculation in the loop but I couldn't achieve any way

Collect data with groupby
I'm not brave enough to do implicit data conversion inside a function. So I prefer to use sorted(data) instead of data.sort
You have to count somehow the number of unique users in order to get the key-value pair {'number_of_users': 3}
You're grouping by ['user_id', 'transaction_category_id'] pair, but to get what you wish the only key to group by is 'transaction_category_id'
With that said, here's a code that I'm sure is close enough to yours to produce the desired grouping.
def group_by_transaction(data):
category = itemgetter('transaction_category_id')
user_amount = itemgetter('user_id', 'transaction_amount')
return [
{
'transaction_category_id': cat_id
, 'number_of_users': len({*users})
, 'transaction_amount_sum': sum(amounts)
}
for cat_id, group in groupby(sorted(data, key=category), category)
for users, amounts in [zip(*(user_amount(record) for record in group))]
]
Update
About for users, amounts in [zip(*(user_amount(record) for record in group))]:
by user_amount(record) ... we extract pairs of data (user_id, transaction_amount)
by zip(*(...)) we accomplish transposing of collected data
zip is a generator, which in this case will result in two rows, where first is user_id values and second is transaction_amount values. To get them both at once we wrap zip-object as the only item of a list. That's the meaning of [zip(...)]
when assigning zip not to one but to several variables, like in users, amounts = zip(...), zipped values will be unpacked. In our case that's two rows mentioned above.

Related

How to convert json into a pandas dataframe?

I'm trying to covert an api response from json to a dataframe in pandas. the problem I am having is that de data is nested in the json format and I am not getting the right columns in my dataframe.
The data is collect from a api with the following format:
{'tickets': [{'url': 'https...',
'id': 1,
'external_id': None,
'via': {'channel': 'web',
'source': {'from': {}, 'to': {}, 'rel': None}},
'created_at': '2020-05-01T04:16:33Z',
'updated_at': '2020-05-23T03:02:49Z',
'type': 'incident',
'subject': 'Subject',
'raw_subject': 'Raw subject',
'description': 'Hi, this is the description',
'priority': 'normal',
'status': 'closed',
'recipient': None,
'requester_id': 409467360874,
'submitter_id': 409126461453,
'assignee_id': 409126461453,
'organization_id': None,
'group_id': 360009916453,
'collaborator_ids': [],
'follower_ids': [],
'email_cc_ids': [],
'forum_topic_id': None,
'problem_id': None,
'has_incidents': False,
'is_public': True,
'due_at': None,
'tags': ['tag_1',
'tag_2',
'tag_3',
'tag_4'],
'custom_fields': [{'id': 360042034433, 'value': 'value of the first custom field'},
{'id': 360041487874, 'value': 'value of the second custom field'},
{'id': 360041489414, 'value': 'value of the third custom field'},
{'id': 360040980053, 'value': 'correo_electrónico'},
{'id': 360040980373, 'value': 'suscribe_newsletter'},
{'id': 360042046173, 'value': None},
{'id': 360041028574, 'value': 'product'},
{'id': 360042103034, 'value': None}],
'satisfaction_rating': {'score': 'unoffered'},
'sharing_agreement_ids': [],
'comment_count': 2,
'fields': [{'id': 360042034433, 'value': 'value of the first custom field'},
{'id': 360041487874, 'value': 'value of the second custom field'},
{'id': 360041489414, 'value': 'value of the third custom field'},
{'id': 360040980053, 'value': 'correo_electrónico'},
{'id': 360040980373, 'value': 'suscribe_newsletter'},
{'id': 360042046173, 'value': None},
{'id': 360041028574, 'value': 'product'},
{'id': 360042103034, 'value': None}],
'followup_ids': [],
'ticket_form_id': 360003608013,
'deleted_ticket_form_id': 360003608013,
'brand_id': 360004571673,
'satisfaction_probability': None,
'allow_channelback': False,
'allow_attachments': True},
What I already tried is the following: I have converted the JSON format into a dict as following:
x = response.json()
df = pd.DataFrame(x['tickets'])
But I'm struggling with the output. I don't know how to get a correct, ordered, normalized dataframe.
(I'm new in this :) )
Let's supose you get your request data by this code r = requests.get(url, auth)
Your data ins't clear yet, so let's get a dataframe of it data = pd.read_json(json.dumps(r.json, ensure_ascii = False))
But, probably you will get a dataframe with one single row.
When I faced a problem like this, I wrote this function to get the full data:
listParam = []
def listDict(entry):
if type(entry) is dict:
listParam.append(entry)
elif type(entry) is list:
for ent in entry:
listDict(ent)
Because your data looks like a dict because of {'tickets': ...} you will need to get the information like that:
listDict(data.iloc[0][0])
And then,
pd.DataFrame(listParam)
I can't show the results because you didn't post the complete data nor told where I can find the data to test, but this will probably work.
You have to convert the json to dictionary first and then convert the dictionary value for key 'tickets' into dataframe.
file = open('file.json').read()
ticketDictionary = json.loads(file)
df = pd.DataFrame(ticketDictionary['tickets'])
'file.json' contains your data here.
df now contains your dataFrame in this format.
For the lists within the response you can have separate dataframes if required:
for field in df['fields']:
df = pd.DataFrame(field)
It will give you this for lengths:
id value
0 360042034433 value of the first custom field
1 360041487874 value of the second custom field
2 360041489414 value of the third custom field
3 360040980053 correo_electrónico
4 360040980373 suscribe_newsletter
5 360042046173 None
6 360041028574 product
7 360042103034 None
This can be one way to structure as you haven't mentioned the exact expected format.

I have list of nested dict variable and need to convert to dict variable type for Json object

Below json data has 3 rules (dict type). I have created as list with some changes. Now i need to convert this "list to dict" data type. The below data has lot of nested list/dict. I want to split this list of list (3 list) and append it to dictionary.(dict datatype)
<class 'list'>
[
{'ID': 'Glacierize bird_sporr after 2 weeks',
'Status': 'Enabled',
'Transitions': [{'Days': 14, 'StorageClass': 'GLACIER'}],
'NoncurrentVersionTransitions': [{'NoncurrentDays': 14, 'StorageClass': 'GLACIER'}],
'Prefix': 'bird_sporr'},
{'Expiration':
{'Days': 45},
'ID': 'Delete files after 45 days',
'Status': 'Enabled',
'NoncurrentVersionExpiration': {'NoncurrentDays': 45},
'Prefix': 'bird_sporr'
},
{'ID': 'PruneAbandonedMultipartUpload',
'Status': 'Enabled',
'AbortIncompleteMultipartUpload': {'DaysAfterInitiation': 30},
'Prefix': ''}
]
I need the below output with dict data type.. This API will not acccept the list data type. Please help on this. Let me know if any queries.
<class 'dict'>
{'ID': 'Glacierize bird_sporr after 2 weeks',
'Status': 'Enabled',
'Transitions': [{'Days': 14, 'StorageClass': 'GLACIER'}],
'NoncurrentVersionTransitions': [{'NoncurrentDays': 14, 'StorageClass': 'GLACIER'}],
'Prefix': 'bird_sporr'},
{'Expiration':
{'Days': 45},
'ID': 'Delete files after 45 days',
'Status': 'Enabled',
'NoncurrentVersionExpiration': {'NoncurrentDays': 45},
'Prefix': 'bird_sporr'},
{'ID': 'PruneAbandonedMultipartUpload',
'Status': 'Enabled',
'AbortIncompleteMultipartUpload': {'DaysAfterInitiation': 30},
'Prefix': ''}
If your problem is just that, you have a list with your output. But you need just the output, without it being contained by a list, Then you should simply be able to do this:
list[0] should give you your desired dictionary.

How to convert a JSON list into two different columns in a CSV file?

This is my first question on this spectacular website, I need to know how to export complex information from a JSON to a CSV.
The problem is that I need from the list that I have in the column to have two different values.
I tried a lot of different combinations and I couldn't so one of my last resources are asked to the community.
My code is this:
def output(alerts):
output = list()
for alert in alerts:
applications = alerts['applications']
for app in applications:
categories = app['categories']
for cat in categories:
output_alert = [list(cat.items())[0], app['confidence'], app['icon'],
app['name'], app['version'], app['website'], alerts['language'], alerts['status']]
output.append(output_alert)
df = pd.DataFrame(output, columns=['Categories', 'Confidence', 'Icon', 'Name', 'Version', 'Website',
'Language', 'Status'])
df.to_csv(args.output)
print('Scan completed, you already have your new CSV file')
return
enter image description here
I left you a picture of the CSV file with the problem in column B (I have a list there) but I need actually two columns with each value...
I attached the JSON response that I have from a REST API
{'applications': [{'categories': [{'59': 'JavaScript libraries'}],
'confidence': '100',
'icon': 'Lo-dash.png',
'name': 'Lodash',
'version': '4.17.15',
'website': 'http://www.lodash.com'},
{'categories': [{'12': 'JavaScript frameworks'}],
'confidence': '100',
'icon': 'RequireJS.png',
'name': 'RequireJS',
'version': '2.3.6',
'website': 'http://requirejs.org'},
{'categories': [{'13': 'Issue trackers'}],
'confidence': '100',
'icon': 'Sentry.svg',
'name': 'Sentry',
'version': '4.6.2',
'website': 'https://sentry.io/'},
{'categories': [{'1': 'CMS'},
{'6': 'Ecommerce'},
{'11': 'Blogs'}],
'confidence': '100',
'icon': 'Wix.png',
'name': 'Wix',
'version': None,
'website': 'https://www.wix.com'},
{'categories': [{'59': 'JavaScript libraries'}],
'confidence': '100',
'icon': 'Zepto.png',
'name': 'Zepto',
'version': None,
'website': 'http://zeptojs.com'},
{'categories': [{'19': 'Miscellaneous'}],
'confidence': '100',
'icon': 'webpack.svg',
'name': 'webpack',
'version': None,
'website': 'https://webpack.js.org/'},
{'categories': [{'12': 'JavaScript frameworks'}],
'confidence': '0',
'icon': 'React.png',
'name': 'React',
'version': None,
'website': 'https://reactjs.org'}], 'language': 'es', 'status': 'success'}
[{'59': 'JavaScript libraries'}] this last thing is my big problem! Thank you for your time and help!
You could try using list(cat.keys())[0], list(cat.values())[0] in your output_alert variable to extract key and value separately.
You can use json_normalize to extract your columns without the for-loop, and then create two new columns with the extracted keys and values from categories:
result = pd.json_normalize(
alerts,
record_path=["applications"],
meta=["language", "status"]
).explode("categories")
result["category_labels"] = result.categories.apply(lambda x: list(x.keys())[0])
result["category_values"] = result.categories.apply(lambda x: list(x.values())[0])
The output is:

Extracting value for one dictionary key in Pandas based on another in the same dictionary

This is from an R guy.
I have this mess in a Pandas column: data['crew'].
array(["[{'credit_id': '54d5356ec3a3683ba0000039', 'department': 'Production', 'gender': 1, 'id': 494, 'job': 'Casting', 'name': 'Terri Taylor', 'profile_path': None}, {'credit_id': '56407fa89251417055000b58', 'department': 'Sound', 'gender': 0, 'id': 6745, 'job': 'Music Editor', 'name': 'Richard Henderson', 'profile_path': None}, {'credit_id': '5789212392514135d60025fd', 'department': 'Production', 'gender': 2, 'id': 9250, 'job': 'Executive In Charge Of Production', 'name': 'Jeffrey Stott', 'profile_path': None}, {'credit_id': '57892074c3a36835fa002886', 'department': 'Costume & Make-Up', 'gender': 0, 'id': 23783, 'job': 'Makeup Artist', 'name': 'Heather Plott', 'profile_path': None}
It goes on for quite some time. Each new dict starts with a credit_id field. One sell can hold several dicts in an array.
Assume I want the names of all Casting directors, as shown in the first entry. I need to check check the job entry in every dict and, if it's Casting, grab what's in the name field and store it in my data frame in data['crew'].
I tried several strategies, then backed off and went for something simple.
Running the following shut me down, so I can't even access a simple field. How can I get this done in Pandas.
for row in data.head().iterrows():
if row['crew'].job == 'Casting':
print(row['crew'])
EDIT: Error Message
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-138-aa6183fdf7ac> in <module>()
1 for row in data.head().iterrows():
----> 2 if row['crew'].job == 'Casting':
3 print(row['crew'])
TypeError: tuple indices must be integers or slices, not str
EDIT: Code used to get the array of dict (strings?) in the first place.
def convert_JSON(data_as_string):
try:
dict_representation = ast.literal_eval(data_as_string)
return dict_representation
except ValueError:
return []
data["crew"] = data["crew"].map(lambda x: sorted([d['name'] if d['job'] == 'Casting' else '' for d in convert_JSON(x)])).map(lambda x: ','.join(map(str, x))
To create a DataFrame from your sample data, write:
df = pd.DataFrame(data=[
{ 'credit_id': '54d5356ec3a3683ba0000039', 'department': 'Production',
'gender': 1, 'id': 494, 'job': 'Casting', 'name': 'Terri Taylor',
'profile_path': None},
{ 'credit_id': '56407fa89251417055000b58', 'department': 'Sound',
'gender': 0, 'id': 6745, 'job': 'Music Editor',
'name': 'Richard Henderson', 'profile_path': None},
{ 'credit_id': '5789212392514135d60025fd', 'department': 'Production',
'gender': 2, 'id': 9250, 'job': 'Executive In Charge Of Production',
'name': 'Jeffrey Stott', 'profile_path': None},
{ 'credit_id': '57892074c3a36835fa002886', 'department': 'Costume & Make-Up',
'gender': 0, 'id': 23783, 'job': 'Makeup Artist',
'name': 'Heather Plott', 'profile_path': None}])
Then you can get your data with a single instruction:
df[df.job == 'Casting'].name
The result is:
0 Terri Taylor
Name: name, dtype: object
The above result is Pandas Series object with names found.
In this case, 0 is the index value for the record found and
Terri Taylor is the name of (the only in your data) Casting Director.
Edit
If you want just a list (not Series), write:
df[df.job == 'Casting'].name.tolist()
The result is ['Terri Taylor'] - just a list.
I think, both my solutions should be quicker than "ordinary" loop
based on iterrows().
Checking the execution time, you may try also yet another solution:
df.query("job == 'Casting'").name.tolist()
==========
And as far as your code is concerned:
iterrows() returns each time a pair containing:
the key of the current row,
a named tuple - the content of this row.
So your loop should look something like:
for row in df.iterrows():
if row[1].job == 'Casting':
print(row[1]['name'])
You can not write row[1].name because it refers to the index value
(here we have a collision with default attributes of the named tuple).

Filtering and rearranging very large dictionary array without pandas

I have a very large dictionary array that looks like this:
masterArray =[{'value': '-1', 'product': 'product1', 'Customer': 'customer1',
'Sensor': 'sensor1', 'Date': '20170302', 'type': 'type1', 'ID': '100'},
{'value': '20', 'product': 'product1', 'Customer': 'customer1',
'Sensor': 'sensor1','Date': '20170302', 'type': 'type2', 'ID': '100'},
{'value': '0', 'product': 'product1', 'Customer': 'customer1',
'Sensor': 'sensor1', 'Date': '20170302', 'type': 'type1', 'ID': '101'},
{'value': '-5', 'product': 'product1', 'Customer': 'customer1',
'Sensor': 'sensor1', 'Date': '20170302', 'type': 'type2', 'ID': '101'}]
I need to be able to print out individual csvs for each day, product, sensor, and customer, with the first column as the ID #s and the types as the rest of the columns with the value as the data filled in the rows.
ID, type1, type2
100, -1, 20
101, 0, -5
I also created a date set and a 'combination' set to gather unique dates and combinations of product, sensor, and customer.
Unfortunately I am not allowed to get the pandas library installed, although I think what I want to do would be done by this:
df = pd.DataFrame(masterArray)
df.head()
pivot = pd.pivot_table(df, index=['ID'], values=['value'], columns=['type'])
for date in dateset:
#filter for date
pqd = pivot.query('Date == date')
for row in comboset:
#filter for each output
pqc = pqd.query('Customer == row[0] & product == row[1] & sensor == row[2]')
outputName = str(row[0] + '_' + date + '_' + row[1] + '_' + row[2] + '.csv')
filepath = os.path.join(path, outputName)
pqc.to_csv(filepath) #print
Currently my pandas-less idea is changing my masterArray into a huge nested dictionary (I create masterArray myself from other input csv files) but I am not sure if this is the most efficient way or not. I also don't know how best to set up the logic for that large of a nested dictionary. Please help!
You can probably try something like this:
data_dict = {}
for each in masterArray:
if not data_dict.has_key(each['ID']):
data_dict[each['ID']] = []
data_dict[each['ID']].append({each['type']: each['value']})

Categories

Resources