I have loaded two json files in Python3.8, and I need to merge the two based on a condition.
Obj1 = [{'account': '223', 'colr': '#555555', 'hash': True},
{'account': '134', 'colr': '#666666', 'hash': True},
{'account': '252', 'colr': '#777777', 'hash': True}]
Obj2 = [{'sn': 38796, 'code': 'df', 'id': 199, 'desc': 'jex - #777777- gg2349.252'},
{'sn': 21949, 'code': 'se', 'id': 193, 'desc': 'jex - #555555 - gf23569'},
{'sn': 21340, 'code': 'se', 'id': 3, 'desc': 'jex - #666666 - gf635387'}]
# What I am trying to get
Obj3 = [{'sn': 38796, 'code': 'df', 'id': 199, 'desc': 'jex - #777777- gg2349.252', 'account': '252', 'colr': '#777777', 'hash': True},
{'sn': 21949, 'code': 'se', 'id': 193, 'desc': 'jex - #555555 - gf23569', 'account': '223', 'colr': '#555555', 'hash': True},
{'sn': 21340, 'code': 'se', 'id': 3, 'desc': 'jex - #666666 - gf635387', 'account': '134', 'colr': '#666666', 'hash': True}]
I have tried from what I can gather everything on SO from append, extend etc but I fall short on the condition.
I need to be able to append elements in Obj1 to Obj2 at their correct place based on a condition that if the colr of Obj1 is mentioned in desc of Obj2 it should append that whole element from Obj1 into the correlated element of Obj2. Or create a new Obj3 that I can print these updated values from.
What I have tried and looked at thus far Append JSON Object, Append json objects to nested list, Appending json object to existing json object and a few others that also didn't help.
Hope this makes sense and thank you
Something simple like this would work.
for i in range(len(Obj1)):
for j in range(len(Obj2)):
if Obj1[i]['colr'] in Obj2[j]['desc']:
Obj1[i].update(Obj2[j])
print(Obj1)
One approach is to first create a dictionary mapping each color to the JSON element. You can do this as
colr2elem = {elem['colr']: elem for elem in json_obj1}
Then you can see which color to append applying a Regular Expression to the description and update json_obj2 dictionary (merge dictionaries).
import re
for elem2 in json_obj2:
elem1 = colr2elem.get(re.search('#\d+', elem2['desc']).group(0))
elem2.update(elem1 if elem1 is not None else {})
Obj1 = [{'account': '223', 'colr': '#555555', 'hash': True},
{'account': '134', 'colr': '#666666', 'hash': True},
{'account': '252', 'colr': '#777777', 'hash': True}]
Obj2 = [{'sn': 38796, 'code': 'df', 'id': 199, 'desc': 'jex - #777777- gg2349.252'},
{'sn': 21949, 'code': 'se', 'id': 193, 'desc': 'jex - #555555 - gf23569'},
{'sn': 21340, 'code': 'se', 'id': 3, 'desc': 'jex - #666666 - gf635387'}]
Obj3 = []
for i in Obj1:
for j in Obj2:
if i["colr"]==j["desc"][6:13] :
a = {**j,**i}
Obj3.append(a)
print(Obj3)
You can use element1['colr'] in element2['desc'] to check if elements from the first and second arrays match. Now, you can iterate over the second array and for each of its elements find the corresponding element from the first array by checking this condition:
json_obj3 = []
for element2 in json_obj2:
for element1 in json_obj1:
if element1['colr'] in element2['desc']:
element3 = dict(**element1, **element2)
json_obj3.append(element3)
break # stop inner for loop, because matched element is found
BTW, this can be written as single expression using nested list comprehension:
json_obj3 = [
dict(**element1, **element2)
for element1 in json_obj1
for element2 in json_obj2
if element1['colr'] in element2['desc']
]
Related
I have this function def group_by_transaction, and I want it to return me a new list of dictionaries, but when I run it with my example data I get:
[{'user_id': 'user3',
'transaction_category_id': '698723',
'transaction_amount_sum': 500},
{'user_id': 'user4',
'transaction_category_id': '698723',
'transaction_amount_sum': 500},
{'user_id': 'user5',
'transaction_category_id': '698723',
'transaction_amount_sum': 300}]
But I wish it was:
[{'number_of_users': 3,
'transaction_category_id': '698723',
'transaction_amount_sum': 1300}]
from itertools import groupby
from operator import itemgetter
data = [{'transaction_id': '00004ed8-2c57-4374-9a0c-3ff1d8a94a9e',
'date': '2013-12-30',
'user_id': 'user3',
'is_blocked': 'false',
'transaction_amount': 200,
'transaction_category_id': '698723',
'is_active': '0'},
{'transaction_id': '00004ed8-2c57-4374-9a0c-3ff1d8a94a7e',
'date': '2013-12-21',
'user_id': 'user3',
'is_blocked': 'false',
'transaction_amount': 300,
'transaction_category_id': '698723',
'is_active': '0'},
{'transaction_id': '00004ed8-2c57-4374-9a0c-3ff1d8a94a9e',
'date': '2013-12-30',
'user_id': 'user4',
'is_blocked': 'false',
'transaction_amount': 200,
'transaction_category_id': '698723',
'is_active': '0'},
{'transaction_id': '00004ed8-2c57-4374-9a0c-3ff1d8a94a7e',
'date': '2013-12-21',
'user_id': 'user4',
'is_blocked': 'false',
'transaction_amount': 300,
'transaction_category_id': '698723',
'is_active': '0'},
{'transaction_id': '00004ed8-2c57-4374-9a0c-3ff1d8a94a7e',
'date': '2013-12-21',
'user_id': 'user5',
'is_blocked': 'false',
'transaction_amount': 300,
'transaction_category_id': '698723',
'is_active': '0'}]
def group_by_transaction(data):
grouper = ['user_id', 'transaction_category_id']
key = itemgetter(*grouper)
data.sort(key=key)
return [{**dict(zip(grouper, k)), 'transaction_amount_sum': sum(map(itemgetter('transaction_amount'), g))}
for k, g in groupby(data, key=key)]
group_by_transaction(data)
Can anybody help me please?
I tried to add a new column into the calculation in the loop but I couldn't achieve any way
Collect data with groupby
I'm not brave enough to do implicit data conversion inside a function. So I prefer to use sorted(data) instead of data.sort
You have to count somehow the number of unique users in order to get the key-value pair {'number_of_users': 3}
You're grouping by ['user_id', 'transaction_category_id'] pair, but to get what you wish the only key to group by is 'transaction_category_id'
With that said, here's a code that I'm sure is close enough to yours to produce the desired grouping.
def group_by_transaction(data):
category = itemgetter('transaction_category_id')
user_amount = itemgetter('user_id', 'transaction_amount')
return [
{
'transaction_category_id': cat_id
, 'number_of_users': len({*users})
, 'transaction_amount_sum': sum(amounts)
}
for cat_id, group in groupby(sorted(data, key=category), category)
for users, amounts in [zip(*(user_amount(record) for record in group))]
]
Update
About for users, amounts in [zip(*(user_amount(record) for record in group))]:
by user_amount(record) ... we extract pairs of data (user_id, transaction_amount)
by zip(*(...)) we accomplish transposing of collected data
zip is a generator, which in this case will result in two rows, where first is user_id values and second is transaction_amount values. To get them both at once we wrap zip-object as the only item of a list. That's the meaning of [zip(...)]
when assigning zip not to one but to several variables, like in users, amounts = zip(...), zipped values will be unpacked. In our case that's two rows mentioned above.
I have a complex situation which I hope to solve and which might profit us all. I collected data from my API, added a pagination and inserted the complete data package in a tuple named q1 and finally I have made a dictionary named dict_1of that tuple which looks like this:
dict_1 = {100: {'ID': 100, 'DKSTGFase': None, 'DK': False, 'KM': None,
'Country: {'Name': GE', 'City': {'Name': 'Berlin'}},
'Type': {'Name': '219'}, 'DKObject': {'Name': '8555', 'Object': {'Name': 'Car'}},
'Order': {'OrderId': 101, 'CreatedOn': '2018-07-06T16:54:36.783+02:00',
'ModifiedOn': '2018-07-06T16:54:36.783+02:00',
'Name': Audi, 'Client': {‘1’ }}, 'DKComponent': {'Name': ‘John’}},
{200: {'ID': 200, 'DKSTGFase': None, 'DK': False, ' KM ': None,
'Country: {'Name': ES', 'City': {'Name': 'Madrid'}}, 'Type': {'Name': '220'},
'DKObject': {'Name': '8556', 'Object': {'Name': 'Car'}},
'Order': {'OrderId': 102, 'CreatedOn': '2018-07-06T16:54:36.783+02:00',
'ModifiedOn': '2018-07-06T16:54:36.783+02:00',
'Name': Mercedes, 'Client': {‘2’ }}, 'DKComponent': {'Name': ‘Sergio’}},
Please note that in the above dictionary I have just stated 2 records. The actual dictionary has 1400 records till it reaches ID 1500.
Now I want to 2 things:
I want to change some keys for all the records. key DK has to become DK1. Key Name in Country has to become Name1 and Name in Object has to become 'Name2'
The second thing I want is to make a dataFrame of the whole bunch of data. My expected outcome is:
This is my code:
q1 = response_2.json()
next_link = q1['#odata.nextLink']
q1 = [tuple(q1.values())]
while next_link:
new_response = requests.get(next_link, headers=headers, proxies=proxies)
new_data = new_response.json()
q1.append(tuple(new_data.values()))
next_link = new_data.get('#odata.nextLink', None)
dict_1 = {
record['ID']: record
for tup in q1
for record in tup[2]
}
#print(dict_1)
for x in dict_1.values():
x['DK1'] = x['DK']
x['Country']['Name1'] = x['Country']['Name']
x['Object']['Name2'] = x['Object']['Name']
df = pd.DataFrame(dict_1)
When i run this I receive the following Error:
Traceback (most recent call last):
File "c:\data\FF\Desktop\Python\PythongMySQL\Talky.py", line 57, in <module>
x['Country']['Name1'] = x['Country']['Name']
TypeError: 'NoneType' object is not subscriptable
working code
lists=[]
alldict=[{100: {'ID': 100, 'DKSTGFase': None, 'DK': False, 'KM': None,
'Country': {'Name': 'GE', 'City': {'Name': 'Berlin'}},
'Type': {'Name': '219'}, 'DKObject': {'Name': '8555', 'Object': {'Name': 'Car'}},
'Order': {'OrderId': 101, 'CreatedOn': '2018-07-06T16:54:36.783+02:00',
'ModifiedOn': '2018-07-06T16:54:36.783+02:00',
'Name': 'Audi', 'Client': {'1' }}, 'DKComponent': {'Name': 'John'}}}]
for eachdict in alldict:
key=list(eachdict.keys())[0]
eachdict[key]['DK1']=eachdict[key]['DK']
del eachdict[key]['DK']
eachdict[key]['Country']['Name1']=eachdict[key]['Country']['Name']
del eachdict[key]['Country']['Name']
eachdict[key]['DKObject']['Object']['Name2']=eachdict[key]['DKObject']['Object']['Name']
del eachdict[key]['DKObject']['Object']['Name']
lists.append([key, eachdict[key]['DK1'], eachdict[key]['KM'], eachdict[key]['Country']['Name1'],
eachdict[key]['Country']['City']['Name'], eachdict[key]['DKObject']['Object']['Name2'], eachdict[key]['Order']['Client']])
pd.DataFrame(lists, columns=[<columnNamesHere>])
Output:
{100: {'ID': 100,
'DKSTGFase': None,
'KM': None,
'Country': {'City': {'Name': 'Berlin'}, 'Name1': 'GE'},
'Type': {'Name': '219'},
'DKObject': {'Name': '8555', 'Object': {'Name2': 'Car'}},
'Order': {'OrderId': 101,
'CreatedOn': '2018-07-06T16:54:36.783+02:00',
'ModifiedOn': '2018-07-06T16:54:36.783+02:00',
'Name': 'Audi',
'Client': {'1'}},
'DKComponent': {'Name': 'John'},
'DK1': False}}
I have the below list -
[{'metric': 'sales', 'value': '100', 'units': 'dollars'},
{'metric': 'instock', 'value': '95.2', 'units': 'percent'}]
I would like to reformat it like the below in Python -
{'sales': '100', 'instock': '95.2'}
I did the below -
a = [above list]
for i in a:
print({i['metric']: i['value']})
But it outputs like this -
{'sales': '100'}
{'instock': '95.2'}
I would like these 2 lines to be a part of the same dictionary
d = [{'metric': 'sales', 'value': '100', 'units': 'dollars'},
{'metric': 'instock', 'value': '95.2', 'units': 'percent'}]
new_d = {e["metric"]: e["value"] for e in d}
# output: {'sales': '100', 'instock': '95.2'}
I believe that it's best to try it first by yourself, and then post a question in case you don't succeed. You should consider posting your attempts next time.
This is from an R guy.
I have this mess in a Pandas column: data['crew'].
array(["[{'credit_id': '54d5356ec3a3683ba0000039', 'department': 'Production', 'gender': 1, 'id': 494, 'job': 'Casting', 'name': 'Terri Taylor', 'profile_path': None}, {'credit_id': '56407fa89251417055000b58', 'department': 'Sound', 'gender': 0, 'id': 6745, 'job': 'Music Editor', 'name': 'Richard Henderson', 'profile_path': None}, {'credit_id': '5789212392514135d60025fd', 'department': 'Production', 'gender': 2, 'id': 9250, 'job': 'Executive In Charge Of Production', 'name': 'Jeffrey Stott', 'profile_path': None}, {'credit_id': '57892074c3a36835fa002886', 'department': 'Costume & Make-Up', 'gender': 0, 'id': 23783, 'job': 'Makeup Artist', 'name': 'Heather Plott', 'profile_path': None}
It goes on for quite some time. Each new dict starts with a credit_id field. One sell can hold several dicts in an array.
Assume I want the names of all Casting directors, as shown in the first entry. I need to check check the job entry in every dict and, if it's Casting, grab what's in the name field and store it in my data frame in data['crew'].
I tried several strategies, then backed off and went for something simple.
Running the following shut me down, so I can't even access a simple field. How can I get this done in Pandas.
for row in data.head().iterrows():
if row['crew'].job == 'Casting':
print(row['crew'])
EDIT: Error Message
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-138-aa6183fdf7ac> in <module>()
1 for row in data.head().iterrows():
----> 2 if row['crew'].job == 'Casting':
3 print(row['crew'])
TypeError: tuple indices must be integers or slices, not str
EDIT: Code used to get the array of dict (strings?) in the first place.
def convert_JSON(data_as_string):
try:
dict_representation = ast.literal_eval(data_as_string)
return dict_representation
except ValueError:
return []
data["crew"] = data["crew"].map(lambda x: sorted([d['name'] if d['job'] == 'Casting' else '' for d in convert_JSON(x)])).map(lambda x: ','.join(map(str, x))
To create a DataFrame from your sample data, write:
df = pd.DataFrame(data=[
{ 'credit_id': '54d5356ec3a3683ba0000039', 'department': 'Production',
'gender': 1, 'id': 494, 'job': 'Casting', 'name': 'Terri Taylor',
'profile_path': None},
{ 'credit_id': '56407fa89251417055000b58', 'department': 'Sound',
'gender': 0, 'id': 6745, 'job': 'Music Editor',
'name': 'Richard Henderson', 'profile_path': None},
{ 'credit_id': '5789212392514135d60025fd', 'department': 'Production',
'gender': 2, 'id': 9250, 'job': 'Executive In Charge Of Production',
'name': 'Jeffrey Stott', 'profile_path': None},
{ 'credit_id': '57892074c3a36835fa002886', 'department': 'Costume & Make-Up',
'gender': 0, 'id': 23783, 'job': 'Makeup Artist',
'name': 'Heather Plott', 'profile_path': None}])
Then you can get your data with a single instruction:
df[df.job == 'Casting'].name
The result is:
0 Terri Taylor
Name: name, dtype: object
The above result is Pandas Series object with names found.
In this case, 0 is the index value for the record found and
Terri Taylor is the name of (the only in your data) Casting Director.
Edit
If you want just a list (not Series), write:
df[df.job == 'Casting'].name.tolist()
The result is ['Terri Taylor'] - just a list.
I think, both my solutions should be quicker than "ordinary" loop
based on iterrows().
Checking the execution time, you may try also yet another solution:
df.query("job == 'Casting'").name.tolist()
==========
And as far as your code is concerned:
iterrows() returns each time a pair containing:
the key of the current row,
a named tuple - the content of this row.
So your loop should look something like:
for row in df.iterrows():
if row[1].job == 'Casting':
print(row[1]['name'])
You can not write row[1].name because it refers to the index value
(here we have a collision with default attributes of the named tuple).
I have a YAML file that parses into an object, e.g.:
{'name': [{'proj_directory': '/directory/'},
{'categories': [{'quick': [{'directory': 'quick'},
{'description': None},
{'table_name': 'quick'}]},
{'intermediate': [{'directory': 'intermediate'},
{'description': None},
{'table_name': 'intermediate'}]},
{'research': [{'directory': 'research'},
{'description': None},
{'table_name': 'research'}]}]},
{'nomenclature': [{'extension': 'nc'}
{'handler': 'script'},
{'filename': [{'id': [{'type': 'VARCHAR'}]},
{'date': [{'type': 'DATE'}]},
{'v': [{'type': 'INT'}]}]},
{'data': [{'time': [{'variable_name': 'time'},
{'units': 'minutes since 1-1-1980 00:00 UTC'},
{'latitude': [{'variable_n...
I'm having trouble accessing the data in python and regularly see the error TypeError: list indices must be integers, not str
I want to be able to access all elements corresponding to 'name' so to retrieve each data field I imagine it would look something like:
import yaml
settings_stream = open('file.yaml', 'r')
settingsMap = yaml.safe_load(settings_stream)
yaml_stream = True
print 'loaded settings for: ',
for project in settingsMap:
print project + ', ' + settingsMap[project]['project_directory']
and I would expect each element would be accessible via something like ['name']['categories']['quick']['directory']
and something a little deeper would just be:
['name']['nomenclature']['data']['latitude']['variable_name']
or am I completely wrong here?
The brackets, [], indicate that you have lists of dicts, not just a dict.
For example, settingsMap['name'] is a list of dicts.
Therefore, you need to select the correct dict in the list using an integer index, before you can select the key in the dict.
So, giving your current data structure, you'd need to use:
settingsMap['name'][1]['categories'][0]['quick'][0]['directory']
Or, revise the underlying YAML data structure.
For example, if the data structure looked like this:
settingsMap = {
'name':
{'proj_directory': '/directory/',
'categories': {'quick': {'directory': 'quick',
'description': None,
'table_name': 'quick'}},
'intermediate': {'directory': 'intermediate',
'description': None,
'table_name': 'intermediate'},
'research': {'directory': 'research',
'description': None,
'table_name': 'research'},
'nomenclature': {'extension': 'nc',
'handler': 'script',
'filename': {'id': {'type': 'VARCHAR'},
'date': {'type': 'DATE'},
'v': {'type': 'INT'}},
'data': {'time': {'variable_name': 'time',
'units': 'minutes since 1-1-1980 00:00 UTC'}}}}}
then you could access the same value as above with
settingsMap['name']['categories']['quick']['directory']
# quick