I have created Data Frame, Check below snippet.
data = {'id': [101,102],
'name': ['xyz', 'xyz'],
'value1' : [41,42],
'value2' : [42,32]
}
df = pd.DataFrame(data, columns = ['id', 'name','value1','value2'])
print(df)
Output of Dataframe
id name value1 value2
101 xyz 41 42
102 xyy 42 32
Here I just want to create nested dictionary from this Data Frame.
Expected output
{'101':{'name':'xyz'
'data' : [{'value1' : 41,'value2':42},
{'value1': 42,'value2':32}]}}
I tried to do by following code but it's won't work so please could you me solve this
#Tried Snippet
code
print({n: grp.loc[n].to_dict('index')for n, grp in df.set_index(['id','name']).group by(level='id')})
output
{101: {'xyz': {'value1': 41, 'value2': 42}}, 102: {'xyz': {'value1': 42, 'value2': 32}}}
code
print({k:f.groupby('name')['value1'].apply(list).to_dict() for k, f in df.groupby('id')})
output
{101: {'xyz': [41]}, 102: {'xyz': [42]}}
required output
{'101':{'name':'xyz'
'data' : [{'value1' : 41,'value2':42},
{'value1': 42,'value2':32},
]}}
Let say df is :
df:
id name value1 value2
0 101 xyz 41 42
1 102 xyy 42 32
2 101 xyz 46 46
3 102 xyy 40 39
df.groupby(['id', 'name'])[['value1', 'value2']] \
.apply(lambda x: x.to_dict(orient='records')).reset_index(name='data')\
.set_index('id').to_dict(orient='index')
{101: {'name': 'xyz', 'data': [{'value1': 41, 'value2': 42}, {'value1': 46, 'value2': 46}]}, 102: {'name': 'xyy', 'data': [{'value1': 42, 'value2': 32}, {'value1': 40, 'value2': 39}]}}
Related
I have a DF with the following columns and data:
I hope it could be converted to two columns, studentid and info, with the following format.
the dataset is
"""
studentid course teacher grade rank
1 math A 91 1
1 history B 79 2
2 math A 88 2
2 history B 83 1
3 math A 85 3
3 history B 76 3
and the desire output is
studentid info
1 "{""math"":[{""teacher"":""A"",""grade"":91,""rank"":1}],
""history"":[{""teacher"":""B"",""grade"":79,""rank"":2}]}"
2 "{""math"":[{""teacher"":""A"",""grade"":88,""rank"":2}],
""history"":[{""teacher"":""B"",""grade"":83,""rank"":1}]}"
3 "{""math"":[{""teacher"":""A"",""grade"":85,""rank"":3}],
""history"":[{""teacher"":""B"",""grade"":76,""rank"":3}]}"
You don't really need groupby() and the single sub-dictionaries shouldn't really be in a list, but as value's for the nested dict. After setting the columns you want as index, with df.to_dict() you can achieve the desired output:
df = df.set_index(['studentid','course'])
df.to_dict(orient='index')
Outputs:
{(1, 'math'): {'teacher': 'A', 'grade': 91, 'rank': 1},
(1, 'history'): {'teacher': 'B', 'grade': 79, 'rank': 2},
(2, 'math'): {'teacher': 'A', 'grade': 88, 'rank': 2},
(2, 'history'): {'teacher': 'B', 'grade': 83, 'rank': 1},
(3, 'math'): {'teacher': 'A', 'grade': 85, 'rank': 3},
(3, 'history'): {'teacher': 'B', 'grade': 76, 'rank': 3}}
Considering that the initial dataframe is df, there are various options, depending on the exact desired output.
If one wants the info column to be a dictionary of lists, this will do the work
df_new = df.groupby('studentid').apply(lambda x: x.drop('studentid', axis=1).to_dict(orient='list')).reset_index(name='info')
[Out]:
studentid info
0 1 {'course': ['math', 'history'], 'teacher': ['A...
1 2 {'course': ['math', 'history'], 'teacher': ['A...
2 3 {'course': ['math', 'history'], 'teacher': ['A...
If one wants a list of dictionaries, then do the following
df_new = df.groupby('studentid').apply(lambda x: x.drop('studentid', axis=1).to_dict(orient='records')).reset_index(name='info')
[Out]:
studentid info
0 1 [{'course': 'math', 'teacher': 'A', 'grade': 9...
1 2 [{'course': 'math', 'teacher': 'A', 'grade': 8...
2 3 [{'course': 'math', 'teacher': 'A', 'grade': 8...
I know there are many threads on this df.groupby to merge rows with the same value in a column. But for the following situation, given this data frame:
df = pd.DataFrame([{'timestamp': '2021-05-28 14:00:00.274', 'value1': 123, 'value2': 21},
{'timestamp': '2021-05-28 14:00:00.374', 'value1': 101, 'value2': 33},
{'timestamp': '2021-05-28 14:00:01.294', 'value1': 7, 'value2': 12},
{'timestamp': '2021-05-28 14:00:02.002', 'value1': 42, 'value2': 10},
{'timestamp': '2021-05-28 14:00:02.039', 'value1': 1, 'value2': 34},
{'timestamp': '2021-05-28 14:00:03.00', 'value1': 2, 'value2': 41}])
I want to merge rows base on timestamp. The condition is: the benchmark time is at second .0, for eg 2021-05-28 14:02:01.000 is the benchmark. Any rows that fall between 2021-05-28 14:02:00.500 and 2021-05-28 14:02:01.500 should be grouped together where the value1 and value2 are the max in the group. The timestamp upper and lower boundary can be inclusive in either of these [ ) or ( ].
For this eg., the expected output is:
df_merge = pd.DataFrame([{'timestamp': '2021-05-28 14:00:00.000', 'value1': 123, 'value2': 33},
{'timestamp': '2021-05-28 14:00:01.000', 'value1': 7, 'value2': 12},
{'timestamp': '2021-05-28 14:00:02.000', 'value1': 42, 'value2': 34},
{'timestamp': '2021-05-28 14:00:03.00', 'value1': 2, 'value2': 41}])
Here, row 0 and 1 are merged into one, row 3 and 4 are merged into one.
The values in column timestamp is of datetime64[ns] type.
What is a good way to do this?
IIUC:
try:
df['timestamp']=pd.to_datetime(df['timestamp'])
#convert 'timestamp' column to datetime
Finally make use of groupby() and pd.Grouper():
out=df.groupby(pd.Grouper(key='timestamp',freq='1s')).max().reset_index()
OR
via assign(),floor() and groupby():
out=df.assign(timestamp=df['timestamp'].dt.floor('1s')).groupby('timestamp',as_index=False).max()
OR
via set_index() and resample():
out=df.set_index('timestamp').resample('1s').max().reset_index()
output of out:
timestamp value1 value2
0 2021-05-28 14:00:00 123 33
1 2021-05-28 14:00:01 7 12
2 2021-05-28 14:00:02 42 34
3 2021-05-28 14:00:03 2 41
One of the columns of my pandas dataframe looks like this
>> df
Item
0 [{"id":A,"value":20},{"id":B,"value":30}]
1 [{"id":A,"value":20},{"id":C,"value":50}]
2 [{"id":A,"value":20},{"id":B,"value":30},{"id":C,"value":40}]
I want to expand it as
A B C
0 20 30 NaN
1 20 NaN 50
2 20 30 40
I tried
dfx = pd.DataFrame()
for i in range(df.shape[0]):
df1 = pd.DataFrame(df.item[i]).T
header = df1.iloc[0]
df1 = df1[1:]
df1 = df1.rename(columns = header)
dfx = dfx.append(df1)
But this takes a lot of time as my data is huge. What is the best way to do this?
My original json data looks like this:
{
{
'_id': '5b1284e0b840a768f5545ef6',
'device': '0035sdf121',
'customerId': '38',
'variantId': '31',
'timeStamp': datetime.datetime(2018, 6, 2, 11, 50, 11),
'item': [{'id': A, 'value': 20},
{'id': B, 'value': 30},
{'id': C, 'value': 50}
},
{
'_id': '5b1284e0b840a768f5545ef6',
'device': '0035sdf121',
'customerId': '38',
'variantId': '31',
'timeStamp': datetime.datetime(2018, 6, 2, 11, 50, 11),
'item': [{'id': A, 'value': 20},
{'id': B, 'value': 30},
{'id': C, 'value': 50}
},
.............
}
I agree with #JeffH, you should really look at how you are constructing the DataFrame.
Assuming you are getting this from somewhere out of your control then you can convert to the your desired DataFrame with:
In []:
pd.DataFrame(df['Item'].apply(lambda r: {d['id']: d['value'] for d in r}).values.tolist())
Out[]:
A B C
0 20 30.0 NaN
1 20 NaN 50.0
2 20 30.0 40.0
I have a list of dictionaries in "my_list" as follows:
my_list=[{'Id': '100', 'A': [val1, val2], 'B': [val3, val4], 'C': [val5,val6]},
{'Id': '200', 'A': [val7, val8], 'B': [val9, val10], 'C':
[val11,val12],
{'Id': '300', 'A': [val13, val14], 'B': [val15, val16], 'C':
[val17,val18]}]
I want to write this list into a CSV file as follows:
ID, A, AA, B, BB, C, CC
100, val1, val2, val3, val4, val5, val6
200, val7, val8, val9, val10, val11, val12
300, val13, val14, val15, val16, val17, val18
Does anyone know how can I handle it?
Tablib should do the trick
I leave here the example on their front page (which you can adapt to the .csv format) :
>>> data = tablib.Dataset(headers=['First Name', 'Last Name', 'Age'])
>>> for i in [('Kenneth', 'Reitz', 22), ('Bessie', 'Monke', 21)]:
... data.append(i)
>>> print(data.export('json'))
[{"Last Name": "Reitz", "First Name": "Kenneth", "Age": 22}, {"Last Name": "Monke", "First Name": "Bessie", "Age": 21}]
>>> print(data.export('yaml'))
- {Age: 22, First Name: Kenneth, Last Name: Reitz}
- {Age: 21, First Name: Bessie, Last Name: Monke}
>>> data.export('xlsx')
<censored binary data>
>>> data.export('df')
First Name Last Name Age
0 Kenneth Reitz 22
1 Bessie Monke 21
You could do this... (replacing print with a csv writerow as appropriate)
print(['ID', 'A', 'AA', 'B', 'BB', 'C', 'CC'])
for row in my_list:
out_row = []
out_row.append(row['Id'])
for v in row['A']:
out_row.append(v)
for v in row['B']:
out_row.append(v)
for v in row['C']:
out_row.append(v)
print(out_row)
You can use pandas to do the trick:
my_list = [{'Id': '100', 'A': [val1, val2], 'B': [val3, val4], 'C': [val5, val6]},
{'Id': '200', 'A': [val7, val8], 'B': [val9, val10], 'C': [val11, val12]},
{'Id': '300', 'A': [val13, val14], 'B': [val15, val16], 'C': [val17, val18]}]
index = ['Id', 'A', 'AA', 'B', 'BB', 'C', 'CC']
df = pd.DataFrame(data=my_list)
for letter in ['A', 'B', 'C']:
first = []
second = []
for a in df[letter].values.tolist():
first.append(a[0])
second.append(a[1])
df[letter] = first
df[letter * 2] = second
df = df.reindex_axis(index, axis=1)
df.to_csv('out.csv')
This produces the following output as dataframe:
Id A AA B BB C CC
0 100 1 2 3 4 5 6
1 200 7 8 9 10 11 12
2 300 13 14 15 16 17 18
and this is the out.csv-file:
,Id,A,AA,B,BB,C,CC
0,100,1,2,3,4,5,6
1,200,7,8,9,10,11,12
2,300,13,14,15,16,17,18
See pandas documentation about the csv-feature (csv).
Write DataFrame to a comma-separated values (csv) file
So I have 2 list of dicts which are as follows:
list1 = [
{'name':'john',
'gender':'male',
'grade': 'third'
},
{'name':'cathy',
'gender':'female',
'grade':'second'
},
]
list2 = [
{'name':'john',
'physics':95,
'chemistry':89
},
{'name':'cathy',
'physics':78,
'chemistry':69
},
]
The output list i need is as follows:
final_list = [
{'name':'john',
'gender':'male',
'grade':'third'
'marks': {'physics':95, 'chemistry': 89}
},
{'name':'cathy',
'gender':'female'
'grade':'second'
'marks': {'physics':78, 'chemistry': 69}
},
]
First i tried with iteration as follows:
final_list = []
for item1 in list1:
for item2 in list2:
if item1['name'] == item2['name']:
temp = dict(item_2)
temp.pop('name')
final_result.append(dict(name=item_1['name'], **temp))
However,this does not give me the desired result..I also tried pandas..limited experience there..
>>> import pandas as pd
>>> df1 = pd.DataFrame(list1)
>>> df2 = pd.DataFrame(list2)
>>> result = pd.merge(df1, df2, on=['name'])
However,i am clueless how to get the data back to the original format i need it in..Any help
You can first merge both dataframes
In [144]: df = pd.DataFrame(list1).merge(pd.DataFrame(list2))
Which would look like,
In [145]: df
Out[145]:
gender grade name chemistry physics
0 male third john 89 95
1 female second cathy 69 78
Then create a marks columns as a dict
In [146]: df['marks'] = df.apply(lambda x: [x[['chemistry', 'physics']].to_dict()], axis=1)
In [147]: df
Out[147]:
gender grade name chemistry physics \
0 male third john 89 95
1 female second cathy 69 78
marks
0 [{u'chemistry': 89, u'physics': 95}]
1 [{u'chemistry': 69, u'physics': 78}]
And, use to_dict(orient='records') method of selected columns of dataframe
In [148]: df[['name', 'gender', 'grade', 'marks']].to_dict(orient='records')
Out[148]:
[{'gender': 'male',
'grade': 'third',
'marks': [{'chemistry': 89L, 'physics': 95L}],
'name': 'john'},
{'gender': 'female',
'grade': 'second',
'marks': [{'chemistry': 69L, 'physics': 78L}],
'name': 'cathy'}]
Using your pandas approach, you can call
result.to_dict(orient='records')
to get it back as a list of dictionaries. It won't put marks in as a sub-field though, since there's nothing telling it to do that. physics and chemistry will just be fields on the same level as the rest.
You may also be having problems because your name is 'cathy' in the first list and 'kathy' in the second, which naturally won't get merged.
create a function that will add a marks column , this columns should contain a dictionary of physics and chemistry marks
def create_marks(df):
df['marks'] = { 'chemistry' : df['chemistry'] , 'physics' : df['physics'] }
return df
result_with_marks = result.apply( create_marks , axis = 1)
Out[19]:
gender grade name chemistry physics marks
male third john 89 95 {u'chemistry': 89, u'physics': 95}
female second cathy 69 78 {u'chemistry': 69, u'physics': 78}
then convert it to your desired result as follows
result_with_marks.drop( ['chemistry' , 'physics'], axis = 1).to_dict(orient = 'records')
Out[20]:
[{'gender': 'male',
'grade': 'third',
'marks': {'chemistry': 89L, 'physics': 95L},
'name': 'john'},
{'gender': 'female',
'grade': 'second',
'marks': {'chemistry': 69L, 'physics': 78L},
'name': 'cathy'}]
Considering you want a list of dicts as output, you can easily do what you want without pandas, use a dict to store all the info using the names as the outer keys, doing one pass over each list not like the O(n^2) double loops in your own code:
out = {d["name"]: d for d in list1}
for d in list2:
out[d.pop("name")]["marks"] = d
from pprint import pprint as pp
pp(list(out.values()))
Output:
[{'gender': 'female',
'grade': 'second',
'marks': {'chemistry': 69, 'physics': 78},
'name': 'cathy'},
{'gender': 'male',
'grade': 'third',
'marks': {'chemistry': 89, 'physics': 95},
'name': 'john'}]
That reuses the dicts in your lists, if you wanted to create new dicts:
out = {d["name"]: d.copy() for d in list1}
for d in list2:
k = d.pop("name")
out[k]["marks"] = d.copy()
from pprint import pprint as pp
pp(list(out.values()))
The output is the same:
[{'gender': 'female',
'grade': 'second',
'marks': {'chemistry': 69, 'physics': 78},
'name': 'cathy'},
{'gender': 'male',
'grade': 'third',
'marks': {'chemistry': 89, 'physics': 95},
'name': 'john'}]