One of the columns of my pandas dataframe looks like this
>> df
Item
0 [{"id":A,"value":20},{"id":B,"value":30}]
1 [{"id":A,"value":20},{"id":C,"value":50}]
2 [{"id":A,"value":20},{"id":B,"value":30},{"id":C,"value":40}]
I want to expand it as
A B C
0 20 30 NaN
1 20 NaN 50
2 20 30 40
I tried
dfx = pd.DataFrame()
for i in range(df.shape[0]):
df1 = pd.DataFrame(df.item[i]).T
header = df1.iloc[0]
df1 = df1[1:]
df1 = df1.rename(columns = header)
dfx = dfx.append(df1)
But this takes a lot of time as my data is huge. What is the best way to do this?
My original json data looks like this:
{
{
'_id': '5b1284e0b840a768f5545ef6',
'device': '0035sdf121',
'customerId': '38',
'variantId': '31',
'timeStamp': datetime.datetime(2018, 6, 2, 11, 50, 11),
'item': [{'id': A, 'value': 20},
{'id': B, 'value': 30},
{'id': C, 'value': 50}
},
{
'_id': '5b1284e0b840a768f5545ef6',
'device': '0035sdf121',
'customerId': '38',
'variantId': '31',
'timeStamp': datetime.datetime(2018, 6, 2, 11, 50, 11),
'item': [{'id': A, 'value': 20},
{'id': B, 'value': 30},
{'id': C, 'value': 50}
},
.............
}
I agree with #JeffH, you should really look at how you are constructing the DataFrame.
Assuming you are getting this from somewhere out of your control then you can convert to the your desired DataFrame with:
In []:
pd.DataFrame(df['Item'].apply(lambda r: {d['id']: d['value'] for d in r}).values.tolist())
Out[]:
A B C
0 20 30.0 NaN
1 20 NaN 50.0
2 20 30.0 40.0
Related
I try to clean the data with this code
empty = {}
mess = lophoc_clean.query("lop_diemquatrinh.notnull()")[['lop_id', 'lop_diemquatrinh']]
keys = []
values = []
for index, rows in mess.iterrows():
if len(rows['lop_diemquatrinh']) >4:
values.append(rows['lop_diemquatrinh'])
keys.append(rows['lop_id'])
df = pd.DataFrame(dict(zip(keys, values)), index = [0]).transpose()
df.columns = ['data']
The result is a dictionary like this
{'data': {37: '[{"date_update":"31-03-2022","diemquatrinh":"6.0"}]',
38: '[{"date_update":"11-03-2022","diemquatrinh":"6.25"}]',
44: '[{"date_update":"25-12-2021","diemquatrinh":"6.0"},{"date_update":"28-04-2022","diemquatrinh":"6.25"},{"date_update":"28-07-2022","diemquatrinh":"6.5"}]',
1095: '[{"date_update":null,"diemquatrinh":null}]'}}
However, I don't know how to make them into a DataFrame with 3 columns like this. Please help me. Thank you!
id
updated_at
diemquatrinh
38
11-03-2022
6.25
44
25-12-2021
6.0
44
28-04-2022
6.25
44
28-07-2022
6.5
1095
null
null
Here you go.
from json import loads
from pprint import pp
import pandas as pd
def get_example_data():
return [
dict(id=38, updated_at="2022-03-11", diemquatrinh=6.25),
dict(id=44, updated_at="2021-12-25", diemquatrinh=6),
dict(id=44, updated_at="2022-04-28", diemquatrinh=6.25),
dict(id=1095, updated_at=None),
]
df = pd.DataFrame(get_example_data())
df["updated_at"] = pd.to_datetime(df["updated_at"])
print(df.dtypes, "\n")
pp(loads(df.to_json()))
print()
print(df, "\n")
pp(loads(df.to_json(orient="records")))
It produces this output:
id int64
updated_at datetime64[ns]
diemquatrinh float64
dtype: object
{'id': {'0': 38, '1': 44, '2': 44, '3': 1095},
'updated_at': {'0': 1646956800000,
'1': 1640390400000,
'2': 1651104000000,
'3': None},
'diemquatrinh': {'0': 6.25, '1': 6.0, '2': 6.25, '3': None}}
id updated_at diemquatrinh
0 38 2022-03-11 6.25
1 44 2021-12-25 6.00
2 44 2022-04-28 6.25
3 1095 NaT NaN
[{'id': 38, 'updated_at': 1646956800000, 'diemquatrinh': 6.25},
{'id': 44, 'updated_at': 1640390400000, 'diemquatrinh': 6.0},
{'id': 44, 'updated_at': 1651104000000, 'diemquatrinh': 6.25},
{'id': 1095, 'updated_at': None, 'diemquatrinh': None}]
Either of the JSON datastructures
would be acceptable input
for creating a new DataFrame from scratch.
I'm trying to create a nested Json file from a pandas dataframe. I found a similar question here but when I tried to apply the answer, the output wasn't what I really wanted. I tried to adjust the code to get the desired answer but I haven't been able to.
Let me explain the problem first then I will sow you what I have done so far.
I have the following dataframe:
Region staff_id rate dep
1 300047 77 4
1 300048 45 3
1 300049 32 7
2 299933 63 8
2 299938 86 7
Now I want the json object to look like this:
{'region': 1 :
{ 'Info': [
{'ID': 300047, 'Rate': 77, 'Dept': 4},
{'ID': 300048, 'Rate': 45, 'Dept': 3},
{'ID': 300049, 'Rate': 32, 'Dept': 7}
]
},
'region': 2 :
{ 'Info': [
{'ID': 299933, 'Rate': 63, 'Dept': 8},
{'ID': 299938, 'Rate': 86, 'Dept': 7}
]
}
}
So for every region, there is a tag called info and inside info there is all the rows of that region.
I tried this code from the previous answer:
json_output = list(df.apply(lambda row: {"region": row["Region"],"Info": [{
"ID": row["staff_id"], "Rate": row["rate"], "Dept": row["dep"]}]
},
axis=1).values)
Which will give me every row in the dataframe and not grouped by the region.
Sorry because this seems repetitive, but I have been trying to change that answer to fit mine and I would really appreciate your help.
As mention by Nick ODell, you can loop through the group by element
df = pd.DataFrame({"REGION":[1,1,1,2,2],
"staff_id": [1,2,3,4,5],
"rate": [77,45,32,63,86],
"dep":[4,3,7,8,7]})
desired_op = []
grp_element = list(df.groupby(["REGION"]))
for i in range(len(grp_element)):
empty_dict = {} # this dict will store data according to Region
lst_info = eval(grp_element[i][1][["staff_id","rate","dep"]].to_json(orient='records')) # converting to Json output of grouped data
empty_dict["REGION"] = grp_element[i][1]["REGION"].values[0] # to get Region number
empty_dict['info'] = lst_info
desired_op.append(empty_dict)
print(desired_op)
[{'REGION': 1,
'info': [{'staff_id': 1, 'rate': 77, 'dep': 4},
{'staff_id': 2, 'rate': 45, 'dep': 3},
{'staff_id': 3, 'rate': 32, 'dep': 7}]},
{'REGION': 2,
'info': [{'staff_id': 4, 'rate': 63, 'dep': 8},
{'staff_id': 5, 'rate': 86, 'dep': 7}]}]
I have some data from an API that I am trying to convert to a Pandas dataframe.
I am struggling to extract the 'station_xyz__cr' id number from the list in a nested dict (where a list can be empty as in the middle dataset).
output = {'data': [{'abc_serial_number__c': 'ABC2020-07571',
'id': 'V48000000000F79',
'modified_date__v': '2020-06-15T05:13:14.000Z',
'name__v': 'VVV-001039',
'station_xyz__cr': {'data': [{'id': 'V5J000000000B86'}],
'responseDetails': {'limit': 250,
'offset': 0,
'size': 1,
'total': 1}}},
{'abc_serial_number__c': 'ABC2020-09952',
'id': 'V48000000001B94',
'modified_date__v': '2020-06-24T11:30:40.000Z',
'name__v': 'VVV-004040',
'station_xyz__cr': {'data': [],
'responseDetails': {'limit': 250,
'offset': 0,
'size': 1,
'total': 1}}},
{'abc_serial_number__c': 'ABC2020-09196',
'id': 'V48000000001B95',
'modified_date__v': '2020-06-23T09:38:18.000Z',
'name__v': 'VVV-004041',
'station_xyz__cr': {'data': [{'id': 'V5J000000000Z10'}],
'responseDetails': {'limit': 250,
'offset': 0,
'size': 1,
'total': 1}}}],
'responseDetails': {'limit': 1000, 'offset': 0, 'size': 3, 'total': 3},
'responseStatus': 'SUCCESS'}
I'm trying to get the nested id data into a column in the dataframe something like this:
station_xyz__cr.data.id
0 V5J000000000B86
1 None
2 V5J000000000Z10
I've tried converting to a dataframe with json_normalize (droppping the columns I don't need):
df = pd.json_normalize(output['data'])
df = df.loc[:, ~df.columns.str.startswith('station_xyz__cr.responseDetails')]
print(df)
abc_serial_number__c id modified_date__v name__v \
0 ABC2020-07571 V48000000000F79 2020-06-15T05:13:14.000Z VVV-001039
1 ABC2020-09952 V48000000001B94 2020-06-24T11:30:40.000Z VVV-004040
2 ABC2020-09196 V48000000001B95 2020-06-23T09:38:18.000Z VVV-004041
station_xyz__cr.data
0 [{'id': 'V5J000000000B86'}]
1 []
2 [{'id': 'V5J000000000Z10'}]
but Im stuggling to convert the 'station_xyz__cr.data' list of dicts to simple dataframe of the ids:
df2 = pd.DataFrame(df['station_xyz__cr.data'].tolist(), index= df.index)
df2 = df2.rename(columns = {0:'station_xyz__cr.data'})
df2
station_xyz__cr.data
0 {'id': 'V5J000000000B86'}
1 None
2 {'id': 'V5J000000000Z10'}
The 'None' is causing me problems when I tried to extract further.
I tried replacing the None - but I could only replace with 0:
df.fillna(0, inplace=True)
Get the row index of None values. Using row index as a mask, set the row, col combinations to a default value that is consistent with the rest of the columns' values for next stage in data flow.
isna_idx = pd.isnull(df2['station_xyz__cr.data'])
df2.loc[isna_idx, ['station_xyz__cr.data']] = {'id': '...'}
I have an array of dictionaries in a pandas DataFrame:
0 [{'id': 16, 'name': 'Animation'}, {'id': 35, 'name': 'Comedy'}, {'id': 10751, 'name': 'Family'}]
1 [{'id': 12, 'name': 'Adventure'}, {'id': 88, 'name': 'Fantasy'}, {'id': 10751, 'name': 'Family'}]
2 [{'id': 10749, 'name': 'Romance'}, {'id': 77, 'name': 'Horror'}]
I am trying to get all the names from a single row into a simple list of Strings, like: "Horror, family, drama" etc for each row in the dataset.
I tried this code but I am getting the error: string indices must be integers
for y in df:
names = [x['name'] for x in y]
Any help is appriciated
Iterating over a data-frame iterates over the names of the columns, `:
In [15]: df = pd.DataFrame({'a':[1,2,3], 'b':[4,5,6]})
In [16]: df
Out[16]:
a b
0 1 4
1 2 5
2 3 6
In [17]: for x in df:
...: print(x)
...:
a
b
It is like a dict that would iterate over it's keys.
You need something like:
df['your_column'].apply(lambda x: [d['name'] for d in x])
IIUC, this is dict not a list. you should using .get
[[y.get('name') for y in x ]for x in df['your columns']]
Out[578]:
[['Animation', 'Comedy', 'Family'],
['Adventure', 'Fantasy', 'Family'],
['Romance', 'Horror']]
Convert str
import ast
df.a=df.a.apply(ast.literal_eval)
I want to convert the below pandas data frame
data = pd.DataFrame([[1,2], [5,6]], columns=['10+', '20+'], index=['A', 'B'])
data.index.name = 'City'
data.columns.name= 'Age Group'
print data
Age Group 10+ 20+
City
A 1 2
B 5 6
in to an array of dictionaries, like
[
{'Age Group': '10+', 'City': 'A', 'count': 1},
{'Age Group': '20+', 'City': 'A', 'count': 2},
{'Age Group': '10+', 'City': 'B', 'count': 5},
{'Age Group': '20+', 'City': 'B', 'count': 6}
]
I am able to get the above expected result using the following loops
result = []
cols_name = data.columns.name
index_names = data.index.name
for index in data.index:
for col in data.columns:
result.append({cols_name: col, index_names: index, 'count': data.loc[index, col]})
Is there any better ways of doing this? Since my original data will be having large number of records, using for loops will take more time.
I think you can use stack with reset_index for reshape and last to_dict:
print (data.stack().reset_index(name='count'))
City Age Group count
0 A 10+ 1
1 A 20+ 2
2 B 10+ 5
3 B 20+ 6
print (data.stack().reset_index(name='count').to_dict(orient='records'))
[
{'Age Group': '10+', 'City': 'A', 'count': 1},
{'Age Group': '20+', 'City': 'A', 'count': 2},
{'Age Group': '10+', 'City': 'B', 'count': 5},
{'Age Group': '20+', 'City': 'B', 'count': 6}
]