I have a .csv file with mix of columns where some contain entries in JSON syntax (nested). I want to extract relevant data from these columns to obtain a more data-rich dataframe for further analysis. I've checked this tutorial on Kaggle but I failed to obtain the desired result.
In order to better explain my problem I've prepared a dummy version of a database below.
raw = {"team":["Team_1","Team_2"],
"who":[[{"name":"Andy", "age":22},{"name":"Rick", "age":30}],[{"name":"Oli", "age":19},{"name":"Joe", "age":21}]]}
df = pd.DataFrame(raw)
I'd like to generate the following columns (or equivalent):
team name_1 name_2 age_1 age_2
Team_1 Andy Rick 22 30
Team_2 Oli Joe 19 21
I've tried the following.
Code 1:
test_norm = json_normalize(data=df)
AttributeError: 'str' object has no attribute 'values'
Code 2:
test_norm = json_normalize(data=df, record_path='who')
TypeError: string indices must be integers
Code 3:
test_norm = json_normalize(data=df, record_path='who', meta=[team])
TypeError: string indices must be integers
Is there any way to do it in an effectively? I've looked for a solution in other stackoverflow topics and I cannot find a working solution with json_normalize.
I also had trouble using json_normalize on the lists of dicts that were contained in the who column. My workaround was to reformat each row into a Dict with unique keys (name_1, age_1, name_2, etc.) for each team member's name/age. After this, creating a dataframe with your desired structure was trivial.
Here are my steps. Beginning with your example:
raw = {"team":["Team_1","Team_2"],
"who":[[{"name":"Andy", "age":22},{"name":"Rick", "age":30}],[{"name":"Oli", "age":19},{"name":"Joe", "age":21}]]}
df = pd.DataFrame(raw)
df
team who
0 Team_1 [{'name': 'Andy', 'age': 22}, {'name': 'Rick',...
1 Team_2 [{'name': 'Oli', 'age': 19}, {'name': 'Joe', '...
Write a method to reformat a list as a Dict and apply to each row in the who column:
def reformat(x):
res = {}
for i, item in enumerate(x):
res['name_' + str(i+1)] = item['name']
res['age_' + str(i+1)] = item['age']
return res
df['who'] = df['who'].apply(lambda x: reformat(x))
df
team who
0 Team_1 {'name_1': 'Andy', 'age_1': 22, 'name_2': 'Ric...
1 Team_2 {'name_1': 'Oli', 'age_1': 19, 'name_2': 'Joe'...
Use json_normalize on the who column. Then ensure the columns of the normalized dataframe appear in the desired order:
import pandas as pd
from pandas.io.json import json_normalize
n = json_normalize(data = df['who'], meta=['team'])
n = n.reindex(sorted(n.columns, reverse=True, key=len), axis=1)
n
name_1 name_2 age_1 age_2
0 Andy Rick 22 30
1 Oli Joe 19 21
Join the dataframe created by json_normalize back to the original df, and drop the who column:
df = df.join(n).drop('who', axis=1)
df
team name_1 name_2 age_1 age_2
0 Team_1 Andy Rick 22 30
1 Team_2 Oli Joe 19 21
If your real .csv file has too many rows, my solution may be a bit too expensive (seeing as how it iterates over each row, and then over each entry inside the list contained in each row). If (hopefully) this isn't the case, perhaps my approach will be good enough.
One option would be to unpack the dictionary yourself. Like so:
from pandas.io.json import json_normalize
raw = {"team":["Team_1","Team_2"],
"who":[[{"name":"Andy", "age":22},{"name":"Rick", "age":30}],[{"name":"Oli", "age":19},{"name":"Joe", "age":21}]]}
# add the corresponding team to the dictionary containing the person information
for idx, list_of_people in enumerate(raw['who']):
for person in list_of_people:
person['team'] = raw['team'][idx]
# flatten the dictionary
list_of_dicts = [dct for list_of_people in raw['who'] for dct in list_of_people]
# normalize to dataframe
json_normalize(list_of_dicts)
# due to unpacking of dict, this results in the same as doing
pd.DataFrame(list_of_dicts)
This outputs a little different. My output is often more convenient for further analysis.
Output:
age name team
22 Andy Team_1
30 Rick Team_1
19 Oli Team_2
21 Joe Team_2
You can iterate through each element in raw['who'] separately, but when you do this the resultant dataframe will have both the opponents in separate rows.
Example:
json_normalize(raw['who'][0])
Output:
age name
22 Andy
30 Rick
You can flatten these out into a single row and then concatenate all the rows to get your final output.
def flatten(df_temp):
df_temp.index = df_temp.index.astype(str)
flattened_df = df_temp.unstack().to_frame().sort_index(level=1).T
flattened_df.columns = flattened_df.columns.map('_'.join)
return flattened_df
df = pd.concat([flatten(pd.DataFrame(json_normalize(x))) for x in raw['who']])
df['team'] = raw['team']
Output:
age_0 name_0 age_1 name_1 team
22 Andy 30 Rick Team_1
19 Oli 21 Joe Team_2
Related
Say I have a Pandas multi-index data frame with 3 indices:
import pandas as pd
import numpy as np
arrays = [['UK', 'UK', 'US', 'FR'], ['Firm1', 'Firm1', 'Firm2', 'Firm1'], ['Andy', 'Peter', 'Peter', 'Andy']]
idx = pd.MultiIndex.from_arrays(arrays, names = ('Country', 'Firm', 'Responsible'))
df_3idx = pd.DataFrame(np.random.randn(4,3), index = idx)
df_3idx
0 1 2
Country Firm Responsible
UK Firm1 Andy 0.237655 2.049636 0.480805
Peter 1.135344 0.745616 -0.577377
US Firm2 Peter 0.034786 -0.278936 0.877142
FR Firm1 Andy 0.048224 1.763329 -1.597279
I have furthermore another pd.dataframe consisting of unique combinations of multi-index-level 1 and 2 from the above data:
arrays = [['UK', 'US', 'FR'], ['Firm1', 'Firm2', 'Firm1']]
idx = pd.MultiIndex.from_arrays(arrays, names = ('Country', 'Firm'))
df_2idx = pd.DataFrame(np.random.randn(3,1), index = idx)
df_2idx
0
Country Firm
UK Firm1 -0.103828
US Firm2 0.096192
FR Firm1 -0.686631
I want to subtract the values from df_3idx by the corresponding value in df_2idx, so, for instance, I want to subtract from every value of the first two rows the value -0.103828, as index 1 and 2 from both dataframes match.
Does anybody know how to do this? I figured I could simply unstack the first dataframe and then subtract, but I am getting an error message.
df_3idx.unstack('Responsible').sub(df_2idx, axis=0)
ValueError: cannot join with no overlapping index names
Unstacking might anyway not be a preferable solution as my data is very big and unstacking might take a lot of time.
I would appreciate any help. Many thanks in advance!
related question but not focused on MultiIndex
However, the answer doesn't really care. The sub method will align on the matching index levels.
pd.DataFrame.sub with parameter axis=0
df_3idx.sub(df_2idx[0], axis=0)
0 1 2
Country Firm Responsible
FR Firm1 Andy 0.027800 3.316148 0.804833
UK Firm1 Andy -2.009797 -1.830799 -0.417737
Peter -1.174544 0.644006 -1.150073
US Firm2 Peter -2.211121 -3.825443 -4.391965
I have below URL that has a JSON response. I need to read this json into a pandas dataframe and perform operations on top of it . This is a case of nested JSON which consists of multiple lists and dicts within dicts.
URL: 'http://api.nobelprize.org/v1/laureate.json'
I have tried below code:
import json, pandas as pd,requests
resp=requests.get('http://api.nobelprize.org/v1/laureate.json')
df=pd.json_normalize(json.loads(resp.content),record_path =['laureates'])
print(df.head(5))
Output-
id firstname surname born died \
0 1 Wilhelm Conrad Röntgen 1845-03-27 1923-02-10
1 2 Hendrik A. Lorentz 1853-07-18 1928-02-04
2 3 Pieter Zeeman 1865-05-25 1943-10-09
3 4 Henri Becquerel 1852-12-15 1908-08-25
4 5 Pierre Curie 1859-05-15 1906-04-19
bornCountry bornCountryCode bornCity \
0 Prussia (now Germany) DE Lennep (now Remscheid)
1 the Netherlands NL Arnhem
2 the Netherlands NL Zonnemaire
3 France FR Paris
4 France FR Paris
diedCountry diedCountryCode diedCity gender \
0 Germany DE Munich male
1 the Netherlands NL NaN male
2 the Netherlands NL Amsterdam male
3 France FR NaN male
4 France FR Paris male
prizes
0 [{'year': '1901', 'category': 'physics', 'shar...
1 [{'year': '1902', 'category': 'physics', 'shar...
2 [{'year': '1902', 'category': 'physics', 'shar...
3 [{'year': '1903', 'category': 'physics', 'shar...
4 [{'year': '1903', 'category': 'physics', 'shar...
But in this prizes comes as a list. If I create a separate dataframe for prizes, it has affiliations as list.I want all columns to come as separate columns. Some entires may/may not have prizes. So need to handle that case as well.
I went through this article https://towardsdatascience.com/all-pandas-json-normalize-you-should-know-for-flattening-json-13eae1dfb7dd. Looks like we'll have to use meta and error=ignore here, but not able to fix it. Appreciate your inputs here. Thanks.
You would have to do this in few steps.
The first step would be to extract the first record_path = ['laureates']
The second one would be record_path = ['laureates', 'prizes'] for the nested json records with meta path as the id from the parent record
Combine the two datasets by joining on the id column.
Drop the unnecessary columns and store
import json, pandas as pd, requests
resp = requests.get('http://api.nobelprize.org/v1/laureate.json')
df0 = pd.json_normalize(json.loads(resp.content),record_path = ['laureates'])
df1 = pd.json_normalize(json.loads(resp.content),record_path = ['laureates','prizes'], meta = [['laureates','id']])
output = pd.merge(df0, df1, left_on='id', right_on='laureates.id').drop(['prizes','laureates.id'], axis=1, inplace=False)
print('Shape of data ->',output.shape)
print('Columns ->',output.columns)
Shape of data -> (975, 18)
Columns -> Index(['id', 'firstname', 'surname', 'born', 'died', 'bornCountry',
'bornCountryCode', 'bornCity', 'diedCountry', 'diedCountryCode',
'diedCity', 'gender', 'year', 'category', 'share', 'motivation',
'affiliations', 'overallMotivation'],
dtype='object')
Found an alternate solution as well with lesser code. This works.
from flatten_json import flatten
data = winners['laureates']
dict_flattened = (flatten(record, '.') for record in data)
df = pd.DataFrame(dict_flattened)
print(df.shape)
(968, 43)
for example, there is a column in a dataframe, 'ID'.
One of the entries is for example, '13245993, 3004992'
I only want to get '13245993'.
That also applies for every row in column 'ID'.
How to change the data in each row in column 'ID'?
You can try like this, apply slicing on ID column to get the required result. I am using 3 chars as no:of chars here
import pandas as pd
data = {'Name':['Tom', 'nick', 'krish', 'jack'],
'ID':[90877, 10909, 12223, 12334]}
df=pd.DataFrame(data)
print('Before change')
print(df)
df["ID"]=df["ID"].apply(lambda x: (str(x)[:3]))
print('After change')
print(df)
output
Before change
Name ID
0 Tom 90877
1 nick 10909
2 krish 12223
3 jack 12334
After change
Name ID
0 Tom 908
1 nick 109
2 krish 122
3 jack 123
You could do something like
data[data['ID'] == '13245993']
this will give you the columns where ID is 13245993
More Indepth Code
I hope this answers your question if not please let me know.
With best regards
Is there a way to use a combination of a column names and a values dictionary to filter a pandas dataframe?
Example dataframe:
df = pd.DataFrame({
"name": ["Ann", "Jana", "Yi", "Robin", "Amal", "Nori"],
"city": ["Chicago", "Prague", "Shanghai", "Manchester", "Chicago", "Osaka"],
"age": [28, 33, 34, 38, 31, 37],
"score": [79.0, 81.0, 80.0, 68.0, 61.0, 84.0],
})
column_dict = {0:"city", 1:"score"}
value_dict = {0:"Chicago", 1:61}
The goal would be to use the matching keys column and value dictionaries to filter the dataframe.
In this example, the city would be filtered to Chicago and the score would be filtered to 61, with the filtered dataframe being:
name city age score
4 Amal Chicago 31 61.0
keep_rows = pd.Series(True, index=df.index)
for k, col in column_dict.items():
value = value_dict[k]
keep_rows &= df[col] == value
>>> df[keep_rows]
name city age score
4 Amal Chicago 31 61.0
It's a bit funny to use two different dicts to store keys and values. You're better off with something like this:
filter_dict = {"city":"Chicago", "score":61}
df_filt = df
for k,v in filter_dict.items():
df_filt = df_filt[df_filt[k] == v]
output:
name city age score
4 Amal Chicago 31 61.0
Use merge:
# create filter DataFrame from column_dict and value_dict
df_filter = pd.DataFrame({value: [value_dict[key]] for key, value in column_dict.items()})
# use merge with df_filter
res = df.merge(df_filter, on=['city', 'score'])
print(res)
Output
name city age score
0 Amal Chicago 31 61.0
I have a dataset with several columns.
Now what I want is to basically calculate score based on a particular column ("name") but grouped on the "id" column.
_id fName lName age
0 ABCD Andrew Schulz
1 ABCD Andreww 23
2 DEFG John boy
3 DEFG Johnn boy 14
4 CDGH Bob TANNA 13
5 ABCD. Peter Parker 45
6 DEFGH Clark Kent 25
So what I am looking is whether for the same id, I am getting similar entries, so I can remove those entries based on a threshold score values. Like here if i run it for col "fName". I should be able to reduce this dataframe to based on a score threshold:
_id fName lName age
0 ABCD Andrew Schulz 23
2 DEFG John boy 14
4 CDGH Bob TANNA 13
5 ABCD Peter Parker 45
6 DEFG Clark Kent 25
I intend to use pyjarowinkler.
If I had two independent columns (without all the group by stuff) to check, this is how I use it.
df['score'] = [distance.get_jaro_distance(x, y) for x, y in zip(df['name_1'],df['name_2'])]
df = df[df['score'] > 0.87]
Can someone suggest a pythonic and fast way of doing this
UPDATE
So, I have tried using record linkage library for this. And I have ended up at a dataframe containing pair of indexes that are similar called 'matches'. Now I just want to basically combine the data.
# Indexation step
indexer = recordlinkage.Index()
indexer.block(left_on='_id')
candidate_links = indexer.index(df)
# Comparison step
compare_cl = recordlinkage.Compare()
compare_cl.string('fName', 'fName', method='jarowinkler', threshold=threshold, label='full_name')
features = compare_cl.compute(candidate_links, df)
# Classification step
matches = features[features.sum(axis=1) >= 1]
print(len(matches))
This is how matches looks:
index1 index2 fName
0 1 1.0
2 3 1.0
I need someone to suggest a way to combine the similar rows in a way that takes data from similar rows
just wanted to clear some doubts regarding your ques. Couldn't clear them in comments due to low reputation.
Like here if i run it for col "fName". I should be able to reduce this
dataframe to based on a score threshold:
So basically your function would return the DataFrame containing the first row in each group (by ID)? This will result in the above listed resultant DataFrame.
_id fName lName age
0 ABCD Andrew Schulz 23
2 DEFG John boy 14
4 CDGH Bob TANNA 13
I hope this code answer your question
r0 =['ABCD','Andrew','Schulz', '' ]
r1 =['ABCD','Andrew', '' , '23' ]
r2 =['DEFG','John' ,'boy' , '' ]
r3 =['DEFG','John' ,'boy' , '14' ]
r4 =['CDGH','Bob' ,'TANNA' , '13' ]
Rx =[r0,r1,r2,r3,r4]
print(Rx)
print()
Dict= dict()
for i in Rx:
if (Dict.__contains__(i[0]) == True):
if (i[2] != ''):
Dict[i[0]][2] = i[2]
if (i[3] != ''):
Dict[i[0]][3] = i[3]
else:
Dict[i[0]]=i
Rx[:] = Dict.values()
print(Rx)
I am lost with the 'score' part of your question, but if what you need is to fill the gaps in data with values from other rows and then drop the duplicates by id, maybe this can help:
df.replace('', np.nan, inplace=True)
df_filled = df.fillna(method='bfill').drop_duplicates('Id', keep='first')
First make sure that empty values are replaced with nulls. Then use fillna to 'back fill' the data. Then drop duplicates keeping the first occurrence of Id. fillna will fill the values from the next value found in the column, which may correspond to other Id, but since you will discard the duplicated rows, I believe drop_duplicates keeping the first occurrence will do the job. (This assumes that at least one value is provided in every column for every Id)
I've tested with this dataset and code:
data = [
['AABBCC', 'Andrew', '',],
['AABBCC', 'Andrew', 'Schulz'],
['AABBCC', 'Andrew', '', 23],
['AABBCC', 'Andrew', '',],
['AABBCC', 'Andrew', '',],
['DDEEFF', 'Karl', 'boy'],
['DDEEFF', 'Karl', ''],
['DDEEFF', 'Karl', '', 14],
['GGHHHH', 'John', 'TANNA', 13],
['HLHLHL', 'Bob', ''],
['HLHLHL', 'Bob', ''],
['HLHLHL', 'Bob', 'Blob'],
['HLHLHL', 'Bob', 'Blob', 15],
['HLHLHL', 'Bob','', 15],
['JLJLJL', 'Nick', 'Best', 20],
['JLJLJL', 'Nick', '']
]
df = pd.DataFrame(data, columns=['Id', 'fName', 'lName', 'Age'])
df.replace('', np.nan, inplace=True)
df_filled = df.fillna(method='bfill').drop_duplicates('Id', keep='first')
Output:
Id fName lName Age
0 AABBCC Andrew Schulz 23.0
5 DDEEFF Karl boy 14.0
8 GGHHHH John TANNA 13.0
9 HLHLHL Bob Blob 15.0
14 JLJLJL Nick Best 20.0
Hope this helps and apologies if I misunderstood the question.