My list/dictionary is nested with lists for different items in it like this:
scores = [{"Student":"Adam","Subjects":[{"Name":"Math","Score":85},{"Name":"Science","Score":90}]},
{"Student":"Bec","Subjects":[{"Name":"Math","Score":70},{"Name":"English","Score":100}]}]
If I use pd.DataFrame directly on the dictionary I get:
What should I do in order to get a data frame that looks like this:
Student Subject.Name Subject.Score
Adam Math 85
Adam Science 90
Bec Math 70
Bec English 100
?
Thanks very much
Use json_normalize with rename:
df = (pd.json_normalize(scores, 'Subjects','Student')
.rename(columns={'Name':'Subject.Name','Score':'Subject.Score'}))
print (df)
Subject.Name Subject.Score Student
0 Math 85 Adam
1 Science 90 Adam
2 Math 70 Bec
3 English 100 Bec
Or list with dict comprehension and DataFrame constructor:
df = (pd.DataFrame([{**x, **{f'Subject.{k}': v for k, v in y.items()}}
for x in scores for y in x.pop('Subjects')]))
print (df)
Student Subject.Name Subject.Score
0 Adam Math 85
1 Adam Science 90
2 Bec Math 70
3 Bec English 100
Related
I have a nested dictionary as below:
stud_data_dict = { 's1' : {'Course 1' : {'Course Name':'Maths',
'Marks':95,
'Grade': 'A+'},
'Course 2' : {'Course Name':'Science',
'Marks': 75,
'Grade': 'B-'}},
's2' : {'Course 1' : {'Course Name':'English',
'Marks': 82,
'Grade': 'B'},
'Course 2' : {'Course Name':'Maths',
'Marks': 90,
'Grade': 'A'}}}
I need to convert it into a dataframe like below
Student Course 1 Course 2
Course Name Marks Grade Course Name Marks Grade
s1 Maths 95 A+ Science 75 B-
s2 English 82 B Maths 90 A
I have tired the following code from this answer
stud_df = pandas.DataFrame.from_dict(stud_data_dict, orient="index").stack().to_frame()
final_df = pandas.DataFrame(stud_df[0].values.tolist(), index=stud_df.index)
I am getting the dataframe like below
Course Name Marks Grade
s1 Course 1 Maths 95 A+
Course 2 Science 75 B-
s2 Course 1 English 82 B
Course 2 Maths 90 A
This is the closest I got to the desired output. What changes do I need to make to the code to get the desired dataframe?
Change dictionary first and then pass to Series with reshape by Series.unstack:
#reformat nested dict
#https://stackoverflow.com/a/39807565/2901002
d = {(level1_key, level2_key, level3_key): values
for level1_key, level2_dict in stud_data_dict.items()
for level2_key, level3_dict in level2_dict.items()
for level3_key, values in level3_dict.items()}
stud_df = pd.Series(d).unstack([1,2])
print (stud_df)
Course 1 Course 2
Course Name Marks Grade Course Name Marks Grade
s1 Maths 95 A+ Science 75 B-
s2 English 82 B Maths 90 A
Another idea is created dictionary of tuples in keys with defaultdict:
from collections import defaultdict
d = defaultdict(dict)
for k, v in stud_data_dict.items():
for k1, v1 in v.items():
for k2, v2 in v1.items():
d[(k1, k2)].update({k: v2})
df = pd.DataFrame(d)
print(df)
Course 1 Course 2
Course Name Marks Grade Course Name Marks Grade
s1 Maths 95 A+ Science 75 B-
s2 English 82 B Maths 90 A
One option is to create data frames from the inner dictionaries, concatenate into a single frame, reshape and cleanup:
out = {key: pd.DataFrame.from_dict(value, orient='index')
for key, value in stud_data_dict.items()}
(pd
.concat(out)
.unstack()
.swaplevel(axis = 1)
.sort_index(axis = 1)
.rename_axis('Student')
.reset_index()
)
Student Course 1 Course 2
Course Name Grade Marks Course Name Grade Marks
0 s1 Maths A+ 95 Science B- 75
1 s2 English B 82 Maths A 90
You should get more performance if you can do all the initial wrangling in vanilla python or numpy, before creating the final dataframe:
out = []; outer = []; bottom = []; index = [];
for key, value in stud_data_dict.items():
out = []
for k, v in value.items():
out.extend(v.values())
outer.append(out)
index.append(key)
bottom.extend(v.keys())
top = np.repeat([*value.keys()], len(v))
pd.DataFrame(outer,
columns = [top, bottom],
index = index)
Course 1 Course 2
Course Name Marks Grade Course Name Marks Grade
s1 Maths 95 A+ Science 75 B-
s2 English 82 B Maths 90 A
This question already has an answer here:
convert dict of lists of tuples to dataframe
(1 answer)
Closed 10 months ago.
I am a beginning in Python.
I know similar questions have been posed, and I have read through the answers for the past 2 hours, but I can’t seem to get my code to work. Appreciate your help to advise where I might have gone wrong.
I have a dictionary as such:
{Tom: [(“Math”, 98),
(“English”,75)],
Betty: [(“Science”, 42),
(“Humanities”, 15]}
What is the most efficient way to convert to the following Pandas Dataframe?
Tom Math 98
Tom English 75
Betty Science 42
Betty Humanities 15
I have tried the following method which is throwing up a TypeError: cannot unpack non-iterable int object:
df = pd.DataFrame(columns=[‘Name’,’Subject’,’Score’])
i=0
for name in enumerate(data):
for subject, score in name:
df.loc[i]= [name,subject,score]
i += 1
Thanks a million!
You can loop and construct a list of list that Pandas can consume.
d = {'Tom': [('Math', 98),
('English',75)],
'Betty': [('Science', 42),
('Humanities', 15)]}
data = [[k, *v] for k, lst in d.items() for v in lst]
df = pd.DataFrame(data, columns=['Name','Subject','Score'])
Name Subject Score
0 Tom Math 98
1 Tom English 75
2 Betty Science 42
3 Betty Humanities 15
Do this,
df = pd.DataFrame(data).melt(var_name = "Name", value_name = "Data")
new_df = pd.DataFrame(df["Data"].tolist(), columns = ["Subject", "Marks"])
new_df.insert(loc = 0, column = "Name", value = df["Name"])
Output -
Name
Subject
Marks
0
Tom
Math
98
1
Betty
Science
42
2
Tom
English
75
3
Betty
Humanities
15
Have a df with values :
name algo accuracy
tom 1 88
tommy 2 87
mark 1 88
stuart 3 100
alex 2 99
lincoln 1 88
How to randomly pick 4 records from df with a condition that at least one record should be picked from each unique algo column values. here, algo column has only 3 unique values (1 , 2 , 3 )
Sample outputs:
name algo accuracy
tom 1 88
tommy 2 87
stuart 3 100
lincoln 1 88
sample output2:
name algo accuracy
mark 1 88
stuart 3 100
alex 2 99
lincoln 1 88
One way
num_sample, num_algo = 4, 3
# sample one for each algo
out = df.groupby('algo').sample(n=num_sample//num_algo)
# append one more sample from those that didn't get selected.
out = out.append(df.drop(out.index).sample(n=num_sample-num_algo) )
Another way is to shuffle the whole data, enumerate the rows within each algo, sort by that enumeration and take the required number of samples. This is slightly more code than the first approach, but is cheaper and produces more balanced algo counts:
# shuffle data
df_random = df['algo'].sample(frac=1)
# enumerations of rows with the same algo
enums = df_random.groupby(df_random).cumcount()
# sort with `np.argsort`:
enums = enums.sort_values()
# pick the first num_sample indices
# these will be indices of the samples
# so we can use `loc`
out = df.loc[enums.iloc[:num_sample].index]
i have the following df:
country sport score
0 ita swim 15
1 fr run 25
2 ger golf 37
3 ita run 17
4 fr golf 58
5 fr run 35
i am interested in some elements of categories only:
ctr = ['ita','fr']
sprt= ['run','golf']
i was hoping in something like this to extract them:
df[(df['country']== x for x in ctr)&(df['sport']== x for x in sprt)]
but while it doesn't throw any error it returns empty..
any suggestion?
i also tried:
df[(df['country']== {x for x in ctr})&(df['sport']== {x for x in sprt})]
EDIT:
the reason why want to use a loop, is cos i am actually interested in the 3 top scores of each combination, which i hoped to concat:
df1 = pd.concat(df[(df['country']== x for x in ctr)&(df['sport']== x for x in sprt)].sort_values(by=['score'],ascending=False).head(3))
Use double Series.isin for check membership:
df1 = df[(df['country'].isin(ctr))&(df['sport'].isin(sprt))]
print (df1)
country sport score
1 fr run 25
3 ita run 17
4 fr golf 58
5 fr run 35
df2 = df1.sort_values('score', ascending=False).groupby(['country','sport']).head(3)
print (df2)
country sport score
4 fr golf 58
5 fr run 35
1 fr run 25
3 ita run 17
I have student data with id's and some values and I need to pivot the table for count of ID.
Here's an example of data:
id name maths science
0 B001 john 50 60
1 B021 Kenny 89 77
2 B041 Jessi 100 89
3 B121 Annie 91 73
4 B456 Mark 45 33
pivot table:
count of ID
5
Lots of different ways to approach this, I would use either shape or nunique() as Sandeep suggested.
data = {'id' : ['0','1','2','3','4'],
'name' : ['john', 'kenny', 'jessi', 'Annie', 'Mark'],
'math' : [50,89,100,91,45],
'science' : [60,77,89,73,33]}
df = pd.DataFrame(data)
print(df)
id name math science
0 0 john 50 60
1 1 kenny 89 77
2 2 jessi 100 89
3 3 Annie 91 73
4 4 Mark 45 33
then pass either of the following:
df.shape() which gives you the length of a data frame.
or
in:df['id'].nunique()
out:5