Convert nested dictionary to pandas dataframe

Convert nested dictionary to pandas dataframe - python

I have a nested dictionary as below:
stud_data_dict = { 's1' : {'Course 1' : {'Course Name':'Maths',
'Marks':95,
'Grade': 'A+'},
'Course 2' : {'Course Name':'Science',
'Marks': 75,
'Grade': 'B-'}},
's2' : {'Course 1' : {'Course Name':'English',
'Marks': 82,
'Grade': 'B'},
'Course 2' : {'Course Name':'Maths',
'Marks': 90,
'Grade': 'A'}}}
I need to convert it into a dataframe like below
Student Course 1 Course 2
Course Name Marks Grade Course Name Marks Grade
s1 Maths 95 A+ Science 75 B-
s2 English 82 B Maths 90 A
I have tired the following code from this answer
stud_df = pandas.DataFrame.from_dict(stud_data_dict, orient="index").stack().to_frame()
final_df = pandas.DataFrame(stud_df[0].values.tolist(), index=stud_df.index)
I am getting the dataframe like below
Course Name Marks Grade
s1 Course 1 Maths 95 A+
Course 2 Science 75 B-
s2 Course 1 English 82 B
Course 2 Maths 90 A
This is the closest I got to the desired output. What changes do I need to make to the code to get the desired dataframe?

Change dictionary first and then pass to Series with reshape by Series.unstack:
#reformat nested dict
#https://stackoverflow.com/a/39807565/2901002
d = {(level1_key, level2_key, level3_key): values
for level1_key, level2_dict in stud_data_dict.items()
for level2_key, level3_dict in level2_dict.items()
for level3_key, values in level3_dict.items()}
stud_df = pd.Series(d).unstack([1,2])
print (stud_df)
Course 1 Course 2
Course Name Marks Grade Course Name Marks Grade
s1 Maths 95 A+ Science 75 B-
s2 English 82 B Maths 90 A
Another idea is created dictionary of tuples in keys with defaultdict:
from collections import defaultdict
d = defaultdict(dict)
for k, v in stud_data_dict.items():
for k1, v1 in v.items():
for k2, v2 in v1.items():
d[(k1, k2)].update({k: v2})
df = pd.DataFrame(d)
print(df)
Course 1 Course 2
Course Name Marks Grade Course Name Marks Grade
s1 Maths 95 A+ Science 75 B-
s2 English 82 B Maths 90 A

One option is to create data frames from the inner dictionaries, concatenate into a single frame, reshape and cleanup:
out = {key: pd.DataFrame.from_dict(value, orient='index')
for key, value in stud_data_dict.items()}
(pd
.concat(out)
.unstack()
.swaplevel(axis = 1)
.sort_index(axis = 1)
.rename_axis('Student')
.reset_index()
)
Student Course 1 Course 2
Course Name Grade Marks Course Name Grade Marks
0 s1 Maths A+ 95 Science B- 75
1 s2 English B 82 Maths A 90
You should get more performance if you can do all the initial wrangling in vanilla python or numpy, before creating the final dataframe:
out = []; outer = []; bottom = []; index = [];
for key, value in stud_data_dict.items():
out = []
for k, v in value.items():
out.extend(v.values())
outer.append(out)
index.append(key)
bottom.extend(v.keys())
top = np.repeat([*value.keys()], len(v))
pd.DataFrame(outer,
columns = [top, bottom],
index = index)
Course 1 Course 2
Course Name Marks Grade Course Name Marks Grade
s1 Maths 95 A+ Science 75 B-
s2 English 82 B Maths 90 A

Related

Conversion of nested dictionary into data frame in Python

My list/dictionary is nested with lists for different items in it like this:
scores = [{"Student":"Adam","Subjects":[{"Name":"Math","Score":85},{"Name":"Science","Score":90}]},
{"Student":"Bec","Subjects":[{"Name":"Math","Score":70},{"Name":"English","Score":100}]}]
If I use pd.DataFrame directly on the dictionary I get:
What should I do in order to get a data frame that looks like this:
Student Subject.Name Subject.Score
Adam Math 85
Adam Science 90
Bec Math 70
Bec English 100
?
Thanks very much

Use json_normalize with rename:
df = (pd.json_normalize(scores, 'Subjects','Student')
.rename(columns={'Name':'Subject.Name','Score':'Subject.Score'}))
print (df)
Subject.Name Subject.Score Student
0 Math 85 Adam
1 Science 90 Adam
2 Math 70 Bec
3 English 100 Bec
Or list with dict comprehension and DataFrame constructor:
df = (pd.DataFrame([{**x, **{f'Subject.{k}': v for k, v in y.items()}}
for x in scores for y in x.pop('Subjects')]))
print (df)
Student Subject.Name Subject.Score
0 Adam Math 85
1 Adam Science 90
2 Bec Math 70
3 Bec English 100

Convert Dictionary of List of Tuples into Pandas Dataframe [duplicate]

This question already has an answer here:
convert dict of lists of tuples to dataframe
(1 answer)
Closed 10 months ago.
I am a beginning in Python.
I know similar questions have been posed, and I have read through the answers for the past 2 hours, but I can’t seem to get my code to work. Appreciate your help to advise where I might have gone wrong.
I have a dictionary as such:
{Tom: [(“Math”, 98),
(“English”,75)],
Betty: [(“Science”, 42),
(“Humanities”, 15]}
What is the most efficient way to convert to the following Pandas Dataframe?
Tom Math 98
Tom English 75
Betty Science 42
Betty Humanities 15
I have tried the following method which is throwing up a TypeError: cannot unpack non-iterable int object:
df = pd.DataFrame(columns=[‘Name’,’Subject’,’Score’])
i=0
for name in enumerate(data):
for subject, score in name:
df.loc[i]= [name,subject,score]
i += 1
Thanks a million!

You can loop and construct a list of list that Pandas can consume.
d = {'Tom': [('Math', 98),
('English',75)],
'Betty': [('Science', 42),
('Humanities', 15)]}
data = [[k, *v] for k, lst in d.items() for v in lst]
df = pd.DataFrame(data, columns=['Name','Subject','Score'])
Name Subject Score
0 Tom Math 98
1 Tom English 75
2 Betty Science 42
3 Betty Humanities 15

Do this,
df = pd.DataFrame(data).melt(var_name = "Name", value_name = "Data")
new_df = pd.DataFrame(df["Data"].tolist(), columns = ["Subject", "Marks"])
new_df.insert(loc = 0, column = "Name", value = df["Name"])
Output -
Name
Subject
Marks
0
Tom
Math
98
1
Betty
Science
42
2
Tom
English
75
3
Betty
Humanities
15

how to compare two csv file in python and flag the difference?

i am new to python. Kindly help me.
Here I have two set of csv-files. i need to compare and output the difference like changed data/deleted data/added data. here's my example
file 1:
Sn Name Subject Marks
1 Ram Maths 85
2 sita Engilsh 66
3 vishnu science 50
4 balaji social 60
file 2:
Sn Name Subject Marks
1 Ram computer 85 #subject name have changed
2 sita Engilsh 66
3 vishnu science 90 #marks have changed
4 balaji social 60
5 kishor chem 99 #added new line
Output - i need to get like this :
Changed Items:
1 Ram computer 85
3 vishnu science 90
Added item:
5 kishor chem 99
Deleted item:
.................
I imported csv and done the comparasion via for loop with redlines. I am not getting the desire output. its confusing me a lot when flagging the added & deleted items between file 1 & file2 (csv files). pl suggest the effective code folks.

The idea here is to flatten your dataframe with melt to compare each value:
# Load your csv files
df1 = pd.read_csv('file1.csv', ...)
df2 = pd.read_csv('file2.csv', ...)
# Select columns (not mandatory, it depends on your 'Sn' column)
cols = ['Name', 'Subject', 'Marks']
# Flat your dataframes
out1 = df1[cols].melt('Name', var_name='Item', value_name='Old')
out2 = df2[cols].melt('Name', var_name='Item', value_name='New')
out = pd.merge(out1, out2, on=['Name', 'Item'], how='outer')
# Flag the state of each item
condlist = [out['Old'] != out['New'],
out['Old'].isna(),
out['New'].isna()]
out['State'] = np.select(condlist, choicelist=['changed', 'added', 'deleted'],
default='unchanged')
Output:
>>> out
Name Item Old New State
0 Ram Subject Maths computer changed
1 sita Subject Engilsh Engilsh unchanged
2 vishnu Subject science science unchanged
3 balaji Subject social social unchanged
4 Ram Marks 85 85 unchanged
5 sita Marks 66 66 unchanged
6 vishnu Marks 50 90 changed
7 balaji Marks 60 60 unchanged
8 kishor Subject NaN chem changed
9 kishor Marks NaN 99 changed

count, flag = 0, 1
for i, j in zip(df1.values, df2.values):
if sum(i == j) != 4:
if flag:
print("Changed Items:")
flag = 0
print(j)
count += 1
if count != len(df2):
print("Newly added:")
print(*df2.iloc[count:, :].values)

how to choose whole people's record if the special course meets given score condition

For example, choose the persons whose score on math is greater than 80. Please note, it means if there are math courses in one person, then any of the math should > 80. I don't mean "choose only math > 80".
I can coin the following code:
import pandas as pd
df = pd.DataFrame(
[
['Mike', 'math 1', 30],
['Mike', 'math 2', 85],
['Mike', 'English writing', 70],
['Mike', 'English reading', 60],
['Mike', 'Java programming', 80],
['John', 'math 1', 85],
['John', 'math 2', 90],
['John', 'Python programming', 60],
['Einstein', 'math 1', 90],
['Einstein', 'math 3', 95],
['Einstein', 'C programming', 90],
],
columns =['name', 'course', 'score']
)
lstDfResult = []
for name in set(df.name):
dfTmp = df.query(f'name == "{name}"')
dfTmpCourse = dfTmp[ dfTmp['course'].str.contains('math') ]
if len(dfTmpCourse[dfTmpCourse.score>80]) == len(dfTmpCourse):
lstDfResult.append(dfTmp)
dfResult = pd.concat(lstDfResult)
print(dfResult)
if the condition gets more complex, for example, choose the persons whose score on math is greater than 80 AND score on engish is greater than 60. The code goes longer.
Is there any terse but fast way to do so in pandas? Thanks
the original df
name
course
score
Mike
math 1
30
Mike
math 2
85
Mike
English writing
70
Mike
English reading
60
Mike
Java programming
80
John
math 1
85
John
math 2
90
John
Python programming
60
Einstein
math 1
90
Einstein
math 3
95
Einstein
C programming
90
the result form only one condition(math > 80)
name
course
score
Einstein
math 1
90
Einstein
math 3
95
Einstein
C programming
90
John
math 1
85
John
math 2
90
John
Python programming
60

Use concat to build a list where each requirement is satisfied independently then select rows that match all requirements.
This solution was greatly improved by #jezrael!
requirements = [('math', 80), ('programming', 70)]
mask = pd.concat([df.loc[df['course'].str.contains(course), 'score']
.gt(score).groupby(df['name']).all().rename(course)
for course, score in requirements], axis=1)
out = df[df['name'].isin(mask.index[mask.all(axis=1)])]
>>> mask
math programming
name
Einstein True True
John True False
Mike False True
>>> out
name course score
8 Einstein math 1 90
9 Einstein math 3 95
10 Einstein C programming 90
Note: according to your comment, no one satisfies the requirements:
math > 80
English > 60
Simplier solution:
m = pd.concat([df.loc[df['course'].str.contains(course), 'score']
.gt(score).groupby(df['name']).all()
for course, score in tups], axis=1)
names = m.index[m.all(axis=1)]
df = df[df['name'].isin(names)]

With filter functions:
def make_filter(crs, scr):
"""
Factory function that returns a filtering function
"""
def f(gr):
# first filter the group
course_scores = gr.loc[gr["course"].str.contains(crs, case=False), "score"]
filtered = course_scores > scr
# it can be empty; e.g., person doesn't have English course
if filtered.empty:
return False
else:
# if not empty, return if all related course scores are okay
return filtered.all()
# returning the inner function
return f
# requirements
courses = ["math", "programming"]
scores = [80, 70]
filters = [*map(make_filter, courses, scores)]
# GroupBy.filter selects those names that satisfy the requirements
result = df.groupby("name").filter(lambda gr: all(f(gr) for f in filters))
We first define a function that produces filtering function given the pair course-threshold score. Above, for example, we require Math and Programming course scores be greater than 80 and 70. map helps make a filter for each pair. Lastly GroupBy.filter looks for each name and applies each filter; we check if all the filters give True to decide whether to keep the group.
>>> result
name course score
8 Einstein math 1 90
9 Einstein math 3 95
10 Einstein C programming 90
Only Einstein had all Math scores > 80 and Programming > 70.

Aggregating rows in a data frame and eliminating duplicates

I want to merge rows in my df so I have one unique row per ID/Name with other values either summed (revenue) or concatenated (subject and product). However, where I am concatenating, I do not want duplicates to appear.
My df is similar to this:
ID Name Revenue Subject Product
123 John 125 Maths A
123 John 75 English B
246 Mary 32 History B
312 Peter 67 Maths A
312 Peter 39 Science A
I am using the following code to aggregate rows in my data frame
def f(x): return ' '.join(list(x))
df.groupby(['ID', 'Name']).agg(
{'Revenue': 'sum', 'Subject': f, 'Product': f}
)
This results in output like this:
ID Name Revenue Subject Product
123 John 200 Maths English A B
246 Mary 32 History B
312 Peter 106 Maths Science A A
How can I amend my code so that duplicates are removed in my concatenation? So in the example above the last row reads A in Product and not A A

You are very close. First apply set on the items before listing and joining them. This will return only unique items
def f(x): return ' '.join(list(set(x)))
df.groupby(['ID', 'Name']).agg(
{'Revenue': 'sum', 'Subject': f, 'Product': f}
)

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Convert nested dictionary to pandas dataframe - python

Related

Conversion of nested dictionary into data frame in Python

Convert Dictionary of List of Tuples into Pandas Dataframe [duplicate]

how to compare two csv file in python and flag the difference?

how to choose whole people's record if the special course meets given score condition

Aggregating rows in a data frame and eliminating duplicates

Categories

Resources