Lookup values from one DataFrame to create a dict from another - python

I am very new to Python and came across a problem that I could not solve.
I have two Dataframe extracted columns only needed to consider, for example,
df1
Student ID Subjects
0 S1 Maths, Physics, Chemistry, Biology
1 S2 Maths, Chemistry, Computing
2 S3 Maths, Chemistry, Computing
3 S4 Biology, Chemistry, Maths
4 S5 English Literature, History, French
5 S6 Economics, Maths, Geography
6 S7 Further Mathematics, Maths, Physics
7 S8 Arts, Film Studies, Psychology
8 S9 English Literature, English Language, Classical
9 S10 Business, Computing, Maths
df2
Subject ID Subjects
58 Che13 Chemistry
59 Bio13 Biology
60 Mat13 Maths
61 FMat13 Further Mathematics
62 Phy13 Physics
63 Eco13 Economics
64 Geo13 Geography
65 His13 History
66 EngLang13 English Langauge
67 EngLit13 English Literature
How can I compare for every df2 subjects, if there is a student taking that subject, make a dictionary with key "Subject ID" and values "student ID"?
Desired output will be something like;
Che13:[S1, S2, S3, ...]
Bio13:[S1,S4,...]

Use explode and map, then you can do a little grouping to get your output:
(df.set_index('Student ID')['Subjects']
.str.split(', ')
.explode()
.map(df2.set_index('Subjects')['Subject ID'])
.reset_index()
.groupby('Subjects')['Student ID']
.agg(list))
Subjects
Bio13 [S1, S4]
Che13 [S1, S2, S3, S4]
Eco13 [S6]
EngLit13 [S5, S9]
FMat13 [S7]
Geo13 [S6]
His13 [S5]
Mat13 [S1, S2, S3, S4, S6, S7, S10]
Phy13 [S1, S7]
Name: Student ID, dtype: object
From here, call .to_dict() if you want the result in a dictionary.

Not pythonic but simple
{row['Subject ID'] :
df1[df1.Subjects.str.contains(row['Subjects'])]['Student ID'].to_list()
for _, row in df2.iterrows()}
What are we doing :
Iterate over all the Subjects and check if the Subject string lies in the subjects taken by a student. If so, get the students ID.

Related

Get Subject and grade from string

Given this string
result = '''Check here to visit our corporate website
Results
Candidate Information
Examination Number
986542346
Candidate Name
JOHN DOE JAMES
Examination
MFFG FOR SCHOOL CANDIDATES 2021
Centre
LORDYARD
Subject Grades
DATA PROCESSING
B3
ECONOMICS
B3
CIVIC EDUCATION
B3
ENGLISH LANGUAGE
A1
MATHEMATICS
B3
AGRICULTURAL SCIENCE
OUTSTANDING
BIOLOGY
A1
CHEMISTRY
B2
PHYSICS
C5
C Information
Card Use
1 of 5'''
How can I extract the NAME(JOHN DOE JAMES, SUBJECTS and the GRADES to different list.
I have tried this for the subject and grades but not giving me the desired results. Firstly, where subject name is more than one word it only returns to last 1 eg instead DATA PROCESSING am getting PROCESSING. Secondly, it is skipping AGRICULTURAL SCIENCE(subject) and OUTSTANDING(grade)
Please note that am new in using regex. Thanks in advance.
pattern = re.compile(r'[A-Z]+\n{1}[A-Z][0-9]')
searches = pattern.findall(result)
if searches:
print(searches)searches = pattern.findall(result)
for search in searches:
print(search)
OUTPUT FOR THE FIRST PRINT STATEMENT:
['PROCESSING\nB3', 'ECONOMICS\nB3', 'EDUCATION\nB3', 'LANGUAGE\nA1', 'MATHEMATICS\nB3', 'BIOLOGY\nA1', 'CHEMISTRY\nB2', 'PHYSICS\nC5']
SECOND PRINT STATEMENT
PROCESSING
B3
ECONOMICS
B3
EDUCATION
B3
LANGUAGE
A1
MATHEMATICS
B3
BIOLOGY
A1
CHEMISTRY
B2
PHYSICS
C5
Here's a way to do this without using regexes. Note that I am assuming "OUTSTANDING" is intended to be a grade. That takes special processing.
result = '''Check here to visit our corporate website Results Candidate Information Examination Number 986542346 Candidate Name JOHN DOE JAMES Examination MFFG FOR SCHOOL CANDIDATES 2021 Centre LORDYARD Subject Grades DATA PROCESSING B3 ECONOMICS B3 CIVIC EDUCATION B3 ENGLISH LANGUAGE A1 MATHEMATICS B3 AGRICULTURAL SCIENCE OUTSTANDING BIOLOGY A1 CHEMISTRY B2 PHYSICS C5 C Information Card Use 1 of 5'''
i = result.find('Name')
j = result.find('Examination',i)
k = result.find('Centre')
l = result.find('Subject Grades')
m = result.find('Information Card')
name = result[i+5:j-1]
exam = result[j+12:k-1]
grades = result[l+15:m].split()
print("Name:", name)
print("Exam:", exam)
print("Grades:")
subject = []
for word in grades:
if len(word) == 2 or word=='OUTSTANDING':
print(' '.join(subject), "......", word)
subject = []
else:
subject.append(word)
Output:
Name: JOHN DOE JAMES
Exam: MFFG FOR SCHOOL CANDIDATES 2021
Grades:
DATA PROCESSING ...... B3
ECONOMICS ...... B3
CIVIC EDUCATION ...... B3
ENGLISH LANGUAGE ...... A1
MATHEMATICS ...... B3
AGRICULTURAL SCIENCE ...... OUTSTANDING
BIOLOGY ...... A1
CHEMISTRY ...... B2
PHYSICS ...... C5

Combine rows in pandas df as per given condition

I have pandas df as shown
Name Subject Score
Rakesh Math 65
Mukesh Science 76
Bhavesh French 87
Rakesh Science 88
Rakesh Hindi 76
Sanjay English 66
Mukesh English 98
Mukesh Marathi 77
I have to make another df including students who took two or more subjects and total their scores in each subjects.
Hence the resultant df will be as shown:
In pandas, there is a method explode that will take a column that contains lists and break them apart. We can do a sort of opposite of that by making list of your Subjects column. I pulled the idea here from another question.
In [1]: df = df.groupby('Name').agg({'Subject': lambda x: x.tolist(), 'Score':'sum'})
In [2]: df
Out[2]:
Subject Score
Name
Bhavesh [French] 87
Mukesh [Science, English, Marathi] 251
Rakesh [Math, Science, Hindi] 229
Sanjay [English] 66
We can then filter on the Subject column for any row where the list has more than one item. This method I lifted from another SO question.
In [3]: df[df['Subject'].str.len() > 1]
Out[3]:
Subject Score
Name
Mukesh [Science, English, Marathi] 251
Rakesh [Math, Science, Hindi] 229
If you want the Subject column to be a string instead of a list, you can utulize this third other-answer from SO.
df['Subject'] = df['Subject'].apply(lambda x: ", ".join(x))
Using groupby, filter and agg we can do it in one line:
(df.groupby('Name')
.filter(lambda g:len(g)>1)
.groupby('Name')
.agg({'Subject': ', '.join, 'Score':'sum'})
)
output
Subject Score
Name
Mukesh Science, English, Marathi 251
Rakesh Math, Science, Hindi 229

Combining three datasets removing duplicates

I've three datasets:
dataset 1
Customer1 Customer2 Exposures + other columns
Nick McKenzie Christopher Mill 23450
Nick McKenzie Stephen Green 23450
Johnny Craston Mary Shane 12
Johnny Craston Stephen Green 12
Molly John Casey Step 1000021
dataset2 (unique Customers: Customer 1 + Customer 2)
Customer Age
Nick McKenzie 53
Johnny Craston 75
Molly John 34
Christopher Mill 63
Stephen Green 65
Mary Shane 54
Casey Step 34
Mick Sale
dataset 3
Customer1 Customer2 Exposures + other columns
Mick Sale Johnny Craston
Mick Sale Stephen Green
Exposures refers to Customer 1 only.
There are other columns omitted for brevity. Dataset 2 is built by getting unique customer 1 and unique customer 2: no duplicates are in that dataset. Dataset 3 has the same column of dataset 1.
I'd like to add the information from dataset 1 into dataset 2 to have
Final dataset
Customer Age Exposures + other columns
Nick McKenzie 53 23450
Johnny Craston 75 12
Molly John 34 1000021
Christopher Mill 63
Stephen Green 65
Mary Shane 54
Casey Step 34
Mick Sale
The final dataset should have all Customer1 and Customer 2 from both dataset 1 and dataset 3, with no duplicates.
I have tried to combine them as follows
result = pd.concat([df2,df1,df3], axis=1)
but the result is not that one I'd expect.
Something wrong is in my way of concatenating the datasets and I'd appreciate it if you can let me know what is wrong.
After concatenating the dataframe df1 and df2 (assuming they have same columns), we can remove the duplicates using df1.drop_duplicates(subset=['customer1']) and then we can join with df2 like this
df1.set_index('Customer1').join(df2.set_index('Customer'))
In case df1 and df2 has different columns based on the primary key we can join using the above command and then again join with the age table.
This would give the result. You can concatenate dataset 1 and datatset 3 because they have same columns. And then run this operation to get the desired result. I am joining specifying the respective keys.
Note: Though not related to the question but for the concatenation one can use this code pd.concat([df1, df3],ignore_index=True) (Here we are ignoring the index column)

Pandas: Getting multiple columns based on condition

I have a data-frame df like this:
Date Student_id Subject Subject_Scores
11/30/2020 1000101 Math 70
11/25/2020 1000101 Physics 75
12/02/2020 1000101 Biology 60
11/25/2020 1000101 Chemistry 49
11/25/2020 1000101 English 80
12/02/2020 1000101 Biology 60
11/25/2020 1000101 Chemistry 49
11/25/2020 1000101 English 80
12/02/2020 1000101 Sociology 50
11/25/2020 1000102 Physics 80
11/25/2020 1000102 Math 90
12/15/2020 1000102 Chemistry 63
12/15/2020 1000103 English 71
case:1
If I use df[df['Student_id]=='1000102']['Date'], this gives unique dates for that particular Student_id.
How can I get the same for multiple columns with single condition.
I want to get multiple columns based on condition, how can I get output df something like this for Student_id = 1000102:
Date Subject
11/25/2020 Physics
11/25/2020 Math
12/15/2020 Chemistry
I have tried this, but getting error:
df[df['Student_id']=='1000102']['Date', 'Subject']
And
df[df['Student_id']=='1000102']['Date']['Subject']
case:2
How can I use df.unique() in the above scenario(for multiple columns)
df[df['Student_id']=='1000102']['Date', 'Subject'].unique() #this gives error
How could this be possibly achieved.
You can pass list to DataFrame.loc:
df1 = df.loc[df['Student_id']=='1000102', ['Date', 'Subject']]
print (df1)
Date Subject
9 11/25/2020 Physics
10 11/25/2020 Math
11 12/15/2020 Chemistry
If need unique values add DataFrame.drop_duplicates:
df2 = df.loc[df['Student_id']=='1000102', ['Date', 'Subject']].drop_duplicates()
print (df2)
Date Subject
9 11/25/2020 Physics
10 11/25/2020 Math
11 12/15/2020 Chemistry
If need Series.unique for each column separately:
df3 = df.loc[df['Student_id']=='1000102', ['Date', 'Subject']].apply(lambda x: x.unique())
print (df3)
Date [11/25/2020, 12/15/2020]
Subject [Physics, Math, Chemistry]
dtype: object

Find non-matching pairs in 2 dataframes and make new missing dataframe Python

I have two uneven dataframes that have all the same variables except for a pair of ID values that vary from one to the other.
For example one of the dataframes, df1, looks like this:
Name Name ID State Gen ID Unit ID
Nikki 9 AZ 1-1 1
Nikki 9 AZ 1-2 2
Nikki 9 AZ 1-3 3
Mondip 101 NY 1A 1A
Mondip 101 NY 1B 1B
James 11 CA 12-1 12
James 11 CA 13-1 13
Sandra 88 NJ 1 1
.
.
.
The other dataframe df2 looks like this:
Name Name ID State Unit ID
Monte 97 PA 4-1
Monte 97 PA 4-2
Nikki Ltd 9 AZ 1
Nikki Ltd 9 AZ 2
Mondip 101 NY 1A
Mondip 101 NY 1B
James 11 CA 12-1
James 11 CA 13-1
.
.
.
As you can see the Gen ID column and the Unit ID column are somehow connected. Sometimes the Unit ID in df2 can be either the Gen ID or the Unit ID in df1.
What I want to do is to create a new dataframe or list of each set of Name, Name ID, and State that does not match df1 and df2. Sometimes the name matches slightly Nikki and Nikki Ltd so I need to take care of this using the Name ID.
For example the new dataframe output df_missing would be:
Name Name ID State Gen ID Unit ID
Monte 97 PA 4-1
Monte 97 PA 4-2
Sandra 88 NJ 1 1
Is there an easy way to do this?
If we assume that you can identify names that are close enough then the first step would be to replace instances of 'Nikki LTD' with 'Nikki'. Once you do that its a simple matter to identify the names that are not mutual to each dataframe. These names are
merged_df = pd.concat([df1, df2])
s1 = set(df1['Name'].unique())
s2 = set(df2['Name'].unique())
# read as every in s1 thats not in s2 and everyone in s2 thats not in s1
mutually_distinct_names = list((s1 - s2).union(s2 - s1))
missing_df = merged_df[merged_df['Name'].isin(muutally_distinct_names)]

Categories

Resources