Get Subject and grade from string - python

Given this string
result = '''Check here to visit our corporate website
Results
Candidate Information
Examination Number
986542346
Candidate Name
JOHN DOE JAMES
Examination
MFFG FOR SCHOOL CANDIDATES 2021
Centre
LORDYARD
Subject Grades
DATA PROCESSING
B3
ECONOMICS
B3
CIVIC EDUCATION
B3
ENGLISH LANGUAGE
A1
MATHEMATICS
B3
AGRICULTURAL SCIENCE
OUTSTANDING
BIOLOGY
A1
CHEMISTRY
B2
PHYSICS
C5
C Information
Card Use
1 of 5'''
How can I extract the NAME(JOHN DOE JAMES, SUBJECTS and the GRADES to different list.
I have tried this for the subject and grades but not giving me the desired results. Firstly, where subject name is more than one word it only returns to last 1 eg instead DATA PROCESSING am getting PROCESSING. Secondly, it is skipping AGRICULTURAL SCIENCE(subject) and OUTSTANDING(grade)
Please note that am new in using regex. Thanks in advance.
pattern = re.compile(r'[A-Z]+\n{1}[A-Z][0-9]')
searches = pattern.findall(result)
if searches:
print(searches)searches = pattern.findall(result)
for search in searches:
print(search)
OUTPUT FOR THE FIRST PRINT STATEMENT:
['PROCESSING\nB3', 'ECONOMICS\nB3', 'EDUCATION\nB3', 'LANGUAGE\nA1', 'MATHEMATICS\nB3', 'BIOLOGY\nA1', 'CHEMISTRY\nB2', 'PHYSICS\nC5']
SECOND PRINT STATEMENT
PROCESSING
B3
ECONOMICS
B3
EDUCATION
B3
LANGUAGE
A1
MATHEMATICS
B3
BIOLOGY
A1
CHEMISTRY
B2
PHYSICS
C5

Here's a way to do this without using regexes. Note that I am assuming "OUTSTANDING" is intended to be a grade. That takes special processing.
result = '''Check here to visit our corporate website Results Candidate Information Examination Number 986542346 Candidate Name JOHN DOE JAMES Examination MFFG FOR SCHOOL CANDIDATES 2021 Centre LORDYARD Subject Grades DATA PROCESSING B3 ECONOMICS B3 CIVIC EDUCATION B3 ENGLISH LANGUAGE A1 MATHEMATICS B3 AGRICULTURAL SCIENCE OUTSTANDING BIOLOGY A1 CHEMISTRY B2 PHYSICS C5 C Information Card Use 1 of 5'''
i = result.find('Name')
j = result.find('Examination',i)
k = result.find('Centre')
l = result.find('Subject Grades')
m = result.find('Information Card')
name = result[i+5:j-1]
exam = result[j+12:k-1]
grades = result[l+15:m].split()
print("Name:", name)
print("Exam:", exam)
print("Grades:")
subject = []
for word in grades:
if len(word) == 2 or word=='OUTSTANDING':
print(' '.join(subject), "......", word)
subject = []
else:
subject.append(word)
Output:
Name: JOHN DOE JAMES
Exam: MFFG FOR SCHOOL CANDIDATES 2021
Grades:
DATA PROCESSING ...... B3
ECONOMICS ...... B3
CIVIC EDUCATION ...... B3
ENGLISH LANGUAGE ...... A1
MATHEMATICS ...... B3
AGRICULTURAL SCIENCE ...... OUTSTANDING
BIOLOGY ...... A1
CHEMISTRY ...... B2
PHYSICS ...... C5

Related

Data Cleaning How to split Pandas column

It has been sometime since I tried working in python.
I have below data frame with many columns too many to name.
last/first location job department
smith john Vancouver A1 servers
rogers steve Toronto A2 eng
Rogers Dave Toronto A4 HR
How to I remove caps in the last/first column and also split the last/first column by " "?
Goal:
last first location job department
smith john Vancouver A1 servers
rogers steve Toronto A2 eng
rogers dave Toronto A4 HR
IIUC, you could use str.lower and str.split:
df[['last', 'first']] = (df.pop('last/first')
.str.lower()
.str.split(n=1, expand=True)
)
output:
location job department last first
0 Vancouver A1 servers smith john
1 Toronto A2 eng rogers steve
2 Toronto A4 HR rogers dave

How to create a transition matrix out of a Pandas dataframe

I have customers at a specific location. The customer may change the location from year to year. I would like to create a transition matrix that shows me the customers that transited from one location to another.
Tidy dataframe:
year cust loc
2019 C1 LA
2019 C2 LA
2019 C3 LB
2019 C4 LC
2019 C5 LA
2019 C6 LA
2020 C1 LB
2020 C2 LA
2020 C4 LC
2020 C5 LC
2020 C6 LC
2020 C7 LD
LA LB LC LD dorp
LA C1 C5,C6
LB C3
LC
LD
I am looking for an elegant way to achieve that in pandas. Any clever idea where extensive nested looping is not needed?

Lookup values from one DataFrame to create a dict from another

I am very new to Python and came across a problem that I could not solve.
I have two Dataframe extracted columns only needed to consider, for example,
df1
Student ID Subjects
0 S1 Maths, Physics, Chemistry, Biology
1 S2 Maths, Chemistry, Computing
2 S3 Maths, Chemistry, Computing
3 S4 Biology, Chemistry, Maths
4 S5 English Literature, History, French
5 S6 Economics, Maths, Geography
6 S7 Further Mathematics, Maths, Physics
7 S8 Arts, Film Studies, Psychology
8 S9 English Literature, English Language, Classical
9 S10 Business, Computing, Maths
df2
Subject ID Subjects
58 Che13 Chemistry
59 Bio13 Biology
60 Mat13 Maths
61 FMat13 Further Mathematics
62 Phy13 Physics
63 Eco13 Economics
64 Geo13 Geography
65 His13 History
66 EngLang13 English Langauge
67 EngLit13 English Literature
How can I compare for every df2 subjects, if there is a student taking that subject, make a dictionary with key "Subject ID" and values "student ID"?
Desired output will be something like;
Che13:[S1, S2, S3, ...]
Bio13:[S1,S4,...]
Use explode and map, then you can do a little grouping to get your output:
(df.set_index('Student ID')['Subjects']
.str.split(', ')
.explode()
.map(df2.set_index('Subjects')['Subject ID'])
.reset_index()
.groupby('Subjects')['Student ID']
.agg(list))
Subjects
Bio13 [S1, S4]
Che13 [S1, S2, S3, S4]
Eco13 [S6]
EngLit13 [S5, S9]
FMat13 [S7]
Geo13 [S6]
His13 [S5]
Mat13 [S1, S2, S3, S4, S6, S7, S10]
Phy13 [S1, S7]
Name: Student ID, dtype: object
From here, call .to_dict() if you want the result in a dictionary.
Not pythonic but simple
{row['Subject ID'] :
df1[df1.Subjects.str.contains(row['Subjects'])]['Student ID'].to_list()
for _, row in df2.iterrows()}
What are we doing :
Iterate over all the Subjects and check if the Subject string lies in the subjects taken by a student. If so, get the students ID.

Python Program to split a new file from a master file

I have a master file which has 4 columns.
Name Parent Child Property
A1 World USA 1
A2 USA Texas 2
A3 Texas Houston 3
A4 USA Austin 4
A5 World USA 5
A6 World Canada 6
A7 Canada Toronto 7
I need to create a new file and extract those records which are in between the keyword(USA) in column 3.
The output file should be :
Name Parent Child Property
A1 World USA 1
A2 USA Texas 2
A3 Texas Houston 3
A4 USA Austin 4
A5 World USA 5
Please find the sample code and working fine on my test box
!/usr/bin/python
import re
oldfile = open("old.txt", "r") - -- > old.txt - source file with all contents
newfile = open("new.txt", "w") - - -> new file to write the output
for line in oldfile:
if re.match("(.)USA(.)", line):
print >> newfile, line,
Output file:
cat new.txt
A1 World USA 1
A2 USA Texas 2
A4 USA Austin 4
A5 World USA 5

How can i split a pandas dataframe in such a way that for each split value it creates a column

for eg.
Input Data frame
Name Subjects
Ramesh Maths,Science
Rakesh MAths,Science,Social Studies
John Social Science, Lietrature
Output Data frame
Name Subject1 Subject2 Subjects3
Ramesh Maths Science NaN
Rakesh MAths Science Social Studies
John Social Science Literature Nan
You can create a new df from the result of str.split and then concat them:
In [66]:
subjects = df['Subjects'].str.split(',', expand=True)
subjects
Out[66]:
0 1 2
0 Maths Science None
1 MAths Science Social Studies
2 Social Science Lietrature None
In [71]:
subjects.columns = ['Subject ' + str(x + 1) for x in range(len(subjects.columns))]
subjects
Out[71]:
Subject 1 Subject 2 Subject 3
0 Maths Science None
1 MAths Science Social Studies
2 Social Science Lietrature None
In [74]:
concatenated = pd.concat([df,subjects], axis=1)
concatenated.drop('Subjects',axis=1,inplace=True)
concatenated
Out[74]:
Name Subject 1 Subject 2 Subject 3
0 Ramesh Maths Science None
1 Rakesh MAths Science Social Studies
2 John Social Science Lietrature None

Categories

Resources