Python Program to split a new file from a master file - python

I have a master file which has 4 columns.
Name Parent Child Property
A1 World USA 1
A2 USA Texas 2
A3 Texas Houston 3
A4 USA Austin 4
A5 World USA 5
A6 World Canada 6
A7 Canada Toronto 7
I need to create a new file and extract those records which are in between the keyword(USA) in column 3.
The output file should be :
Name Parent Child Property
A1 World USA 1
A2 USA Texas 2
A3 Texas Houston 3
A4 USA Austin 4
A5 World USA 5

Please find the sample code and working fine on my test box
!/usr/bin/python
import re
oldfile = open("old.txt", "r") - -- > old.txt - source file with all contents
newfile = open("new.txt", "w") - - -> new file to write the output
for line in oldfile:
if re.match("(.)USA(.)", line):
print >> newfile, line,
Output file:
cat new.txt
A1 World USA 1
A2 USA Texas 2
A4 USA Austin 4
A5 World USA 5

Related

Concatenate strings in multiple csv files into one datafram along x and y axis (in pandas)

I have a folder with many csv files. They all look similar, they all have the same names for columns and rows. They all have strings as values in their cells. I want to concatenate them along columns AND rows so that all the strings are concatenated into their respective cells.
Example:
file1.csv
0
1
2
3
4
b1
peter
house
ash
plane
b2
carl
horse
paul
knife
b3
mary
apple
linda
carrot
b4
hank
car
herb
beer
file2.csv
0
1
2
3
4
b1
mark
green
hello
band
b2
no
phone
spoon
goodbye
b3
red
cherry
charly
hammer
b4
good
yes
ok
simon
What I want is this result with no delimiter between the string values:
concatenated.csv
0
1
2
3
4
b1
peter mark
house green
ash hello
plane band
b2
carl no
horse phone
paul spoon
knife goodbye
b3
mary red
apple cherry
linda charly
carrot hammer
b4
hank good
car yes
herb ok
beer simon
And I don't know how to do this in pandas in a jupyter notebook.
I have tried a couple of things but all of them either kept a seperate set of rows or of columns.
If these are your dataframes:
df1_data = {
1 : ['peter', 'carl', 'mary', 'hank'],
2 : ['house', 'horse','apple', 'car']
}
df1 = pd.DataFrame(df1_data)
df2_data = {
1 : ['mark', 'no', 'red', 'good'],
2 : ['green','phone','cherry','yes' ]
}
df2 = pd.DataFrame(df2_data)
df1:
1 2
0 peter house
1 carl horse
2 mary apple
3 hank car
df2:
1 2
0 mark green
1 no phone
2 red cherry
3 good yes
You can reach your requested dataframe like this:
df = pd.DataFrame()
df[1] = df1[1] + ' ' + df2[1]
df[2] = df1[2] + ' ' + df2[2]
print(df)
Output:
1 2
0 peter mark house green
1 carl no horse phone
2 mary red apple cherry
3 hank good car yes
Loop for csv files:
Now, if you have a lot of csv files with names like file1.csv and file2.csv and so on, you can save them all in d like this:
d = {}
for i in range(1,#N):
d[i] = pd.read_csv('.../file'+str(i)+'.csv')
#N is the number of csv files. (because I started from 1, you have to add 1 to N)
And build the dataframe you want like this:
concatenated_df = pd.DataFrame()
for i in range(1,#N):
concatenated_df[i] = d[1].iloc[:,i] + ' ' + d[2].iloc[:,i] + ...
#N is the number of columns here.
If performance is not an issue, you can use pandas.DataFrame.applymap with pandas.Series.add :
out = df1[[0]].join(df1.iloc[:, 1:].applymap(lambda v: f"{v} ").add(df2.iloc[:, 1:]))
Or, for a large dataset, you can use pandas.concat with a listcomp :
out = (
df1[[0]]
.join(pd.concat([df1.merge(df2, on=0)
.filter(regex=f"{p}_\w").agg(" ".join, axis=1)
.rename(idx) for idx, p in enumerate(range(1, len(df1.columns)), start=1)],
axis=1))
)
Output :
​
print(out)
0 1 2 3 4
0 b1 peter mark house green ash hello plane band
1 b2 carl no horse phone paul spoon knife goodbye
2 b3 mary red apple cherry linda charly carrot hammer
3 b4 hank good car yes herb ok beer simon
Reading many csv files into a single DF is a pretty common answer, and is the first part of your question. You can find a suitable answer here.
After that, in an effort to allow you to perform this on all of the files at the same time, you can melt and pivot with a custom agg function like so:
import glob
import pandas as pd
# See the linked answer if you need help finding csv files in a different directory
all_files = glob.glob('*.csv'))
df = pd.concat((pd.read_csv(f) for f in all_files))
output = df.melt(id_vars='0')
.pivot_table(index='0',
columns='variable',
values='value',
aggfunc=lambda x: ' '.join(x))

Get Subject and grade from string

Given this string
result = '''Check here to visit our corporate website
Results
Candidate Information
Examination Number
986542346
Candidate Name
JOHN DOE JAMES
Examination
MFFG FOR SCHOOL CANDIDATES 2021
Centre
LORDYARD
Subject Grades
DATA PROCESSING
B3
ECONOMICS
B3
CIVIC EDUCATION
B3
ENGLISH LANGUAGE
A1
MATHEMATICS
B3
AGRICULTURAL SCIENCE
OUTSTANDING
BIOLOGY
A1
CHEMISTRY
B2
PHYSICS
C5
C Information
Card Use
1 of 5'''
How can I extract the NAME(JOHN DOE JAMES, SUBJECTS and the GRADES to different list.
I have tried this for the subject and grades but not giving me the desired results. Firstly, where subject name is more than one word it only returns to last 1 eg instead DATA PROCESSING am getting PROCESSING. Secondly, it is skipping AGRICULTURAL SCIENCE(subject) and OUTSTANDING(grade)
Please note that am new in using regex. Thanks in advance.
pattern = re.compile(r'[A-Z]+\n{1}[A-Z][0-9]')
searches = pattern.findall(result)
if searches:
print(searches)searches = pattern.findall(result)
for search in searches:
print(search)
OUTPUT FOR THE FIRST PRINT STATEMENT:
['PROCESSING\nB3', 'ECONOMICS\nB3', 'EDUCATION\nB3', 'LANGUAGE\nA1', 'MATHEMATICS\nB3', 'BIOLOGY\nA1', 'CHEMISTRY\nB2', 'PHYSICS\nC5']
SECOND PRINT STATEMENT
PROCESSING
B3
ECONOMICS
B3
EDUCATION
B3
LANGUAGE
A1
MATHEMATICS
B3
BIOLOGY
A1
CHEMISTRY
B2
PHYSICS
C5
Here's a way to do this without using regexes. Note that I am assuming "OUTSTANDING" is intended to be a grade. That takes special processing.
result = '''Check here to visit our corporate website Results Candidate Information Examination Number 986542346 Candidate Name JOHN DOE JAMES Examination MFFG FOR SCHOOL CANDIDATES 2021 Centre LORDYARD Subject Grades DATA PROCESSING B3 ECONOMICS B3 CIVIC EDUCATION B3 ENGLISH LANGUAGE A1 MATHEMATICS B3 AGRICULTURAL SCIENCE OUTSTANDING BIOLOGY A1 CHEMISTRY B2 PHYSICS C5 C Information Card Use 1 of 5'''
i = result.find('Name')
j = result.find('Examination',i)
k = result.find('Centre')
l = result.find('Subject Grades')
m = result.find('Information Card')
name = result[i+5:j-1]
exam = result[j+12:k-1]
grades = result[l+15:m].split()
print("Name:", name)
print("Exam:", exam)
print("Grades:")
subject = []
for word in grades:
if len(word) == 2 or word=='OUTSTANDING':
print(' '.join(subject), "......", word)
subject = []
else:
subject.append(word)
Output:
Name: JOHN DOE JAMES
Exam: MFFG FOR SCHOOL CANDIDATES 2021
Grades:
DATA PROCESSING ...... B3
ECONOMICS ...... B3
CIVIC EDUCATION ...... B3
ENGLISH LANGUAGE ...... A1
MATHEMATICS ...... B3
AGRICULTURAL SCIENCE ...... OUTSTANDING
BIOLOGY ...... A1
CHEMISTRY ...... B2
PHYSICS ...... C5

Data Cleaning How to split Pandas column

It has been sometime since I tried working in python.
I have below data frame with many columns too many to name.
last/first location job department
smith john Vancouver A1 servers
rogers steve Toronto A2 eng
Rogers Dave Toronto A4 HR
How to I remove caps in the last/first column and also split the last/first column by " "?
Goal:
last first location job department
smith john Vancouver A1 servers
rogers steve Toronto A2 eng
rogers dave Toronto A4 HR
IIUC, you could use str.lower and str.split:
df[['last', 'first']] = (df.pop('last/first')
.str.lower()
.str.split(n=1, expand=True)
)
output:
location job department last first
0 Vancouver A1 servers smith john
1 Toronto A2 eng rogers steve
2 Toronto A4 HR rogers dave

Pandas DataFrame: How to extract the last two string type numbers from a column which doesn't always end with the two numbers

Sorry for the possible confusion in the title, here's what I'm trying to do:
I'm trying to merge my Parcels data frame with my Municipality Code look up table. The Parcels dataframe:
df1.head()
PARID OWNER1
0 B10 2 1 0131 WILSON ROBERT JR
1 B10 2 18B 0131 COMUNALE MICHAEL J & MARY ANN
2 B10 2 18D 0131 COMUNALE MICHAEL J & MARY ANN
3 B10 2 19F 0131 MONROE & JEFFERSON HOLDINGS LLC
4 B10 4 11 0131 NOEL JAMES H
The Municipality Code dataframe:
df_LU.head()
PARID Municipality
0 01 Allen Twp.
1 02 Bangor
2 03 Bath
3 04 Bethlehem
4 05 Bethlehem Twp.
The last two numbers in the first column of df1 ('31' in 'B10 2 1 0131') are the Municipality Code that I need to merge with the Municipality Code DataFrame. But in my 30,000 or so records, there are about 200 records end with letters as shown below:
PARID OWNER1
299 D11 10 10 0131F HOWARD THEODORE P & CLAUDIA S
1007 F10 4 3 0134F KNEEBONE JUDY ANN
1011 F10 5 2 0134F KNEEBONE JUDY ANN
1114 F8 18 10 0626F KNITTER WILBERT D JR & AMY J
1115 F8 18 8 0626F KNITTER DONALD
For these rows, the two numbers before the last letter are the Code that I need to extract out (like '31' in 'D11 10 10 0131F')
If I just use
pd.DataFrame(df1['PARID'].str[-2:])
This will give me:
PARID
...
299 1F
...
While what I need is:
PARID
...
299 31
...
My code of accomplishing this is pretty lengthy, which pretty much invloves:
Join all the rows that end with 2 numbers.
Find out the index of the rows that end with a letter in the 'PARID' field
Join the results from step 2 again with the Municipality look up dataframe.
The code is there:
#Do the extraction and merge for the rows that end with numbers
df_2015= df1[['PARID','OWNER1']]
df_2015['PARID'] = df_2015['PARID'].str[-2:]
df_15r =pd.merge(df_2015, df_LU, how = 'left', on = 'PARID')
df_15r
#The pivot result for rows generated from above.
Result15_First = df_15r.groupby('Municipality').count()
Result15_First.to_clipboard()
#Check the ID field for rows that end with letters
check15 = df_2015['PARID'].unique()
check15
C = pd.DataFrame({'ID':check15})
NC = C.dropna()
LNC = NC[NC['ID'].str.endswith('F')]
MNC = NC[NC['ID'].str.endswith('A')]
F = [LNC, MNC]
NNC = pd.concat(F, axis = 0)
s = NNC['ID'].tolist()
s
# Identify the records in s
df_p15 = df_2015.loc[df_2015['PARID'].isin(s)]
df_p15
# Separate out a dataframe with just the rows that end with a letter
df15= df1[['PARID','OWNER1']]
df15c = df15[df15.index.isin(df_p15.index)]
df15c
#This step is to create the look up field from the new data frame, the two numbers before the ending letter.
df15c['PARID1'] = df15c['PARID'].str[-3:-1]
df15c
#Then I will join the look up table
df_15t =df15c.merge(df_LU.set_index('PARID'), left_on = 'PARID1', right_index = True)
df_15b = df_15t.groupby('Municipality').count()
df_15b
It wasn't until I finished that I realized how lengthy my code was for a seemingly simple task. If there is a better way to achieve, which is a sure thing, please let me know. Thanks.
You can use pandas string methods to extract the last two numbers
df1['PARID'].str.extract('.*(\d{2})', expand = False)
You get
0 31
1 31
2 13
3 13
4 31
You can use str.replace to remove all non-digits. After that, you should be able to use .str[-2:].
import pandas as pd
df1 = pd.DataFrame({ 'PARID' : pd.Series(["M3N6V2 B7 13A 0131", "M3N6V2 B7 13B
0131", "Y2 7 B13 0213", "Y2 7 B14 0213", "M5 N4 12 0231A"]),
'Owner' : pd.Series(["Tom", "Jerry", "Jack", "Chris", "Alex"])})
df1['PARID'].str.replace(r'\D+', '').str[-2:]
import pandas as pd
df = pd.DataFrame([['M3N6V2 B7 13A 0131','M3N6V2 B7 13B 0131','Y2 7 B13 0213', 'Y2 7 B14 0213', 'M5 N4 12 0231A' ], ['Tom', 'Jerry', 'Jack', 'Chris', 'Alex']])
df = df.T
df.columns = ['PARID', 'Owner']
print(df)
prints your left DataFrame
PARID Owner
0 M3N6V2 B7 13A 0131 Tom
1 M3N6V2 B7 13B 0131 Jerry
2 Y2 7 B13 0213 Jack
3 Y2 7 B14 0213 Chris
4 M5 N4 12 0231A Alex
and now for your right DataFrame
import numpy as np
df['IDPART'] = None
for row in df.index:
if df.at[row, 'PARID'][-1].isalpha():
df.at[row, 'IDPART'] = df.at[row, 'PARID'][-3:-1]
else:
df.at[row, 'IDPART'] = df.at[row, 'PARID'][-2:]
df['IDPART']=df['IDPART'].apply(int) #Converting the column to be joined to an integer column
print(df)
gives:
PARID Owner IDPART
0 M3N6V2 B7 13A 0131 Tom 31
1 M3N6V2 B7 13B 0131 Jerry 31
2 Y2 7 B13 0213 Jack 13
3 Y2 7 B14 0213 Chris 13
4 M5 N4 12 0231A Alex 31
and then merge
merged = pd.merge(df, otherdf, how = 'left', left_on = 'IDPART', right_on = 'PARID', left_index=False, right_index=False)
print(merged)
gives:
PARID_x Owner IDPART PARID_y Municipality
0 M3N6V2 B7 13A 0131 Tom 31 31 Tatamy
1 M3N6V2 B7 13B 0131 Jerry 31 31 Tatamy
2 Y2 7 B13 0213 Jack 13 13 Allentown
3 Y2 7 B14 0213 Chris 13 13 Allentown
4 M5 N4 12 0231A Alex 31 31 Tatamy

Ignore backslash when reading tsv file in python

I have a large sep="|" tsv with an address field that has a bunch of values with the following
...xxx|yyy|Level 1 2 xxx Street\(MYCompany)|...
This ends up as:
line1) ...xxx|yyy|Level 1 2 xxx Street\
line2) (MYCompany)|...
Tried running the quote=2 to turn non numeric into strings in read_table with Pandas but it still treats the backslash as new line. What is an efficient way to ignore rows with values in a field that contain backslash escapes to new line, is there a way to ignore the new line for \?
Ideally it will prepare the data file so it can be read into a dataframe in pandas.
Update: showing 5 lines with breakage on 3rd line.
1788768|1831171|208434489|2014-08-14 13:40:02|108|c||Desktop|coupon|49 XXX Ave|Australia|Victoria|3025|Melbourne
1788772|1831177|202234489|2014-08-14 13:41:37|108|c||iOS||u7 38-46 South Street|Australia|New South Wales|2116|Sydney
1788776|1831182|205234489|2014-08-14 13:42:41|108|c||Desktop||Level XXX Margaret Street\
(My Company)|Australia|New South Wales|2000|Sydney|Sydney
1788780|1831186|202634489|2014-08-14 13:43:46|108|c||Desktop||Po box ZZZ|Australia|New South Wales|2444|NSW Other|Port Macquarie
Here is another solution using regex:
import pandas as pd
import re
f = open('input.tsv')
fl = f.read()
f.close()
#Replace '\\n' with '\' using regex
fl = re.sub('\\\\\n','\\\\',s)
o = open('input_fix.tsv','w')
o.write(fl)
o.close()
cols = range(1,17)
#Prime the number of columns by specifying names for each column
#This takes care of the issue of variable number of columns
df = pd.read_csv(fl,sep='|',names=cols)
will produce the following result:
I think you can try first read_csv with sep which is NOT in values and it seems that it read correct:
import pandas as pd
import io
temp=u"""
49 XXX Ave|Australia
u7 38-46 South Street|Australia
XXX Margaret Street\
New South Wales|Australia
Po box ZZZ|Australia"""
#after testing replace io.StringIO(temp) to filename
df = pd.read_csv(io.StringIO(temp), sep="^", header=None)
print df
0
0 49 XXX Ave|Australia
1 u7 38-46 South Street|Australia
2 XXX Margaret StreetNew South Wales|Australia
3 Po box ZZZ|Australia
Then you can create new file with to_csv and read_csv with sep="|":
df.to_csv('myfile.csv', header=False, index=False)
print pd.read_csv('myfile.csv', sep="|", header=None)
0 1
0 49 XXX Ave Australia
1 u7 38-46 South Street Australia
2 XXX Margaret StreetNew South Wales Australia
3 Po box ZZZ Australia
Next solution with not createing new file, but write to variable output and then read_csv with io.StringIO:
import pandas as pd
import io
temp=u"""
49 XXX Ave|Australia
u7 38-46 South Street|Australia
XXX Margaret Street\
New South Wales|Australia
Po box ZZZ|Australia"""
#after testing replace io.StringIO(temp) to filename
df = pd.read_csv(io.StringIO(temp), sep=";", header=None)
print df
0
0 49 XXX Ave|Australia
1 u7 38-46 South Street|Australia
2 XXX Margaret StreetNew South Wales|Australia
3 Po box ZZZ|Australia
output = df.to_csv(header=False, index=False)
print output
49 XXX Ave|Australia
u7 38-46 South Street|Australia
XXX Margaret StreetNew South Wales|Australia
Po box ZZZ|Australia
print pd.read_csv(io.StringIO(u""+output), sep="|", header=None)
0 1
0 49 XXX Ave Australia
1 u7 38-46 South Street Australia
2 XXX Margaret StreetNew South Wales Australia
3 Po box ZZZ Australia
If I test it in your data, it seems that 1. and 2.rows have 14 fields, next two 15 fields.
So I remove last item from both rows (3. and 4.), maybe this is only typo (I hope it):
import pandas as pd
import io
temp=u"""1788768|1831171|208434489|2014-08-14 13:40:02|108|c||Desktop|coupon|49 XXX Ave|Australia|Victoria|3025|Melbourne
1788772|1831177|202234489|2014-08-14 13:41:37|108|c||iOS||u7 38-46 South Street|Australia|New South Wales|2116|Sydney
1788776|1831182|205234489|2014-08-14 13:42:41|108|c||Desktop||Level XXX Margaret Street\
(My Company)|Australia|New South Wales|2000|Sydney
1788780|1831186|202634489|2014-08-14 13:43:46|108|c||Desktop||Po box ZZZ|Australia|New South Wales|2444|NSW Other"""
#after testing replace io.StringIO(temp) to filename
df = pd.read_csv(io.StringIO(temp), sep=";", header=None)
print df
0
0 1788768|1831171|208434489|2014-08-14 13:40:02|...
1 1788772|1831177|202234489|2014-08-14 13:41:37|...
2 1788776|1831182|205234489|2014-08-14 13:42:41|...
3 1788780|1831186|202634489|2014-08-14 13:43:46|...
output = df.to_csv(header=False, index=False)
print pd.read_csv(io.StringIO(u""+output), sep="|", header=None)
0 1 2 3 4 5 6 7 \
0 1788768 1831171 208434489 2014-08-14 13:40:02 108 c NaN Desktop
1 1788772 1831177 202234489 2014-08-14 13:41:37 108 c NaN iOS
2 1788776 1831182 205234489 2014-08-14 13:42:41 108 c NaN Desktop
3 1788780 1831186 202634489 2014-08-14 13:43:46 108 c NaN Desktop
8 9 10 11 \
0 coupon 49 XXX Ave Australia Victoria
1 NaN u7 38-46 South Street Australia New South Wales
2 NaN Level XXX Margaret Street(My Company) Australia New South Wales
3 NaN Po box ZZZ Australia New South Wales
12 13
0 3025 Melbourne
1 2116 Sydney
2 2000 Sydney
3 2444 NSW Other
But if data are correct, add parameter names=range(15) to read_csv:
print pd.read_csv(io.StringIO(u""+output), sep="|", names=range(15))
0 1 2 3 4 5 6 7 \
0 1788768 1831171 208434489 2014-08-14 13:40:02 108 c NaN Desktop
1 1788772 1831177 202234489 2014-08-14 13:41:37 108 c NaN iOS
2 1788776 1831182 205234489 2014-08-14 13:42:41 108 c NaN Desktop
3 1788780 1831186 202634489 2014-08-14 13:43:46 108 c NaN Desktop
8 9 10 11 \
0 coupon 49 XXX Ave Australia Victoria
1 NaN u7 38-46 South Street Australia New South Wales
2 NaN Level XXX Margaret Street(My Company) Australia New South Wales
3 NaN Po box ZZZ Australia New South Wales
12 13 14
0 3025 Melbourne NaN
1 2116 Sydney NaN
2 2000 Sydney Sydney
3 2444 NSW Other Port Macquarie

Categories

Resources