Import .txt to Pandas Dataframe With Multiple Delimiters - python

I would like to import .txt file into a Pandas Dataframe, my .txt file:
Ann Gosh 1234567892008-12-15Irvine CA45678A9Z5Steve Ryan
Yosh Dave 9876543212009-04-18St. Elf NY12345P8G0Brad Tuck
Clair Simon 3245674572008-12-29New Jersey NJ56789R9B3Dan John
The dataframe should look like this:
FirstN LastN SID Birth City States Postal TeacherFirstN TeacherLastN
Ann Gosh 123456789 2008-12-15 Irvine CA A9Z5 Steve Ryan
Yosh Dave 987654321 2009-04-18 St. Elf NY P8G0 Brad Tuck
Clair Simon 324567457 2008-12-29 New Jersey NJ R9B3 Dan John
I tried multiple ways including this:
df = pd.read_csv('student.txt', sep='\s+', engine='python', header=None, index_col=False)
to import the raw file into the dataframe, then plan to clean data for each column but it's too complicated. Could you please help me? (the Postal here is just the 4 char before TeacherFirstN)

You can start with setting names on you existing columns, and then applying regex on data while creating the new columns.
In order to fix the "single space delimiter" issue in your output, you can define "at least 2 space characters" eg [\s]{2,} as delimiter which would fix the issue for St. Elf in City names
An example :
import pandas as pd
import re
df = pd.read_csv(
'test.txt',
sep = '[\s]{2,}',
engine = 'python',
header = None,
index_col = False,
names= [
"FirstN","LastN","FULLSID","TeacherData","TeacherLastN"
]
)
sid_pattern = re.compile(r'(\d{9})(\d+-\d+-\d+)(.*)', re.IGNORECASE)
df['SID'] = df.apply(lambda row: sid_pattern.search(row.FULLSID).group(1), axis = 1)
df['Birth'] = df.apply(lambda row: sid_pattern.search(row.FULLSID).group(2), axis = 1)
df['City'] = df.apply(lambda row: sid_pattern.search(row.FULLSID).group(3), axis = 1)
teacherdata_pattern = re.compile(r'(.{2})([\dA-Z]+\d)(.*)', re.IGNORECASE)
df['States'] = df.apply(lambda row: teacherdata_pattern.search(row.TeacherData).group(1), axis = 1)
df['Postal'] = df.apply(lambda row: teacherdata_pattern.search(row.TeacherData).group(2)[-4:], axis = 1)
df['TeacherFirstN'] = df.apply(lambda row: teacherdata_pattern.search(row.TeacherData).group(3), axis = 1)
del df['FULLSID']
del df['TeacherData']
print(df)
Output :
FirstN LastN TeacherLastN SID Birth City States Postal TeacherFirstN
0 Ann Gosh Ryan 123456789 2008-12-15 Irvine CA A9Z5 Steve
1 Yosh Dave Tuck 987654321 2009-04-18 St. Elf NY P8G0 Brad
2 Clair Simon John 324567457 2008-12-29 New Jersey NJ R9B3 Dan

Related

Text to columns in pandas dataframe

I have a pandas dataset like below:
import pandas as pd
data = {'id': ['001', '002', '003'],
'address': ["William J. Clare\n290 Valley Dr.\nCasper, WY 82604\nUSA, United States",
"1180 Shelard Tower\nMinneapolis, MN 55426\nUSA, United States",
"William N. Barnard\n145 S. Durbin\nCasper, WY 82601\nUSA, United States"]
}
df = pd.DataFrame(data)
print(df)
I need to convert address column to text delimited by \n and create new columns like name, address line 1, City, State, Zipcode, Country like below:
id Name addressline1 City State Zipcode Country
1 William J. Clare 290 Valley Dr. Casper WY 82604 United States
2 null 1180 Shelard Tower Minneapolis MN 55426 United States
3 William N. Barnard 145 S. Durbin Casper WY 82601 United States
I am learning python and from morning I am solving this. Any help will be greatly appreciated.
Thanks,
Right now, Pandas is returning you the table with 2 columns. If you look at the value in the second column, the essential information is separated with the comma. Therefore, if you saved your dataframe to df you can do the following:
df['address_and_city'] = df['address'].apply(lambda x: x.split(',')[0])
df['state_and_postal'] = df['address'].apply(lambda x: x.split(',')[1])
df['country'] = df['address'].apply(lambda x: x.split(',')[2])
Now, you have additional three columns in your dataframe, the last one contains the full information about the country already. Now from the first two columns that you have created you can extract the info you need in a similar way.
df['address_first_line'] = df['address_and_city'].apply(lambda x: ' '.join(x.split('\n')[:-1]))
df['city'] = df['address_and_city'].apply(lambda x: x.split('\n')[-1])
df['state'] = df['state_and_postal'].apply(lambda x: x.split(' ')[1])
df['postal'] = df['state_and_postal'].apply(lambda x: x.split(' ')[2].split('\n')[0])
Now you should have all the columns you need. You can remove the excess columns with:
df.drop(columns=['address','address_and_city','state_and_postal'], inplace=True)
Of course, it all can be done faster and with fewer lines of code, but I think it is the clearest way of doing it, which I hope you will find useful. If you don't understand what I did there, check the documentation for split and join methods, and also for apply method, native to pandas.

String literal matching between words in two different dataframe (dfs) and generate a new dataframe

I have two dataframes df1 and df2
df1 =
University
School
Student first name
last name
nick name
AAA
Law
John
Mckenzie
Stevie
BBB
Business
Steve
Savannah
JO
CCC
Engineering
Mark
Justice
Fre
DDD
Arts
Stuart
Little
Rah
EEE
Life science
Adam
Johnson
meh
120 rows X 5 columns
df2 =
Statement
Stuart had a headache last nigh which was due to th……
Rah basically found a new found friend which lead to the……
Gerome got a brand new watch which was……….
Adam was found chilling all through out his life……
Savannah is such a common name that……..
3000 rows X1 columns
AIM is to form df3
Match the string literal and iterate it through every cells in the columns "Student first name" , "Student last name" , "Student nick name" to produce the table below
Df3 =
Statement
Matching
University
School
Stuart had a headache last nigh which was due to th…
Stuart
DDD
Arts
Rah basically found a new found friend which lead to
Rah
DDD
Arts
Gerome got a brand new watch which was……….
NA
NA
NA
Adam was found chilling all through out his life……
Adam
EEE
Life science
Savannah is such a common name that……..
Savannah
BBB
Business
3000 rows X 4 columns
You can melt and merge:
import re
df1_melt = df1.melt(['University', 'School'], value_name='Match')
regex = '|'.join(map(re.escape, df1_melt['Match']))
out = df2.join(
df1_melt[['Match', 'University', 'School']]
.merge(df2['Statement']
.str.extract(f'({regex})', expand=False)
.rename('Match'),
how='right', on='Match'
)
)
output:
Statement Match University School
0 Stuart had a headache last nigh which was due to the Stuart DDD Arts
1 Rah basically found a new found friend which lead to the Rah DDD Arts
2 Gerome got a brand new watch which was NaN NaN NaN
3 Adam was found chilling all through out his life Adam EEE Life science
4 Savannah is such a common name that Savannah BBB Business
Naïve approach, loop columns to find matches then loop to merge on matches:
import re
columns_to_match = ["Student first name", "last name", "nick name"]
dfs = []
for column in columns_to_match:
search_strings = df1[column].unique().tolist()
regex = "|".join(map(re.escape, search_strings))
df2["Matching"] = df2["Statement"].str.extract(f"({regex})")
dfs.append(df2.dropna())
matched_df = pd.concat(dfs).reset_index(drop=True)
dfs = []
for column in columns_to_match:
final_df = df1.merge(matched_df, how="inner", left_on=column, right_on="Matching")
dfs.append(final_df)
final_df = pd.concat(dfs).reset_index(drop=True).drop(columns=columns_to_match)
My answer makes the following assumptions:
The index on df1 serves as the student ID and is unique.
That you only want to fill the first student found. A statement like "John and Steve are friends" will be assigned to John.
import re
assigned = pd.Series([False] * len(df2))
df3 = df2.copy()
# Loop through each student, taking their first, last and nick name
for idx, names in df1[["Student first name", "last name", "nick name"]].iterrows():
# If all statements have been assigned, terminate the loop
if assigned.all():
break
# Combine the student's first, last and nick name into a regex pattern
pattern = f"({'|'.join(names.map(re.escape))})"
# For each UNASSIGNED statement, Find the pattern. We only search unassigned
# statements to lower the number of searches.
match = df3.loc[~assigned, "Statement"].str.extract(pattern, expand=False)
# Mark the statement as assigned
cond = ~assigned & match.notna()
assigned[cond] = True
# Fill in the student's info
df3.loc[cond, "Match"] = match[cond]
df3.loc[cond, "University"] = df1.loc[idx, "University"]
df3.loc[cond, "School"] = df1.loc[idx, "School"]
Rather than iterating through each cell, you could create three dataframes (merging with all three columns separately) and concatenate the results into one dataframe.
df2['Matching'] = df2['Statement'].str.split().str[0]
dfs = []
for col in ['Student first name', 'last name', 'nick name']:
df_temp = pd.merge(df2, df1[[col, 'University', 'School']].rename(columns={col:'Matching'}), how='left')
dfs.append(df_temp)
df3 = pd.concat(dfs).drop_duplicates()

Python dataframe from 2 text files (different number of columns)

I need to make a dataframe from two txt files.
The first txt file looks like this Street_name space id.
The second txt file loks like this City_name space id.
Example:
text file 1:
Roseberry st 1234
Brooklyn st 4321
Wolseley 1234567
text file 2:
Winnipeg 4321
Winnipeg 1234
Ste Anne 1234567
I need to make one dataframe out of this. Sometimes there is just one word for Street_name, and sometimes more. The same goes for City_name.
I get an error: ParserError: Error tokenizing data. C error: Expected 2 fields in line 5, saw 3 because I'm trying to put both words for street name into the same column, but don't know how to do it. I want one column for street name (no matter if it consists of one or more words, one for city name and one for id.
I want a df with 3 rows and 3 cols.
Thanks!
Edit: both text files are huge (each 50 mil rows +) so i need this code not to break and be optimised for large files.
It is NOT correct CSV and it may need to read it on your own.
You can normal open(), read() and later split on new line to create list of lines. And later you can use for-loop and use line.rsplit(" ", 1) to split line on last space.
Minimal working example:
I use io to simulate file in memory - so everyone can simply copy and test it - but you should use open()
text = '''Roseberry st 1234
Brooklyn st 4321
Wolseley 1234567'''
import io
#with open('filename') as fh:
with io.StringIO(text) as fh:
lines = fh.read().splitlines()
print(lines)
lines = [line.rsplit(" ", 1) for line in lines]
print(lines)
import pandas as pd
df = pd.DataFrame(lines, columns=['name', 'name'])
print(df)
Result:
['Roseberry st 1234', 'Brooklyn st 4321', 'Wolseley 1234567']
[['Roseberry st', '1234'], ['Brooklyn st', '4321'], ['Wolseley', '1234567']]
name number
0 Roseberry st 1234
1 Brooklyn st 4321
2 Wolseley 1234567
EDIT:
read_csv can use regex to define separator (i.e. sep="\s+" for many spaces) and it can even use lookahead/loopbehind ((?=...)/(?<=...)) to check if there is digit after space without catching it as part of separator.
text = '''Roseberry st 1234
Brooklyn st 4321
Wolseley 1234567'''
import io
import pandas as pd
#df = pd.read_csv('filename', names=['name', 'number'], sep='\s(?=\d)', engine='python')
df = pd.read_csv(io.StringIO(text), names=['name', 'number'], sep='\s(?=\d)', engine='python')
print(df)
Result:
name number
0 Roseberry st 1234
1 Brooklyn st 4321
2 Wolseley 1234567
And later you can try to connect both dataframe using .join(), .merge() with parameter on= (or something similar) like in SQL query.
text1 = '''Roseberry st 1234
Brooklyn st 4321
Wolseley 1234567'''
text2 = '''Winnipeg 4321
Winnipeg 1234
Ste Anne 1234567'''
import io
import pandas as pd
df1 = pd.read_csv(io.StringIO(text1), names=['street name', 'id'], sep='\s(?=\d)', engine='python')
df2 = pd.read_csv(io.StringIO(text2), names=['city name', 'id'], sep='\s(?=\d)', engine='python')
print(df1)
print(df2)
df = df1.merge(df2, on='id')
print(df)
Result:
street name id
0 Roseberry st 1234
1 Brooklyn st 4321
2 Wolseley 1234567
city name id
0 Winnipeg 4321
1 Winnipeg 1234
2 Ste Anne 1234567
street name id city name
0 Roseberry st 1234 Winnipeg
1 Brooklyn st 4321 Winnipeg
2 Wolseley 1234567 Ste Anne
Pandas doc: Merge, join, concatenate and compare
There's nothing that I'm aware of in pandas that does this automatically.
Below, I built a script that will merge those addresses (addy + st) into a single column, then merges the two data frames into one based on the "id".
I assume your actual text files are significantly larger, so assuming they follow the pattern set in the two examples, this script should work fine.
Basically, this code turns each line of text in the file into a list, then combines lists of length 3 into length 2 by combining the first two list items.
After that, it turns the "list of lists" into a dataframe and merges those dataframes on column "id".
Couple caveats:
Make sure you set the correct text file paths
Make sure the first line of the text files contains 2, single string column headers (ie: "address id") or (ie: "city id")
Make sure each text file id column header is named "id"
import pandas as pd
import numpy as np
# set both text file paths (you may need full path i.e. C:\Users\Name\bla\bla\bla\text1.txt)
text_path_1 = r'text1.txt'
text_path_2 = r'text2.txt'
# declares first text file
with open(text_path_1) as f1:
text_file_1 = f1.readlines()
# declares second text file
with open(text_path_2) as f2:
text_file_2 = f2.readlines()
# function that massages data into two columns (to put "st" into same column as address name)
def data_massager(text_file_lines):
data_list = []
for item in text_file_lines:
stripped_item = item.strip('\n')
split_stripped_item = stripped_item.split(' ')
if len(split_stripped_item) == 3:
split_stripped_item[0:2] = [' '.join(split_stripped_item[0 : 2])]
data_list.append(split_stripped_item)
return data_list
# runs function on both text files
data_list_1 = data_massager(text_file_1)
data_list_2 = data_massager(text_file_2)
# creates dataframes on both text files
df1 = pd.DataFrame(data_list_1[1:], columns = data_list_1[0])
df2 = pd.DataFrame(data_list_2[1:], columns = data_list_2[0])
# merges data based on id (make sure both text files' id is named "id")
merged_df = df1.merge(df2, how='left', on='id')
# prints dataframe (assuming you're using something like jupyter-lab)
merged_df
pandas has strong support for strings. You can make the lines of each file into a Series and then use a regular expression to separate the fields into separate columns. I assume that "id" is the common value that links the two datasets, so it can become the dataframe index and the columns can just be added together.
import pandas as pd
street_series = pd.Series([line.strip() for line in open("text1.txt")])
street_df = street_series.str.extract(r"(.*?) (\d+)$")
del street_series
street_df.rename({0:"street", 1:"id"}, axis=1, inplace=True)
street_df.set_index("id", inplace=True)
print(street_df)
city_series = pd.Series([line.strip() for line in open("text2.txt")])
city_df = city_series.str.extract(r"(.*?) (\d+)$")
del city_series
city_df.rename({0:"city", 1:"id"}, axis=1, inplace=True)
city_df.set_index("id", inplace=True)
print(city_df)
street_df["city"] = city_df["city"]
print(street_df)

Appling a custom function to each row in a column in a dataframe

I have a bit of code which pulls the latitude and longitude for a location. It is here:
address = 'New York University'
url = 'https://nominatim.openstreetmap.org/search/' + urllib.parse.quote(address) +'?format=json'
response = requests.get(url).json()
print(response[0]["lat"])
print(response[0]["lon"])
I'm wanting to apply this as a function to a long column of "address".
I've seen loads of questions about 'apply' and 'map', but they're almost all simple math examples.
Here is what I tried last night:
def locate (address):
response = requests.get(url).json()
print(response[0]["lat"])
print(response[0]["lon"])
return
df['lat'] = df['lat'].map(locate)
df['lon'] = df['lon'].map(locate)
This ended up just applying the first row lat / lon to the entire csv.
What is the best method to turn the code into a custom function and apply it to each row?
Thanks in advance.
EDIT: Thank you #PacketLoss for your assistance. I'm getting an indexerror:list index out of range, but it does work on his sample dataframe.
Here is the read_csv I used to pull in the data:
df = pd.read_csv('C:\\Users\\CIHAnalyst1\\Desktop\\InstitutionLocations.csv', sep=',', error_bad_lines=False, index_col=False, dtype='unicode', encoding = "utf-8", warn_bad_lines=False)
Here is a text copy of the rows from the dataframe:
address
0 GRAND CANYON UNIVERSITY
1 SOUTHERN NEW HAMPSHIRE UNIVERSITY
2 WESTERN GOVERNORS UNIVERSITY
3 FLORIDA INTERNATIONAL UNIVERSITY - UNIVERSITY ...
4 PENN STATE UNIVERSITY UNIVERSITY PARK
... ...
4292 THE ART INSTITUTES INTERNATIONAL LLC
4293 INTERCOAST - ONLINE
4294 CAROLINAS COLLEGE OF HEALTH SCIENCES
4295 DYERSBURG STATE COMMUNITY COLLEGE COVINGTON
4296 ULTIMATE MEDICAL ACADEMY - NY
You need to return your values from your function, or nothing will happen.
We can use apply here and pass the address from the df as well.
data = {'address': ['New York University', 'Sydney Opera House', 'Paris', 'SupeRduperFakeAddress']}
df = pd.DataFrame(data)
def locate(row):
url = 'https://nominatim.openstreetmap.org/search/' + urllib.parse.quote(row['address']) +'?format=json'
response = requests.get(url).json()
if response:
row['lat'] = response[0]['lat']
row['lon'] = response[0]['lon']
return row
df = df.apply(locate, axis=1)
Outputs
address lat lon
0 New York University 40.72925325 -73.99625393609625
1 Sydney Opera House -33.85719805 151.21512338473752
2 Paris 48.8566969 2.3514616
3 SupeRduperFakeAddress NaN NaN

How to check if a word is in each row of a pandas dataframe

I have a pandas dataframe with a column designated to town names. After each town name I am adding the word "NSW" (e.g. "Sydney" will become "Sydney NSW"). However, this means even when a town already has NSW written, the script will add it again (e.g. "Narara NSW" will become "Narara NSW NSW"). How can I check if the name already has NSW and only add the string if NSW is not present. Here is my code so far:
#Adds "NSW" to the end of each town in the dataframe and then adds these changes to to the csv
df['FullAddress'] = df['FullAddress'] + ' NSW'
print(df)
df.to_csv('latLongTest.csv', index=False)
Use pandas.Series.where with pandas.Series.str.endswith:
s = pd.Series(["Sydney", "Narara NSW"])
s.where(s.str.endswith("NSW"), lambda x: x + " NSW")
Output:
0 Sydney NSW
1 Narara NSW
dtype: object
My personal preference is to usually use np.where() in a situation like this:
df['FullAddress'] = np.where((df['FullAddress'].str.endswith(' NSW')), df['FullAddress'], df['FullAddress'] + ' NSW')
It is vectorized and similar to an excel if statement IF(CONDITION, THEN, ELSE).
import pandas as pd
df = pd.DataFrame({'FullAddress': ['Sydney', 'Sydney NSW', 'Narara NSW', 'Narara']})
df['FullAddress'] = df.apply(lambda x: x.FullAddress if x.FullAddress.endswith(' NSW') else x.FullAddress + ' NSW', axis=1)
print(df)
Output:
FullAddress
0 Sydney NSW
1 Sydney NSW
2 Narara NSW
3 Narara NSW

Categories

Resources