How to condense text cleaning steps into single Python function?

How to condense text cleaning steps into single Python function? - python

Newer programmer here, deeply appreciate any help this knowledgeable community is willing to provide.
I have a column of 140,000 text strings (company names) in a pandas dataframe on which I want to strip all whitespace everywhere in/around the strings, remove all punctuation, substitute specific substrings, and uniformly transform to lowercase. I want to then take the first 0:10 elements in the strings and store them in a new dataframe column.
Here is a reproducible example.
import string
import pandas as pd
data = ["West Georgia Co",
"W.B. Carell Clockmakers",
"Spine & Orthopedic LLC",
"LRHS Saint Jose's Grocery",
"Optitech#NYCityScape"]
df = pd.DataFrame(data, columns = ['co_name'])
def remove_punctuations(text):
for punctuation in string.punctuation:
text = text.replace(punctuation, '')
return text
# applying remove_punctuations function
df['co_name_transform'] = df['co_name'].apply(remove_punctuations)
# this next step replaces 'Saint' with 'st' to standardize,
# and I may want to make other substitutions but this is a common one.
df['co_name_transform'] = df.co_name_transform.str.replace('Saint', 'st')
# replace whitespace
df['co_name_transform'] = df.co_name_transform.str.replace(' ', '')
# make lowercase
df['co_name_transform'] = df.co_name_transform.str.lower()
# select first 0:10 of strings
df['co_name_transform'] = df.co_name_transform.str[0:10]
print(df)
co_name co_name_transform
0 West Georgia Co westgeorgi
1 W.B. Carell Clockmakers wbcarellcl
2 Spine & Orthopedic LLC spineortho
3 LRHS Saint Jose's Grocery lrhsstjose
4 Optitech#NYCityScape optitechny
How can I put all these steps into a single function like this?
def clean_text(df[col]):
for co in co_name:
do_all_the_steps
return df[new_col]
Thank you

You don't need a function to do this. Try the following one-liner.
df['co_name_transform'] = df['co_name'].str.replace('[^A-Za-z0-9-]+', '').str.replace('Saint', 'st').str.lower().str[0:10]
Final output will be.
co_name co_name_transform
0 West Georgia Co westgeorgi
1 W.B. Carell Clockmakers wbcarellcl
2 Spine & Orthopedic LLC spineortho
3 LRHS Saint Jose's Grocery lrhsstjose
4 Optitech#NYCityScape optitechny

You can do all the steps in the function you pass to the apply method:
import re
df['co_name_transform'] = df['co_name'].apply(lambda s: re.sub(r'[\W_]+', '', s).replace('Saint', 'st').lower()[:10])

Another solution, similar to the previous one, but with the list of "to_replace" in one dictionary, so you can add more items to replace. Also, the previous solution won't give the first 10.
data = ["West Georgia Co",
"W.B. Carell Clockmakers",
"Spine & Orthopedic LLC",
"LRHS Saint Jose's Grocery",
"Optitech#NYCityScape","Optitech#NYCityScape","Optitech#NYCityScape","Optitech#NYCityScape","Optitech#NYCityScape","Optitech#NYCityScape","Optitech#NYCityScape","Optitech#NYCityScape","Optitech#NYCityScape"]
df = pd.DataFrame(data, columns = ['co_name'])
to_replace = {'[^A-Za-z0-9-]+':'','Saint':'st'}
for i in to_replace :
df['co_name'] = df['co_name'].str.replace(i,to_replace[i]).str.lower()
df['co_name'][0:10]
Result :
0 westgeorgiaco
1 wbcarellclockmakers
2 spineorthopedicllc
3 lrhssaintjosesgrocery
4 optitechnycityscape
5 optitechnycityscape
6 optitechnycityscape
7 optitechnycityscape
8 optitechnycityscape
9 optitechnycityscape
Name: co_name, dtype: object
Previous solution ( won't show the first 10)
df['co_name_transform'] = df['co_name'].str.replace('[^A-Za-z0-9-]+', '').str.replace('Saint', 'st').str.lower().str[0:10]
Result :
0 westgeorgi
1 wbcarellcl
2 spineortho
3 lrhssaintj
4 optitechny
5 optitechny
6 optitechny
7 optitechny
8 optitechny
9 optitechny
10 optitechny
11 optitechny
12 optitechny
Name: co_name_transform, dtype: object

Related

How to extract city from dataframe column (not same format)

i have this dataframe, and i want to extract cities in a separate column. You can also see, that the format is not same, and the city can be anywhere in the row. How can i extract only cities i a new column?
Prompt. Here we are talking about German cities. May be to find a dictionary, that shows all German cities and somehow compare with my dataset?
Here is dictionary of german cities: https://gist.github.com/embayer/772c442419999fa52ca1
Dataframe
Adresse
0 Karlstr 10, 10 B, 30,; 04916 Hamburg
1 München Dorfstr. 28-55, 22555
2 Marnstraße. Berlin 12, 45666 Berlin
3 Musterstr, 24855 Dresden
... ...
850 Muster Hausweg 11, Hannover, 56668
851 Mariestr. 4, 48669 Nürnberg
852 Hilden Weederstr 33-55, 56889
853 Pt-gaanen-Str. 2, 45883 Potsdam
Output
Cities
0 Hamburg
1 München
2 Berlin
3 Dresden
... ...
850 Hannover
851 Nürnberg
852 Hilden
853 Potsdam

You could extract in a list all the cities from the dictionary you provided ( I asssume it's the 'stadt' key ), and then use str.findall in your column:
cities_ = [cities[n]['stadt'] for n in range(0,len(cities))]
df.Adresse.str.findall(r'|'.join(cities_))
>>>
0 [Karlstr, Hamburg]
1 []
2 []
3 []
4 []
5 []
6 []
7 []
8 []
Name: Adresse, dtype: object

You can simply use str.extract since all the names are between couple of stars.
df["cities"] = df["Adress"].str.extract(r'\*\*(\w+)\*\*')
Since it seems the stars are not present in your file, you can do it differently.
Use the dictionary of cities, called cities from the file you linked but keep only a unique sequence (called a set) of cities.
german_cities = set(map(lambda x: x['stadt'], cities))
Then, we'll split the address string for each row and lookup in the German cities dictionary.
Since the first argument of apply is the series itself, we just need to tell it to have a look at the set of German cities.
def lookup_cities(string, cities):
splits = string.replace(",", "").split(" ")
for s in splits:
if s in cities:
return s
return "NaN"
df["Adress"].apply(lookup_cities, args=(german_cities,))
Now if you find any "NaN" then it's either that a city in your document has a typo or maybe several way to write it, you'll have to investigate yourself.
P.S: I had to remove all the spaces in the cities files otherwise the names wouldn't match. It was just a matter of using find and replace all in my editor.

You can use regular expression to extract the city names, as they are indicated by **:
import re
import pandas
df = pd.DataFrame({"Adresse": ["Karlstr 10, 10 B, 30,; 04916 **Hamburg**", "**München** Dorfstr. 28-55, 22555", "Marnstraße. Berlin 12, 45666 **Berlin**", "Musterstr, 24855 **Dresden**"]})
df['Cities'] = [re.findall(r".*\*\*(.*)\*\*", address)[0] for address in df['Adresse']]
This results in:
df
Adresse Cities
0 Karlstr 10, 10 B, 30,; 04916 **Hamburg** Hamburg
1 **München** Dorfstr. 28-55, 22555 München
2 Marnstraße. Berlin 12, 45666 **Berlin** Berlin
3 Musterstr, 24855 **Dresden** Dresden

Clean column in data frame

I'm trying to clean one column which contains the ID number which is starting from S and 7 numbers, e.g.: 'S1234567' and save only this number into new column. I started with this column named Remarks, this is an example of the data inside:
Remarks
0 S0252508 Shippment UK
1 S0255111 Shippment UK
2 S0256352 Shippment UK
3 S0259138 Shippment UK
4 S0260425 Shippment US
I've menaged to separate those rows which has the format S1234567 + text using the code below:
merged_out['Remarks'] = merged_out['Remarks'].replace("\t", "\r")
merged_out['Remarks'] = merged_out['Remarks'].replace("\n", "\r")
s = merged_out['Remarks'].str.split("\r").apply(pd.Series, 1).stack()
s.index = s.index.droplevel(-1)
s.name = 'Remarks'
del merged_out['Remarks']
merged_out = merged_out.join(s)
merged_out[['Number','Remarks']] = merged_out.Remarks.str.split(" ", 1, expand=True)
After creating a data frame I found that there are a lot of mistakes inside of that column because the data are written there manually, so there are some examples of those wrong records:
Number
0. Pallets:
1. S0246734/S0246735/S0246736
3. delivery
4. S0258780 31 cok
5. S0246732-
6. 2
7. ok
8. nan
And this is only the wrong data which are in the Number column, I will need to clear this and save only those which has the correct number, if there is sth. like that: S0246732/S0246736/S0246738, then I need to have separated row for each number with the same data as it was for this record. For the other one I need to save those which contains the number, the other should have the null value.

Here is a regex approach that will do what I think your question asks:
import pandas as pd
merged_out = pd.DataFrame({
'Remarks':[
'S0252508 Shippment UK',
'S0255111 Shippment UK',
'S0256352 Shippment UK',
'S0259138/S0259139 Shippment UK',
'S12345678 Shippment UK',
'S0260425 Shippment US']
})
pat = r'(?:(\bS\d{7})/)*(\bS\d{7}\b)'
df = merged_out.Remarks.str.extractall(pat)
df = ( pd.concat([
pd.DataFrame(df.unstack().apply(lambda row: row.dropna().tolist(), axis=1), columns=['Number']),
merged_out],
axis=1).explode('Number') )
df.Remarks = df.Remarks.str.replace(pat + r'\s*', '', regex=True)
Input:
Remarks
0 S0252508 Shippment UK
1 S0255111 Shippment UK
2 S0256352 Shippment UK
3 S0259138/S0259139 Shippment UK
4 S12345678 Shippment UK
5 S0260425 Shippment US
Output:
Number Remarks
0 S0252508 Shippment UK
1 S0255111 Shippment UK
2 S0256352 Shippment UK
3 S0259138 Shippment UK
3 S0259139 Shippment UK
5 S0260425 Shippment US
4 NaN S12345678 Shippment UK
Explanation:
with Series.str.extractall(), use a pattern to obtain 0 or more occurrences of word boundary \b followed by S followed by 7 digits and a 1 occurrence of S followed by 7 digits (flanked by word boundaries \b)
use unstack() to eliminate multiple index levels
use apply() with dropna() and tolist() to create a new dataframe with a Number column containing a list of numbers for each row
use explode() to add new rows for lists with more than one Number item
with Series.str.replace(), filter out the number matches using the previous pattern, plus r'\s*' to match trailing whitespace characters, to obtain the residual Remarks
Notes:
all rows in the sample input contain one valid Number except that one row contains multiple Number values separated by / delimiters, and another row contains no valid Number (it has S followed by 8 digits, more than the 7 that make a valid Number)

I think the easiest solution is to use regular expressions and a list comprehension:
import re
import pandas as pd
merged_out['Remarks'] = [re.split('\s', i)[0] for i in merged_out['Remarks']]
Explanation:
This regular expression allows you to split the data when there is a space and makes a list from the i row in the column Remarks. With the 0, I selected the 0 element in this list. In this case, it is the number.
In this case, the list comprehension iterates through all the column in the dataset. In consequence, you will obtain the corresponding number of each row in the new column Remarks.

Python/Pandas:How to process a column of data when it meets certain criteria

i have a csv lie this
userlabel|country
SZ5GZTD_[56][13631808]|russia
YZ5GZTC-3_[51][13680735]|uk
XZ5GZTA_12-[51][13574893]|usa
testYZ5GZWC_11-[51][13632101]|cuba
I use pandas to read this csv, I'd like to add a new column ci,Its value comes from userlabel,and the following conditions must be met:
convert values to lowercase
start with 'yz' or 'testyz'
the code is like this :
(df['userlabel'].str.lower()).str.extract(r"(test)?([a-z]+).*", expand=True)[1]
when it matched,ci is the number between the first "- or _" and second "- or _" from userlabel.
the fake code is like this:
ci = (userlabel,r'.*(\_|\-)(\d+)(\_|\-).*',2)
finally,the result is like this
userlabel ci country
SZ5GZTD_[56][13631808] russia
YZ5GZTC-3_[51][13680735] 3 uk
XZ5GZTA_12-[51][13574893] usa
testYZ5GZWC_11-[51][13632101] 11 cuba

You can use
import pandas as pd
df = pd.DataFrame({'userlabel':['SZ5GZTD_[56][13631808]','YZ5GZTC-3_[51][13680735]','XZ5GZTA_12-[51][13574893]','testYZ5GZWC_11-[51][13632101]'], 'country':['russia','uk','usa','cuba']})
df['ci'] = df['userlabel'].str.extract(r"(?i)^(?:yz|testyz)[^_-]*[_-](\d+)[-_]", expand=True)
>>> df['ci']
0 NaN
1 3
2 NaN
3 11
Name: ci, dtype: object
# To rearrange columns, add the following line:
df = df[['userlabel', 'ci', 'country']]
>>> df
userlabel ci country
0 SZ5GZTD_[56][13631808] NaN russia
1 YZ5GZTC-3_[51][13680735] 3 uk
2 XZ5GZTA_12-[51][13574893] NaN usa
3 testYZ5GZWC_11-[51][13632101] 11 cuba
See the regex demo.
Regex details:
(?i) - make the pattern case insensitive (no need using str.lower())
^ - start of string
(?:yz|testyz) - a non-capturing group matching either yz or testyz
[^_-]* - zero or more chars other than _ and -
[_-] - the first _ or -
(\d+) - Group 1 (the Series.str.extract requires a capturing group since it only returns this captured substring): one or more digits
[-_] - a - or _.

import re
def get_val(s):
l = re.findall(r'^(YZ|testYZ).*[_-](\d+)[_-].*', s)
return None if(len(l) == 0) else l[0][1]
df['ci'] = df['userlabel'].apply(lambda x: get_val(x))
df = df[['userlabel', 'ci', 'country']]
userlabel ci country
0 SZ5GZTD_[56][13631808] None russia
1 YZ5GZTC-3_[51][13680735] 3 uk
2 XZ5GZTA_12-[51][13574893] None usa
3 testYZ5GZWC_11-[51][13632101] 11 cuba

How to replace words with different case in PANDAS dataframe

I have a really big data frame and it contains a specific column "city" with multiple cities repeating but in different case, for eg -
***City***
Gurgaon
GURGAON
gurgaon
Chennai
CHENNAI
Banglore
Hydrabad
BANGLORE
HYDRABAD
.
Is there a way to replace all the same cities with different case, with a single name.
There are total 3k rows in each column, so manually it's not possible.
Edit -
The city column of the DF also contains cities like
'Gurgaon'
'GURGAON'
'gurgaon ' #there is a white space at the end
I want something so that they all change to the same name and the delimiter is also removed. So that the output is →
'Gurgaon'
'Gurgaon'
'Gurgaon' #no white space at the end
Thanks

Here is how you can use str.strip() to remove trailing whitespaces, and then use str.title():
import pandas as pd
df = pd.DataFrame({'City':["Gurgaon",
"GURGAON",
"gurgaon",
"Chennai",
"CHENNAI",
"Banglore",
"Hydrabad",
"BANGLORE",
"HYDRABAD"]})
df['City'] = df['City'].str.strip()
df['City'] = df['City'].str.title()
print(df)
Output:
City
0 Gurgaon
1 Gurgaon
2 Gurgaon
3 Chennai
4 Chennai
5 Banglore
6 Hydrabad
7 Banglore
8 Hydrabad

First, change the cities to have the same format:
df.city=df.city.apply(lambda x: x.capitalize())
Then, remove duplicates:
df.drop_duplicates()
(I assume the rest of the columns are equal)

Python: how to check entries with white spaces in a dataframe?

I have a dataframe df containing the information of car brands. For instance,
df['Car_Brand'][1]
'HYUNDAI '
where the length of each entries is the same len(df['Car_Brand'][1])=30. I can also have entries with only white spaces.
df['Car_Brand']
0 TOYOTA
1 HYUNDAI
2
3
4
5 OPEL
6
7 JAGUAR
where
df['Car_Brand'][2]
' '
I would like to drop from the dataframe all the entries with white spaces and reduce the size of the others. Finally:
df['Car_Brand'][1]
'HYUNDAI '
becomes
df['Car_Brand'][1]
'HYUNDAI'
I started to remove the withe spaces, in this way:
tmp = df['Car_Brand'].str.replace(" ","")

using str.strip and convert it to bool to filter the empty ones
df['Car_Brand'] = df['Car_Brand'].strip()
df[df['Car_Brand'].astype(bool)]

It seems need:
s = df['Car_Brand']
s1 = s[s != ''].reset_index(drop=True)
#if multiple whitespaces
#s1 = s[s.str.strip() != ''].reset_index(drop=True)
print (s1)
0 TOYOTA
1 HYUNDAI
2 OPEL
3 JAGUAR
Name: Car_Brand, dtype: object
If multiple whitespaces:
s = df[~df['Car_Brand'].str.contains('^\s+$')]

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to condense text cleaning steps into single Python function? - python

You can do all the steps in the function you pass to the apply method: import re df['co_name_transform'] = df['co_name'].apply(lambda s: re.sub(r'[\W_]+', '', s).replace('Saint', 'st').lower()[:10])

Related

How to extract city from dataframe column (not same format)

Clean column in data frame

Python/Pandas:How to process a column of data when it meets certain criteria

How to replace words with different case in PANDAS dataframe

Python: how to check entries with white spaces in a dataframe?

Categories

Resources