I have a series of texts that has either one word or a combination of words. I need to delete the last word its greater than 1, if not leave the last word.
Have tried the following regex:
df["first_middle_name"] = df["full_name"].replace("\s+\S+$", "")
from this solution: Removing last words in each row in pandas dataframe
It deletes certain words keeps others.
Some examples of strings in my df['Municipio']:
Zacapa
San Luis, **Jalapa**
Antigua Guatemala **Sacatepéquez**
Guatemala
Mixco
Sacapulas, **Jutiapa**
Puerto Barrios, **Izabal**
Petén **Petén**
San Martin Jil, **Chimaltenango**
What I need for example is if it finds one word keeps that word, if it is a combination of more words (2 or more) and there is a comma or space delete the last word. See bold words.
Thank you!
You can apply a function to check if , in string first, then check space in string.
df['Municipio'] = df['Municipio'].apply(lambda x: ', '.join(x.split(',')[:-1]) if ',' in x
else (' '.join(x.split(' ')[:-1]) if ' ' in x else x))
print(df)
Municipio
0 Zacapa
1 San Luis
2 Antigua Guatemala
3 Guatemala
4 Mixco
5 Sacapulas
6 Puerto Barrios
7 Petén
8 San Martin Jil
If you want to keep the last comma and space
df['Municipio'] = df['Municipio'].apply(lambda x: ', '.join(x.split(',')[:-1]+['']) if ',' in x
else (' '.join(x.split(' ')[:-1]+['']) if ' ' in x else x))
print(df)
Municipio
0 Zacapa
1 San Luis,
2 Antigua Guatemala
3 Guatemala
4 Mixco
5 Sacapulas,
6 Puerto Barrios,
7 Petén
8 San Martin Jil,
Related
Have this data:
region gdp_per_capita
0 Coasts of USA 71 546
1 USA: New York, New Jersey 81 615
2 USA: California 74 205
3 USA: New England 74 000
Wanna get this:
region gdp_per_capita
0 Coasts of USA 71546
1 USA: New York, New Jersey 81615
2 USA: California 74205
3 USA: New England 74000
Tried to use df.columns = df.columns.str.replace(' ', ''), but it did not work
Just this should do:
df['gdp_per_capita'] = df['gdp_per_capita'].astype(str).str.replace('\s+', '').replace('nan', np.nan)
df['gdp_per_capita'] = pd.to_numeric(df['gdp_per_capita'])
print(df)
region gdp_per_capita
0 Coasts of USA 71546
1 USA: New York, New Jersey 81615
2 USA: California 74205
3 USA: New England 74000
Looks like you want to work with numbers rather than strings.
Hence, replacing ' ' with '' and using pd.to_numeric seems like an easy and solid approach.
Let me suggest another one which might or might not be good (it depends on your dataset).
If the thousands in your dataset are separated by a whitespace (' '), you can just read your df as
df = pd.read_csv(file, thousands = ' ')
and all your columns with 74 109 would be read as 74109 and dtype integer or float.
import re
df['gdp_per_capita'] = df['gdp_per_capita'].apply(lambda x: re.sub("[^0-9]", "", str(x))).astype(int)
I am not quite sure it will work or not but try the following:
Trim leading space of column in pandas – lstrip()
Trim trailing space of column in pandas – rstrip()
Trim Both leading and trailing space of column in pandas – strip()
Strip all the white space of column in pandas.
Let me know if it works :)
Newer programmer here, deeply appreciate any help this knowledgeable community is willing to provide.
I have a column of 140,000 text strings (company names) in a pandas dataframe on which I want to strip all whitespace everywhere in/around the strings, remove all punctuation, substitute specific substrings, and uniformly transform to lowercase. I want to then take the first 0:10 elements in the strings and store them in a new dataframe column.
Here is a reproducible example.
import string
import pandas as pd
data = ["West Georgia Co",
"W.B. Carell Clockmakers",
"Spine & Orthopedic LLC",
"LRHS Saint Jose's Grocery",
"Optitech#NYCityScape"]
df = pd.DataFrame(data, columns = ['co_name'])
def remove_punctuations(text):
for punctuation in string.punctuation:
text = text.replace(punctuation, '')
return text
# applying remove_punctuations function
df['co_name_transform'] = df['co_name'].apply(remove_punctuations)
# this next step replaces 'Saint' with 'st' to standardize,
# and I may want to make other substitutions but this is a common one.
df['co_name_transform'] = df.co_name_transform.str.replace('Saint', 'st')
# replace whitespace
df['co_name_transform'] = df.co_name_transform.str.replace(' ', '')
# make lowercase
df['co_name_transform'] = df.co_name_transform.str.lower()
# select first 0:10 of strings
df['co_name_transform'] = df.co_name_transform.str[0:10]
print(df)
co_name co_name_transform
0 West Georgia Co westgeorgi
1 W.B. Carell Clockmakers wbcarellcl
2 Spine & Orthopedic LLC spineortho
3 LRHS Saint Jose's Grocery lrhsstjose
4 Optitech#NYCityScape optitechny
How can I put all these steps into a single function like this?
def clean_text(df[col]):
for co in co_name:
do_all_the_steps
return df[new_col]
Thank you
You don't need a function to do this. Try the following one-liner.
df['co_name_transform'] = df['co_name'].str.replace('[^A-Za-z0-9-]+', '').str.replace('Saint', 'st').str.lower().str[0:10]
Final output will be.
co_name co_name_transform
0 West Georgia Co westgeorgi
1 W.B. Carell Clockmakers wbcarellcl
2 Spine & Orthopedic LLC spineortho
3 LRHS Saint Jose's Grocery lrhsstjose
4 Optitech#NYCityScape optitechny
You can do all the steps in the function you pass to the apply method:
import re
df['co_name_transform'] = df['co_name'].apply(lambda s: re.sub(r'[\W_]+', '', s).replace('Saint', 'st').lower()[:10])
Another solution, similar to the previous one, but with the list of "to_replace" in one dictionary, so you can add more items to replace. Also, the previous solution won't give the first 10.
data = ["West Georgia Co",
"W.B. Carell Clockmakers",
"Spine & Orthopedic LLC",
"LRHS Saint Jose's Grocery",
"Optitech#NYCityScape","Optitech#NYCityScape","Optitech#NYCityScape","Optitech#NYCityScape","Optitech#NYCityScape","Optitech#NYCityScape","Optitech#NYCityScape","Optitech#NYCityScape","Optitech#NYCityScape"]
df = pd.DataFrame(data, columns = ['co_name'])
to_replace = {'[^A-Za-z0-9-]+':'','Saint':'st'}
for i in to_replace :
df['co_name'] = df['co_name'].str.replace(i,to_replace[i]).str.lower()
df['co_name'][0:10]
Result :
0 westgeorgiaco
1 wbcarellclockmakers
2 spineorthopedicllc
3 lrhssaintjosesgrocery
4 optitechnycityscape
5 optitechnycityscape
6 optitechnycityscape
7 optitechnycityscape
8 optitechnycityscape
9 optitechnycityscape
Name: co_name, dtype: object
Previous solution ( won't show the first 10)
df['co_name_transform'] = df['co_name'].str.replace('[^A-Za-z0-9-]+', '').str.replace('Saint', 'st').str.lower().str[0:10]
Result :
0 westgeorgi
1 wbcarellcl
2 spineortho
3 lrhssaintj
4 optitechny
5 optitechny
6 optitechny
7 optitechny
8 optitechny
9 optitechny
10 optitechny
11 optitechny
12 optitechny
Name: co_name_transform, dtype: object
Initial Data (String datatype)
Los Gatos 50K
Las Palmas Canary Islands 25K
Roland Garros
Seoul 25K
Rome
Desired Result
Los Gatos
Las Palmas Canary Islands
Roland Garros
Seoul
Rome
I am looking for a way to remove any string pattern that is 2 digits and then a K. But it needs to be able to handle any 2 values before the K. I haven't seen any answers that use a wildcard for the part of the replace. It should be something like this (I know this is not valid) -
data.replace("**K", '')
Side note - This string will be a column in a dataframe so if there is an easy solution that works with that would be ideal. If not I can iterate through each row and transform it that way.
Try
df = df.replace('\d{2}K', '', regex = True)
0
0 Los Gatos
1 Las Palmas Canary Islands
2 Roland Garros
3 Seoul
4 Rome
I have a DataFrame with thousands of rows and two columns like so:
string state
0 the best new york cheesecake rochester ny ny
1 the best dallas bbq houston tx random str tx
2 la jolla fish shop of san diego san diego ca ca
3 nothing here dc
For each state, I have a regular expression of all city names (in lower case) structured like (city1|city2|city3|...) where the order of the cities is arbitrary (but can be changed if needed). For example, the regular expression for the state of New York contains both 'new york' and 'rochester' (and likewise 'dallas' and 'houston' for Texas, and 'san diego' and 'la jolla' for California).
I want to find out what the last appearing city in the string is (for observations 1, 2, 3, 4, I'd want 'rochester', 'houston', 'san diego', and NaN (or whatever), respectively).
I started off with str.extract and was trying to think of things like reversing the string but have reached an impasse.
Thanks so much for any help!
You can use str.findall, but if no match get empty list, so need apply. Last select last item of string by [-1]:
cities = r"new york|dallas|rochester|houston|san diego"
print (df['string'].str.findall(cities)
.apply(lambda x: x if len(x) >= 1 else ['no match val'])
.str[-1])
0 rochester
1 houston
2 san diego
3 no match val
Name: string, dtype: object
(Corrected >= 1 to > 1.)
Another solution is a bit hack - add no match string to start of each string by radd and add this string to cities too:
a = 'no match val'
cities = r"new york|dallas|rochester|houston|san diego" + '|' + a
print (df['string'].radd(a).str.findall(cities).str[-1])
0 rochester
1 houston
2 san diego
3 no match val
Name: string, dtype: object
cities = r"new york|dallas|..."
def last_match(s):
found = re.findall(cities, s)
return found[-1] if found else ""
df['string'].apply(last_match)
#0 rochester
#1 houston
#2 san diego
#3
I've got a series of addresses and would like a series with just the street name. The only catch is some of the addresses don't have a house number, and some do.
So if I have a series that looks like:
Idx
0 11000 SOUTH PARK
1 20314 BRAKER LANE
2 203 3RD ST
3 BIRMINGHAM PARK
4 E 12TH
What function would I write to get
Idx
0 SOUTH PARK
1 BRAKER LANE
2 3RD ST
3 BIRMINGHAM PARK
4 E 12TH
where any 'words' made entirely of numeric characters at the beginning of the string have been removed? As you can see above, I would like to retain the 3 that '3RD STREET' starts with. I'm thinking a regular expression but this is beyond me. Thanks!
You can use str.replace with regex ^\d+\s+ to remove leading digits:
s.str.replace('^\d+\s+', '')
Out[491]:
0 SOUTH PARK
1 BRAKER LANE
2 3RD ST
3 BIRMINGHAM PARK
4 E 12TH
Name: Idx, dtype: object
str.replace('\d+\s', '') is what I came up with:
df = pd.DataFrame({'IDx': ['11000 SOUTH PARK',
'20314 BRAKER LANE',
'203 3RD ST',
'BIRMINGHAM PARK',
'E 12TH']})
df
Out[126]:
IDx
0 11000 SOUTH PARK
1 20314 BRAKER LANE
2 203 3RD ST
3 BIRMINGHAM PARK
4 E 12TH
df.IDx = df.IDx.str.replace('\d+\s', '')
df
Out[128]:
IDx
0 SOUTH PARK
1 BRAKER LANE
2 3RD ST
3 BIRMINGHAM PARK
4 E 12TH