I have a DataFrame with thousands of rows and two columns like so:
string state
0 the best new york cheesecake rochester ny ny
1 the best dallas bbq houston tx random str tx
2 la jolla fish shop of san diego san diego ca ca
3 nothing here dc
For each state, I have a regular expression of all city names (in lower case) structured like (city1|city2|city3|...) where the order of the cities is arbitrary (but can be changed if needed). For example, the regular expression for the state of New York contains both 'new york' and 'rochester' (and likewise 'dallas' and 'houston' for Texas, and 'san diego' and 'la jolla' for California).
I want to find out what the last appearing city in the string is (for observations 1, 2, 3, 4, I'd want 'rochester', 'houston', 'san diego', and NaN (or whatever), respectively).
I started off with str.extract and was trying to think of things like reversing the string but have reached an impasse.
Thanks so much for any help!
You can use str.findall, but if no match get empty list, so need apply. Last select last item of string by [-1]:
cities = r"new york|dallas|rochester|houston|san diego"
print (df['string'].str.findall(cities)
.apply(lambda x: x if len(x) >= 1 else ['no match val'])
.str[-1])
0 rochester
1 houston
2 san diego
3 no match val
Name: string, dtype: object
(Corrected >= 1 to > 1.)
Another solution is a bit hack - add no match string to start of each string by radd and add this string to cities too:
a = 'no match val'
cities = r"new york|dallas|rochester|houston|san diego" + '|' + a
print (df['string'].radd(a).str.findall(cities).str[-1])
0 rochester
1 houston
2 san diego
3 no match val
Name: string, dtype: object
cities = r"new york|dallas|..."
def last_match(s):
found = re.findall(cities, s)
return found[-1] if found else ""
df['string'].apply(last_match)
#0 rochester
#1 houston
#2 san diego
#3
Related
I have some address data like:
Address
Buffalo, NY, 14201
Stackoverflow Street, New York, NY, 9999
I'd like to split these into columns like:
Street City State Zip
NaN Buffalo NY 14201
StackOverflow Street New York NY 99999
Essentially, I'd like to shift my strings over by one in each column in the result.
With Pandas I know I can split columns like:
import pandas as pd
df = pd.DataFrame(
data={'Address': ['Buffalo, NY, 14201', 'Stackoverflow Street, New York, NY, 99999']}
)
df[['Street','City','State','Zip']] = (
df['Address']
.str.split(',', expand=True)
.applymap(lambda col: col.strip() if col else col)
)
but need to figure out how to conditionally shift columns when my result is only 3 columns.
First, create a function to reverse a split for each row. Because if you split normally, the NaN will be at the end, so you reverse the order and split the list now the NaN will be at the end but the list is reversed.
Then, apply it to all rows.
Then, rename the columns because they will be integers.
Finally, set them in the right order.
fn = lambda x: pd.Series([i for i in reversed(x.split(','))])
pad = df['Address'].apply(fn)
pad looks like this right now,
0 1 2 3
0 14201 NY Buffalo NaN
1 99999 NY New York Stackoverflow Street
Just need to rename the columns and flip the order back.
pad.rename(columns={0:'Zip',1:'State',2:'City',3:'Street'},inplace=True)
df = pad[['Street','City','State','Zip']]
Output:
Street City State Zip
0 NaN Buffalo NY 14201
1 Stackoverflow Street New York NY 99999
Use a bit of numpy magic to reorder the columns with None on the left:
df2 = df['Address'].str.split(',', expand=True)
df[['Street','City','State','Zip']] = df2.to_numpy()[np.arange(len(df))[:,None], np.argsort(df2.notna())]
Output:
Address Street City State Zip
0 Buffalo, NY, 14201 None Buffalo NY 14201
1 Stackoverflow Street, New York, NY, 99999 Stackoverflow Street New York NY 99999
Another idea, add as many commas as needed to have n-1 (here 3) before splitting:
df[['Street','City','State','Zip']] = (
df['Address'].str.count(',')
.rsub(4-1).map(lambda x: ','*x)
.add(df['Address'])
.str.split(',', expand=True)
)
Output:
Address Street City State Zip
0 Buffalo, NY, 14201 Buffalo NY 14201
1 Stackoverflow Street, New York, NY, 99999 Stackoverflow Street New York NY 99999
Well I found a solution but not sure if there is something more performant out there. Open to other ideas.
def split_shift(s: str) -> list[str]:
split_str: list[str] = s.split(',')
# If split is only 3 items, shift things over by inserting a NA in front
if len(split_str) == 3:
split_str.insert(0,pd.NA)
return split_str
df[['Street','City','State','Zip']] = pd.DataFrame(df['Address'].apply(lambda x: split_shift(x)).tolist())
I have a series of texts that has either one word or a combination of words. I need to delete the last word its greater than 1, if not leave the last word.
Have tried the following regex:
df["first_middle_name"] = df["full_name"].replace("\s+\S+$", "")
from this solution: Removing last words in each row in pandas dataframe
It deletes certain words keeps others.
Some examples of strings in my df['Municipio']:
Zacapa
San Luis, **Jalapa**
Antigua Guatemala **Sacatepéquez**
Guatemala
Mixco
Sacapulas, **Jutiapa**
Puerto Barrios, **Izabal**
Petén **Petén**
San Martin Jil, **Chimaltenango**
What I need for example is if it finds one word keeps that word, if it is a combination of more words (2 or more) and there is a comma or space delete the last word. See bold words.
Thank you!
You can apply a function to check if , in string first, then check space in string.
df['Municipio'] = df['Municipio'].apply(lambda x: ', '.join(x.split(',')[:-1]) if ',' in x
else (' '.join(x.split(' ')[:-1]) if ' ' in x else x))
print(df)
Municipio
0 Zacapa
1 San Luis
2 Antigua Guatemala
3 Guatemala
4 Mixco
5 Sacapulas
6 Puerto Barrios
7 Petén
8 San Martin Jil
If you want to keep the last comma and space
df['Municipio'] = df['Municipio'].apply(lambda x: ', '.join(x.split(',')[:-1]+['']) if ',' in x
else (' '.join(x.split(' ')[:-1]+['']) if ' ' in x else x))
print(df)
Municipio
0 Zacapa
1 San Luis,
2 Antigua Guatemala
3 Guatemala
4 Mixco
5 Sacapulas,
6 Puerto Barrios,
7 Petén
8 San Martin Jil,
I have two dataframes:
df1:
locality
0 Chicago, IL
1 San Francisco, CA
2 Chic, TN
df2:
City County
0 San Francisco San Francisco County
1 Chic Dyer County
2 Chicago Cook County
I want to find all values (or their corresponding indices) in locality that start with each value in City so that I can eventually merge the two dataframes. The answer to this post is close to what I want to achieve, but is greedy and only extracts one value -- I want all matches, even if somewhat incorrect
For example, I would like to know that "Chic" in df2 matches both "Chicago, IL" and "Chic, TN" in df1 (and/or that "Chicago, IL" matches both "Chic" and "Chicago")
So far, I've accomplished this by using pandas apply and a custom function:
def getMatches(row):
matches = df1[df1['locality'].str.startswith(row['City'])]
return matches.index.tolist(), matches['locality']
df2.apply(getMatches, axis=1)
0 ([1], [San Francisco, CA, USA])
1 ([0, 2], [Chicago, IL, USA, Chic, TN, USA])
2 ([0], [Chicago, IL, USA])
This works fine until both df1 and df2 are large (100,000+ rows), where I run into time and memory issues (Even using Dask's parallelized apply). Is there a better alternative?
This is my df
"33, BUffalo New York"
"44, Charleston North Carolina "
], columns=['row'])
My intention is to split them by a comma followed by a space or just a space like this
33 Buffalo New York
44 Charleston North Carolina
My command is as follows:
df["row"].str.split("[,\s|\s]", n = 2, expand = True)
0 STD City State
1 33 Buffalo New York
2 44 Charleston North Carolina
As explained in the pandas docs, your split command does what it should if you just remove the square brackets. This command works:
new_df = df["row"].str.split(",\s|\s", n=2, expand=True)
Note: if your cities have spaces in them, then this will fail. It works if the state has a space in it, because the n=3 ensures that exactly 3 columns result.
The only part that you are missing is to set the first row as the header. As answered here, you can use pandas' iloc command:
new_df.columns = new_df.iloc[0]
new_df = newdf[1:]
print (new_df)
# 0 STD City State
# 1 33 BUffalo New York
# 2 44 Charleston North Carolina
I have a column, market_area that I want to abbreviate by keeping only the part of the string to the left of the hyphen.
For example, my data is like this:
import pandas as pd
tmp = pd.DataFrame({'market_area': ['San Francisco-Oakland-San Jose',
None,
'Dallas-Fort Worth',
'Los Angeles-Riverside-Orange County'],
'val': [1,2,3,4]})
My desired output would be:
['San Francisco', None, 'Dallas', 'Los Angeles']
I am able to split based on the hyphen:
tmp['market_area'].str.split('-')
But how do I extract only the part to the left of the hyphen?
You can extract the first element in the splitted list using .str[0]:
tmp.market_area.str.split('-').str[0]
Out[3]:
0 San Francisco
1 None
2 Dallas
3 Los Angeles
Name: market_area, dtype: object
Or use str.extract method with regex ^([^-]*).*, which captures the pattern until the first -:
tmp.market_area.str.extract('^([^-]*).*', expand=False)
Out[5]:
0 San Francisco
1 NaN
2 Dallas
3 Los Angeles
Name: market_area, dtype: object