I have many addresses information, such as:
123 1st Ave Apt501, Flushing, New York, 00000, USA
234 West 20th Street 1A, New York, New York, 11111, USA
345 North 100st Street Apt. 110, New York, New York, 22222, USA
I would like to get the street information. So, I am wondering how can I delete the apartment information after "Ave", and "Street"?
So, the addresses will be cleaned as:
123 1st Ave, Flushing, New York, 00000, USA
234 West 20th Street, New York, New York, 11111, USA
345 North 100st Street, New York, New York, 22222, USA
Or the data can be cleaned as:
123 1st Ave
234 West 20th Street
345 North 100st Street
This is the code I tried. However, I was not able to remove apartment information not including "apt".
conditions = [df.address.str.contains('Apt')]
choices = [df.address.apply(lambda x: x[x.find('Apt'):])]
df['apt'] = np.select(conditions, choices, default = '')
choices2 = [df.address.apply(lambda x: x[:x.find('Apt')])]
df['address'] = np.select(conditions, choices2, default = df.address)
I think you should wrap all the addresses in a list and use a split to separate each element in the address so you can access street information by index 0.
addresses = ['123 1st Ave, Flushing, New York, 00000, USA', '234 West 20th Street, New York, New York, 11111, USA',
'345 North 100st Street, New York, New York, 22222, USA']
for s in addresses:
print(s.split(',')[0])
Output
123 1st Ave
234 West 20th Street
345 North 100st Street
To get the second option, I'd split at comma first and then process the first item with a regular expression.
df['street'] = (df.address
.str.split(',') # split at ,
.str[0] # get the first element
.str.replace('(Apt[.\s]*|Street\s+)\d+\w?$',
'')
)
The regular expression matches
Apt followed by zero or more dots or whitespace OR
Street followed by whitespace
one or more integers
an optional letter
and all that at the end of the string ($).
The pattern might need some tweaking but gives the right result for the example.
Related
I have a dataset for which I'm trying to fix the address situation. The data for the Address column comes with the name of the person who ordered by default, so the column look something like this:
**Address**
John Snow 333 East Road 123 MA United States
Mary Scott 123 South Road 321 MA United States
What I need is a way to split the column when the first number appears, so that I end up with a "Name" column and a "New Address" column. It is important to keep everything after the first number and not to split at every number that appears in the Address column.
How would I go about doing this?
Thanks!
Given df:
address
0 John Snow 333 East Road 123 MA United States
1 Mary Scott 123 South Road 321 MA United States
Doing:
df[['name', 'new_address']] = df.address.str.extract('(.+?)\s(\d+.+)')
print(df)
Output:
address name new_address
0 John Snow 333 East Road 123 MA United States John Snow 333 East Road 123 MA United States
1 Mary Scott 123 South Road 321 MA United States Mary Scott 123 South Road 321 MA United States
Regexr Explanation
Using regex
text = John Snow 333 East Road 123 MA United States
res = re.search("(\d+)", s)
# -> <re.Match object; span=(10, 13), match='333'>
start, end = s.span()
before = text[:start].strip() # away spaces which might or might not be there
after = text[end:].strip()
print(before, "|", after)
# btw. here you can get the number by res[0]
# -> John Snow | East Road 123 MA United States
But that seams not like a logical and usefull split?
Some improvement would be
res = re.match("^(.+?)(\d+)(.+)(\d+)(.+)", text)
print(res.groups())
#-> ('John Snow ', '333', ' East Road ', '123', ' MA United States')
Adjust the groups to your liking. But be aware that this will produce problems if more than 2 number blocks occure!
I have a dataframe column 'address' with values like this in each row:
3466B, Jerome Avenue, The Bronx, Bronx County, New York, 10467, United States, (40.881836199999995, -73.88176324294639)
Jackson Heights 74th Street - Roosevelt Avenue (7), 75th Street, Queens, Queens County, New York, 11372, United States, (40.74691655, -73.8914737373454)
I need only to keep the value Bronx / Queens / Manhattan / Staten Island from each row.
Is there any way to do this?
Thanks in advance.
One option is this, assuming the values are always in the same place. Using .split(', ')[2]
"3466B, Jerome Avenue, The Bronx, Bronx County, New York, 10467, United States, (40.881836199999995, -73.88176324294639)".split(', ')[2]
If the source file is a CSV (Comma-separated values), I would have a look at pandas and pandas.read_csv('filename.csv') and leverage all the nice features that are in pandas.
If the values are not at the same position and you need only a is in set of values or not:
import pandas as pd
df = pd.DataFrame(["The Bronx", "Queens", "Man"])
df.isin(["Queens", "The Bronx"])
You could add a column, let's call it 'district' and then populate it like this.
import pandas as pd
df = pd.DataFrame({'address':["3466B, Jerome Avenue, The Bronx, Bronx County, New York, 10467, United States, (40.881836199999995, -73.88176324294639)",
"Jackson Heights 74th Street - Roosevelt Avenue (7), 75th Street, Queens, Queens County, New York, 11372, United States, (40.74691655, -73.8914737373454)"]})
districts = ['Bronx','Queens','Manhattan', 'Staten Island']
df['district'] = ''
for district in districts:
df.loc[df['address'].str.contains(district) , 'district'] = district
print(df)
I have a data frame that looks like below:
City State Country
Chicago IL United States
Boston
San Diego CA United States
Los Angeles CA United States
San Francisco
Sacramento
Vancouver BC Canada
Toronto
And I have 3 lists of values that are ready to fill in the None cells:
city = ['Boston', 'San Francisco', 'Sacramento', 'Toronto']
state = ['MA', 'CA', 'CA', 'ON']
country = ['United States', 'United States', 'United States', 'Canada']
The order of the elements in these list are correspondent to each other. Thus, the first items across all 3 lists match each other, and so forth. How can I fill out the empty cells and produce a result like below?
City State Country
Chicago IL United States
Boston MA United States
San Diego CA United States
Los Angeles CA United States
San Francisco CA United States
Sacramento CA United States
Vancouver BC Canada
Toronto ON Canada
My code gives me an error and I'm stuck.
if df.loc[df['City'] == 'Boston']:
'State' = 'MA'
Any solution is welcome. Thank you.
Create two mappings, one for <city : state>, and another for <city : country>.
city_map = dict(zip(city, state))
country_map = dict(zip(city, country))
Next, set City as the index -
df = df.set_index('City')
And, finally use map/replace to transform keys to values as appropriate -
df['State'] = df['City'].map(city_map)
df['Country'] = df['City'].map(country_map)
As an extra final step, you may call df.reset_index() at the end.
I have a DataFrame with thousands of rows and two columns like so:
string state
0 the best new york cheesecake rochester ny ny
1 the best dallas bbq houston tx random str tx
2 la jolla fish shop of san diego san diego ca ca
3 nothing here dc
For each state, I have a regular expression of all city names (in lower case) structured like (city1|city2|city3|...) where the order of the cities is arbitrary (but can be changed if needed). For example, the regular expression for the state of New York contains both 'new york' and 'rochester' (and likewise 'dallas' and 'houston' for Texas, and 'san diego' and 'la jolla' for California).
I want to find out what the last appearing city in the string is (for observations 1, 2, 3, 4, I'd want 'rochester', 'houston', 'san diego', and NaN (or whatever), respectively).
I started off with str.extract and was trying to think of things like reversing the string but have reached an impasse.
Thanks so much for any help!
You can use str.findall, but if no match get empty list, so need apply. Last select last item of string by [-1]:
cities = r"new york|dallas|rochester|houston|san diego"
print (df['string'].str.findall(cities)
.apply(lambda x: x if len(x) >= 1 else ['no match val'])
.str[-1])
0 rochester
1 houston
2 san diego
3 no match val
Name: string, dtype: object
(Corrected >= 1 to > 1.)
Another solution is a bit hack - add no match string to start of each string by radd and add this string to cities too:
a = 'no match val'
cities = r"new york|dallas|rochester|houston|san diego" + '|' + a
print (df['string'].radd(a).str.findall(cities).str[-1])
0 rochester
1 houston
2 san diego
3 no match val
Name: string, dtype: object
cities = r"new york|dallas|..."
def last_match(s):
found = re.findall(cities, s)
return found[-1] if found else ""
df['string'].apply(last_match)
#0 rochester
#1 houston
#2 san diego
#3
Here's the code I have:
def boba_recs(lat, lng):
f1 = pd.read_csv("./boba_final.csv")
user_loc = Point(lng, lat) # converts user lat/long to point object
# makes dataframe of distances between each boba place and the user loc
f1['Distance'] = [user_loc.distance(Point(xy)) for xy in zip(f1.Longitude, f1.Lat)]
# grabs the three smallest distances
boba = f1.nsmallest(3, 'Distance').set_index('Name') # sets index to name
return(": " + boba['Address'])
This returns:
Name
Coco Bubble Tea : 129 E 45th St New York, NY 10017
Gong Cha : 75 W 38th St, New York, NY 10018
Smoocha Tea & Juice Bar : 315 5th Ave New York, NY 10016
Name: Address, dtype: object
Almost perfect except I want to get rid of the "Name: Address, dtype: object" row. I've tried a few things and it just won't go away without messing up the format of everything else.
'Address' is the name of the pd.Series and 'Name' is the name of the index
Try:
s.rename_axis(None).rename(None)
Coco Bubble Tea : 129 E 45th St New York, NY 10017
Gong Cha : 75 W 38th St, New York, NY 10018
Smoocha Tea & Juice Bar : 315 5th Ave New York, NY 10016
dtype: object
Or I'd rewrite your function
def boba_recs(lat, lng):
f1 = pd.read_csv("./boba_final.csv")
user_loc = Point(lng, lat) # converts user lat/long to point object
# makes dataframe of distances between each boba place and the user loc
f1['Distance'] = [user_loc.distance(Point(xy)) for xy in zip(f1.Longitude, f1.Lat)]
# grabs the three smallest distances
boba = f1.nsmallest(3, 'Distance').set_index('Name') # sets index to name
return(": " + boba['Address']).rename_axis(None).rename(None)
If you want a string
def boba_recs(lat, lng):
f1 = pd.read_csv("./boba_final.csv")
user_loc = Point(lng, lat) # converts user lat/long to point object
# makes dataframe of distances between each boba place and the user loc
f1['Distance'] = [user_loc.distance(Point(xy)) for xy in zip(f1.Longitude, f1.Lat)]
# grabs the three smallest distances
boba = f1.nsmallest(3, 'Distance').set_index('Name') # sets index to name
temp = (": " + boba['Address']).rename_axis(None).__repr__()
return temp.rsplit('\n', 1)[0]
That isn't a row. It's a description of the data you are printing. Try printing just the .values and you will see.
[Coco Bubble Tea : 129 E 45th St New York, NY 10017
Gong Cha : 75 W 38th St, New York, NY 10018
Smoocha Tea & Juice Bar : 315 5th Ave New York, NY 10016]
Update Based on your comment:
print(pd.DataFrame(your_series))
Name
Coco Bubble Tea : 129 E 45th St New York, NY 10017
Gong Cha : 75 W 38th St, New York, NY 10018
Smoocha Tea & Juice Bar : 315 5th Ave New York, NY 10016