I have a dataset for which I'm trying to fix the address situation. The data for the Address column comes with the name of the person who ordered by default, so the column look something like this:
**Address**
John Snow 333 East Road 123 MA United States
Mary Scott 123 South Road 321 MA United States
What I need is a way to split the column when the first number appears, so that I end up with a "Name" column and a "New Address" column. It is important to keep everything after the first number and not to split at every number that appears in the Address column.
How would I go about doing this?
Thanks!
Given df:
address
0 John Snow 333 East Road 123 MA United States
1 Mary Scott 123 South Road 321 MA United States
Doing:
df[['name', 'new_address']] = df.address.str.extract('(.+?)\s(\d+.+)')
print(df)
Output:
address name new_address
0 John Snow 333 East Road 123 MA United States John Snow 333 East Road 123 MA United States
1 Mary Scott 123 South Road 321 MA United States Mary Scott 123 South Road 321 MA United States
Regexr Explanation
Using regex
text = John Snow 333 East Road 123 MA United States
res = re.search("(\d+)", s)
# -> <re.Match object; span=(10, 13), match='333'>
start, end = s.span()
before = text[:start].strip() # away spaces which might or might not be there
after = text[end:].strip()
print(before, "|", after)
# btw. here you can get the number by res[0]
# -> John Snow | East Road 123 MA United States
But that seams not like a logical and usefull split?
Some improvement would be
res = re.match("^(.+?)(\d+)(.+)(\d+)(.+)", text)
print(res.groups())
#-> ('John Snow ', '333', ' East Road ', '123', ' MA United States')
Adjust the groups to your liking. But be aware that this will produce problems if more than 2 number blocks occure!
Related
I have many addresses information, such as:
123 1st Ave Apt501, Flushing, New York, 00000, USA
234 West 20th Street 1A, New York, New York, 11111, USA
345 North 100st Street Apt. 110, New York, New York, 22222, USA
I would like to get the street information. So, I am wondering how can I delete the apartment information after "Ave", and "Street"?
So, the addresses will be cleaned as:
123 1st Ave, Flushing, New York, 00000, USA
234 West 20th Street, New York, New York, 11111, USA
345 North 100st Street, New York, New York, 22222, USA
Or the data can be cleaned as:
123 1st Ave
234 West 20th Street
345 North 100st Street
This is the code I tried. However, I was not able to remove apartment information not including "apt".
conditions = [df.address.str.contains('Apt')]
choices = [df.address.apply(lambda x: x[x.find('Apt'):])]
df['apt'] = np.select(conditions, choices, default = '')
choices2 = [df.address.apply(lambda x: x[:x.find('Apt')])]
df['address'] = np.select(conditions, choices2, default = df.address)
I think you should wrap all the addresses in a list and use a split to separate each element in the address so you can access street information by index 0.
addresses = ['123 1st Ave, Flushing, New York, 00000, USA', '234 West 20th Street, New York, New York, 11111, USA',
'345 North 100st Street, New York, New York, 22222, USA']
for s in addresses:
print(s.split(',')[0])
Output
123 1st Ave
234 West 20th Street
345 North 100st Street
To get the second option, I'd split at comma first and then process the first item with a regular expression.
df['street'] = (df.address
.str.split(',') # split at ,
.str[0] # get the first element
.str.replace('(Apt[.\s]*|Street\s+)\d+\w?$',
'')
)
The regular expression matches
Apt followed by zero or more dots or whitespace OR
Street followed by whitespace
one or more integers
an optional letter
and all that at the end of the string ($).
The pattern might need some tweaking but gives the right result for the example.
I have a dataframe column 'address' with values like this in each row:
3466B, Jerome Avenue, The Bronx, Bronx County, New York, 10467, United States, (40.881836199999995, -73.88176324294639)
Jackson Heights 74th Street - Roosevelt Avenue (7), 75th Street, Queens, Queens County, New York, 11372, United States, (40.74691655, -73.8914737373454)
I need only to keep the value Bronx / Queens / Manhattan / Staten Island from each row.
Is there any way to do this?
Thanks in advance.
One option is this, assuming the values are always in the same place. Using .split(', ')[2]
"3466B, Jerome Avenue, The Bronx, Bronx County, New York, 10467, United States, (40.881836199999995, -73.88176324294639)".split(', ')[2]
If the source file is a CSV (Comma-separated values), I would have a look at pandas and pandas.read_csv('filename.csv') and leverage all the nice features that are in pandas.
If the values are not at the same position and you need only a is in set of values or not:
import pandas as pd
df = pd.DataFrame(["The Bronx", "Queens", "Man"])
df.isin(["Queens", "The Bronx"])
You could add a column, let's call it 'district' and then populate it like this.
import pandas as pd
df = pd.DataFrame({'address':["3466B, Jerome Avenue, The Bronx, Bronx County, New York, 10467, United States, (40.881836199999995, -73.88176324294639)",
"Jackson Heights 74th Street - Roosevelt Avenue (7), 75th Street, Queens, Queens County, New York, 11372, United States, (40.74691655, -73.8914737373454)"]})
districts = ['Bronx','Queens','Manhattan', 'Staten Island']
df['district'] = ''
for district in districts:
df.loc[df['address'].str.contains(district) , 'district'] = district
print(df)
If I have the following dataframe 'countries':
country info
england london-europe
scotland edinburgh-europe
china beijing-asia
unitedstates washington-north_america
I would like to take the info field and have to remove everything after the '-', to become:
country info
england london
scotland edinburgh
china beijing
unitedstates washington
How do I do this?
Try:
countries['info'] = countries['info'].str.split('-').str[0]
Output:
country info
0 england london
1 scotland edinburgh
2 china beijing
3 unitedstates washington
You just need to keep the first part of the string after a split on the dash character:
countries['info'] = countries['info'].str.split('-').str[0]
Or, equivalently, you can use
countries['info'] = countries['info'].str.split('-').map(lambda x: x[0])
You can also use str.extract with pattern r"(\w+)(?=\-)"
Ex:
print(df['info'].str.extract(r"(\w+)(?=\-)"))
Output:
info
0 london
1 edinburgh
2 beijing
3 washington
I'm trying to extract data from a Wikipedia table (https://en.wikipedia.org/wiki/NBA_Most_Valuable_Player_Award) about the MVP winners over NBA history.
This is my code:
wik_req = requests.get("https://en.wikipedia.org/wiki/NBA_Most_Valuable_Player_Award")
wik_webpage = wik_req.content
soup = BeautifulSoup(wik_webpage, "html.parser")
my_table = soup('table', {"class":"wikitable plainrowheaders sortable"})[0].find_all('a')
print(my_table)
for x in my_table:
test = x.get("title")
print(test)
However, this code prints all HTML title tags of the table as in the following (short version):
'1955–56 NBA season
Bob Pettit
Power Forward (basketball)
United States
St. Louis Hawks
1956–57 NBA season
Bob Cousy
Point guard
Boston Celtics'
Eventually, I want to create a pandas dataframe in which I store all the season years in a column, all the player years in a column, and so on and so forth. What code does the trick to only print one of the HTML tag titles (e.g. only the NBA season years)? I can then store those into a column to set up my dataframe and do the same with player, position, nationality and team.
All you should need for that dataframe is:
import pandas as pd
url = "https://en.wikipedia.org/wiki/NBA_Most_Valuable_Player_Award"
df=pd.read_html(url)[5]
Output:
print(df)
Season Player ... Nationality Team
0 1955–56 Bob Pettit* ... United States St. Louis Hawks
1 1956–57 Bob Cousy* ... United States Boston Celtics
2 1957–58 Bill Russell* ... United States Boston Celtics (2)
3 1958–59 Bob Pettit* (2) ... United States St. Louis Hawks (2)
4 1959–60 Wilt Chamberlain* ... United States Philadelphia Warriors
.. ... ... ... ... ...
59 2014–15 Stephen Curry^ ... United States Golden State Warriors (2)
60 2015–16 Stephen Curry^ (2) ... United States Golden State Warriors (3)
61 2016–17 Russell Westbrook^ ... United States Oklahoma City Thunder (2)
62 2017–18 James Harden^ ... United States Houston Rockets (4)
63 2018–19 Giannis Antetokounmpo^ ... Greece Milwaukee Bucks (4)
[64 rows x 5 columns]
If you really want to stick with BeautifulSoup, here's an example to get you started:
my_table = soup('table', {"class":"wikitable plainrowheaders sortable"})[0]
season_col=[]
for row in my_table.find_all('tr')[1:]:
season = row.findChildren(recursive=False)[0]
season_col.append(season.text.strip())
I expect there may be some differences between columns, but as you indicated you want to get familiar with BeautifulSoup, that's for you to explore :)
I have a dataframe like this:
Cause_of_death famous_for name nationality
suicide by hanging African jazz XYZ South
unknown Korean president ABC South
heart attack businessman EFG American
heart failure Prime Minister LMN Indian
heart problems African writer PQR South
And the dataframe is too big. What I want to do is to make changes in the nationality column. You can see that for the nationality = South, we have Korea and Africa as a part of the strings in the famous_for column. So What I want to do is change the nationality to South Africa if famous_for contains Africa and nationality to South Korea if famous_for contains Korea.
What I had tried is:
for i in deaths['nationality']:
if (deaths['nationality']=='South'):
if deaths['famous_for'].contains('Korea'):
deaths['nationality']='South Korea'
elif deaths['famous_for'].contains('Korea'):
deaths['nationality']='South Africa'
else:
pass
You can use contains() to check if the famous_for columns includes Korea or Africa and set nationality accordingly.
df.loc[df.famous_for.str.contains('Korean'), 'nationality']='South Korean'
df.loc[df.famous_for.str.contains('Africa'), 'nationality']='South Africa'
df
Out[783]:
Cause_of_death famous_for name nationality
0 suicide by hanging African jazz XYZ South Africa
1 unknown Korean president ABC South Korean
2 heart attack businessman EFG American
3 heart failure Prime Minister LMN Indian
4 heart problems African writer PQR South Africa
Or you can do this in one line using:
df.nationality = (
df.nationality.str.cat(df.famous_for.str.extract('(Africa|Korea)',expand=False),
sep=' ', na_rep=''))
df
Out[801]:
Cause_of_death famous_for name nationality
0 suicide by hanging African jazz XYZ South Africa
1 unknown Korean president ABC South Korea
2 heart attack businessman EFG American
3 heart failure Prime Minister LMN Indian
4 heart problems African writer PQR South Africa
If many conditions is possible use custom function with DataFrame.apply and axis=1 for process by rows:
def f(x):
if (x['nationality']=='South'):
if 'Korea' in x['famous_for']:
return 'South Korea'
elif 'Africa' in x['famous_for']:
return 'South Africa'
else:
return x['nationality']
deaths['nationality'] = deaths.apply(f, axis=1)
print (deaths)
Cause_of_death famous_for name nationality
0 suicide by hanging African jazz XYZ South Africa
1 unknown Korean president ABC South Korea
2 heart attack businessman EFG American
3 heart failure Prime Minister LMN Indian
4 heart problems African writer PQR South Africa
But if only few conditions use str.contains with DataFrame.loc:
mask1 = deaths['nationality'] == 'South'
mask2 = deaths['famous_for'].str.contains('Korean')
mask3 = deaths['famous_for'].str.contains('Africa')
deaths.loc[mask1 & mask2, 'nationality']='South Korea'
deaths.loc[mask1 & mask3, 'nationality']='South Africa'
print (deaths)
0 suicide by hanging African jazz XYZ South Africa
1 unknown Korean president ABC South Korea
2 heart attack businessman EFG American
3 heart failure Prime Minister LMN Indian
4 heart problems African writer PQR South Africa
Another solution with mask:
mask1 = deaths['nationality'] == 'South'
mask2 = deaths['famous_for'].str.contains('Korean')
mask3 = deaths['famous_for'].str.contains('Africa')
deaths['nationality'] = deaths['nationality'].mask(mask1 & mask2, 'South Korea')
deaths['nationality'] = deaths['nationality'].mask(mask1 & mask3,'South Africa')
print (deaths)
0 suicide by hanging African jazz XYZ South Africa
1 unknown Korean president ABC South Korea
2 heart attack businessman EFG American
3 heart failure Prime Minister LMN Indian
4 heart problems African writer PQR South Africa