Pandas - remove numbers from start of string in series

Pandas - remove numbers from start of string in series - python

I've got a series of addresses and would like a series with just the street name. The only catch is some of the addresses don't have a house number, and some do.
So if I have a series that looks like:
Idx
0 11000 SOUTH PARK
1 20314 BRAKER LANE
2 203 3RD ST
3 BIRMINGHAM PARK
4 E 12TH
What function would I write to get
Idx
0 SOUTH PARK
1 BRAKER LANE
2 3RD ST
3 BIRMINGHAM PARK
4 E 12TH
where any 'words' made entirely of numeric characters at the beginning of the string have been removed? As you can see above, I would like to retain the 3 that '3RD STREET' starts with. I'm thinking a regular expression but this is beyond me. Thanks!

You can use str.replace with regex ^\d+\s+ to remove leading digits:
s.str.replace('^\d+\s+', '')
Out[491]:
0 SOUTH PARK
1 BRAKER LANE
2 3RD ST
3 BIRMINGHAM PARK
4 E 12TH
Name: Idx, dtype: object

str.replace('\d+\s', '') is what I came up with:
df = pd.DataFrame({'IDx': ['11000 SOUTH PARK',
'20314 BRAKER LANE',
'203 3RD ST',
'BIRMINGHAM PARK',
'E 12TH']})
df
Out[126]:
IDx
0 11000 SOUTH PARK
1 20314 BRAKER LANE
2 203 3RD ST
3 BIRMINGHAM PARK
4 E 12TH
df.IDx = df.IDx.str.replace('\d+\s', '')
df
Out[128]:
IDx
0 SOUTH PARK
1 BRAKER LANE
2 3RD ST
3 BIRMINGHAM PARK
4 E 12TH

Related

Extracting Specific Text From column in dataframe in pandas

I have a pandas dataframe with a column, which I need to extract the word with [ft,mi,FT,MI] of the state column using regular expression and stored in other column.
df1 = {
'State':['Arizona 4.47ft','Georgia 1023mi','Newyork 2022 NY 74.6 FT','Indiana 747MI(In)','Florida 453mi FL']}
Expected output
State Distance
0 Arizona 4.47ft 4.47ft
1 Georgia 1023mi 1023mi
2 Newyork NY 74.6ft 74.6ft
3 Indiana 747MI(In) 747MI
4 Florida 453mi FL 453mi
Would anyone please help?

Build a regex pattern with the help of list l then use str.extract to extract the occurrence of this pattern from the State column
l = ['ft','mi','FT','MI']
df1['Distance'] = df1['State'].str.extract(r'(\S+(?:%s))\b' % '|'.join(l))
State Distance
0 Arizona 4.47ft 4.47ft
1 Georgia 1023mi 1023mi
2 Newyork 2022 NY 74.6FT 74.6FT
3 Indiana 747MI(In) 747MI
4 Florida 453mi FL 453mi

Constructing a dataframe with multiple columns based on str conditions using a loop - python

I have a webscraped Twitter DataFrame that includes user location. The location variable looks like this:
2 Crockett, Houston County, Texas, 75835, USA
3 NYC, New York, USA
4 Warszawa, mazowieckie, RP
5 Texas, USA
6 Virginia Beach, Virginia, 23451, USA
7 Louisville, Jefferson County, Kentucky, USA
I would like to construct state dummies for all USA states by using a loop.
I have managed to extract users from the USA using
location_usa = location_df['location'].str.contains('usa', case = False)
However the code would be too bulky I wrote this for every single state. I have a list of the states as strings.
Also I am unable to use
pd.Series.Str.get_dummies()
as there are different locations within the same state and each entry is a whole sentence.
I would like the output to look something like this:
Alabama Alaska Arizona
1 0 0 1
2 0 1 0
3 1 0 0
4 0 0 0
Or the same with Boolean values.

Use .str.extract to get a Series of the states, and then use pd.get_dummies on that Series. Will need to define a list of all 50 states:
import pandas as pd
states = ['Texas', 'New York', 'Kentucky', 'Virginia']
pd.get_dummies(df.col1.str.extract('(' + '|'.join(x+',' for x in states)+ ')')[0].str.strip(','))
Kentucky New York Texas Virginia
0 0 0 1 0
1 0 1 0 0
2 0 0 0 0
3 0 0 1 0
4 0 0 0 1
5 1 0 0 0
Note I matched on States followed by a ',' as that seems to be the pattern and allows you to avoid false matches like 'Virginia' with 'Virginia Beach', or more problematic things like 'Washington County, Minnesota'
If you expect mutliple states to match on a single line, then this becomes .extractall summing across the 0th level:
pd.get_dummies(df.col1.str.extractall('(' + '|'.join(x+',' for x in states)+ ')')[0].str.strip(',')).sum(level=0).clip(upper=1)
Edit:
Perhaps there are better ways, but this can be a bit safer as suggested by #BradSolomon allowing matches on 'State,( optional 5 digit Zip,) USA'
states = ['Texas', 'New York', 'Kentucky', 'Virginia', 'California', 'Pennsylvania']
pat = '(' + '|'.join(x+',?(\s\d{5},)?\sUSA' for x in states)+ ')'
s = df.col1.str.extract(pat)[0].str.split(',').str[0]
Output: s
0 Texas
1 New York
2 NaN
3 Texas
4 Virginia
5 Kentucky
6 Pennsylvania
Name: 0, dtype: object
from Input
col1
0 Crockett, Houston County, Texas, 75835, USA
1 NYC, New York, USA
2 Warszawa, mazowieckie, RP
3 Texas, USA
4 Virginia Beach, Virginia, 23451, USA
5 Louisville, Jefferson County, Kentucky, USA
6 California, Pennsylvania, USA

Pandas groupby using function variable

I have this dataframe:
iata airport city state country lat \
0 00M Thigpen Bay Springs MS USA 31.953765
1 00R Livingston Municipal Livingston TX USA 30.685861
2 00V Meadow Lake Colorado Springs CO USA 38.945749
3 01G Perry-Warsaw Perry NY USA 42.741347
4 01J Hilliard Airpark Hilliard FL USA 30.688012
I am trying to get the number of airports per state. For example if I have the function:
f(dataframe, state):
result reuslt
Where state would be a state abbreviation, such as 'MA'. I am trying to group the dataframe by the input variable, such as state ('MA') to then get the number of airports per state.
When I use:
df.groupby(state)['airport'].value_counts()
or
df.groupby(state)['airport'].value_counts()/df['airport'].count()
df.groupby(['state'] == state)['airport'].value_counts()/df['airport'].count()
The last two are regarding the conditional probability a selected airport will be in that state.
It throws a Key Error: 'MA', which I think is due to the input variable not being recognized as a column, but a value in the column.
Is there a way to get the number of airports per state?

I would use Pandas's nunique to get the number of airports per state. The code is easier to read and remember.
To illustrate my point, I modified the dataset as follows, such that Florida has three more fictional airports:
iata airport city state country lat
0 00M Thigpen Bay Springs MS USA 31.953765
1 00R Livingston Municipal Livingston TX USA 30.685861
2 00V Meadow Lake Springs CO USA 38.945749
3 01G Perry-Warsaw Perry NY USA 42.741347
4 01J Hilliard Airpark Hilliard FL USA 30.688012
5 f234 Weirdviller Chilliard FL USA 30.788012
6 23r2 Johnson Billiard FL USA 30.888012
Then, we write:
df.groupby('state').iata.nunique()
to get the following results:
state
CO 1
MS 1
TX 1
FL 3
NY 1
Name: iata, dtype: int64
Hope this helps.

Assuming each record is an airport throughout, you can just count the records for each state / country combination:
df.groupby(['country','state']).size()

You can rewrite this as an explicit groupby apply:
In [11]: df.groupby("state")["airport"].apply(lambda x: x.value_counts() / len(x))
Out[11]:
state
CO Meadow Lake 1.0
FL Hilliard Airpark 1.0
MS Thigpen 1.0
NY Perry-Warsaw 1.0
TX Livingston Municipal 1.0
Name: airport, dtype: float64
or store the groupby and reuse it (probably this is faster):
In [21]: g = df.groupby("state")["airport"]
In [22]: g.value_counts() / g.size()
Out[22]:
state airport
CO Meadow Lake 1.0
FL Hilliard Airpark 1.0
MS Thigpen 1.0
NY Perry-Warsaw 1.0
TX Livingston Municipal 1.0
Name: airport, dtype: float64

This seemed to work the way I intended with all your help. a[state] represents an input in the form of a state abbreviation ('MA'). This returns the probability of a randomly selected airport belonging to that state.
a = df.groupby('state').iata.nunique()
s = a.sum()
result = a[state]/s
return result

str.extract starting from the back in pandas DataFrame

I have a DataFrame with thousands of rows and two columns like so:
string state
0 the best new york cheesecake rochester ny ny
1 the best dallas bbq houston tx random str tx
2 la jolla fish shop of san diego san diego ca ca
3 nothing here dc
For each state, I have a regular expression of all city names (in lower case) structured like (city1|city2|city3|...) where the order of the cities is arbitrary (but can be changed if needed). For example, the regular expression for the state of New York contains both 'new york' and 'rochester' (and likewise 'dallas' and 'houston' for Texas, and 'san diego' and 'la jolla' for California).
I want to find out what the last appearing city in the string is (for observations 1, 2, 3, 4, I'd want 'rochester', 'houston', 'san diego', and NaN (or whatever), respectively).
I started off with str.extract and was trying to think of things like reversing the string but have reached an impasse.
Thanks so much for any help!

You can use str.findall, but if no match get empty list, so need apply. Last select last item of string by [-1]:
cities = r"new york|dallas|rochester|houston|san diego"
print (df['string'].str.findall(cities)
.apply(lambda x: x if len(x) >= 1 else ['no match val'])
.str[-1])
0 rochester
1 houston
2 san diego
3 no match val
Name: string, dtype: object
(Corrected >= 1 to > 1.)
Another solution is a bit hack - add no match string to start of each string by radd and add this string to cities too:
a = 'no match val'
cities = r"new york|dallas|rochester|houston|san diego" + '|' + a
print (df['string'].radd(a).str.findall(cities).str[-1])
0 rochester
1 houston
2 san diego
3 no match val
Name: string, dtype: object

cities = r"new york|dallas|..."
def last_match(s):
found = re.findall(cities, s)
return found[-1] if found else ""
df['string'].apply(last_match)
#0 rochester
#1 houston
#2 san diego
#3

Appending or Concatenating DataFrame via for loop to existing DataFrame

Posted in the output you will see that this code take the Location column(or series), and places it in a data frame. After which, the first,second, and third part of the nested for loop then takes the first index of each column and then creates a dataframe to add to the first dataframe. What I have been trying to do is for loop through, going up one index each for loop, and then adding a new dataframe of repetitve data. However, when I try to print it, the dataframe will only print the first dataframe, and the last repetitive dataframe that it looped through. However I'm trying to make a huge dataframe that attaches a repetitive index data frame from 0-17. I have updated this to show the repetitiveness that I am looking for, but in a truncated way. I hope this helps. Thanks!
Here is the input
for j in range(0,18,1):
for i in range(0,18,1):
df['Rep Loc'] = str(df['Location'][j:j+1])
df['Rep Lat'] = float(df['Latitude'][j:j+1])
df['Rep Long'] = float(df['Longitude'][j:j+1])
break
print(df)
Here is the output
Location Latitude
Longitude \
0 Letsholathebe II Rd, Maun, North-West District... -19.989491
23.397709
1 North-West District, Botswana -19.389353
23.267951
2 Silobela, Kwekwe, Midlands Province, Zimbabwe -18.993930
29.147992
3 Mosi-Oa-Tunya, Livingstone, Southern Province,... -17.910147
25.861904
4 Parkway Drive, Victoria Falls, Matabeleland No... -17.909231
25.827019
5 A33, Kasane, North-West District, Botswana -17.795057
25.197270
6 T1, Southern Province, Zambia -17.040664
26.608454
7 Sikoongo Road, Siavonga, Southern Province, Za... -16.536204
28.708753
8 New Kasama, Lusaka Province, Zambia -15.471934
28.398588
9 Simon Mwansa Kapwepwe Avenue, Avondale, Lusaka... -15.386244
28.397111
10 Lusaka, Lusaka Province, 1010, Zambia -15.416697
28.281381
11 Chigwirizano Road, Rhodes Park, Lusaka, Lusaka... -15.401848
28.302248
12 T2, Kabwe, Central Province, Zambia -14.420744
28.462169
13 Kabushi Road, Ndola, Copperbelt Province, Zambia -12.997968
28.608536
14 Dr Aggrey Avenue, Mishenshi, Kitwe, Copperbelt... -12.797684
28.199061
15 President Avenue, Kalulushi, Copperbelt Provin... -12.833375
28.108370
16 Eglise Methodiste Unie, Avenue Mantola, Mawawa... -11.699407
27.500234
17 Avenue Babemba, Kolwezi, Lwalaba, Katanga, Lua... -10.698109
25.503816
Rep Loc Rep Lat
Rep
Long
0 0 Letsholathebe II Rd, Maun, North-West Dis... -19.989491
23.397709
1 0 Letsholathebe II Rd, Maun, North-West Dis... -19.989491
23.397709
2 0 Letsholathebe II Rd, Maun, North-West Dis... -19.989491
23.397709
Rep Loc Rep Lat
Rep Long
0 1 North-West District, Botswana\nName: Loca... -19.389353
23.267951
1 1 North-West District, Botswana\nName: Loca... -19.389353
23.267951
2 1 North-West District, Botswana\nName: Loca... -19.389353
23.267951
Rep Loc Rep Lat
Rep Long
0 2 Silobela, Kwekwe, Midlands Province, Zimb... -18.99393
29.147992
1 2 Silobela, Kwekwe, Midlands Province, Zimb... -18.99393
29.147992
Rep Loc Rep Lat
Rep Long
0 3 Mosi-Oa-Tunya, Livingstone, Southern Prov... -17.910147
25.861904
1 3 Mosi-Oa-Tunya, Livingstone, Southern Prov... -17.910147
25.861904
2 3 Mosi-Oa-Tunya, Livingstone, Southern Prov... -17.910147
25.861904
Rep Loc Rep Lat Rep
Long
0 4 Parkway Drive, Victoria Falls, Matabelela... -17.909231
25.827019
1 4 Parkway Drive, Victoria Falls, Matabelela... -17.909231
25.827019
2 4 Parkway Drive, Victoria Falls, Matabelela... -17.909231
25.827019
Rep Loc Rep Lat Rep
Long
0 5 A33, Kasane, North-West District, Botswan... -17.795057
25.19727
1 5 A33, Kasane, North-West District, Botswan... -17.795057
25.19727
2 5 A33, Kasane, North-West District, Botswan... -17.795057
25.19727

Good practice when asking questions is to provide an example of what you want your output to look like. However, this is my best guess at what you want.
pd.concat({i: d.shift(i) for i in range(18)}, axis=1)

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Pandas - remove numbers from start of string in series - python

You can use str.replace with regex ^\d+\s+ to remove leading digits: s.str.replace('^\d+\s+', '') Out[491]: 0 SOUTH PARK 1 BRAKER LANE 2 3RD ST 3 BIRMINGHAM PARK 4 E 12TH Name: Idx, dtype: object

Related

Extracting Specific Text From column in dataframe in pandas

Constructing a dataframe with multiple columns based on str conditions using a loop - python

Pandas groupby using function variable

str.extract starting from the back in pandas DataFrame

Appending or Concatenating DataFrame via for loop to existing DataFrame

Categories

Resources