Extracting Specific Text From column in dataframe in pandas - python

I have a pandas dataframe with a column, which I need to extract the word with [ft,mi,FT,MI] of the state column using regular expression and stored in other column.
df1 = {
'State':['Arizona 4.47ft','Georgia 1023mi','Newyork 2022 NY 74.6 FT','Indiana 747MI(In)','Florida 453mi FL']}
Expected output
State Distance
0 Arizona 4.47ft 4.47ft
1 Georgia 1023mi 1023mi
2 Newyork NY 74.6ft 74.6ft
3 Indiana 747MI(In) 747MI
4 Florida 453mi FL 453mi
Would anyone please help?

Build a regex pattern with the help of list l then use str.extract to extract the occurrence of this pattern from the State column
l = ['ft','mi','FT','MI']
df1['Distance'] = df1['State'].str.extract(r'(\S+(?:%s))\b' % '|'.join(l))
State Distance
0 Arizona 4.47ft 4.47ft
1 Georgia 1023mi 1023mi
2 Newyork 2022 NY 74.6FT 74.6FT
3 Indiana 747MI(In) 747MI
4 Florida 453mi FL 453mi

Related

How to conditionally filter a Pandas dataframe

I have a Pandas dataframe that looks like this:
import pandas as pd
df = pd.DataFrame({
'city': ['New York','New York','New York','Los Angeles','Los Angeles','Houston','Houston','Houston'],
'airport': ['LGA', 'EWR', 'JFK', 'LAX', 'BUR', 'IAH', 'HOU', 'EFD'],
'distance': [38, 50, 32, 8, 50, 90, 78, 120]
}
df
city airport distance
0 New York LGA 38
1 New York EWR 50
2 New York JFK 32
3 Los Angeles LAX 8
4 Los Angeles BUR 50
5 Houston IAH 90
6 Houston HOU 78
7 Houston EFD 120
I would like to output a separate dataframe based on the following logic:
if the value in the distance column is 40 or less between a given city and associated airport, than keep the row
if, within a given city, there is no distance below 40, then show only the shortest (lowest) distance
The desired dataframe would look like this:
city airport distance
0 New York LGA 38
1 New York JFK 32
3 Los Angeles LAX 8
4 Houston HOU 78 <-- this is returned, even though it's more than 40
How would I do this?
Thanks!
So in your case do with drop_duplicates then combine_first
out = df.sort_values('distance').drop_duplicates('city').combine_first(df.loc[df['distance']<40])
Out[228]:
city airport distance
0 NewYork LGA 38
2 NewYork JFK 32
3 LosAngeles LAX 8
6 Houston HOU 78
Another possible solution, which is based on the following ideas:
Create a dataframe that only contains rows where distance is lower or equal to 40.
Create another dataframe whose rows correspond to the minimum of distance per group of cities.
Concatenate the above two dataframes.
Remove the duplicates.
(pd.concat([tdf.loc[tdf.distance.le(40)],
tdf.iloc[tdf.groupby('city')['distance'].idxmin()]])
.drop_duplicates()
)
Output:
city airport distance
0 New York LGA 38
2 New York JFK 32
3 Los Angeles LAX 8
6 Houston HOU 78

How to Split a column into two by comma delimiter, and put a value without comma in second column and not in first?

I have a column in a df that I want to split into two columns splitting by comma delimiter. If the value in that column does not have a comma I want to put that into the second column instead of first.
Origin
New York, USA
England
Russia
London, England
California, USA
USA
I want the result to be:
Location
Country
New York
USA
NaN
England
NaN
Russia
London
England
California
USA
NaN
USA
I used this code
df['Location'], df['Country'] = df['Origin'].str.split(',', 1)
We can try using str.extract here:
df["Location"] = df["Origin"].str.extract(r'(.*),')
df["Country"] = df["Origin"].str.extract(r'(\w+(?: \w+)*)$')
Here is a way by using str.extract() and named groups
df['Origin'].str.extract(r'(?P<Location>[A-Za-z ]+(?=,))?(?:, )?(?P<Country>\w+)')
Output:
Location Country
0 New York USA
1 NaN England
2 NaN Russia
3 London England
4 California USA
5 NaN USA

Group by one column and take count of multiple categories in Pandas

I have a dataset, df, where I would like to group by one column and then take the count of each category within a second column
name location sku
svc1 ny hey1
svc2 ny hey1
svc3 ny hey1
svc4 ny hey1
lo1 ny ok1
lo2 ny ok1
fab1 ny hi
fab2 ny hi
fab3 ny hi
hello ca no
hello ca no
desired
location sku count
ny hey1 4
ny ok1 2
ny hi 3
ca no 2
doing
df2 = pd.DataFrame()
df2['sku'] = df.groupby('location')['sku'].nth(0)
df2['count'] = df.groupby('sku').count()
However, I am getting NAN for count, and I am not getting all of the data listed under sku.
Any suggestion is appreciated.
You are looking to group by two columns:
df.groupby(['location','sku']).size().reset_index(name='count')
Or groupby one column and value_counts the other:
# this should be slightly faster
(df.groupby('location')['sku'].value_counts()
.reset_index(name='count'))
Output:
location sku count
0 ca no 2
1 ny hey1 4
2 ny hi 3
3 ny ok1 2

How to categorized data in pandas using contained keywords

Let df be the dataframe as follows:
date text
0 2019-6-7 London is good.
1 2019-5-8 I am going to Paris.
2 2019-4-4 Do you want to go to London?
3 2019-3-7 I love Paris!
I would like to add a column city, which indicates the city contained in text, that is,
date text city
0 2019-6-7 London is good. London
1 2019-5-8 I am going to Paris. Paris
2 2019-4-4 Do you want to go to London? London
3 2019-3-7 I love Paris! Paris
How to do it without using lambda?
You can first match sure you have the list of city , then str.findall
df.text.str.findall('London|Paris').str[0]
Out[320]:
0 London
1 Paris
2 London
3 Paris
Name: text, dtype: object
df['city'] = df.text.str.findall('London|Paris').str[0]
Adding to #WenYoBen's method, if there is only either of Paris or London in one text then str.extract is better:
regex = '(London|Paris)'
df['city'] = df.text.str.extract(regex)
df
date text city
0 2019-6-7 London is good. London
1 2019-5-8 I am going to Paris. Paris
2 2019-4-4 Do you want to go to London? London
3 2019-3-7 I love Paris! Paris
And if you want all the cities in your regex in a text then str.extractall is an option too:
df['city'] = df.text.str.extractall(regex).values
df
date text city
0 2019-6-7 London is good. London
1 2019-5-8 I am going to Paris. Paris
2 2019-4-4 Do you want to go to London? London
3 2019-3-7 I love Paris! Paris
Note that if there are multiple matches, the extractall will return a list

Constructing a dataframe with multiple columns based on str conditions using a loop - python

I have a webscraped Twitter DataFrame that includes user location. The location variable looks like this:
2 Crockett, Houston County, Texas, 75835, USA
3 NYC, New York, USA
4 Warszawa, mazowieckie, RP
5 Texas, USA
6 Virginia Beach, Virginia, 23451, USA
7 Louisville, Jefferson County, Kentucky, USA
I would like to construct state dummies for all USA states by using a loop.
I have managed to extract users from the USA using
location_usa = location_df['location'].str.contains('usa', case = False)
However the code would be too bulky I wrote this for every single state. I have a list of the states as strings.
Also I am unable to use
pd.Series.Str.get_dummies()
as there are different locations within the same state and each entry is a whole sentence.
I would like the output to look something like this:
Alabama Alaska Arizona
1 0 0 1
2 0 1 0
3 1 0 0
4 0 0 0
Or the same with Boolean values.
Use .str.extract to get a Series of the states, and then use pd.get_dummies on that Series. Will need to define a list of all 50 states:
import pandas as pd
states = ['Texas', 'New York', 'Kentucky', 'Virginia']
pd.get_dummies(df.col1.str.extract('(' + '|'.join(x+',' for x in states)+ ')')[0].str.strip(','))
Kentucky New York Texas Virginia
0 0 0 1 0
1 0 1 0 0
2 0 0 0 0
3 0 0 1 0
4 0 0 0 1
5 1 0 0 0
Note I matched on States followed by a ',' as that seems to be the pattern and allows you to avoid false matches like 'Virginia' with 'Virginia Beach', or more problematic things like 'Washington County, Minnesota'
If you expect mutliple states to match on a single line, then this becomes .extractall summing across the 0th level:
pd.get_dummies(df.col1.str.extractall('(' + '|'.join(x+',' for x in states)+ ')')[0].str.strip(',')).sum(level=0).clip(upper=1)
Edit:
Perhaps there are better ways, but this can be a bit safer as suggested by #BradSolomon allowing matches on 'State,( optional 5 digit Zip,) USA'
states = ['Texas', 'New York', 'Kentucky', 'Virginia', 'California', 'Pennsylvania']
pat = '(' + '|'.join(x+',?(\s\d{5},)?\sUSA' for x in states)+ ')'
s = df.col1.str.extract(pat)[0].str.split(',').str[0]
Output: s
0 Texas
1 New York
2 NaN
3 Texas
4 Virginia
5 Kentucky
6 Pennsylvania
Name: 0, dtype: object
from Input
col1
0 Crockett, Houston County, Texas, 75835, USA
1 NYC, New York, USA
2 Warszawa, mazowieckie, RP
3 Texas, USA
4 Virginia Beach, Virginia, 23451, USA
5 Louisville, Jefferson County, Kentucky, USA
6 California, Pennsylvania, USA

Categories

Resources