Python - Extract multiple values from string in pandas df - python

I've searched for an answer for the following question but haven't found the answer yet. I have a large dataset like this small example:
df =
A B
1 I bought 3 apples in 2013
3 I went to the store in 2020 and got milk
1 In 2015 and 2019 I went on holiday to Spain
2 When I was 17, in 2014 I got a new car
3 I got my present in 2018 and it broke down in 2019
What I would like is to extract all the values of > 1950 and have this as an end result:
A B C
1 I bought 3 apples in 2013 2013
3 I went to the store in 2020 and got milk 2020
1 In 2015 and 2019 I went on holiday to Spain 2015_2019
2 When I was 17, in 2014 I got a new car 2014
3 I got my present in 2018 and it broke down in 2019 2018_2019
I tried to extract values first, but didn't get further than:
df["C"] = df["B"].str.extract('(\d+)').astype(int)
df["C"] = df["B"].apply(lambda x: re.search(r'\d+', x).group())
But all I get are error messages (I've only started python and working with texts a few weeks ago..). Could someone help me?

With single regex pattern (considering your comment "need the year it took place"):
In [268]: pat = re.compile(r'\b(19(?:[6-9]\d|5[1-9])|[2-9]\d{3})')
In [269]: df['C'] = df['B'].apply(lambda x: '_'.join(pat.findall(x)))
In [270]: df
Out[270]:
A B C
0 1 I bought 3 apples in 2013 2013
1 3 I went to the store in 2020 and got milk 2020
2 1 In 2015 and 2019 I went on holiday to Spain 2015_2019
3 2 When I was 17, in 2014 I got a new car 2014
4 3 I got my present in 2018 and it broke down in ... 2018_2019

Here's one way using str.findall and joining those items from the resulting lists that are greater than 1950::
s = df["B"].str.findall('\d+')
df['C'] = s.apply(lambda x: '_'.join(i for i in x if int(i)> 1950))
A B C
0 1 I bought 3 apples in 2013 2013
1 3 I went to the store in 2020 and got milk 2020
2 1 In 2015 and 2019 I went on holiday to Spain 2015_2019
3 2 When I was 17, in 2014 I got a new car 2014
4 3 I got my present in 2018 and it broke down in ... 2018_2019

Related

I want to filter rows from data frame where the year is 2020 and 2021 using re.search and re.match functions

Data Frame:
Unnamed: 0 date target insult tweet year
0 1 2014-10-09 thomas-frieden fool Can you believe this fool, Dr. Thomas Frieden ... 2014
1 2 2014-10-09 thomas-frieden DOPE Can you believe this fool, Dr. Thomas Frieden ... 2014
2 3 2015-06-16 politicians all talk and no action Big time in U.S. today - MAKE AMERICA GREAT AG... 2015
3 4 2015-06-24 ben-cardin It's politicians like Cardin that have destroy... Politician #SenatorCardin didn't like that I s... 2015
4 5 2015-06-24 neil-young total hypocrite For the nonbeliever, here is a photo of #Neily... 2015
I want the data frame which consists for only year with 2020 and 2021 using search and match methods.
df_filtered = df.loc[df.year.str.contains('2014|2015', regex=True) == True]

How to fill missing values in a dataframe based on group value counts?

I have a pandas DataFrame with 2 columns: Year(int) and Condition(string). In column Condition I have a nan value and I want to replace it based on information from groupby operation.
import pandas as pd
import numpy as np
year = [2015, 2016, 2017, 2016, 2016, 2017, 2015, 2016, 2015, 2015]
cond = ["good", "good", "excellent", "good", 'excellent','excellent', np.nan, 'good','excellent', 'good']
X = pd.DataFrame({'year': year, 'condition': cond})
stat = X.groupby('year')['condition'].value_counts()
It gives:
print(X)
year condition
0 2015 good
1 2016 good
2 2017 excellent
3 2016 good
4 2016 excellent
5 2017 excellent
6 2015 NaN
7 2016 good
8 2015 excellent
9 2015 good
print(stat)
year condition
2015 good 2
excellent 1
2016 good 3
excellent 1
2017 excellent 2
As nan value in 6th row gets year = 2015 and from stat I get that from 2015 the most frequent is 'good' so I want to replace this nan value with 'good' value.
I have tried with fillna and .transform method but it does not work :(
I would be grateful for any help.
I did a little extra transformation to get stat as a dictionary mapping the year to its highest frequency name (credit to this answer):
In[0]:
fill_dict = stat.unstack().idxmax(axis=1).to_dict()
fill_dict
Out[0]:
{2015: 'good', 2016: 'good', 2017: 'excellent'}
Then use fillna with map based on this dictionary (credit to this answer):
In[0]:
X['condition'] = X['condition'].fillna(X['year'].map(fill_dict))
X
Out[0]:
year condition
0 2015 good
1 2016 good
2 2017 excellent
3 2016 good
4 2016 excellent
5 2017 excellent
6 2015 good
7 2016 good
8 2015 excellent
9 2015 good

Select by Column Values

I know it is possible in arcpy. Finding out if can happen in pandas.
I have the following
data= {'Species':[ 'P.PIN','P.PIN','V.FOG', 'V.KOP', 'E.MON', 'E.CLA', 'E.KLI', 'D.FGH','W.ERT','S.MIX','P.PIN'],
'FY':[ '2002','2016','2018','2010','2009','2019','2017','2016','2018','2018','2016']}
I need to select all the P.PIN, P.RAD and any other species starting with E that have a FY equal to or older than 2016 and put into a new dataframe.
How can I get this done. All I am able to select P.PIN and P.RAD but have adding in all the other starting with E;
df3 =df[(df['FY']>=2016)&(df1['LastSpecies'].isin(['P.PIN','P.RAD']))]
Your help will be highly appreciated.
Step by step way. But you can also combine the logic inside the np.where() just want to show that all conditions were done.
Start by typecasting your df['FY'] values as int so we can use the greater than (>) operator.
>>> df['FY'] = df['FY'].astype(int)
>>> df['flag'] = np.where(df['Species'].isin(['P.PIN', 'P.RAD']), ['Take'], ['Remove'])
>>> df
Species FY flag
0 P.PIN 2002 Take
1 P.PIN 2016 Take
2 V.FOG 2018 Remove
3 V.KOP 2010 Remove
4 E.MON 2009 Remove
5 E.CLA 2019 Remove
6 E.KLI 2017 Remove
7 D.FGH 2016 Remove
8 W.ERT 2018 Remove
9 S.MIX 2018 Remove
10 P.PIN 2016 Take
>>> df['flag'] = np.where((df['FY'] > 2016) & (df['Species'].str.startswith('E')), ['Take'], df['flag'])
>>> df
Species FY flag
0 P.PIN 2002 Take
1 P.PIN 2016 Take
2 V.FOG 2018 Remove
3 V.KOP 2010 Remove
4 E.MON 2009 Remove
5 E.CLA 2019 Take
6 E.KLI 2017 Take
7 D.FGH 2016 Remove
8 W.ERT 2018 Remove
9 S.MIX 2018 Remove
10 P.PIN 2016 Take
>>> new_df = df[df['flag'].isin(['Take'])][['Species', 'FY']]
>>> new_df
Species FY
0 P.PIN 2002
1 P.PIN 2016
5 E.CLA 2019
6 E.KLI 2017
10 P.PIN 2016
Hope this helps :D

iterate over pandas dataframe and create another dataframe with repititive records

I have a dataframe act with columns as ['ids','start-yr','end-yr'].
I want to create another dataframe timeline with columns as ['ids','years'].
using the act df. So if act has fields as
ids start-yr end-yr
--------------------------------
'IAs728-ahe83j' 2014 2016
'J8273nbajsu-193h' 2012 2018
I want the timeline df to be populated like this:
ids years
------------------------
'IAs728-ahe83j' 2014
'IAs728-ahe83j' 2015
'IAs728-ahe83j' 2016
'J8273nbajsu-193h' 2012
'J8273nbajsu-193h' 2013
'J8273nbajsu-193h' 2014
'J8273nbajsu-193h' 2015
'J8273nbajsu-193h' 2016
'J8273nbajsu-193h' 2017
'J8273nbajsu-193h' 2018
My attempt so far:
timeline = pd.DataFrame(columns=['ids','years'])
cnt = 0
for ix, row in act.iterrows():
for yr in range(int(row['start-yr']), int(row['end-yr'])+1, 1):
timeline[cnt, 'ids'] = row['ids']
timeline[cnt, 'years'] = yr
cnt += 1
But this is a very costly operation, too much time consuming (which is obvious, i know). So what should be the best pythonic approach to populate a pandas df in a situation like this?
Any help is appreciated, thanks.
Use list comprehension with range for list of tuples and DataFrame constructor:
a = [(i, x) for i, a, b in df.values for x in range(a, b + 1)]
df = pd.DataFrame(a, columns=['ids','years'])
print (df)
ids years
0 'IAs728-ahe83j' 2014
1 'IAs728-ahe83j' 2015
2 'IAs728-ahe83j' 2016
3 'J8273nbajsu-193h' 2012
4 'J8273nbajsu-193h' 2013
5 'J8273nbajsu-193h' 2014
6 'J8273nbajsu-193h' 2015
7 'J8273nbajsu-193h' 2016
8 'J8273nbajsu-193h' 2017
9 'J8273nbajsu-193h' 2018
If possible multiple columns in DataFrame filter them by list:
c = ['ids','start-yr','end-yr']
a = [(i, x) for i, a, b in df[c].values for x in range(a, b + 1)]

Sum of a Column in a Dictionary of Dataframes

How can I work with a dictionary of dataframes please? Or, is there a better way to get an overview of my data? If I have for example:
Fruit Qty Year
Apple 2 2016
Orange 1 2017
Mango 2 2016
Apple 9 2016
Orange 8 2015
Mango 7 2016
Apple 6 2016
Orange 5 2017
Mango 4 2015
Then I am trying to find out how many in total I get per year, for example:
2015 2016 2017
Apple 0 11 0
Orange 8 0 6
Mango 4 9 0
I have written some code but it might not be useful:
import pandas as pd
# Fruit Data
df_1 = pd.DataFrame({'Fruit':['Apple','Orange','Mango','Apple','Orange','Mango','Apple','Orange','Mango'], 'Qty': [2,1,2,9,8,7,6,5,4], 'Year': [2016,2017,2016,2016,2015,2016,2016,2017,2015]})
# Create a list of Fruits
Fruits = df_1.Fruit.unique()
# Break down the dataframe by Year
df_2015 = df_1[df_1['Year'] == 2015]
df_2016 = df_1[df_1['Year'] == 2016]
df_2017 = df_1[df_1['Year'] == 2017]
# Create a dataframe dictionary of Fruits
Dict_2015 = {elem : pd.DataFrame for elem in Fruits}
Dict_2016 = {elem : pd.DataFrame for elem in Fruits}
Dict_2017 = {elem : pd.DataFrame for elem in Fruits}
# Store the Qty for each Fruit x each Year
for Fruit in Dict_2015.keys():
Dict_2015[Fruit] = df_2015[:][df_2015.Fruit == Fruit]
for Fruit in Dict_2016.keys():
Dict_2016[Fruit] = df_2016[:][df_2016.Fruit == Fruit]
for Fruit in Dict_2017.keys():
Dict_2017[Fruit] = df_2017[:][df_2017.Fruit == Fruit]
You can use pandas.pivot_table.
res = df.pivot_table(index='Fruit', columns=['Year'], values='Qty',
aggfunc=np.sum, fill_value=0)
print(res)
Year 2015 2016 2017
Fruit
Apple 0 17 0
Mango 4 9 0
Orange 8 0 6
For guidance on usage, see How to pivot a dataframe.
jpp has already posted an answer in the format you wanted. However, since your question seemed like you are open to other views, I thought of sharing another way. Not exactly in the format you posted but this how I usually do it.
df = df.groupby(['Fruit', 'Year']).agg({'Qty': 'sum'}).reset_index()
This will look something like:
Fruit Year Sum
Apple 2015 0
Apple 2016 11
Apple 2017 0
Orange 2015 8
Orange 2016 0
Orange 2017 6
Mango 2015 4
Mango 2016 9
Mango 2017 0

Categories

Resources