I have a pandas df with mixed formatting for a specific column. It contains the qtr and year. I'm hoping to split this column into separate columns. But the formatting contains a space or a second dash between qtr and year.
I'm hoping to include a function that splits the column by a blank space or a second dash.
df = pd.DataFrame({
'Qtr' : ['APR-JUN 2019','JAN-MAR 2019','JAN-MAR 2015','JUL-SEP-2020','OCT-DEC 2014','JUL-SEP-2015'],
})
out:
Qtr
0 APR-JUN 2019 # blank
1 JAN-MAR 2019 # blank
2 JAN-MAR 2015 # blank
3 JUL-SEP-2020 # second dash
4 OCT-DEC 2014 # blank
5 JUL-SEP-2015 # second dash
split by blank
df[['Qtr', 'Year']] = df['Qtr'].str.split(' ', 1, expand=True)
split by second dash
df[['Qtr', 'Year']] = df['Qtr'].str.split('-', 1, expand=True)
intended output:
Qtr Year
0 APR-JUN 2019
1 JAN-MAR 2019
2 JAN-MAR 2015
3 JUL-SEP 2020
4 OCT-DEC 2014
5 JUL-SEP 2015
You can use a regular expression with the extract function of the string accessor.
df[['Qtr', 'Year']] = df['Qtr'].str.extract(r'(\w{3}-\w{3}).(\d{4})')
print(df)
Result
Qtr Year
0 APR-JUN 2019
1 JAN-MAR 2019
2 JAN-MAR 2015
3 JUL-SEP 2020
4 OCT-DEC 2014
5 JUL-SEP 2015
You can split using regex using positive lookahead and non capturing group (?:..), then filter out the empty values, and apply a pandas Series on the values:
>>> (df.Qtr.str.split('\s|(.+(?<=-).+)(?:-)')
.apply(lambda x: [i for i in x if i])
.apply(lambda x: pd.Series(x, index=['Qtr', 'Year']))
)
Qtr Year
0 APR-JUN 2019
1 JAN-MAR 2019
2 JAN-MAR 2015
3 JUL-SEP 2020
4 OCT-DEC 2014
5 JUL-SEP 2015
If, and only if, the data is in the posted format you could use list slicing.
import pandas as pd
df = pd.DataFrame(
{
"Qtr": [
"APR-JUN 2019",
"JAN-MAR 2019",
"JAN-MAR 2015",
"JUL-SEP-2020",
"OCT-DEC 2014",
"JUL-SEP-2015",
],
}
)
df[['Qtr', 'Year']] = [(x[:7], x[8:12]) for x in df['Qtr']]
print(df)
Qtr Year
0 APR-JUN 2019
1 JAN-MAR 2019
2 JAN-MAR 2015
3 JUL-SEP 2020
4 OCT-DEC 2014
5 JUL-SEP 2015
Related
I have a pandas DataFrame with 2 columns: Year(int) and Condition(string). In column Condition I have a nan value and I want to replace it based on information from groupby operation.
import pandas as pd
import numpy as np
year = [2015, 2016, 2017, 2016, 2016, 2017, 2015, 2016, 2015, 2015]
cond = ["good", "good", "excellent", "good", 'excellent','excellent', np.nan, 'good','excellent', 'good']
X = pd.DataFrame({'year': year, 'condition': cond})
stat = X.groupby('year')['condition'].value_counts()
It gives:
print(X)
year condition
0 2015 good
1 2016 good
2 2017 excellent
3 2016 good
4 2016 excellent
5 2017 excellent
6 2015 NaN
7 2016 good
8 2015 excellent
9 2015 good
print(stat)
year condition
2015 good 2
excellent 1
2016 good 3
excellent 1
2017 excellent 2
As nan value in 6th row gets year = 2015 and from stat I get that from 2015 the most frequent is 'good' so I want to replace this nan value with 'good' value.
I have tried with fillna and .transform method but it does not work :(
I would be grateful for any help.
I did a little extra transformation to get stat as a dictionary mapping the year to its highest frequency name (credit to this answer):
In[0]:
fill_dict = stat.unstack().idxmax(axis=1).to_dict()
fill_dict
Out[0]:
{2015: 'good', 2016: 'good', 2017: 'excellent'}
Then use fillna with map based on this dictionary (credit to this answer):
In[0]:
X['condition'] = X['condition'].fillna(X['year'].map(fill_dict))
X
Out[0]:
year condition
0 2015 good
1 2016 good
2 2017 excellent
3 2016 good
4 2016 excellent
5 2017 excellent
6 2015 good
7 2016 good
8 2015 excellent
9 2015 good
I know it is possible in arcpy. Finding out if can happen in pandas.
I have the following
data= {'Species':[ 'P.PIN','P.PIN','V.FOG', 'V.KOP', 'E.MON', 'E.CLA', 'E.KLI', 'D.FGH','W.ERT','S.MIX','P.PIN'],
'FY':[ '2002','2016','2018','2010','2009','2019','2017','2016','2018','2018','2016']}
I need to select all the P.PIN, P.RAD and any other species starting with E that have a FY equal to or older than 2016 and put into a new dataframe.
How can I get this done. All I am able to select P.PIN and P.RAD but have adding in all the other starting with E;
df3 =df[(df['FY']>=2016)&(df1['LastSpecies'].isin(['P.PIN','P.RAD']))]
Your help will be highly appreciated.
Step by step way. But you can also combine the logic inside the np.where() just want to show that all conditions were done.
Start by typecasting your df['FY'] values as int so we can use the greater than (>) operator.
>>> df['FY'] = df['FY'].astype(int)
>>> df['flag'] = np.where(df['Species'].isin(['P.PIN', 'P.RAD']), ['Take'], ['Remove'])
>>> df
Species FY flag
0 P.PIN 2002 Take
1 P.PIN 2016 Take
2 V.FOG 2018 Remove
3 V.KOP 2010 Remove
4 E.MON 2009 Remove
5 E.CLA 2019 Remove
6 E.KLI 2017 Remove
7 D.FGH 2016 Remove
8 W.ERT 2018 Remove
9 S.MIX 2018 Remove
10 P.PIN 2016 Take
>>> df['flag'] = np.where((df['FY'] > 2016) & (df['Species'].str.startswith('E')), ['Take'], df['flag'])
>>> df
Species FY flag
0 P.PIN 2002 Take
1 P.PIN 2016 Take
2 V.FOG 2018 Remove
3 V.KOP 2010 Remove
4 E.MON 2009 Remove
5 E.CLA 2019 Take
6 E.KLI 2017 Take
7 D.FGH 2016 Remove
8 W.ERT 2018 Remove
9 S.MIX 2018 Remove
10 P.PIN 2016 Take
>>> new_df = df[df['flag'].isin(['Take'])][['Species', 'FY']]
>>> new_df
Species FY
0 P.PIN 2002
1 P.PIN 2016
5 E.CLA 2019
6 E.KLI 2017
10 P.PIN 2016
Hope this helps :D
I'm downloading data from FRED. I'm summing to get annual numbers, but don't want incomplete years. So I need a sum condition if count the number of obs is 12 because the series is monthly.
import pandas_datareader.data as web
mnemonic = 'RSFSXMV'
df = web.DataReader(mnemonic, 'fred', 2000, 2020)
df['year'] = df.index.year
new_df = df.groupby(["year"])[mnemonic].sum().reset_index()
print(new_df)
I don't want 2019 to show up.
In your case we using transform with nunique to make sure each year should have 12 unique month , if not we drop it before do the groupby sum
df['Month']=df.index.month
m=df.groupby('year').Month.transform('nunique')==12
new_df = df.loc[m].groupby(["year"])[mnemonic].sum().reset_index()
isin
df['Month']=df.index.month
m=df.groupby('year').Month.nunique()
new_df = df.loc[df.year.isin(m.index[m==12)].groupby(["year"])[mnemonic].sum().reset_index()
You could use a aggreate function count while groupby:
df['year'] = df.index.year
df = df.groupby('year').agg({'RSFSXMV': 'sum', 'year': 'count'})
which will give you:
RSFSXMV year
year
2000 2487790 12
2001 2563218 12
2002 2641870 12
2003 2770397 12
2004 2969282 12
2005 3196141 12
2006 3397323 12
2007 3531906 12
2008 3601512 12
2009 3393753 12
2010 3541327 12
2011 3784014 12
2012 3934506 12
2013 4043037 12
2014 4191342 12
2015 4252113 12
2016 4357528 12
2017 4561833 12
2018 4810502 12
2019 2042147 5
Then simply drop those rows with a year count less than 12
I have a dataframe act with columns as ['ids','start-yr','end-yr'].
I want to create another dataframe timeline with columns as ['ids','years'].
using the act df. So if act has fields as
ids start-yr end-yr
--------------------------------
'IAs728-ahe83j' 2014 2016
'J8273nbajsu-193h' 2012 2018
I want the timeline df to be populated like this:
ids years
------------------------
'IAs728-ahe83j' 2014
'IAs728-ahe83j' 2015
'IAs728-ahe83j' 2016
'J8273nbajsu-193h' 2012
'J8273nbajsu-193h' 2013
'J8273nbajsu-193h' 2014
'J8273nbajsu-193h' 2015
'J8273nbajsu-193h' 2016
'J8273nbajsu-193h' 2017
'J8273nbajsu-193h' 2018
My attempt so far:
timeline = pd.DataFrame(columns=['ids','years'])
cnt = 0
for ix, row in act.iterrows():
for yr in range(int(row['start-yr']), int(row['end-yr'])+1, 1):
timeline[cnt, 'ids'] = row['ids']
timeline[cnt, 'years'] = yr
cnt += 1
But this is a very costly operation, too much time consuming (which is obvious, i know). So what should be the best pythonic approach to populate a pandas df in a situation like this?
Any help is appreciated, thanks.
Use list comprehension with range for list of tuples and DataFrame constructor:
a = [(i, x) for i, a, b in df.values for x in range(a, b + 1)]
df = pd.DataFrame(a, columns=['ids','years'])
print (df)
ids years
0 'IAs728-ahe83j' 2014
1 'IAs728-ahe83j' 2015
2 'IAs728-ahe83j' 2016
3 'J8273nbajsu-193h' 2012
4 'J8273nbajsu-193h' 2013
5 'J8273nbajsu-193h' 2014
6 'J8273nbajsu-193h' 2015
7 'J8273nbajsu-193h' 2016
8 'J8273nbajsu-193h' 2017
9 'J8273nbajsu-193h' 2018
If possible multiple columns in DataFrame filter them by list:
c = ['ids','start-yr','end-yr']
a = [(i, x) for i, a, b in df[c].values for x in range(a, b + 1)]
You can use this to create the dataframe:
xyz = pd.DataFrame({'release' : ['7 June 2013', '2012', '31 January 2013',
'February 2008', '17 June 2014', '2013']})
I am trying to split the data and save, them into 3 columns named "day, month and year", using this command:
dataframe[['day','month','year']] = dataframe['release'].str.rsplit(expand=True)
The resulting dataframe is :
dataframe
As you can see, that it works perfectly when it gets 3 strings, but whenever it is getting less then 3 strings, it saves the data at the wrong place.
I have tried split and rsplit, both are giving the same result.
Any solution to get the data at the right place?
The last one is year and it is present in every condition , it should be the first one to be saved and then month if it is present otherwise nothing and same way the day should be stored.
You could
In [17]: dataframe[['year', 'month', 'day']] = dataframe['release'].apply(
lambda x: pd.Series(x.split()[::-1]))
In [18]: dataframe
Out[18]:
release year month day
0 7 June 2013 2013 June 7
1 2012 2012 NaN NaN
2 31 January 2013 2013 January 31
3 February 2008 2008 February NaN
4 17 June 2014 2014 June 17
5 2013 2013 NaN NaN
Try reversing the result.
dataframe[['year','month','day']] = dataframe['release'].str.rsplit(expand=True).reverse()