How to save split data in panda in reverse order? - python

You can use this to create the dataframe:
xyz = pd.DataFrame({'release' : ['7 June 2013', '2012', '31 January 2013',
'February 2008', '17 June 2014', '2013']})
I am trying to split the data and save, them into 3 columns named "day, month and year", using this command:
dataframe[['day','month','year']] = dataframe['release'].str.rsplit(expand=True)
The resulting dataframe is :
dataframe
As you can see, that it works perfectly when it gets 3 strings, but whenever it is getting less then 3 strings, it saves the data at the wrong place.
I have tried split and rsplit, both are giving the same result.
Any solution to get the data at the right place?
The last one is year and it is present in every condition , it should be the first one to be saved and then month if it is present otherwise nothing and same way the day should be stored.

You could
In [17]: dataframe[['year', 'month', 'day']] = dataframe['release'].apply(
lambda x: pd.Series(x.split()[::-1]))
In [18]: dataframe
Out[18]:
release year month day
0 7 June 2013 2013 June 7
1 2012 2012 NaN NaN
2 31 January 2013 2013 January 31
3 February 2008 2008 February NaN
4 17 June 2014 2014 June 17
5 2013 2013 NaN NaN

Try reversing the result.
dataframe[['year','month','day']] = dataframe['release'].str.rsplit(expand=True).reverse()

Related

Split columns by space or dash - python

I have a pandas df with mixed formatting for a specific column. It contains the qtr and year. I'm hoping to split this column into separate columns. But the formatting contains a space or a second dash between qtr and year.
I'm hoping to include a function that splits the column by a blank space or a second dash.
df = pd.DataFrame({
'Qtr' : ['APR-JUN 2019','JAN-MAR 2019','JAN-MAR 2015','JUL-SEP-2020','OCT-DEC 2014','JUL-SEP-2015'],
})
out:
Qtr
0 APR-JUN 2019 # blank
1 JAN-MAR 2019 # blank
2 JAN-MAR 2015 # blank
3 JUL-SEP-2020 # second dash
4 OCT-DEC 2014 # blank
5 JUL-SEP-2015 # second dash
split by blank
df[['Qtr', 'Year']] = df['Qtr'].str.split(' ', 1, expand=True)
split by second dash
df[['Qtr', 'Year']] = df['Qtr'].str.split('-', 1, expand=True)
intended output:
Qtr Year
0 APR-JUN 2019
1 JAN-MAR 2019
2 JAN-MAR 2015
3 JUL-SEP 2020
4 OCT-DEC 2014
5 JUL-SEP 2015
You can use a regular expression with the extract function of the string accessor.
df[['Qtr', 'Year']] = df['Qtr'].str.extract(r'(\w{3}-\w{3}).(\d{4})')
print(df)
Result
Qtr Year
0 APR-JUN 2019
1 JAN-MAR 2019
2 JAN-MAR 2015
3 JUL-SEP 2020
4 OCT-DEC 2014
5 JUL-SEP 2015
You can split using regex using positive lookahead and non capturing group (?:..), then filter out the empty values, and apply a pandas Series on the values:
>>> (df.Qtr.str.split('\s|(.+(?<=-).+)(?:-)')
.apply(lambda x: [i for i in x if i])
.apply(lambda x: pd.Series(x, index=['Qtr', 'Year']))
)
Qtr Year
0 APR-JUN 2019
1 JAN-MAR 2019
2 JAN-MAR 2015
3 JUL-SEP 2020
4 OCT-DEC 2014
5 JUL-SEP 2015
If, and only if, the data is in the posted format you could use list slicing.
import pandas as pd
df = pd.DataFrame(
{
"Qtr": [
"APR-JUN 2019",
"JAN-MAR 2019",
"JAN-MAR 2015",
"JUL-SEP-2020",
"OCT-DEC 2014",
"JUL-SEP-2015",
],
}
)
df[['Qtr', 'Year']] = [(x[:7], x[8:12]) for x in df['Qtr']]
print(df)
Qtr Year
0 APR-JUN 2019
1 JAN-MAR 2019
2 JAN-MAR 2015
3 JUL-SEP 2020
4 OCT-DEC 2014
5 JUL-SEP 2015

How to add a new column based on different conditions on other columns pandas

This is my dataframe:
Date Month
04/21/2019 April
07/03/2019 July
01/05/2018 January
09/23/2019 September
I want to add a column called fiscal year. A new fiscal year starts on 1st of July every year and ends on the last day of June. So for example if the year is 2019 and month is April, it is still fiscal year 2019. However, if the year is 2019 but month is anything after June, it will be fiscal year 2020. The resulting data frame should look like this:
Date Month FY
04/21/2019 April FY19
07/03/2019 July FY20
01/05/2019 January FY19
09/23/2019 September FY20
How do I achieve this?
One way using pandas.Dateoffset:
df["FY"] = (pd.to_datetime(df["Date"])
+ pd.DateOffset(months=6)).dt.strftime("FY%Y")
print(df)
Output:
Date Month FY
0 04/21/2019 April FY2019
1 07/03/2019 July FY2020
2 01/05/2019 January FY2019
3 09/23/2019 September FY2020
try via pd.PeriodIndex()+pd.to_datetime():
df['Date']=pd.to_datetime(df['Date'])
df['FY']=pd.PeriodIndex(df['Date'],freq='A-JUN').strftime("FY%y")
output:
Date Month FY
0 2019-04-21 April FY19
1 2019-07-03 July FY20
2 2019-01-05 January FY19
3 2019-09-23 September FY20
Note: I suggest you you convert your 'Date' to datetime first then do any operation on it or If you don't want to convert 'Date' column then use the above code in a single step:
df['FY']=pd.PeriodIndex(pd.to_datetime(df['Date']),freq='A-JUN').strftime("FY%y")

Select corresponding column value for max value of separate column(from a specific range of column) of pandas data frame

year month quantity
DateNew
2005-01 2005 January 49550
2005-02 2005 February 96088
2005-03 2005 March 28874
2005-04 2005 April 66917
2005-05 2005 May 24070
... ... ... ...
2018-08 2018 August 132629
2018-09 2018 September 104394
2018-10 2018 October 121305
2018-11 2018 November 121049
2018-12 2018 December 174984
This is the data frame that I have. I want to select the maximum quantity for each year and return the corresponding month for it.
I have tried this so far
df.groupby('year').max()
But in this, I get the max value for each and every column and hence getting September in each year.
I have no clue how to approach the actual solution.
I think you want idxmax:
df.loc[df.groupby('year')['quantity'].idxmax()]
Output:
year month quantity
DateNew
2005-02 2005 February 96088
2018-12 2018 December 174984
Or just for the months:
df.loc[df.groupby('year')['quantity'].idxmax(), 'month']
Output:
DateNew
2005-02 February
2018-12 December
Name: month, dtype: object
Also, you can use sort_values followed by duplicated:
df.loc[~df.sort_values('quantity').duplicated('year', keep='last'), 'month']

Doing a pandas left merge with duplicate column names (want to delete left and keep right) [duplicate]

This question already has answers here:
Pandas Merging 101
(8 answers)
Closed 2 years ago.
So let's say I have df_1
Day Month Amt
--------------- --------- ---------
Monday Jan 10
Tuesday Feb 20
Wednesday Feb 30
Thursday April 40
Friday April 50
and df_2
Month Amt
--------------- ---------
Jan 999
Feb 1000000
April 123456
I want to get the following result when I do a left merge:
Day Month Amt
--------------- --------- ---------
Monday Jan 999
Tuesday Feb 1000000
Wednesday Feb 1000000
Thursday April 123456
Friday April 123456
So basically the 'Amt' values from the right table replace the 'Amt' values from the left table where applicable.
When I try
df_1.merge(df_2,how = 'left',on = 'Month')
I get:
Day Month Amt_X Amt_Y
--------------- --------- --------- -------
Monday Jan 10 999
Tuesday Feb 20 1000000
Wednesday Feb 30 1000000
Thursday April 40 123456
Friday April 50 123456
Anyone know of a simple and efficient fix? Thanks!
This answer is purely supplemental to the duplicate target. That is a much more comprehensive answer than this.
Strategy #1
there are two components to this problem.
Use df_2 to create a mapping.
The intuitive way to do this is
mapping = df_2.set_index('Month')['Amt']
which creates a series object that can be passed to pd.Series.map
However, I'm partial to
mapping = dict(zip(df_2.Month, df_2.Amt))
Or even more obtuse
mapping = dict(zip(*map(df_2.get, df_2)))
Use pandas.Series.map
df_1.Month.map(mapping)
0 999
1 1000000
2 1000000
3 123456
4 123456
Name: Month, dtype: int64
Finally, you want to put that into the existing dataframe.
Create a copy
df_1.assign(Amt=df_1.Month.map(mapping))
Day Month Amt
0 Monday Jan 999
1 Tuesday Feb 1000000
2 Wednesday Feb 1000000
3 Thursday April 123456
4 Friday April 123456
Overwrite existing data
df_1['Amt'] = df_1.Month.map(mapping)
Strategy #2
To use merge most succinctly, drop the column that is to be replaced.
df_1.drop('Amt', axis=1).merge(df_2)
Day Month Amt
0 Monday Jan 999
1 Tuesday Feb 1000000
2 Wednesday Feb 1000000
3 Thursday April 123456
4 Friday April 123456

How to fill missing values in a dataframe based on group value counts?

I have a pandas DataFrame with 2 columns: Year(int) and Condition(string). In column Condition I have a nan value and I want to replace it based on information from groupby operation.
import pandas as pd
import numpy as np
year = [2015, 2016, 2017, 2016, 2016, 2017, 2015, 2016, 2015, 2015]
cond = ["good", "good", "excellent", "good", 'excellent','excellent', np.nan, 'good','excellent', 'good']
X = pd.DataFrame({'year': year, 'condition': cond})
stat = X.groupby('year')['condition'].value_counts()
It gives:
print(X)
year condition
0 2015 good
1 2016 good
2 2017 excellent
3 2016 good
4 2016 excellent
5 2017 excellent
6 2015 NaN
7 2016 good
8 2015 excellent
9 2015 good
print(stat)
year condition
2015 good 2
excellent 1
2016 good 3
excellent 1
2017 excellent 2
As nan value in 6th row gets year = 2015 and from stat I get that from 2015 the most frequent is 'good' so I want to replace this nan value with 'good' value.
I have tried with fillna and .transform method but it does not work :(
I would be grateful for any help.
I did a little extra transformation to get stat as a dictionary mapping the year to its highest frequency name (credit to this answer):
In[0]:
fill_dict = stat.unstack().idxmax(axis=1).to_dict()
fill_dict
Out[0]:
{2015: 'good', 2016: 'good', 2017: 'excellent'}
Then use fillna with map based on this dictionary (credit to this answer):
In[0]:
X['condition'] = X['condition'].fillna(X['year'].map(fill_dict))
X
Out[0]:
year condition
0 2015 good
1 2016 good
2 2017 excellent
3 2016 good
4 2016 excellent
5 2017 excellent
6 2015 good
7 2016 good
8 2015 excellent
9 2015 good

Categories

Resources