Dataframe: Rowwise calculations - python

I want to import an excel file into a dataframe. My dataframe (without the excel calculations) looks like this:
Jan
Feb
Mar
Apr
KPI (IF(SUM(JAN:APR)<0;0);SUM(JAN:APR))
5
-25
-20
5
0
15
24
11
-20
30
What is the best way to calculate the "KPI" column rowwise?

total = df.loc[:, "Jan": "Apr"].sum(axis=1)
df["KPI"] = total.where(total > 0, other=0)
get the total across needed columns
keep it as is where it is > 0; put 0 to other places
or
df["KPI"] = np.where(total > 0, total, 0)
or
df["KPI"] = total * (total > 0)
to get
In [162]: df
Out[162]:
Jan Feb Mar Apr KPI
0 5 -25 -20 5 0
1 15 24 11 -20 30

Calculate row-wise sum, then call mask and pass a lambda function for x>0 finally fill NaN values by zero.
>>> df['KPI']=df.sum(1).mask(lambda x: x<0, 0)
Jan Feb Mar Apr KPI
0 5 -25 -20 5 0.0
1 15 24 11 -20 30.0
Better solution: call sum then Series.clip:
df['KPI']=df.sum(1).clip(0)
Jan Feb Mar Apr KPI
0 5 -25 -20 5 0.0
1 15 24 11 -20 30.0

Related

How to find cumulative sum of specific column in CSV file

I have a csv file in the format:
20 05 2019 12:00:00, 100
21 05 2019 12:00:00, 200
22 05 2019 12:00:00, 480
And i want to access the second variable, ive tried a variety of different alterations but none have worked.
Initially i tried
import pandas as pd
import numpy as np
col = [i for i in range(2)]
col[1] = "Power"
data = pd.read_csv('FILENAME.csv', names=col)
df1 = data.sum(data, axis=1)
df2 = np.cumsum(df1)
print(df2)
You can use cumsum function:
data['Power'].cumsum()
Output:
0 100
1 300
2 780
Name: Power, dtype: int64
Use df.cumsum:
In [1820]: df = pd.read_csv('FILENAME.csv', names=col)
In [1821]: df
Out[1821]:
0 Power
0 20 05 2019 12:00:00 100
1 21 05 2019 12:00:00 200
2 22 05 2019 12:00:00 480
In [1823]: df['cumulative sum'] = df['Power'].cumsum()
In [1824]: df
Out[1824]:
0 Power cumulative sum
0 20 05 2019 12:00:00 100 100
1 21 05 2019 12:00:00 200 300
2 22 05 2019 12:00:00 480 780

how do i convert a pandas dataframe from wide to long while keeping the index?

i'd like to transform a dataframe from wide to long, going from many columns to two columns, while keeping the index. I've tried below using melt. Please let me know what i'm missing.
n.b. the actual dataframe will have hundreds of columns, so i can't list them in the code.
create dataframe:
df = pd.DataFrame(np.random.randint(0,100,size=(3, 3)), columns=list('ABC'),index = ['jan','feb','mar'])
output:
A B C
jan 76 7 72
feb 29 15 69
mar 4 24 9
melt dataframe:
df2 = pd.melt(df.reset_index())
output:
variable value
0 index jan
1 index feb
2 index mar
3 A 76
4 A 29
5 A 4
6 B 7
7 B 15
8 B 24
9 C 72
10 C 69
11 C 9
desired output:
variable value
jan A 76
feb A 29
mar A 4
jan B 7
feb B 15
mar B 24
jan C 72
feb C 69
mar C 9
With df.melt , you can use , 'index' as name of index, then set the 'index' column back as index and rename:
df.reset_index().melt('index').set_index('index').rename_axis(None)
With df.stack also possible using below:
(df.stack().rename_axis([None,'variable']).reset_index(-1,name='value')
.sort_values('variable'))
variable value
jan A 76
feb A 29
mar A 4
jan B 7
feb B 15
mar B 24
jan C 72
feb C 69
mar C 9

Copy values in one column to other rows in that column after partitioning data

The table below shows the value column with value for the first row. I need to copy the value 100 to all other rows for id=1 and value =200 for all rows with id = 2
id month value
1 jan 100
1 feb 0
1 mar 0
1 apr 0
2 jan 200
2 feb 0
2 mar 0
desired output:
id month value
1 jan 100
1 feb 100
1 mar 100
1 apr 100
2 jan 200
2 feb 200
2 mar 200
for i in range(id_values):
df.loc[df['id'] == i,'value'] = i * 100

pandas: conditionally return a column's value

I am trying to make a new column called 'wage_rate' that fills in the appropriate wage rate for the employee based on the year of the observation.
In other words, my list looks something like this:
eecode year w2011 w2012 w2013
1 2012 7 8 9
1 2013 7 8 9
2 2011 20 25 25
2 2012 20 25 25
2 2013 20 25 25
And I want return, in a new column, 8 for the first row, 9 for the second, 20, 25, 25.
One way would be to use apply by constructing column name for each row based on year like 'w' + str(x.year).
In [41]: df.apply(lambda x: x['w' + str(x.year)], axis=1)
Out[41]:
0 8
1 9
2 20
3 25
4 25
dtype: int64
Details:
In [42]: df
Out[42]:
eecode year w2011 w2012 w2013
0 1 2012 7 8 9
1 1 2013 7 8 9
2 2 2011 20 25 25
3 2 2012 20 25 25
4 2 2013 20 25 25
In [43]: df['wage_rate'] = df.apply(lambda x: x['w' + str(x.year)], axis=1)
In [44]: df
Out[44]:
eecode year w2011 w2012 w2013 wage_rate
0 1 2012 7 8 9 8
1 1 2013 7 8 9 9
2 2 2011 20 25 25 20
3 2 2012 20 25 25 25
4 2 2013 20 25 25 25
values = [ row['w%s'% row['year']] for key, row in df.iterrows() ]
df['wage_rate'] = values # create the new columns
This solution is using an explicit loop, thus is likely slower than other pure-pandas solutions, but on the other hand it is simple and readable.
you can rename columns names to be the same as year columns using replace
In [70]:
df.columns = [re.sub('w(?=\d+4$)' , '' , column ) for column in df.columns ]
In [80]:
df.columns
Out[80]:
Index([u'eecode', u'year', u'2011', u'2012', u'2013', u'wage_rate'], dtype='object')
then get the value using the following
df['wage_rate'] = df.apply(lambda x : x[str(x.year)] , axis = 1)
Out[79]:
eecode year 2011 2012 2013 wage_rate
1 2012 7 8 9 8
1 2013 7 8 9 9
2 2011 20 25 25 20
2 2012 20 25 25 25
2 2013 20 25 25 25

Pandas groupby month and year

I have the following dataframe:
Date abc xyz
01-Jun-13 100 200
03-Jun-13 -20 50
15-Aug-13 40 -5
20-Jan-14 25 15
21-Feb-14 60 80
I need to group the data by year and month. I.e., Group by Jan 2013, Feb 2013, Mar 2013, etc...
I will be using the newly grouped data to create a plot showing abc vs xyz per year/month.
I've tried various combinations of groupby and sum, but I just can't seem to get anything to work. How can I do it?
You can use either resample or Grouper (which resamples under the hood).
First make sure that the datetime column is actually of datetimes (hit it with pd.to_datetime). It's easier if it's a DatetimeIndex:
In [11]: df1
Out[11]:
abc xyz
Date
2013-06-01 100 200
2013-06-03 -20 50
2013-08-15 40 -5
2014-01-20 25 15
2014-02-21 60 80
In [12]: g = df1.groupby(pd.Grouper(freq="M")) # DataFrameGroupBy (grouped by Month)
In [13]: g.sum()
Out[13]:
abc xyz
Date
2013-06-30 80 250
2013-07-31 NaN NaN
2013-08-31 40 -5
2013-09-30 NaN NaN
2013-10-31 NaN NaN
2013-11-30 NaN NaN
2013-12-31 NaN NaN
2014-01-31 25 15
2014-02-28 60 80
In [14]: df1.resample("M", how='sum') # the same
Out[14]:
abc xyz
Date
2013-06-30 40 125
2013-07-31 NaN NaN
2013-08-31 40 -5
2013-09-30 NaN NaN
2013-10-31 NaN NaN
2013-11-30 NaN NaN
2013-12-31 NaN NaN
2014-01-31 25 15
2014-02-28 60 80
Note: Previously pd.Grouper(freq="M") was written as pd.TimeGrouper("M"). The latter is now deprecated since 0.21.
I had thought the following would work, but it doesn't (due to as_index not being respected? I'm not sure.). I'm including this for interest's sake.
If it's a column (it has to be a datetime64 column! as I say, hit it with to_datetime), you can use the PeriodIndex:
In [21]: df
Out[21]:
Date abc xyz
0 2013-06-01 100 200
1 2013-06-03 -20 50
2 2013-08-15 40 -5
3 2014-01-20 25 15
4 2014-02-21 60 80
In [22]: pd.DatetimeIndex(df.Date).to_period("M") # old way
Out[22]:
<class 'pandas.tseries.period.PeriodIndex'>
[2013-06, ..., 2014-02]
Length: 5, Freq: M
In [23]: per = df.Date.dt.to_period("M") # new way to get the same
In [24]: g = df.groupby(per)
In [25]: g.sum() # dang not quite what we want (doesn't fill in the gaps)
Out[25]:
abc xyz
2013-06 80 250
2013-08 40 -5
2014-01 25 15
2014-02 60 80
To get the desired result we have to reindex...
Keep it simple:
GB = DF.groupby([(DF.index.year), (DF.index.month)]).sum()
giving you,
print(GB)
abc xyz
2013 6 80 250
8 40 -5
2014 1 25 15
2 60 80
and then you can plot like asked using,
GB.plot('abc', 'xyz', kind='scatter')
There are different ways to do that.
I created the data frame to showcase the different techniques to filter your data.
df = pd.DataFrame({'Date': ['01-Jun-13', '03-Jun-13', '15-Aug-13', '20-Jan-14', '21-Feb-14'],
'abc': [100, -20, 40, 25, 60], 'xyz': [200, 50,-5, 15, 80] })
I separated months/year/day and separated month-year as you explained.
def getMonth(s):
return s.split("-")[1]
def getDay(s):
return s.split("-")[0]
def getYear(s):
return s.split("-")[2]
def getYearMonth(s):
return s.split("-")[1] + "-" + s.split("-")[2]
I created new columns: year, month, day and 'yearMonth'. In your case, you need one of both. You can group using two columns 'year','month' or using one column yearMonth
df['year'] = df['Date'].apply(lambda x: getYear(x))
df['month'] = df['Date'].apply(lambda x: getMonth(x))
df['day'] = df['Date'].apply(lambda x: getDay(x))
df['YearMonth'] = df['Date'].apply(lambda x: getYearMonth(x))
Output:
Date abc xyz year month day YearMonth
0 01-Jun-13 100 200 13 Jun 01 Jun-13
1 03-Jun-13 -20 50 13 Jun 03 Jun-13
2 15-Aug-13 40 -5 13 Aug 15 Aug-13
3 20-Jan-14 25 15 14 Jan 20 Jan-14
4 21-Feb-14 60 80 14 Feb 21 Feb-14
You can go through the different groups in groupby(..) items.
In this case, we are grouping by two columns:
for key, g in df.groupby(['year', 'month']):
print key, g
Output:
('13', 'Jun') Date abc xyz year month day YearMonth
0 01-Jun-13 100 200 13 Jun 01 Jun-13
1 03-Jun-13 -20 50 13 Jun 03 Jun-13
('13', 'Aug') Date abc xyz year month day YearMonth
2 15-Aug-13 40 -5 13 Aug 15 Aug-13
('14', 'Jan') Date abc xyz year month day YearMonth
3 20-Jan-14 25 15 14 Jan 20 Jan-14
('14', 'Feb') Date abc xyz year month day YearMonth
In this case, we are grouping by one column:
for key, g in df.groupby(['YearMonth']):
print key, g
Output:
Jun-13 Date abc xyz year month day YearMonth
0 01-Jun-13 100 200 13 Jun 01 Jun-13
1 03-Jun-13 -20 50 13 Jun 03 Jun-13
Aug-13 Date abc xyz year month day YearMonth
2 15-Aug-13 40 -5 13 Aug 15 Aug-13
Jan-14 Date abc xyz year month day YearMonth
3 20-Jan-14 25 15 14 Jan 20 Jan-14
Feb-14 Date abc xyz year month day YearMonth
4 21-Feb-14 60 80 14 Feb 21 Feb-14
In case you want to access a specific item, you can use get_group
print df.groupby(['YearMonth']).get_group('Jun-13')
Output:
Date abc xyz year month day YearMonth
0 01-Jun-13 100 200 13 Jun 01 Jun-13
1 03-Jun-13 -20 50 13 Jun 03 Jun-13
Similar to get_group. This hack would help to filter values and get the grouped values.
This also would give the same result.
print df[df['YearMonth']=='Jun-13']
Output:
Date abc xyz year month day YearMonth
0 01-Jun-13 100 200 13 Jun 01 Jun-13
1 03-Jun-13 -20 50 13 Jun 03 Jun-13
You can select list of abc or xyz values during Jun-13
print df[df['YearMonth']=='Jun-13'].abc.values
print df[df['YearMonth']=='Jun-13'].xyz.values
Output:
[100 -20] #abc values
[200 50] #xyz values
You can use this to go through the dates that you have classified as "year-month" and apply criteria on it to get related data.
for x in set(df.YearMonth):
print df[df['YearMonth']==x].abc.values
print df[df['YearMonth']==x].xyz.values
I recommend also to check this answer as well.
You can also do it by creating a string column with the year and month as follows:
df['date'] = df.index
df['year-month'] = df['date'].apply(lambda x: str(x.year) + ' ' + str(x.month))
grouped = df.groupby('year-month')
However this doesn't preserve the order when you loop over the groups, e.g.
for name, group in grouped:
print(name)
Will give:
2007 11
2007 12
2008 1
2008 10
2008 11
2008 12
2008 2
2008 3
2008 4
2008 5
2008 6
2008 7
2008 8
2008 9
2009 1
2009 10
So then, if you want to preserve the order, you must do as suggested by #Q-man above:
grouped = df.groupby([df.index.year, df.index.month])
This will preserve the order in the above loop:
(2007, 11)
(2007, 12)
(2008, 1)
(2008, 2)
(2008, 3)
(2008, 4)
(2008, 5)
(2008, 6)
(2008, 7)
(2008, 8)
(2008, 9)
(2008, 10)
Some of the answers are using Date as an index instead of a column (and there's nothing wrong with doing that).
However, for anyone who has the dates stored as a column (instead of an index), remember to access the column's dt attribute. That is:
# First make sure `Date` is a datetime column
df['Date'] = pd.to_datetime(
arg=df['Date'],
format='%d-%b-%y' # Assuming dd-Mon-yy format
)
# Group by year and month
df.groupby(
[
df['Date'].dt.year,
df['Date'].dt.month
]
).sum()

Categories

Resources