I have a pandas DataFrame df looking like this :
item year value
A 2010 20
A 2011 25
A 2012 32
B 2016 20
B 2019 40
B 2018 50
My goal is to be able, for each item to calculate the difference of value between each date. Then for example, I want to find for item A : 12 (32 -20 because year max is 2012 and year min is 2010) and for item B : 20 (40 - 20, because year max is 2019 and year min is 2016).
I use the following code to get, for each item, year max and year min :
df.groupby("item").agg({'year':[np.min, np.max]})
Then, I find the year min and year max for each item. However, I stuck to make what I want.
Try sort_values by year, then you can groupby and select first for min and last for max:
g = df.sort_values('year').groupby('item')
out = g['value'].last() - g['value'].first()
Output:
item
A 12
B 20
Name: value, dtype: int64
Use:
def fun(x):
return x[x.index.max()] - x[x.index.min()]
res = df.set_index("year").groupby("item").agg(fun)
print(res)
Output
value
item
A 12
B 20
Use loc accessor in agg to calculate value difference and also, you can also concat the first and last year in an item to give you a clear indication of the range.
df.sort_values(by=['item','year']).groupby('item').agg( year=('year', lambda x: str(x.iloc[0]) +'-'+str(x.iloc[-1])),value=('value', lambda x: x.iloc[-1]-x.iloc[0]))
year value
item
A 2010-2012 12
B 2016-2019 20
Related
I have 2 dataframes:
df_dec_light and df_rally.
df_dec_light.head():
log_return month year
1970-12-01 0.003092 12 1970
1970-12-02 0.011481 12 1970
1970-12-03 0.004736 12 1970
1970-12-04 0.006279 12 1970
1970-12-07 0.005351 12 1970
1970-12-08 -0.005239 12 1970
1970-12-09 0.000782 12 1970
1970-12-10 0.004235 12 1970
1970-12-11 0.003774 12 1970
1970-12-14 -0.005109 12 1970
df_rally.head():
rally_start rally_end
0 1970-12-18 1970-12-31
1 1971-12-17 1971-12-31
2 1972-12-15 1972-12-29
3 1973-12-21 1973-12-31
4 1974-12-20 1974-12-31
I need to filter df_dec_light based on condition that df_dec_light.index is between values of columns df_rally['rally_start']and df_rally['rally_end'].
I've tried something like this:
df_dec_light[(df_dec_light.index >= df_rally['rally_start']) & (df_dec_light.index <= df_rally['rally_end'])]
I was expecting to to recieve filtered df_dec_light dataframe with indexes that are within intervals between df_rally['rally_start'] and df_rally['rally_end'].
Something like this:
log_return month year
1970-12-18 0.001997 12 1970
1970-12-21 -0.003108 12 1970
1970-12-22 0.001111 12 1970
1970-12-23 0.000666 12 1970
1970-12-24 0.005644 12 1970
1970-12-28 0.005283 12 1970
1970-12-29 0.010810 12 1970
1970-12-30 0.002061 12 1970
1970-12-31 -0.001301 12 1970
Would really apreciate any help. Thanks!
Let's create an IntervalIndex from the start and end column values in df_rally dataframe, then map the intervals on index of df_dec_light dataframe and use notna to check if the index values are contained in any interval
ix = pd.IntervalIndex.from_arrays(df_rally.rally_start, df_rally.rally_end, closed='both')
mask = df_dec_light.index.map(ix.to_series()).notna()
then use the mask to filter the dataframe
df_dec_light[mask]
To solve this we can first turn the ranges in df_rally into pd.DateTimeIndex by calling pd.date_range on each row. This will give us each row of df_rally as a pd.DateTimeIndex.
As we want to later check if the index of df_dec_light is in any of the ranges, we want to combine all of these ranges. This is done with union.
We assert that the newly created pd.Series index_list is not empty and then select its first element. This element is the pd.DateTimeIndex on which we can now call union with all other pd.DateTimeIndex.
We can now use pd.Index.isin to create a boolean array of whether each index Date is found in the passed set of Dates.
If we now apply this mask to df_dec_light it returns only the entries that are within one of the specified ranges of df_rally.
index_list = df_rally.apply(lambda x: pd.date_range(x['rally_start'], x['rally_end']), axis=1)
assert(not index_list.empty)
all_ranges=index_list.iloc[0]
for range in index_list:
all_ranges=all_ranges.union(range)
print(all_ranges)
mask = df_dec_light.index.isin(all_ranges)
print(df_dec_light[mask])
I have a table, and I'm trying to get the second largest "percent" value by column "Day".
I can get the second largest value, but the column 'Hour', is not the right one
Table:df
name
Day
Hour
percent
000_RJ_S1
26
10
0.908494
000_RJ_S1
26
11
0.831482
000_RJ_S1
26
12
0.843846
000_RJ_S1
26
13
0.877238
000_RJ_S1
26
17
0.163908
000_RJ_S1
26
18
0.230296
000_RJ_S1
26
19
0.359440
000_RJ_S1
26
20
0.379988
Script Used:
df = df.groupby(['name','Day'])[['Hour','percent']].apply(lambda x: x.nlargest(2, columns='percent').min())
Output:
As you can see, the "Hour" value is wrong. It should be "13" and not "10". The second largest value is right.
name
Day
Hour
percent
000_RJ_S1
26
10
0.877238
It should be:
name
Day
Hour
percent
000_RJ_S1
26
13
0.877238
I can't figure out what's is wrong. Could you guys help me with this issue.
Thanks a lot
Sort the percent columns before grouping, and use the nth function instead:
(df.sort_values('percent', ascending=False)
.groupby(['name', 'Day'],sort=False, as_index = False)
.nth(1)
)
name Day Hour percent
3 000_RJ_S1 26 13 0.877238
The reason you have got 10 is because of the min() function.
The nlargest() in the lambda would return the two rows with largest percent values and when you apply min() what it does is it selects the minimum values from each column separately which gave you that output.
You can use iloc[1] instead of min() to get the desired result
Here's the code using iloc:
df.groupby(['name','Day'])[['Hour','percent']].apply(lambda x: x.nlargest(2, columns='percent')).iloc[1]
One solution is to use a double groupby:
cols = ['name','Day']
# get the top 2 per group
s = df.groupby(cols)['percent'].nlargest(2)
# get the index of min per group
idx = s.droplevel(cols).groupby(s.droplevel(-1).index).idxmin()
# slice original data with those indexes
df2 = df.loc[idx.values]
Output:
name Day Hour percent
3 000_RJ_S1 26 13 0.877238
i have data with 3 columns: date, id, sales.
my first task is filtering sales above 100. i did it.
second task, grouping id by consecutive days.
index
date
id
sales
0
01/01/2018
03
101
1
01/01/2018
07
178
2
02/01/2018
03
120
3
03/01/2018
03
150
4
05/01/2018
07
205
the result should be:
index
id
count
0
03
3
1
07
1
2
07
1
i need to do this task without using pandas/dataframe, but right now i can't imagine from which side attack this problem.
just for effort, i tried the suggestion for a solution here count consecutive days python dataframe
but the ids' not grouped.
here is my code:
data = df[df['sales'] >= 100]
data['date'] = pd.to_datetime(data['date']).dt.date
s = data.groupby('id').date.diff().dt.days.ne(1).cumsum()
new_frame = data.groupby(['id', s]).size().reset_index(level=0, drop=True)
it is very importent that the "new_frame" will have "count" column, because after i need to count id by range of those count days in "count" column. e.g. count of id's in range of 0-7 days, 7-12 days etc. but it's not part of my question.
Thank you a lot
Your code is close, but need some fine-tuning, as follows:
data = df[df['sales'] >= 100]
data['date'] = pd.to_datetime(data['date'], dayfirst=True)
df2 = data.sort_values(['id', 'date'])
s = df2.groupby('id').date.diff().dt.days.ne(1).cumsum()
new_frame = df2.groupby(['id', s]).size().reset_index(level=1, drop=True).reset_index(name='count')
Result:
print(new_frame)
id count
0 3 3
1 7 1
2 7 1
Summary of changes:
As your dates are in dd/mm/yyyy instead of the default mm/dd/yyyy, you have to specify the parameter dayfirst=True in pd.to_datetime(). Otherwise, 02/01/2018 will be regarded as 2018-02-01 instead of 2018-01-02 as expected and the day diff with adjacent entries will be around 30 as opposed to 1.
We added a sort step to sort by columns id and date to simplify the later grouping during the creation of the series s.
In the last groupby() the code reset_index(level=0, drop=True) should be dropping level=1 instead. Since, level=0 is the id fields which we want to keep.
In the last groupby(), we do an extra .reset_index(name='count') to make the Pandas series change back to a dataframe and also name the new column as count.
I'm trying to finish my workproject but I'm getting stuck at a certain point.
Part of the dataframe I have is this:
year_month
year
month
2007-01
2007
1
2009-07
2009
7
2010-03
2010
3
However, I want to add the column "season". I'm illustrating soccer seasons and the season column needs to illustrate what season the players plays. So if month is equal or smaller than 3, the "season" column needs to correspond with ((year-1), "/", year) and if larger with (year, "/", (year + 1)).
The table should look like this:
year_month
year
month
season
2007-01
2007
1
2006/2007
2009-07
2009
7
2009/2010
2010-03
2010
3
2009/2010
Hopefully someone else can help me with this problem.
Here is the code to create the first Table:
import pandas as pd
from datetime import datetime
df = pd.DataFrame({'year_month':["2007-01", "2009-07", "2010-03"],
'year':[2007, 2009, 2010],
'month':[1, 7, 3]})
# convert the 'Date' columns to datetime format
df['year_month']= pd.to_datetime(df['year_month'])
Thanks in advance!
You can use np.where() to specify the condition and get corresponding strings according to True / False of the condition, as follows:
df['season'] = np.where(df['month'] <= 3,
(df['year'] - 1).astype(str) + '/' + df['year'].astype(str),
df['year'].astype(str) + '/' + (df['year'] + 1).astype(str))
Result:
year_month year month season
0 2007-01-01 2007 1 2006/2007
1 2009-07-01 2009 7 2009/2010
2 2010-03-01 2010 3 2009/2010
You can use a lambda function with conditionals and axis=1 to apply it to each row. Using f-Strings reduces the code needed to transform values from the year column into strings as needed for your new season column.
df['season'] = df.apply(lambda x: f"{x['year']-1}/{x['year']}" if x['month'] <= 3 else f"{x['year']}/{x['year']+1}", axis=1)
Output:
year_month year month season
0 2007-01 2007 1 2006/2007
1 2009-07 2009 7 2009/2010
2 2010-03 2010 3 2009/2010
I have this MultiIndex pandas dataframe:
chamber_temp
month day
1 1 0.000000
2 0.005977
3 0.001439
4 -0.000119
5 0.000514
...
12 27 0.001799
28 0.002346
29 -0.001815
30 0.001102
31 -0.004189
What I want to get is which month has the highest cumsum().
What I am trying to do is for each month there should 1 value which will give me the cumulative sum of all the values for day in that month, that is the problem which I am trying to get help on.
You can leverage on level parameter in Series.sum when there's MultiIndex to avoid groupby in such cases.
df['champer_temp'].sum(level=0).idxmax()
Please Try
df.groupby('month')['chamber_temp'].sum().idxmax()