What is the best way to find the largest value after date of each row, for example i have this dataframe:
import pandas as pd
data = [[20200101, 10], [20200102, 16], [20200103, 14], [20200104, 18]]
df = pd.DataFrame(data, columns=['date', 'value'])
print(df)
date value
0 20200101 10
1 20200102 16
2 20200103 14
3 20200104 18
i need to get the first largest value date after each row date :
date value largest_value_date
0 20200101 10 20200102
1 20200102 16 20200104
2 20200103 14 20200104
3 20200104 18 0
of course i tried with "for" but in big data it's very slow:
df['largest_value_date'] = 0
for i in range(0, len(df)):
date = df['date'].iloc[i]
value = df['value'].iloc[i]
largestDate = df[(df['date'] > date) & (df['value'] > value)]
if len(largestDate) > 0:
df['largest_value_date'].iloc[i] = largestDate['date'].iloc[0]
print(df)
date value largest_value_date
0 20200101 10 20200102
1 20200102 16 20200104
2 20200103 14 20200104
3 20200104 18 0
We can speed up the whole process with numpy board cast then idxmax , get the most recent values' id greater than the current row , then assign it back
s = df['value'].values
idx = pd.DataFrame(np.triu(s-s[:,None])).gt(0).idxmax(1)
df['new'] = df['date'].reindex(idx.replace(0,-1)).values
df
Out[158]:
date value new
0 20200101 10 20200102.0
1 20200102 16 20200104.0
2 20200103 14 20200104.0
3 20200104 18 NaN
Related
I'm having this data frame:
id date count
1 8/31/22 1
1 9/1/22 2
1 9/2/22 8
1 9/3/22 0
1 9/4/22 3
1 9/5/22 5
1 9/6/22 1
1 9/7/22 6
1 9/8/22 5
1 9/9/22 7
1 9/10/22 1
2 8/31/22 0
2 9/1/22 2
2 9/2/22 0
2 9/3/22 5
2 9/4/22 1
2 9/5/22 6
2 9/6/22 1
2 9/7/22 1
2 9/8/22 2
2 9/9/22 2
2 9/10/22 0
I want to aggregate the count by id and date to get sum of quantities Details:
Date: the all counts in a week should be aggregated on Saturday. A week starts from Sunday and ends on Saturday. The time period (the first day and the last day of counts) is fixed for all of the ids.
The desired output is given below:
id date count
1 9/3/22 11
1 9/10/22 28
2 9/3/22 7
2 9/10/22 13
I have already the following code for this work and it does work but it is not efficient as it takes a long time to run for a large database. I am looking for a much faster and efficient way to get the output:
df['day_name'] = new_df['date'].dt.day_name()
df_week_count = pd.DataFrame(columns=['id', 'date', 'count'])
for id in ids:
# make a dataframe for each id
df_id = new_df.loc[new_df['id'] == id]
df_id.reset_index(drop=True, inplace=True)
# find Starudays index
saturday_indices = df_id.loc[df_id['day_name'] == 'Saturday'].index
j = 0
sat_index = 0
while(j < len(df_id)):
# find sum of count between j and saturday_index[sat_index]
sum_count = df_id.loc[j:saturday_indices[sat_index], 'count'].sum()
# add id, date, sum_count to df_week_count
temp_df = pd.DataFrame([[id, df_id.loc[saturday_indices[sat_index], 'date'], sum_count]], columns=['id', 'date', 'count'])
df_week_count = pd.concat([df_week_count, temp_df], ignore_index=True)
j = saturday_indices[sat_index] + 1
sat_index += 1
if sat_index >= len(saturday_indices):
break
if(j < len(df_id)):
sum_count = df_id.loc[j:, 'count'].sum()
temp_df = pd.DataFrame([[id, df_id.loc[len(df_id) - 1, 'date'], sum_count]], columns=['id', 'date', 'count'])
df_week_count = pd.concat([df_week_count, temp_df], ignore_index=True)
df_final = df_week_count.copy(deep=True)
Create a grouping factor from the dates.
week = pd.to_datetime(df['date'].to_numpy()).strftime('%U %y')
df.groupby(['id',week]).agg({'date':max, 'count':sum}).reset_index()
id level_1 date count
0 1 35 22 9/3/22 11
1 1 36 22 9/9/22 28
2 2 35 22 9/3/22 7
3 2 36 22 9/9/22 13
I tried to understand as much as i can :)
here is my process
# reading data
df = pd.read_csv(StringIO(data), sep=' ')
# data type fix
df['date'] = pd.to_datetime(df['date'])
# initial grouping
df = df.groupby(['id', 'date'])['count'].sum().to_frame().reset_index()
df.sort_values(by=['date', 'id'], inplace=True)
df.reset_index(drop=True, inplace=True)
# getting name of the day
df['day_name'] = df.date.dt.day_name()
# getting week number
df['week'] = df.date.dt.isocalendar().week
# adjusting week number to make saturday the last day of the week
df.loc[df.day_name == 'Sunday','week'] = df.loc[df.day_name == 'Sunday', 'week'] + 1
what i think you are looking for
df.groupby(['id','week']).agg(count=('count','sum'), date=('date','max')).reset_index()
id
week
count
date
0
1
35
11
2022-09-03 00:00:00
1
1
36
28
2022-09-10 00:00:00
2
2
35
7
2022-09-03 00:00:00
3
2
36
13
2022-09-10 00:00:00
I have two dataframes, just like below.
Dataframe1:
country
type
start_week
end_week
1
a
12
13
2
b
13
14
Dataframe2:
country
type
week
value
1
a
12
1000
1
a
13
900
1
a
14
800
2
b
12
1000
2
b
13
900
2
b
14
800
I want to add to the first dataframe column with the mean value from the second dataframe for key (country+type) and between start_week and end_week.
I want desired output to look like the below:
country
type
start_week
end_week
avg
1
a
12
13
950
2
b
13
14
850
here is one way :
combined = df1.merge(df2 , on =['country','type'])
combined = combined.loc[(combined.start_week <= combined.week) & (combined.week <= combined.end_week)]
output = combined.groupby(['country','type','start_week','end_week'])['value'].mean().reset_index()
output:
>>
country type start_week end_week value
0 1 a 12 13 950.0
1 2 b 13 14 850.0
You can use pd.melt and comparison of numpy arrays.
# melt df1
melted_df1 = df1.melt(id_vars=['country','type'],value_name='week')[['country','type','week']]
# for loop to compare two dataframe arrays
result = []
for i in df2.values:
for j in melted_df1.values:
if (j == i[:3]).all():
result.append(i)
break
# Computing mean of the result dataframe
result_df = pd.DataFrame(result,columns=df2.columns).groupby('type').mean().reset_index()['value']
# Assigning result_df to df1
df1['avg'] = result_df
country type start_week end_week avg
0 1 a 12 13 950.0
1 2 b 13 14 850.0
I want to select only those rows from a dataframe where certain columns with suffix have values not equal to zero. Also the number of columns is more so I need a generalised solution.
eg:
import pandas as pd
data = {
'ID' : [1,2,3,4,5],
'M_NEW':[10,12,14,16,18],
'M_OLD':[10,12,14,16,18],
'M_DIFF':[0,0,0,0,0],
'CA_NEW':[10,12,16,16,18],
'CA_OLD':[10,12,14,16,18],
'CA_DIFF':[0,0,2,0,0],
'BC_NEW':[10,12,14,16,18],
'BC_OLD':[10,12,14,16,17],
'BC_DIFF':[0,0,0,0,1]
}
df = pd.DataFrame(data)
df
The dataframe would be :
ID M_NEW M_OLD M_DIFF CA_NEW CA_OLD CA_DIFF BC_NEW BC_OLD BC_DIFF
0 1 10 10 0 10 10 0 10 10 0
1 2 12 12 0 12 12 0 12 12 0
2 3 14 14 0 16 14 2 14 14 0
3 4 16 16 0 16 16 0 16 16 0
4 5 18 18 0 18 18 0 18 17 1
The desired output is : (because of 2 in CA_DIFF and 1 in BC_DIFF)
ID M_NEW M_OLD M_DIFF CA_NEW CA_OLD CA_DIFF BC_NEW BC_OLD BC_DIFF
0 3 14 14 0 16 14 2 14 14 0
1 5 18 18 0 18 18 0 18 17 1
This works with using multiple conditions but what if the number of DIFF columns are more? Like 20? Can someone provide a general solution? Thanks.
You can do this:
...
# get all columns with X_DIFF
columns = df.columns[df.columns.str.contains('_DIFF')]
# check if any has value greater than 0
df[df[columns].transform(lambda x: x > 0).any(axis=1)]
You could use the function below, combined with pipe to filter rows, based on various conditions:
In [22]: def filter_rows(df, dtype, columns, condition, any_True = True):
...: temp = df.copy()
...: if dtype:
...: temp = df.select_dtypes(dtype)
...: if columns:
...: booleans = temp.loc[:, columns].transform(condition)
...: else:
...: booleans = temp.transform(condition)
...: if any_True:
...: booleans = booleans.any(axis = 1)
...: else:
...: booleans = booleans.all(axis = 1)
...:
...: return df.loc[booleans]
In [24]: df.pipe(filter_rows,
dtype=None,
columns=lambda df: df.columns.str.endswith("_DIFF"),
condition= lambda df: df.ne(0)
)
Out[24]:
ID M_NEW M_OLD M_DIFF CA_NEW CA_OLD CA_DIFF BC_NEW BC_OLD BC_DIFF
2 3 14 14 0 16 14 2 14 14 0
4 5 18 18 0 18 18 0 18 17 1
I have a dataframe and want to create a new column based on other rows of the dataframe. My dataframe looks like
MitarbeiterID ProjektID Jahr Monat Week mean freq last
0 583 83224 2020 1 2 3.875 4 0
1 373 17364 2020 1 3 5.00 0 4
2 923 19234 2020 1 4 5.00 3 3
3 643 17364 2020 1 3 4.00 2 2
Now I want to check, if the freq of a row is zero, then I will check if there is another row with the same ProjektID and Year an Week where the freq is not 0. If this is true I want a new column "other" which is value 1 and 0 else.
So, the output should be
MitarbeiterID ProjektID Jahr Monat Week mean freq last other
0 583 83224 2020 1 2 3.875 4 0 0
1 373 17364 2020 1 3 5.00 0 4 1
2 923 19234 2020 1 4 5.00 3 3 0
3 643 17364 2020 1 3 4.00 2 2 0
This time I have no approach, can anyone help?
Thanks!
The following solution tests if the required conditions are True.
import io
import pandas as pd
Data
df = pd.read_csv(io.StringIO("""
MitarbeiterID ProjektID Jahr Monat Week mean freq last
0 583 83224 2020 1 2 3.875 4 0
1 373 17364 2020 1 3 5.00 0 4
2 923 19234 2020 1 4 5.00 3 3
3 643 17364 2020 1 3 4.00 2 2
"""), sep="\s\s+", engine="python")
Make a column other with all values zero.
df['other'] = 0
If ProjektID, Jahr, Week are duplicated and any of the Freq values is larger than zero, then the rows that are duplicated (keep=False to also capture the original duplicated row) and where Freq is zero will have the value Other filled with 1. Change any() to all() if you need all values to be larger than zero.
if (df.loc[df[['ProjektID','Jahr', 'Week']].duplicated(), 'freq'] > 0).any(): df.loc[(df[['ProjektID','Jahr', 'Week']].duplicated(keep=False)) & (df['freq'] == 0), ['other']] = 1
else: print("Other stays zero")
Output:
I think the best way to solve this is not to use pandas too much :-) converting things to sets and tuples should make it fast enough.
The idea is to make a dictionary of all the triples (ProjektID, Jahr, Week) that appear in the dataset with freq != 0 and then check for all lines with freq == 0 if their triple belongs to this dictionary or not. In code, I'm creating a dummy dataset with:
x = pd.DataFrame(np.random.randint(0, 2, (8, 4)), columns=['id', 'year', 'week', 'freq'])
which in my case randomly gave:
>>> x
id year week freq
0 1 0 0 0
1 0 0 0 1
2 0 1 0 1
3 0 0 1 0
4 0 1 0 0
5 1 0 0 1
6 0 0 1 1
7 0 1 1 0
Now, we want triplets only where freq != 0, so we use
x1 = x.loc[x['freq'] != 0]
triplets = {tuple(row) for row in x1[['id', 'year', 'week']].values}
Note that I'm using x1.values, which is not a pandas DataFrame but rather a numpy array; so each row in there can now be converted to tuple. This is necessary because dataframe rows, or even numpy array or lists, are mutable objects and cannot be hashed in a dictionary otherwise. Using a set instead of e.g. a list (which doesn't have this restriction) is for efficiency purposes.
Next, we define a boolean variable which is True if a triplet (id, year, week) belongs to the above set:
belongs = x[['id', 'year', 'week']].apply(lambda x: tuple(x) in triplets, axis=1)
We are basically done, this is the further column you want, except for also needing to force freq == 0:
x['other'] = np.logical_and(belongs, x['freq'] == 0).astype(int)
(the final .astype(int) is to have it values 0 and 1, as you were asking, instead of False and True). Final result in my case:
>>> x
id year week freq other
0 1 0 0 0 1
1 0 0 0 1 0
2 0 1 0 1 0
3 0 0 1 0 1
4 0 1 0 0 1
5 1 0 0 1 0
6 0 0 1 1 0
7 0 1 1 0 0
Looks like I am too late ...:
df.set_index(['ProjektID', 'Jahr', 'Week'], drop=True, inplace=True)
df['other'] = 0
df.other.mask(df.freq == 0,
df.freq[df.freq == 0].index.isin(df.freq[df.freq != 0].index),
inplace=True)
df.other = df.other.astype('int')
df.reset_index(drop=False, inplace=True)
import pandas as pd
df = pd.DataFrame(data=[[1,1,10],[1,2,50],[1,3,20],[1,4,24],
[2,1,20],[2,2,10],[2,3,20],[2,4,34],[3,1,10],[3,2,50],
[3,3,20],[3,4,24],[3,5,24],[4,1,24]],columns=['day','hour','event'])
df
Out[4]:
day hour event
0 1 1 10
1 1 2 50
2 1 3 20 <- yes
3 1 4 24 <- yes
4 2 1 20 <- yes
5 2 2 10
6 2 3 20 <- yes
7 2 4 34 <- yes
8 3 1 10 <- yes
9 3 2 50
10 3 3 20 <- yes
11 3 4 24 <- yes
11 3 5 24 <- yes (here we have also an hour more)
12 4 1 24 <- yes
now I would like to sum the number of events from hour=3 to hour=1 of the following day..
The expected result should be
0 64
1 64
2 92
#convert columns to datetimes, for same day of next day subtract 2 hours:
a = pd.to_datetime(df['day'].astype(str) + ':' + df['hour'].astype(str), format='%d:%H')- pd.Timedelta(2, unit='h')
#get hours between 1 and 23 only ->in real 3,4...23,1
hours = a.dt.hour.between(1,23)
#create consecutives groups by filtering
df['a'] = hours.ne(hours.shift()).cumsum()
#filter only expected hours
df = df[hours]
#aggregate
df = df.groupby('a')['event'].sum().reset_index(drop=True)
print (df)
0 10
1 64
2 64
3 92
Name: event, dtype: int64
Another similar solution:
#create datetimeindex
df.index = pd.to_datetime(df['day'].astype(str)+':'+df['hour'].astype(str), format='%d:%H')
#shift by 2 hours
df = df.shift(-2, freq='h')
#filter hours and first unnecessary event
df = df[(df.index.hour != 0) & (df.index.year != 1899)]
#aggregate
df = df.groupby(df.index.day)['event'].sum().reset_index(drop=True)
print (df)
0 64
1 64
2 92
Name: event, dtype: int64
Another solution:
#filter out first values less as 3 and hours == 2
df = df[(df['hour'].eq(3).cumsum() > 0) & (df['hour'] != 2)]
#subtract 1 day by condition and aggregate
df = df['event'].groupby(np.where(df['hour'] < 3, df['day'] - 1, df['day'])).sum()
print (df)
1 64
2 64
3 92
Name: event, dtype: int64
One option would be to just remove all entries for which hour is 2, then combine the results into groups of 3 and sum those;
v = df[df.hour != 2][1:].event
np.add.reduceat(v, range(0, len(v), 3))
One way is to define a grouping column via pd.DataFrame.apply with a custom function.
Then groupby this new column.
df['grouping'] = df.apply(lambda x: x['day']-2 if x['hour'] < 3 else x['day']-1, axis=1)
res = df.loc[(df['hour'] != 2) & (df['grouping'] >= 0)]\
.groupby('grouping')['event'].sum()\
.reset_index(drop=True)
Result
0 64
1 64
2 92
Name: event, dtype: int64