import pandas as pd
df = pd.DataFrame(data=[[1,1,10],[1,2,50],[1,3,20],[1,4,24],
[2,1,20],[2,2,10],[2,3,20],[2,4,34],[3,1,10],[3,2,50],
[3,3,20],[3,4,24],[3,5,24],[4,1,24]],columns=['day','hour','event'])
df
Out[4]:
day hour event
0 1 1 10
1 1 2 50
2 1 3 20 <- yes
3 1 4 24 <- yes
4 2 1 20 <- yes
5 2 2 10
6 2 3 20 <- yes
7 2 4 34 <- yes
8 3 1 10 <- yes
9 3 2 50
10 3 3 20 <- yes
11 3 4 24 <- yes
11 3 5 24 <- yes (here we have also an hour more)
12 4 1 24 <- yes
now I would like to sum the number of events from hour=3 to hour=1 of the following day..
The expected result should be
0 64
1 64
2 92
#convert columns to datetimes, for same day of next day subtract 2 hours:
a = pd.to_datetime(df['day'].astype(str) + ':' + df['hour'].astype(str), format='%d:%H')- pd.Timedelta(2, unit='h')
#get hours between 1 and 23 only ->in real 3,4...23,1
hours = a.dt.hour.between(1,23)
#create consecutives groups by filtering
df['a'] = hours.ne(hours.shift()).cumsum()
#filter only expected hours
df = df[hours]
#aggregate
df = df.groupby('a')['event'].sum().reset_index(drop=True)
print (df)
0 10
1 64
2 64
3 92
Name: event, dtype: int64
Another similar solution:
#create datetimeindex
df.index = pd.to_datetime(df['day'].astype(str)+':'+df['hour'].astype(str), format='%d:%H')
#shift by 2 hours
df = df.shift(-2, freq='h')
#filter hours and first unnecessary event
df = df[(df.index.hour != 0) & (df.index.year != 1899)]
#aggregate
df = df.groupby(df.index.day)['event'].sum().reset_index(drop=True)
print (df)
0 64
1 64
2 92
Name: event, dtype: int64
Another solution:
#filter out first values less as 3 and hours == 2
df = df[(df['hour'].eq(3).cumsum() > 0) & (df['hour'] != 2)]
#subtract 1 day by condition and aggregate
df = df['event'].groupby(np.where(df['hour'] < 3, df['day'] - 1, df['day'])).sum()
print (df)
1 64
2 64
3 92
Name: event, dtype: int64
One option would be to just remove all entries for which hour is 2, then combine the results into groups of 3 and sum those;
v = df[df.hour != 2][1:].event
np.add.reduceat(v, range(0, len(v), 3))
One way is to define a grouping column via pd.DataFrame.apply with a custom function.
Then groupby this new column.
df['grouping'] = df.apply(lambda x: x['day']-2 if x['hour'] < 3 else x['day']-1, axis=1)
res = df.loc[(df['hour'] != 2) & (df['grouping'] >= 0)]\
.groupby('grouping')['event'].sum()\
.reset_index(drop=True)
Result
0 64
1 64
2 92
Name: event, dtype: int64
Related
I'm having this data frame:
id date count
1 8/31/22 1
1 9/1/22 2
1 9/2/22 8
1 9/3/22 0
1 9/4/22 3
1 9/5/22 5
1 9/6/22 1
1 9/7/22 6
1 9/8/22 5
1 9/9/22 7
1 9/10/22 1
2 8/31/22 0
2 9/1/22 2
2 9/2/22 0
2 9/3/22 5
2 9/4/22 1
2 9/5/22 6
2 9/6/22 1
2 9/7/22 1
2 9/8/22 2
2 9/9/22 2
2 9/10/22 0
I want to aggregate the count by id and date to get sum of quantities Details:
Date: the all counts in a week should be aggregated on Saturday. A week starts from Sunday and ends on Saturday. The time period (the first day and the last day of counts) is fixed for all of the ids.
The desired output is given below:
id date count
1 9/3/22 11
1 9/10/22 28
2 9/3/22 7
2 9/10/22 13
I have already the following code for this work and it does work but it is not efficient as it takes a long time to run for a large database. I am looking for a much faster and efficient way to get the output:
df['day_name'] = new_df['date'].dt.day_name()
df_week_count = pd.DataFrame(columns=['id', 'date', 'count'])
for id in ids:
# make a dataframe for each id
df_id = new_df.loc[new_df['id'] == id]
df_id.reset_index(drop=True, inplace=True)
# find Starudays index
saturday_indices = df_id.loc[df_id['day_name'] == 'Saturday'].index
j = 0
sat_index = 0
while(j < len(df_id)):
# find sum of count between j and saturday_index[sat_index]
sum_count = df_id.loc[j:saturday_indices[sat_index], 'count'].sum()
# add id, date, sum_count to df_week_count
temp_df = pd.DataFrame([[id, df_id.loc[saturday_indices[sat_index], 'date'], sum_count]], columns=['id', 'date', 'count'])
df_week_count = pd.concat([df_week_count, temp_df], ignore_index=True)
j = saturday_indices[sat_index] + 1
sat_index += 1
if sat_index >= len(saturday_indices):
break
if(j < len(df_id)):
sum_count = df_id.loc[j:, 'count'].sum()
temp_df = pd.DataFrame([[id, df_id.loc[len(df_id) - 1, 'date'], sum_count]], columns=['id', 'date', 'count'])
df_week_count = pd.concat([df_week_count, temp_df], ignore_index=True)
df_final = df_week_count.copy(deep=True)
Create a grouping factor from the dates.
week = pd.to_datetime(df['date'].to_numpy()).strftime('%U %y')
df.groupby(['id',week]).agg({'date':max, 'count':sum}).reset_index()
id level_1 date count
0 1 35 22 9/3/22 11
1 1 36 22 9/9/22 28
2 2 35 22 9/3/22 7
3 2 36 22 9/9/22 13
I tried to understand as much as i can :)
here is my process
# reading data
df = pd.read_csv(StringIO(data), sep=' ')
# data type fix
df['date'] = pd.to_datetime(df['date'])
# initial grouping
df = df.groupby(['id', 'date'])['count'].sum().to_frame().reset_index()
df.sort_values(by=['date', 'id'], inplace=True)
df.reset_index(drop=True, inplace=True)
# getting name of the day
df['day_name'] = df.date.dt.day_name()
# getting week number
df['week'] = df.date.dt.isocalendar().week
# adjusting week number to make saturday the last day of the week
df.loc[df.day_name == 'Sunday','week'] = df.loc[df.day_name == 'Sunday', 'week'] + 1
what i think you are looking for
df.groupby(['id','week']).agg(count=('count','sum'), date=('date','max')).reset_index()
id
week
count
date
0
1
35
11
2022-09-03 00:00:00
1
1
36
28
2022-09-10 00:00:00
2
2
35
7
2022-09-03 00:00:00
3
2
36
13
2022-09-10 00:00:00
I have a dataframe called Result that comes from a SQL query:
Loc ID Bank
1 23 NULL
1 24 NULL
1 25 NULL
2 23 6
2 24 7
2 25 8
I am trying to set the values of Loc == 1 Bank equal to the Bank of Loc == 2 when the ID is the same, resulting in:
Loc ID Bank
1 23 6
1 24 7
1 25 8
2 23 6
2 24 7
2 25 8
Here is where I am at with the code, I know the ending is super simple I just can't wrap my head around a solution that doesn't involve iterating over every row (9000~).
result.loc[(result['Loc'] == '1'), 'bank'] = ???
You can try this. It uses map() to get the values from ID.
for_map = df.loc[df['Loc'] == 2].set_index('ID')['Bank'].squeeze().to_dict()
df.loc[df['Loc'] == 1,'Bank'] = df.loc[df['Loc'] == 1,'Bank'].fillna(df['ID'].map(for_map))
You could do a self merge on the dataframe, on ID, then filter for rows where it is equal to 2:
(
df.merge(df, on="ID")
.loc[lambda df: df.Loc_y == 2, ["Loc_x", "ID", "Bank_y"]]
.rename(columns=lambda x: x.split("_")[0] if "_" in x else x)
.astype({"Bank": "Int8"})
.sort_values("Loc", ignore_index=True)
)
Loc ID Bank
0 1 23 6
1 1 24 7
2 1 25 8
3 2 23 6
4 2 24 7
5 2 25 8
You could also stack/unstack, although this fails if you have duplicate indices:
(
df.set_index(["Loc", "ID"])
.unstack("Loc")
.bfill(1)
.stack()
.reset_index()
.reindex(columns=df.columns)
)
Loc ID Bank
0 1 23 6.0
1 2 23 6.0
2 1 24 7.0
3 2 24 7.0
4 1 25 8.0
5 2 25 8.0
Why not use pandas.MultiIndex ?
Commonalities
# Arguments,
_0th_level = 'Loc'
merge_key = 'ID'
value_key = 'Bank' # or a list of colnames or `slice(None)` to propagate all columns values.
src_key = '2'
dst_key = '1'
# Computed once for all,
df = result.set_index([_0th_level, merge_key])
df2 = df.xs(key=src_key, level=_0th_level, drop_level=False)
df1_ = df2.rename(level=_0th_level, index={src_key: dst_key})
First (naive) approach
df.loc[df1_.index, value_key] = df1_
# to get `result` back : df.reset_index()
Second (robust) approach
That being shown, the first approach may be illegal (since pandas version 1.0.0) if there is one or more missing label [...].
So if you must ensure that indexes exist both at source and destination, the following does the job on shared IDs only.
df1 = df.xs(key=dst_key, level=_0th_level, drop_level=False)
idx = df1.index.intersection(df1_.index) # <-----
df.loc[idx, value_key] = df1_.loc[idx, value_key]
i have a small sample data set:
import pandas as pd
d = {
'measure1_x': [10,12,20,30,21],
'measure2_x':[11,12,10,3,3],
'measure3_x':[10,0,12,1,1],
'measure1_y': [1,2,2,3,1],
'measure2_y':[1,1,1,3,3],
'measure3_y':[1,0,2,1,1]
}
df = pd.DataFrame(d)
df = df.reindex_axis([
'measure1_x','measure2_x', 'measure3_x','measure1_y','measure2_y','measure3_y'
], axis=1)
it looks like:
measure1_x measure2_x measure3_x measure1_y measure2_y measure3_y
10 11 10 1 1 1
12 12 0 2 1 0
20 10 12 2 1 2
30 3 1 3 3 1
21 3 1 1 3 1
i created the column names almost the same except for '_x' and '_y' to help identify which pair should be multiplying: i want to multiply the pair with the same column name when '_x' and '_y' are disregarded, then i want sum the numbers to get a total number, keep in mind my actual data set is huge and the columns are not in this perfect order so this naming is a way for identifying correct pairs to multiply:
total = measure1_x * measure1_y + measure2_x * measure2_y + measure3_x * measure3_y
so desired output:
measure1_x measure2_x measure3_x measure1_y measure2_y measure3_y total
10 11 10 1 1 1 31
12 12 0 2 1 0 36
20 10 12 2 1 2 74
30 3 1 3 3 1 100
21 3 1 1 3 1 31
my attempt and thought process, but cannot proceed anymore syntax wise:
#first identify the column names that has '_x' and '_y', then identify if
#the column names are the same after removing '_x' and '_y', if the pair has
#the same name then multiply them, do that for all pairs and sum the results
#up to get the total number
for colname in df.columns:
if "_x".lower() in colname.lower() or "_y".lower() in colname.lower():
if "_x".lower() in colname.lower():
colnamex = colname
if "_y".lower() in colname.lower():
colnamey = colname
#if colnamex[:-2] are the same for colnamex and colnamey then multiply and sum
filter + np.einsum
Thought I'd try something a little different this time—
get your _x and _y columns separately
do a product-sum. This is very easy to specify with einsum (and fast).
df = df.sort_index(axis=1) # optional, do this if your columns aren't sorted
i = df.filter(like='_x')
j = df.filter(like='_y')
df['Total'] = np.einsum('ij,ij->i', i, j) # (i.values * j).sum(axis=1)
df
measure1_x measure2_x measure3_x measure1_y measure2_y measure3_y Total
0 10 11 10 1 1 1 31
1 12 12 0 2 1 0 36
2 20 10 12 2 1 2 74
3 30 3 1 3 3 1 100
4 21 3 1 1 3 1 31
A slightly more robust version which filters out non-numeric columns and performs an assertion beforehand—
df = df.sort_index(axis=1).select_dtypes(exclude=[object])
i = df.filter(regex='.*_x')
j = df.filter(regex='.*_y')
assert i.shape == j.shape
df['Total'] = np.einsum('ij,ij->i', i, j)
If the assertion fails, the the assumptions of 1) your columns being numeric, and 2) the number of x and y columns being equal, as your question would suggest, do not hold for your actual dataset.
Use df.columns.str.split to generate a new MultiIndex
Use prod with axis and level arguments
Use sum with axis argument
Use assign to create new column
df.assign(
Total=df.set_axis(
df.columns.str.split('_', expand=True),
axis=1, inplace=False
).prod(axis=1, level=0).sum(1)
)
measure1_x measure2_x measure3_x measure1_y measure2_y measure3_y Total
0 10 11 10 1 1 1 31
1 12 12 0 2 1 0 36
2 20 10 12 2 1 2 74
3 30 3 1 3 3 1 100
4 21 3 1 1 3 1 31
Restrict dataframe to just columns that look like 'meausre[i]_[j]'
df.assign(
Total=df.filter(regex='^measure\d+_\w+$').pipe(
lambda d: d.set_axis(
d.columns.str.split('_', expand=True),
axis=1, inplace=False
)
).prod(axis=1, level=0).sum(1)
)
Debugging
See if this gets you the correct Totals
d_ = df.copy()
d_.columns = d_.columns.str.split('_', expand=True)
d_.prod(axis=1, level=0).sum(1)
0 31
1 36
2 74
3 100
4 31
dtype: int64
I have a wide table in a format as follows (for up to 10 people):
person1_status | person2_status | person3_status | person1_type | person_2 type | person3_type
0 | 1 | 0 | 7 | 4 | 6
Where status can be a 0 or a 1 (first 3 cols).
Where type can be a # ranging from 4-7. The value here corresponds to another table that specifies a value based on type. So...
Type | Value
4 | 10
5 | 20
6 | 30
7 | 40
I need to calculate two columns, 'A', and 'B', where:
A is the sum of values of each person's type (in that row) where
status = 0.
B is the sum of values of each person's type (in that row) where
status = 1.
For example, the resulting columns 'A', and 'B' would be as follows:
A | B
70 | 10
An explanation of this:
'A' has value 70 because person1 and person3 have "status" 0 and have corresponding type of 7 and 6 (which corresponds to values 30 and 40).
Similarly, there should be another column 'B' that has the value "10" because only person2 has status "1" and their type is "4" (which has corresponding value of 10).
This is probably a stupid question, but how do I do this in a vectorized way? I don't want to use a for loop or anything since it'll be less efficient...
I hope that made sense... could anyone help me? I think I'm brain dead trying to figure this out.
For simpler calculated columns I was getting away with just np.where but I'm a little stuck here since I need to calculate the sum of values from multiple columns given certain conditions while pulling in those values from a separate table...
hope that made sense
Use the filter method which will filter the column names for those where a string appears in them.
Make a dataframe for the lookup values other_table and set the index as the type column.
df_status = df.filter(like = 'status')
df_type = df.filter(like = 'type')
df_type_lookup = df_type.applymap(lambda x: other_table.loc[x]).values
df['A'] = np.sum((df_status == 0).values * df_type_lookup, 1)
df['B'] = np.sum((df_status == 1).values * df_type_lookup, 1)
Full example below:
Create fake data
df = pd.DataFrame({'person_1_status':np.random.randint(0, 2,1000) ,
'person_2_status':np.random.randint(0, 2,1000),
'person_3_status':np.random.randint(0, 2,1000),
'person_1_type':np.random.randint(4, 8,1000),
'person_2_type':np.random.randint(4, 8,1000),
'person_3_type':np.random.randint(4, 8,1000)},
columns= ['person_1_status', 'person_2_status', 'person_3_status',
'person_1_type', 'person_2_type', 'person_3_type'])
person_1_status person_2_status person_3_status person_1_type \
0 1 0 0 7
1 0 1 0 6
2 1 0 1 7
3 0 0 0 7
4 0 0 1 4
person_3_type person_3_type
0 5 5
1 7 7
2 7 7
3 7 7
4 7 7
Make other_table
other_table = pd.Series({4:10, 5:20, 6:30, 7:40})
4 10
5 20
6 30
7 40
dtype: int64
Filter out status and type columns to their own dataframes
df_status = df.filter(like = 'status')
df_type = df.filter(like = 'type')
Make lookup table
df_type_lookup = df_type.applymap(lambda x: other_table.loc[x]).values
Apply matrix multiplication and sum across rows.
df['A'] = np.sum((df_status == 0).values * df_type_lookup, 1)
df['B'] = np.sum((df_status == 1).values * df_type_lookup, 1)
Output
person_1_status person_2_status person_3_status person_1_type \
0 0 0 1 7
1 0 1 0 4
2 0 1 1 7
3 0 1 0 6
4 0 0 1 5
person_2_type person_3_type A B
0 7 5 80 20
1 6 4 20 30
2 5 5 40 40
3 6 4 40 30
4 7 5 60 20
consider the dataframe df
mux = pd.MultiIndex.from_product([['status', 'type'], ['p%i' % i for i in range(1, 6)]])
data = np.concatenate([np.random.choice((0, 1), (10, 5)), np.random.rand(10, 5)], axis=1)
df = pd.DataFrame(data, columns=mux)
df
The way this is structured we can do this for type == 1
df.status.mul(df.type).sum(1)
0 0.935290
1 1.252478
2 1.354461
3 1.399357
4 2.102277
5 1.589710
6 0.434147
7 2.553792
8 1.205599
9 1.022305
dtype: float64
and for type == 0
df.status.rsub(1).mul(df.type).sum(1)
0 1.867986
1 1.068045
2 0.653943
3 2.239459
4 0.214523
5 0.734449
6 1.291228
7 0.614539
8 0.849644
9 1.109086
dtype: float64
You can get your columns in this format using the following code
df.columns = df.columns.str.split('_', expand=True)
df = df.swaplevel(0, 1, 1)
Consider df
Index A B C
0 20161001 0 24.5
1 20161001 3 26.5
2 20161001 6 21.5
3 20161001 9 29.5
4 20161001 12 20.5
5 20161002 0 30.5
6 20161002 3 22.5
7 20161002 6 25.5
...
Also consider df2
Index Threshold
0 25
1 27
2 29
3 30
4 25
5 30
..
I want to add a column "Number of Rows" to df2 which contains the number of rows in df where (C > Threshold) & (A >= 20161001) & (A <= 20161002) holds true. This is basically to imply that there are conditions on more than one column in df
Index Threshold Number of Rows
0 25 4
1 27 2
2 29 2
3 30 1
4 25 4
5 30 1
..
For Threshold=25 in df2, there are 4 rows in df where "C" value crosses 25.
I tried something like:
def foo(threshold,start,end):
return len(df[(df['C'] > threshold) & (df['A'] > start) & (df['A'] < end)])
df2['Number of rows'] = df.apply(lambda df2: foo(df2['Threshold'],start = 20161001, end = 20161002),axis=1)
But this is populating the Number of Rows column with 0. Why is this?
You could make use of Boolean Indexing and the sum() aggregate function
# Create the first dataframe (df)
df = pd.DataFrame([[20161001,0 ,24.5],
[20161001,3 ,26.5],
[20161001,6 ,21.5],
[20161001,9 ,29.5],
[20161001,12,20.5],
[20161002,0 ,30.5],
[20161002,3 ,22.5],
[20161002,6 ,25.5]],columns=['A','B','C'])
# Create the second dataframe (df2)
df2 = pd.DataFrame(data=[25,27,29,30,25,30],columns=['Threshold'])
start = 20161001
end = 20161002
df2['Number of Rows'] = df2['Threshold'].apply(lambda x : ((df.C > x) & (df.A >= start) & (df.A <= end)).sum())
print(df2['Number of Rows'])
Out[]:
0 4
1 2
2 2
3 1
4 4
5 1
Name: Number of Rows, dtype: int64