I have the following problem and do not know how to solve it in a perfomant way:
Input Pandas DataFrame:
timestep
article
volume
35
1
20
37
2
5
123
2
12
155
3
10
178
2
23
234
1
17
478
1
28
Output Pandas DataFrame:
timestep
volume
35
20
37
25
123
32
178
53
234
50
478
61
Calculation Example for timestep 478:
28 (last article 1 volume) + 23 (last article 2 volume) + 10 (last article 3 volume) = 61
What ist the best way to do this in pandas?
Try with ffill:
#sort if needed
df = df.sort_values("timestep")
df["volume"] = (df["volume"].where(df["article"].eq(1)).ffill().fillna(0) +
df["volume"].where(df["article"].eq(2)).ffill().fillna(0))
output = df.drop("article", axis=1)
>>> output
timestep volume
0 35 20.0
1 37 25.0
2 123 32.0
3 178 43.0
4 234 40.0
5 478 51.0
Group By article & Take last element & Sum
df.groupby(['article']).tail(1)["volume"].sum()
You can set group number of consecutive article by .cumsum(). Then get the value of previous group last item by .map() with GroupBy.last(). Finally, add volume with this previous last, as follows:
# Get group number of consecutive `article`
g = df['article'].ne(df['article'].shift()).cumsum()
# Add `volume` to previous group last
df['volume'] += g.sub(1).map(df.groupby(g)['volume'].last()).fillna(0, downcast='infer')
Result:
print(df)
timestep article volume
0 35 1 20
1 37 2 25
2 123 2 32
3 178 2 43
4 234 1 40
5 478 1 51
Breakdown of steps
Previous group last values:
g.sub(1).map(df.groupby(g)['volume'].last()).fillna(0, downcast='infer')
0 0
1 20
2 20
3 20
4 43
5 43
Name: article, dtype: int64
Try:
df["new_volume"] = (
df.loc[df["article"] != df["article"].shift(-1), "volume"]
.reindex(df.index, method='ffill')
.shift()
+ df["volume"]
).fillna(df["volume"])
df
Output:
timestep article volume new_volume
0 35 1 20 20.0
1 37 2 5 25.0
2 123 2 12 32.0
3 178 2 23 43.0
4 234 1 17 40.0
5 478 1 28 51.0
Explained:
Find the last record of each group by checking the 'article' from the previous row, then reindex that series aligning to the original dataframe and fill forward and shift to the next group with that 'volume'. And this to the current row's 'volume' and fill that first value with the original 'volume' value.
Related
I have a shipping records table with approx. 100K rows and
I want to calculate, for each row, for each material, how many qtys were shipped in last 30 days.
As you can see in below example, calculated qty depends on "material, shipping date".
I've tried to write very basic code and couldn't find a way to apply it to all rows.
df[(df['malzeme']==material) & (df['cikistarihi'] < shippingDate) & (df['cikistarihi'] >= (shippingDate-30))]['qty'].sum()
material
shippingDate
qty
shipped qtys in last 30 days
A
23.01.2019
8
0
A
28.01.2019
41
8
A
31.01.2019
66
49 (8+41)
A
20.03.2019
67
0
B
17.02.2019
53
0
B
26.02.2019
35
53
B
11.03.2019
4
88 (53+35)
B
20.03.2019
67
106 (35+4+67)
You can use .groupby with .rolling:
# convert the shippingData to datetime:
df["shippingDate"] = pd.to_datetime(df["shippingDate"], dayfirst=True)
# sort the values (if they aren't already)
df = df.sort_values(["material", "shippingDate"])
df["shipped qtys in last 30 days"] = (
df.groupby("material")
.rolling("30D", on="shippingDate", closed="left")["qty"]
.sum()
.fillna(0)
.values
)
print(df)
Prints:
material shippingDate qty shipped qtys in last 30 days
0 A 2019-01-23 8 0.0
1 A 2019-01-28 41 8.0
2 A 2019-01-31 66 49.0
3 A 2019-03-20 67 0.0
4 B 2019-02-17 53 0.0
5 B 2019-02-26 35 53.0
6 B 2019-03-11 4 88.0
7 B 2019-03-20 67 39.0
EDIT: Add .sort_values() before groupby
Suppose I have the following dataframe:
. Column1 Column2
0 25 1
1 89 2
2 59 3
3 78 10
4 99 20
5 38 30
6 89 100
7 57 200
8 87 300
Im not sure if what I want to do is impossible or not. But I want to compare every three rows of column1 and then take the highest 2 out the three rows and assign the corresponding 2 Column2 values to a new column. The values in column 3 does not matter if they are joined or not. It does not matter if they are arranged or not for I know every 2 rows of column 3 belong to every 3 rows of column 1.
. Column1 Column2 Column3
0 25 1 2
1 89 2 3
2 59 3
3 78 10 20
4 99 20 10
5 38 30
6 89 100 100
7 57 200 300
8 87 300
You can use np.arange with np.repeat to create a grouping array which groups every 3 values.
Then use GroupBy.nlargest then extract indices of those values using pd.Index.get_level_values, then assign them to Column3 pandas handles index alignment.
n_grps = len(df)/3
g = np.repeat(np.arange(n_grps), 3)
idx = df.groupby(g)['Column1'].nlargest(2).index.get_level_values(1)
vals = df.loc[idx, 'Column2']
vals
# 1 2
# 2 3
# 4 20
# 3 10
# 6 100
# 8 300
# Name: Column2, dtype: int64
df['Column3'] = vals
df
Column1 Column2 Column3
0 25 1 NaN
1 89 2 2.0
2 59 3 3.0
3 78 10 10.0
4 99 20 20.0
5 38 30 NaN
6 89 100 100.0
7 57 200 NaN
8 87 300 300.0
To get output like you mentioned in the question you have to sort and push NaN to last then you have perform this additional step.
df['Column3'] = df.groupby(g)['Column3'].apply(lambda x:x.sort_values()).values
Column1 Column2 Column3
0 25 1 2.0
1 89 2 3.0
2 59 3 NaN
3 78 10 10.0
4 99 20 20.0
5 38 30 NaN
6 89 100 100.0
7 57 200 300.0
8 87 300 NaN
I've been working on this problem for a little bit and am really close. Essentially, I want to create a time series of event counts by type from an event database. I'm really close. Here's what I've done so far:
Starting with an abbreviated version of my dataframe:
event_date year time_precision event_type \
0 2020-10-24 2020 1 Battles
1 2020-10-24 2020 1 Riots
2 2020-10-24 2020 1 Riots
3 2020-10-24 2020 1 Battles
4 2020-10-24 2020 2 Battles
I want the time series to be by month and year, so first I convert the dates to datetime:
nga_df.event_date = pd.to_datetime(nga_df.event_date)
Then, I want to create a time series of events by type, so I one-hot encode them:
nga_df = pd.get_dummies(nga_df, columns=['event_type'], prefix='', prefix_sep='')
Next, I need to extract the month, so that I can create monthly counts:
nga_df['month'] = nga_df.event_date.apply(lambda x: x.month)
finally, and I am so close here, I group my data by month and year and take the transpose:
conflict_series = nga_df.groupby(['year','month']).sum()
conflict_series.T
Which results in this lovely new dataframe:
year 1997 ... 2020
month 1 2 3 4 5 6 ... 5 6 7
fatalities 11 30 38 112 17 29 ... 1322 1015 619
Battles 4 4 5 13 2 2 ... 77 99 74
Explosions/Remote violence 2 1 0 0 3 0 ... 38 28 17
Protests 1 0 0 1 0 1 ... 31 83 50
Riots 3 3 4 1 4 1 ... 27 14 18
Strategic developments 1 0 0 0 0 0 ... 7 2 7
Violence against civilians 3 5 7 3 2 1 ... 135 112 88
So, I guess what I need to do is combine my index (columns after transpose) so that they are a single index. How do I do this?
The end goal is to combine this data with economic indicators to see if there is a trend, so I need both datasets to be in the same form, where the columns are monthly counts of different values.
Here's how I did it:
Step 1: flatten index:
# convert the multi-index to a flat set of tuples: (YYYY, MM)
index = conflict_series.index.to_flat_index().to_series()
Step 2: Add arbitrary but requiredend-of-month for conversion to date-time:
index = index.apply(lambda x: x + (28,))
Step 3: Convert resulting 3-tuple to date time:
index = index.apply(lambda x: datetime.date(*x))
Step 4: reset the DataFrame index:
conflict_series.set_index(index, inplace=True)
Results:
fatalities Battles Explosions/Remote violence Protests Riots \
1997-01-28 11 4 2 1 3
1997-02-28 30 4 1 0 3
1997-03-28 38 5 0 0 4
1997-04-28 112 13 0 1 1
1997-05-28 17 2 3 0 4
Strategic developments Violence against civilians total_events
1997-01-28 1 3 14
1997-02-28 0 5 13
1997-03-28 0 7 16
1997-04-28 0 3 18
1997-05-28 0 2 11
And now the plot I was looking for:
Let me say I have a DataFrame where the data is ordered with respect to time. I have a column as weights and I want to find the maximum weight relative to the current index. For example the max value found for the 10th Row would be from elements 11 to the end.
I ended up writing this function. But performance is a big threat.
import pandas as pd
df=pd.DataFrame({"time":[100,200,300,400,500,600,700,800],"weights":
[120,160,190,110,34,55,66,33]})
totalRows=df['time'].count()
def findMaximumValRelativeToCurrentRow(row):
index= row.name
if index!= totalRows:
tempDf = df[index:totalRows]
val=tempDf['weights'].max()
df.set_value(index,'max',val)
else:
df.set_value(index,'max',row['weights'])
df.apply(findMaximumValRelativeToCurrentRow,axis=1)
print df
Is there any better way to do the operation than this?
You can use cummax with iloc for reverse order:
print (df['weights'].iloc[::-1])
7 33
6 66
5 55
4 34
3 110
2 190
1 160
0 120
Name: weights, dtype: int64
df['max1'] = df['weights'].iloc[::-1].cummax()
print (df)
time weights max max1
0 100 120 190.0 190
1 200 160 190.0 190
2 300 190 190.0 190
3 400 110 110.0 110
4 500 34 66.0 66
5 600 55 66.0 66
6 700 66 66.0 66
7 800 33 33.0 33
New to pandas, I'm trying to sum up all previous values of a column. In SQL I did this by joining the table to itself, so I've been taking the same approach in pandas, but having some issues.
Original Data Frame
TeamName PlayerCount Goals CalMonth
0 A 25 126 1
1 A 25 100 2
2 A 25 156 3
3 B 22 205 1
4 B 30 300 2
5 B 28 189 3
Code
prev_month = np.where(df3['CalMonth'] == 12, df3['CalMonth'] - 11, df3['CalMonth'] + 1)
df4 = pd.merge(df3, df3, how='left', left_on=['TeamName','CalMonth'], right_on=['TeamName', prev_month])
print(df4.head(20))
Output
TeamName PlayerCount_x Goals_x CalMonth_x
0 A 25 126 1
1 A 25 100 2
2 A 25 156 3
3 B 22 205 1
4 B 22 300 2
5 B 22 189 3
PlayerCount_y Goals_y CalMonth_y
NaN NaN NaN
25 126 1
25 100 2
22 NaN NaN
22 205 1
22 100 2
The output is what I had in mind, but what I want now is to create a column that is YTD and sum up all Goals from previous months. Here are my desired results (can either include the current month or not, that can be done in an additional step):
TeamName PlayerCount_x Goals_x CalMonth_x
0 A 25 126 1
1 A 25 100 2
2 A 25 156 3
3 B 22 205 1
4 B 22 300 2
5 B 22 189 3
PlayerCount_y Goals_y CalMonth_y Goals_YTD
NaN NaN NaN NaN
25 126 1 126
25 100 2 226
22 NaN NaN NaN
22 205 1 205
22 100 2 305