My apologies for the vague title, it's complicated to translate what I want in writing terms.
I'm trying to build a filled line chart with the date on x axis and total transaction over time on the y axis
My data
The object is a pandas dataframe.
date | symbol | type | qty | total
----------------------------------------------
2020-09-10 ABC Buy 5 10
2020-10-18 ABC Buy 2 20
2020-09-19 ABC Sell 3 15
2020-11-05 XYZ Buy 10 8
2020-12-03 XYZ Buy 10 9
2020-12-05 ABC Buy 2 5
What I whant
date | symbol | type | qty | total | aggregate_total
------------------------------------------------------------
2020-09-10 ABC Buy 5 10 10
2020-10-18 ABC Buy 2 20 10+20 = 30
2020-09-19 ABC Sell 3 15 10+20-15 = 15
2020-11-05 XYZ Buy 10 8 8
2020-12-03 XYZ Buy 10 9 8+9 = 17
2020-12-05 ABC Buy 2 5 10+20-15+5 = 20
Where I am now
I'm working with 2 nested for loops : one for iterating over the symbols, one for iterating each row. I store the temporary results in lists. I'm still unsure how I will add the results to the final dataframe. I could reorder the dataframe by symbol and date, then append each temp lists together and finally assign that temp list to a new column.
The code below is just the inner loop over the rows.
af = df.loc[df['symbol'] == 'ABC']
for i in (range(0,af.shape[0])):
# print(af.iloc[0:i,[2,4]])
# if type is a buy, we add the last operation to the aggregate
if af.iloc[i,2] == "BUY":
temp_agg_total.append(temp_agg_total[i] + af.iloc[i,4])
temp_agg_qty.append(temp_agg_qty[i] + af.iloc[i, 3])
else:
temp_agg_total.append(temp_agg_total[i] - af.iloc[i,4])
temp_agg_qty.append(temp_agg_qty[i] - af.iloc[i, 3])
# Remove first element of list (0)
temp_agg_total.pop(0)
temp_agg_qty.pop(0)
af = af.assign(agg_total = temp_agg_total,
agg_qty = temp_agg_qty)
My question
Is there a better way to do this in pandas or numpy ? It feels really heavy for something relatively simple.
The presence of the Buy/Sell type of operation complicates things.
Regards
# negate qty of Sells
df.loc[df['type']=='Sell', 'total'] *=-1
# cumulative sum of the qty based on symbol
df['aggregate_total'] = df.groupby('symbol')['total'].cumsum()
Is this which you're looking for?
df['Agg'] = 1
df.loc[df['type'] == 'Sell', 'Agg'] = -1
df['Agg'] = df['Agg']*df['total']
df['Agg'].cumsum()
df["Type_num"] = df["type"].map({"Buy":1,"Sell":-1})
df["Num"] = df.Type_num*df.total
df.groupby(["symbol"],as_index=False)["Num"].cumsum()
pd.concat([df,df.groupby(["symbol"],as_index=False)["Num"].cumsum()],axis=1)
date symbol type qty total Type_num Num CumNum
0 2020-09-10 ABC Buy 5 10 1 10 10
1 2020-10-18 ABC Buy 2 20 1 20 30
2 2020-09-19 ABC Sell 3 15 -1 -15 15
3 2020-11-05 XYZ Buy 10 8 1 8 8
4 2020-12-03 XYZ Buy 10 9 1 9 17
5 2020-12-05 ABC Buy 2 5 1 5 20
The most important thing here is the cumulative sum. The regrouping is used to make sure that the cumulative sum is just performed on each kind of different symbol. The renaming and dropping of columns should be easy for you.
Trick is that I made {sell; buy} into {1,-1}
Related
I am new to pandas and trying to figure out the following how to calculate the percentage change (difference) between 2 years, given that sometimes there is no previous year.
I am given a dataframe as follows:
company date amount
1 Company 1 2020 3
2 Company 1 2021 1
3 COMPANY2 2020 7
4 Company 3 2020 4
5 Company 3 2021 4
.. ... ... ...
766 Company N 2021 9
765 Company N 2020 1
767 Company XYZ 2021 3
768 Company X 2021 3
769 Company Z 2020 2
I wrote something like this:
for company in unique(df2.company):
company_df = df2[df2.company== company]
company_df.sort_values(by ="date")
company_df_year = company_df.amount.tolist()
company_df_year.pop()
company_df_year.insert(0,0)
company_df["value_year_before"] = company_df_year
if any in company_df.value_year_before == None:
company_df["diff"] = 0
else:
company_df["diff"] = (company_df.amount- company_df.value_year_before)/company_df.value_year_before
df2["ratio"] = company_df["diff"]
But I keep getting >NAN.
Where did I make a mistake?
The main issue is that you are overwriting company_df in each iteration of the loop and only keeping the last one.
However, normally when using Pandas if you are starting to use a for loop then you are doing something wrong and there is an easier way to accomplish the goal. Here you could use groupby and pct_change to compute the ratio of each group.
df = df.sort_values(['company', 'date'])
df['ratio'] = df.groupby('company')['amount'].pct_change()
df['ratio'] = df['ratio'].fillna(0.0)
Groupby will keep the order of the rows within each group so we sort before to ensure that the order of the dates is correct and fillna replace any nans with 0.
Result:
company date amount ratio
3 COMPANY2 2020 7 0.000000
1 Company 1 2020 3 0.000000
2 Company 1 2021 1 -0.666667
4 Company 3 2020 4 0.000000
5 Company 3 2021 4 0.000000
765 Company N 2020 1 0.000000
766 Company N 2021 9 8.000000
768 Company X 2021 3 0.000000
767 Company XYZ 2021 3 0.000000
769 Company Z 2020 2 0.000000
Apply an anonymous function that calculate the change percentage and returns that if there is more than one values. Use:
df = pd.DataFrame({'company': [1,1,3], 'date':[2020,2021,2020], 'amount': [4,5,7]})
df.groupby('company')['amount'].apply(lambda x: (list(x)[1]-list(x)[0])/list(x)[0] if len(x)>1 else 'not enough values')
Input df:
Output:
I am trying to parse and organize a trading history file.I am trying to aggregate every 3 or 4 rows that has the same Type: BUY or SELL together if they come after each other only. and if they don't then I want to take only one row.
as you can see the example below those multi-buy trades I want them to be aggregated within one row which after it will come another one row of sell trade.
in a new df with aggregated trades prices, and amounts.
link for csv: https://drive.google.com/file/d/1GoDRdI7G8uJzuLoFrm5InbDg23mAwW6o/view?usp=sharing
You can use this to get the results you are looking for. I am using cumulative sum when the previous value is not equal to the current value.
dictionary = { "BUY": 1, "SELL": 0}
df['id1'] = df['Type'].map(dictionary)
df['grp'] = (df['id1']!=df['id1'].shift()).cumsum()
Now you can aggregate the values using a simple groupby like below. This will sum the amount for each consecutive buy and sell
df.groupby(['grp'])['Amount'].sum()
This is the output of grp column.
Type grp
0 BUY 1
1 BUY 1
2 BUY 1
3 BUY 1
4 SELL 2
5 SELL 2
6 SELL 2
7 SELL 2
8 BUY 3
9 SELL 4
10 BUY 5
11 SELL 6
12 BUY 7
13 SELL 8
14 BUY 9
15 BUY 9
16 SELL 10
17 SELL 10
I have a dataframe that is of the following type. I have all the columns except the final column, "Total Previous Points P1", which I am hoping to create:
The data is sorted by the "Date" column.
Date | Points_P1 | P1_id | P2_id | Total_Previous_Points_P1
-------------+---------------+----------+-----------------------------------
10/08/15 | 5 | 100 | 90 | 500
-------------+---------------+----------+-----------------------------------
11/09/16 | 5 | 100 | 90 | 500
-------------+---------------+----------+-----------------------------------
20/09/19 | 10 | 10000 | 360 | 4,200
-------------+---------------+----------+-----------------------------------
... | | ... | ... | ...
-------------+---------------+----------+-----------------------------------
n | | | |
Now the column I want to create, is the "Total_Previous_Points_P1" column shown above.
The way to create it:
For each row, check the date (call this DATE_VAL) and P1_id (call this ID_VAL)
Now, for all rows before DATE_VAL AND where P1 id == ID_VAL, sum up the previous points.
Put this sum in the final column, in the current row
Is there a fast pandas pythonic way to do this? My data set is very large.
Thank you!
The solution by SIA computes sum of Points_P1 including the
current value of Points_P1, whereas the requirement is to sum
previous points (for all rows before...).
Assuming that dates in each group are unique (in your sample they are),
the proper, pandasonic solution should include the following steps:
Sort by Date.
Group by P1_id, then for each group:
Take Points_P1 column.
Compute cumulative sum.
Subtract the current value of Points_P1.
So the whole code should be:
df['Total_Previous_Points_P1'] = df.sort_values('Date')\
.groupby(['P1_id']).Points_P1.cumsum() - df.Points_P1
Edit
If Date is not unique (within group of rows with some P1_id), the case
is more complicated, what can be shown on such source DataFrame:
Date Points_P1 P1_id
0 2016-11-09 5 100
1 2016-11-09 3 100
2 2015-10-08 5 100
3 2019-09-20 10 10000
4 2019-09-21 7 100
5 2019-07-10 12 10000
6 2019-12-10 12 10000
Note that for P1_id there are two rows for 2016-11-09.
In this case, start from computing "group" sums of previous points,
for each P1_id and Date:
sumPrev = df.groupby(['P1_id', 'Date']).Points_P1.sum()\
.groupby(level=0).apply(lambda gr: gr.shift(fill_value=0).cumsum())\
.rename('Total_Previous_Points_P1')
The result is:
P1_id Date
100 2015-10-08 0
2016-11-09 5
2019-09-21 13
10000 2019-07-10 0
2019-09-20 12
2019-12-10 22
Name: Total_Previous_Points_P1, dtype: int64
Then merge df with sumPrev on P1_id and Date (in sumPrev on the index):
df = pd.merge(df, sumPrev, left_on=['P1_id', 'Date'], right_index=True)
To show the result, it is more instructive to sort df also on ['P1_id', 'Date']:
Date Points_P1 P1_id Total_Previous_Points_P1
2 2015-10-08 5 100 0
0 2016-11-09 5 100 5
1 2016-11-09 3 100 5
4 2019-09-21 7 100 13
5 2019-07-10 12 10000 0
3 2019-09-20 10 10000 12
6 2019-12-10 12 10000 22
As you can see:
The first sum for each P1_id is 0 (no points from previous dates).
E.g. for both rows with Date == 2016-11-09 the sum of previous
points is 5 (which is in row for Date == 2015-10-08).
Try:
df['Total_Previous_Points_P1'] = df.groupby(['P1_id'])['Points_P1'].cumsum()
How It Works
First, it groups the data using P1_id feature.
Then it accesses the Points_P1 values on the grouped dataframe and apply the cumulative sum function cumsum(), which returns the sum of points up to and including the current row for each group.
I have data sets that are ~30-60,000,000 lines each. Each Name has one or more unique ID associated with it for every day in the data set. Some OP_DATE and OP_HOUR the unique IDs can have 0 or blank values for each Load1,2,3.
I'm looking for a way to calculate the total maximum values of columns over all the OP_DATE that look like these:
Name ID OP_DATE OP_HOUR OP_TIME Load1 Load2 Load3
OMI 1 2001-01-01 1 1 11 10 12
OMI 1 2001-01-01 2 0.2 1 12 10
.
.
OMI 2A 2001-01-01 1 0.4 5
.
.
OMI 2A 2001-01-01 24 0.6 2 7 12
.
.
Kain 2 01 2002-01-01 1 0.1 6 12
Kain 2 01 2002-01-01 2 0.98 3 14 7
.
.
OMI 1 2018-01-01 1 0.89 12 10 20
.
.
I want to find the maximum values of Load1, Load2, Load3, and find what OP_DATE, OP_TIME and OP_HOUR that it occurred on.
The output I want is:
Name ID max OP_DATE max OP_HOUR max OP_TIME max Load1 max Load2 max Load3
OMI 1 2011-06-11 22 ..... max values on dates
OMI 2A 2012-02-01 12 ..... max values on dates
Kain 2 01 2006-01-01 1..... max values on dates
Is there a way I can do this easily?
I've tried:
unique_MAX = df.groupby(['Name','ID'])['Load1', 'Load2', 'Load3'].max().reset_index()
But this would group only by the dates and give me a total maximum - I'd like the associated dates, hours, and times as well.
To get the full row of information for any given fields [max]:
Get the index locations for the max of each group you desire
Use the indexes to return the full row at each location
An example for finding the max Load1 for each Name & ID pair
idx = df.groupby(['Name','ID'])['Load1'].transform(max) == df['Load1']
df[idx]
Out[14]:
name ID dt x y
1 Fred 050 1/2/2018 2 4
4 Dave 001 1/3/2018 6 1
5 Carly 002 1/3/2018 5 7
I am learning python and at the moment I am playing with some sales data. The data is in csv format and is showing weekly sales.
I have below columns with some sample data as below:
store# dept# dates weeklysales
1 1 01/01/2005 50000
1 1 08/01/2005 120000
1 1 15/01/2005 75000
1 1 22/01/2005 25000
1 1 29/01/2005 18000
1 2 01/01/2005 15000
1 2 08/01/2005 12000
1 2 15/01/2005 75000
1 2 22/01/2005 35000
1 2 29/01/2005 28000
1 1 01/02/2005 50000
1 1 08/02/2005 120000
1 1 15/02/2005 75000
1 1 22/03/2005 25000
1 1 29/03/2005 18000
I want to add the weeklysales to monthly basis in each department and want to display the records.
I have tried to use groupby function in Pandas from below links:
how to convert monthly data to quarterly in pandas
Pandas group by and sum two columns
Pandas group-by and sum
But what is happening in the above that I get sum of all the columns and getting the following output by adding the store and dept numbers as well:
store# dept# dates weeklysales
4 3 01/2005 28800
4 1 01/2005 165000
4 3 02/2005 245000
4 3 03/2005 43000
I do not want to add store and dept numbers but want to just add the weeklysales figure by each month and want the display like:
store# dept# dates weeklysales
1 1 01/2005 28800
1 2 01/2005 165000
1 1 02/2005 245000
1 1 03/2005 43000
Will be grateful if I can get a solution for that.
Cheers,
Is this what you are after?
Convert dates to month/year format and then group and sum sales.
(df.assign(dates=df.dates.dt.strftime('%m/%Y'))
.groupby(['store#','dept#','dates'])
.sum()
.reset_index()
)
Out[243]:
store# dept# dates weeklysales
0 1 1 01/2005 288000
1 1 1 02/2005 245000
2 1 1 03/2005 43000
3 1 2 01/2005 165000