Pandas: cumulative sum with conditional subtraction - python
Consider a pandas dataframe like:
>> df
date_time op_type price volume
01-01-1970 9:30:01 ASK 100 1800
01-01-1970 9:30:25 ASK 90 1000
01-01-1970 9:30:28 BID 90 900
01-01-1970 9:30:28 TRADE 90 900
01-01-1970 9:31:01 BID 80 500
01-01-1970 9:31:09 ASK 80 100
01-01-1970 9:31:09 TRADE 80 100
I would like to do three calculations: i) the cumulative sum of the volume for op_type == "ASK" rows; ii) the cumulative sum of the volume for op_type == "BID" rows; and iii) the sum of the previous two volumes.
That is simple enough, but there is a condition for op_type == "TRADE" operations:
Whenever there is a TRADE operation whose price matches the price on a BID operation, I would like to subtract that TRADE operation's volume from the cumulative BID volume.
Whenever there is a TRADE operation whose price matches the price on an ASK operation, I would like to subtract that TRADE operation's volume from the cumulative ASK volume.
The output I'm looking for is:
>> df
date_time op_type price volume ASK_vol BID_vol BIDASK_vol
01-01-1970 9:30:01 ASK 100 1800 1800 0 1800
01-01-1970 9:30:25 ASK 90 1000 2800 0 2800
01-01-1970 9:30:28 BID 90 900 2800 900 3700
01-01-1970 9:30:28 TRADE 90 900 2800 0 2800
01-01-1970 9:31:01 BID 80 500 2800 500 3300
01-01-1970 9:31:09 ASK 80 100 2900 500 3400
01-01-1970 9:31:09 TRADE 80 100 2800 500 3300
I read this question but I'm not sure how to incorporate the conditional subtraction to that answer. I would greatly appreciate any help, thank you.
IIUC, this is what you need.
a= np.where(df['op_type'] == 'ASK',df.volume,0)
b= np.where(df['op_type'] == 'BID',df.volume,0)
a_t = (np.where(df['op_type'] == 'TRADE',
(np.where(df['op_type'].shift(1) == 'ASK',
(np.where(df['volume']==df['volume'].shift(1),-df.volume,0)),0)),0))
b_t = (np.where(df['op_type'] == 'TRADE',
(np.where(df['op_type'].shift(1) == 'BID',
(np.where(df['volume']==df['volume'].shift(1),-df.volume,0)),0)),0))
df['ASK_vol']=(np.where(a_t!=0,a_t,a)).cumsum()
df['BID_vol']=(np.where(b_t!=0,b_t,b)).cumsum()
df['BIDASK_vol']= df['ASK_vol']+df['BID_vol']
output
date_time op_type price volume ASK_vol BID_vol BIDASK_vol
01-01-1970 9:30:01 ASK 100 1800 1800 0 1800
01-01-1970 9:30:25 ASK 90 1000 2800 0 2800
01-01-1970 9:30:28 BID 90 900 2800 900 3700
01-01-1970 9:30:28 TRADE 90 900 2800 0 2800
01-01-1970 9:31:01 BID 80 500 2800 500 3300
01-01-1970 9:31:09 ASK 80 100 2900 500 3400
01-01-1970 9:31:09 TRADE 80 100 2800 500 3300
Related
how to compare two uneven dataset columns with each other?
As shown in the above picture there are two datasets that do not have the same row count. The task is to compare the distance between each city to the range each vehicle can travel i.e. (compare the distance between city1 & city2 with all the vehicle type ranges.)
Let's assume that we have 2 dictionaries(If you have a DataFrame, you can use to_dict() method to convert it to the dictionary) vehicles = {'A320': 5000, 'A330': 8000, 'B737': 5000, 'B747': 10000, 'Q400': 1500, 'ATR72': 1000} city_distances = {'AA-BB': 3000, 'BB-CC': 6500, 'CC-AA': 400, 'AA-DD': 1000} You can simply create a nested for loop and check whatever condition you want. I, for example, did check, whether the vehicle could travel the city route. for city_route in city_distances.keys(): for vehicle in vehicles.keys(): if vehicles[vehicle] >= city_distances[city_route]: print(f'Vehile {vehicle} can travel the {city_route} route') else: print(f"Vehile {vehicle} can't travel the {city_route} route")
Well, you didn't tell/show us the expected output, so I'll give you the code to let you .merge() both DF, from there you can do pretty everything you want: df3 = df2.merge(df1, how='cross') df3 Truncated result: index city_a city_b vehicles range 0 AA-BB 3000 A320 5000 1 AA-BB 3000 A330 8000 2 AA-BB 3000 B737 5000 3 AA-BB 3000 B747 10000 4 AA-BB 3000 Q400 1500 5 AA-BB 3000 ATR72 1000 6 BB-CC 6500 A320 5000 7 BB-CC 6500 A330 8000 ... index city_a city_b vehicles range 16 CC-AA 400 Q400 1500 17 CC-AA 400 ATR72 1000 18 AA-DD 1000 A320 5000 19 AA-DD 1000 A330 8000 20 AA-DD 1000 B737 5000 21 AA-DD 1000 B747 10000 22 AA-DD 1000 Q400 1500 23 AA-DD 1000 ATR72 1000
How to make the output a percentage in python [duplicate]
This question already has answers here: Python format percentages (2 answers) Closed 1 year ago. I have the following code and output: #Percentage of data with 95% or greated discount print((df[df["itc_disc_95_itc_sku"] == True].shape[0] /df.shape[0])*100, 'Percent of ITC Sku discounts were between 95 and 120') print((df[df["itc_disc_95_disc_per_coupon"] == True].shape[0] /df.shape[0])*100, 'Percent of Discounts per coupon were between 95 and 120') print((df[df["itc_disc_95_disc_open_box"] == True].shape[0] /df.shape[0])*100, 'Percent of Open Box Discounts were between 95 and 120') print((df[df["itc_disc_95_disc_employee"] == True].shape[0] /df.shape[0])*100, 'Percent of Employee discounts were between 95 and 120') print((df[df["itc_disc_95_disc_overide"] == True].shape[0] /df.shape[0])*100, 'Percent of Overide discounts were between 95 and 120') 12.0676663204247 Percent of ITC Sku discounts were between 95 and 120 8.827338637725374 Percent of Discounts per coupon were between 95 and 120 0.0855575236983875 Percent of Open Box Discounts were between 95 and 120 0.022239723285513567 Percent of Employee discounts were between 95 and 120 0.0 Percent of Overide discounts were between 95 and 120 I need to present the data in the notebook and want to format it so that it reads cleanly as a % and only two decimal places. Would want my desired output to look like: 12.07% Percent of ITC Sku discounts were between 95 and 120 8.83% Percent of Discounts per coupon were between 95 and 120 0.09% Percent of Open Box Discounts were between 95 and 120 0.02% Percent of Employee discounts were between 95 and 120 0.0% Percent of Overide discounts were between 95 and 120
An f-string with %? Then you don't need the * 100. >>> x = 0.120676663204247 >>> f'{x:.2%} of all questions are not completely terrible' '12.07% of all questions are not completely terrible'
I guess you could just use f-strings, specifying .2f if you want two decimals: print(f'{(df[df["itc_disc_95_itc_sku"] == True].shape[0] /df.shape[0])*100:.2f}%, Percent of ITC Sku discounts were between 95 and 120')
did you try something like this ? num1 = ... percentage = '{0.2%}'.format(num1) print(percentage) 2 is the number of decimal places that you need, you can change it if you want
Pandas Relative Time Pivot
I have the last eight months of my customers' data, however these months are not the same months, just the last months they happened to be with us. Monthly fees and penalties are stored in rows, but I want each of the last eight months to be a column. What I have: Customer Amount Penalties Month 123 500 200 1/7/2017 123 400 100 1/6/2017 ... 213 300 150 1/4/2015 213 200 400 1/3/2015 What I want: Customer Month-8-Amount Month-7-Amount ... Month-1-Amount Month-1-Penalties ... 123 500 400 450 300 213 900 250 300 200 ... What I've tried: df = df.pivot(index=num, columns=[amount,penalties]) I got this error: ValueError: all arrays must be same length Is there some ideal way to do this?
You can do it with unstack and set_index # assuming all date is sort properly , then we do cumcount df['Month']=df.groupby('Customer').cumcount()+1 # slice the most recent 8 one df=df.loc[df.Month<=8,:]# slice the most recent 8 one # doing unstack to reshape your df s=df.set_index(['Customer','Month']).unstack().sort_index(level=1,axis=1) # flatten multiple index to one s.columns=s.columns.map('{0[0]}-{0[1]}'.format) s.add_prefix("Month-") Out[189]: Month-Amount-1 Month-Penalties-1 Month-Amount-2 Month-Penalties-2 Customer 123 500 200 400 100 213 300 150 200 400
Python Groupby and Plot
With the following groupby how can I ultimately group the data so that I can plot the price (x-axis) and size (y-axis) while iterating through every symbol and exchange? Thanks. df_group = df.groupby(['symbol','exchange','price'])["size"].sum() symbol exchange price AAPL ARCA 154.630 800 154.640 641 154.650 100 154.660 300 154.670 400 154.675 100 154.680 300 154.690 1390 154.695 100 154.700 360 154.705 100 154.710 671 154.720 190 154.725 100 154.730 400 ... XOM PSX 80.67 1300 80.68 2721 80.69 1901 80.7 700 80.71 800 80.72 200 80.73 700 80.74 500 80.75 600 80.76 300 80.77 900 80.78 100 80.79 1000 80.8 1000 symbol exch price sizesizesizesizesizesizesizesizesizesizesizesi...
you can use aggregate functions fun={'symbol':{'size':'count'} df_group = df.groupby(['symbol','exchange','price']).agg(fun).reset_index() df_group.columns=df_group.columns.droplevel(1) df_group
Pandas Column mathematical operations No error no answer
I am trying to perform some simple mathematical operations on the files. The columns in below file_1.csv are dynamic in nature the number of columns will increased from time to time. So we cannot have fixed last_column master_ids.csv : Before any pre-processing Ids,ref0 #the columns increase dynamically 1234,1000 8435,5243 2341,563 7352,345 master_count.csv : Before any processing Ids,Name,lat,lon,ref1 1234,London,40.4,10.1,500 8435,Paris,50.5,20.2,400 2341,NewYork,60.6,30.3,700 7352,Japan,70.7,80.8,500 1234,Prague,40.4,10.1,100 8435,Berlin,50.5,20.2,200 2341,Austria,60.6,30.3,500 7352,China,70.7,80.8,300 master_Ids.csv : after one pre-processing Ids,ref,00:30:00 1234,1000,500 8435,5243,300 2341,563,400 7352,345,500 master_count.csv: expected Output (Append/merge) Ids,Name,lat,lon,ref1,00:30:00 1234,London,40.4,10.1,500,750 8435,Paris,50.5,20.2,400,550 2341,NewYork,60.6,30.3,700,900 7352,Japan,70.7,80.8,500,750 1234,Prague,40.4,10.1,100,350 8435,Berlin,50.5,20.2,200,350 2341,Austria,60.6,30.3,500,700 7352,China,70.7,80.8,300,750 Eg: Ids: 1234 appears 2 times so the value of ids:1234 at current time (00:30:00) is 500 which is to be divided by count of ids occurrence and then add to the corresponding values from ref1 and create a new column with the current time. master_Ids.csv : After another pre-processing Ids,ref,00:30:00,00:45:00 1234,1000,500,100 8435,5243,300,200 2341,563,400,400 7352,345,500,600 master_count.csv: expected output after another execution (Merge/append) Ids,Name,lat,lon,ref1,00:30:00,00:45:00 1234,London,40.4,10.1,500,750,550 8435,Paris,50.5,20.2,400,550,500 2341,NewYork,60.6,30.3,700,900,900 7352,Japan,70.7,80.8,500,750,800 1234,Prague,40.4,10.1,100,350,150 8435,Berlin,50.5,20.2,200,350,300 2341,Austria,60.6,30.3,500,700,700 7352,China,70.7,80.8,300,750,600 So here current time is 00:45:00, and we divide the current time value by the count of ids occurrences, and then add to the corresponding ref1 values by creating an new column with new current time. Program: By Jianxun Li import pandas as pd import numpy as np csv_file1 = '/Data_repository/master_ids.csv' csv_file2 = '/Data_repository/master_count.csv' df1 = pd.read_csv(csv_file1).set_index('Ids') # need to sort index in file 2 df2 = pd.read_csv(csv_file2).set_index('Ids').sort_index() # df1 and df2 has a duplicated column 00:00:00, use df1 without 1st column temp = df2.join(df1.iloc[:, 1:]) # do the division by number of occurence of each Ids # and add column any time series def my_func(group): num_obs = len(group) # process with column name after next timeseries (inclusive) group.iloc[:,4:] = (group.iloc[:,4:]/num_obs).add(group.iloc[:,3], axis=0) return group result = temp.groupby(level='Ids').apply(my_func) The program executes with no errors and no output. Need some fixing suggestions please.
This program assumes updating of both master_counts.csv and master_ids.csv over time and should be robust to the timing of the updates. That is, it should produce correct results if run multiple times on the same update or if an update is missed. # this program updates (and replaces) the original master_counts.csv with data # in master_ids.csv, so we only want the first 5 columns when we read it in master_counts = pd.read_csv('master_counts.csv').iloc[:,:5] # this file is assumed to be periodically updated with the addition of new columns master_ids = pd.read_csv('master_ids.csv') for i in range( 2, len(master_ids.columns) ): master_counts = master_counts.merge( master_ids.iloc[:,[0,i]], on='Ids' ) count = master_counts.groupby('Ids')['ref1'].transform('count') master_counts.iloc[:,-1] = master_counts['ref1'] + master_counts.iloc[:,-1]/count master_counts.to_csv('master_counts.csv',index=False) %more master_counts.csv Ids,Name,lat,lon,ref1,00:30:00,00:45:00 1234,London,40.4,10.1,500,750.0,550.0 1234,Prague,40.4,10.1,100,350.0,150.0 8435,Paris,50.5,20.2,400,550.0,500.0 8435,Berlin,50.5,20.2,200,350.0,300.0 2341,NewYork,60.6,30.3,700,900.0,900.0 2341,Austria,60.6,30.3,500,700.0,700.0 7352,Japan,70.7,80.8,500,750.0,800.0 7352,China,70.7,80.8,300,550.0,600.0
import pandas as pd import numpy as np csv_file1 = '/home/Jian/Downloads/stack_flow_bundle/Data_repository/master_lac_Test.csv' csv_file2 = '/home/Jian/Downloads/stack_flow_bundle/Data_repository/lat_lon_master.csv' df1 = pd.read_csv(csv_file1).set_index('Ids') Out[53]: 00:00:00 00:30:00 00:45:00 Ids 1234 1000 500 100 8435 5243 300 200 2341 563 400 400 7352 345 500 600 # need to sort index in file 2 df2 = pd.read_csv(csv_file2).set_index('Ids').sort_index() Out[81]: Name lat lon 00:00:00 Ids 1234 London 40.4 10.1 500 1234 Prague 40.4 10.1 500 2341 NewYork 60.6 30.3 700 2341 Austria 60.6 30.3 700 7352 Japan 70.7 80.8 500 7352 China 70.7 80.8 500 8435 Paris 50.5 20.2 400 8435 Berlin 50.5 20.2 400 # df1 and df2 has a duplicated column 00:00:00, use df1 without 1st column temp = df2.join(df1.iloc[:, 1:]) Out[55]: Name lat lon 00:00:00 00:30:00 00:45:00 Ids 1234 London 40.4 10.1 500 500 100 1234 Prague 40.4 10.1 500 500 100 2341 NewYork 60.6 30.3 700 400 400 2341 Austria 60.6 30.3 700 400 400 7352 Japan 70.7 80.8 500 500 600 7352 China 70.7 80.8 500 500 600 8435 Paris 50.5 20.2 400 300 200 8435 Berlin 50.5 20.2 400 300 200 # do the division by number of occurence of each Ids # and add column 00:00:00 def my_func(group): num_obs = len(group) # process with column name after 00:30:00 (inclusive) group.iloc[:,4:] = (group.iloc[:,4:]/num_obs).add(group.iloc[:,3], axis=0) return group result = temp.groupby(level='Ids').apply(my_func) Out[104]: Name lat lon 00:00:00 00:30:00 00:45:00 Ids 1234 London 40.4 10.1 500 750 550 1234 Prague 40.4 10.1 500 750 550 2341 NewYork 60.6 30.3 700 900 900 2341 Austria 60.6 30.3 700 900 900 7352 Japan 70.7 80.8 500 750 800 7352 China 70.7 80.8 500 750 800 8435 Paris 50.5 20.2 400 550 500 8435 Berlin 50.5 20.2 400 550 500
My suggestion is to reformat your data so that it's like this: Ids,ref0,current_time,ref1 1234,1000,None,None 8435,5243,None,None 2341,563,None,None 7352,345,None,None Then after your "first preprocess" it will become like this: Ids,ref0,time,ref1 1234,1000,None,None 8435,5243,None,None 2341,563,None,None 7352,345,None,None 1234,1000,00:30:00,500 8435,5243,00:30:00,300 2341,563,00:30:00,400 7352,345,00:30:00,500 . . . and so on. The idea is that you should make a single column to hold the time information, and then for each preprocess, insert the new data into new rows, and give those rows a value in the time column indicating what time period they come from. You may or may not want to keep the initial rows with "None" in this table; maybe you just want to start with the "00:30:00" values and keep the "master ids" in a separate file. I haven't totally followed exactly how you're computing the new ref1 values, but the point is that doing this is likely to greatly simplify your life. In general, instead of adding an unbounded number of new columns, it can be much nicer to add a single new column whose values will then be the values you were going to use as headers for the open-ended new columns.