Efficient way to iterate through a large dataframe - python
I have a csv file that contains several thousand records of company stock data. It contains the following integer fields:
low_price, high_price, volume_traded
10, 20, 45667
15, 22, 256565
41, 47, 45645
30, 39, 547343
My requirement is to create a new csv file from this data by accumulating the volume_traded at every price level (from low to high). The final result would just be two columns as follows:
price, total_volume_traded
10, 45667
11, 45667
12, 45667
....
....
15, 302232
etc
In other words the final csv contains one record for every price level (not just the high/low but also the prices in-between), along with the total amount of volume_traded at that price level.
I have got this working however it is terribly slow and inefficient. Im sure there must be better ways of accomplishing this.
Basically what i've done is use nested loops:
First iterate through each row.
On each row, create a nested loop to iterate through the price range from low_price to high_price.
Check if the price already exists in the new dataframe, if so add the current volume_traded to it. If it doesn't exist, apppend the price and volume (ie: create a new row).
Below is some of the relevant code. I would be grateful if anyone could advise a better way of doing this in terms of efficiency/speed:
df_exising = #dataframe created from existing csv
df_new = #dataframe for new Price/Volume values
for index, row in df_existing.iterrows():
price = row['low_price']
for i in range(row['low_price'], row['high_price']+1):
volume = row['volume_traded']
df_new = accumulate_volume(df_new, price, volume)
price+=1
def accumulate_volume(df_new, price, volume):
#If price level already exists, add volume to existing
if df_new['Price'].loc[df_new['Price'] == price].count() > 0:
df_new['Volume'].loc[df_new['Price'] == price] += volume
return(df_new)
else:
#first occurrence of price level, add new row
tmp = {'Price':int(price), 'Volume':volume}
return(df_new.append(tmp, ignore_index=True))
#once the above finishes, df_new is written to the new csv file
My guess for why this is so slow is at least partly because 'append' creates a new object every time it's called, and it gets called a LOT. In total, the nested loop from the above code gets run 1595653 times.
I would be very grateful for any assistance.
Let's forget a moment about potential issues with methodology (think about how your results would look if 100k shares traded at a price of 50-51 and 100k traded at 50-59).
Below are a set of commented steps that should achieve your goal:
# Initialize DataFrame.
df = pd.DataFrame({'low': [10, 15, 41, 30],
'high': [20, 22, 47, 39],
'volume': [45667, 256565, 45645, 547343]})
# Initialize a price dictionary spanning range of potential prices.
d = {price: 0 for price in range(min(df.low), max(df.high) + 1)}
# Create helper function to add volume to given price bucket.
def add_volume(price_dict, price, volume):
price_dict[price] += volume
# Use a nested list comprehension to call the function and populate the dictionary.
_ = [[add_volume(d, price, volume) for price in range(low, high + 1)]
for low, high, volume in zip(df.low, df.high, df.volume)]
# Convert the dictionary to a DataFrame and output to csv.
idx = pd.Index(d.keys(), name='price')
df = pd.DataFrame(d.values(), index=idx, columns=['total_volume_traded'])
df.to_csv('output.csv')
>>> df
total_volume_traded
price
10 45667
11 45667
12 45667
13 45667
14 45667
15 302232
16 302232
17 302232
18 302232
19 302232
20 302232
21 256565
22 256565
23 0
24 0
25 0
26 0
27 0
28 0
29 0
30 547343
31 547343
32 547343
33 547343
34 547343
35 547343
36 547343
37 547343
38 547343
39 547343
40 0
41 45645
42 45645
43 45645
44 45645
45 45645
46 45645
47 45645
I would first groupby the 'low_price' column, then sum up the volume_traded, reset the index. This will effectively accumulate all the prices of interest, then you want to sort by the price, this makes the prices monotonic so that we can use it as the index. After setting to be the index we can then call reindex and compute a new index and fill the missing values using method='pad':
In [33]:
temp="""low_price,high_price,volume_traded
10,20,45667
15,22,256565
41,47,45645
10,20,12345
30,39,547343"""
df = pd.read_csv(io.StringIO(temp))
df
Out[33]:
low_price high_price volume_traded
0 10 20 45667
1 15 22 256565
2 41 47 45645
3 10 20 12345
4 30 39 547343
In [34]:
df1 = df.groupby('low_price')['volume_traded'].sum().reset_index()
df1
Out[34]:
low_price volume_traded
0 10 58012
1 15 256565
2 30 547343
3 41 45645
In [36]:
df1.sort(['low_price']).set_index(['low_price']).reindex(index = np.arange(df1['low_price'].min(), df1['low_price'].max()+1), method='pad')
Out[36]:
volume_traded
low_price
10 58012
11 58012
12 58012
13 58012
14 58012
15 256565
16 256565
17 256565
18 256565
19 256565
20 256565
21 256565
22 256565
23 256565
24 256565
25 256565
26 256565
27 256565
28 256565
29 256565
30 547343
31 547343
32 547343
33 547343
34 547343
35 547343
36 547343
37 547343
38 547343
39 547343
40 547343
41 45645
Related
Remove the group of rows based on the condition of rows
I have a dataframe which has two columns, 'Group' and 'Sample Number' The column 'Group' has sample number '11' which is UNIQUE. and each group will have only one '11' Sample Number, followed by the sample numbers in range of 21 to 29 ( for example, 21, 22 23, 24, 25, 26, 27 , 28 , 29) and followed by the sample numbers in range of 31 to 39 (for example, 31, 32, 33, 34, 35, 36, 37, 38, 39). Hence each group should have one '11' sample number, at least one sample number in the range of 21 to 29 and at least one sample number in the rande of 31 to 39. I wish to compute in such a way that my code goes through each group and Check if there is a sample number 11 in the group or not. Check if there is at least one sample number in the range of 21 to 29 . Check if there is at least one sample number in the range of 31 to 39 If any of these three conditions does not match then the code removes the entire group from the dataframe Below is the dataframe in table format: Group Sample_Number Z007 11 Z007 21 Z007 22 Z007 23 Z007 31 Z007 32 Z008 11 Z008 31 Z008 32 Z008 33 Z009 11 Z009 21 Z009 22 Z009 23 Z010 21 Z010 22 Z010 23 Z010 24 Z010 31 Z010 32 Z010 33 Z010 34 df = pd.DataFrame([[Z007, 11],[Z007, 21] , [Z007, 22], [Z007, 23], [Z007, 31],[Z007, 32],[Z008, 11],[Z008, 31],[Z008, 32],[Z008, 33],[Z009, 11],[Z009, 21],[Z009, 22],[Z009, 23], [Z010, 21],[Z010, 22],[Z010, 23], [Z010, 24],[Z010, 31],[Z010, 32],[Z010, 33],[Z010, 34], columns=['Group', 'Sample_Number']) The code should remove the group 'Z008' as it does not have the sample number in the range of 21 to 29. It should remove the group 'Z009' as it does not have the sample number in the range of 31 to 39. Also it should remove the group 'Z010' as it does not have the sample number '11'. Expected answer is below: Group Sample_Number Z007 11 Z007 21 Z007 22 Z007 23 Z007 31 Z007 32 I could do it only for sample number 11 but struggling to do the same for the other sample numbers in the range of (21 to 29 ) and (31 to 39), below is the code for sample number 11 invalid_group_no = [i for i in df['Group'].unique() if df[df['Group']== i]["Sample_Number"].to_list().count(11)!=1] Can anyone please help me with the other sample numbers? Please feel free to implement your own ways. Any help is appreciated.
Try this: groups = set(df['Group'][df['Sample_Number'] == 11]) & set(df['Group'][df['Sample_Number'].isin(range(21,30))]) & set(df['Group'][df['Sample_Number'].isin(range(31,40))]) df = df[df['Group'].isin(groups)] Group Sample_Number 0 Z007 11 1 Z007 21 2 Z007 22 3 Z007 23 4 Z007 31 5 Z007 32
Pandas dataframe Plotly line chart with two lines
I have a pandas dataframe as below and I would like to produce a few charts with the data. 'Name' column are the names of the accounts, 'Number' column is the number of users under each count, and the months columns are the login times of each account in every month. Acc User Jan Feb Mar Apr May June Nora 39 5 13 16 22 14 20 Bianca 53 14 31 22 21 20 29 Anna 65 30 17 18 28 12 13 Katie 46 9 12 30 34 25 15 Melissa 29 29 12 30 10 4 9 1st: I would like to monitor the trend of logins from January to May. One line illustrates Bianca's login and the other line illustrates everyone else's login. 2nd: I would like to monitor the percentage change of logins from January to May. One line illustrates Bianca's login percentage change and the other line illustrates everyone else's login percentage change. Thank you for your time and assistance. I'm a beginner at this. I appreciate any help on this! Much appreciated!!
I suggest best approach to group is use categoricals. pct_change is not a direct aggregate function so it's a bit more involved to get it. import io import matplotlib.pyplot as plt df = pd.read_csv(io.StringIO("""Acc User Jan Feb Mar Apr May June Nora 39 5 13 16 22 14 20 Bianca 53 14 31 22 21 20 29 Anna 65 30 17 18 28 12 13 Katie 46 9 12 30 34 25 15 Melissa 29 29 12 30 10 4 9"""), sep="\s+") # just setup 2 plot areas fig, ax = plt.subplots(1,2, figsize=[20,5]) # want to divide data into 2 groups df["grp"] = pd.Categorical(df["Acc"], ["Bianca","Others"]) df["grp"].fillna("Others", inplace=True) # just get it out of the way... df.drop(columns="User", inplace=True) # simple plot where function exists directly. Not transform to get lines.. df.groupby("grp").sum().T.plot(ax=ax[0]) # a bit more sophisticated to get pct change... df.groupby("grp").sum().T.assign( Bianca=lambda x: x["Bianca"].pct_change().fillna(0)*100, Others=lambda x: x["Others"].pct_change().fillna(0)*100 ).plot(ax=ax[1]) output
Calculating average/standard deviations of rows containing certain string in pandas dataframe
I have a large pandas dataframe read as table. I would like to calculate the means and standard deviations of the two different groups, CRPS and Age, so I can plot them in a bar plot with std deviations as the error bars. I can get the mean calculated by just the Age column. I figured it's a for loop that I have to construct, but I don't know how to construct further than table["Age"].mean(), which just gives me the average of all data points' age values. This is where I need some guidance. I want to look in the group column, tell it to calculate the average and standard deviation for the ages of that group. So, an average and standard deviation value for the ages of the CRPS group, for example. I have the first 25 rows down below just to show what the dataframe looks like. I also have imported numpy as np as well. Group Age 0 CRPS 50 1 CRPS 59 2 CRPS 22 3 CRPS 48 4 CRPS 53 5 CRPS 48 6 CRPS 29 7 CRPS 44 8 CRPS 28 9 CRPS 42 10 CRPS 35 11 CONTROLS 54 12 CONTROLS 43 13 CRPS 50 14 CRPS 62 15 CONTROLS 64 16 CONTROLS 39 17 CRPS 40 18 CRPS 59 19 CRPS 46 20 CONTROLS 56 21 CRPS 21 22 CRPS 45 23 CONTROLS 41 24 CRPS 46 25 CONTROLS 35
I don't think you need a for-loop. Instead, you might try something like: table.iloc[table['Group'] == 'CRPS']['Age'].mean() I haven't tested with your table, but I think that will work. The idea is to first create a boolean array, which is true for row indices where the group field contains 'CRPS', then to select all of those rows using iloc, and finally to take the mean. You could iterate over all of the groups in the following way: mean_age = dict() for group in set(table['Group']): mean_age[group] = table.iloc[table['Group'] == group]['Age'].mean() Maybe this is where you intended to use a for loop.
Pandas: Creating another column with row column multiplication
Priority Expected Actual High 47 30 Medium 22 14 Required 16 5 I'm trying to create two other columns 'Expected_values' which will have the values like for the row High 47*5, for the row Medium 22*3,for the row Required 16*10 and 'Actual_values' for the row High 30*5, for the row Medium 14*3,for the row Required 5*10 like this Priority Expected Actual Expected_values Actual_values Required 16 5 160 50 High 47 30 235 150 Medium 22 14 66 42 Any simple way to do that in pandas or numpy?
try: a = np.array([5, 3, 10]) df['Expected_values'] = df.Expected * a df['Actual_values'] = df.Actual * a print df Priority Expected Actual Expected_values Actual_values 0 High 47 30 235 150 1 Medium 22 14 66 42 2 Required 16 5 160 50
Finding average value in a single data frame using pandas
import pandas as pd import csv l=[] url = "ftp://ftp.cpc.ncep.noaa.gov/htdocs/degree_days/weighted/daily_data/{}/Population.Heating.txt" for i in range(1981,2018): df = pd.read_csv(url.format(i), sep='\|', skiprows=3, engine='python') l.append(df.loc[9]) print(pd.concat(l)) pd.concat(l) Region CONUS 19810101 28 19810102 29 19810103 33 19810104 37 19810105 38 19810106 33 19810107 31 19810108 36 19810109 37 19810110 36 ... 20171227 37 20171228 38 20171229 35 20171230 34 20171231 40 Name: 9, Length: 13551, dtype: object >>> This code will give temperature from 1981 to 2017, and I am trying to find the average value of each month pd.concat(l).mean() didn't work.... Can anyone help me on this issue? Thank you!