Efficient way to iterate through a large dataframe

Efficient way to iterate through a large dataframe - python

I have a csv file that contains several thousand records of company stock data. It contains the following integer fields:
low_price, high_price, volume_traded
10, 20, 45667
15, 22, 256565
41, 47, 45645
30, 39, 547343
My requirement is to create a new csv file from this data by accumulating the volume_traded at every price level (from low to high). The final result would just be two columns as follows:
price, total_volume_traded
10, 45667
11, 45667
12, 45667
....
....
15, 302232
etc
In other words the final csv contains one record for every price level (not just the high/low but also the prices in-between), along with the total amount of volume_traded at that price level.
I have got this working however it is terribly slow and inefficient. Im sure there must be better ways of accomplishing this.
Basically what i've done is use nested loops:
First iterate through each row.
On each row, create a nested loop to iterate through the price range from low_price to high_price.
Check if the price already exists in the new dataframe, if so add the current volume_traded to it. If it doesn't exist, apppend the price and volume (ie: create a new row).
Below is some of the relevant code. I would be grateful if anyone could advise a better way of doing this in terms of efficiency/speed:
df_exising = #dataframe created from existing csv
df_new = #dataframe for new Price/Volume values
for index, row in df_existing.iterrows():
price = row['low_price']
for i in range(row['low_price'], row['high_price']+1):
volume = row['volume_traded']
df_new = accumulate_volume(df_new, price, volume)
price+=1
def accumulate_volume(df_new, price, volume):
#If price level already exists, add volume to existing
if df_new['Price'].loc[df_new['Price'] == price].count() > 0:
df_new['Volume'].loc[df_new['Price'] == price] += volume
return(df_new)
else:
#first occurrence of price level, add new row
tmp = {'Price':int(price), 'Volume':volume}
return(df_new.append(tmp, ignore_index=True))
#once the above finishes, df_new is written to the new csv file
My guess for why this is so slow is at least partly because 'append' creates a new object every time it's called, and it gets called a LOT. In total, the nested loop from the above code gets run 1595653 times.
I would be very grateful for any assistance.

Let's forget a moment about potential issues with methodology (think about how your results would look if 100k shares traded at a price of 50-51 and 100k traded at 50-59).
Below are a set of commented steps that should achieve your goal:
# Initialize DataFrame.
df = pd.DataFrame({'low': [10, 15, 41, 30],
'high': [20, 22, 47, 39],
'volume': [45667, 256565, 45645, 547343]})
# Initialize a price dictionary spanning range of potential prices.
d = {price: 0 for price in range(min(df.low), max(df.high) + 1)}
# Create helper function to add volume to given price bucket.
def add_volume(price_dict, price, volume):
price_dict[price] += volume
# Use a nested list comprehension to call the function and populate the dictionary.
_ = [[add_volume(d, price, volume) for price in range(low, high + 1)]
for low, high, volume in zip(df.low, df.high, df.volume)]
# Convert the dictionary to a DataFrame and output to csv.
idx = pd.Index(d.keys(), name='price')
df = pd.DataFrame(d.values(), index=idx, columns=['total_volume_traded'])
df.to_csv('output.csv')
>>> df
total_volume_traded
price
10 45667
11 45667
12 45667
13 45667
14 45667
15 302232
16 302232
17 302232
18 302232
19 302232
20 302232
21 256565
22 256565
23 0
24 0
25 0
26 0
27 0
28 0
29 0
30 547343
31 547343
32 547343
33 547343
34 547343
35 547343
36 547343
37 547343
38 547343
39 547343
40 0
41 45645
42 45645
43 45645
44 45645
45 45645
46 45645
47 45645

I would first groupby the 'low_price' column, then sum up the volume_traded, reset the index. This will effectively accumulate all the prices of interest, then you want to sort by the price, this makes the prices monotonic so that we can use it as the index. After setting to be the index we can then call reindex and compute a new index and fill the missing values using method='pad':
In [33]:
temp="""low_price,high_price,volume_traded
10,20,45667
15,22,256565
41,47,45645
10,20,12345
30,39,547343"""
df = pd.read_csv(io.StringIO(temp))
df
Out[33]:
low_price high_price volume_traded
0 10 20 45667
1 15 22 256565
2 41 47 45645
3 10 20 12345
4 30 39 547343
In [34]:
df1 = df.groupby('low_price')['volume_traded'].sum().reset_index()
df1
Out[34]:
low_price volume_traded
0 10 58012
1 15 256565
2 30 547343
3 41 45645
In [36]:
df1.sort(['low_price']).set_index(['low_price']).reindex(index = np.arange(df1['low_price'].min(), df1['low_price'].max()+1), method='pad')
Out[36]:
volume_traded
low_price
10 58012
11 58012
12 58012
13 58012
14 58012
15 256565
16 256565
17 256565
18 256565
19 256565
20 256565
21 256565
22 256565
23 256565
24 256565
25 256565
26 256565
27 256565
28 256565
29 256565
30 547343
31 547343
32 547343
33 547343
34 547343
35 547343
36 547343
37 547343
38 547343
39 547343
40 547343
41 45645

Related

Remove the group of rows based on the condition of rows

I have a dataframe which has two columns, 'Group' and 'Sample Number'
The column 'Group' has sample number '11' which is UNIQUE. and each group will have only one '11' Sample Number, followed by the sample numbers in range of 21 to 29 ( for example, 21, 22 23, 24, 25, 26, 27 , 28 , 29) and followed by the sample numbers in range of 31 to 39 (for example, 31, 32, 33, 34, 35, 36, 37, 38, 39). Hence each group should have one '11' sample number, at least one sample number in the range of 21 to 29 and at least one sample number in the rande of 31 to 39.
I wish to compute in such a way that my code goes through each group and
Check if there is a sample number 11 in the group or not.
Check if there is at least one sample number in the range of 21
to 29 .
Check if there is at least one sample number in the range
of 31 to 39
If any of these three conditions does not match then the code removes the entire group from the dataframe
Below is the dataframe in table format:
Group
Sample_Number
Z007
11
Z007
21
Z007
22
Z007
23
Z007
31
Z007
32
Z008
11
Z008
31
Z008
32
Z008
33
Z009
11
Z009
21
Z009
22
Z009
23
Z010
21
Z010
22
Z010
23
Z010
24
Z010
31
Z010
32
Z010
33
Z010
34
df = pd.DataFrame([[Z007, 11],[Z007, 21] , [Z007, 22], [Z007, 23], [Z007, 31],[Z007, 32],[Z008, 11],[Z008, 31],[Z008, 32],[Z008, 33],[Z009, 11],[Z009, 21],[Z009, 22],[Z009, 23], [Z010, 21],[Z010, 22],[Z010, 23], [Z010, 24],[Z010, 31],[Z010, 32],[Z010, 33],[Z010, 34], columns=['Group', 'Sample_Number'])
The code should remove the group 'Z008' as it does not have the sample number in the range of 21 to 29. It should remove the group 'Z009' as it does not have the sample number in the range of 31 to 39. Also it should remove the group 'Z010' as it does not have the sample number '11'.
Expected answer is below:
Group
Sample_Number
Z007
11
Z007
21
Z007
22
Z007
23
Z007
31
Z007
32
I could do it only for sample number 11 but struggling to do the same for the other sample numbers in the range of (21 to 29 ) and (31 to 39), below is the code for sample number 11
invalid_group_no = [i for i in df['Group'].unique() if
df[df['Group']== i]["Sample_Number"].to_list().count(11)!=1]
Can anyone please help me with the other sample numbers? Please feel free to implement your own ways. Any help is appreciated.

Try this:
groups = set(df['Group'][df['Sample_Number'] == 11]) & set(df['Group'][df['Sample_Number'].isin(range(21,30))]) & set(df['Group'][df['Sample_Number'].isin(range(31,40))])
df = df[df['Group'].isin(groups)]
Group Sample_Number
0 Z007 11
1 Z007 21
2 Z007 22
3 Z007 23
4 Z007 31
5 Z007 32

Pandas dataframe Plotly line chart with two lines

I have a pandas dataframe as below and I would like to produce a few charts with the data. 'Name' column are the names of the accounts, 'Number' column is the number of users under each count, and the months columns are the login times of each account in every month.
Acc User Jan Feb Mar Apr May June
Nora 39 5 13 16 22 14 20
Bianca 53 14 31 22 21 20 29
Anna 65 30 17 18 28 12 13
Katie 46 9 12 30 34 25 15
Melissa 29 29 12 30 10 4 9
1st: I would like to monitor the trend of logins from January to May. One line illustrates Bianca's login and the other line illustrates everyone else's login.
2nd: I would like to monitor the percentage change of logins from January to May. One line illustrates Bianca's login percentage change and the other line illustrates everyone else's login percentage change.
Thank you for your time and assistance. I'm a beginner at this. I appreciate any help on this! Much appreciated!!

I suggest best approach to group is use categoricals. pct_change is not a direct aggregate function so it's a bit more involved to get it.
import io
import matplotlib.pyplot as plt
df = pd.read_csv(io.StringIO("""Acc User Jan Feb Mar Apr May June
Nora 39 5 13 16 22 14 20
Bianca 53 14 31 22 21 20 29
Anna 65 30 17 18 28 12 13
Katie 46 9 12 30 34 25 15
Melissa 29 29 12 30 10 4 9"""), sep="\s+")
# just setup 2 plot areas
fig, ax = plt.subplots(1,2, figsize=[20,5])
# want to divide data into 2 groups
df["grp"] = pd.Categorical(df["Acc"], ["Bianca","Others"])
df["grp"].fillna("Others", inplace=True)
# just get it out of the way...
df.drop(columns="User", inplace=True)
# simple plot where function exists directly. Not transform to get lines..
df.groupby("grp").sum().T.plot(ax=ax[0])
# a bit more sophisticated to get pct change...
df.groupby("grp").sum().T.assign(
Bianca=lambda x: x["Bianca"].pct_change().fillna(0)*100,
Others=lambda x: x["Others"].pct_change().fillna(0)*100
).plot(ax=ax[1])
output

Calculating average/standard deviations of rows containing certain string in pandas dataframe

I have a large pandas dataframe read as table. I would like to calculate the means and standard deviations of the two different groups, CRPS and Age, so I can plot them in a bar plot with std deviations as the error bars.
I can get the mean calculated by just the Age column. I figured it's a for loop that I have to construct, but I don't know how to construct further than table["Age"].mean(), which just gives me the average of all data points' age values. This is where I need some guidance. I want to look in the group column, tell it to calculate the average and standard deviation for the ages of that group. So, an average and standard deviation value for the ages of the CRPS group, for example.
I have the first 25 rows down below just to show what the dataframe looks like. I also have imported numpy as np as well.
Group Age
0 CRPS 50
1 CRPS 59
2 CRPS 22
3 CRPS 48
4 CRPS 53
5 CRPS 48
6 CRPS 29
7 CRPS 44
8 CRPS 28
9 CRPS 42
10 CRPS 35
11 CONTROLS 54
12 CONTROLS 43
13 CRPS 50
14 CRPS 62
15 CONTROLS 64
16 CONTROLS 39
17 CRPS 40
18 CRPS 59
19 CRPS 46
20 CONTROLS 56
21 CRPS 21
22 CRPS 45
23 CONTROLS 41
24 CRPS 46
25 CONTROLS 35

I don't think you need a for-loop.
Instead, you might try something like:
table.iloc[table['Group'] == 'CRPS']['Age'].mean()
I haven't tested with your table, but I think that will work.
The idea is to first create a boolean array, which is true for row indices where the group field contains 'CRPS', then to select all of those rows using iloc, and finally to take the mean. You could iterate over all of the groups in the following way:
mean_age = dict()
for group in set(table['Group']):
mean_age[group] = table.iloc[table['Group'] == group]['Age'].mean()
Maybe this is where you intended to use a for loop.

Pandas: Creating another column with row column multiplication

Priority Expected Actual
High 47 30
Medium 22 14
Required 16 5
I'm trying to create two other columns 'Expected_values' which will have the values like for the row High 47*5, for the row Medium 22*3,for the row Required 16*10 and 'Actual_values' for the row High 30*5, for the row Medium 14*3,for the row Required 5*10
like this
Priority Expected Actual Expected_values Actual_values
Required 16 5 160 50
High 47 30 235 150
Medium 22 14 66 42
Any simple way to do that in pandas or numpy?

try:
a = np.array([5, 3, 10])
df['Expected_values'] = df.Expected * a
df['Actual_values'] = df.Actual * a
print df
Priority Expected Actual Expected_values Actual_values
0 High 47 30 235 150
1 Medium 22 14 66 42
2 Required 16 5 160 50

Finding average value in a single data frame using pandas

import pandas as pd
import csv
l=[]
url = "ftp://ftp.cpc.ncep.noaa.gov/htdocs/degree_days/weighted/daily_data/{}/Population.Heating.txt"
for i in range(1981,2018):
df = pd.read_csv(url.format(i), sep='\|', skiprows=3, engine='python')
l.append(df.loc[9])
print(pd.concat(l))
pd.concat(l)
Region CONUS
19810101 28
19810102 29
19810103 33
19810104 37
19810105 38
19810106 33
19810107 31
19810108 36
19810109 37
19810110 36
...
20171227 37
20171228 38
20171229 35
20171230 34
20171231 40
Name: 9, Length: 13551, dtype: object
>>>
This code will give temperature from 1981 to 2017, and I am trying to find the average value of each month
pd.concat(l).mean() didn't work....
Can anyone help me on this issue? Thank you!

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Efficient way to iterate through a large dataframe - python

Related

Remove the group of rows based on the condition of rows

Pandas dataframe Plotly line chart with two lines

Calculating average/standard deviations of rows containing certain string in pandas dataframe

Pandas: Creating another column with row column multiplication

Finding average value in a single data frame using pandas

Categories

Resources