Index by condition in python-numpy?

Index by condition in python-numpy? - python

I'm trying to migrate from Matlab to Python. I'm rewriting some code that I had in Matlab to Python for testing. I've installed Anaconda and currently using Spyder IDE. Using Matlab I created a function that returns the values of the commercial API 5L diameter(diametro) and thickness(espesor) of pipes that are closer to the input parameters of the function. I did this using a Matlab table.
Note that the inputs of the diameter(diametro_entrada) and thickness(espesor_entrada) are in meters[m] and the thickness inside the function are in millimeters [mm], that's why at the end I had to multiply espesor_entrada*1000
function tabla_seleccion=tablaAPI(diametro_entrada,espesor_entrada)
%Proporciona la tabla de caños API 5L, introducir diámetro en [m] y espesor
%en [m]
Diametro_m=[0.3556;0.3556;0.3556;0.3556;0.3556;0.3556;0.3556;0.3556;0.3556;0.3556;0.3556;0.3556;0.3556;0.3556;0.3556;0.3556;0.3556;0.3556;0.3556;0.3556;0.3556;0.3556;0.3556;0.3556;...
0.4064;0.4064;0.4064;0.4064;0.4064;0.4064;0.4064;0.4064;0.4064;0.4064;0.4064;0.4064;0.4064;0.4064;0.4064;0.4064;0.4064;0.4064;0.4064;0.4064;0.4064;0.4064;0.4064;0.4064;...
0.4570;0.4570;0.4570;0.4570;0.4570;0.4570;0.4570;0.4570;0.4570;0.4570;0.4570;0.4570;0.4570;0.4570;0.4570;0.4570;0.4570;0.4570;0.4570;0.4570;0.4570;0.4570;0.4570;...
0.5080;0.5080;0.5080;0.5080;0.5080;0.5080;0.5080;0.5080;0.5080;0.5080;0.5080;0.5080;0.5080;0.5080;0.5080;0.5080;0.5080;0.5080;0.5080;0.5080;0.5080;0.5080;0.5080;0.5080;...
0.559;0.559;0.559;0.559;0.559;0.559;0.559;0.559;0.559;0.559;0.559;0.559;0.559;0.559;0.559;0.559;0.559;0.559;0.559;0.559;0.559;0.559;0.559;0.559;0.559;0.559;...
0.610;0.610;0.610;0.610;0.610;0.610;0.610;0.610;0.610;0.610;0.610;0.610;0.610;0.610;0.610;0.610;0.610;0.610;0.610;0.610;0.610;0.610;0.610;0.610;0.610;0.610;...
0.660;0.660;0.660;0.660;0.660;0.660;0.660;0.660;0.660;0.660;0.660;0.660;0.660;0.660;0.660;0.660;0.660;...
0.711;0.711;0.711;0.711;0.711;0.711;0.711;0.711;0.711;0.711;0.711;0.711;0.711;0.711;0.711;0.711;0.711;...
0.762;0.762;0.762;0.762;0.762;0.762;0.762;0.762;0.762;0.762;0.762;0.762;0.762;0.762;0.762;0.762;0.762;0.762;0.762;0.762;0.762;...
0.813;0.813;0.813;0.813;0.813;0.813;0.813;0.813;0.813;0.813;0.813;0.813;0.813;0.813;0.813;0.813;0.813;0.813;0.813;0.813;0.813];
Espesor_mm=[4.8;5.2;5.3;5.6;6.4;7.1;7.9;8.7;9.5;10.3;11.1;11.9;12.7;14.3;15.9;17.5;19.1;20.6;22.2;23.8;25.4;27.0;28.6;31.8;...
4.8;5.2;5.6;6.4;7.1;7.9;8.7;9.5;10.3;11.1;11.9;12.7;14.3;15.9;17.5;19.1;20.6;22.2;23.8;25.4;27.0;28.6;30.2;31.8;...
4.8;5.6;6.4;7.1;7.9;8.7;9.5;10.3;11.1;11.9;12.7;14.3;15.9;17.5;19.1;20.6;22.2;23.8;25.4;27.0;28.6;30.2;31.8;...
5.6;6.4;7.1;7.9;8.7;9.5;10.3;11.1;11.9;12.7;14.3;15.9;17.5;19.1;20.6;22.2;23.8;25.4;27.0;28.6;30.2;31.8;33.3;34.9;...
5.6;6.4;7.1;7.9;8.7;9.5;10.3;11.1;11.9;12.7;14.3;15.9;17.5;19.1;20.6;22.2;23.8;25.4;27.0;28.6;30.2;31.8;33.3;34.9;36.5;38.1;...
6.4;7.1;7.9;8.7;9.5;10.3;11.1;11.9;12.7;14.3;15.9;17.5;19.1;20.6;22.2;23.8;25.4;27.0;28.6;30.2;31.8;33.3;34.9;36.5;38.1;39.7;...
6.4;7.1;7.9;8.7;9.5;10.3;11.1;11.9;12.7;14.3;15.9;17.5;19.1;20.6;22.2;23.8;25.4;...
6.4;7.1;7.9;8.7;9.5;10.3;11.1;11.9;12.7;14.3;15.9;17.5;19.1;20.6;22.2;23.8;25.4;...
6.4;7.1;7.9;8.7;9.5;10.3;11.1;11.9;12.7;14.3;15.9;17.5;19.1;20.6;22.2;23.8;25.4;27.0;28.6;30.2;31.8;...
6.4;7.1;7.9;8.7;9.5;10.3;11.1;11.9;12.7;14.3;15.9;17.5;19.1;20.6;22.2;23.8;25.4;27.0;28.6;30.2;31.8];
TablaAPI=table(Diametro_m,Espesor_mm);
tabla_seleccion=TablaAPI(abs(TablaAPI.Diametro_m-diametro_entrada)<0.05 & abs(TablaAPI.Espesor_mm-(espesor_entrada*1000))<1.2,:);
end
With the input diameter(d) and the input thickness(e) I get the commercial pipe that has less than 0.05 in diameter and 1.2 in thickness from the former.
I want to do reproduce this in Python with Numpy or another package.
First I defined 2 Numpy arrays, with the same names as in Matlab but comma separated instead of semicolon and without the "..." at the end of each line, then defined another Numpy array as:
TablaAPI=numpy.array([Diametro_m,Espesor_mm])
I want to know if I can index that array in some way like I did in Matlab or I have to define something else totally different.
Thanks a lot!

You sure can!
Here's an example of how you can use numpy:
Using Numpy
import math
import numpy as np
# Declare your Diametro_m, Espesor_mmhere just like you did in your example
# Transpose and merge the columns
arr = np.concatenate((Diametro_m, Espesor_mm.T), axis=1)
selection = arr[np.ix_(abs(arr[:0])<0.05,abs(arr[:1]-(math.e*1000)) > <1.2 )]
Example usage from John Zwinck's answer
Using Dataframes
Dataframes may also be great for your application in case you need to do heavier queries or mix column datatypes. This code should work for you, if you choose that option:
# These imports go at the top of your document
import pandas as pd
import numpy as np
import math
# Declare your Diametro_m, Espesor_mmhere just like you did in your example
df_d = pd.DataFrame(data=Diametro_m,
index=np.array(range(1, len(Diametro_m))),
columns=np.array(range(1, len(Diametro_m))))
df_e = pd.DataFrame(data=Espesor_mm,
index=np.array(range(1, len(Diametro_m))),
columns=np.array(range(1, len(Diametro_m))))
# Merge the dataframes
merged_df = pd.merge(left=df_d , left_index=True
right=df_e , right_index=True,
how='inner')
# Now you can perform your selections like this:
selection = merged_df.loc[abs(merged_df['df_d']) <0.05, abs(merged_df['df_e']-(math.e*1000))) <1.2]
# This "mask" of the dataframe will return all results that satisfy your query.
print(selection)

Since you have not given an example of your expected output it's a bit of guessing what you are really after, but here is one version with numpy.
# rewritten arrays for numpy
Diametro_m=[0.3556,0.3556,0.3556,0.3556,0.3556,0.3556,0.3556,0.3556,0.3556,0.3556,0.3556,0.3556,0.3556,0.3556,0.3556,0.3556,0.3556,0.3556,0.3556,0.3556,0.3556,0.3556,0.3556,0.3556,
0.4064,0.4064,0.4064,0.4064,0.4064,0.4064,0.4064,0.4064,0.4064,0.4064,0.4064,0.4064,0.4064,0.4064,0.4064,0.4064,0.4064,0.4064,0.4064,0.4064,0.4064,0.4064,0.4064,0.4064,
0.4570,0.4570,0.4570,0.4570,0.4570,0.4570,0.4570,0.4570,0.4570,0.4570,0.4570,0.4570,0.4570,0.4570,0.4570,0.4570,0.4570,0.4570,0.4570,0.4570,0.4570,0.4570,0.4570,
0.5080,0.5080,0.5080,0.5080,0.5080,0.5080,0.5080,0.5080,0.5080,0.5080,0.5080,0.5080,0.5080,0.5080,0.5080,0.5080,0.5080,0.5080,0.5080,0.5080,0.5080,0.5080,0.5080,0.5080,
0.559,0.559,0.559,0.559,0.559,0.559,0.559,0.559,0.559,0.559,0.559,0.559,0.559,0.559,0.559,0.559,0.559,0.559,0.559,0.559,0.559,0.559,0.559,0.559,0.559,0.559,
0.610,0.610,0.610,0.610,0.610,0.610,0.610,0.610,0.610,0.610,0.610,0.610,0.610,0.610,0.610,0.610,0.610,0.610,0.610,0.610,0.610,0.610,0.610,0.610,0.610,0.610,
0.660,0.660,0.660,0.660,0.660,0.660,0.660,0.660,0.660,0.660,0.660,0.660,0.660,0.660,0.660,0.660,0.660,
0.711,0.711,0.711,0.711,0.711,0.711,0.711,0.711,0.711,0.711,0.711,0.711,0.711,0.711,0.711,0.711,0.711,
0.762,0.762,0.762,0.762,0.762,0.762,0.762,0.762,0.762,0.762,0.762,0.762,0.762,0.762,0.762,0.762,0.762,0.762,0.762,0.762,0.762,
0.813,0.813,0.813,0.813,0.813,0.813,0.813,0.813,0.813,0.813,0.813,0.813,0.813,0.813,0.813,0.813,0.813,0.813,0.813,0.813,0.813]
Espesor_mm=[4.8,5.2,5.3,5.6,6.4,7.1,7.9,8.7,9.5,10.3,11.1,11.9,12.7,14.3,15.9,17.5,19.1,20.6,22.2,23.8,25.4,27.0,28.6,31.8,
4.8,5.2,5.6,6.4,7.1,7.9,8.7,9.5,10.3,11.1,11.9,12.7,14.3,15.9,17.5,19.1,20.6,22.2,23.8,25.4,27.0,28.6,30.2,31.8,
4.8,5.6,6.4,7.1,7.9,8.7,9.5,10.3,11.1,11.9,12.7,14.3,15.9,17.5,19.1,20.6,22.2,23.8,25.4,27.0,28.6,30.2,31.8,
5.6,6.4,7.1,7.9,8.7,9.5,10.3,11.1,11.9,12.7,14.3,15.9,17.5,19.1,20.6,22.2,23.8,25.4,27.0,28.6,30.2,31.8,33.3,34.9,
5.6,6.4,7.1,7.9,8.7,9.5,10.3,11.1,11.9,12.7,14.3,15.9,17.5,19.1,20.6,22.2,23.8,25.4,27.0,28.6,30.2,31.8,33.3,34.9,36.5,38.1,
6.4,7.1,7.9,8.7,9.5,10.3,11.1,11.9,12.7,14.3,15.9,17.5,19.1,20.6,22.2,23.8,25.4,27.0,28.6,30.2,31.8,33.3,34.9,36.5,38.1,39.7,
6.4,7.1,7.9,8.7,9.5,10.3,11.1,11.9,12.7,14.3,15.9,17.5,19.1,20.6,22.2,23.8,25.4,
6.4,7.1,7.9,8.7,9.5,10.3,11.1,11.9,12.7,14.3,15.9,17.5,19.1,20.6,22.2,23.8,25.4,
6.4,7.1,7.9,8.7,9.5,10.3,11.1,11.9,12.7,14.3,15.9,17.5,19.1,20.6,22.2,23.8,25.4,27.0,28.6,30.2,31.8,
6.4,7.1,7.9,8.7,9.5,10.3,11.1,11.9,12.7,14.3,15.9,17.5,19.1,20.6,22.2,23.8,25.4,27.0,28.6,30.2,31.8]
import numpy as np
diametro_entrada = 0.4
espesor_entrada = 5
Diametro_m = np.array(Diametro_m)
Espesor_mm = np.array(Espesor_mm)
# Diametro_m and Espesor_mm has shape (223,)
# if not change so that they have that shape
table = np.array([Diametro_m, Espesor_mm]).T
mask = np.where((np.abs(Diametro_m - diametro_entrada) < 0.05) &
(np.abs(Espesor_mm - espesor_entrada) < 1.2)
)
result = table[mask]
print('with numpy')
print(result)
or you can do it with just python...
# redo with python only
# based on a simple dict and list comprehension
D_m = [0.3556, 0.4064, 0.4570, 0.5080, 0.559, 0.610, 0.660, 0.711, 0.762, 0.813]
E_mm = [[4.8,5.2,5.3,5.6,6.4,7.1,7.9,8.7,9.5,10.3,11.1,11.9,12.7,14.3,15.9,17.5,19.1,20.6,22.2,23.8,25.4,27.0,28.6,31.8],
[4.8,5.2,5.6,6.4,7.1,7.9,8.7,9.5,10.3,11.1,11.9,12.7,14.3,15.9,17.5,19.1,20.6,22.2,23.8,25.4,27.0,28.6,30.2,31.8],
[4.8,5.6,6.4,7.1,7.9,8.7,9.5,10.3,11.1,11.9,12.7,14.3,15.9,17.5,19.1,20.6,22.2,23.8,25.4,27.0,28.6,30.2,31.8],
[5.6,6.4,7.1,7.9,8.7,9.5,10.3,11.1,11.9,12.7,14.3,15.9,17.5,19.1,20.6,22.2,23.8,25.4,27.0,28.6,30.2,31.8,33.3,34.9],
[5.6,6.4,7.1,7.9,8.7,9.5,10.3,11.1,11.9,12.7,14.3,15.9,17.5,19.1,20.6,22.2,23.8,25.4,27.0,28.6,30.2,31.8,33.3,34.9,36.5,38.1],
[6.4,7.1,7.9,8.7,9.5,10.3,11.1,11.9,12.7,14.3,15.9,17.5,19.1,20.6,22.2,23.8,25.4,27.0,28.6,30.2,31.8,33.3,34.9,36.5,38.1,39.7],
[6.4,7.1,7.9,8.7,9.5,10.3,11.1,11.9,12.7,14.3,15.9,17.5,19.1,20.6,22.2,23.8,25.4],
[6.4,7.1,7.9,8.7,9.5,10.3,11.1,11.9,12.7,14.3,15.9,17.5,19.1,20.6,22.2,23.8,25.4],
[6.4,7.1,7.9,8.7,9.5,10.3,11.1,11.9,12.7,14.3,15.9,17.5,19.1,20.6,22.2,23.8,25.4,27.0,28.6,30.2,31.8],
[6.4,7.1,7.9,8.7,9.5,10.3,11.1,11.9,12.7,14.3,15.9,17.5,19.1,20.6,22.2,23.8,25.4,27.0,28.6,30.2,31.8]]
table2 = dict(zip(D_m, E_mm))
result2 = []
for D, E in table2.items():
if abs(D - diametro_entrada) < 0.05:
Et = [t for t in E if abs(t - espesor_entrada) < 1.2]
result2 += [(D, t) for t in Et]
print('with vanilla python')
print('\n'.join((str(r) for r in result2)))
Once you are in python there are endless ways to do this, you could easily do the same with pandas, or sqlite. My personal preference tends to lean towards as little dependencies as possible, in this case I would go for a csv file as input and then do it without numpy, if it was a true large scale problem I would consider sqlite/numpy/pandas.
Good luck with the transition, I don't think you will regret it.

Related

Alternatives to pandas.query() when the column name is a Python keyword (import, sum, min etc)?

The pandas documentation for query() states that
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.query.html
Column names which are Python keywords (like “list”, “for”, “import”,
etc) cannot be used.
This is why, in the toy example below, I cannot use pandas.query() to filter the columns called 'min' nor 'sum'
My question is: what is an efficient and easy to read way to filter on multiple columns if I cannot use query() ? E.g. to run the pandas equivalent of a SQL statement like:
select m.* from df m where (m.[min] > 0.5 and m.[min] < 0.9) or m.[sum] > 0.7
Using pandas syntax, this comes to mind:
out5 = df.loc[ ( (df['min'] > 0.5 ) & df['min'] < 0.9 ) | df['sum'] > 0.7 ]
which, well, works, but I find very hard to read, and much more obscure than in most other languages, from SQL to R.
I also don't understand why the first condition [ min > 0.5 ] must be in brackets or else it won't work.
I had thought of using the pandasql package, which loads dataframes into an in-memory SqlLite database, runs a query there, then exports back to pandas, but development stopped about 5 years ago, it's never reached version 1, and I am afraid that data types could be emssed up (eg SqlLite doesn't explicitly support dates or datetimes).
The toy example is this:
import pandas as pd
import numpy as np
df = pd.DataFrame(columns=['a','min','sum','col with space'] , data = np.random.rand(100,4))
# this works
out1 = df.query("a > 0.5" )
# this doesn't: TypeError: '>' not supported between instances of 'function' and 'float'
out2 = df.query(" min > 0.5 ")
# doesn't work:
# NumExprClobberingError: Variables in expression "(sum) > (0.5)" overlap with builtins: ('sum')
out3 = df.query(" sum > 0.5 ")
# this works
out4 = df.query(" `col with space` > 0.5 ")

I prefer duckdb:
import duckdb
df = duckdb.query('''/%SQL Code%/''').to_df()

Optimizing Python Code: Faster groupby and for loops

I want to make a For Loop given below, faster in python.
import pandas as pd
import numpy as np
import scipy
np.random.seed(1)
xl = pd.DataFrame({'Concat' : np.arange(101,999), 'ships_x' : np.random.randint(1001,3000,size=898)})
yl = pd.DataFrame({'PickDate' : np.random.randint(1,8,size=10000),'Concat' : np.random.randint(101,999,size=10000), 'ships_x' : np.random.randint(101,300,size=10000), 'ships_y' : np.random.randint(1001,3000,size=10000)})
tempno = [np.random.randint(1,100,size=5)]
k=1
p = pd.DataFrame(0,index=np.arange(len(xl)),columns=['temp','cv']).astype(object)
for ib in [xb for xb in range(0,len(xl))]:
tempno1 = np.append(tempno,ib)
temp = list(set(tempno1))
temptab = yl[yl['Concat'].isin(np.array(xl['Concat'][tempno1]))].groupby('PickDate')['ships_x','ships_y'].sum().reset_index()
temptab['contri'] = temptab['ships_x']/temptab['ships_y']
p.ix[k-1,'cv'] = 1 if math.isnan(scipy.stats.variation(temptab['contri'])) else scipy.stats.variation(temptab['contri'])
p.ix[k-1,'temp'] = temp
k = k+1
where,
xl, yl - two data frames I am working on with columns like Concat, x_ships and y_ships.
tempno - a initial list of indices of xl dataframe, referring to a list of 'Concat' values.
So, in for loop we add one extra index to tempno in each iteration and then subset 'yl' dataframe based on 'Concat' values matching with those of 'xl' dataframe. Then, we find "coefficient of variation"(taken from scipy lib) and make note in new dataframe 'p'.
The problem is it is taking too much time as number of iterations of for loop varies in thousands. The 'group_by' line is taking maximum time. I have tried and made a few changes, now the code look likes below, changes made mentioned in comments. There is a slight improvement but this doesn't solve my purpose. Please suggest the fastest way possible to implement this. Many thanks.
# Getting all tempno1 into a list with one step
tempno1 = [np.append(tempno,ib) for ib in [xb for xb in range(0,len(xl))]]
temp = [list(set(tempk)) for tempk in tempno1]
# Taking only needed columns from x and y dfs
xtemp = xl[['Concat']]
ytemp = yl[['Concat','ships_x','ships_y','PickDate']]
#Shortlisting y df and groupby in two diff steps
ytemp = [ytemp[ytemp['Concat'].isin(np.array(xtemp['Concat'][tempnokk]))] for tempnokk in tempno1]
temptab = [ytempk.groupby('PickDate')['ships_x','ships_y'].sum().reset_index() for ytempk in ytemp]
tempkcontri = [tempk['ships_x']/tempk['ships_y'] for tempk in temptab]
tempkcontri = [pd.DataFrame(tempkcontri[i],columns=['contri']) for i in range(0,len(tempkcontri))]
temptab = [temptab[i].join(tempkcontri[i]) for i in range(0,len(temptab))]
pcv = [1 if math.isnan(scipy.stats.variation(temptabkk['contri'])) else scipy.stats.variation(temptabkk['contri']) for temptabkk in temptab]
p = pd.DataFrame({'temp' : temp,'cv': pcv})

Tracking Error on a number of benchmarks

I'm trying to calculate tracking error for a number of different benchmarks versus a fund that I'm looking at (tracking error is defined as the standard deviation of the percent difference between the fund and benchmark). The time series for the fund and all the benchmarks are all in a data frame that I'm reading from an excel on file and what I have so far is this (with the idea that arg1 represents all the benchmarks and is then applied using applymap), but it's returning a KeyError, any suggestions?
import pandas as pd
import numpy as np
data = pd.read_excel('File_Path.xlsx')
def index_analytics(arg1):
tracking_err = np.std((data['Fund'] - data[arg1]) / data[arg1])
return tracking_err
data.applymap(index_analytics)

There are a few things that need fixed. First,applymap passes each individual value for all the columns to your calling function (index_analytics). So arg1 is the individual scalar value for all the values in your dataframe. data[arg1] is always going to return a key error unless all your values are also column names.
You also shouldn't need to use apply to do this. Assuming your benchmarks are in the same dataframe then you should be able to do something like this for each benchmark. Next time include a sample of your dataframe.
df['Benchmark1_result'] = (df['Fund'] - data['Benchmark1']) / data['Benchmark1']
And if you want to calculate all the standard deviations for all the benchmarks you can do this
# assume you have a dataframe with a list of all the benchmark columns
benchmark_columns = [list, of, benchmark, columns]
np.std((df['Fund'].values - df[benchmark_columns].values) / df['Fund'].values, axis=1)

Assuming you're following the definition of Tracking Error below:
import pandas as pd
import numpy as np
# Example DataFrame
df = pd.DataFrame({'Portfolio_Returns': [5.00, 1.67], 'Bench_Returns': [2.89, .759]})
df['Active_Return'] = df['Portfolio_Returns'] - df['Bench_Returns']
print(df.head())
list_ = df['Active_Return']
temp_ = []
for val in list_:
x = val**2
temp_.append(x)
tracking_error = np.sqrt(sum(temp_))
print(f"Tracking Error is: {tracking_error}")
Or if you want it more compact (because apparently the cool kids do it):
df = pd.DataFrame({'Portfolio_Returns': [5.00, 1.67], 'Bench_Returns': [2.89, .759]})
tracking_error = np.sqrt(sum([val**2 for val in df['Portfolio_Returns'] - df['Bench_Returns']]))
print(f"Tracking Error is: {tracking_error}")

How to convert an array of dates (format 'mm/dd/yy HH:MM:SS') to numerics?

I have recently (1 week) decided to migrate my work to Python from matlab. Since I am used to matlab, I am finding it difficult sometimes to get the exact equivalent of what I want to do in python.
Here's my problem:
I have a set of csv files that I want to process. So far, I have succeeded in loading them into groups. Each column has a size of more 600000 x 1. In one of the columns in the csv file is the time which has a format of 'mm/dd/yy HH:MM:SS'. I want to convert the time column to number and I am using date2num from matplot lib for that. Is there a 'matrix' way of doing it? The command in matlab for doing that is datenum(time, 'mm/dd/yyyy HH:MM:SS') where time is a 600000 x 1 matrix.
Thanks
Here is an example of the code that I am talking about:
import csv
import time
import datetime from datetime
import date from matplotlib.dates
import date2num
time = []
otherColumns = []
for d in csv.DictReader(open('MyFile.csv')):
time.append(str(d['time']))
otherColumns.append(float(d['otherColumns']))
timeNumeric = date2num(datetime.datetime.strptime(time,"%d/%m/%y %H:%M:%S" ))

you could use a generator:
def pre_process(dict_sequence):
for d in dict_sequence:
d['time'] = date2num(datetime.datetime.strptime(d['time'],"%d/%m/%y %H:%M:%S" ))
yield d
now you can process your csv:
for d in pre_process(csv.DictReader(open('MyFile.csv'))):
process(d)
the advantage of this solution is that it doesn't copy sequences that are potentially large.
Edit:
So you the contents of the file in a numpy array?
reader = csv.DictReader(open('MyFile.csv'))
#you might want to get rid of the intermediate list if the file is really big.
data = numpy.array(list(d.values() for d in pre_process(reader)))
Now you have a nice big array that allows all kinds of operations. You want only the first column to get your 600000x1 matrix:
data[:,0] # assuming time is the first column

The closest thing in Python for matlab's matrix/vector operation is list comprehension. If you would like to apply a Python function on each item in a list you could do:
new_list = [date2num(data) for data in old_list]
or
new_list = map(date2num, old_list)

What is the most efficient way to loop through dataframes with pandas?

I want to perform my own complex operations on financial data in dataframes in a sequential manner.
For example I am using the following MSFT CSV file taken from Yahoo Finance:
Date,Open,High,Low,Close,Volume,Adj Close
2011-10-19,27.37,27.47,27.01,27.13,42880000,27.13
2011-10-18,26.94,27.40,26.80,27.31,52487900,27.31
2011-10-17,27.11,27.42,26.85,26.98,39433400,26.98
2011-10-14,27.31,27.50,27.02,27.27,50947700,27.27
....
I then do the following:
#!/usr/bin/env python
from pandas import *
df = read_csv('table.csv')
for i, row in enumerate(df.values):
date = df.index[i]
open, high, low, close, adjclose = row
#now perform analysis on open/close based on date, etc..
Is that the most efficient way? Given the focus on speed in pandas, I would assume there must be some special function to iterate through the values in a manner that one also retrieves the index (possibly through a generator to be memory efficient)? df.iteritems unfortunately only iterates column by column.

The newest versions of pandas now include a built-in function for iterating over rows.
for index, row in df.iterrows():
# do some logic here
Or, if you want it faster use itertuples()
But, unutbu's suggestion to use numpy functions to avoid iterating over rows will produce the fastest code.

Pandas is based on NumPy arrays.
The key to speed with NumPy arrays is to perform your operations on the whole array at once, never row-by-row or item-by-item.
For example, if close is a 1-d array, and you want the day-over-day percent change,
pct_change = close[1:]/close[:-1]
This computes the entire array of percent changes as one statement, instead of
pct_change = []
for row in close:
pct_change.append(...)
So try to avoid the Python loop for i, row in enumerate(...) entirely, and
think about how to perform your calculations with operations on the entire array (or dataframe) as a whole, rather than row-by-row.

Like what has been mentioned before, pandas object is most efficient when process the whole array at once. However for those who really need to loop through a pandas DataFrame to perform something, like me, I found at least three ways to do it. I have done a short test to see which one of the three is the least time consuming.
t = pd.DataFrame({'a': range(0, 10000), 'b': range(10000, 20000)})
B = []
C = []
A = time.time()
for i,r in t.iterrows():
C.append((r['a'], r['b']))
B.append(time.time()-A)
C = []
A = time.time()
for ir in t.itertuples():
C.append((ir[1], ir[2]))
B.append(time.time()-A)
C = []
A = time.time()
for r in zip(t['a'], t['b']):
C.append((r[0], r[1]))
B.append(time.time()-A)
print B
Result:
[0.5639059543609619, 0.017839908599853516, 0.005645036697387695]
This is probably not the best way to measure the time consumption but it's quick for me.
Here are some pros and cons IMHO:
.iterrows(): return index and row items in separate variables, but significantly slower
.itertuples(): faster than .iterrows(), but return index together with row items, ir[0] is the index
zip: quickest, but no access to index of the row
EDIT 2020/11/10
For what it is worth, here is an updated benchmark with some other alternatives (perf with MacBookPro 2,4 GHz Intel Core i9 8 cores 32 Go 2667 MHz DDR4)
import sys
import tqdm
import time
import pandas as pd
B = []
t = pd.DataFrame({'a': range(0, 10000), 'b': range(10000, 20000)})
for _ in tqdm.tqdm(range(10)):
C = []
A = time.time()
for i,r in t.iterrows():
C.append((r['a'], r['b']))
B.append({"method": "iterrows", "time": time.time()-A})
C = []
A = time.time()
for ir in t.itertuples():
C.append((ir[1], ir[2]))
B.append({"method": "itertuples", "time": time.time()-A})
C = []
A = time.time()
for r in zip(t['a'], t['b']):
C.append((r[0], r[1]))
B.append({"method": "zip", "time": time.time()-A})
C = []
A = time.time()
for r in zip(*t.to_dict("list").values()):
C.append((r[0], r[1]))
B.append({"method": "zip + to_dict('list')", "time": time.time()-A})
C = []
A = time.time()
for r in t.to_dict("records"):
C.append((r["a"], r["b"]))
B.append({"method": "to_dict('records')", "time": time.time()-A})
A = time.time()
t.agg(tuple, axis=1).tolist()
B.append({"method": "agg", "time": time.time()-A})
A = time.time()
t.apply(tuple, axis=1).tolist()
B.append({"method": "apply", "time": time.time()-A})
print(f'Python {sys.version} on {sys.platform}')
print(f"Pandas version {pd.__version__}")
print(
pd.DataFrame(B).groupby("method").agg(["mean", "std"]).xs("time", axis=1).sort_values("mean")
)
## Output
Python 3.7.9 (default, Oct 13 2020, 10:58:24)
[Clang 12.0.0 (clang-1200.0.32.2)] on darwin
Pandas version 1.1.4
mean std
method
zip + to_dict('list') 0.002353 0.000168
zip 0.003381 0.000250
itertuples 0.007659 0.000728
to_dict('records') 0.025838 0.001458
agg 0.066391 0.007044
apply 0.067753 0.006997
iterrows 0.647215 0.019600

You can loop through the rows by transposing and then calling iteritems:
for date, row in df.T.iteritems():
# do some logic here
I am not certain about efficiency in that case. To get the best possible performance in an iterative algorithm, you might want to explore writing it in Cython, so you could do something like:
def my_algo(ndarray[object] dates, ndarray[float64_t] open,
ndarray[float64_t] low, ndarray[float64_t] high,
ndarray[float64_t] close, ndarray[float64_t] volume):
cdef:
Py_ssize_t i, n
float64_t foo
n = len(dates)
for i from 0 <= i < n:
foo = close[i] - open[i] # will be extremely fast
I would recommend writing the algorithm in pure Python first, make sure it works and see how fast it is-- if it's not fast enough, convert things to Cython like this with minimal work to get something that's about as fast as hand-coded C/C++.

You have three options:
By index (simplest):
>>> for index in df.index:
... print ("df[" + str(index) + "]['B']=" + str(df['B'][index]))
With iterrows (most used):
>>> for index, row in df.iterrows():
... print ("df[" + str(index) + "]['B']=" + str(row['B']))
With itertuples (fastest):
>>> for row in df.itertuples():
... print ("df[" + str(row.Index) + "]['B']=" + str(row.B))
Three options display something like:
df[0]['B']=125
df[1]['B']=415
df[2]['B']=23
df[3]['B']=456
df[4]['B']=189
df[5]['B']=456
df[6]['B']=12
Source: alphons.io

I checked out iterrows after noticing Nick Crawford's answer, but found that it yields (index, Series) tuples. Not sure which would work best for you, but I ended up using the itertuples method for my problem, which yields (index, row_value1...) tuples.
There's also iterkv, which iterates through (column, series) tuples.

Just as a small addition, you can also do an apply if you have a complex function that you apply to a single column:
http://pandas.pydata.org/pandas-docs/dev/generated/pandas.DataFrame.apply.html
df[b] = df[a].apply(lambda col: do stuff with col here)

As #joris pointed out, iterrows is much slower than itertuples and itertuples is approximately 100 times faster than iterrows, and I tested the speed of both methods in a DataFrame with 5 million records the result is for iterrows, it is 1200it/s, and itertuples is 120000it/s.
If you use itertuples, note that every element in the for loop is a namedtuple, so to get the value in each column, you can refer to the following example code
>>> df = pd.DataFrame({'col1': [1, 2], 'col2': [0.1, 0.2]},
index=['a', 'b'])
>>> df
col1 col2
a 1 0.1
b 2 0.2
>>> for row in df.itertuples():
... print(row.col1, row.col2)
...
1, 0.1
2, 0.2

For sure, the fastest way to iterate over a dataframe is to access the underlying numpy ndarray either via df.values (as you do) or by accessing each column separately df.column_name.values. Since you want to have access to the index too, you can use df.index.values for that.
index = df.index.values
column_of_interest1 = df.column_name1.values
...
column_of_interestk = df.column_namek.values
for i in range(df.shape[0]):
index_value = index[i]
...
column_value_k = column_of_interest_k[i]
Not pythonic? Sure. But fast.
If you want to squeeze more juice out of the loop you will want to look into cython. Cython will let you gain huge speedups (think 10x-100x). For maximum performance check memory views for cython.

Another suggestion would be to combine groupby with vectorized calculations if subsets of the rows shared characteristics which allowed you to do so.

look at last one
t = pd.DataFrame({'a': range(0, 10000), 'b': range(10000, 20000)})
B = []
C = []
A = time.time()
for i,r in t.iterrows():
C.append((r['a'], r['b']))
B.append(round(time.time()-A,5))
C = []
A = time.time()
for ir in t.itertuples():
C.append((ir[1], ir[2]))
B.append(round(time.time()-A,5))
C = []
A = time.time()
for r in zip(t['a'], t['b']):
C.append((r[0], r[1]))
B.append(round(time.time()-A,5))
C = []
A = time.time()
for r in range(len(t)):
C.append((t.loc[r, 'a'], t.loc[r, 'b']))
B.append(round(time.time()-A,5))
C = []
A = time.time()
[C.append((x,y)) for x,y in zip(t['a'], t['b'])]
B.append(round(time.time()-A,5))
B
0.46424
0.00505
0.00245
0.09879
0.00209

I believe the most simple and efficient way to loop through DataFrames is using numpy and numba. In that case, looping can be approximately as fast as vectorized operations in many cases. If numba is not an option, plain numpy is likely to be the next best option. As has been noted many times, your default should be vectorization, but this answer merely considers efficient looping, given the decision to loop, for whatever reason.
For a test case, let's use the example from #DSM's answer of calculating a percentage change. This is a very simple situation and as a practical matter you would not write a loop to calculate it, but as such it provides a reasonable baseline for timing vectorized approaches vs loops.
Let's set up the 4 approaches with a small DataFrame, and we'll time them on a larger dataset below.
import pandas as pd
import numpy as np
import numba as nb
df = pd.DataFrame( { 'close':[100,105,95,105] } )
pandas_vectorized = df.close.pct_change()[1:]
x = df.close.to_numpy()
numpy_vectorized = ( x[1:] - x[:-1] ) / x[:-1]
def test_numpy(x):
pct_chng = np.zeros(len(x))
for i in range(1,len(x)):
pct_chng[i] = ( x[i] - x[i-1] ) / x[i-1]
return pct_chng
numpy_loop = test_numpy(df.close.to_numpy())[1:]
#nb.jit(nopython=True)
def test_numba(x):
pct_chng = np.zeros(len(x))
for i in range(1,len(x)):
pct_chng[i] = ( x[i] - x[i-1] ) / x[i-1]
return pct_chng
numba_loop = test_numba(df.close.to_numpy())[1:]
And here are the timings on a DataFrame with 100,000 rows (timings performed with Jupyter's %timeit function, collapsed to a summary table for readability):
pandas/vectorized 1,130 micro-seconds
numpy/vectorized 382 micro-seconds
numpy/looped 72,800 micro-seconds
numba/looped 455 micro-seconds
Summary: for simple cases, like this one, you would go with (vectorized) pandas for simplicity and readability, and (vectorized) numpy for speed. If you really need to use a loop, do it in numpy. If numba is available, combine it with numpy for additional speed. In this case, numpy + numba is almost as fast as vectorized numpy code.
Other details:
Not shown are various options like iterrows, itertuples, etc. which are orders of magnitude slower and really should never be used.
The timings here are fairly typical: numpy is faster than pandas and vectorized is faster than loops, but adding numba to numpy will often speed numpy up dramatically.
Everything except the pandas option requires converting the DataFrame column to a numpy array. That conversion is included in the timings.
The time to define/compile the numpy/numba functions was not included in the timings, but would generally be a negligible component of the timing for any large dataframe.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Index by condition in python-numpy? - python

Related

Alternatives to pandas.query() when the column name is a Python keyword (import, sum, min etc)?

Optimizing Python Code: Faster groupby and for loops

Tracking Error on a number of benchmarks

How to convert an array of dates (format 'mm/dd/yy HH:MM:SS') to numerics?

What is the most efficient way to loop through dataframes with pandas?

Categories

Resources