How to use pandas dataframes and numpy arrays in Rpy2? - python
I'd like to use pandas for all my analysis along with numpy but use Rpy2 for plotting my data. I want to do all analyses using pandas dataframes and then use full plotting of R via rpy2 to plot these. py2, and am using ipython to plot. What's the correct way to do this?
Nearly all commands I try fail. For example:
I'm trying to plot a scatter between two columns of a pandas DataFrame df. I'd like the labels of df to be used in x/y axis just like would be used if it were an R dataframe. Is there a way to do this? When I try to do it with r.plot, I get this gibberish plot:
In: r.plot(df.a, df.b) # df is pandas DataFrame
yields:
Out: rpy2.rinterface.NULL
resulting in the plot:
As you can see, the axes labels are messed up and it's not reading the axes labels from the DataFrame like it should (the X axis is column a of df and the Y axis is column b).
If I try to make a histogram with r.hist, it doesn't work at all, yielding the error:
In: r.hist(df.a)
Out:
...
vectors.pyc in <genexpr>((x,))
293 if l < 7:
294 s = '[' + \
--> 295 ', '.join((p_str(x, max_width = math.floor(52 / l)) for x in self[ : 8])) +\
296 ']'
297 else:
vectors.pyc in p_str(x, max_width)
287 res = x
288 else:
--> 289 res = "%s..." % (str(x[ : (max_width - 3)]))
290 return res
291
TypeError: slice indices must be integers or None or have an __index__ method
And resulting in this plot:
Any idea what the error means? And again here, the axes are all messed up and littered with gibberish data.
EDIT: This error occurs only when using ipython. When I run the command from a script, it still produces the problematic plot, but at least runs with no errors. It must be something wrong with calling these commands from ipython.
I also tried to convert the pandas DataFrame df to an R DataFrame as recommended by the poster below, but that fails too with this error:
com.convert_to_r_dataframe(mydf) # mydf is a pandas DataFrame
----> 1 com.convert_to_r_dataframe(mydf)
in convert_to_r_dataframe(df, strings_as_factors)
275 # FIXME: This doesn't handle MultiIndex
276
--> 277 for column in df:
278 value = df[column]
279 value_type = value.dtype.type
TypeError: iteration over non-sequence
How can I get these basic plotting features to work on Pandas DataFrame (with labels of plots read from the labels of the Pandas DataFrame), and also get the conversion between a Pandas DF to an R DF to work?
EDIT2: Here is a complete example of a csv file "test.txt" (http://pastebin.ca/2311928) and my code to answer #dale's comment:
import rpy2
from rpy2.robjects import r
import rpy2.robjects.numpy2ri
import pandas.rpy.common as com
from rpy2.robjects.packages import importr
from rpy2.robjects.lib import grid
from rpy2.robjects.lib import ggplot2
rpy2.robjects.numpy2ri.activate()
from numpy import *
import scipy
# load up pandas df
import pandas
data = pandas.read_table("./test.txt")
# plotting a column fails
print "data.c2: ", data.c2
r.plot(data.c2)
# Conversion and then plotting also fails
r_df = com.convert_to_r_dataframe(data)
r.plot(r_df)
The call to plot the column of "data.c2" fails, even though data.c2 is a column of a pandas df and therefore for all intents and purposes should be a numpy array. I use the activate() call so I thought it would handle this column as a numpy array and plot it.
The second call to plot the dataframe data after conversion to an R dataframe also fails. Why is that? If I load up test.txt from R as a dataframe, I'm able to plot() it and since my dataframe was converted from pandas to R, it seems like it should work here too.
When I do try rmagic in ipython, it does not fire up a plot window for some reason, though it does not error. I.e. if I do:
In [12]: X = np.array([0,1,2,3,4])
In [13]: Y = np.array([3,5,4,6,7])
In [14]: import rpy2
In [15]: from rpy2.robjects import r
In [16]: import rpy2.robjects.numpy2ri
In [17]: import pandas.rpy.common as com
In [18]: from rpy2.robjects.packages import importr
In [19]: from rpy2.robjects.lib import grid
In [20]: from rpy2.robjects.lib import ggplot2
In [21]: rpy2.robjects.numpy2ri.activate()
In [22]: from numpy import *
In [23]: import scipy
In [24]: r.assign("x", X)
Out[24]:
<Array - Python:0x592ad88 / R:0x6110850>
[ 0, 1, 2, 3, 4]
In [25]: r.assign("y", Y)
<Array - Python:0x592f5f0 / R:0x61109b8>
[ 3, 5, 4, 6, 7]
In [27]: %R plot(x,y)
There's no error, but no plot window either. In any case, I'd like to stick to rpy2 and not rely on rmagic if possible.
Thanks.
[note: Your code in "edit 2" is working here (Python 2.7, rpy2-2.3.2, R-1.15.2).]
As #dale mentions it whenever R objects are anonymous (that is no R symbol exists for the object) the R deparse(substitute()) will end up returning the structure() of the R object, and a possible fix is to specify the "xlab" and "ylab" parameters; for some plots you'll have to also specify main (the title).
An other way to work around that is to use R's formulas and feed the data frame (more below, after we work out the conversion part).
Forget about what is in pandas.rpy. It is both broken and seem to ignore features available in rpy2.
An earlier quick fix to conversion with ipython can be turned into a proper conversion rather easily. I am considering adding one to the rpy2 codebase (with more bells and whistles), but in the meantime just add the following snippet after all your imports in your code examples. It will transparently convert pandas' DataFrame objects into rpy2's DataFrame whenever an R call is made.
from collections import OrderedDict
py2ri_orig = rpy2.robjects.conversion.py2ri
def conversion_pydataframe(obj):
if isinstance(obj, pandas.core.frame.DataFrame):
od = OrderedDict()
for name, values in obj.iteritems():
if values.dtype.kind == 'O':
od[name] = rpy2.robjects.vectors.StrVector(values)
else:
od[name] = rpy2.robjects.conversion.py2ri(values)
return rpy2.robjects.vectors.DataFrame(od)
elif isinstance(obj, pandas.core.series.Series):
# converted as a numpy array
res = py2ri_orig(obj)
# "index" is equivalent to "names" in R
if obj.ndim == 1:
res.names = ListVector({'x': ro.conversion.py2ri(obj.index)})
else:
res.dimnames = ListVector(ro.conversion.py2ri(obj.index))
return res
else:
return py2ri_orig(obj)
rpy2.robjects.conversion.py2ri = conversion_pydataframe
Now the following code will "just work":
r.plot(rpy2.robjects.Formula('c3~c2'), data)
# `data` was converted to an rpy2 data.frame on the fly
# and the a scatter plot c3 vs c2 (with "c2" and "c3" the labels on
# the "x" axis and "y" axis).
I also note that you are importing ggplot2, without using it. Currently the conversion
will have to be explicitly requested. For example:
p = ggplot2.ggplot(rpy2.robjects.conversion.py2ri(data)) +\
ggplot2.geom_histogram(ggplot2.aes_string(x = 'c3'))
p.plot()
You need to pass in the labels explicitly when calling the r.plot function.
r.plot([1,2,3],[1,2,3], xlab="X", ylab="Y")
When you plot in R, it grabs the labels via deparse(substitute(x)) which essentially grabs the variable name from the plot(testX, testY). When you're passing in python objects via rpy2, it's an anonymous R object and akin to the following in R:
> deparse(substitute(c(1,2,3)))
[1] "c(1, 2, 3)"
which is why you're getting the crazy labels.
A lot of times it's saner to use rpy2 to only push data back and forth.
r.assign('testX', df.A)
r.assign('testY', df.B)
%R plot(testX, testY)
rdf = com.convert_to_r_dataframe(df)
r.assign('bob', rdf)
%R plot(bob$$A, bob$$B)
http://nbviewer.ipython.org/4734581/
use rpy. the conversion is part of pandas so you don't need to do it yoursef
http://pandas.pydata.org/pandas-docs/dev/r_interface.html
In [1217]: from pandas import DataFrame
In [1218]: df = DataFrame({'A': [1, 2, 3], 'B': [4, 5, 6], 'C':[7,8,9]},
......: index=["one", "two", "three"])
......:
In [1219]: r_dataframe = com.convert_to_r_dataframe(df)
In [1220]: print type(r_dataframe)
<class 'rpy2.robjects.vectors.DataFrame'>
Related
Convert MatchIt summary object into a pandas dataframe with pyr2
I am using R's MatchIt package but calling it from Python via the pyr2 package. On the R-side MatchIt gives me a complex result object including raw data and some additional statistic information. One of is a matrix I want to transform into a data set which I can do in R code like this # R Code m.out <- matchit(....) m.sum <- summary(m.out) # The following two lines should be somehow "translated" into # Pythons rpy2 balance <- m.sum$sum.matched balance <- as.data.frame(balance) My problem is that I don't know how to implement the two last lines with Pythons rpy2 package. I am able to get m.out and m.sum with rpy2. See this MWE please #!/usr/bin/env python3 import rpy2 from rpy2.robjects.packages import importr import rpy2.robjects as robjects import rpy2.robjects.pandas2ri as pandas2ri import pydataset if __name__ == '__main__': # import robjects.packages.importr('MatchIt') # data p_df = pydataset.data('respiratory') p_df.treat = p_df.treat.replace({'P': 0, 'A': 1}) # Convert Panda data into R data with robjects.conversion.localconverter( robjects.default_converter + pandas2ri.converter): r_df = robjects.conversion.py2rpy(p_df) # Call R's matchit with R data object match_out = robjects.r['matchit']( formula=robjects.Formula('treat ~ age + sex'), data=r_df, method='nearest', distance='glm') # matched data match_data = robjects.r['match.data'](match_out) # Convert R data into Pandas data with robjects.conversion.localconverter( robjects.default_converter + pandas2ri.converter): match_data = robjects.conversion.rpy2py(match_data) # summary object match_sum = robjects.r['summary'](match_out) # x = robjects.r(''' # balance <- match_sum$sum.matched # balance <- as.data.frame(balance) # # balance # ''') When inspecting the python object match_sum I can't find anything like sum.matched in it. So I have to "translate" the match_sum$sum.matched somehow with rpy2. But I don't know how. An alternative solution would be to run everything as R code with robjects.r(''' # r code ...'''). But in that case I don't know how to bring a Pandas data frame into that code. EDIT: Be aware that in the MWE presented here the conversion from R objects into Python objects and vis-à-vis an outdated solution is used. Please see the answer below for a better one.
Ah, it is always the same phenomena: While formulating the question the answers jump'n right into your face. My (maybe not the best) solution is: Use real R code and run it with rpy2.robjects.r(). That R code need to create an R function() to be able to receive a dataframe from the outside (the caller). Beside that solution and based on another answer I also modified the conversion from R to Python data frames in that code. #!/usr/bin/env python3 import rpy2 from rpy2.robjects.packages import importr import rpy2.robjects as robjects import rpy2.robjects.pandas2ri as pandas2ri import pydataset if __name__ == '__main__': # For converting objects from/into Pandas <-> R # Credits: https://stackoverflow.com/a/20808449/4865723 pandas2ri.activate() # import robjects.packages.importr('MatchIt') # data df = pydataset.data('respiratory') df.treat = df.treat.replace({'P': 0, 'A': 1}) # match object match_out = robjects.r['matchit']( formula=robjects.Formula('treat ~ age + sex'), data=df, method='nearest', distance='glm') # matched data match_data = robjects.r['match.data'](match_out) match_data = robjects.conversion.rpy2py(match_data) # SOLUTION STARTS HERE: get_balance_dataframe = robjects.r('''f <- function(match_out) { as.data.frame(summary(match_out)$sum.matched) } ''') balance = get_balance_dataframe(match_out) balance = robjects.conversion.rpy2py(balance) print(type(balance)) print(balance) Here is the output. <class 'pandas.core.frame.DataFrame'> Means Treated Means Control Std. Mean Diff. Var. Ratio eCDF Mean eCDF Max Std. Pair Dist. distance 0.514630 0.472067 0.471744 0.512239 0.077104 0.203704 0.507222 age 32.888889 34.129630 -0.089355 1.071246 0.063738 0.203704 0.721511 sexF 0.111111 0.259259 -0.471405 NaN 0.148148 0.148148 0.471405 sexM 0.888889 0.740741 0.471405 NaN 0.148148 0.148148 0.471405 EDIT: Take that there are no umlauts or other unicode-problematic characters in the cell values or in the column and row names when you do this on Windows. From time to time then there comes a unicode decode error. I wasn't able to reproduce this stable so I have no fresh bug report about it.
Adding tooltip to folium.features.GeoJson from a geopandas dataframe
I am having issues adding tooltips to my folium.features.GeoJson. I can't get columns to display from the dataframe when I select them. feature = folium.features.GeoJson(df.geometry, name='Location', style_function=style_function, tooltip=folium.GeoJsonTooltip(fields= [df.acquired],aliases=["Time"],labels=True)) ax.add_child(feature) For some reason when I run the code above it responds with Name: acquired, Length: 100, dtype: object is not available in the data. Choose from: (). I can't seem to link the data to my tooltip.
have made your code a MWE by including some data two key issues with your code need to pass properties not just geometry to folium.features.GeoJson() Hence passed df instead of df.geometry folium.GeoJsonTooltip() takes a list of properties (columns) not an array of values. Hence passed ["acquired"] instead of array of values from a dataframe column implied issue with your code. All dataframe columns need to contain values that can be serialised to JSON. Hence conversion of acquired to string and drop() import geopandas as gpd import pandas as pd import shapely.wkt import io import folium df = pd.read_csv(io.StringIO("""ref;lanes;highway;maxspeed;length;name;geometry A3015;2;primary;40 mph;40.68;Rydon Lane;MULTILINESTRING ((-3.4851169 50.70864409999999, -3.4849879 50.7090007), (-3.4857269 50.70693379999999, -3.4853034 50.7081574), (-3.488620899999999 50.70365289999999, -3.4857269 50.70693379999999), (-3.4853034 50.7081574, -3.4851434 50.70856839999999), (-3.4851434 50.70856839999999, -3.4851169 50.70864409999999)) A379;3;primary;50 mph;177.963;Rydon Lane;MULTILINESTRING ((-3.4763853 50.70886769999999, -3.4786112 50.70811229999999), (-3.4746017 50.70944449999999, -3.4763853 50.70886769999999), (-3.470350900000001 50.71041779999999, -3.471219399999999 50.71028909999998), (-3.465049699999999 50.712158, -3.470350900000001 50.71041779999999), (-3.481215600000001 50.70762499999999, -3.4813909 50.70760109999999), (-3.4934747 50.70059599999998, -3.4930204 50.7007898), (-3.4930204 50.7007898, -3.4930048 50.7008015), (-3.4930048 50.7008015, -3.4919513 50.70168349999999), (-3.4919513 50.70168349999999, -3.49137 50.70213669999998), (-3.49137 50.70213669999998, -3.4911565 50.7023015), (-3.4911565 50.7023015, -3.4909108 50.70246919999999), (-3.4909108 50.70246919999999, -3.4902349 50.70291189999999), (-3.4902349 50.70291189999999, -3.4897693 50.70314579999999), (-3.4805021 50.7077218, -3.4806265 50.70770150000001), (-3.488620899999999 50.70365289999999, -3.4888806 50.70353719999999), (-3.4897693 50.70314579999999, -3.489176800000001 50.70340539999999), (-3.489176800000001 50.70340539999999, -3.4888806 50.70353719999999), (-3.4865751 50.70487679999999, -3.4882604 50.70375799999999), (-3.479841700000001 50.70784459999999, -3.4805021 50.7077218), (-3.4882604 50.70375799999999, -3.488620899999999 50.70365289999999), (-3.4806265 50.70770150000001, -3.481215600000001 50.70762499999999), (-3.4717096 50.71021009999998, -3.4746017 50.70944449999999), (-3.4786112 50.70811229999999, -3.479841700000001 50.70784459999999), (-3.471219399999999 50.71028909999998, -3.4717096 50.71021009999998))"""), sep=";") df = gpd.GeoDataFrame(df, geometry=df["geometry"].apply(shapely.wkt.loads), crs="epsg:4326") df["acquired"] = pd.date_range("8-feb-2022", freq="1H", periods=len(df)) def style_function(x): return {"color":"blue", "weight":3} ax = folium.Map( location=[sum(df.total_bounds[[1, 3]]) / 2, sum(df.total_bounds[[0, 2]]) / 2], zoom_start=12, ) # data time is not JSON serializable... df["tt"] = df["acquired"].dt.strftime("%Y-%b-%d %H:%M") feature = folium.features.GeoJson(df.drop(columns="acquired"), name='Location', style_function=style_function, tooltip=folium.GeoJsonTooltip(fields= ["tt"],aliases=["Time"],labels=True)) ax.add_child(feature)
Error when creating dataframe "*** ValueError: If using all scalar values, you must pass an index"
I was trying to create a plot with seaborn, but I faced an error: *** ValueError: If using all scalar values, you must pass an index In my program I have read an output file from XFOIL and was trying to plot its results to check it out. XFOIL File format: # x Cp 1.00000 0.24276 0.99818 0.23883 0.99591 0.22657 0.99342 0.21421 0.99066 0.20128 0.98759 0.18802 0.98413 0.17434 0.98020 0.16018 0.97569 0.14544 0.97044 0.12999 0.96429 0.11374 0.95703 0.09661 ### **(the file is big, and I will not transcript it completely here)** ### I decided to create a dataframe to enable the plotting process easily. lst = fl.readlines() lst_splt = [ s.replace('#','').split() for s in lst] cp = pd.DataFrame.from_records(lst_splt[1:], columns=lst_splt[:1]).astype(float) and finally I tried to plot it using seaborn: sns.lineplot(x='x',y='Cp', data=cp) but as I said on the beginning of the question, an error appeared: *** ValueError: If using all scalar values, you must pass an index What can I do to fix this error?
Not sure why you are having this error, but you can simply do: import matplotlib.pyplot as plt plt.plot(cp["x"], cp["Cp"]) plt.show() EDIT: After some experimentation, it seems that your method for creating the dataframe is probably the culprit. You can replace it with: cp = pd.read_csv(filename, sep="\s+", skiprows=2, names=["x", "Cp"]) # Make sure that you have the right value for skiprows (should ignore the header and that's it) # Then this works: sns.lineplot(x="x", y="Cp", data=cp)
The problem is related to the columns argument passed to the DataFrame constructor. When printing lst_splt[:1], which is the value you pass to the columns argument, I get this: print(lst_splt[:1]) # [['x', 'Cp']] The Dataframe constructor in this case needs a flat list, not a nested one. The problem is solved when you change lst_splt[:1] to lst_splt[:1][0] which when printed gives of course: print(lst_splt[:1][0]) # ['x', 'Cp'] The modified verrsion of your code below works fine: import matplotlib.pyplot as plt import seaborn as sns import pandas as pd fl = open('data.txt', 'r') lst = fl.readlines() lst_splt = [ s.replace('#','').split() for s in lst] cp = pd.DataFrame.from_records(lst_splt[1:], columns=lst_splt[:1][0]).astype(float) sns.lineplot(data=cp, x='x',y='Cp') plt.show() out:
Index by condition in python-numpy?
I'm trying to migrate from Matlab to Python. I'm rewriting some code that I had in Matlab to Python for testing. I've installed Anaconda and currently using Spyder IDE. Using Matlab I created a function that returns the values of the commercial API 5L diameter(diametro) and thickness(espesor) of pipes that are closer to the input parameters of the function. I did this using a Matlab table. Note that the inputs of the diameter(diametro_entrada) and thickness(espesor_entrada) are in meters[m] and the thickness inside the function are in millimeters [mm], that's why at the end I had to multiply espesor_entrada*1000 function tabla_seleccion=tablaAPI(diametro_entrada,espesor_entrada) %Proporciona la tabla de caños API 5L, introducir diámetro en [m] y espesor %en [m] Diametro_m=[0.3556;0.3556;0.3556;0.3556;0.3556;0.3556;0.3556;0.3556;0.3556;0.3556;0.3556;0.3556;0.3556;0.3556;0.3556;0.3556;0.3556;0.3556;0.3556;0.3556;0.3556;0.3556;0.3556;0.3556;... 0.4064;0.4064;0.4064;0.4064;0.4064;0.4064;0.4064;0.4064;0.4064;0.4064;0.4064;0.4064;0.4064;0.4064;0.4064;0.4064;0.4064;0.4064;0.4064;0.4064;0.4064;0.4064;0.4064;0.4064;... 0.4570;0.4570;0.4570;0.4570;0.4570;0.4570;0.4570;0.4570;0.4570;0.4570;0.4570;0.4570;0.4570;0.4570;0.4570;0.4570;0.4570;0.4570;0.4570;0.4570;0.4570;0.4570;0.4570;... 0.5080;0.5080;0.5080;0.5080;0.5080;0.5080;0.5080;0.5080;0.5080;0.5080;0.5080;0.5080;0.5080;0.5080;0.5080;0.5080;0.5080;0.5080;0.5080;0.5080;0.5080;0.5080;0.5080;0.5080;... 0.559;0.559;0.559;0.559;0.559;0.559;0.559;0.559;0.559;0.559;0.559;0.559;0.559;0.559;0.559;0.559;0.559;0.559;0.559;0.559;0.559;0.559;0.559;0.559;0.559;0.559;... 0.610;0.610;0.610;0.610;0.610;0.610;0.610;0.610;0.610;0.610;0.610;0.610;0.610;0.610;0.610;0.610;0.610;0.610;0.610;0.610;0.610;0.610;0.610;0.610;0.610;0.610;... 0.660;0.660;0.660;0.660;0.660;0.660;0.660;0.660;0.660;0.660;0.660;0.660;0.660;0.660;0.660;0.660;0.660;... 0.711;0.711;0.711;0.711;0.711;0.711;0.711;0.711;0.711;0.711;0.711;0.711;0.711;0.711;0.711;0.711;0.711;... 0.762;0.762;0.762;0.762;0.762;0.762;0.762;0.762;0.762;0.762;0.762;0.762;0.762;0.762;0.762;0.762;0.762;0.762;0.762;0.762;0.762;... 0.813;0.813;0.813;0.813;0.813;0.813;0.813;0.813;0.813;0.813;0.813;0.813;0.813;0.813;0.813;0.813;0.813;0.813;0.813;0.813;0.813]; Espesor_mm=[4.8;5.2;5.3;5.6;6.4;7.1;7.9;8.7;9.5;10.3;11.1;11.9;12.7;14.3;15.9;17.5;19.1;20.6;22.2;23.8;25.4;27.0;28.6;31.8;... 4.8;5.2;5.6;6.4;7.1;7.9;8.7;9.5;10.3;11.1;11.9;12.7;14.3;15.9;17.5;19.1;20.6;22.2;23.8;25.4;27.0;28.6;30.2;31.8;... 4.8;5.6;6.4;7.1;7.9;8.7;9.5;10.3;11.1;11.9;12.7;14.3;15.9;17.5;19.1;20.6;22.2;23.8;25.4;27.0;28.6;30.2;31.8;... 5.6;6.4;7.1;7.9;8.7;9.5;10.3;11.1;11.9;12.7;14.3;15.9;17.5;19.1;20.6;22.2;23.8;25.4;27.0;28.6;30.2;31.8;33.3;34.9;... 5.6;6.4;7.1;7.9;8.7;9.5;10.3;11.1;11.9;12.7;14.3;15.9;17.5;19.1;20.6;22.2;23.8;25.4;27.0;28.6;30.2;31.8;33.3;34.9;36.5;38.1;... 6.4;7.1;7.9;8.7;9.5;10.3;11.1;11.9;12.7;14.3;15.9;17.5;19.1;20.6;22.2;23.8;25.4;27.0;28.6;30.2;31.8;33.3;34.9;36.5;38.1;39.7;... 6.4;7.1;7.9;8.7;9.5;10.3;11.1;11.9;12.7;14.3;15.9;17.5;19.1;20.6;22.2;23.8;25.4;... 6.4;7.1;7.9;8.7;9.5;10.3;11.1;11.9;12.7;14.3;15.9;17.5;19.1;20.6;22.2;23.8;25.4;... 6.4;7.1;7.9;8.7;9.5;10.3;11.1;11.9;12.7;14.3;15.9;17.5;19.1;20.6;22.2;23.8;25.4;27.0;28.6;30.2;31.8;... 6.4;7.1;7.9;8.7;9.5;10.3;11.1;11.9;12.7;14.3;15.9;17.5;19.1;20.6;22.2;23.8;25.4;27.0;28.6;30.2;31.8]; TablaAPI=table(Diametro_m,Espesor_mm); tabla_seleccion=TablaAPI(abs(TablaAPI.Diametro_m-diametro_entrada)<0.05 & abs(TablaAPI.Espesor_mm-(espesor_entrada*1000))<1.2,:); end With the input diameter(d) and the input thickness(e) I get the commercial pipe that has less than 0.05 in diameter and 1.2 in thickness from the former. I want to do reproduce this in Python with Numpy or another package. First I defined 2 Numpy arrays, with the same names as in Matlab but comma separated instead of semicolon and without the "..." at the end of each line, then defined another Numpy array as: TablaAPI=numpy.array([Diametro_m,Espesor_mm]) I want to know if I can index that array in some way like I did in Matlab or I have to define something else totally different. Thanks a lot!
You sure can! Here's an example of how you can use numpy: Using Numpy import math import numpy as np # Declare your Diametro_m, Espesor_mmhere just like you did in your example # Transpose and merge the columns arr = np.concatenate((Diametro_m, Espesor_mm.T), axis=1) selection = arr[np.ix_(abs(arr[:0])<0.05,abs(arr[:1]-(math.e*1000)) > <1.2 )] Example usage from John Zwinck's answer Using Dataframes Dataframes may also be great for your application in case you need to do heavier queries or mix column datatypes. This code should work for you, if you choose that option: # These imports go at the top of your document import pandas as pd import numpy as np import math # Declare your Diametro_m, Espesor_mmhere just like you did in your example df_d = pd.DataFrame(data=Diametro_m, index=np.array(range(1, len(Diametro_m))), columns=np.array(range(1, len(Diametro_m)))) df_e = pd.DataFrame(data=Espesor_mm, index=np.array(range(1, len(Diametro_m))), columns=np.array(range(1, len(Diametro_m)))) # Merge the dataframes merged_df = pd.merge(left=df_d , left_index=True right=df_e , right_index=True, how='inner') # Now you can perform your selections like this: selection = merged_df.loc[abs(merged_df['df_d']) <0.05, abs(merged_df['df_e']-(math.e*1000))) <1.2] # This "mask" of the dataframe will return all results that satisfy your query. print(selection)
Since you have not given an example of your expected output it's a bit of guessing what you are really after, but here is one version with numpy. # rewritten arrays for numpy Diametro_m=[0.3556,0.3556,0.3556,0.3556,0.3556,0.3556,0.3556,0.3556,0.3556,0.3556,0.3556,0.3556,0.3556,0.3556,0.3556,0.3556,0.3556,0.3556,0.3556,0.3556,0.3556,0.3556,0.3556,0.3556, 0.4064,0.4064,0.4064,0.4064,0.4064,0.4064,0.4064,0.4064,0.4064,0.4064,0.4064,0.4064,0.4064,0.4064,0.4064,0.4064,0.4064,0.4064,0.4064,0.4064,0.4064,0.4064,0.4064,0.4064, 0.4570,0.4570,0.4570,0.4570,0.4570,0.4570,0.4570,0.4570,0.4570,0.4570,0.4570,0.4570,0.4570,0.4570,0.4570,0.4570,0.4570,0.4570,0.4570,0.4570,0.4570,0.4570,0.4570, 0.5080,0.5080,0.5080,0.5080,0.5080,0.5080,0.5080,0.5080,0.5080,0.5080,0.5080,0.5080,0.5080,0.5080,0.5080,0.5080,0.5080,0.5080,0.5080,0.5080,0.5080,0.5080,0.5080,0.5080, 0.559,0.559,0.559,0.559,0.559,0.559,0.559,0.559,0.559,0.559,0.559,0.559,0.559,0.559,0.559,0.559,0.559,0.559,0.559,0.559,0.559,0.559,0.559,0.559,0.559,0.559, 0.610,0.610,0.610,0.610,0.610,0.610,0.610,0.610,0.610,0.610,0.610,0.610,0.610,0.610,0.610,0.610,0.610,0.610,0.610,0.610,0.610,0.610,0.610,0.610,0.610,0.610, 0.660,0.660,0.660,0.660,0.660,0.660,0.660,0.660,0.660,0.660,0.660,0.660,0.660,0.660,0.660,0.660,0.660, 0.711,0.711,0.711,0.711,0.711,0.711,0.711,0.711,0.711,0.711,0.711,0.711,0.711,0.711,0.711,0.711,0.711, 0.762,0.762,0.762,0.762,0.762,0.762,0.762,0.762,0.762,0.762,0.762,0.762,0.762,0.762,0.762,0.762,0.762,0.762,0.762,0.762,0.762, 0.813,0.813,0.813,0.813,0.813,0.813,0.813,0.813,0.813,0.813,0.813,0.813,0.813,0.813,0.813,0.813,0.813,0.813,0.813,0.813,0.813] Espesor_mm=[4.8,5.2,5.3,5.6,6.4,7.1,7.9,8.7,9.5,10.3,11.1,11.9,12.7,14.3,15.9,17.5,19.1,20.6,22.2,23.8,25.4,27.0,28.6,31.8, 4.8,5.2,5.6,6.4,7.1,7.9,8.7,9.5,10.3,11.1,11.9,12.7,14.3,15.9,17.5,19.1,20.6,22.2,23.8,25.4,27.0,28.6,30.2,31.8, 4.8,5.6,6.4,7.1,7.9,8.7,9.5,10.3,11.1,11.9,12.7,14.3,15.9,17.5,19.1,20.6,22.2,23.8,25.4,27.0,28.6,30.2,31.8, 5.6,6.4,7.1,7.9,8.7,9.5,10.3,11.1,11.9,12.7,14.3,15.9,17.5,19.1,20.6,22.2,23.8,25.4,27.0,28.6,30.2,31.8,33.3,34.9, 5.6,6.4,7.1,7.9,8.7,9.5,10.3,11.1,11.9,12.7,14.3,15.9,17.5,19.1,20.6,22.2,23.8,25.4,27.0,28.6,30.2,31.8,33.3,34.9,36.5,38.1, 6.4,7.1,7.9,8.7,9.5,10.3,11.1,11.9,12.7,14.3,15.9,17.5,19.1,20.6,22.2,23.8,25.4,27.0,28.6,30.2,31.8,33.3,34.9,36.5,38.1,39.7, 6.4,7.1,7.9,8.7,9.5,10.3,11.1,11.9,12.7,14.3,15.9,17.5,19.1,20.6,22.2,23.8,25.4, 6.4,7.1,7.9,8.7,9.5,10.3,11.1,11.9,12.7,14.3,15.9,17.5,19.1,20.6,22.2,23.8,25.4, 6.4,7.1,7.9,8.7,9.5,10.3,11.1,11.9,12.7,14.3,15.9,17.5,19.1,20.6,22.2,23.8,25.4,27.0,28.6,30.2,31.8, 6.4,7.1,7.9,8.7,9.5,10.3,11.1,11.9,12.7,14.3,15.9,17.5,19.1,20.6,22.2,23.8,25.4,27.0,28.6,30.2,31.8] import numpy as np diametro_entrada = 0.4 espesor_entrada = 5 Diametro_m = np.array(Diametro_m) Espesor_mm = np.array(Espesor_mm) # Diametro_m and Espesor_mm has shape (223,) # if not change so that they have that shape table = np.array([Diametro_m, Espesor_mm]).T mask = np.where((np.abs(Diametro_m - diametro_entrada) < 0.05) & (np.abs(Espesor_mm - espesor_entrada) < 1.2) ) result = table[mask] print('with numpy') print(result) or you can do it with just python... # redo with python only # based on a simple dict and list comprehension D_m = [0.3556, 0.4064, 0.4570, 0.5080, 0.559, 0.610, 0.660, 0.711, 0.762, 0.813] E_mm = [[4.8,5.2,5.3,5.6,6.4,7.1,7.9,8.7,9.5,10.3,11.1,11.9,12.7,14.3,15.9,17.5,19.1,20.6,22.2,23.8,25.4,27.0,28.6,31.8], [4.8,5.2,5.6,6.4,7.1,7.9,8.7,9.5,10.3,11.1,11.9,12.7,14.3,15.9,17.5,19.1,20.6,22.2,23.8,25.4,27.0,28.6,30.2,31.8], [4.8,5.6,6.4,7.1,7.9,8.7,9.5,10.3,11.1,11.9,12.7,14.3,15.9,17.5,19.1,20.6,22.2,23.8,25.4,27.0,28.6,30.2,31.8], [5.6,6.4,7.1,7.9,8.7,9.5,10.3,11.1,11.9,12.7,14.3,15.9,17.5,19.1,20.6,22.2,23.8,25.4,27.0,28.6,30.2,31.8,33.3,34.9], [5.6,6.4,7.1,7.9,8.7,9.5,10.3,11.1,11.9,12.7,14.3,15.9,17.5,19.1,20.6,22.2,23.8,25.4,27.0,28.6,30.2,31.8,33.3,34.9,36.5,38.1], [6.4,7.1,7.9,8.7,9.5,10.3,11.1,11.9,12.7,14.3,15.9,17.5,19.1,20.6,22.2,23.8,25.4,27.0,28.6,30.2,31.8,33.3,34.9,36.5,38.1,39.7], [6.4,7.1,7.9,8.7,9.5,10.3,11.1,11.9,12.7,14.3,15.9,17.5,19.1,20.6,22.2,23.8,25.4], [6.4,7.1,7.9,8.7,9.5,10.3,11.1,11.9,12.7,14.3,15.9,17.5,19.1,20.6,22.2,23.8,25.4], [6.4,7.1,7.9,8.7,9.5,10.3,11.1,11.9,12.7,14.3,15.9,17.5,19.1,20.6,22.2,23.8,25.4,27.0,28.6,30.2,31.8], [6.4,7.1,7.9,8.7,9.5,10.3,11.1,11.9,12.7,14.3,15.9,17.5,19.1,20.6,22.2,23.8,25.4,27.0,28.6,30.2,31.8]] table2 = dict(zip(D_m, E_mm)) result2 = [] for D, E in table2.items(): if abs(D - diametro_entrada) < 0.05: Et = [t for t in E if abs(t - espesor_entrada) < 1.2] result2 += [(D, t) for t in Et] print('with vanilla python') print('\n'.join((str(r) for r in result2))) Once you are in python there are endless ways to do this, you could easily do the same with pandas, or sqlite. My personal preference tends to lean towards as little dependencies as possible, in this case I would go for a csv file as input and then do it without numpy, if it was a true large scale problem I would consider sqlite/numpy/pandas. Good luck with the transition, I don't think you will regret it.
Vectorize integration of pandas.DataFrame
I have a DataFrame of force-displacement data. The displacement array has been set to the DataFrame index, and the columns are my various force curves for different tests. How do I calculate the work done (which is "the area under the curve")? I looked at numpy.trapz which seems to do what I need, but I think that I can avoid looping over each column like this: import numpy as np import pandas as pd forces = pd.read_csv(...) work_done = {} for col in forces.columns: work_done[col] = np.trapz(forces.loc[col], forces.index)) I was hoping to create a new DataFrame of the areas under the curves rather than a dict, and thought that DataFrame.apply() or something might be appropriate but don't know where to start looking. In short: Can I avoid the looping? Can I create a DataFrame of work done directly? Thanks in advance for any help.
You could vectorize this by passing the whole DataFrame to np.trapz and specifying the axis= argument, e.g.: import numpy as np import pandas as pd # some random input data gen = np.random.RandomState(0) x = gen.randn(100, 10) names = [chr(97 + i) for i in range(10)] forces = pd.DataFrame(x, columns=names) # vectorized version wrk = np.trapz(forces, x=forces.index, axis=0) work_done = pd.DataFrame(wrk[None, :], columns=forces.columns) # non-vectorized version for comparison work_done2 = {} for col in forces.columns: work_done2.update({col:np.trapz(forces.loc[:, col], forces.index)}) These give the following output: from pprint import pprint pprint(work_done.T) # 0 # a -24.331560 # b -10.347663 # c 4.662212 # d -12.536040 # e -10.276861 # f 3.406740 # g -3.712674 # h -9.508454 # i -1.044931 # j 15.165782 pprint(work_done2) # {'a': -24.331559643023006, # 'b': -10.347663159421426, # 'c': 4.6622123535050459, # 'd': -12.536039649161403, # 'e': -10.276861220217308, # 'f': 3.4067399176289994, # 'g': -3.7126739591045541, # 'h': -9.5084536839888187, # 'i': -1.0449311137294459, # 'j': 15.165781517623724} There are a couple of other problems with your original example. col is a column name rather than a row index, so it needs to index the second dimension of your dataframe (i.e. .loc[:, col] rather than .loc[col]). Also, you have an extra trailing parenthesis on the last line. Edit: You could also generate the output DataFrame directly by .applying np.trapz to each column, e.g.: work_done = forces.apply(np.trapz, axis=0, args=(forces.index,)) However, this isn't really 'proper' vectorization - you are still calling np.trapz separately on each column. You can see this by comparing the speed of the .apply version against calling np.trapz directly: In [1]: %timeit forces.apply(np.trapz, axis=0, args=(forces.index,)) 1000 loops, best of 3: 582 µs per loop In [2]: %timeit np.trapz(forces, x=forces.index, axis=0) The slowest run took 6.04 times longer than the fastest. This could mean that an intermediate result is being cached 10000 loops, best of 3: 53.4 µs per loop This isn't an entirely fair comparison, since the second version excludes the extra time taken to construct the DataFrame from the output numpy array, but this should still be smaller than the difference in time taken to perform the actual integration.
Here's how to get the cumulative integral along a dataframe column using the trapezoidal rule. Alternatively, the following creates a pandas.Series method for doing your choice of Trapezoidal, Simpson's or Romberger's rule (source): import pandas as pd from scipy import integrate import numpy as np #%% Setup Functions def integrate_method(self, how='trapz', unit='s'): '''Numerically integrate the time series. #param how: the method to use (trapz by default) #return Available methods: * trapz - trapezoidal * cumtrapz - cumulative trapezoidal * simps - Simpson's rule * romb - Romberger's rule See http://docs.scipy.org/doc/scipy/reference/integrate.html for the method details. or the source code https://github.com/scipy/scipy/blob/master/scipy/integrate/quadrature.py ''' available_rules = set(['trapz', 'cumtrapz', 'simps', 'romb']) if how in available_rules: rule = integrate.__getattribute__(how) else: print('Unsupported integration rule: %s' % (how)) print('Expecting one of these sample-based integration rules: %s' % (str(list(available_rules)))) raise AttributeError if how is 'cumtrapz': result = rule(self.values) result = np.insert(result, 0, 0, axis=0) else: result = rule(self.values) return result pd.Series.integrate = integrate_method #%% Setup (random) data gen = np.random.RandomState(0) x = gen.randn(100, 10) names = [chr(97 + i) for i in range(10)] df = pd.DataFrame(x, columns=names) #%% Cummulative Integral df_cummulative_integral = df.apply(lambda x: x.integrate('cumtrapz')) df_integral = df.apply(lambda x: x.integrate('trapz')) df_do_they_match = df_cummulative_integral.tail(1).round(3) == df_integral.round(3) if df_do_they_match.all().all(): print("Trapz produces the last row of cumtrapz")