Looking to plot a histogram emanating from a dataframe, I seem to lack in transforming to a right object type that matplotlib can deal with. Here are some failed attempts. How do I fix it up?
And more generally, how do you typically salvage something like that?
import numpy as np
import pandas as pd
from matplotlib import pyplot as plt
filter(lambda v: v > 0, df['foo_col']).hist(bins=10)
---> 10 filter(lambda v: v > 0, df['foo_col']).hist(bins=100)
AttributeError: 'filter' object has no attribute 'hist'
hist(filter(lambda v: v > 0, df['foo_col']), bins=100)
---> 10 hist(filter(lambda v: v > 0, df['foo_col']), bins=100)
TypeError: 'Series' object is not callable
By all accounts, filter is lucky to be part of the standard library. IIUC, you just want to filter your dataframe to plot a histogram of values > 0. Pandas has its own syntax for that:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
data = np.random.randint(-50, 1000, 10000)
df = pd.DataFrame({'some_data': data})
df[df['some_data'] >= 0].hist(bins=100)
plt.show()
Note that this will run much faster than python builtins could ever hope to (it doesn't make much difference in my trivial example, but it will with bigger datasets). It's important to use pandas methods with dataframes wherever possible because, in many cases, the calculation will be vectorized and run in highly optimised C/C++ code.
Related
I have the following pandas dataframe df with 2 columns, which looks like:
0 0
1. 22
2. 34
3. 21
4. 21
5. 92
I would like to integrate the area under this curve if we were to plot the first columns as the x-axis and the second column as the y-axis. I have tried doing this using the integrated module from scipy (from scipy import integrate), and applied as follows as I have seen in examples online:
print(df.integrate)
However, it seems the integrate function does not work. I'm receiving the error:
Dataframe object has no attribute integrate
How would I go about this?
Thank you
You want numerical integration given a fixed sample of data. The Scipy package lists a handful of methods to do this: https://docs.scipy.org/doc/scipy/reference/integrate.html#integrating-functions-given-fixed-samples
For your data, the trapezoidal is probably the most straight forward. You provide the y and x values to the function. You did not post the column names of your data frame, so I am using the 0-index for x and the 1-index for y values
from scipy.integrate import trapz
trapz(df.iloc[:, 1], df.iloc[:, 0])
Since integrate is a scipy method not a pandas method, you need to invoke it as follows:
from scipy.integrate import trapz, simps
print(trapz(*args))
https://docs.scipy.org/doc/scipy/reference/tutorial/integrate.html
Try this
import pandas as pd
import numpy as np
def integrate(x, y):
area = np.trapz(y=y, x=x)
return area
df = pd.DataFrame({'x':[0, 1, 2, 3, 4, 4, 5],'y':[0, 1, 3, 3, 5, 6, 7]})
x = df.x.values
y = df.y.values
print(integrate(x, y))
I need the values of the autocorrelation coefficients coming from the autocorrelation_plot(). The problem is that the output coming from this function is not accessible, so I need another function to get such values. That's why I used acf() from statsmodels but it didn't get the same plot as autocorrelation_plot() does. Here is my code:
from statsmodels.tsa.stattools import acf
from pandas.plotting import autocorrelation_plot
import matplotlib.pyplot as plt
import numpy as np
y = np.sin(np.arange(1,6*np.pi,0.1))
plt.plot(acf(y))
plt.show()
So the result is not the same as this:
autocorrelation_plot(y)
plt.show()
This seems to be related to the nlags parameter of acf:
nlags: int, optional
Number of lags to return autocorrelation for.
I don't know what exactly this does but in the source of acf there is a slicing
that shortens the array:
avf = acovf(x, unbiased=unbiased, demean=True, fft=fft, missing=missing)
acf = avf[:nlags + 1] / avf[0]
If you use statsmodels.tsa.stattools.acovf directly the result is the same as with autocorrelation_plot:
avf = acovf(x, unbiased=unbiased, demean=True, fft=fft, missing=missing)
So you can call it like
plt.plot(acf(y, nlags=len(y)))
to make it work.
An explanation of lag: https://math.stackexchange.com/questions/2548314/what-is-lag-in-a-time-series/2548350
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
dataset=pd.read_csv('Churn_Modelling.csv')
X=dataset.iloc[:, 3:13]
Y=dataset.iloc[:, 13]
from sklearn.preprocessing import LabelEncoder
label_en1=LabelEncoder()
X.values[:, 1]=label_en1.fit_transform(X.values[:, 1])
label_en2=LabelEncoder()
X.values[:, 2]=label_en2.fit_transform(X.values[:, 2])
I tried creating dummy variables but it is not happening. I am using X.values int the encoding section because the version of Spyder that I have does not support object arrays so let X and Y be dataframes. I added .values because it dataframes do not support slice terminology. Where might I have gone wrong ?
I created a similar program before for creating dummy variables and it worked then. I don't understand why it is not happening for this one.
Edit:
Can you pass in a slice of your slice? Like so:
X.iloc[:, 1] = label_en1.fit_transform(X.iloc[:, 1])
You would essentially trim your dataframe down to what appears to be an array
Instead of accessing X.values, try accessing the feature / column name directly:
X['col_name'] = label_en1.fit_transform(X['col_name'])
This question already has answers here:
python+numpy: why does numpy.log throw an attribute error if its operand is too big?
(2 answers)
Closed 5 years ago.
I am trying to plot several points using Matplotlib on a plot that has lines following the function defined in energy(). The points are plasma parameters and the lines follow the function that connects them using multiple values of the Debye length.
import matplotlib.pyplot as plt
import numpy as np
n_pts = [10**21,10**19,10**23,10**11,10**15,10**14,10**17,10**6]
KT_pts = [10000,100,1000,0.05,2,0.1,0.2,0.01]
n_set = np.logspace(6,25)
debye_set = 7.43*np.logspace(-1,-7,10)
def energy(n,debye):
return n*(debye/7430)**2
fig,ax=plt.subplots()
ax.scatter(n_pts,KT_pts)
for debye in debye_set:
ax.loglog(n_set,energy(n_set,debye))
plt.show()
This gives the following error:
AttributeError: 'int' object has no attribute 'log'
Python does automatic, weird things for integers larger than can be held as a 64-bit integer (on 64 bit systems), like 10**21. In doing so, numpy will then not automatically use a numpy dtype for such objects, instead using the object dtype. This, in turn, does not support ufuncs like np.log:
> np.log([10**3])
array([ 6.90775528])
> np.log([10**30])
AttributeError: 'int' object has no attribute 'log'
One easy solution here is to make sure that numpy converts n_pts, the array with the large numbers, into a dtype it can actually use, like float:
import matplotlib.pyplot as plt
import numpy as np
n_pts = np.array([10**21,10**19,10**23,10**11,10**15,10**14,10**17,10**6], dtype='float')
KT_pts = [10000,100,1000,0.05,2,0.1,0.2,0.01]
n_set = np.logspace(6,25)
debye_set = 7.43*np.logspace(-1,-7,10)
def energy(n,debye):
return n*(debye/7430)**2
fig,ax=plt.subplots()
ax.scatter(n_pts,KT_pts)
for debye in debye_set:
ax.loglog(n_set,energy(n_set,debye))
plt.show()
I tried to plot the output of the defined function with respect to z. However the error TypeError: unhashable type: 'numpy.ndarray' is shown. Please help.
import numpy as np
import matplotlib.pyplot as plt
import sympy as sp
a=1.48185562
b=0.57081914
c=-0.25098188
H0=70.32724312
z=np.linspace(0.0,1.5,100)
omega_m0=0.3
dlabel= 'w(z) vz z'
def func(z):
sp.var('z+1')
H=((2/H0)*((b*(z+1)+c*(z+1)**0.5+2.0-a-b-c)*(1-0.5*a*(z+1)**(-0.5)) - ((z+1)-a*(z+1)**0.5-1.0+a)*(b+c*0.5*(z+1)**(-0.5)))/(b*(z+1)+c*(z+1)**0.5+2.0-a-b-c)**2)**(-1)
return ((2*(z+1)/3)*(sp.diff(sp.log(H)))-1)/(1-(H/H0)**2*omega_m0*(z+1)**3)
wz=func(z)
plt.plot(z,wz)
plt.xlabel('z')
plt.ylabel('w(z)')
plt.show()
I'm not sure what you want to do with sp.var('z+1')... at least I hope you were not trying to create a variable named z+1. I got the code to run but I let you make sure it does what you want and complain if not :)
import numpy as np
import matplotlib.pyplot as plt
import sympy as sp
a=1.48185562
b=0.57081914
c=-0.25098188
H0=70.32724312
x=np.linspace(0.0,1.5,100)
omega_m0=0.3
dlabel= 'w(z) vz z'
sp.var('z')
def func(z):
H=((2/H0)*((b*(z+1)+c*(z+1)**0.5+2.0-a-b-c)*(1-0.5*a*(z+1)**(-0.5)) - ((z+1)-a*(z+1)**0.5-1.0+a)*(b+c*0.5*(z+1)**(-0.5)))/(b*(z+1)+c*(z+1)**0.5+2.0-a-b-c)**2)**(-1)
return ((2*(z+1)/3)*(sp.diff(sp.log(H)))-1)/(1-(H/H0)**2*omega_m0*(z+1)**3)
wz = [func(z).evalf(subs = {z : y}) for y in x]
plt.plot(x,wz)
plt.xlabel('z')
plt.ylabel('w(z)')
plt.show()
EDIT: in order to get wz, the following piece is much faster ( cf Evaluate sympy expression from an array of values ):
from sympy.utilities.lambdify import lambdify
func_np_ready = lambdify(z, func(z),'numpy') # returns a numpy-ready function
wz = func_np_ready(x)
You may be better off flagging your question with sympy - it's probably the behaviour of one of those functions that's causing the issue, and someone else might know all about it.
It's probably a good idea to split those really long formulas up into multi lines (at least while debugging) to help you track down the error. Also put in some prints etc.
I know it's not what you want to achieve but if I cut out the sympy (I don't have it installed!) and adjust the array lengths it plots without error:
...
H=((2/H0)*((b*(z+1)+c*(z+1)**0.5+2.0-a-b-c)*(1-0.5*a*(z+1)**(-0.5)) - ((z+1)-a*(z+1)**0.5-1.0+a)*(b+c*0.5*(z+1)**(-0.5)))/(b*(z+1)+c*(z+1)**0.5+2.0-a-b-c)**2)**(-1)
return ((2*(z[:-1]+1)/3)*(np.diff(np.log(H)))-1)/(1-(H[:-1]/H0)**2*omega_m0*(z[:-1]+1)**3)
wz=func(z)
plt.plot(z[:-1],wz)