When using argmax() it returns:
The current behaviour of Series.argmax is deprecated, use idxmax
instead. The behavior of argmax will be corrected to return the
positional maximum in the future. For now, use series.values.argmax
or np.argmax(np.array(values)) to get the position of the maximum
row.
"""Entry point for launching an IPython kernel.
Any ideas what this means? I have used np.argmax(np.array(values)) to get the position of the maximum row but it just returns the max value. idxmax returns another error.
Here is example:
import numpy as np
import pandas as pd
In[3]:
mtx = np.random.randn(10)
mtx
Out[3]:
array([-1.47694909, -0.61658367, -1.2609941 , 0.33956725, 1.69096661,
0.10680407, -3.53473223, 0.61587513, 2.34405466, -1.49556778])
In[4]:
ser = pd.Series(mtx)
ser
Out[4]:
0 -1.476949
1 -0.616584
2 -1.260994
3 0.339567
4 1.690967
5 0.106804
6 -3.534732
7 0.615875
8 2.344055
9 -1.495568
dtype: float64
In[5]:
ser.idxmax()
Out[5]:
8
In[6]:
ser[ser.idxmax()]
Out[6]:
2.344054659817029
Related
The objective is to update the df rows, by considering element in the df and and reference value from external np array.
Currently, I had to use a for loop to update each row, as below.
However, I wonder whether this can be takcle using any pandas built-in module.
import pandas as pd
import numpy as np
arr=np.array([1,2,5,100,3,6,8,3,99,12,5,6,8,11,14,11,100,1,3])
arr=arr.reshape((1,-1))
df=pd.DataFrame(zip([1,7,13],[4,11,17],['a','g','t']),columns=['start','end','o'])
for n in range (len(df)):
a=df.loc[n]
drange=list(range(a['start'],a['end']+1))
darr=arr[0,drange]
r=np.where(darr==np.amax(darr))[0].item()
df.loc[n,'pos_peak']=drange[r]
Expected output
start end o pos_peak
0 1 4 a 3.0
1 7 11 g 8.0
2 13 17 t 16.0
My approach would be to use pandas apply() function with which you can apply a function to each row of your dataframe. In order to find the index of the maximum element, I used the numpy function argmax() onto the relevant part of arr. Here is the code:
import pandas as pd
import numpy as np
arr=np.array([1,2,5,100,3,6,8,3,99,12,5,6,8,11,14,11,100,1,3])
arr=arr.reshape((1,-1))
df=pd.DataFrame(zip([1,7,13],[4,11,17],['a','g','t']),columns=['start','end','o'])
df['pos_peak'] = df.apply(lambda x: x['start'] + np.argmax(arr[0][x['start']:x['end']+1]), axis=1)
df
Output:
start end o pos_peak
0 1 4 a 3
1 7 11 g 8
2 13 17 t 16
I have this very simple series.
pd.Series(np.random.randn(10), dtype=np.int32)
I want to force a dtype, but pandas will overrule my initial setup:
Out[6]:
0 0.764638
1 -1.451616
2 -0.318875
3 -1.882215
4 1.995595
5 -0.497508
6 -1.004066
7 -1.641371
8 -1.271198
9 0.907795
dtype: float64
I know I could do this:
pd.Series(np.random.randn(10), dtype=np.int32).astype("int32")
But my question is: Why does pandas not handle the data how I want it in the Series constructor? There is no force parameter or something like that.
Can somebody explain me what happens there and how I can force the dtype in the series constructor or at least get a warning if the output differs from what I wanted initially?
You can use this:
>>> pd.Series(np.random.randn(10).astype(np.int32))
0 0
1 1
2 1
3 1
4 0
5 0
6 -1
7 0
8 0
9 0
dtype: int32
Pandas infers data type correctly. You can force your datatype with one exception. If your data is float and you want to force dtype to intX, this will not work because pandas does not take the responsibility to loose information and truncate the result.
That is why you have this behaviour.
>>> np.random.randn(10).dtype
dtype('float64')
>>> pd.Series(np.random.randn(10)).dtype
dtype('float64') # OK
>>> pd.Series(np.random.randn(10), dtype=np.int32).dtype
dtype('float64') # KO -> Pandas does not truncate the data
>>> np.random.randint(1, 10, 10).dtype
dtype('int64')
>>> pd.Series(np.random.randint(1, 10, 10)).dtype
dtype('int64') # OK
>>> pd.Series(np.random.randint(1, 10, 10), dtype=np.float64).dtype
dtype('float64') # OK -> float64 is a super set of int64
I have an R dataframe that I've processed:
import rpy2.robjects as ro
from rpy2.robjects.packages import importr
from rpy2.robjects import pandas2ri
from rpy2.robjects.conversion import localconverter
pandas2ri.activate()
import pandas as pd
%%R
n = c(2, 3, 5)
s = c("aa", "bb", "cc")
b = c(TRUE, FALSE, TRUE)
r_df = data.frame(n, s, b)
r_df[['c']]=NA
r_df
#out:
# n s b c
#1 2 aa 1 NA
#2 3 bb 0 NA
#3 5 cc 1 NA
When I convert it to pandas, it replaces NA with integers.
with localconverter(ro.default_converter + pandas2ri.converter):
pd_from_r_df = ro.conversion.rpy2py(ro.r('r_df'))
pd_from_r_df
#Out:
# n s b c
#1 2.0 aa 1 -2147483648
#2 3.0 bb 0 -2147483648
#3 5.0 cc 1 -2147483648
I have tried to set different data types in the columns of r_df, but to no avail. How can I fix this issue?
Note, setting r_df[is.na(r_df)]='None' prior to converting to pandas solves the issue. But it should be simpler than this
The likely issue is that R has an "NA" value for boolean values ("logical vectors" in R lingo) and integer values while Python/numpy does not.
Look at how the dtype changed between the two following examples:
In [1]: import pandas
In [2]: pandas.Series([True, False, True])
Out[2]:
0 True
1 False
2 True
dtype: bool
In [3]: pandas.Series([True, False, None])
Out[3]:
0 True
1 False
2 None
dtype: object
Here what is happening is that the column "c" in your R data frame is of type "logical" (LGLSXP) but in C this is an R array of integer values using only one of 0, 1, and -2147483648 (for FALSE, TRUE, and NA respectively). The rpy2 converter is converting to a numpy vector of integers because:
rpy2 implements the numpy array interface to allow matching C arrays across the two languages.
numpy uses that interface (numpy.array() is called by rpy2)
This is admittedly only one of the ways to approach conversion and there are situations where this is not the most convenient. Using a custom converter can be used to get a behavior that would suit you better.
PS: One more note about your workaround below
Note, setting r_df[is.na(r_df)]='None' prior to converting to pandas
solves the issue. But it should be simpler than this
What is happening here is that you are converting the R boolean vector into a vector of strings.
Numpy functions, eg np.mean(), np.var(), etc, accept an array-like argument, like np.array, or list, etc.
But passing in a pandas dataframe also works. This means that a pandas dataframe can indeed disguise itself as a numpy array, which I find a little strange (despite knowing the fact that the underlying values of a df are indeed numpy arrays).
For an object to be an array-like, I thought that it should be slicable using integer indexing in the way a numpy array is sliced. So for instance df[1:3, 2:3] should work, but it would lead to an error.
So, possibly a dataframe gets converted into a numpy array when it goes inside the function. But if that is the case then why does np.mean(numpy_array) lead to a different result than that of np.mean(df)?
a = np.random.rand(4,2)
a
Out[13]:
array([[ 0.86688862, 0.09682919],
[ 0.49629578, 0.78263523],
[ 0.83552411, 0.71907931],
[ 0.95039642, 0.71795655]])
np.mean(a)
Out[14]: 0.68320065182041034
gives a different result than what the below gives...
df = pd.DataFrame(data=a, index=range(np.shape(a)[0]),
columns=range(np.shape(a)[1]))
df
Out[18]:
0 1
0 0.866889 0.096829
1 0.496296 0.782635
2 0.835524 0.719079
3 0.950396 0.717957
np.mean(df)
Out[21]:
0 0.787276
1 0.579125
dtype: float64
The former output is a single number, whereas the latter is a column-wise mean. How does a numpy function know about the make of a dataframe?
If you step through this:
--Call--
> d:\winpython-64bit-3.4.3.5\python-3.4.3.amd64\lib\site-packages\numpy\core\fromnumeric.py(2796)mean()
-> def mean(a, axis=None, dtype=None, out=None, keepdims=False):
(Pdb) s
> d:\winpython-64bit-3.4.3.5\python-3.4.3.amd64\lib\site-packages\numpy\core\fromnumeric.py(2877)mean()
-> if type(a) is not mu.ndarray:
(Pdb) s
> d:\winpython-64bit-3.4.3.5\python-3.4.3.amd64\lib\site-packages\numpy\core\fromnumeric.py(2878)mean()
-> try:
(Pdb) s
> d:\winpython-64bit-3.4.3.5\python-3.4.3.amd64\lib\site-packages\numpy\core\fromnumeric.py(2879)mean()
-> mean = a.mean
You can see that the type is not a ndarray so it tries to call a.mean which in this case would be df.mean():
In [6]:
df.mean()
Out[6]:
0 0.572999
1 0.468268
dtype: float64
This is why the output is different
Code to reproduce above:
In [3]:
a = np.random.rand(4,2)
a
Out[3]:
array([[ 0.96750329, 0.67623187],
[ 0.44025179, 0.97312747],
[ 0.07330062, 0.18341157],
[ 0.81094166, 0.04030253]])
In [4]:
np.mean(a)
Out[4]:
0.52063384885403818
In [5]:
df = pd.DataFrame(data=a, index=range(np.shape(a)[0]),
columns=range(np.shape(a)[1]))
df
Out[5]:
0 1
0 0.967503 0.676232
1 0.440252 0.973127
2 0.073301 0.183412
3 0.810942 0.040303
numpy output:
In [7]:
np.mean(df)
Out[7]:
0 0.572999
1 0.468268
dtype: float64
If you'd called .values to return a np array then the output is the same:
In [8]:
np.mean(df.values)
Out[8]:
0.52063384885403818
I have a dataframe that looks like this:
df = pd.DataFrame({'A':[100,300,500,600],
'B':[100,200,300,400],
'C':[1000,2000,3000,4000],
'D':[1,4,5,6],
'E':[2,5,2,7]})
and when applying the pairwise maximum to any two columns, using
maximum(df.A,df.B)
I get an error saying
NameError: global name 'maximum' is not defined
I was under the impression that this error only occurred when using a variable that had not been assigned yet. However, the maximum function should work in numpy. I know I can just apply
df[['A','B']].apply(max)
but I am concerned as to the cause of the error. Why is it complaining about not having defined a reserved function keyword?
Did you miss "np." by any chance after importing numpy as np . Here is my output from my MacBook :
>>> import numpy as np
>>> np.maximum(df.A,df.B)
0 100
1 300
2 500
3 600
Name: A, dtype: int64
pandas alternative:
In [32]: df[['A','B']].max().max()
Out[32]: 600
step-by-step:
In [31]: df[['A','B']].max()
Out[31]:
A 600
B 400
dtype: int64
if you need a maximum per row:
In [35]: df[['A','B']].max(axis=1)
Out[35]:
0 100
1 300
2 500
3 600
dtype: int64