How do numpy functions operate on pandas objects internally? - python

Numpy functions, eg np.mean(), np.var(), etc, accept an array-like argument, like np.array, or list, etc.
But passing in a pandas dataframe also works. This means that a pandas dataframe can indeed disguise itself as a numpy array, which I find a little strange (despite knowing the fact that the underlying values of a df are indeed numpy arrays).
For an object to be an array-like, I thought that it should be slicable using integer indexing in the way a numpy array is sliced. So for instance df[1:3, 2:3] should work, but it would lead to an error.
So, possibly a dataframe gets converted into a numpy array when it goes inside the function. But if that is the case then why does np.mean(numpy_array) lead to a different result than that of np.mean(df)?
a = np.random.rand(4,2)
a
Out[13]:
array([[ 0.86688862, 0.09682919],
[ 0.49629578, 0.78263523],
[ 0.83552411, 0.71907931],
[ 0.95039642, 0.71795655]])
np.mean(a)
Out[14]: 0.68320065182041034
gives a different result than what the below gives...
df = pd.DataFrame(data=a, index=range(np.shape(a)[0]),
columns=range(np.shape(a)[1]))
df
Out[18]:
0 1
0 0.866889 0.096829
1 0.496296 0.782635
2 0.835524 0.719079
3 0.950396 0.717957
np.mean(df)
Out[21]:
0 0.787276
1 0.579125
dtype: float64
The former output is a single number, whereas the latter is a column-wise mean. How does a numpy function know about the make of a dataframe?

If you step through this:
--Call--
> d:\winpython-64bit-3.4.3.5\python-3.4.3.amd64\lib\site-packages\numpy\core\fromnumeric.py(2796)mean()
-> def mean(a, axis=None, dtype=None, out=None, keepdims=False):
(Pdb) s
> d:\winpython-64bit-3.4.3.5\python-3.4.3.amd64\lib\site-packages\numpy\core\fromnumeric.py(2877)mean()
-> if type(a) is not mu.ndarray:
(Pdb) s
> d:\winpython-64bit-3.4.3.5\python-3.4.3.amd64\lib\site-packages\numpy\core\fromnumeric.py(2878)mean()
-> try:
(Pdb) s
> d:\winpython-64bit-3.4.3.5\python-3.4.3.amd64\lib\site-packages\numpy\core\fromnumeric.py(2879)mean()
-> mean = a.mean
You can see that the type is not a ndarray so it tries to call a.mean which in this case would be df.mean():
In [6]:
df.mean()
Out[6]:
0 0.572999
1 0.468268
dtype: float64
This is why the output is different
Code to reproduce above:
In [3]:
a = np.random.rand(4,2)
a
Out[3]:
array([[ 0.96750329, 0.67623187],
[ 0.44025179, 0.97312747],
[ 0.07330062, 0.18341157],
[ 0.81094166, 0.04030253]])
In [4]:
np.mean(a)
Out[4]:
0.52063384885403818
In [5]:
df = pd.DataFrame(data=a, index=range(np.shape(a)[0]),
columns=range(np.shape(a)[1]))
​
df
Out[5]:
0 1
0 0.967503 0.676232
1 0.440252 0.973127
2 0.073301 0.183412
3 0.810942 0.040303
numpy output:
In [7]:
np.mean(df)
Out[7]:
0 0.572999
1 0.468268
dtype: float64
If you'd called .values to return a np array then the output is the same:
In [8]:
np.mean(df.values)
Out[8]:
0.52063384885403818

Related

Pandas Series - force dtype in Series constructor

I have this very simple series.
pd.Series(np.random.randn(10), dtype=np.int32)
I want to force a dtype, but pandas will overrule my initial setup:
Out[6]:
0 0.764638
1 -1.451616
2 -0.318875
3 -1.882215
4 1.995595
5 -0.497508
6 -1.004066
7 -1.641371
8 -1.271198
9 0.907795
dtype: float64
I know I could do this:
pd.Series(np.random.randn(10), dtype=np.int32).astype("int32")
But my question is: Why does pandas not handle the data how I want it in the Series constructor? There is no force parameter or something like that.
Can somebody explain me what happens there and how I can force the dtype in the series constructor or at least get a warning if the output differs from what I wanted initially?
You can use this:
>>> pd.Series(np.random.randn(10).astype(np.int32))
0 0
1 1
2 1
3 1
4 0
5 0
6 -1
7 0
8 0
9 0
dtype: int32
Pandas infers data type correctly. You can force your datatype with one exception. If your data is float and you want to force dtype to intX, this will not work because pandas does not take the responsibility to loose information and truncate the result.
That is why you have this behaviour.
>>> np.random.randn(10).dtype
dtype('float64')
>>> pd.Series(np.random.randn(10)).dtype
dtype('float64') # OK
>>> pd.Series(np.random.randn(10), dtype=np.int32).dtype
dtype('float64') # KO -> Pandas does not truncate the data
>>> np.random.randint(1, 10, 10).dtype
dtype('int64')
>>> pd.Series(np.random.randint(1, 10, 10)).dtype
dtype('int64') # OK
>>> pd.Series(np.random.randint(1, 10, 10), dtype=np.float64).dtype
dtype('float64') # OK -> float64 is a super set of int64

Pandas argmax() deprecation message

When using argmax() it returns:
The current behaviour of Series.argmax is deprecated, use idxmax
instead. The behavior of argmax will be corrected to return the
positional maximum in the future. For now, use series.values.argmax
or np.argmax(np.array(values)) to get the position of the maximum
row.
"""Entry point for launching an IPython kernel.
Any ideas what this means? I have used np.argmax(np.array(values)) to get the position of the maximum row but it just returns the max value. idxmax returns another error.
Here is example:
import numpy as np
import pandas as pd
In[3]:
mtx = np.random.randn(10)
mtx
Out[3]:
array([-1.47694909, -0.61658367, -1.2609941 , 0.33956725, 1.69096661,
0.10680407, -3.53473223, 0.61587513, 2.34405466, -1.49556778])
In[4]:
ser = pd.Series(mtx)
ser
Out[4]:
0 -1.476949
1 -0.616584
2 -1.260994
3 0.339567
4 1.690967
5 0.106804
6 -3.534732
7 0.615875
8 2.344055
9 -1.495568
dtype: float64
In[5]:
ser.idxmax()
Out[5]:
8
In[6]:
ser[ser.idxmax()]
Out[6]:
2.344054659817029

Vectorized Splitting of String on Character Count using Numpy or Pandas

Is there a way to split a Numpy Array in a vectorized manner based upon character count for each element?
Input:
In [1]: import numpy as np
In [2]: y = np.array([ 'USC00013160194806SNOW','USC00013160194806SNOW','USC00013160194806SNOW' ])
In [3]: y
Out[3]:
array(['USC00013160194806SNOW', 'USC00013160194806SNOW',
'USC00013160194806SNOW'],
dtype='|S21')
I want each element of the array split according to a certain number of characters.
Desired Output:
In [3]: y
Out[3]:
array(['USC00013160', 'USC00013160',
'USC00013160'],
dtype='|S21')
I've executed this using standard python loops, but I'm dealing with millions of values, so I'm trying to figure the fastest method.
You can create a view using a data type with the same size as y's dtype that has subfields corresponding to the parts that you want. For example,
In [22]: y
Out[22]:
array(['USC00013160194806SNOW', 'USC00013160194806SNOW',
'USC00013160194806SNOW'],
dtype='|S21')
In [23]: dt = np.dtype([('part1', 'S11'), ('part2', 'S6'), ('part3', 'S4')])
In [24]: v = y.view(dt)
In [25]: v['part1']
Out[25]:
array(['USC00013160', 'USC00013160', 'USC00013160'],
dtype='|S11')
In [26]: v['part2']
Out[26]:
array(['194806', '194806', '194806'],
dtype='|S6')
In [27]: v['part3']
Out[27]:
array(['SNOW', 'SNOW', 'SNOW'],
dtype='|S4')
Note that these are all views of the same data in y. If you modify them in place, you are also modifying y. For example,
In [32]: v3 = v['part3']
In [33]: v3
Out[33]:
array(['SNOW', 'SNOW', 'SNOW'],
dtype='|S4')
Change v3[1] to 'RAIN':
In [34]: v3[1] = 'RAIN'
In [35]: v3
Out[35]:
array(['SNOW', 'RAIN', 'SNOW'],
dtype='|S4')
Now see that y[1] is also changed:
In [36]: y
Out[36]:
array(['USC00013160194806SNOW', 'USC00013160194806RAIN',
'USC00013160194806SNOW'],
dtype='|S21')
One possible solution I've found is just completing the operation using Pandas Series, but I'm wondering if this can be done using an only Numpy arrays slicing methods. If not, it is fine, more curious about the best practice.
Starting Pandas Series:
In [33]: x = pd.read_csv("data.txt", delimiter='\n', dtype=str, squeeze=True)
In [34]: x
Out[34]:
0 USC00013160194807SNOW
1 USC00013160194808SNOW
2 USC00013160194809SNOW
3 USC00013160194810SNOW
4 USC00013160194811SNOW, dtype: object
Vectorized String Processing based on Character Count:
In [37]: k = x.str[0:11]
Output:
In [38]: k
Out[38]:
0 USC00013160
1 USC00013160
2 USC00013160
3 USC00013160
4 USC00013160

Extract on "-" delimited in pandas

I have a pandas series with some values like 19.99-20.99 (i.e. two numbers separated by a dash).
How would you just take the left or right value?
Use split("-") on the resulting string and then access the result with index notation, ie split_result[1].
Here's an example:
In [5]: my_series = pandas.Series(['19.22-20.11','18.55-34.22','12.33-22.00','13.33-34.23'])
In [6]: my_series[0]
Out[6]: '19.22-20.11'
In [7]: my_series[0].split("-")
Out[7]: ['19.22', '20.11']
In [8]: my_series[0].split("-")[0]
Out[8]: '19.22'
In [9]: my_series[0].split("-")[1]
Out[9]: '20.11'
In [1]: s = pd.Series(['19.99-20.99', '20.99-21.99'])
In [2]: s.str.split('-').str[0]
Out[2]:
0 19.99
1 20.99
dtype: object
In [3]: s.str.split('-').str[1]
Out[3]:
0 20.99
1 21.99
dtype: object

Populate a Pandas SparseDataFrame from a SciPy Sparse Matrix

I noticed Pandas now has support for Sparse Matrices and Arrays. Currently, I create DataFrame()s like this:
return DataFrame(matrix.toarray(), columns=features, index=observations)
Is there a way to create a SparseDataFrame() with a scipy.sparse.csc_matrix() or csr_matrix()? Converting to dense format kills RAM badly. Thanks!
A direct conversion is not supported ATM. Contributions are welcome!
Try this, should be ok on memory as the SpareSeries is much like a csc_matrix (for 1 column)
and pretty space efficient
In [37]: col = np.array([0,0,1,2,2,2])
In [38]: data = np.array([1,2,3,4,5,6],dtype='float64')
In [39]: m = csc_matrix( (data,(row,col)), shape=(3,3) )
In [40]: m
Out[40]:
<3x3 sparse matrix of type '<type 'numpy.float64'>'
with 6 stored elements in Compressed Sparse Column format>
In [46]: pd.SparseDataFrame([ pd.SparseSeries(m[i].toarray().ravel())
for i in np.arange(m.shape[0]) ])
Out[46]:
0 1 2
0 1 0 4
1 0 0 5
2 2 3 6
In [47]: df = pd.SparseDataFrame([ pd.SparseSeries(m[i].toarray().ravel())
for i in np.arange(m.shape[0]) ])
In [48]: type(df)
Out[48]: pandas.sparse.frame.SparseDataFrame
As of pandas v 0.20.0 you can use the SparseDataFrame constructor.
An example from the pandas docs:
import numpy as np
import pandas as pd
from scipy.sparse import csr_matrix
arr = np.random.random(size=(1000, 5))
arr[arr < .9] = 0
sp_arr = csr_matrix(arr)
sdf = pd.SparseDataFrame(sp_arr)
A much shorter version:
df = pd.DataFrame(m.toarray())

Categories

Resources