This question already has answers here:
Convert pandas dataframe to NumPy array
(15 answers)
Closed 2 years ago.
The community reviewed whether to reopen this question 1 year ago and left it closed:
Original close reason(s) were not resolved
How can I get the index or column of a DataFrame as a NumPy array or Python list?
To get a NumPy array, you should use the values attribute:
In [1]: df = pd.DataFrame({'A': [1, 2, 3], 'B': [4, 5, 6]}, index=['a', 'b', 'c']); df
A B
a 1 4
b 2 5
c 3 6
In [2]: df.index.values
Out[2]: array(['a', 'b', 'c'], dtype=object)
This accesses how the data is already stored, so there isn't any need for a conversion.
Note: This attribute is also available for many other pandas objects.
In [3]: df['A'].values
Out[3]: Out[16]: array([1, 2, 3])
To get the index as a list, call tolist:
In [4]: df.index.tolist()
Out[4]: ['a', 'b', 'c']
And similarly, for columns.
You can use df.index to access the index object and then get the values in a list using df.index.tolist(). Similarly, you can use df['col'].tolist() for Series.
pandas >= 0.24
Deprecate your usage of .values in favour of these methods!
From v0.24.0 onwards, we will have two brand spanking new, preferred methods for obtaining NumPy arrays from Index, Series, and DataFrame objects: they are to_numpy(), and .array. Regarding usage, the docs mention:
We haven’t removed or deprecated Series.values or
DataFrame.values, but we highly recommend and using .array or
.to_numpy() instead.
See this section of the v0.24.0 release notes for more information.
to_numpy() Method
df.index.to_numpy()
# array(['a', 'b'], dtype=object)
df['A'].to_numpy()
# array([1, 4])
By default, a view is returned. Any modifications made will affect the original.
v = df.index.to_numpy()
v[0] = -1
df
A B
-1 1 2
b 4 5
If you need a copy instead, use to_numpy(copy=True);
v = df.index.to_numpy(copy=True)
v[-1] = -123
df
A B
a 1 2
b 4 5
Note that this function also works for DataFrames (while .array does not).
array Attribute
This attribute returns an ExtensionArray object that backs the Index/Series.
pd.__version__
# '0.24.0rc1'
# Setup.
df = pd.DataFrame([[1, 2], [4, 5]], columns=['A', 'B'], index=['a', 'b'])
df
A B
a 1 2
b 4 5
<!- ->
df.index.array
# <PandasArray>
# ['a', 'b']
# Length: 2, dtype: object
df['A'].array
# <PandasArray>
# [1, 4]
# Length: 2, dtype: int64
From here, it is possible to get a list using list:
list(df.index.array)
# ['a', 'b']
list(df['A'].array)
# [1, 4]
or, just directly call .tolist():
df.index.tolist()
# ['a', 'b']
df['A'].tolist()
# [1, 4]
Regarding what is returned, the docs mention,
For Series and Indexes backed by normal NumPy arrays, Series.array
will return a new arrays.PandasArray, which is a thin (no-copy)
wrapper around a numpy.ndarray. arrays.PandasArray isn’t especially
useful on its own, but it does provide the same interface as any
extension array defined in pandas or by a third-party library.
So, to summarise, .array will return either
The existing ExtensionArray backing the Index/Series, or
If there is a NumPy array backing the series, a new ExtensionArray object is created as a thin wrapper over the underlying array.
Rationale for adding TWO new methods
These functions were added as a result of discussions under two GitHub issues GH19954 and GH23623.
Specifically, the docs mention the rationale:
[...] with .values it was unclear whether the returned value would be the
actual array, some transformation of it, or one of pandas custom
arrays (like Categorical). For example, with PeriodIndex, .values
generates a new ndarray of period objects each time. [...]
These two functions aim to improve the consistency of the API, which is a major step in the right direction.
Lastly, .values will not be deprecated in the current version, but I expect this may happen at some point in the future, so I would urge users to migrate towards the newer API, as soon as you can.
If you are dealing with a multi-index dataframe, you may be interested in extracting only the column of one name of the multi-index. You can do this as
df.index.get_level_values('name_sub_index')
and of course name_sub_index must be an element of the FrozenList df.index.names
Since pandas v0.13 you can also use get_values:
df.index.get_values()
A more recent way to do this is to use the .to_numpy() function.
If I have a dataframe with a column 'price', I can convert it as follows:
priceArray = df['price'].to_numpy()
You can also pass the data type, such as float or object, as an argument of the function
I converted the pandas dataframe to list and then used the basic list.index(). Something like this:
dd = list(zone[0]) #Where zone[0] is some specific column of the table
idx = dd.index(filename[i])
You have you index value as idx.
Below is a simple way to convert a dataframe column into a NumPy array.
df = pd.DataFrame(somedict)
ytrain = df['label']
ytrain_numpy = np.array([x for x in ytrain['label']])
ytrain_numpy is a NumPy array.
I tried with to.numpy(), but it gave me the below error:
TypeError: no supported conversion for types: (dtype('O'),)* while doing Binary Relevance classfication using Linear SVC.
to.numpy() was converting the dataFrame into a NumPy array, but the inner element's data type was a list because of which the above error was observed.
Related
I am trying to pass a series to a user defined function and getting this error:
Function:
def scale(series):
sc=StandardScaler()
sc.fit_transform(series)
print(series)
Code for calling:
df['Value'].apply(scale) # df['Value'] is a Series having float dtype.
Error:
ValueError: Expected 2D array, got scalar array instead:
array=28.69.
Reshape your data either using array.reshape(-1, 1) if your data has a single feature or array.reshape(1, -1) if it contains a single sample.
Can anyone help address this issue?
The method apply will apply a function to each element in the Series (or in case of a DataFrame either each row or each column depending on the chosen axis). Here you expect your function to process the entire Series and to output a new Series in its stead.
You can therefore simply run:
StandardScaler().fit_transform(df['Value'].values.reshape(-1, 1))
StandardScaler excepts a 2D array as input where each row is a sample input that consists of one or more features. Even it is just a single feature (as seems to be the case in your example) it has to have the right dimensions. Therefore, before handing over your Series to sklearn I am accessing the values (the numpy representation) and reshaping it accordingly.
For more details on reshape(-1, ...) check this out: What does -1 mean in numpy reshape?
Now, the best bit. If your entire DataFrame consists of a single column you could simply do:
StandardScaler().fit_transform(df)
And even if it doesn't, you could still avoid the reshape:
StandardScaler().fit_transform(df[['Value']])
Note how in this case 'Value' is surrounded by 2 sets of braces so this time it is not a Series but rather a DataFrame with a subset of the original columns (in case you do not want to scale all of them). Since a DataFrame is already 2-dimensional, you don't need to worry about reshaping.
Finally, if you want to scale just some of the columns and update your original DataFrame all you have to do is:
>>> df = pd.DataFrame({'A': [1,2,3], 'B': [0,5,6], 'C': [7, 8, 9]})
>>> columns_to_scale = ['A', 'B']
>>> df[columns_to_scale] = StandardScaler().fit_transform(df[columns_to_scale])
>>> df
A B C
0 -1.224745 -1.397001 7
1 0.000000 0.508001 8
2 1.224745 0.889001 9
I recently started learning python for data analysis and I am having problems trying to understand some cases of object assignment when using pandas DataFrame and Series.
First of all, I understand that changing the value of one object, will not change another object which value was assigned in the first one. The typical:
a = 7
b = a
a = 12
So far a = 12 and b = 7. But when using Pandas I have the following situation:
import pandas as pd
my_df = pd.DataFrame({'Col1': [2, 7, 9],'Col2': [1, 6, 12],'Col3': [1, 6, 9]})
pd_colnames = pd.Series(my_df.columns.values)
list_colnames = list(my_df.columns.values)
Now this two objects contain the same text, one as pd.Series and the second as list. But if I change some column names the values change:
>>> my_df.columns.values[0:2] = ['a','b']
>>> pd_colnames
0 a
1 b
2 Col3
dtype: object
>>> list_colnames
['Col1', 'Col2', 'Col3']
Can somebody explain me why using the built-in list the values did not change, while with pandas.Series the values changed when I modified the data frame?
And what can I do to avoid this behavior in pandas.Series? I have a data frame which column names sometimes I need to use in English and sometimes in Spanish, and I'd like to be able to keep both as a pandas.Series object in order to interact with them.
This is because list() is creating a new object (a copy) in list_colnames = list(my_df.columns.values). This is easily tested:
a = [1, 2, 3]
b = list(a)
a[0] = 5
print(b)
---> [1, 2, 3]
Once you create that copy, list_colnames is completely detached from the initial df (including the array of column names).
Conversely, my_df.columns.values gives you access to the underlying numpy array for the column names. You can see that with print(type(my_df.columns.values)). When you create a Series from this array, it has no need to create a copy, so the values in your Series are still linked to the column names of my_df (they are the same object).
First of all, I understand that changing the value of one object, will not change another object which value was assigned in the first one.
This is only true for immutable types (int, float, bool, str, tuple, unicode), and not mutable types (list, set, dict). See more here.
>>> a = [1, 2, 3]
>>> b = a
>>> b[0] = 4
>>> a
[4, 2, 3]
What is going on is list_colnames is a copy of the pd_colnames (through the call of the list function), where pd_colnames is a mutable type related to the my_df.
From the reindex docs:
Conform DataFrame to new index with optional filling logic, placing NA/NaN in locations having no value in the previous index. A new object is produced unless the new index is equivalent to the current one and copy=False.
Therefore, I thought that I would get a reordered Dataframe by setting copy=False in place (!). It appears, however, that I do get a copy and need to assign it to the original object again. I don't want to assign it back, if I can avoid it (the reason comes from this other question).
This is what I am doing:
import numpy as np
import pandas as pd
df = pd.DataFrame(np.random.rand(5, 5))
df.columns = [ 'a', 'b', 'c', 'd', 'e' ]
df.head()
Outs:
a b c d e
0 0.234296 0.011235 0.664617 0.983243 0.177639
1 0.378308 0.659315 0.949093 0.872945 0.383024
2 0.976728 0.419274 0.993282 0.668539 0.970228
3 0.322936 0.555642 0.862659 0.134570 0.675897
4 0.167638 0.578831 0.141339 0.232592 0.976057
Reindex gives me the correct output, but I'd need to assign it back to the original object, which is what I wanted to avoid by using copy=False:
df.reindex( columns=['e', 'd', 'c', 'b', 'a'], copy=False )
The desired output after that line is:
e d c b a
0 0.177639 0.983243 0.664617 0.011235 0.234296
1 0.383024 0.872945 0.949093 0.659315 0.378308
2 0.970228 0.668539 0.993282 0.419274 0.976728
3 0.675897 0.134570 0.862659 0.555642 0.322936
4 0.976057 0.232592 0.141339 0.578831 0.167638
Why is copy=False not working in place?
Is it possible to do that at all?
Working with python 3.5.3, pandas 0.23.3
reindex is a structural change, not a cosmetic or transformative one. As such, a copy is always returned because the operation cannot be done in-place (it would require allocating new memory for underlying arrays, etc). This means you have to assign the result back, there's no other choice.
df = df.reindex(['e', 'd', 'c', 'b', 'a'], axis=1)
Also see the discussion on GH21598.
The one corner case where copy=False is actually of any use is when the indices used to reindex df are identical to the ones it already has. You can check by comparing the ids:
id(df)
# 4839372504
id(df.reindex(df.index, copy=False)) # same object returned
# 4839372504
id(df.reindex(df.index, copy=True)) # new object created - ids are different
# 4839371608
A bit off topic, but I believe this would rearrange the columns in place
for i, colname in enumerate(list_of_columns_in_desired_order):
col = dataset.pop(colname)
dataset.insert(i, colname, col)
I'm trying to update a column from float to int. consider df in the following two scenarios:
df = pd.DataFrame(dict(A=[1.1, 2], B=[1., 2]))
print(df.A.dtype)
df.loc[:, ['A']] = df[['A']].astype(int)
print(df.A.dtype)
df
The dtype failed to update to int but the value in 'A' is definitely truncated.
However,
df = pd.DataFrame(dict(A=[1.1, 2], B=[1., 2]))
print(df.A.dtype)
df.loc[:, 'A'] = df.A.astype(int)
print(df.A.dtype)
df
works just fine.
Is there a justification for these behaving differently?
Right from the documentation:
Note When trying to convert a subset of columns to a specified type
using astype() and loc(), upcasting occurs. loc() tries to fit in what
we are assigning to the current dtypes, while [] will overwrite them
taking the dtype from the right hand side. Therefore the following
piece of code produces the unintended result.
Bit of a general question but I have been using pandas for more than a year now and I keep getting into trouble when I have mixed types in pandas DataFrame columns. I would Frequently have a DataFrame that comes in like this:
df2 =
0 1 2 3 4
val_str test test test test test
val_date 2014-01-15 2014-01-15 2014-01-15 2014-01-15 2014-01-15
val_float 1.5 1.5 1.5 1.5 1.5
val_int 1 1 1 1 1
as example generated by:
import pandas as pd
import datetime
df = pd.DataFrame(index=range(5))
df['val_str'] = "test"
df['val_date']= datetime.datetime(2014,1,15)
df['val_bool'] = True
df['val_float'] = 1.5
df['val_int'] = 1
df2=df.T
Convoluted example, but the data comes from excel, csv etc and the a lot of times the the rows have consistent datatypes instead of the columns.
Pandas seems to handle its methods (mostly) well with this kind of data, but I frequently get unexpected results when selecting or trying to do boolean operations with the data.
Selecting data with e.g.
df2[2]['val_bool'] #eems to work without problem
seems to work well, even pulling rows out with e.g.:
df2.ix['val_bool'] # works fine
seems to work as expected. I frequently run into problems trying to use this slice to further select data.
df2.ix['val_bool'].dtype
>>> dtype('O')
# trying boolean operations on this gives numeric results?
Is there any pandas guideline as to whether this could cause problems. I have gone back to some of the initial tutorials and gathered that columns "should" have consistent datatypes. Pandas' flexibility however allows you to do this, but some of the methods break? I vaguely remember one of Wes McKinneys talk where he mentioned that:
df.T.T != df
What are the differences and what should I be careful of when columns in DataFrame does not have consistent datatypes?
Datatypes are column based. Doing a transpose df.T in a mixed-type frame, will necessarily convert to a type that can hold both types, meaning that a string and float will yield a object dtype.
so df.T.T != df, but, you can do: df.T.T.convert_objects() which will generally succeed in convert the object dtypes back to basic types.
Under the hood, Pandas stores columns or groups of columns with the same dtype in a Block. Thus, you can think of all the float columns being stored in one big array, and all the string columns in another array, etc.
When you have heterogenous column data, such as in df2 above, every value is stored in an array of dtype object:
In [154]: df2._data
Out[154]:
BlockManager
Items: Int64Index([0, 1, 2, 3, 4], dtype='int64')
Axis 1: Index([u'val_str', u'val_date', u'val_bool', u'val_float', u'val_int'], dtype='object')
ObjectBlock: [0, 1, 2, 3, 4], 5 x 5, dtype: object
This is the worst kind of dtype to have, since it enjoys none of the speed advantages offered by NumPy numeric types. Moreover, some NumPy (and maybe Pandas) functions raise exceptions when operating on arrays of object dtype.
Even when you select a row which has only float values, you get back an array of object dtype:
In [149]: df2.loc['val_float'].dtype
Out[149]: dtype('O')
So the best way to take advantage of pandas is to load the data in a way which allows whole columns to have NumPy dtypes other than object, and never transpose (unless the entire DataFrame is of homogeneous dtype).
Note how the columns of df are segregated into blocks of different dtype. This is much better than df2's one big ObjectBlock.
In [155]: df._data
Out[155]:
BlockManager
Items: Index([u'val_str', u'val_date', u'val_bool', u'val_float', u'val_int'], dtype='object')
Axis 1: Int64Index([0, 1, 2, 3, 4], dtype='int64')
ObjectBlock: [val_str], 1 x 5, dtype: object
DatetimeBlock: [val_date], 1 x 5, dtype: datetime64[ns]
BoolBlock: [val_bool], 1 x 5, dtype: bool
FloatBlock: [val_float], 1 x 5, dtype: float64
IntBlock: [val_int], 1 x 5, dtype: int64