I recently started learning python for data analysis and I am having problems trying to understand some cases of object assignment when using pandas DataFrame and Series.
First of all, I understand that changing the value of one object, will not change another object which value was assigned in the first one. The typical:
a = 7
b = a
a = 12
So far a = 12 and b = 7. But when using Pandas I have the following situation:
import pandas as pd
my_df = pd.DataFrame({'Col1': [2, 7, 9],'Col2': [1, 6, 12],'Col3': [1, 6, 9]})
pd_colnames = pd.Series(my_df.columns.values)
list_colnames = list(my_df.columns.values)
Now this two objects contain the same text, one as pd.Series and the second as list. But if I change some column names the values change:
>>> my_df.columns.values[0:2] = ['a','b']
>>> pd_colnames
0 a
1 b
2 Col3
dtype: object
>>> list_colnames
['Col1', 'Col2', 'Col3']
Can somebody explain me why using the built-in list the values did not change, while with pandas.Series the values changed when I modified the data frame?
And what can I do to avoid this behavior in pandas.Series? I have a data frame which column names sometimes I need to use in English and sometimes in Spanish, and I'd like to be able to keep both as a pandas.Series object in order to interact with them.
This is because list() is creating a new object (a copy) in list_colnames = list(my_df.columns.values). This is easily tested:
a = [1, 2, 3]
b = list(a)
a[0] = 5
print(b)
---> [1, 2, 3]
Once you create that copy, list_colnames is completely detached from the initial df (including the array of column names).
Conversely, my_df.columns.values gives you access to the underlying numpy array for the column names. You can see that with print(type(my_df.columns.values)). When you create a Series from this array, it has no need to create a copy, so the values in your Series are still linked to the column names of my_df (they are the same object).
First of all, I understand that changing the value of one object, will not change another object which value was assigned in the first one.
This is only true for immutable types (int, float, bool, str, tuple, unicode), and not mutable types (list, set, dict). See more here.
>>> a = [1, 2, 3]
>>> b = a
>>> b[0] = 4
>>> a
[4, 2, 3]
What is going on is list_colnames is a copy of the pd_colnames (through the call of the list function), where pd_colnames is a mutable type related to the my_df.
Related
I'm trying to use pandas' read_csv with the dtype parameter set to CategoricalDtype. It does generate the DataFrame with categories as expected but I have noticed that the categories themselves are object type instead of some kind of int. For example,
import pandas as pd
from io import StringIO
data = 'data\n1\n2\n3\n'
df = pd.read_csv(StringIO(data), dtype=pd.CategoricalDtype())
df['data']
results in
0 1
1 2
2 3
Name: data, dtype: category
Categories (3, object): ['1', '2', '3']
This is a bit surprising because if I create a list of numbers and then generate a Series, without using read_csv, the categories are int64.
lst = [1, 2, 3]
pd.Series(lst, dtype=pd.CategoricalDtype())
results in
0 1
1 2
2 3
dtype: category
Categories (3, int64): [1, 2, 3]
I know I can pass the categories explicitly to the CategoricalDtype to circumvent this, but this is a bit annoying. Is this behaviour expected?
Yes, this behavior is expected. When reading a csv all data is stored as a string and pandas essentially guesses (intelligently) at whether or not a column is supposed to be something else after parsing the data (unless given a dtype beforehand). This is probably an oversimplification of how pandas interprets text-based files, so some one please correct me if I'm wrong or has more information to include.
If you remove the manual dtype in your pd.read_csv, pandas will read in your data and then accurately guess that the column should be of an int dtype. By manually setting dtype=pd.CategoricalDtype (note you can also achieve the result with dtype="category") pandas skips the implicit conversion to an int dtype before converting it to a CategoricalDtype which is why your categories have an object dtype.
In your second example, the data in your list lst are all numeric. Since you aren't explicitly supplying the categories, pandas draws on the unique values in lst to create its categories. Since all the value in the categories are int, it the unique values in lst will be of dtype int. If you want your categories in your second example to be a string you'll need to recast lst to contain strings (e.g. lst = [str(x) for x in lst]), or better yet, you can replace the underlying categories with a copy that has an object/string dtype after creation of the Series.
lst = [1, 2, 3]
s = pd.Series(lst, dtype=pd.CategoricalDtype())
# replace underlying categories with an string version
s = s.cat.rename_categories(s.cat.categories.astype(str))
print(s)
0 1
1 2
2 3
dtype: category
Categories (3, object): ['1', '2', '3']
This question already has answers here:
Convert pandas dataframe to NumPy array
(15 answers)
Closed 2 years ago.
The community reviewed whether to reopen this question 1 year ago and left it closed:
Original close reason(s) were not resolved
How can I get the index or column of a DataFrame as a NumPy array or Python list?
To get a NumPy array, you should use the values attribute:
In [1]: df = pd.DataFrame({'A': [1, 2, 3], 'B': [4, 5, 6]}, index=['a', 'b', 'c']); df
A B
a 1 4
b 2 5
c 3 6
In [2]: df.index.values
Out[2]: array(['a', 'b', 'c'], dtype=object)
This accesses how the data is already stored, so there isn't any need for a conversion.
Note: This attribute is also available for many other pandas objects.
In [3]: df['A'].values
Out[3]: Out[16]: array([1, 2, 3])
To get the index as a list, call tolist:
In [4]: df.index.tolist()
Out[4]: ['a', 'b', 'c']
And similarly, for columns.
You can use df.index to access the index object and then get the values in a list using df.index.tolist(). Similarly, you can use df['col'].tolist() for Series.
pandas >= 0.24
Deprecate your usage of .values in favour of these methods!
From v0.24.0 onwards, we will have two brand spanking new, preferred methods for obtaining NumPy arrays from Index, Series, and DataFrame objects: they are to_numpy(), and .array. Regarding usage, the docs mention:
We haven’t removed or deprecated Series.values or
DataFrame.values, but we highly recommend and using .array or
.to_numpy() instead.
See this section of the v0.24.0 release notes for more information.
to_numpy() Method
df.index.to_numpy()
# array(['a', 'b'], dtype=object)
df['A'].to_numpy()
# array([1, 4])
By default, a view is returned. Any modifications made will affect the original.
v = df.index.to_numpy()
v[0] = -1
df
A B
-1 1 2
b 4 5
If you need a copy instead, use to_numpy(copy=True);
v = df.index.to_numpy(copy=True)
v[-1] = -123
df
A B
a 1 2
b 4 5
Note that this function also works for DataFrames (while .array does not).
array Attribute
This attribute returns an ExtensionArray object that backs the Index/Series.
pd.__version__
# '0.24.0rc1'
# Setup.
df = pd.DataFrame([[1, 2], [4, 5]], columns=['A', 'B'], index=['a', 'b'])
df
A B
a 1 2
b 4 5
<!- ->
df.index.array
# <PandasArray>
# ['a', 'b']
# Length: 2, dtype: object
df['A'].array
# <PandasArray>
# [1, 4]
# Length: 2, dtype: int64
From here, it is possible to get a list using list:
list(df.index.array)
# ['a', 'b']
list(df['A'].array)
# [1, 4]
or, just directly call .tolist():
df.index.tolist()
# ['a', 'b']
df['A'].tolist()
# [1, 4]
Regarding what is returned, the docs mention,
For Series and Indexes backed by normal NumPy arrays, Series.array
will return a new arrays.PandasArray, which is a thin (no-copy)
wrapper around a numpy.ndarray. arrays.PandasArray isn’t especially
useful on its own, but it does provide the same interface as any
extension array defined in pandas or by a third-party library.
So, to summarise, .array will return either
The existing ExtensionArray backing the Index/Series, or
If there is a NumPy array backing the series, a new ExtensionArray object is created as a thin wrapper over the underlying array.
Rationale for adding TWO new methods
These functions were added as a result of discussions under two GitHub issues GH19954 and GH23623.
Specifically, the docs mention the rationale:
[...] with .values it was unclear whether the returned value would be the
actual array, some transformation of it, or one of pandas custom
arrays (like Categorical). For example, with PeriodIndex, .values
generates a new ndarray of period objects each time. [...]
These two functions aim to improve the consistency of the API, which is a major step in the right direction.
Lastly, .values will not be deprecated in the current version, but I expect this may happen at some point in the future, so I would urge users to migrate towards the newer API, as soon as you can.
If you are dealing with a multi-index dataframe, you may be interested in extracting only the column of one name of the multi-index. You can do this as
df.index.get_level_values('name_sub_index')
and of course name_sub_index must be an element of the FrozenList df.index.names
Since pandas v0.13 you can also use get_values:
df.index.get_values()
A more recent way to do this is to use the .to_numpy() function.
If I have a dataframe with a column 'price', I can convert it as follows:
priceArray = df['price'].to_numpy()
You can also pass the data type, such as float or object, as an argument of the function
I converted the pandas dataframe to list and then used the basic list.index(). Something like this:
dd = list(zone[0]) #Where zone[0] is some specific column of the table
idx = dd.index(filename[i])
You have you index value as idx.
Below is a simple way to convert a dataframe column into a NumPy array.
df = pd.DataFrame(somedict)
ytrain = df['label']
ytrain_numpy = np.array([x for x in ytrain['label']])
ytrain_numpy is a NumPy array.
I tried with to.numpy(), but it gave me the below error:
TypeError: no supported conversion for types: (dtype('O'),)* while doing Binary Relevance classfication using Linear SVC.
to.numpy() was converting the dataFrame into a NumPy array, but the inner element's data type was a list because of which the above error was observed.
Consider the numpy.array i
i = np.empty((1,), dtype=object)
i[0] = [1, 2]
i
array([list([1, 2])], dtype=object)
Example 1
index
df = pd.DataFrame([1], index=i)
df
0
[1, 2] 1
Example 2
columns
But
df = pd.DataFrame([1], columns=i)
Leads to this when I display it
df
TypeError: unhashable type: 'list'
However, df.T works!?
Question
Why is it necessary for index values to be hashable in a column context but not in an index context? And why only when it's displayed?
This is because of how pandas internally determines the string representation of the DataFrame object. Essentially, the difference between column labels and index labels here is that the column determines the format of the string representation (as the column could be a float, int, etc.).
The error thus happens because pandas stores a separate formatter object for each column in a dictionary and this object is retrieved using the column name. Specifically, the line that triggers the error is https://github.com/pandas-dev/pandas/blob/d1accd032b648c9affd6dce1f81feb9c99422483/pandas/io/formats/format.py#L420
The "unhashable type" error usually means that the type, list in this case, is mutable. Mutable types aren't hashable, because they may change after they have produced the hash code. This happens because you are trying to retrieve an item using a list as a key, but since a key has to be hashable, the retrieval fails.
From the pandas documentation, i get Series.axes will return a list, and indeed it is a list
$ python3 process_data.py
<class 'list'>
However, when I attempted to print the string representation of the list, I get this
To run print directly
print(row.axes)
$ python3 process_data.py
Index(['rank', 'name','high', 'low', 'analysis'],
dtype='object')
Which doesn't look like a normal list at all.
>>> [1,2,3,4,5]
[1, 2, 3, 4, 5]
I still can access the information in the weird list by doing list_name[0][index], which is like a two-dimensional list. I mean if its internal type is list, how can it have this behavior. If it is a numpy array like object, why the internal type is still list.
EDIT:
def process_nextfile(date, catagory):
df1 = pd.read_csv('{}all_csv/{}/catagory{}'.format(BASE_DIR, date, catagory), header = None, names = CATAGORY_HEADER[catagory-1])
for index, row in df1.iterrows():
print(row.axes.__name__)
break
if __name__ == '__main__':
process_nextfile('2016-04-05', 2)
When you use iterrows(), every row is a pandas Series, the axes attribute returns a list of labels/or index. So what is contained in the list are index objects, check this simple example:
s = pd.Series([1,2,3])
s.axes
# [RangeIndex(start=0, stop=3, step=1)]
To get a normal list, you can access the index object and then convert it to a list:
s.axes[0].tolist()
# [0, 1, 2]
I have a string:
str='ABCDEFG'
I also have numpy arrays defined:
A=numpy.array([1,2,3])
B=numpy.array([2,3,4])
Now I want to be able to covert the string into a numpy array with the rows defined by these variables:
str=[[1,2,3],[2,3,4],...]
These are very long strings and I would rather not loop through them with a find and replace type of operation.
List comprehension for the win:
In[18]: str='ABCDEFG'
In[19]: A=[1,2,3]
B=[2,3,4]
In[20]: [locals().get(x) for x in str if x in locals().keys()]
Out[20]: [[1, 2, 3], [2, 3, 4]]
You should use locals or globals depending on your scope.
I wouldn't recommend the approach you're proposing. Managing variables and variable names is really the job of the programmer, not the program.
It seems like what you're trying to do would be accomplished easier with a DataFrame (i.e. table) object. Each row of the dataframe would have an identifier (a character within 'ABCDEFG' in your case).
I'd recommend checking out the pandas library: http://pandas.pydata.org/. It fits your use case well with minimal code:
import pandas as pd
rownames = list('AB')
dataframe = pd.DataFrame([[1,2,3],[2,3,4]], index=rownames)
dataframe.loc['B'] # Returns [2, 3, 4] as a Series