Can't Select A Cell In Pandas

Can't Select A Cell In Pandas - python

This is a simple noob question, but it is vexing me. Following a tutorial, I want to select the first value in column "A". The tutorial says run print(df[0]['A']) but Python3 gives me an error. However, it works perfectly if I use print(df[0:1]['A']). Why is that?
Here is the full code for replication:
import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.randn(100, 3), index=pd.date_range('1/1/2000', periods=100), columns=['A', 'B', 'C'])
print(df[0:1]['A'])

because df[0]['A'] means column 0 at index A; you need to use df.iloc[0]['A'], or df['A'][0], or df.ix[0]['A']
see here for the indexing and slicing.
see here for when you get a copy as opposed to a view.

See the selecting ranges section of the docs. As mentioned:
With DataFrame, slicing inside of [] slices the rows. This is provided largely as a convenience since it is such a common operation.
The flip side being that this is inconsistent.
It's worth mentioning that you can often be explicit with loc/iloc:
In [11]: df = pd.DataFrame([[1, 2], [3, 4], [5, 6]], columns=['A', 'B'])
In [12]: df['A']
Out[12]:
0 1
1 3
2 5
Name: A, dtype: int64
In [13]: df.loc[:, 'A'] # equivalently
Out[13]:
0 1
1 3
2 5
Name: A, dtype: int64
In [14]: df.iloc[:, 0] # accessing column by position
Out[14]:
0 1
1 3
2 5
Name: A, dtype: int64
It's worth mentioning another inconsistency with slicing:
In [15]: df.loc[0:1, 'A']
Out[15]:
0 1
1 3
dtype: int64
In [16]: df.iloc[0:1, 0] # doesn't include 1th row
Out[16]:
0 1
dtype: int64
To select with a position and a label use ix:
In [17]: df.ix[0:1, 'A']
Out[17]:
0 1
1 3
Name: A, dtype: int64
Note labels take precedence with ix.
It's worth emphasising that assignment is garaunteed to work with one loc/iloc/ix, but may fail when chaining:
In [18]: df.ix[0:1, 'A'] = 7 # works
In [19]: df['A'][0:1] = 7 # *sometimes* works, avoid!

Related

TypeError: 'method' object is not iterable - Python [duplicate]

In both the bellow cases:
import pandas
d = {'col1': 2, 'col2': 2.5}
df = pandas.DataFrame(data=d, index=[0])
print(df['col2'])
print(df.col2)
Both methods can be used to index on a column and yield the same result, so is there any difference between them?

The "dot notation", i.e. df.col2 is the attribute access that's exposed as a convenience.
You may access an index on a Series, column on a DataFrame, and an item on a Panel directly as an attribute:
df['col2'] does the same: it returns a pd.Series of the column.
A few caveats about attribute access:
you cannot add a column (df.new_col = x won't work, worse: it will silently actually create a new attribute rather than a column - think monkey-patching here)
it won't work if you have spaces in the column name or if the column name is an integer.

They are the same as long you're accessing a single column with a simple name, but you can do more with the bracket notation. You can only use df.col if the column name is a valid Python identifier (e.g., does not contains spaces and other such stuff). Also, you may encounter surprises if your column name clashes with a pandas method name (like sum). With brackets you can select multiple columns (e.g., df[['col1', 'col2']]) or add a new column (df['newcol'] = ...), which can't be done with dot access.
The other question you linked to applies, but that is a much more general question. Python objects get to define how the . and [] operators apply to them. Pandas DataFrames have chosen to make them the same for this limited case of accessing single columns, with the caveats described above.

Short answer for differences:
[] indexing (squared brackets access) has the full functionality to operate on DataFrame column data.
While attribute access (dot access) is mainly for convenience to access existing DataFrame column data, but occasionally has its limitations (e.g. special column names, creating a new column).
More explanation
Series and DataFrame are core classes and data structures in pandas, and of course they are Python classes too, so there are some minor distinction when involving attribute access between pandas DataFrame and normal Python objects. But it's well documented and can be easily understood. Just a few points to note:
In Python, users may dynamically add data attributes of their own to an instance object using attribute access.
>>> class Dog(object):
... pass
>>> dog = Dog()
>>> vars(dog)
{}
>>> superdog = Dog()
>>> vars(superdog)
{}
>>> dog.legs = 'I can run.'
>>> superdog.wings = 'I can fly.'
>>> vars(dog)
{'legs': 'I can run.'}
>>> vars(superdog)
{'wings': 'I can fly.'}
In pandas, index and column are closely related to the data structure, you may access an index on a Series, column on a DataFrame as an attribute.
>>> import pandas as pd
>>> import numpy as np
>>> data = np.random.randint(low=0, high=10, size=(2,2))
>>> df = pd.DataFrame(data, columns=['a', 'b'])
>>> df
a b
0 7 6
1 5 8
>>> vars(df)
{'_is_copy': None,
'_data': BlockManager
Items: Index(['a', 'b'], dtype='object')
Axis 1: RangeIndex(start=0, stop=2, step=1)
IntBlock: slice(0, 2, 1), 2 x 2, dtype: int64,
'_item_cache': {}}
But, pandas attribute access is mainly a convinience for reading from and modifying an existing element of a Series or column of a DataFrame.
>>> df.a
0 7
1 5
Name: a, dtype: int64
>>> df.b = [1, 1]
>>> df
a b
0 7 1
1 5 1
And, the convenience is a tradeoff for full functionality. E.g. you can create a DataFrame object with column names ['space bar', '1', 'loc', 'min', 'index'], but you can't access them as an attribute, because they are either not a valid Python identifier (1, space bar) or conflict with an existing attribute name.
>>> data = np.random.randint(0, 10, size=(2, 5))
>>> df_special_col_names = pd.DataFrame(data, columns=['space bar', '1', 'loc', 'min', 'index'])
>>> df_special_col_names
space bar 1 loc min index
0 4 4 4 8 9
1 3 0 1 2 3
In these cases, the .loc, .iloc and [] indexing is the defined way to fullly access/operate index and columns of Series and DataFrame objects.
>>> df_special_col_names['space bar']
0 4
1 3
Name: space bar, dtype: int64
>>> df_special_col_names.loc[:, 'min']
0 8
1 2
Name: min, dtype: int64
>>> df_special_col_names.iloc[:, 1]
0 4
1 0
Name: 1, dtype: int64
Another important difference is when trying to create a new column for DataFrame. As you can see, df.c = df.a + df.b just created an new attribute along side to the core data structure, so starting from version 0.21.0 and later, this behavior will raise a UserWarning (silent no more).
>>> df
a b
0 7 1
1 5 1
>>> df.c = df.a + df.b
__main__:1: UserWarning: Pandas doesn't allow columns to be created via a new attribute name - see https://pandas.pydata.org/pandas-docs/stable/indexing.html#attribute-access`
>>> df['d'] = df.a + df.b
>>> df
a b d
0 7 1 8
1 5 1 6
>>> df.c
0 8
1 6
dtype: int64
>>> vars(df)
{'_is_copy': None,
'_data':
BlockManager
Items: Index(['a', 'b', 'd'], dtype='object')
Axis 1: RangeIndex(start=0, stop=2, step=1)
IntBlock: slice(0, 2, 1), 2 x 2, dtype: int64
IntBlock: slice(2, 3, 1), 1 x 2, dtype: int64,
'_item_cache': {},
'c': 0 8
1 6
dtype: int64}
Finally, to create a new column for DataFrame, never use attribute access. The correct way is to use either [] or .loc indexing:
>>> df
a b
0 7 6
1 5 8
>>> df['c'] = df.a + df.b
>>> # OR
>>> df.loc[:, 'c'] = df.a + df.b
>>> df # c is an new added column
a b c
0 7 6 13
1 5 8 13

. notation is very useful when working interactively and for exploration. However, for code clarity and to avoid crazy shit to happen, you definitely should use [] notation. An example why you should use [] when creating a new column.
df = pd.DataFrame(data={'A':[1, 2, 3],
'B':[4,5,6]})
# this has no effect
df.D = 11
df
A B
0 1 4
1 2 5
2 3 6
# but this works
df['D'] = 11
df
Out[19]:
A B D
0 1 4 11
1 2 5 11
2 3 6 11

If you had a dataframe like this (I am not recommending these column names)...
df = pd.DataFrame({'min':[1,2], 'max': ['a','a'], 'class': [1975, 1981], 'sum': [3,4]})
print(df)
min max class sum
0 1 a 1975 3
1 2 a 1981 4
It all looks OK and there are no errors. You can even access the columns via df['min'] etc...
print(df['min'])
0 1
1 2
Name: min, dtype: int64
However, if you tried with df.<column_name> you would get problems:
print(df.min)
<bound method NDFrame._add_numeric_operations.<locals>.min of min max class sum
0 1 a 1975 3
1 2 a 1981 4>
print(df.max)
<bound method NDFrame._add_numeric_operations.<locals>.max of min max class sum
0 1 a 1975 3
1 2 a 1981 4>
print(df.class)
File "<ipython-input-31-3472b02a328e>", line 1
print(df.class)
^
SyntaxError: invalid syntax
print(df.sum)
<bound method NDFrame._add_numeric_operations.<locals>.sum of min max class sum
0 1 a 1975 3
1 2 a 1981 4>

The difference between accessing pandas Dataframe column by index or as an attribute? [duplicate]

In both the bellow cases:
import pandas
d = {'col1': 2, 'col2': 2.5}
df = pandas.DataFrame(data=d, index=[0])
print(df['col2'])
print(df.col2)
Both methods can be used to index on a column and yield the same result, so is there any difference between them?

The "dot notation", i.e. df.col2 is the attribute access that's exposed as a convenience.
You may access an index on a Series, column on a DataFrame, and an item on a Panel directly as an attribute:
df['col2'] does the same: it returns a pd.Series of the column.
A few caveats about attribute access:
you cannot add a column (df.new_col = x won't work, worse: it will silently actually create a new attribute rather than a column - think monkey-patching here)
it won't work if you have spaces in the column name or if the column name is an integer.

They are the same as long you're accessing a single column with a simple name, but you can do more with the bracket notation. You can only use df.col if the column name is a valid Python identifier (e.g., does not contains spaces and other such stuff). Also, you may encounter surprises if your column name clashes with a pandas method name (like sum). With brackets you can select multiple columns (e.g., df[['col1', 'col2']]) or add a new column (df['newcol'] = ...), which can't be done with dot access.
The other question you linked to applies, but that is a much more general question. Python objects get to define how the . and [] operators apply to them. Pandas DataFrames have chosen to make them the same for this limited case of accessing single columns, with the caveats described above.

Short answer for differences:
[] indexing (squared brackets access) has the full functionality to operate on DataFrame column data.
While attribute access (dot access) is mainly for convenience to access existing DataFrame column data, but occasionally has its limitations (e.g. special column names, creating a new column).
More explanation
Series and DataFrame are core classes and data structures in pandas, and of course they are Python classes too, so there are some minor distinction when involving attribute access between pandas DataFrame and normal Python objects. But it's well documented and can be easily understood. Just a few points to note:
In Python, users may dynamically add data attributes of their own to an instance object using attribute access.
>>> class Dog(object):
... pass
>>> dog = Dog()
>>> vars(dog)
{}
>>> superdog = Dog()
>>> vars(superdog)
{}
>>> dog.legs = 'I can run.'
>>> superdog.wings = 'I can fly.'
>>> vars(dog)
{'legs': 'I can run.'}
>>> vars(superdog)
{'wings': 'I can fly.'}
In pandas, index and column are closely related to the data structure, you may access an index on a Series, column on a DataFrame as an attribute.
>>> import pandas as pd
>>> import numpy as np
>>> data = np.random.randint(low=0, high=10, size=(2,2))
>>> df = pd.DataFrame(data, columns=['a', 'b'])
>>> df
a b
0 7 6
1 5 8
>>> vars(df)
{'_is_copy': None,
'_data': BlockManager
Items: Index(['a', 'b'], dtype='object')
Axis 1: RangeIndex(start=0, stop=2, step=1)
IntBlock: slice(0, 2, 1), 2 x 2, dtype: int64,
'_item_cache': {}}
But, pandas attribute access is mainly a convinience for reading from and modifying an existing element of a Series or column of a DataFrame.
>>> df.a
0 7
1 5
Name: a, dtype: int64
>>> df.b = [1, 1]
>>> df
a b
0 7 1
1 5 1
And, the convenience is a tradeoff for full functionality. E.g. you can create a DataFrame object with column names ['space bar', '1', 'loc', 'min', 'index'], but you can't access them as an attribute, because they are either not a valid Python identifier (1, space bar) or conflict with an existing attribute name.
>>> data = np.random.randint(0, 10, size=(2, 5))
>>> df_special_col_names = pd.DataFrame(data, columns=['space bar', '1', 'loc', 'min', 'index'])
>>> df_special_col_names
space bar 1 loc min index
0 4 4 4 8 9
1 3 0 1 2 3
In these cases, the .loc, .iloc and [] indexing is the defined way to fullly access/operate index and columns of Series and DataFrame objects.
>>> df_special_col_names['space bar']
0 4
1 3
Name: space bar, dtype: int64
>>> df_special_col_names.loc[:, 'min']
0 8
1 2
Name: min, dtype: int64
>>> df_special_col_names.iloc[:, 1]
0 4
1 0
Name: 1, dtype: int64
Another important difference is when trying to create a new column for DataFrame. As you can see, df.c = df.a + df.b just created an new attribute along side to the core data structure, so starting from version 0.21.0 and later, this behavior will raise a UserWarning (silent no more).
>>> df
a b
0 7 1
1 5 1
>>> df.c = df.a + df.b
__main__:1: UserWarning: Pandas doesn't allow columns to be created via a new attribute name - see https://pandas.pydata.org/pandas-docs/stable/indexing.html#attribute-access`
>>> df['d'] = df.a + df.b
>>> df
a b d
0 7 1 8
1 5 1 6
>>> df.c
0 8
1 6
dtype: int64
>>> vars(df)
{'_is_copy': None,
'_data':
BlockManager
Items: Index(['a', 'b', 'd'], dtype='object')
Axis 1: RangeIndex(start=0, stop=2, step=1)
IntBlock: slice(0, 2, 1), 2 x 2, dtype: int64
IntBlock: slice(2, 3, 1), 1 x 2, dtype: int64,
'_item_cache': {},
'c': 0 8
1 6
dtype: int64}
Finally, to create a new column for DataFrame, never use attribute access. The correct way is to use either [] or .loc indexing:
>>> df
a b
0 7 6
1 5 8
>>> df['c'] = df.a + df.b
>>> # OR
>>> df.loc[:, 'c'] = df.a + df.b
>>> df # c is an new added column
a b c
0 7 6 13
1 5 8 13

. notation is very useful when working interactively and for exploration. However, for code clarity and to avoid crazy shit to happen, you definitely should use [] notation. An example why you should use [] when creating a new column.
df = pd.DataFrame(data={'A':[1, 2, 3],
'B':[4,5,6]})
# this has no effect
df.D = 11
df
A B
0 1 4
1 2 5
2 3 6
# but this works
df['D'] = 11
df
Out[19]:
A B D
0 1 4 11
1 2 5 11
2 3 6 11

If you had a dataframe like this (I am not recommending these column names)...
df = pd.DataFrame({'min':[1,2], 'max': ['a','a'], 'class': [1975, 1981], 'sum': [3,4]})
print(df)
min max class sum
0 1 a 1975 3
1 2 a 1981 4
It all looks OK and there are no errors. You can even access the columns via df['min'] etc...
print(df['min'])
0 1
1 2
Name: min, dtype: int64
However, if you tried with df.<column_name> you would get problems:
print(df.min)
<bound method NDFrame._add_numeric_operations.<locals>.min of min max class sum
0 1 a 1975 3
1 2 a 1981 4>
print(df.max)
<bound method NDFrame._add_numeric_operations.<locals>.max of min max class sum
0 1 a 1975 3
1 2 a 1981 4>
print(df.class)
File "<ipython-input-31-3472b02a328e>", line 1
print(df.class)
^
SyntaxError: invalid syntax
print(df.sum)
<bound method NDFrame._add_numeric_operations.<locals>.sum of min max class sum
0 1 a 1975 3
1 2 a 1981 4>

df.date_column_name showing different output compared to df['date_column_name'] [duplicate]

In both the bellow cases:
import pandas
d = {'col1': 2, 'col2': 2.5}
df = pandas.DataFrame(data=d, index=[0])
print(df['col2'])
print(df.col2)
Both methods can be used to index on a column and yield the same result, so is there any difference between them?

The "dot notation", i.e. df.col2 is the attribute access that's exposed as a convenience.
You may access an index on a Series, column on a DataFrame, and an item on a Panel directly as an attribute:
df['col2'] does the same: it returns a pd.Series of the column.
A few caveats about attribute access:
you cannot add a column (df.new_col = x won't work, worse: it will silently actually create a new attribute rather than a column - think monkey-patching here)
it won't work if you have spaces in the column name or if the column name is an integer.

They are the same as long you're accessing a single column with a simple name, but you can do more with the bracket notation. You can only use df.col if the column name is a valid Python identifier (e.g., does not contains spaces and other such stuff). Also, you may encounter surprises if your column name clashes with a pandas method name (like sum). With brackets you can select multiple columns (e.g., df[['col1', 'col2']]) or add a new column (df['newcol'] = ...), which can't be done with dot access.
The other question you linked to applies, but that is a much more general question. Python objects get to define how the . and [] operators apply to them. Pandas DataFrames have chosen to make them the same for this limited case of accessing single columns, with the caveats described above.

Short answer for differences:
[] indexing (squared brackets access) has the full functionality to operate on DataFrame column data.
While attribute access (dot access) is mainly for convenience to access existing DataFrame column data, but occasionally has its limitations (e.g. special column names, creating a new column).
More explanation
Series and DataFrame are core classes and data structures in pandas, and of course they are Python classes too, so there are some minor distinction when involving attribute access between pandas DataFrame and normal Python objects. But it's well documented and can be easily understood. Just a few points to note:
In Python, users may dynamically add data attributes of their own to an instance object using attribute access.
>>> class Dog(object):
... pass
>>> dog = Dog()
>>> vars(dog)
{}
>>> superdog = Dog()
>>> vars(superdog)
{}
>>> dog.legs = 'I can run.'
>>> superdog.wings = 'I can fly.'
>>> vars(dog)
{'legs': 'I can run.'}
>>> vars(superdog)
{'wings': 'I can fly.'}
In pandas, index and column are closely related to the data structure, you may access an index on a Series, column on a DataFrame as an attribute.
>>> import pandas as pd
>>> import numpy as np
>>> data = np.random.randint(low=0, high=10, size=(2,2))
>>> df = pd.DataFrame(data, columns=['a', 'b'])
>>> df
a b
0 7 6
1 5 8
>>> vars(df)
{'_is_copy': None,
'_data': BlockManager
Items: Index(['a', 'b'], dtype='object')
Axis 1: RangeIndex(start=0, stop=2, step=1)
IntBlock: slice(0, 2, 1), 2 x 2, dtype: int64,
'_item_cache': {}}
But, pandas attribute access is mainly a convinience for reading from and modifying an existing element of a Series or column of a DataFrame.
>>> df.a
0 7
1 5
Name: a, dtype: int64
>>> df.b = [1, 1]
>>> df
a b
0 7 1
1 5 1
And, the convenience is a tradeoff for full functionality. E.g. you can create a DataFrame object with column names ['space bar', '1', 'loc', 'min', 'index'], but you can't access them as an attribute, because they are either not a valid Python identifier (1, space bar) or conflict with an existing attribute name.
>>> data = np.random.randint(0, 10, size=(2, 5))
>>> df_special_col_names = pd.DataFrame(data, columns=['space bar', '1', 'loc', 'min', 'index'])
>>> df_special_col_names
space bar 1 loc min index
0 4 4 4 8 9
1 3 0 1 2 3
In these cases, the .loc, .iloc and [] indexing is the defined way to fullly access/operate index and columns of Series and DataFrame objects.
>>> df_special_col_names['space bar']
0 4
1 3
Name: space bar, dtype: int64
>>> df_special_col_names.loc[:, 'min']
0 8
1 2
Name: min, dtype: int64
>>> df_special_col_names.iloc[:, 1]
0 4
1 0
Name: 1, dtype: int64
Another important difference is when trying to create a new column for DataFrame. As you can see, df.c = df.a + df.b just created an new attribute along side to the core data structure, so starting from version 0.21.0 and later, this behavior will raise a UserWarning (silent no more).
>>> df
a b
0 7 1
1 5 1
>>> df.c = df.a + df.b
__main__:1: UserWarning: Pandas doesn't allow columns to be created via a new attribute name - see https://pandas.pydata.org/pandas-docs/stable/indexing.html#attribute-access`
>>> df['d'] = df.a + df.b
>>> df
a b d
0 7 1 8
1 5 1 6
>>> df.c
0 8
1 6
dtype: int64
>>> vars(df)
{'_is_copy': None,
'_data':
BlockManager
Items: Index(['a', 'b', 'd'], dtype='object')
Axis 1: RangeIndex(start=0, stop=2, step=1)
IntBlock: slice(0, 2, 1), 2 x 2, dtype: int64
IntBlock: slice(2, 3, 1), 1 x 2, dtype: int64,
'_item_cache': {},
'c': 0 8
1 6
dtype: int64}
Finally, to create a new column for DataFrame, never use attribute access. The correct way is to use either [] or .loc indexing:
>>> df
a b
0 7 6
1 5 8
>>> df['c'] = df.a + df.b
>>> # OR
>>> df.loc[:, 'c'] = df.a + df.b
>>> df # c is an new added column
a b c
0 7 6 13
1 5 8 13

. notation is very useful when working interactively and for exploration. However, for code clarity and to avoid crazy shit to happen, you definitely should use [] notation. An example why you should use [] when creating a new column.
df = pd.DataFrame(data={'A':[1, 2, 3],
'B':[4,5,6]})
# this has no effect
df.D = 11
df
A B
0 1 4
1 2 5
2 3 6
# but this works
df['D'] = 11
df
Out[19]:
A B D
0 1 4 11
1 2 5 11
2 3 6 11

If you had a dataframe like this (I am not recommending these column names)...
df = pd.DataFrame({'min':[1,2], 'max': ['a','a'], 'class': [1975, 1981], 'sum': [3,4]})
print(df)
min max class sum
0 1 a 1975 3
1 2 a 1981 4
It all looks OK and there are no errors. You can even access the columns via df['min'] etc...
print(df['min'])
0 1
1 2
Name: min, dtype: int64
However, if you tried with df.<column_name> you would get problems:
print(df.min)
<bound method NDFrame._add_numeric_operations.<locals>.min of min max class sum
0 1 a 1975 3
1 2 a 1981 4>
print(df.max)
<bound method NDFrame._add_numeric_operations.<locals>.max of min max class sum
0 1 a 1975 3
1 2 a 1981 4>
print(df.class)
File "<ipython-input-31-3472b02a328e>", line 1
print(df.class)
^
SyntaxError: invalid syntax
print(df.sum)
<bound method NDFrame._add_numeric_operations.<locals>.sum of min max class sum
0 1 a 1975 3
1 2 a 1981 4>

Pandas: Difference between dot and [] [duplicate]

In both the bellow cases:
import pandas
d = {'col1': 2, 'col2': 2.5}
df = pandas.DataFrame(data=d, index=[0])
print(df['col2'])
print(df.col2)
Both methods can be used to index on a column and yield the same result, so is there any difference between them?

The "dot notation", i.e. df.col2 is the attribute access that's exposed as a convenience.
You may access an index on a Series, column on a DataFrame, and an item on a Panel directly as an attribute:
df['col2'] does the same: it returns a pd.Series of the column.
A few caveats about attribute access:
you cannot add a column (df.new_col = x won't work, worse: it will silently actually create a new attribute rather than a column - think monkey-patching here)
it won't work if you have spaces in the column name or if the column name is an integer.

They are the same as long you're accessing a single column with a simple name, but you can do more with the bracket notation. You can only use df.col if the column name is a valid Python identifier (e.g., does not contains spaces and other such stuff). Also, you may encounter surprises if your column name clashes with a pandas method name (like sum). With brackets you can select multiple columns (e.g., df[['col1', 'col2']]) or add a new column (df['newcol'] = ...), which can't be done with dot access.
The other question you linked to applies, but that is a much more general question. Python objects get to define how the . and [] operators apply to them. Pandas DataFrames have chosen to make them the same for this limited case of accessing single columns, with the caveats described above.

Short answer for differences:
[] indexing (squared brackets access) has the full functionality to operate on DataFrame column data.
While attribute access (dot access) is mainly for convenience to access existing DataFrame column data, but occasionally has its limitations (e.g. special column names, creating a new column).
More explanation
Series and DataFrame are core classes and data structures in pandas, and of course they are Python classes too, so there are some minor distinction when involving attribute access between pandas DataFrame and normal Python objects. But it's well documented and can be easily understood. Just a few points to note:
In Python, users may dynamically add data attributes of their own to an instance object using attribute access.
>>> class Dog(object):
... pass
>>> dog = Dog()
>>> vars(dog)
{}
>>> superdog = Dog()
>>> vars(superdog)
{}
>>> dog.legs = 'I can run.'
>>> superdog.wings = 'I can fly.'
>>> vars(dog)
{'legs': 'I can run.'}
>>> vars(superdog)
{'wings': 'I can fly.'}
In pandas, index and column are closely related to the data structure, you may access an index on a Series, column on a DataFrame as an attribute.
>>> import pandas as pd
>>> import numpy as np
>>> data = np.random.randint(low=0, high=10, size=(2,2))
>>> df = pd.DataFrame(data, columns=['a', 'b'])
>>> df
a b
0 7 6
1 5 8
>>> vars(df)
{'_is_copy': None,
'_data': BlockManager
Items: Index(['a', 'b'], dtype='object')
Axis 1: RangeIndex(start=0, stop=2, step=1)
IntBlock: slice(0, 2, 1), 2 x 2, dtype: int64,
'_item_cache': {}}
But, pandas attribute access is mainly a convinience for reading from and modifying an existing element of a Series or column of a DataFrame.
>>> df.a
0 7
1 5
Name: a, dtype: int64
>>> df.b = [1, 1]
>>> df
a b
0 7 1
1 5 1
And, the convenience is a tradeoff for full functionality. E.g. you can create a DataFrame object with column names ['space bar', '1', 'loc', 'min', 'index'], but you can't access them as an attribute, because they are either not a valid Python identifier (1, space bar) or conflict with an existing attribute name.
>>> data = np.random.randint(0, 10, size=(2, 5))
>>> df_special_col_names = pd.DataFrame(data, columns=['space bar', '1', 'loc', 'min', 'index'])
>>> df_special_col_names
space bar 1 loc min index
0 4 4 4 8 9
1 3 0 1 2 3
In these cases, the .loc, .iloc and [] indexing is the defined way to fullly access/operate index and columns of Series and DataFrame objects.
>>> df_special_col_names['space bar']
0 4
1 3
Name: space bar, dtype: int64
>>> df_special_col_names.loc[:, 'min']
0 8
1 2
Name: min, dtype: int64
>>> df_special_col_names.iloc[:, 1]
0 4
1 0
Name: 1, dtype: int64
Another important difference is when trying to create a new column for DataFrame. As you can see, df.c = df.a + df.b just created an new attribute along side to the core data structure, so starting from version 0.21.0 and later, this behavior will raise a UserWarning (silent no more).
>>> df
a b
0 7 1
1 5 1
>>> df.c = df.a + df.b
__main__:1: UserWarning: Pandas doesn't allow columns to be created via a new attribute name - see https://pandas.pydata.org/pandas-docs/stable/indexing.html#attribute-access`
>>> df['d'] = df.a + df.b
>>> df
a b d
0 7 1 8
1 5 1 6
>>> df.c
0 8
1 6
dtype: int64
>>> vars(df)
{'_is_copy': None,
'_data':
BlockManager
Items: Index(['a', 'b', 'd'], dtype='object')
Axis 1: RangeIndex(start=0, stop=2, step=1)
IntBlock: slice(0, 2, 1), 2 x 2, dtype: int64
IntBlock: slice(2, 3, 1), 1 x 2, dtype: int64,
'_item_cache': {},
'c': 0 8
1 6
dtype: int64}
Finally, to create a new column for DataFrame, never use attribute access. The correct way is to use either [] or .loc indexing:
>>> df
a b
0 7 6
1 5 8
>>> df['c'] = df.a + df.b
>>> # OR
>>> df.loc[:, 'c'] = df.a + df.b
>>> df # c is an new added column
a b c
0 7 6 13
1 5 8 13

. notation is very useful when working interactively and for exploration. However, for code clarity and to avoid crazy shit to happen, you definitely should use [] notation. An example why you should use [] when creating a new column.
df = pd.DataFrame(data={'A':[1, 2, 3],
'B':[4,5,6]})
# this has no effect
df.D = 11
df
A B
0 1 4
1 2 5
2 3 6
# but this works
df['D'] = 11
df
Out[19]:
A B D
0 1 4 11
1 2 5 11
2 3 6 11

If you had a dataframe like this (I am not recommending these column names)...
df = pd.DataFrame({'min':[1,2], 'max': ['a','a'], 'class': [1975, 1981], 'sum': [3,4]})
print(df)
min max class sum
0 1 a 1975 3
1 2 a 1981 4
It all looks OK and there are no errors. You can even access the columns via df['min'] etc...
print(df['min'])
0 1
1 2
Name: min, dtype: int64
However, if you tried with df.<column_name> you would get problems:
print(df.min)
<bound method NDFrame._add_numeric_operations.<locals>.min of min max class sum
0 1 a 1975 3
1 2 a 1981 4>
print(df.max)
<bound method NDFrame._add_numeric_operations.<locals>.max of min max class sum
0 1 a 1975 3
1 2 a 1981 4>
print(df.class)
File "<ipython-input-31-3472b02a328e>", line 1
print(df.class)
^
SyntaxError: invalid syntax
print(df.sum)
<bound method NDFrame._add_numeric_operations.<locals>.sum of min max class sum
0 1 a 1975 3
1 2 a 1981 4>

Accessing Pandas column using squared brackets vs using a dot (like an attribute)

In both the bellow cases:
import pandas
d = {'col1': 2, 'col2': 2.5}
df = pandas.DataFrame(data=d, index=[0])
print(df['col2'])
print(df.col2)
Both methods can be used to index on a column and yield the same result, so is there any difference between them?

The "dot notation", i.e. df.col2 is the attribute access that's exposed as a convenience.
You may access an index on a Series, column on a DataFrame, and an item on a Panel directly as an attribute:
df['col2'] does the same: it returns a pd.Series of the column.
A few caveats about attribute access:
you cannot add a column (df.new_col = x won't work, worse: it will silently actually create a new attribute rather than a column - think monkey-patching here)
it won't work if you have spaces in the column name or if the column name is an integer.

They are the same as long you're accessing a single column with a simple name, but you can do more with the bracket notation. You can only use df.col if the column name is a valid Python identifier (e.g., does not contains spaces and other such stuff). Also, you may encounter surprises if your column name clashes with a pandas method name (like sum). With brackets you can select multiple columns (e.g., df[['col1', 'col2']]) or add a new column (df['newcol'] = ...), which can't be done with dot access.
The other question you linked to applies, but that is a much more general question. Python objects get to define how the . and [] operators apply to them. Pandas DataFrames have chosen to make them the same for this limited case of accessing single columns, with the caveats described above.

Short answer for differences:
[] indexing (squared brackets access) has the full functionality to operate on DataFrame column data.
While attribute access (dot access) is mainly for convenience to access existing DataFrame column data, but occasionally has its limitations (e.g. special column names, creating a new column).
More explanation
Series and DataFrame are core classes and data structures in pandas, and of course they are Python classes too, so there are some minor distinction when involving attribute access between pandas DataFrame and normal Python objects. But it's well documented and can be easily understood. Just a few points to note:
In Python, users may dynamically add data attributes of their own to an instance object using attribute access.
>>> class Dog(object):
... pass
>>> dog = Dog()
>>> vars(dog)
{}
>>> superdog = Dog()
>>> vars(superdog)
{}
>>> dog.legs = 'I can run.'
>>> superdog.wings = 'I can fly.'
>>> vars(dog)
{'legs': 'I can run.'}
>>> vars(superdog)
{'wings': 'I can fly.'}
In pandas, index and column are closely related to the data structure, you may access an index on a Series, column on a DataFrame as an attribute.
>>> import pandas as pd
>>> import numpy as np
>>> data = np.random.randint(low=0, high=10, size=(2,2))
>>> df = pd.DataFrame(data, columns=['a', 'b'])
>>> df
a b
0 7 6
1 5 8
>>> vars(df)
{'_is_copy': None,
'_data': BlockManager
Items: Index(['a', 'b'], dtype='object')
Axis 1: RangeIndex(start=0, stop=2, step=1)
IntBlock: slice(0, 2, 1), 2 x 2, dtype: int64,
'_item_cache': {}}
But, pandas attribute access is mainly a convinience for reading from and modifying an existing element of a Series or column of a DataFrame.
>>> df.a
0 7
1 5
Name: a, dtype: int64
>>> df.b = [1, 1]
>>> df
a b
0 7 1
1 5 1
And, the convenience is a tradeoff for full functionality. E.g. you can create a DataFrame object with column names ['space bar', '1', 'loc', 'min', 'index'], but you can't access them as an attribute, because they are either not a valid Python identifier (1, space bar) or conflict with an existing attribute name.
>>> data = np.random.randint(0, 10, size=(2, 5))
>>> df_special_col_names = pd.DataFrame(data, columns=['space bar', '1', 'loc', 'min', 'index'])
>>> df_special_col_names
space bar 1 loc min index
0 4 4 4 8 9
1 3 0 1 2 3
In these cases, the .loc, .iloc and [] indexing is the defined way to fullly access/operate index and columns of Series and DataFrame objects.
>>> df_special_col_names['space bar']
0 4
1 3
Name: space bar, dtype: int64
>>> df_special_col_names.loc[:, 'min']
0 8
1 2
Name: min, dtype: int64
>>> df_special_col_names.iloc[:, 1]
0 4
1 0
Name: 1, dtype: int64
Another important difference is when trying to create a new column for DataFrame. As you can see, df.c = df.a + df.b just created an new attribute along side to the core data structure, so starting from version 0.21.0 and later, this behavior will raise a UserWarning (silent no more).
>>> df
a b
0 7 1
1 5 1
>>> df.c = df.a + df.b
__main__:1: UserWarning: Pandas doesn't allow columns to be created via a new attribute name - see https://pandas.pydata.org/pandas-docs/stable/indexing.html#attribute-access`
>>> df['d'] = df.a + df.b
>>> df
a b d
0 7 1 8
1 5 1 6
>>> df.c
0 8
1 6
dtype: int64
>>> vars(df)
{'_is_copy': None,
'_data':
BlockManager
Items: Index(['a', 'b', 'd'], dtype='object')
Axis 1: RangeIndex(start=0, stop=2, step=1)
IntBlock: slice(0, 2, 1), 2 x 2, dtype: int64
IntBlock: slice(2, 3, 1), 1 x 2, dtype: int64,
'_item_cache': {},
'c': 0 8
1 6
dtype: int64}
Finally, to create a new column for DataFrame, never use attribute access. The correct way is to use either [] or .loc indexing:
>>> df
a b
0 7 6
1 5 8
>>> df['c'] = df.a + df.b
>>> # OR
>>> df.loc[:, 'c'] = df.a + df.b
>>> df # c is an new added column
a b c
0 7 6 13
1 5 8 13

. notation is very useful when working interactively and for exploration. However, for code clarity and to avoid crazy shit to happen, you definitely should use [] notation. An example why you should use [] when creating a new column.
df = pd.DataFrame(data={'A':[1, 2, 3],
'B':[4,5,6]})
# this has no effect
df.D = 11
df
A B
0 1 4
1 2 5
2 3 6
# but this works
df['D'] = 11
df
Out[19]:
A B D
0 1 4 11
1 2 5 11
2 3 6 11

If you had a dataframe like this (I am not recommending these column names)...
df = pd.DataFrame({'min':[1,2], 'max': ['a','a'], 'class': [1975, 1981], 'sum': [3,4]})
print(df)
min max class sum
0 1 a 1975 3
1 2 a 1981 4
It all looks OK and there are no errors. You can even access the columns via df['min'] etc...
print(df['min'])
0 1
1 2
Name: min, dtype: int64
However, if you tried with df.<column_name> you would get problems:
print(df.min)
<bound method NDFrame._add_numeric_operations.<locals>.min of min max class sum
0 1 a 1975 3
1 2 a 1981 4>
print(df.max)
<bound method NDFrame._add_numeric_operations.<locals>.max of min max class sum
0 1 a 1975 3
1 2 a 1981 4>
print(df.class)
File "<ipython-input-31-3472b02a328e>", line 1
print(df.class)
^
SyntaxError: invalid syntax
print(df.sum)
<bound method NDFrame._add_numeric_operations.<locals>.sum of min max class sum
0 1 a 1975 3
1 2 a 1981 4>

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Can't Select A Cell In Pandas - python

because df[0]['A'] means column 0 at index A; you need to use df.iloc[0]['A'], or df['A'][0], or df.ix[0]['A'] see here for the indexing and slicing. see here for when you get a copy as opposed to a view.

Related

TypeError: 'method' object is not iterable - Python [duplicate]

The difference between accessing pandas Dataframe column by index or as an attribute? [duplicate]

df.date_column_name showing different output compared to df['date_column_name'] [duplicate]

Pandas: Difference between dot and [] [duplicate]

Accessing Pandas column using squared brackets vs using a dot (like an attribute)

Categories

Resources