What is difference between DataFrame attribute and column [duplicate] - python

This question already has an answer here:
In pandas, what's the difference between df['column'] and df.column? [duplicate]
(1 answer)
Closed 4 years ago.
In [66]: data
Out[66]:
col1 col2 label
0 1.0 a c
1 2.0 b d
2 3.0 c e
3 0.0 d f
4 4.0 e 0
5 5.0 f 0
In [67]: data.label
Out[67]:
0 c
1 d
2 NaN
3 f
4 NaN
5 NaN
Name: col2, dtype: object
In [68]: data['label']
Out[68]:
0 c
1 d
2 e
3 f
4 0
5 0
Name: label, dtype: object
Why data.label and data['label'] showing different results?

The big difference I've noticed is assignment.
import random
import pandas as pd
s = "SummerCrime|WinterCrime".split("|")
j = {x: [random.choice(["ASB", "Violence", "Theft", "Public Order", "Drugs"]) for j in range(300)] for x in s}
df = pd.DataFrame(j)
df.FallCrime = [random.choice(["ASB", "Violence", "Theft", "Public Order", "Drugs"]) for j in range(300)]
Gives: UserWarning: Pandas doesn't allow columns to be created via a new attribute name
However, there are also docs associated with this, which has the following warnings which may be related to your problem:
You can use this access only if the index element is a valid Python identifier, e.g. s.1 is not allowed. See here for an explanation of valid identifiers.
The attribute will not be available if it conflicts with an existing method name, e.g. s.min is not allowed, but s['min'] is possible.
Similarly, the attribute will not be available if it conflicts with any of the following list: index, major_axis, minor_axis, items.
In any of these cases, standard indexing will still work, e.g. s['1'], s['min'], and s['index'] will access the corresponding element or column.
They go on to say:
You can use attribute access to modify an existing element of a Series or column of a DataFrame, but be careful; if you try to use attribute access to create a new column, it creates a new attribute rather than a new column. In 0.21.0 and later, this will raise a UserWarning
So it's possible you did this without realizing.

The difference between these two is related to assignment. with data.label you cannot assign the values to column.
data.label is to access the attributes and data["label"] is to assign the values.
Also if you have spaces in your column name, for example df['label name'], while using data.label name will through an error.
For more information see this Answer link

Related

df.index vs df["index"] after resetting index [duplicate]

This question already has an answer here:
Proper way to access a column of a pandas dataframe
(1 answer)
Closed last month.
import pandas as pd
df1 = pd.DataFrame({
"value": [1, 1, 1, 2, 2, 2]})
print(df1)
print("-------------------------")
print(df1.reset_index())
print("-------------------------")
print(df1.reset_index().index)
print("-------------------------")
print(df1.reset_index()["index"])
produces the output
value
0 1
1 1
2 1
3 2
4 2
5 2
-------------------------
index value
0 0 1
1 1 1
2 2 1
3 3 2
4 4 2
5 5 2
-------------------------
RangeIndex(start=0, stop=6, step=1)
-------------------------
0 0
1 1
2 2
3 3
4 4
5 5
Name: index, dtype: int64
I am wondering why print(df1.reset_index().index) and
print(df1.reset_index()["index"]) prints different things in this case? The latter prints the "index" column, while the former prints the indices.
If we want to access the reset indices (the column), then it seems we have to use brackets?
The .index attribute in a pandas DataFrame will always point to the Index (row label) of the DataFrame not a column named "index".
If we want to access the reset indices (the column), then it seems we
have to use brackets?
Yes, or you can assign a name when reseting the index for example:
df1.reset_index(names='the_index').the_index
# 0 0
# 1 1
# 2 2
# 3 3
# 4 4
# 5 5
# Name: the_index, dtype: int64
Several things happened. First, when you don't specify and index, pandas uses a RangeIndex object as a virtual index of the dataframe. The dataframe is a collection of numpy arrays which are naturally indexed from 0, 1, 2, and etc. Since RangeIndex is just 0, 1, etc... it doesn't actually create its values in memory. Had you printed the index of the original df1, it would be a RangeIndex, just like df1.reset_index().index.
reset_index has an optional drop parameter. By default, pandas will take the existing index and turn it into a column of the dataframe. This was a RangeIndex object but it had to be expanded into a realized column to fit with the other columns in the df. Had you included drop=True, there would be no "index" column.
When you reset the index, dataframes always have to have some index and the default is that virtual RangeIndex you see.
DataFrames have a shortcut where some columns can be addressed by attribute name rather than item (the square brackets). But, if the column name doesn't meet python's attribute naming rules or if it clashes with an existing attribute, you can't reference it that way. .index is the dataframe index so if you happen to also have a column "index", you need to access it via the square bracket item protocol.
One could argue that pandas should never have allowed the attribute access path because it can't be used consistently. I wouldn't argue that (except I totally would).
It does this because you are printing different things:
print(df1.reset_index().index)
is the same as:
df = df1.reset_index()
print(df.index)
This firstly adds an Id index to the dataframe then prints the actual index of the df.
print(df1.reset_index()["index"])
is the equivalent of
df = df1.reset_index()
print(df["index"])
It firstly adds an Id index to the dataframe but keeps both "index" and "values" columns. It then prints the Column named "Index" (which is NOT the index of the df)
If you want to make the "index" column the index, you must use:
df = df1.set_index("index")

Pandas Drop Duplicates And Store Duplicates [duplicate]

This question already has answers here:
How do I get a list of all the duplicate items using pandas in python?
(13 answers)
Closed 2 months ago.
i use the pandas.DataFrame.drop_duplicates to search duplicates in a dataframe. This removes the duplicates from the dataframe. This also works great. However, I would like to know which data has been removed.
Is there a way to save the data in a new list before removing it?
I have unfortunately found in the documentation of pandas no information on this.
Thanks for the answer.
It uses duplicated function to filter out the information which is duplicated. By default the first occurrence is set to True, all others set as False, Using this function and filter on original data, you can know which data is kept and which is dropped out.
https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.duplicated.html
You can use duplicated and boolean indexing with groupby.agg to keep the list of duplicates:
m = df.duplicated('group')
dropped = df[m].groupby(df['group'])['value'].agg(list)
print(dropped)
df = df[~m]
print(df)
Output:
# print(dropped)
group
A [2]
B [4, 5]
C [7]
Name: value, dtype: object
# print(df)
group value
0 A 1
2 B 3
5 C 6
Used input:
group value
0 A 1
1 A 2
2 B 3
3 B 4
4 B 5
5 C 6
6 C 7

Pandas loc indexing with inplace fillna [duplicate]

What's the difference between:
pandas df.loc[:,('col_a','col_b')]
and
df.loc[:,['col_a','col_b']]
The link below doesn't mention the latter, though it works. Do both pull a view? Does the first pull a view and the second pull a copy? Love learning Pandas.
http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
Thanks
If your DataFrame has a simple column index, then there is no difference.
For example,
In [8]: df = pd.DataFrame(np.arange(12).reshape(4,3), columns=list('ABC'))
In [9]: df.loc[:, ['A','B']]
Out[9]:
A B
0 0 1
1 3 4
2 6 7
3 9 10
In [10]: df.loc[:, ('A','B')]
Out[10]:
A B
0 0 1
1 3 4
2 6 7
3 9 10
But if the DataFrame has a MultiIndex, there can be a big difference:
df = pd.DataFrame(np.random.randint(10, size=(5,4)),
columns=pd.MultiIndex.from_arrays([['foo']*2+['bar']*2,
list('ABAB')]),
index=pd.MultiIndex.from_arrays([['baz']*2+['qux']*3,
list('CDCDC')]))
# foo bar
# A B A B
# baz C 7 9 9 9
# D 7 5 5 4
# qux C 5 0 5 1
# D 1 7 7 4
# C 6 4 3 5
In [27]: df.loc[:, ('foo','B')]
Out[27]:
baz C 9
D 5
qux C 0
D 7
C 4
Name: (foo, B), dtype: int64
In [28]: df.loc[:, ['foo','B']]
KeyError: 'MultiIndex Slicing requires the index to be fully lexsorted tuple len (1), lexsort depth (0)'
The KeyError is saying that the MultiIndex has to be lexsorted. If we do that, then we still get a different result:
In [29]: df.sortlevel(axis=1).loc[:, ('foo','B')]
Out[29]:
baz C 9
D 5
qux C 0
D 7
C 4
Name: (foo, B), dtype: int64
In [30]: df.sortlevel(axis=1).loc[:, ['foo','B']]
Out[30]:
foo
A B
baz C 7 9
D 7 5
qux C 5 0
D 1 7
C 6 4
Why is that? df.sortlevel(axis=1).loc[:, ('foo','B')] is selecting the column where the first column level equals foo, and the second column level is B.
In contrast, df.sortlevel(axis=1).loc[:, ['foo','B']] is selecting the columns where the first column level is either foo or B. With respect to the first column level, there are no B columns, but there are two foo columns.
I think the operating principle with Pandas is that if you use df.loc[...] as
an expression, you should assume df.loc may be returning a copy or a view. The Pandas docs do not specify any rules about which you should expect.
However, if you make an assignment of the form
df.loc[...] = value
then you can trust Pandas to alter df itself.
The reason why the documentation warns about the distinction between views and copies is so that you are aware of the pitfall of using chain assignments of the form
df.loc[...][...] = value
Here, Pandas evaluates df.loc[...] first, which may be a view or a copy. Now if it is a copy, then
df.loc[...][...] = value
is altering a copy of some portion of df, and thus has no effect on df itself. To add insult to injury, the effect on the copy is lost as well since there are no references to the copy and thus there is no way to access the copy after the assignment statement completes, and (at least in CPython) it is therefore soon-to-be garbage collected.
I do not know of a practical fool-proof a priori way to determine if df.loc[...] is going to return a view or a copy.
However, there are some rules of thumb which may help guide your intuition (but note that we are talking about implementation details here, so there is no guarantee that Pandas needs to behave this way in the future):
If the resultant NDFrame can not be expressed as a basic slice of the
underlying NumPy array, then it probably will be a copy. Thus, a selection of arbitrary rows or columns will lead to a copy. A selection of sequential rows and/or sequential columns (which may be expressed as a slice) may return a view.
If the resultant NDFrame has columns of different dtypes, then df.loc
will again probably return a copy.
However, there is an easy way to determine if x = df.loc[..] is a view a postiori: Simply see if changing a value in x affects df. If it does, it is a view, if not, x is a copy.

Dynamically accessing a pandas dataframe column

Consider this simple example
import pandas as pd
df = pd.DataFrame({'one' : [1,2,3],
'two' : [1,0,0]})
df
Out[9]:
one two
0 1 1
1 2 0
2 3 0
I want to write a function that takes as inputs a dataframe df and a column mycol.
Now this works:
df.groupby('one').two.sum()
Out[10]:
one
1 1
2 0
3 0
Name: two, dtype: int64
this works too:
def okidoki(df,mycol):
return df.groupby('one')[mycol].sum()
okidoki(df, 'two')
Out[11]:
one
1 1
2 0
3 0
Name: two, dtype: int64
but this FAILS
def megabug(df,mycol):
return df.groupby('one').mycol.sum()
megabug(df, 'two')
AttributeError: 'DataFrameGroupBy' object has no attribute 'mycol'
What is wrong here?
I am worried that okidoki uses some chaining that might create some subtle bugs (https://pandas.pydata.org/pandas-docs/stable/indexing.html#why-does-assignment-fail-when-using-chained-indexing).
How can I still keep the syntax groupby('one').mycol? Can the mycol string be converted to something that might work that way?
Thanks!
You pass a string as the second argument. In effect, you're trying to do something like:
df.'two'
Which is invalid syntax. If you're trying to dynamically access a column, you'll need to use the index notation, [...] because the dot/attribute accessor notation doesn't work for dynamic access.
Dynamic access on its own is possible. For example, you can use getattr (but I don't recommend this, it's an antipattern):
In [674]: df
Out[674]:
one two
0 1 1
1 2 0
2 3 0
In [675]: getattr(df, 'one')
Out[675]:
0 1
1 2
2 3
Name: one, dtype: int64
Dynamically selecting by attribute from a groupby call can be done, something like:
In [677]: getattr(df.groupby('one'), mycol).sum()
Out[677]:
one
1 1
2 0
3 0
Name: two, dtype: int64
But don't do it. It is a horrid anti pattern, and much more unreadable than df.groupby('one')[mycol].sum().
I think you need [] for select column by column name what is general solution for selecting columns, because select by attributes have many exceptions:
You can use this access only if the index element is a valid python identifier, e.g. s.1 is not allowed. See here for an explanation of valid identifiers.
The attribute will not be available if it conflicts with an existing method name, e.g. s.min is not allowed.
Similarly, the attribute will not be available if it conflicts with any of the following list: index, major_axis, minor_axis, items, labels.
In any of these cases, standard indexing will still work, e.g. s['1'], s['min'], and s['index'] will access the corresponding element or column.
def megabug(df,mycol):
return df.groupby('one')[mycol].sum()
print (megabug(df, 'two'))
one
1 1
2 0
3 0
Name: two, dtype: int64

Pandas overwrite values in column selectively based on condition from another column

I have a dataframe in pandas with four columns. The data consists of strings. Sample:
A B C D
0 2 asicdsada v:cVccv u
1 4 ascccaiiidncll v:cVccv:ccvc u
2 9 sca V:c u
3 11 lkss v:cv u
4 13 lcoao v:ccv u
5 14 wuduakkk V:ccvcv: u
I want to replace the string 'u' in Col D with the string 'a' if Col C in that row contains the substring 'V' (case sensitive).
Desired outcome:
A B C D
0 2 asicdsada v:cVccv a
1 4 ascccaiiidncll v:cVccv:ccvc a
2 9 sca V:c a
3 11 lkss v:cv u
4 13 lcoao v:ccv u
5 14 wuduakkk V:ccvcv: a
I prefer to overwrite the value already in Column D, rather than assign two different values, because I'd like to selectively overwrite some of these values again later, under different conditions.
It seems like this should have a simple solution, but I cannot figure it out, and haven't been able to find a fully applicable solution in other answered questions.
df.ix[1]["D"] = "a"
changes an individual value.
df.ix[:]["C"].str.contains("V")
returns a series of booleans, but I am not sure what to do with it. I have tried many many combinations of .loc, apply, contains, re.search, and for loops, and I get either errors or replace every value in column D. I'm a novice with pandas/python so it's hard to know whether my syntax, methods, or conceptualization of what I even need to do are off (probably all of the above).
As you've already tried, use str.contains to get a boolean Series, and then use .loc to say "change these rows and the D column". For example:
In [5]: df.loc[df["C"].str.contains("V"), "D"] = "a"
In [6]: df
Out[6]:
A B C D
0 2 asicdsada v:cVccv a
1 4 ascccaiiidncll v:cVccv:ccvc a
2 9 sca V:c a
3 11 lkss v:cv u
4 13 lcoao v:ccv u
5 14 wuduakkk V:ccvcv: a
(Avoid using .ix -- it's officially deprecated now.)

Categories

Resources