pandas' replace function replaces targeted values with another, as expected:
>>> import pandas as pd
>>>
>>>
>>> t = pd.Series([10,20,30])
>>> t
0 10
1 20
2 30
dtype: int64
>>> t.replace(to_replace=20, value=222)
0 10
1 222
2 30
dtype: int64
>>> from numpy import nan; t.replace(to_replace=20, value=nan)
0 10.0
1 NaN
2 30.0
dtype: float64
But when asking to replace by None, it replaces by the previous value.
>>> t.replace(to_replace=20, value=None)
0 10
1 10
2 30
dtype: int64
What is the rationale behind this, if any?
This is because to_replace is a scalar and value is None. That behaviour is described in the documentation:
The method to use when for replacement, when to_replace is a scalar,
list or tuple and value is None.
The default method is pad, if you take a look at the code. This probably happens because None is often used to express the absence of the parameter.
Related
how to get back 10 from df.iloc[10] where df = pd.DataFrame({'a':np.arange(1,12)})?
I tried df.index but it returns a weird np.array which doesn't contain anything close to 10.
With df.iloc or df.loc, you obtain a series that corresponds to the columns of a given row in the dataframe:
>>> df
foo
a 44
b 34
c 65
>>> df.iloc[1]
foo 34
Name: b, dtype: int64
>>> df.index[1]
'b'
>>> df.loc['b']
foo 34
Name: b, dtype: int64
You can see that the index of this second row, here b, is kept as the name of the series. Hence we can use it to find the position in the index:
>>> ser = df.iloc[1]
>>> df.index.get_indexer([ser.name])[0]
1
Note that Index.get_indexer only works with arrays, hence the need to get the first element of the answer.
Alternately you can always convert the index to a list and use list.index to find the element position, but this will likely be much slower:
>>> df.index.to_list().index(ser.name)
1
The most simple solution if the index matches the row numbers is df.iloc[10].name which returns 10
I have 2 data frames. First dataframe has numbers as index. Second dataframe has datetime as index. The slice operator (:) behaves differently on these dataframes.
Case 1
>>> df = pd.DataFrame({'A':[1,2,3]}, index=[0,1,2])
>>> df
A
0 1
1 2
2 3
>>> df [0:2]
A
0 1
1 2
Case 2
>>> a = dt.datetime(2000,1,1)
>>> b = dt.datetime(2000,1,2)
>>> c = dt.datetime(2000,1,3)
>>> df = pd.DataFrame({'A':[1,2,3]}, index = [a,b,c])
>>> df
A
2000-01-01 1
2000-01-02 2
2000-01-03 3
>>> df[a:b]
A
2000-01-01 1
2000-01-02 2
Why does the final row gets excluded in case 1 but not in case 2?
Dont use it, better is use loc for consistency:
df = pd.DataFrame({'A':[1,2,3]}, index=[0,1,2])
print (df.loc[0:2])
A
0 1
1 2
2 3
a = datetime.datetime(2000,1,1)
b = datetime.datetime(2000,1,2)
c = datetime.datetime(2000,1,3)
df = pd.DataFrame({'A':[1,2,3]}, index = [a,b,c])
print (df.loc[a:b])
A
2000-01-01 1
2000-01-02 2
Reason, why last row is omitted is possible find in docs:
With DataFrame, slicing inside of [] slices the rows. This is provided largely as a convenience since it is such a common operation.
print (df[0:2])
A
0 1
1 2
For selecting by datetimes exact indexing is used :
... In contrast, indexing with Timestamp or datetime objects is exact, because the objects have exact meaning. These also follow the semantics of including both endpoints.
Okay to understand this first let's run an experiment
import pandas as pd
import datetime as dt
a = dt.datetime(2000,1,1)
b = dt.datetime(2000,1,2)
c = dt.datetime(2000,1,3)
df = pd.DataFrame({'A':[4,5,6]}, index=[a,b,c])
Now let's use
df2[0:2]
Which gives us
A
2000-01-01 1
2000-01-02 2
Now this behavior is consistent through python and list slicing, but if you use
df[a:c]
You get
A
2000-01-01 1
2000-01-02 2
2000-01-03 3
this is because df[a:c] overrides the default list slicing method as indexes do not correspond to integers, and in the function written in Pandas which also includes the last element, so if your indexes were integers, pandas defaults to inbuilt slicing, whereas if they are not integers, this effect is observed, as already mentioned in the answer by jezrael, it is better to use loc, as that has more consistency across the board.
Consider this simple example
import pandas as pd
df = pd.DataFrame({'one' : [1,2,3],
'two' : [1,0,0]})
df
Out[9]:
one two
0 1 1
1 2 0
2 3 0
I want to write a function that takes as inputs a dataframe df and a column mycol.
Now this works:
df.groupby('one').two.sum()
Out[10]:
one
1 1
2 0
3 0
Name: two, dtype: int64
this works too:
def okidoki(df,mycol):
return df.groupby('one')[mycol].sum()
okidoki(df, 'two')
Out[11]:
one
1 1
2 0
3 0
Name: two, dtype: int64
but this FAILS
def megabug(df,mycol):
return df.groupby('one').mycol.sum()
megabug(df, 'two')
AttributeError: 'DataFrameGroupBy' object has no attribute 'mycol'
What is wrong here?
I am worried that okidoki uses some chaining that might create some subtle bugs (https://pandas.pydata.org/pandas-docs/stable/indexing.html#why-does-assignment-fail-when-using-chained-indexing).
How can I still keep the syntax groupby('one').mycol? Can the mycol string be converted to something that might work that way?
Thanks!
You pass a string as the second argument. In effect, you're trying to do something like:
df.'two'
Which is invalid syntax. If you're trying to dynamically access a column, you'll need to use the index notation, [...] because the dot/attribute accessor notation doesn't work for dynamic access.
Dynamic access on its own is possible. For example, you can use getattr (but I don't recommend this, it's an antipattern):
In [674]: df
Out[674]:
one two
0 1 1
1 2 0
2 3 0
In [675]: getattr(df, 'one')
Out[675]:
0 1
1 2
2 3
Name: one, dtype: int64
Dynamically selecting by attribute from a groupby call can be done, something like:
In [677]: getattr(df.groupby('one'), mycol).sum()
Out[677]:
one
1 1
2 0
3 0
Name: two, dtype: int64
But don't do it. It is a horrid anti pattern, and much more unreadable than df.groupby('one')[mycol].sum().
I think you need [] for select column by column name what is general solution for selecting columns, because select by attributes have many exceptions:
You can use this access only if the index element is a valid python identifier, e.g. s.1 is not allowed. See here for an explanation of valid identifiers.
The attribute will not be available if it conflicts with an existing method name, e.g. s.min is not allowed.
Similarly, the attribute will not be available if it conflicts with any of the following list: index, major_axis, minor_axis, items, labels.
In any of these cases, standard indexing will still work, e.g. s['1'], s['min'], and s['index'] will access the corresponding element or column.
def megabug(df,mycol):
return df.groupby('one')[mycol].sum()
print (megabug(df, 'two'))
one
1 1
2 0
3 0
Name: two, dtype: int64
I have a series containing data like
0 a
1 ab
2 b
3 a
And I want to replace any row containing 'b' to 1, and all others to 0. I've tried
one = labels.str.contains('b')
zero = ~labels.str.contains('b')
labels.ix[one] = 1
labels.ix[zero] = 0
And this does the trick but it gives this pesky warning
A value is trying to be set on a copy of a slice from a DataFrame
See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
self._setitem_with_indexer(indexer, value)
And I know I've seen this before in the last few times I've used pandas. Could you please give the recommended approach? My method gives the desired result, but what should I do? Also, I think Python is supposed to be an 'if it makes logical sense and you type it it will run' kind of language, but my solution seems perfectly logical in the human-readable sense and it seems very non-pythonic that it throws an error.
Try this:
ds = pd.Series(['a','ab','b','a'])
ds
0 a
1 ab
2 b
3 a
dtype: object
ds.apply(lambda x: 1 if 'b' in x else 0)
0 0
1 1
2 1
3 0
dtype: int64
You can use numpy.where. Output is numpy.ndarray, so you have to use Series constructor:
import pandas as pd
import numpy as np
ser = pd.Series(['a','ab','b','a'])
print ser
0 a
1 ab
2 b
3 a
dtype: object
print np.where(ser.str.contains('b'),1,0)
[0 1 1 0]
print type(np.where(ser.str.contains('b'),1,0))
<type 'numpy.ndarray'>
print pd.Series(np.where(ser.str.contains('b'),1,0), index=ser.index)
0 0
1 1
2 1
3 0
dtype: int32
I have a pandas datagframe created from a csv file. One column of this dataframe contains numeric data that is initially cast as a string. Most entries are numeric-like, but some contain various error codes that are non-numeric. I do not know beforehand what all the error codes might be or how many there are. So, for instance, the dataframe might look like:
[In 1]: df
[Out 1]:
data OtherAttr
MyIndex
0 1.4 aaa
1 error1 foo
2 2.2 bar
3 0.8 bar
4 xxx bbb
...
743733 BadData ccc
743734 7.1 foo
I want to cast df.data as a float and throw out any values that don't convert properly. Is there a built-in functionality for this? Something like:
df.data = df.data.astype(float, skipbad = True)
(Although I know that specifically will not work and I don't see any kwargs within astype that do what I want)
I guess I could write a function using try and then use pandas apply or map, but that seems like an inelegant solution. This must be a fairly common problem, right?
Use the convert_objects method which "attempts to infer better dtype for object columns":
In [11]: df['data'].convert_objects(convert_numeric=True)
Out[11]:
0 1.4
1 NaN
2 2.2
3 0.8
4 NaN
Name: data, dtype: float64
In fact, you can apply this to the entire DataFrame:
In [12]: df.convert_objects(convert_numeric=True)
Out[12]:
data OtherAttr
MyIndex
0 1.4 aaa
1 NaN foo
2 2.2 bar
3 0.8 bar
4 NaN bbb