I'm converting a .ods spreadsheet to a Pandas DataFrame. I have whole columns and rows I'd like to drop because they contain only "None". As "None" is a str, I have:
pandas.DataFrame.replace("None", numpy.nan)
...on which I call: .dropna(how='all')
Is there a pandas equivalent to numpy.nan?
Is there a way to use .dropna() with the *string "None" rather than NaN?
You can use float('nan') if you really want to avoid importing things from the numpy namespace:
>>> import pandas as pd
>>> s = pd.Series([1, 2, 3])
>>> s[1] = float('nan')
>>> s
0 1.0
1 NaN
2 3.0
dtype: float64
>>>
>>> s.dropna()
0 1.0
2 3.0
dtype: float64
Moreover, if you have a string value "None", you can .replace("None", float("nan")):
>>> s[1] = "None"
>>> s
0 1
1 None
2 3
dtype: object
>>>
>>> s.replace("None", float("nan"))
0 1.0
1 NaN
2 3.0
dtype: float64
If you are trying to drop directly the rows containing a "None" string value (without converting these "None" cells to NaN values), I guess it can be done without using replace + dropna
Considering a DataFrame like :
In [3]: df = pd.DataFrame({
"foo": [1,2,3,4],
"bar": ["None",5,5,6],
"baz": [8, "None", 9, 10]
})
In [4]: df
Out[4]:
bar baz foo
0 None 8 1
1 5 None 2
2 5 9 3
3 6 10 4
Using replace and dropna will return
In [5]: df.replace('None', float("nan")).dropna()
Out[5]:
bar baz foo
2 5.0 9.0 3
3 6.0 10.0 4
Which can also be obtained by simply selecting the row you need :
In [7]: df[df.eval("foo != 'None' and bar != 'None' and baz != 'None'")]
Out[7]:
bar baz foo
2 5 9 3
3 6 10 4
You can also use the drop method of your dataframe, selecting appropriately the axis/labels targeted :
In [9]: df.drop(df[(df.baz == "None") |
(df.bar == "None") |
(df.foo == "None")].index)
Out[9]:
bar baz foo
2 5 9 3
3 6 10 4
These two methods are more or less interchangeable as you can also do for example:
df[(df.baz != "None") & (df.bar != "None") & (df.foo != "None")]
(but i guess the comparison df.somecolumns == "Some string" is only possible if the column type allows it, before theses last 2 examples, which wasn't the case with eval, i had to do df = df.astype (object) as the foo columns was of type int64)
Related
I can't figure how to set the value of a Series at a specific index in a chainable style.
For example, say I have the following dataframe:
>>> df = pd.DataFrame({'a': [1,2,3], 'b': [0,0,0]})
>>> df
a b
0 1 0
1 2 0
2 3 0
If I want to change all the values of a column in a pipeline, I can use pandas.DataFrame.assign():
>>> df.assign(b=[4,5,6])
a b
0 1 4
1 2 5
2 3 6
...and then I can do other stuff with the dataframe on the same line, for example:
>>> df.assign(b=[4,5,6]).mul(100)
a b
0 100 400
1 200 500
2 300 600
But I can't do this for an individual value at a specific in a Series.
>>> s = df['a']
>>> s
0 1
1 2
2 3
Name: a, dtype: int64
I can, of course, just use a normal Python assignment operation using =:
>>> s[1] = 9
>>> s
0 1
1 9
2 3
Name: a, dtype: int64
But the problems with that are:
It's in-place, so it modifies my existing dataframe
Assignment statements using = are not allowed in Python lambda functions
For example, what if I wanted to do this:
>>> df.apply(lambda x: x['b', 0] = 13, axis=1)
File "<stdin>", line 1
df.apply(lambda x: x['b', 0] = 13, axis=1)
^
SyntaxError: expression cannot contain assignment, perhaps you meant "=="?
(I understand that there are better ways to that particular case, but this is just a made-up example.)
How can I set the value at the specified index of a Series? I would like to be able to just to something like s.set_value(idx, 'my_val') and have it return the modified (copied) Series.
You can use pandas.Series.where() to return a copy of the column with the item at the specified index.
This is basically like using .loc:
>>> df['b'].where(df['b'].index != 1, 13)
0 0
1 13
2 0
Name: b, dtype: int64
If you have an index that isn't a RangeIndex or that doesn't start from zero, you can call reset_index() before where(), which will be like the above for .loc only mimicking the behavior of .iloc instead:
>>> s = pd.Series({'a': 0, None: 0, True: 0})
>>> s
a 0
NaN 0
True 0
dtype: int64
>>> s.where(s.reset_index().index != 1, 13)
a 0
NaN 13
True 0
dtype: int64
This question already has answers here:
In Pandas, how to filter a Series based on the type of the values?
(3 answers)
Closed 4 years ago.
I have a pandas dataframe df with a column, call it A, that contains multiple data types. I want to select all rows of df where A has a particular data type.
For example, suppose that A has types int and str. I want to do something like df[type(df[A])==int] .
Setup
df = pd.DataFrame({'A': ['hello', 1, 2, 3, 'bad']})
This entire column will be assigned dtype Object. If you just want to find numeric values:
pd.to_numeric(df.A, errors='coerce').dropna()
1 1.0
2 2.0
3 3.0
Name: A, dtype: float64
However, this would also allow floats, string representations of numbers, etc. into the mix. If you really want to find elements that are of type int, you can use a list comprehension:
df.loc[[isinstance(val, int) for val in df.A], 'A']
1 1
2 2
3 3
Name: A, dtype: object
But notice that the dtype is still Object.
If the column has Boolean values, these will be kept, since bool is a subclass of int. If you don't want this behavior, you can use type instead of isinstance
Group by type
dod = dict(tuple(df.groupby(df['A'].map(type), sort=False)))
Setup
df = pd.DataFrame(dict(A=[1, 'one', {1}, [1], (1,)] * 2))
Validation
for t, d in dod.items():
print(t, d, sep='\n')
print()
<class 'int'>
A
0 1
5 1
<class 'str'>
A
1 one
6 one
<class 'set'>
A
2 {1}
7 {1}
<class 'list'>
A
3 [1]
8 [1]
<class 'tuple'>
A
4 (1,)
9 (1,)
Using groupby data from user3483203
for _,x in df.groupby(df.A.apply(lambda x : type(x).__name__)):
print(x)
A
1 1
2 2
3 3
A
0 hello
4 bad
d={ y:x for y,x in df.groupby(df.A.apply(lambda x : type(x).__name__))}
a = [2, 'B',3.0, 'c', 1, 'a', 2.0, 'b',3, 'C', 'A', 1.0]
df = pd.DataFrame({"a": a})
df['upper'] = df['a'].str.isupper()
df['lower'] = df['a'].str.islower()
df['int'] = df['a'].apply(isinstance,args = [int])
df['float'] = df['a'].apply(isinstance,args = [float])
print(df)
a upper lower int float
0 2 NaN NaN True False
1 B True False False False
2 3 NaN NaN False True
3 c False True False False
4 1 NaN NaN True False
5 a False True False False
6 2 NaN NaN False True
7 b False True False False
8 3 NaN NaN True False
9 C True False False False
10 A True False False False
11 1 NaN NaN False True
integer = df[df['int']]['a']
print(integer)
0 2
4 1
8 3
Name: a, dtype: object
Suppose I have a DataFrame
df = pandas.DataFrame({'a': [1,2], 'b': [3,4]}, ['foo', 'bar'])
a b
foo 1 3
bar 2 4
And I want to added a column based on another Series:
s = pandas.Series({'foo': 10, 'baz': 20})
foo 10
baz 20
dtype: int64
How do I assign the Series to a column of the DataFrame and provide a default value if the DataFrame index value is not in the Series index?
I'm looking for something of the form:
df['c'] = s.withDefault(42)
Which would result in the following Dataframe:
a b c
foo 1 3 10
bar 2 4 42
#Note: bar got value 42 because it's not in s
Thank you in advance for your consideration and response.
Using map with get
get has an argument that you can use to specify the default value.
df.assign(c=df.index.map(lambda x: s.get(x, 42)))
a b c
foo 1 3 10
bar 2 4 42
Use reindex with fill_value
df.assign(c=s.reindex(df.index, fill_value=42))
a b c
foo 1 3 10
bar 2 4 42
You need to use join between df and dataframe which is obtained from s and then fill the NaN with default value, which is 42, in your case.
df['c'] = df.join(pandas.DataFrame(s, columns=['c']))['c'].fillna(42).astype(int)
Output:
a b c
foo 1 3 10
bar 2 4 42
Is the condition None == None is true or false?
I have 2 pandas-dataframes:
import pandas as pd
df1 = pd.DataFrame({'id':[1,2,3,4,5], 'value':[None,20,None,40,50]})
df2 = pd.DataFrame({'index':[1,2,3], 'value':[None,20,None]})
In [42]: df1
Out[42]: id value
0 1 NaN
1 2 20.0
2 3 NaN
3 4 40.0
4 5 50.0
In [43]: df2
Out[43]: index value
0 1 NaN
1 2 20.0
2 3 NaN
When I'm executing merge action it's looks like None == None is True:
In [37]: df3 = df1.merge(df2, on='value', how='inner')
In [38]: df3
Out[38]: id value index
0 1 NaN 1
1 1 NaN 3
2 3 NaN 1
3 3 NaN 3
4 2 20.0 2
but when I do this:
In [39]: df4 = df3[df3['value']==df3['value']]
In [40]: df4
Out[40]: id value index
4 2 20.0 2
In [41]: df3['value']==df3['value']
Out[41]: 0 False
1 False
2 False
3 False
4 True
It shows that None == None is false.
Pandas uses the floating point Not a Number value, NaN, to indicate that something is missing in a series of numbers. That's because that's easier to handle in the internal representation of data. You don't have any None objects in your series. Even so, if you use dtype=object data, None is used to encode missing value. See Working with missing data.
Not that it matters here, but NaN is always, by definition, not equal to NaN:
>>> float('NaN') == float('NaN')
False
When merging or broadcasting, Pandas knows what 'missing' means, there is no equality test being done on the NaN or None values in a series. Nulls are skipped explicitly.
If you want to test if a value is a null or not, use the series.isnull()and series.notnull() methods instead.
I suspect this is a simpler form of my question here. [Update: unfortunately not so.]
If you do something like this (in Pandas 0.11):
df = pd.DataFrame([[1,2],[1,3],[2,4]],columns='a b'.split())
print df
g = df.groupby('a').count()
print type(g)
print g
You get the expected:
a b
0 1 2
1 1 3
2 2 4
<class 'pandas.core.frame.DataFrame'>
a b
a
1 2 2
2 1 1
But if there's only one resulting group, you get a very odd Series instead:
df = pd.DataFrame([[1,2],[1,3],[1,4]],columns='a b'.split())
...
a b
0 1 2
1 1 3
2 1 4
<class 'pandas.core.series.Series'>
a
1 a 3
b 3
Name: 1, dtype: int64
But I'd rather the result was a DataFrame equivalent to this:
print pd.DataFrame([[3,3]],index=pd.Index([1],name='a'),columns='a b'.split())
a b
a
1 3 3
I'm stuck as to how to get that easily from the series (and not sure why I get that in the first place).
In pandas 0.12 this does exactly what you ask.
In [3]: df = pd.DataFrame([[1,2],[1,3],[1,4]],columns='a b'.split())
In [4]: df.groupby('a').count()
Out[4]:
a b
a
1 3 3
To replicate what you're seeing pass squeeze=True:
In [5]: df.groupby('a', squeeze=True).count()
Out[5]:
a
1 a 3
b 3
Name: 1, dtype: int64
If you can't upgrade then do:
In [3]: df.groupby('a').count().unstack()
Out[3]:
a b
a
1 3 3