pandas DataFrame combine_first and update methods have strange behavior

pandas DataFrame combine_first and update methods have strange behavior - python

I'm running into a strange issue (or intended?) where combine_first or update are causing values stored as bool to be upcasted into float64s if the argument supplied is not supplying the boolean columns.
Example workflow in ipython:
In [144]: test = pd.DataFrame([[1,2,False,True],[4,5,True,False]], columns=['a','b','isBool', 'isBool2'])
In [145]: test
Out[145]:
a b isBool isBool2
0 1 2 False True
1 4 5 True False
In [147]: b = pd.DataFrame([[45,45]], index=[0], columns=['a','b'])
In [148]: b
Out[148]:
a b
0 45 45
In [149]: test.update(b)
In [150]: test
Out[150]:
a b isBool isBool2
0 45 45 0 1
1 4 5 1 0
Was this meant to be the behavior of the update function? I would think that if nothing was specified that update wouldn't mess with the other columns.
EDIT: I started tinkering around a little more. The plot thickens. If I insert one more command: test.update([]) before running test.update(b), boolean behavior works at the cost of numbers upcasted as objects. This also applies to DSM's simplified example.
Based on panda's source code, it looks like the reindex_like method is creating a DataFrame of dtype object, while reindex_like b creates a DataFrame of dtype float64. Since object is more general, subsequent operations work with bools. Unfortunately running np.log on the numerical columns will fail with an AttributeError.

this is a bug, update shouldn't touch unspecified columns, fixed here https://github.com/pydata/pandas/pull/3021

Before updating, the dateframe b is been filled by reindex_link, so that b becomes
In [5]: b.reindex_like(a)
Out[5]:
a b isBool isBool2
0 45 45 NaN NaN
1 NaN NaN NaN NaN
And then use numpy.where to update the data frame.
The tragedy is that for numpy.where, if two data have different type, the more general one would be used. For example
In [20]: np.where(True, [True], [0])
Out[20]: array([1])
In [21]: np.where(True, [True], [1.0])
Out[21]: array([ 1.])
Since NaN in numpy is floating type, it'll also return an floating type.
In [22]: np.where(True, [True], [np.nan])
Out[22]: array([ 1.])
Therefore, after updating, your 'isBool' and 'isBool2' column become floating type.
I've added this issue on the issue tracker for pandas

Related

Getting first element of series from pandas .value_counts() [duplicate]

I am new to python. This seems like a basic question to ask. But I really want to understand what is happening here
import numpy as np
import pandas as pd
tempdata = np.random.random(5)
myseries_one = pd.Series(tempdata)
myseries_two = pd.Series(data = tempdata, index = ['a','b','c','d','e'])
myseries_three = pd.Series(data = tempdata, index = [10,11,12,13,14])
myseries_one
Out[1]:
0 0.291293
1 0.381014
2 0.923360
3 0.271671
4 0.605989
dtype: float64
myseries_two
Out[2]:
a 0.291293
b 0.381014
c 0.923360
d 0.271671
e 0.605989
dtype: float64
myseries_three
Out[3]:
10 0.291293
11 0.381014
12 0.923360
13 0.271671
14 0.605989
dtype: float64
Indexing first element from each dataframe
myseries_one[0] #As expected
Out[74]: 0.29129291112626043
myseries_two[0] #As expected
Out[75]: 0.29129291112626043
myseries_three[0]
KeyError:0
Doubt1 :-Why this is happenening ? Why myseries_three[0] gives me a keyError ?
what we meant by calling myseries_one[0] , myseries_one[0] or myseries_three[0] ? Does calling this way mean we are calling by rownames ?
Doubt2 :-Is rownames and rownumber in Python works as different as rownames and rownumber in R ?
myseries_one[0:2]
Out[78]:
0 0.291293
1 0.381014
dtype: float64
myseries_two[0:2]
Out[79]:
a 0.291293
b 0.381014
dtype: float64
myseries_three[0:2]
Out[80]:
10 0.291293
11 0.381014
dtype: float64
Doubt3:- If calling myseries_three[0] meant calling by rownames then how myseries_three[0:3] producing the output ? does myseries_three[0:4] mean we are calling by rownumber ? Please explain and guide. I am migrating from R to python. so its a bit confusing for me.

When you are attempting to slice with myseries[something], the something is often ambiguous. You are highlighting a case of that ambiguity. In your case, pandas is trying to help you out by guessing what you mean.
myseries_one[0] #As expected
Out[74]: 0.29129291112626043
myseries_one has integer labels. It would make sense that when you attempt to slice with an integer that you intend to get the element that is labeled with that integer. It turns out, that you have an element labeled with 0 an so that is returned to you.
myseries_two[0] #As expected
Out[75]: 0.29129291112626043
myseries_two has string labels. It's highly unlikely that you meant to slice this series with a label of 0 when labels are all strings. So, pandas assumes that you meant a position of 0 and returns the first element (thanks pandas, that was helpful).
myseries_three[0]
KeyError:0
myseries_three has integer labels and you are attempting to slice with an integer... perfect. Let's just get that value for you... KeyError. Whoops, that index label does not exist. In this case, it is safer for pandas to fail than to guess that maybe you meant to slice by position. The documentation even suggests that if you want to remove the ambiguity, use loc for label based slicing and iloc for position based slicing.
Let's try loc
myseries_one.loc[0]
0.29129291112626043
myseries_two.loc[0]
KeyError:0
myseries_three.loc[0]
KeyError:0
Only myseries_one has a label 0. The other two return KeyErrors
Let's try iloc
myseries_one.iloc[0]
0.29129291112626043
myseries_two.iloc[0]
0.29129291112626043
myseries_three.iloc[0]
0.29129291112626043
They all have a position of 0 and return the first element accordingly.
For the range slicing, pandas decides to be less interpretive and sticks to positional slicing for the integer slice 0:2. Keep in mind. Actual real people (the programmers writing pandas code) are the ones making these decisions. When you are attempting to do something that is ambiguous, you may get varying results. To remove ambiguity, use loc and iloc.
iloc
myseries_one.iloc[0:2]
0 0.291293
1 0.381014
dtype: float64
myseries_two.iloc[0:2]
a 0.291293
b 0.381014
dtype: float64
myseries_three.iloc[0:2]
10 0.291293
11 0.381014
dtype: float64
loc
myseries_one.loc[0:2]
0 0.291293
1 0.381014
2 0.923360
dtype: float64
myseries_two.loc[0:2]
TypeError: cannot do slice indexing on <class 'pandas.indexes.base.Index'> with these indexers [0] of <type 'int'>
myseries_three.loc[0:2]
Series([], dtype: float64)

Why fillna does not work on float values?

I try to replace in all the empty cell of a dataset the mean of that column.
I use modifiedData=data.fillna(data.mean())
but it works only on integer column type.
I have also a column with float values and in it fillna does not work.
Why?

.fillna() works on columns that are nan. The concept of nan can't exist in an int column. Pandas dtype int does not support nan.
If you have a column with what seems to be integers, it is more likely an object column. Perhaps even filled with strings. Strings that are empty in some cases.
Empty strings are not filled by .fillna()
In [8]: pd.Series(["2", "1", ""]).fillna(0)
Out[8]:
0 2
1 1
2
dtype: object
An easy way to figure out what's going on is to use the df.Column.isna() method.
If that method gives you all False. you know there are no nan to fill.
To turn empty strings into nan values
In [11]: s = pd.Series(["2", "1", ""])
In [12]: empty_string_mask = s.str.len() == 0
In [21]: s.loc[empty_string_mask] = float('nan')
In [22]: s
Out[22]:
0 2
1 1
2 NaN
dtype: object
After that you can fillna
In [23]: s.fillna(0)
Out[23]:
0 2
1 1
2 0
dtype: object
Another way of going about this problem is to check the dtype
df.column.dtype
If it says 'object' It confirms your issue
You can cast the column to a float column
df.column = df.column.dtype(float)
Though manipulating dtypes in pandas usually leads to pains, this may be an easier route to take for this particular problem.

Uncomfortable output of mode() in pandas Dataframe

I have a dataframe with several columns (the features).
>>> print(df)
col1 col2
a 1 1
b 2 2
c 3 3
d 3 2
I would like to compute the mode of one of them. This is what happens:
>>> print(df['col1'].mode())
0 3
dtype: int64
I would like to output simply the value 3.
This behavoiur is quite strange, if you consider that the following very similar code is working:
>>> print(df['col1'].mean())
2.25
So two questions: why does this happen? How can I obtain the pure mode value as it happens for the mean?

Because Series.mode() can return multiple values:
consider the following DF:
In [77]: df
Out[77]:
col1 col2
a 1 1
b 2 2
c 3 3
d 3 2
e 2 3
In [78]: df['col1'].mode()
Out[78]:
0 2
1 3
dtype: int64
From docstring:
Empty if nothing occurs at least 2 times. Always returns Series
even if only one value.
If you want to chose the first value:
In [83]: df['col1'].mode().iloc[0]
Out[83]: 2
In [84]: df['col1'].mode()[0]
Out[84]: 2

I agree that it's too cumbersome
df['col1'].mode().iloc[0].values[0]

a series can have one mean(), but a series can have more than one mode()
like
<2,2,2,3,3,3,4,4,4,5,6,7,8> its mode 2,3,4.
the output must be indexed

mode() will return all values that tie for the most frequent value.
In order to support that functionality, it must return a collection, which takes the form of a dataFrame or Series.
For example, if you had a series:
[2, 2, 3, 3, 5, 5, 6]
Then the most frequent values occur twice. The result would then be the series [2, 3, 5] since each of these occur twice.
If you want to collapse this into a single value, you can access the first value, compute the max(), min(), or whatever makes most sense for your application.

Python Pandas drop columns based on max value of column

Im just getting going with Pandas as a tool for munging two dimensional arrays of data. It's super overwhelming, even after reading the docs. You can do so much that I can't figure out how to do anything, if that makes any sense.
My dataframe (simplified):
Date Stock1 Stock2 Stock3
2014.10.10 74.75 NaN NaN
2014.9.9 NaN 100.95 NaN
2010.8.8 NaN NaN 120.45
So each column only has one value.
I want to remove all columns that have a max value less than x. So say here as an example, if x = 80, then I want a new DataFrame:
Date Stock2 Stock3
2014.10.10 NaN NaN
2014.9.9 100.95 NaN
2010.8.8 NaN 120.45
How can this be acheived? I've looked at dataframe.max() which gives me a series. Can I use that, or have a lambda function somehow in select()?

Use the df.max() to index with.
In [19]: from pandas import DataFrame
In [23]: df = DataFrame(np.random.randn(3,3), columns=['a','b','c'])
In [36]: df
Out[36]:
a b c
0 -0.928912 0.220573 1.948065
1 -0.310504 0.847638 -0.541496
2 -0.743000 -1.099226 -1.183567
In [24]: df.max()
Out[24]:
a -0.310504
b 0.847638
c 1.948065
dtype: float64
Next, we make a boolean expression out of this:
In [31]: df.max() > 0
Out[31]:
a False
b True
c True
dtype: bool
Next, you can index df.columns by this (this is called boolean indexing):
In [34]: df.columns[df.max() > 0]
Out[34]: Index([u'b', u'c'], dtype='object')
Which you can finally pass to DF:
In [35]: df[df.columns[df.max() > 0]]
Out[35]:
b c
0 0.220573 1.948065
1 0.847638 -0.541496
2 -1.099226 -1.183567
Of course, instead of 0, you use any value that you want as the cutoff for dropping.

how to preserve pandas dataframe identity when extracting a single row

I am extracting a subset of my dataframe by index using either .xs or .loc (they seem to behave the same). When my condition retrieves multiple rows, the result stays a dataframe. When only a single row is retrieved, it is automatically converted to a series. I don't want that behavior, since that means I need to handle multiple cases downstream (different method sets available for series vs dataframe).
In [1]: df = pd.DataFrame({'a':range(7), 'b':['one']*4 + ['two'] + ['three']*2,
'c':range(10,17)})
In [2]: df.set_index('b', inplace=True)
In [3]: df.xs('one')
Out[3]:
a c
b
one 0 10
one 1 11
one 2 12
one 3 13
In [4]: df.xs('two')
Out[4]:
a 4
c 14
Name: two, dtype: int64
In [5]: type(df.xs('two'))
Out [5]: pandas.core.series.Series
I can manually convert that series back to a dataframe, but it seems cumbersome and will also require case testing to see if I should do that. Is there a cleaner way to just get a dataframe back to begin with?

IIUC, you can simply add braces, [], and use .loc:
>>> df.loc["two"]
a 4
c 14
Name: two, dtype: int64
>>> type(_)
<class 'pandas.core.series.Series'>
>>> df.loc[["two"]]
a c
b
two 4 14
[1 rows x 2 columns]
>>> type(_)
<class 'pandas.core.frame.DataFrame'>
This may remind you of how numpy advanced indexing works:
>>> a = np.arange(9).reshape(3,3)
>>> a[1]
array([3, 4, 5])
>>> a[[1]]
array([[3, 4, 5]])
Now this will probably require some refactoring of code so that you're always accessing with a list, even if the list only has one element, but it works well for me in practice.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

pandas DataFrame combine_first and update methods have strange behavior - python

this is a bug, update shouldn't touch unspecified columns, fixed here https://github.com/pydata/pandas/pull/3021

Related

Getting first element of series from pandas .value_counts() [duplicate]

Why fillna does not work on float values?

Uncomfortable output of mode() in pandas Dataframe

Python Pandas drop columns based on max value of column

how to preserve pandas dataframe identity when extracting a single row

Categories

Resources