Subtract two DataFrames with non overlapping indexes - python

I'm trying to subtract two DataFrames together. I would like to treat missing values as 0. fillna() won't work here because I don't know the common indexes before doing the subtraction:
import pandas as pd
A = pd.DataFrame([1,2], index=['a','b'])
B = pd.DataFrame([3,4], index=['a','c'])
A - B
0
a -2
b NaN
c NaN
Ideally, I would like to have:
A - B
0
a -2
b 2
c -4
Is it possible to get that while keeping the code simple?

You can use the subtract method and specify a fill_value of zero:
A.subtract(B, fill_value=0)
Note: the method below, combineAdd, is deprecated from 0.17.0 onwards.
One way is to use the combineAdd method to add -B to A:
>>> A.combineAdd(-B)
0
a -2
b 2
c -4
With this method, the two DataFrames are added and the values at non-matching indices default to the value in either A or B.

Related

Pandas mask with composite expression behaviour

this question was previously asked (and then deleted) by an user, I was looking to find a solution so I could give out an answer when the question disappeared and I, moreover, can't seem to make sense of pandas' behaviour so I would appreciate some clarity, the original question stated something along the lines of:
How can I replace every negative value except those in a given list with NaN in a Pandas dataframe?
my setup to reproduce the scenario is the following:
import pandas as pd
import numpy as np
df = pd.DataFrame({
'A' : [x for x in range(4)],
'B' : [x for x in range(-2, 2)]
})
this should technically only be an issue of correctly passing a boolean expression to pd.where, my attemped solution looks like:
df[df >= 0 | df.isin([-2])]
which produces:
index
A
B
0
0
NaN
1
1
NaN
2
2
0
3
3
1
which also cancels the number in the list!
moreover if I mask the dataframe with each of the two conditions I get the correct behavior:
with df[df >= 0] (identical to the compound result)
index
A
B
0
0
NaN
1
1
NaN
2
2
0
3
3
1
with df[df.isin([-2])] (identical to the compound result)
index
A
B
0
NaN
-2.0
1
NaN
NaN
2
NaN
NaN
3
NaN
NaN
So it seems like I am
Running into some undefined behaviour as a result of performing logic on NaN values
I have got something wrong
Anyone can clarify this situation to me?
Solution
df[(df >= 0) | (df.isin([-2]))]
Explanation
In python, bitwise OR, |, has a higher operator precedence than comparison operators like >=: https://docs.python.org/3/reference/expressions.html#operator-precedence
When filtering a pandas DataFrame on multiple boolean conditions, you need to enclose each condition in parentheses. More from the boolean indexing section of the pandas user guide:
Another common operation is the use of boolean vectors to filter the
data. The operators are: | for or, & for and, and ~ for not. These
must be grouped by using parentheses, since by default Python will
evaluate an expression such as df['A'] > 2 & df['B'] < 3 as df['A'] > (2 & df['B']) < 3, while the desired evaluation order is (df['A'] > 2) & (df['B'] < 3).

pandas grabbing value to fill in new columns

I am trying to get new columns B and C with a condition B value will be positive if the ‘A’ of one day is bigger than the ‘A’ of the day before. Otherwise, the value will be negative (C column).
Here is an example of what I am trying to get:
A B C
0. 167765
1. 235353 235353
2. 89260 89260
3. 188382 188382
4. 104677 104677
5. 207723 207723
I notice that this will cause an index error because the number of data in column B and C will be different from the original column A.
Currently, I am doing via this to test move specific data to column B and this cause length of values does not match the length of index error:
df['B'] = np.where(df['A'] <= 250000)
how do I accomplish the desired output where the first row is NA or empty?
desired output:
B C
0.
1. 235353
2. 89260
3. 188382
4. 104677
5. 207723
I'm not able to understand how you got to your final result by the method you're describing
In my understanding a value should be placed in column B if it is greater than the value the day before. Otherwise in column C.
You may need to correct me or adapt this answer if you meant differently.
The trick is in to use .where on a pandas Series object, which inserts the NaNs automatically.
df = pd.DataFrame({'A': [167765, 235353, 89260, 188382, 104677, 207723]})
diffs = df['A'].diff()
df['B'] = df['A'].where(diffs >= 0)
df['C'] = df['A'].where(diffs < 0)
diffs is going to be the following Series which also comes with a handy NaN in the first row.
0 NaN
1 67588.0
2 -146093.0
3 99122.0
4 -83705.0
5 103046.0
Name: A, dtype: float64
Comparing with NaN always returns False. Therefore we can omit the first row by comparing for the positive and the negative seperately.
The resulting table looks like this
A B C
0 167765 NaN NaN
1 235353 235353.0 NaN
2 89260 NaN 89260.0
3 188382 188382.0 NaN
4 104677 NaN 104677.0
5 207723 207723.0 NaN
You can try giving explicit list of index:
df['B'] = np.where(df.index.isin([1, 2, 3]), df['A'], np.nan)
df['C'] = np.where(df.index.isin([4, 5]), df['A'], np.nan)

Difference between giving pandas a python iterable vs a pd.Series for column

What are some of the differences between passing a List vs a pd.Series type to create a new dataFrame column? For example, from trial-and-error I've noticed:
# (1d) We can also give it a Series, which is quite similar to giving it a List
df['cost1'] = pd.Series([random.choice([1.99,2.99,3.99]) for i in range(len(df))])
df['cost2'] = [random.choice([1.99,2.99,3.99]) for i in range(len(df))]
df['cost3'] = pd.Series([1,2,3]) # <== will pad length with `NaN`
df['cost4'] = [1,2,3] # <== this one will fail because not the same size
d
Are there any other reasons that pd.Series differs from passing a standard python list? Can a dataframe take any python iterable or are there restrictions on what can be passed to it? Finally, is using pd.Series the 'correct' way to add columns, or can it be used interchangably with other types?
List assign to dataframe here require the same length
For the pd.Series assign , it will use the index as key to match original DataFrame index, then fill the value with the same index in Series
df=pd.DataFrame([1,2,3],index=[9,8,7])
df['New']=pd.Series([1,2,3])
# the default index is range index , which is from 0 to n
# since the dataframe index dose not match the series, then will return NaN
df
Out[88]:
0 New
9 1 NaN
8 2 NaN
7 3 NaN
Different length with matched index
df['New']=pd.Series([1,2],index=[9,8])
df
Out[90]:
0 New
9 1 1.0
8 2 2.0
7 3 NaN

overwrite slice of multi-index dataframe with series

I have a multi-index dataframe and want to set a slice of one of its columns equal to a series, ordered (sorted) according to the column slice' and series' index-match. The column's innermost index and series' index are identical, except their ordering (sorting). (see example below)
I can do this by first sorting the series' index according to the column's index and then using series.values (see below), but this feels like a workaround and I was wondering if it's possible to directly assign the series to the column slice.
example:
import pandas as pd
multi_index=pd.MultiIndex.from_product([['a','b'],['x','y']])
df=pd.DataFrame(0,multi_index,['p','q'])
s1=pd.Series([1,2],['y','x'])
df.loc['a','p']=s1[df.loc['a','p'].index].values
The code above gives the desired output, but I was wondering if the last line could be done simpler, e.g.:
df.loc['a','p']=s1
but this sets the column slice to NaNs.
Desired output:
p q
a x 2 0
y 1 0
b x 0 0
y 0 0
obtained output from df.loc['a','p']=s1:
p q
a x NaN 0
y NaN 0
b x 0.0 0
y 0.0 0
It seems like a simple issue to me but I haven't been able to find the answer anywhere.
Have you tried something like that?
df.loc['a']['p'] = s1
Resulting df is here
p q
a x 2 0
y 1 0
b x 0 0
y 0 0

How do I find duplicate indices in a DataFrame?

I have a pandas DataFrame with a multi-level index ("instance" and "index"). I want to find all the first-level ("instance") index values which are non-unique and to print out those values.
My frame looks like this:
A
instance index
a 1 10
2 12
3 4
b 1 12
2 5
3 2
b 1 12
2 5
3 2
I want to find "b" as the duplicate 0-level index and print its value ("b") out.
You can use the get_duplicates() method:
>>> df.index.get_level_values('instance').get_duplicates()
[0, 1]
(In my example data 0 and 1 both appear multiple times.)
The get_level_values() method can accept a label (such as 'instance') or an integer and retrieves the relevant part of the MultiIndex.
Assuming that your df has an index made of 'instance' and 'index' you could do this:
df1 = df.reset_index().pivot_table(index=['instance','index'], values='A', aggfunc='count')
df1[df1 > 1].index.get_level_values(0).drop_duplicates()
Which yields:
Index([u'b'], dtype='object')
Adding .values at the end (.drop_duplicates().values) will make an array:
array(['b'], dtype=object)
Or the same with one line using .groupby:
df[df.groupby(level=['instance','index']).count() > 1].dropna().index.get_level_values(0).drop_duplicates()
This should give you the whole row which isn't quite what you asked for but might be close enough:
df[df.index.get_level_values('instance').duplicated()]
You want the duplicated method:
df['Instance'].duplicated()

Categories

Resources