Panda, summing multiple DataFrames with different columns - python

I have the following DataFrames:
A =
0 1 2
0 1 1 1
1 1 1 1
2 1 1 1
B =
0 5
0 1 1
5 1 1
I want to 'join' these two frames such that:
A + B =
0 1 2 5
0 2 1 1 1
1 1 1 1 0
2 1 1 1 0
5 1 0 0 1
where A+B is a new dataframe

Using add
df1.add(df2,fill_value=0).fillna(0)
Out[217]:
0 1 2 5
0 2.0 1.0 1.0 1.0
1 1.0 1.0 1.0 0.0
2 1.0 1.0 1.0 0.0
5 1.0 0.0 0.0 1.0
If you need int
df1.add(df2,fill_value=0).fillna(0).astype(int)
Out[242]:
0 1 2 5
0 2 1 1 1
1 1 1 1 0
2 1 1 1 0
5 1 0 0 1

import numpy as np
import pandas as pd
A = pd.DataFrame(np.ones(9).reshape(3, 3))
B = pd.DataFrame(np.ones(4).reshape(2, 2), columns=[0, 5], index=[0, 5])
A.add(B, fill_value=0).fillna(0)
[Out]
0 1 2 5
0 2.0 1.0 1.0 1.0
1 1.0 1.0 1.0 0.0
2 1.0 1.0 1.0 0.0
5 1.0 0.0 0.0 1.0

Related

Calculating differences between rows within groups using pandas

I want to group by the id column in this dataframe:
id a b c
0 1 1 6 2
1 1 2 5 2
2 2 3 4 2
3 2 4 3 2
4 3 5 2 2
5 3 6 1 2
and add the differences between rows for the same column and group as additional columns to end up with this dataframe:
id a b c a_diff b_diff c_diff
0 1 1 6 2 -1.0 1.0 0.0
1 1 2 5 2 1.0 -1.0 0.0
2 2 3 4 2 -1.0 1.0 0.0
3 2 4 3 2 1.0 -1.0 0.0
4 3 5 2 2 -1.0 1.0 0.0
5 3 6 1 2 1.0 -1.0 0.0
data here
df = pd.DataFrame({'id': [1,1,2,2,3,3], 'a': [1,2,3,4,5,6],'b': [6,5,4,3,2,1], 'c': [2,2,2,2,2,2]})
Your desired output doesn't make much sense, but I can force it there with:
df[['a_diff', 'b_diff', 'c_diff']] = df.groupby('id').transform(lambda x: x.diff(1).fillna(x.diff(-1)))
Output:
id a b c a_diff b_diff c_diff
0 1 1 6 2 -1.0 1.0 0.0
1 1 2 5 2 1.0 -1.0 0.0
2 2 3 4 2 -1.0 1.0 0.0
3 2 4 3 2 1.0 -1.0 0.0
4 3 5 2 2 -1.0 1.0 0.0
5 3 6 1 2 1.0 -1.0 0.0

Efficient way to get 2d array (kind of adjacency matrix) from 1d array

For an array, say, a = np.array([1,2,1,0,0,1,1,2,2,2]), something like an adjacency "matrix" A needs to be created. I.e. A is a symmetric (n, n) numpy array where n = len(a) and A[i,j] = 1 if a[i] == a[j] and 0 otherwise (i = 0...n-1 and j = 0...n-1):
0 1 2 3 4 5 6 7 8 9
0 1 0 1 0 0 1 1 0 0 0
1 1 0 0 0 0 0 1 1 1
2 1 0 0 1 1 0 0 0
3 1 1 0 0 0 0 0
4 1 0 0 0 0 0
5 1 1 0 0 0
6 1 0 0 0
7 1 1 1
8 1 1
9 1
The trivial solution is
n = len(a)
A = np.zeros([n, n]).astype(int)
for i in range(n):
for j in range(n):
if a[i] == a[j]:
A[i, j] = 1
else:
A[i, j] = 0
Can this be done in a numpy way, i.e. without loops?
You can use numpy broadcasting:
b = (a[:,None]==a).astype(int)
df = pd.DataFrame(b)
output:
0 1 2 3 4 5 6 7 8 9
0 1 0 1 0 0 1 1 0 0 0
1 0 1 0 0 0 0 0 1 1 1
2 1 0 1 0 0 1 1 0 0 0
3 0 0 0 1 1 0 0 0 0 0
4 0 0 0 1 1 0 0 0 0 0
5 1 0 1 0 0 1 1 0 0 0
6 1 0 1 0 0 1 1 0 0 0
7 0 1 0 0 0 0 0 1 1 1
8 0 1 0 0 0 0 0 1 1 1
9 0 1 0 0 0 0 0 1 1 1
If you want the upper triangle only, use numpy.tril_indices:
b = (a[:,None]==a).astype(float)
b[np.tril_indices_from(b, k=-1)] = np.nan
df = pd.DataFrame(b)
output:
0 1 2 3 4 5 6 7 8 9
0 1.0 0.0 1.0 0.0 0.0 1.0 1.0 0.0 0.0 0.0
1 NaN 1.0 0.0 0.0 0.0 0.0 0.0 1.0 1.0 1.0
2 NaN NaN 1.0 0.0 0.0 1.0 1.0 0.0 0.0 0.0
3 NaN NaN NaN 1.0 1.0 0.0 0.0 0.0 0.0 0.0
4 NaN NaN NaN NaN 1.0 0.0 0.0 0.0 0.0 0.0
5 NaN NaN NaN NaN NaN 1.0 1.0 0.0 0.0 0.0
6 NaN NaN NaN NaN NaN NaN 1.0 0.0 0.0 0.0
7 NaN NaN NaN NaN NaN NaN NaN 1.0 1.0 1.0
8 NaN NaN NaN NaN NaN NaN NaN NaN 1.0 1.0
9 NaN NaN NaN NaN NaN NaN NaN NaN NaN 1.0

How to insert list of values into null values of a column in python?

I am new to pandas. I am facing an issue with null values. I have a list of 3 values which has to be inserted into a column of missing values how do I do that?
In [57]: df
Out[57]:
a b c d
0 0 1 2 3
1 0 NaN 0 1
2 0 Nan 3 4
3 0 1 2 5
4 0 Nan 2 6
In [58]: list = [11,22,44]
The output I want
Out[57]:
a b c d
0 0 1 2 3
1 0 11 0 1
2 0 22 3 4
3 0 1 2 5
4 0 44 2 6
If your list is same length as the no of NaN:
l=[11,22,44]
df.loc[df['b'].isna(),'b'] = l
print(df)
a b c d
0 0 1.0 2 3
1 0 11.0 0 1
2 0 22.0 3 4
3 0 1.0 2 5
4 0 44.0 2 6
Try with stack and assign the value then unstack back
s = df.stack(dropna=False)
s.loc[s.isna()] = l # chnage the list name to l here, since override the original python and panda function and object name will create future warning
df = s.unstack()
df
Out[178]:
a b c d
0 0.0 1.0 2.0 3.0
1 0.0 11.0 0.0 1.0
2 0.0 22.0 3.0 4.0
3 0.0 1.0 2.0 5.0
4 0.0 44.0 2.0 6.0

Pandas Dataframe of staggered zeros

I'm building a monte carlo model and need to model how many new items I capture each month, for a given of months. Each month I add a random number of items with a known mean and stdev.
months = ['2017-03','2017-04','2017-05']
new = np.random.normal(4,3,size = len(months)).round()
print new
[ 1. 5. 4.]
df_new = pd.DataFrame(zip(months,new),columns = ['Period','newPats'])
print df_new
Period newPats
0 2017-03 1.0
1 2017-04 5.0
2 2017-05 4.0
I need to transform this into an item x month dataframe, where the value is a zero until the month that the given item starts.
Here's the shape I have:
df_full = pd.DataFrame(np.ones((new.sum(), len(months))),columns = months)
2017-03 2017-04 2017-05
0 1.0 1.0 1.0
1 1.0 1.0 1.0
2 1.0 1.0 1.0
3 1.0 1.0 1.0
4 1.0 1.0 1.0
5 1.0 1.0 1.0
6 1.0 1.0 1.0
7 1.0 1.0 1.0
8 1.0 1.0 1.0
9 1.0 1.0 1.0
and here's the output I need:
#perform transformation
print df_out
2017-03 2017-04 2017-05
0 1 1 1
1 0 1 1
2 0 1 1
3 0 1 1
4 0 1 1
5 0 1 1
6 0 0 1
7 0 0 1
8 0 0 1
9 0 0 1
The rule is that there was 1 item added in 2017-03, so all periods = 1 for the first record. The next 5 items were added in 2017-04, so all prior periods = 0. The final 4 items were added in 2017-05, so they are only = 1 in the last month. This is going into a monte carlo simulation which will be run thousands of times, so I can't manually iterate over the columns/rows - any vectorized suggestions for how to handle?
Beat you all to it.
df_out = pd.DataFrame([new[:x+1].sum() * [1] + (new.sum() - new[:x+1].sum() ) * [0] for x in range(len(months))]).transpose()
df_out.columns = months
print df_out
2017-03 2017-04 2017-05
0 1 1 1
1 0 1 1
2 0 1 1
3 0 1 1
4 0 1 1
5 0 1 1
6 0 0 1
7 0 0 1
8 0 0 1
9 0 0 1

Sum values from DataFrame into Parent Index - Python/Pandas

I'm working with Mint transaction data and trying to sum the values from each category into it's parent category.
I have a dataframe mint_data that is created from all my Mint transactions:
mint_data = tranactions_data.pivot(index='Category', columns='Date', values='Amount')
mint_data image
And a dict with Category:Parent pairs (this uses xlwings to pull from excel sheet)
cat_parent = cats_sheet.range('A1').expand().options(dict).value
Cat:Parent image
I'm not sure how to go about looping through the mint_data df and summing amounts into the parent category. I would like to keep the data frame format exactly the same, just replacing the parent values.
Here is an example df:
A B C D E
par_a 0 0 5 0 0
cat1a 5 2 3 2 1
cat2a 0 1 2 1 0
par_b 1 0 1 1 2
cat1b 0 1 2 1 0
cat2b 1 1 1 1 1
cat3b 0 1 2 1 0
I also have a dict with
{'par_a': 'par_a',
'cat1a': 'par_a',
'cat2a': 'par_a',
'par_b': 'par_b',
'cat1b': 'par_b',
'cat2b': 'par_b',
'cat3b': 'par_b'}
I am trying to get the dataframe to end up with
A B C D E
par_a 5 3 10 3 1
cat1a 5 2 3 2 1
cat2a 0 1 2 1 0
par_b 2 3 6 4 3
cat1b 0 1 2 1 0
cat2b 1 1 1 1 1
cat3b 0 1 2 1 0
Let's call your dictionary "dct" and then make a new column that maps to the parent:
>>> df['parent'] = df.reset_index()['index'].map(dct).values
A B C D E parent
par_a 0 0 5 0 0 par_a
cat1a 5 2 3 2 1 par_a
cat2a 0 1 2 1 0 par_a
par_b 1 0 1 1 2 par_b
cat1b 0 1 2 1 0 par_b
cat2b 1 1 1 1 1 par_b
cat3b 0 1 2 1 0 par_b
Then sum by parent:
>>> df_sum = df.groupby('parent').sum()
A B C D E
parent
par_a 5 3 10 3 1
par_b 2 3 6 4 3
In many cases you would stop there, but since you want to combine the parent/child data, you need some sort of merge. combine_first will work well here since it will selectively update in the direction you want:
>>> df_new = df_sum.combine_first(df)
A B C D E parent
cat1a 5.0 2.0 3.0 2.0 1.0 par_a
cat1b 0.0 1.0 2.0 1.0 0.0 par_b
cat2a 0.0 1.0 2.0 1.0 0.0 par_a
cat2b 1.0 1.0 1.0 1.0 1.0 par_b
cat3b 0.0 1.0 2.0 1.0 0.0 par_b
par_a 5.0 3.0 10.0 3.0 1.0 par_a
par_b 2.0 3.0 6.0 4.0 3.0 par_b
You mentioned a multi-index in a comment, so you may prefer to organize it more like this:
>>> df_new.reset_index().set_index(['parent','index']).sort_index()
A B C D E
parent index
par_a cat1a 5.0 2.0 3.0 2.0 1.0
cat2a 0.0 1.0 2.0 1.0 0.0
par_a 5.0 3.0 10.0 3.0 1.0
par_b cat1b 0.0 1.0 2.0 1.0 0.0
cat2b 1.0 1.0 1.0 1.0 1.0
cat3b 0.0 1.0 2.0 1.0 0.0
par_b 2.0 3.0 6.0 4.0 3.0

Categories

Resources