Panda, summing multiple DataFrames with different columns

Panda, summing multiple DataFrames with different columns - python

I have the following DataFrames:
A =
0 1 2
0 1 1 1
1 1 1 1
2 1 1 1
B =
0 5
0 1 1
5 1 1
I want to 'join' these two frames such that:
A + B =
0 1 2 5
0 2 1 1 1
1 1 1 1 0
2 1 1 1 0
5 1 0 0 1
where A+B is a new dataframe

Using add
df1.add(df2,fill_value=0).fillna(0)
Out[217]:
0 1 2 5
0 2.0 1.0 1.0 1.0
1 1.0 1.0 1.0 0.0
2 1.0 1.0 1.0 0.0
5 1.0 0.0 0.0 1.0
If you need int
df1.add(df2,fill_value=0).fillna(0).astype(int)
Out[242]:
0 1 2 5
0 2 1 1 1
1 1 1 1 0
2 1 1 1 0
5 1 0 0 1

import numpy as np
import pandas as pd
A = pd.DataFrame(np.ones(9).reshape(3, 3))
B = pd.DataFrame(np.ones(4).reshape(2, 2), columns=[0, 5], index=[0, 5])
A.add(B, fill_value=0).fillna(0)
[Out]
0 1 2 5
0 2.0 1.0 1.0 1.0
1 1.0 1.0 1.0 0.0
2 1.0 1.0 1.0 0.0
5 1.0 0.0 0.0 1.0

Related

Calculating differences between rows within groups using pandas

I want to group by the id column in this dataframe:
id a b c
0 1 1 6 2
1 1 2 5 2
2 2 3 4 2
3 2 4 3 2
4 3 5 2 2
5 3 6 1 2
and add the differences between rows for the same column and group as additional columns to end up with this dataframe:
id a b c a_diff b_diff c_diff
0 1 1 6 2 -1.0 1.0 0.0
1 1 2 5 2 1.0 -1.0 0.0
2 2 3 4 2 -1.0 1.0 0.0
3 2 4 3 2 1.0 -1.0 0.0
4 3 5 2 2 -1.0 1.0 0.0
5 3 6 1 2 1.0 -1.0 0.0
data here
df = pd.DataFrame({'id': [1,1,2,2,3,3], 'a': [1,2,3,4,5,6],'b': [6,5,4,3,2,1], 'c': [2,2,2,2,2,2]})

Your desired output doesn't make much sense, but I can force it there with:
df[['a_diff', 'b_diff', 'c_diff']] = df.groupby('id').transform(lambda x: x.diff(1).fillna(x.diff(-1)))
Output:
id a b c a_diff b_diff c_diff
0 1 1 6 2 -1.0 1.0 0.0
1 1 2 5 2 1.0 -1.0 0.0
2 2 3 4 2 -1.0 1.0 0.0
3 2 4 3 2 1.0 -1.0 0.0
4 3 5 2 2 -1.0 1.0 0.0
5 3 6 1 2 1.0 -1.0 0.0

Efficient way to get 2d array (kind of adjacency matrix) from 1d array

For an array, say, a = np.array([1,2,1,0,0,1,1,2,2,2]), something like an adjacency "matrix" A needs to be created. I.e. A is a symmetric (n, n) numpy array where n = len(a) and A[i,j] = 1 if a[i] == a[j] and 0 otherwise (i = 0...n-1 and j = 0...n-1):
0 1 2 3 4 5 6 7 8 9
0 1 0 1 0 0 1 1 0 0 0
1 1 0 0 0 0 0 1 1 1
2 1 0 0 1 1 0 0 0
3 1 1 0 0 0 0 0
4 1 0 0 0 0 0
5 1 1 0 0 0
6 1 0 0 0
7 1 1 1
8 1 1
9 1
The trivial solution is
n = len(a)
A = np.zeros([n, n]).astype(int)
for i in range(n):
for j in range(n):
if a[i] == a[j]:
A[i, j] = 1
else:
A[i, j] = 0
Can this be done in a numpy way, i.e. without loops?

You can use numpy broadcasting:
b = (a[:,None]==a).astype(int)
df = pd.DataFrame(b)
output:
0 1 2 3 4 5 6 7 8 9
0 1 0 1 0 0 1 1 0 0 0
1 0 1 0 0 0 0 0 1 1 1
2 1 0 1 0 0 1 1 0 0 0
3 0 0 0 1 1 0 0 0 0 0
4 0 0 0 1 1 0 0 0 0 0
5 1 0 1 0 0 1 1 0 0 0
6 1 0 1 0 0 1 1 0 0 0
7 0 1 0 0 0 0 0 1 1 1
8 0 1 0 0 0 0 0 1 1 1
9 0 1 0 0 0 0 0 1 1 1
If you want the upper triangle only, use numpy.tril_indices:
b = (a[:,None]==a).astype(float)
b[np.tril_indices_from(b, k=-1)] = np.nan
df = pd.DataFrame(b)
output:
0 1 2 3 4 5 6 7 8 9
0 1.0 0.0 1.0 0.0 0.0 1.0 1.0 0.0 0.0 0.0
1 NaN 1.0 0.0 0.0 0.0 0.0 0.0 1.0 1.0 1.0
2 NaN NaN 1.0 0.0 0.0 1.0 1.0 0.0 0.0 0.0
3 NaN NaN NaN 1.0 1.0 0.0 0.0 0.0 0.0 0.0
4 NaN NaN NaN NaN 1.0 0.0 0.0 0.0 0.0 0.0
5 NaN NaN NaN NaN NaN 1.0 1.0 0.0 0.0 0.0
6 NaN NaN NaN NaN NaN NaN 1.0 0.0 0.0 0.0
7 NaN NaN NaN NaN NaN NaN NaN 1.0 1.0 1.0
8 NaN NaN NaN NaN NaN NaN NaN NaN 1.0 1.0
9 NaN NaN NaN NaN NaN NaN NaN NaN NaN 1.0

How to insert list of values into null values of a column in python?

I am new to pandas. I am facing an issue with null values. I have a list of 3 values which has to be inserted into a column of missing values how do I do that?
In [57]: df
Out[57]:
a b c d
0 0 1 2 3
1 0 NaN 0 1
2 0 Nan 3 4
3 0 1 2 5
4 0 Nan 2 6
In [58]: list = [11,22,44]
The output I want
Out[57]:
a b c d
0 0 1 2 3
1 0 11 0 1
2 0 22 3 4
3 0 1 2 5
4 0 44 2 6

If your list is same length as the no of NaN:
l=[11,22,44]
df.loc[df['b'].isna(),'b'] = l
print(df)
a b c d
0 0 1.0 2 3
1 0 11.0 0 1
2 0 22.0 3 4
3 0 1.0 2 5
4 0 44.0 2 6

Try with stack and assign the value then unstack back
s = df.stack(dropna=False)
s.loc[s.isna()] = l # chnage the list name to l here, since override the original python and panda function and object name will create future warning
df = s.unstack()
df
Out[178]:
a b c d
0 0.0 1.0 2.0 3.0
1 0.0 11.0 0.0 1.0
2 0.0 22.0 3.0 4.0
3 0.0 1.0 2.0 5.0
4 0.0 44.0 2.0 6.0

Pandas Dataframe of staggered zeros

I'm building a monte carlo model and need to model how many new items I capture each month, for a given of months. Each month I add a random number of items with a known mean and stdev.
months = ['2017-03','2017-04','2017-05']
new = np.random.normal(4,3,size = len(months)).round()
print new
[ 1. 5. 4.]
df_new = pd.DataFrame(zip(months,new),columns = ['Period','newPats'])
print df_new
Period newPats
0 2017-03 1.0
1 2017-04 5.0
2 2017-05 4.0
I need to transform this into an item x month dataframe, where the value is a zero until the month that the given item starts.
Here's the shape I have:
df_full = pd.DataFrame(np.ones((new.sum(), len(months))),columns = months)
2017-03 2017-04 2017-05
0 1.0 1.0 1.0
1 1.0 1.0 1.0
2 1.0 1.0 1.0
3 1.0 1.0 1.0
4 1.0 1.0 1.0
5 1.0 1.0 1.0
6 1.0 1.0 1.0
7 1.0 1.0 1.0
8 1.0 1.0 1.0
9 1.0 1.0 1.0
and here's the output I need:
#perform transformation
print df_out
2017-03 2017-04 2017-05
0 1 1 1
1 0 1 1
2 0 1 1
3 0 1 1
4 0 1 1
5 0 1 1
6 0 0 1
7 0 0 1
8 0 0 1
9 0 0 1
The rule is that there was 1 item added in 2017-03, so all periods = 1 for the first record. The next 5 items were added in 2017-04, so all prior periods = 0. The final 4 items were added in 2017-05, so they are only = 1 in the last month. This is going into a monte carlo simulation which will be run thousands of times, so I can't manually iterate over the columns/rows - any vectorized suggestions for how to handle?

Beat you all to it.
df_out = pd.DataFrame([new[:x+1].sum() * [1] + (new.sum() - new[:x+1].sum() ) * [0] for x in range(len(months))]).transpose()
df_out.columns = months
print df_out
2017-03 2017-04 2017-05
0 1 1 1
1 0 1 1
2 0 1 1
3 0 1 1
4 0 1 1
5 0 1 1
6 0 0 1
7 0 0 1
8 0 0 1
9 0 0 1

Sum values from DataFrame into Parent Index - Python/Pandas

I'm working with Mint transaction data and trying to sum the values from each category into it's parent category.
I have a dataframe mint_data that is created from all my Mint transactions:
mint_data = tranactions_data.pivot(index='Category', columns='Date', values='Amount')
mint_data image
And a dict with Category:Parent pairs (this uses xlwings to pull from excel sheet)
cat_parent = cats_sheet.range('A1').expand().options(dict).value
Cat:Parent image
I'm not sure how to go about looping through the mint_data df and summing amounts into the parent category. I would like to keep the data frame format exactly the same, just replacing the parent values.
Here is an example df:
A B C D E
par_a 0 0 5 0 0
cat1a 5 2 3 2 1
cat2a 0 1 2 1 0
par_b 1 0 1 1 2
cat1b 0 1 2 1 0
cat2b 1 1 1 1 1
cat3b 0 1 2 1 0
I also have a dict with
{'par_a': 'par_a',
'cat1a': 'par_a',
'cat2a': 'par_a',
'par_b': 'par_b',
'cat1b': 'par_b',
'cat2b': 'par_b',
'cat3b': 'par_b'}
I am trying to get the dataframe to end up with
A B C D E
par_a 5 3 10 3 1
cat1a 5 2 3 2 1
cat2a 0 1 2 1 0
par_b 2 3 6 4 3
cat1b 0 1 2 1 0
cat2b 1 1 1 1 1
cat3b 0 1 2 1 0

Let's call your dictionary "dct" and then make a new column that maps to the parent:
>>> df['parent'] = df.reset_index()['index'].map(dct).values
A B C D E parent
par_a 0 0 5 0 0 par_a
cat1a 5 2 3 2 1 par_a
cat2a 0 1 2 1 0 par_a
par_b 1 0 1 1 2 par_b
cat1b 0 1 2 1 0 par_b
cat2b 1 1 1 1 1 par_b
cat3b 0 1 2 1 0 par_b
Then sum by parent:
>>> df_sum = df.groupby('parent').sum()
A B C D E
parent
par_a 5 3 10 3 1
par_b 2 3 6 4 3
In many cases you would stop there, but since you want to combine the parent/child data, you need some sort of merge. combine_first will work well here since it will selectively update in the direction you want:
>>> df_new = df_sum.combine_first(df)
A B C D E parent
cat1a 5.0 2.0 3.0 2.0 1.0 par_a
cat1b 0.0 1.0 2.0 1.0 0.0 par_b
cat2a 0.0 1.0 2.0 1.0 0.0 par_a
cat2b 1.0 1.0 1.0 1.0 1.0 par_b
cat3b 0.0 1.0 2.0 1.0 0.0 par_b
par_a 5.0 3.0 10.0 3.0 1.0 par_a
par_b 2.0 3.0 6.0 4.0 3.0 par_b
You mentioned a multi-index in a comment, so you may prefer to organize it more like this:
>>> df_new.reset_index().set_index(['parent','index']).sort_index()
A B C D E
parent index
par_a cat1a 5.0 2.0 3.0 2.0 1.0
cat2a 0.0 1.0 2.0 1.0 0.0
par_a 5.0 3.0 10.0 3.0 1.0
par_b cat1b 0.0 1.0 2.0 1.0 0.0
cat2b 1.0 1.0 1.0 1.0 1.0
cat3b 0.0 1.0 2.0 1.0 0.0
par_b 2.0 3.0 6.0 4.0 3.0

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Panda, summing multiple DataFrames with different columns - python

I have the following DataFrames: A = 0 1 2 0 1 1 1 1 1 1 1 2 1 1 1 B = 0 5 0 1 1 5 1 1 I want to 'join' these two frames such that: A + B = 0 1 2 5 0 2 1 1 1 1 1 1 1 0 2 1 1 1 0 5 1 0 0 1 where A+B is a new dataframe

Using add df1.add(df2,fill_value=0).fillna(0) Out[217]: 0 1 2 5 0 2.0 1.0 1.0 1.0 1 1.0 1.0 1.0 0.0 2 1.0 1.0 1.0 0.0 5 1.0 0.0 0.0 1.0 If you need int df1.add(df2,fill_value=0).fillna(0).astype(int) Out[242]: 0 1 2 5 0 2 1 1 1 1 1 1 1 0 2 1 1 1 0 5 1 0 0 1

import numpy as np import pandas as pd A = pd.DataFrame(np.ones(9).reshape(3, 3)) B = pd.DataFrame(np.ones(4).reshape(2, 2), columns=[0, 5], index=[0, 5]) A.add(B, fill_value=0).fillna(0) [Out] 0 1 2 5 0 2.0 1.0 1.0 1.0 1 1.0 1.0 1.0 0.0 2 1.0 1.0 1.0 0.0 5 1.0 0.0 0.0 1.0

Related

Calculating differences between rows within groups using pandas

Efficient way to get 2d array (kind of adjacency matrix) from 1d array

How to insert list of values into null values of a column in python?

Pandas Dataframe of staggered zeros

Sum values from DataFrame into Parent Index - Python/Pandas

Categories

Resources