computing sum of pandas dataframes

computing sum of pandas dataframes - python

I have two dataframes that I want to add bin-wise. That is, given
dfc1 = pd.DataFrame(list(zip(range(10),np.zeros(10))), columns=['bin', 'count'])
dfc2 = pd.DataFrame(list(zip(range(0,10,2), np.ones(5))), columns=['bin', 'count'])
which gives me this
dfc1:
bin count
0 0 0
1 1 0
2 2 0
3 3 0
4 4 0
5 5 0
6 6 0
7 7 0
8 8 0
9 9 0
dfc2:
bin count
0 0 1
1 2 1
2 4 1
3 6 1
4 8 1
I want to generate this:
bin count
0 0 1
1 1 0
2 2 1
3 3 0
4 4 1
5 5 0
6 6 1
7 7 0
8 8 1
9 9 0
where I've added the count columns where the bin columns matched.
In fact, it turns out that I only ever add 1 (that is, count in dfc2 is always 1). So an alternate version of the question is "given an array of bin values (dfc2.bin), how can I add one to each of their corresponding count values in dfc1?"
My only solution thus far feels grossly inefficient (and slightly unreadable in the end), doing an outer joint between the two bin columns, thus creating a third dataframe on which I do a computation and then project out the unneeded column.
Suggestions?

First set bin to be index in both dataframes, then you can use add, fillvalue is needed to point that zero shall be used if bin is missing in dataframe:
dfc1 = dfc1.set_index('bin')
dfc2 = dfc2.set_index('bin')
result = pd.DataFrame.add(dfc1, dfc2, fill_value=0)
Pandas automatically sums up rows with equal index.
By the way, if you need to perform such operation frequently, I strongly recommend using numpy.bincount, which allows even repeating the bin index inside one dataframe

Since the dfc1 index is the same as your "bin" value, you could simply do the following:
dfc1.iloc[dfc2.bin].cnt += 1
Notice that I renamed your "count" column to "cnt" since count is a pandas builtin, which can cause confusion and errors!

As an alternative of #Alleo's answer, you can use method combineAdd to simply add 2 dataframes together and set_index at the same time, provided that their indexes will be matched by bin:
dfc1.set_index('bin').combineAdd(dfc2.set_index('bin')).reset_index()
bin count
0 0 1
1 1 0
2 2 1
3 3 0
4 4 1
5 5 0
6 6 1
7 7 0
8 8 1
9 9 0

Related

Duplicate a single row at index?

In the past hour I was searching here and couldn't find a very simple thing I need to do, duplicate a single row at index x, and just put in on index x+1.
df
a b
0 3 8
1 2 4
2 9 0
3 5 1
copy index 2 and insert it as is in the next row:
a b
0 3 8
1 2 4
2 9 0
3 9 0 # new row
4 5 1
What I tried is concat(with my own columns names) which make a mess.
line = pd.DataFrame({"date": date, "event": None}, index=[index+1])
return pd.concat([df.iloc[:index], line, df.iloc[index:]]).reset_index(drop=True)
How to simply duplicate a full row at a given index ?

You can use repeat(). Fill in the dictionary with the index and the key, and how many extra rows you would like to add as the value. This can work for multiple values.
d = {2:1}
df.loc[df.index.repeat(df.index.map(d).fillna(0)+1)].reset_index()
Output:
index a b
0 0 3 8
1 1 2 4
2 2 9 0
3 2 9 0
4 3 5 1

Got it.
df.loc[index+0.5] = df.loc[index].values
return df.sort_index().reset_index(drop = True)

How to get number of rows since last peak Pandas

I would like to get a rolling count of how many rows have been between the current row and the last peak. Example code:
Value | Rows since Peak
-----------------------
1 0
3 0
1 1
2 2
1 3
4 0
6 0
5 1

You can compare the values to the cummax and use it for a groupby.cumcount:
df['Rows since Peak'] = (df.groupby(df['Value'].eq(df['Value'].cummax())
.cumsum())
.cumcount()
)
How it works:
Every time a value is equal to the cumulated max (df['Value'].eq(df['Value'].cummax())) we start a new group (using cumsum to define the group). Then cumcount enumerates since the start of the group.
output:
Value Rows since Peak
0 1 0
1 3 0
2 1 1
3 2 2
4 1 3
5 4 0
6 6 0
7 5 1

dataframe iloc works unexpectedly in pandas

I am creating a dataframe like this.
np.random.seed(2)
df=pd.DataFrame(np.random.randint(1,6,(6,6)))
out[]
0 1 1 4 3 4 1
1 3 2 4 3 5 5
2 5 4 5 3 4 4
3 3 2 3 5 4 1
4 5 4 2 3 1 5
5 5 3 5 3 2 1
spliting the dataframe into 3,3 matrix like below, it will have 16 matrix.
dfs=[]
for col in range(df.shape[1]-2):
for row in range(df.shape[0]-2):
dfs.append(df.iloc[row:row+3,col:col+3])
lets print,
dfs[0]
1 1 4
3 2 4
5 4 5
dfs[1]
3 2 4
5 4 5
3 2 3
.
.
.
dfs[15]
5 4 1
3 1 5
3 2 1
writing a function to change the values from each matrix in locations [1,0] and [1,2] to zero,
so that my output will looks like,
dfs[0]
1 1 4
0 2 0
5 4 5
def process(x):
new=[]
for d in x:
d.iloc[1,0]=0
d.iloc[1,2]=0
new.append(d)
print(d)
return new
dfs=process(dfs.copy())
my expected output, is
dfs[0]
1 1 4
0 2 0
5 4 5
but what my function returns is,
dfs[0]
1 1 4
0 0 0
0 0 0
dfs[1]
0 0 0
0 0 0
0 0 0
It producres more zeros in all matrix. I don't know why it is working unexpectedly or what I am doing wrong with my function process please help. Thanks.

Long story short, you are a victim of chained indexing, which can lead to bad things happening.
When you slice the original DataFrame, you get overlapping views.
Modifying one changes the others too, since the second row of one chunk is the first row of another, and the third row of the first chunk is the first row of yet another, and so on...which is why you see non-zero values only at the "edges", since those are unique to a single chunk.
You can make copies of each slice, like this:
def process(x):
new = []
for d in x:
d = d.copy() # each one is now a copy
d.iloc[1, 0]=0
d.iloc[1, 2]=0
new.append(d)
return new
Lastly, note that dfs = process(dfs) is actually fine; you don't need to make a copy of the enclosing list.

Change your code and process function call to get your required output. Also, I used copy in for loop to make subset of dataframe which is independent to change in future, in your case it makes changes to original df which are reflected with all zeros in other dfs list:
for col in range(df.shape[1]-2):
for row in range(df.shape[0]-2):
dfs.append(df.iloc[row:row+3,col:col+3].copy())
dfs=process(dfs)

how remove rows in a dataframe that the order of values are not important

I have a dataframe like this:
source target weight
1 2 5
2 1 5
1 2 5
1 2 7
3 1 6
1 1 6
1 3 6
My goal is to remove the duplicate rows, but the order of source and target columns are not important. In fact, the order of two columns are not important and they should be removed. In this case, the expected result would be
source target weight
1 2 5
1 2 7
3 1 6
1 1 6
Is there any way to this without loops?

Use frozenset and duplicated
df[~df[['source', 'target']].apply(frozenset, 1).duplicated()]
source target weight
0 1 2 5
3 3 1 6
4 1 1 6
If you want to account for unordered source/target and weight
df[~df[['weight']].assign(A=df[['source', 'target']].apply(frozenset, 1)).duplicated()]
source target weight
0 1 2 5
3 1 2 7
4 3 1 6
5 1 1 6
However, to be explicit with more readable code.
# Create series where values are frozensets and therefore hashable.
# With hashable things, we can determine duplicity.
# Note that I also set the index and name to set up for a convenient `join`
s = pd.Series(list(map(frozenset, zip(df.source, df.target))), df.index, name='mixed')
# Use `drop` to focus on just those columns leaving whatever else is there.
# This is more general and accommodates more than just a `weight` column.
mask = df.drop(['source', 'target'], axis=1).join(s).duplicated()
df[~mask]
source target weight
0 1 2 5
3 1 2 7
4 3 1 6
5 1 1 6

Should be fairly easy.
data = [[1,2,5],
[2,1,5],
[1,2,5],
[3,1,6],
[1,1,6],
[1,3,6],
]
df = pd.DataFrame(data,columns=['source','target','weight'])
You can drop the duplicates using drop_duplicates
df = df.drop_duplicates(keep=False)
print(df)
would result in:
source target weight
1 2 1 5
3 3 1 6
4 1 1 6
5 1 3 6
because you want to handle the unordered source/target issue.
def pair(row):
sorted_pair = sorted([row['source'],row['target']])
row['source'] = sorted_pair[0]
row['target'] = sorted_pair[1]
return row
df = df.apply(pair,axis=1)
and then you can use df.drop_duplicates()
source target weight
0 1 2 5
3 1 2 7
4 1 3 6
5 1 1 6

"Drop random rows" from pandas dataframe

In a pandas dataframe, how can I drop a random subset of rows that obey a condition?
In other words, if I have a Pandas dataframe with a Label column, I'd like to drop 50% (or some other percentage) of rows where Label == 1, but keep all of the rest:
Label A -> Label A
0 1 0 1
0 2 0 2
0 3 0 3
1 10 1 11
1 11 1 12
1 12
1 13
I'd love to know the simplest and most pythonic/panda-ish way of doing this!
Edit: This question provides part of an answer, but it only talks about dropping rows by index, disregarding the row values. I'd still like to know how to drop only from rows that are labeled a certain way.

Use the frac argument
df.sample(frac=.5)
If you define the amount you want to drop in a variable n
n = .5
df.sample(frac=1 - n)
To include the condition, use drop
df.drop(df.query('Label == 1').sample(frac=.5).index)
Label A
0 0 1
1 0 2
2 0 3
4 1 11
6 1 13

Using drop with sample
df.drop(df[df.Label.eq(1)].sample(2).index)
Label A
0 0 1
1 0 2
2 0 3
3 1 10
5 1 12

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

computing sum of pandas dataframes - python

Since the dfc1 index is the same as your "bin" value, you could simply do the following: dfc1.iloc[dfc2.bin].cnt += 1 Notice that I renamed your "count" column to "cnt" since count is a pandas builtin, which can cause confusion and errors!

Related

Duplicate a single row at index?

How to get number of rows since last peak Pandas

dataframe iloc works unexpectedly in pandas

how remove rows in a dataframe that the order of values are not important

"Drop random rows" from pandas dataframe

Categories

Resources