sum numbers in two dataframes based on their intersecting indexes - python

i have 2 dataframes with some common index and some that are not:
df1
DATA
1 1
2 2
3 3
4 4
5 5
df2
DATA
3 3
4 4
5 5
6 6
7 7
I want to sum/take max (i actually need both for different cols) them, and consider missing indexes as 0.
In this example the result should be:
df_results
DATA
1 1
2 2
3 6
4 8
5 10
6 6
7 7
where 3,4,5 were summed, but the rest remained the same.
Thx!

Try this:
combined = df1.add(df2, fill_value=0)

Related

How to add colmun with order number based on rules to DataFrame?

suggested below questions don't sove my problem, because I want to add ordering based on rules. Suggested question don't answer to that. And question is not a duplicate.
I have a DataFrame and I need to add a 'new column' with the order number of each value.
I was able to do that, but I wonder:
1- is there a more correct/elegant way to do this?
Also, is it possible:
2- to give equivalent numbers in the same order? For example in my case second and third rows have the same value, and is it possible to assign 2 for both of them?
3- to set rule for defining order for example, if difference between rows is less than 0,5 then they should be assigned the same row order. If more, then order number should increase.
Thank you in advance!
np.random.seed(42)
df2=pd.DataFrame(np.random.randint(1,10, 10), columns=['numbers'])
df2=df2.sort_values('numbers')
df2['ord']=1+np.arange(0, len(df2['numbers']))
If you want to use the same order number to identical "numbers", use groupby.ngroup:
df2['ord'] = df2.groupby('numbers').ngroup().add(1)
Output:
numbers ord
5 3 1
1 4 2
9 4 2
3 5 3
8 5 3
0 7 4
4 7 4
6 7 4
2 8 5
7 8 5
grouping with threshold
grouper = df2['numbers'].diff().gt(1).cumsum()
df2['ord_threshold'] = df2.groupby(grouper).ngroup().add(1)
Output:
numbers ord ord_threshold
5 3 1 1
1 4 2 1
9 4 2 1
3 5 3 1
8 5 3 1
0 7 4 2
4 7 4 2
6 7 4 2
2 8 5 2
7 8 5 2
you can do as well by reseting indexes:
np.random.seed(42)
df2=pd.DataFrame(np.random.randint(1,10, 10), columns=['numbers'])
df2=df2.sort_values('numbers').reset_index(drop=True)
#reset indexes
df2.reset_index(inplace=True)
#put value of new indexes (+1) in ord column
df2['ord']=df2['index']+1
#clean index column created
df2.drop(columns='index',inplace=True)
print(df2)
Result:
numbers ord
0 3 1
1 4 2
2 4 3
3 5 4
4 5 5
5 7 6
6 7 7
7 7 8
8 8 9
9 8 10
Let us try
df2['ord'] = df2['numbers'].factorize()[0] + 1

Pandas replacing subset of column values with another column based on index

I tried to check other questions but didn't find what I needed.
I have a dataframe df:
a b
0 6 4
1 5 6
2 2 2
3 7 4
4 3 6
5 5 2
6 4 7
and a second dataframe df2
d
0 60
1 50
5 50
6 40
I want to replace the values in df['a'] with the values in df2['d'] - but only in the relevant indices.
Output:
a b
0 60 4
1 50 6
2 2 2
3 7 4
4 3 6
5 50 2
6 40 7
All other questions I saw like this one referring to a single value, but I want to replace the values based on entire column.
I know I can iterate the rows one by one and replace the values, but I'm looking for a more efficient way.
Note: df2 does not have indices that are not in df. I want to replace all values in df2 with the values of df.
Simply use indexing:
df.loc[df2.index, 'a'] = df2['d']
output:
a b
0 60 4
1 50 6
2 2 2
3 7 4
4 3 6
5 50 2
6 40 7

find the max value of each column with spatial range of row in DataFrame

I want to code my Sliding Window and I have a huge data shep like this
feature x
a b c d g
1 2 3 4 5 6
2 4 5 6 9 4
3 6 7 8 6 0
4 2 3 5 7 9
5 2 2 2 2 2
enter image description here
and
label y
0
1
1
2
0
I want to define df with the same columns and zipping every 3 rows in one with the max value of each column --> this for feature x
and define df2 with the same column and zipping every 3 rows in one with the most frequent.
there someone can help me :(
the first row maybe like this
6 7 8 9 6 with label 1
I get the answer but you can improve it!

Index matching with multiple columns in python

I have two pandas dataframe with different size. two dataframe looks like
df1 =
x y data
1 2 5
2 2 7
5 3 9
3 5 2
and another dataframe looks like:
df2 =
x y value
5 3 7
1 2 4
3 5 2
7 1 4
4 6 5
2 2 1
7 5 8
I am trying to merge these two dataframe so that the final dataframe expected to have same combination of x and y with respective value. I am expecting final dataframe in this format:
x y data value
1 2 5 4
2 2 7 1
5 3 9 7
3 5 2 2
I tride this code but not getting expected results.
dfB.set_index('x').loc[dfA.x].reset_index()
Use merge, by default how='inner' so it can be omit and if join only on same columns parameter on can be omit too:
print (pd.merge(df1,df2))
x y data value
0 1 2 5 4
1 2 2 7 1
2 5 3 9 7
3 3 5 2 2
If in real data are multiple same column names use:
print (pd.merge(df1,df2, on=['x','y']))
x y data value
0 1 2 5 4
1 2 2 7 1
2 5 3 9 7
3 3 5 2 2
df1.merge(df2,by='x')
This will do

Select some elements of a column and find the maximum of them,repeatedly over a large file. USING PYTHON

I have a large file with 2.2 million rows.
Value Label
4 1
6 1
2 2
6 2
3 2
5 3
8 3
7 3
1 4
5 4
2 5
4 5
1 5
I want to know the fastest way to get following output, where 'Max' stores the maximum value in each label
Label Max
1 6
2 6
3 8
4 5
5 4
I implemented a normal logic using 'for'&'while' loops in python, but it takes hours. I expect pandas will have something for tackling this.
Call max on a groupby object:
In [116]:
df.groupby('Label').max()
Out[116]:
Value
Label
1 6
2 6
3 8
4 5
5 4
If you want to restore the Label column from the index then call reset_index:
In [117]:
df.groupby('Label').max().reset_index()
Out[117]:
Label Value
0 1 6
1 2 6
2 3 8
3 4 5
4 5 4

Categories

Resources