group rows based on partial strings from two columns and sum values - python

df = pd.DataFrame({'c1':['Ax','Ay','Bx','By'], 'c2':['Ay','Ax','By','Bx'], 'c3':[1,2,3,4]})
c1 c2 c3
0 Ax Ay 1
1 Ay Ax 2
2 Bx By 3
3 By Bx 4
I'd like to sum the c3 values by aggregating the same xy combinations from the c1 and c2 columns.
The expected output is
c1 c2 c3
0 x y 4 #[Ax Ay] + [Bx By]
1 y x 6 #[Ay Ax] + [By Bx]

You can select values in c1 and c2 without first letters and aggregate sum:
df = df.groupby([df.c1.str[1:], df.c2.str[1:]]).sum().reset_index()
print (df)
c1 c2 c3
0 x y 4
1 y x 6

Related

Merging two DataFrames based on indexes from two other DataFrames

I'm new to pandas have tried going through the docs and experiment with various examples, but this problem I'm tacking has really stumped me.
I have the following two dataframes (DataA/DataB) which I would like to merge on a per global_index/item/values basis.
DataA DataB
row item_id valueA row item_id valueB
0 x A1 0 x B1
1 y A2 1 y B2
2 z A3 2 x B3
3 x A4 3 y B4
4 z A5 4 z B5
5 x A6 5 x B6
6 y A7 6 y B7
7 z A8 7 z B8
The list of items(item_ids) is finite and each of the two dataframes represent a the value of a trait (trait A, trait B) for an item at a given global_index value.
The global_index could roughly be thought of as a unit of "time"
The mapping between each data frame (DataA/DataB) and the global_index is done via the following two mapper DFs:
DataA_mapper
global_index start_row num_rows
0 0 3
1 3 2
3 5 3
DataB_mapper
global_index start_row num_rows
0 0 2
2 2 3
4 5 3
Simply put for a given global_index (eg: 1) the mapper will define a list of rows into the respective DFs (DataA or DataB) that are associated with that global_index.
For example, for a global_index value of 0:
In DF DataA rows 0..2 are associated with global_index 0
In DF DataB rows 0..1 are associated with global_index 0
Another example, for a global_index value of 2:
In DF DataB rows 2..4 are associated with global_index 2
In DF DataA there are no rows associated with global_index 2
The ranges [start_row,start_row + num_rows) presented do not overlap each other and represent a unique sequence/range of rows in their respective dataframes (DataA, DataB)
In short no row in either DataA or DataB will be found in more than one range.
I would like to merge the DFs so that I get the following dataframe:
row global_index item_id valueA valueB
0 0 x A1 B1
1 0 y A2 B2
2 0 z A3 NaN
3 1 x A4 B1
4 1 z A5 NaN
5 2 x A4 B3
6 2 y A2 B4
7 2 z A5 NaN
8 3 x A6 B3
9 3 y A7 B4
10 3 z A8 B5
11 4 x A6 B6
12 4 y A7 B7
13 4 z A8 B8
In the final datafram any pair of global_index/item_id there will ever be either:
a value for both valueA and valueB
a value only for valueA
a value only for valueB
With the requirement being if there is only one value for a given global_index/item (eg: valueA but no valueB) for the last value of the missing one to be used.
First, you can create the 'global_index' column using the function pd.cut:
for df, m in [(df_A, map_A), (df_B, map_B)]:
bins = np.insert(m['num_rows'].cumsum().values, 0, 0) # create bins and add zero at the beginning
df['global_index'] = pd.cut(df['row'], bins=bins, labels=m['global_index'], right=False)
Next, you can use outer join to merge both data frames:
df = df_A.merge(df_B, on=['global_index', 'item_id'], how='outer')
And finally you can use functions groupby and ffill to fill missing values:
for val in ['valueA', 'valueB']:
df[val] = df.groupby('item_id')[val].ffill()
Output:
item_id global_index valueA valueB
0 x 0 A1 B1
1 y 0 A2 B2
2 z 0 A3 NaN
3 x 1 A4 B1
4 z 1 A5 NaN
5 x 3 A6 B1
6 y 3 A7 B2
7 z 3 A8 NaN
8 x 2 A6 B3
9 y 2 A7 B4
10 z 2 A8 B5
11 x 4 A6 B6
12 y 4 A7 B7
13 z 4 A8 B8
I haven't tested this out, since I don't have any good test data, but I think something like this should work. Basically what this is doing is, rather than trying to pull off some sort of complicated join, it is building a series of lists to hold your data, which you can then put back together into a final dataframe at the end.
DataA.set_index('row')
DataB.set_index('row')
#we're going to create the new dataframe from scratch, creating a list for each column we want
global_index = []
AValues = []
AIndex = []
BValues = []
BIndex = []
for indexNum in totalIndexes:
#for each global index, we get the total number of rows to extract from DataA and DataB
AStart = DataA_mapper.loc[DataA_mapper['global_index']==indexNum, 'start_row'].values[0]
ARows = DataA_mapper.loc[DataA_mapper['global_index']==indexNum, 'num_rows'].values[0]
AStop = AStart + Arows
BStart = DataB_mapper.loc[DataB_mapper['global_index']==indexNum, 'start_row'].values[0]
BRows = DataB_mapper.loc[DataB_mapper['global_index']==indexNum, 'num_rows'].values[0]
BStop = BStart + Brows
#Next we extract values from DataA and DataB, turn them into lists, and add them to our data
AValues = AValues + list(DataA.iloc[AStart:AStop, 1].values)
AIndex = AIndex + list(DataA.iloc[AStart:AStop, 0].values)
BValues = BValues + list(DataB.iloc[BStart:BStop, 1].values)
BIndex = BIndex + list(DataA.iloc[AStart:AStop, 0].values)
#Create a temporary list of the current global_index, and add it to our data
global_index_temp = []
for row in range(max(ARows,Brows)):
global_index_temp.append(indexNum)
global_index = global_index + global_index_temp
#combine all these individual lists into a dataframe
finalData = list(zip(global_index, AIndex, BIndex, AValues, BValues))
df = pd.DataFrame(data = finalData, columns = ['global_index', 'item1', 'item2', 'valueA', 'valueB'])
#lastly you just need to merge item1 and item2 to get your item_id column
I've tried to comment it nicely so that hopefully the general plan makes sense and you can follow along and correct my mistakes or rewrite it your own way.

How to compare two data frames with same columns but different number of rows?

df1=
A B C D
a1 b1 c1 1
a2 b2 c2 2
a3 b3 c3 4
df2=
A B C D
a1 b1 c1 2
a2 b2 c2 1
I want to compare the value of the column 'D' in both dataframes. If both dataframes had same number of rows I would just do this.
newDF = df1['D']-df2['D']
However there are times when the number of rows are different. I want a result Dataframe which shows a dataframe like this.
resultDF=
A B C D_df1 D_df2 Diff
a1 b1 c1 1 2 -1
a2 b2 c2 2 1 1
EDIT: if 1st row in A,B,C from df1 and df2 is same then and only then compare 1st row of column D for each dataframe. Similarly, repeat for all the row.
Use merge and df.eval
df1.merge(df2, on=['A','B','C'], suffixes=['_df1','_df2']).eval('Diff=D_df1 - D_df2')
Out[314]:
A B C D_df1 D_df2 Diff
0 a1 b1 c1 1 2 -1
1 a2 b2 c2 2 1 1

How to get rows where a set of columns are equal to a given value in Pandas?

I have a dataframe with many columns (around 1000).
Given a set of columns (around 10), which have 0 or 1 as values, I would like to select all the rows where I have 1s in the aforementioned set of columns.
Toy example. My dataframe is something like this:
c1,c2,c3,c4,c5
'a',1,1,0,1
'b',0,1,0,0
'c',0,0,1,1
'd',0,1,0,0
'e',1,0,0,1
And I would like to get the rows where the columns c2 and c5 are equal to 1:
'a',1,1,0,1
'e',1,0,0,1
Which would be the most efficient way to do it?
Thanks!
This would be more generic for multiple columns cols
In [1277]: cols = ['c2', 'c5']
In [1278]: df[(df[cols] == 1).all(1)]
Out[1278]:
c1 c2 c3 c4 c5
0 'a' 1 1 0 1
4 'e' 1 0 0 1
Or,
In [1284]: df[np.logical_and.reduce([df[x]==1 for x in cols])]
Out[1284]:
c1 c2 c3 c4 c5
0 'a' 1 1 0 1
4 'e' 1 0 0 1
Or,
In [1279]: df.query(' and '.join(['%s==1'%x for x in cols]))
Out[1279]:
c1 c2 c3 c4 c5
0 'a' 1 1 0 1
4 'e' 1 0 0 1
Can you try doing something like this:
df.loc[df['c2'] == 1 & df['c5'] == 1]
import pandas as pd
frame = pd.DataFrame([
['a',1,1,0,1],
['b',0,1,0,0],
['c',0,0,1,1],
['d',0,1,0,0],
['e',1,0,0,1]], columns='c1,c2,c3,c4,c5'.split(','))
print(frame.loc[(frame['c2'] == 1) & (frame['c5'] == 1)])

merge two dataframe based on specific column information

I am trying to handling dataframe in several ways.
and now I'd like to merge two dataframe based on specific column information and delete rows which is duplicated
Is it possible?
I tried to use Concatenate function but faliled...
for example if I want to merge df1 and df2 into d3 with
condition:
if c1&c2 information is same, delete duplicated rows(only use df1, even if c3 data between df1 and df2 is different)
if c1&c2 information is different, use both rows (df1,df2)
before:
df1
c1 c2 c3
0 0 x {'a':1 ,'b':2}
1 0 y {'a':3 ,'b':4}
2 2 z {'a':5 ,'b':6}
df2
c1 c2 c3
0 0 x {'a':11 ,'b':12}
1 0 y {'a':13 ,'b':14}
2 3 z {'a':15 ,'b':16}
expected result d3:
c1 c2 c3
0 0 x {'a':1 ,'b':2}
1 0 y {'a':3 ,'b':4}
2 2 z {'a':5 ,'b':6}
3 3 z {'a':15 ,'b':16}
enter code here
You can do this firstly by determining which rows are only in df2 using merge and passing how='right' and indicator=True, then concat this with df1:
In [125]:
merged = df1.merge(df2, left_on=['c1','c2'], right_on=['c1','c2'], how='right', indicator=True)
merged = merged[merged['_merge']=='right_only']
merged = merged.rename(columns={'c3_y':'c3'})
merged
Out[125]:
c1 c2 c3_x c3 _merge
2 3 z NaN {'a':15 ,'b':16} right_only
In [126]:
combined = pd.concat([df1, merged[df1.columns]])
combined
Out[126]:
c1 c2 c3
0 0 x {'a':1 ,'b':2}
1 0 y {'a':3 ,'b':4}
2 2 z {'a':5 ,'b':6}
2 3 z {'a':15 ,'b':16}
If we break down the above:
In [128]:
merged = df1.merge(df2, left_on=['c1','c2'], right_on=['c1','c2'], how='right', indicator=True)
merged
Out[128]:
c1 c2 c3_x c3_y _merge
0 0 x {'a':1 ,'b':2} {'a':11 ,'b':12} both
1 0 y {'a':3 ,'b':4} {'a':13 ,'b':14} both
2 3 z NaN {'a':15 ,'b':16} right_only
In [129]:
merged = merged[merged['_merge']=='right_only']
merged
Out[129]:
c1 c2 c3_x c3_y _merge
2 3 z NaN {'a':15 ,'b':16} right_only
In [130]:
merged = merged.rename(columns={'c3_y':'c3'})
merged
Out[130]:
c1 c2 c3_x c3 _merge
2 3 z NaN {'a':15 ,'b':16} right_only

how to convert column names into column values in pandas - python

df=pd.DataFrame(index=['x','y'], data={'a':[1,2],'b':[3,4]})
how can I convert column names into values of a column? This is my desired output
c1 c2
x 1 a
x 3 b
y 2 a
y 4 b
You can use:
print (df.T.unstack().reset_index(level=1, name='c1')
.rename(columns={'level_1':'c2'})[['c1','c2']])
c1 c2
x 1 a
x 3 b
y 2 a
y 4 b
Or:
print (df.stack().reset_index(level=1, name='c1')
.rename(columns={'level_1':'c2'})[['c1','c2']])
c1 c2
x 1 a
x 3 b
y 2 a
y 4 b
try this:
In [279]: df.stack().reset_index().set_index('level_0').rename(columns={'level_1':'c2',0:'c1'})
Out[279]:
c2 c1
level_0
x a 1
x b 3
y a 2
y b 4
Try:
df1 = df.stack().reset_index(-1).iloc[:, ::-1]
df1.columns = ['c1', 'c2']
df1
In [62]: (pd.melt(df.reset_index(), var_name='c2', value_name='c1', id_vars='index')
.set_index('index'))
Out[62]:
c2 c1
index
x a 1
y a 2
x b 3
y b 4

Categories

Resources