How to flat in one row Dataframe and concatenate them - python

I have several Dataframes with the same structure :
0
1
0
TITLE
TITLE1
1
A
A1
2
B
B1
3
C
C1
4
D
D1
5
E
E1
0
1
0
TITLE
TITLE2
1
A
A2
2
B
B2
3
C
C2
4
D
D2
5
E
E2
My goal is to have :
TITLE
A
B
C
D
E
TITLE1
A1
B1
C1
D1
E1
TITLE2
A2
B2
C2
D2
E2
How can I transform my Dataframes to flat them and concatenate them like this?

Use DataFrame.set_index with transpose and then concat:
df11 = df1.set_index(0).T
df22 = df2.set_index(0).T
df = pd.concat([df11,df22]).set_index('TITLE')
print (df)
0 A B C D E
TITLE
TITLE1 A1 B1 C1 D1 E1
TITLE2 A2 B2 C2 D2 E2
Or transpose after concat with axis=1:
df11 = df1.set_index(0)
df22 = df2.set_index(0)
df = pd.concat([df11,df22], axis=1).T.set_index('TITLE')

Related

Check if value of one column exists in another column, put a value in another column in pandas

Say I have a data frame like the following:
A B C D E
a1 b1 c1 d1 e1
a2 a1 c2 d2 e2
a3 a1 a2 d3 e3
a4 a1 a2 a3 e4
I want to create a new column with predefined values if a value found in other columns.
Something like this:
A B C D E F
a1 b1 c1 d1 e1 NA
a2 a1 c2 d2 e2 in_B
a3 a1 a2 d3 e3 in_B, in_C
a4 a1 a2 a3 e4 in_B, in_C, in_D
The in_B, in_C could be other string of choice. If values present in multiple columns, then value of F would be multiple. Example, row 3 and 4 of column F (in row 3 there are two values and in row 4 there are three values). So far, I have tried a below:
DF.F=np.where(DF.A.isin(DF.B), DF.A,'in_B')
But it does not give expected result. Any help
STEPS:
Stack the dataframe.
check for the duplicate values.
unstack to get the same structure back.
use dot to get the required result.
df['new_col'] = df.stack().duplicated().unstack().dot(
'In ' + k.columns + ',').str.strip(',')
OUTPUT:
A B C D E new_col
0 a1 b1 c1 d1 e1
1 a2 a1 c2 d2 e2 In B
2 a3 a1 a2 d3 e3 In B,In C
3 a4 a1 a2 a3 e4 In B,In C,In D

How to compare two data frames with same columns but different number of rows?

df1=
A B C D
a1 b1 c1 1
a2 b2 c2 2
a3 b3 c3 4
df2=
A B C D
a1 b1 c1 2
a2 b2 c2 1
I want to compare the value of the column 'D' in both dataframes. If both dataframes had same number of rows I would just do this.
newDF = df1['D']-df2['D']
However there are times when the number of rows are different. I want a result Dataframe which shows a dataframe like this.
resultDF=
A B C D_df1 D_df2 Diff
a1 b1 c1 1 2 -1
a2 b2 c2 2 1 1
EDIT: if 1st row in A,B,C from df1 and df2 is same then and only then compare 1st row of column D for each dataframe. Similarly, repeat for all the row.
Use merge and df.eval
df1.merge(df2, on=['A','B','C'], suffixes=['_df1','_df2']).eval('Diff=D_df1 - D_df2')
Out[314]:
A B C D_df1 D_df2 Diff
0 a1 b1 c1 1 2 -1
1 a2 b2 c2 2 1 1

How to group by two column with swapped values in pandas?

I want to group by columns where the commutative rule applies.
For example
column 1, column 2 contains values (a,b) in the first row and (b,a) for another row, then I want to group these two records perform a group by operation.
Input:
From To Count
a1 b1 4
b1 a1 3
a1 b2 2
b3 a1 12
a1 b3 6
Output:
From To Count(+)
a1 b1 7
a1 b2 2
b3 a1 18
I tried to apply group by after swapping the elements. But I don't have any approach to solve this problem. Help me to solve this problem.
Thanks in advance.
Use numpy.sort for sorting each row:
cols = ['From','To']
df[cols] = pd.DataFrame(np.sort(df[cols], axis=1))
print (df)
From To Count
0 a1 b1 4
1 a1 b1 3
2 a1 b2 2
3 a1 b3 12
4 a1 b3 6
df1 = df.groupby(cols, as_index=False)['Count'].sum()
print (df1)
From To Count
0 a1 b1 7
1 a1 b2 2
2 a1 b3 18

Summing columns from different dataframe according to some column names

Suppose I have a main dataframe
main_df
Cri1 Cri2 Cr3 total
0 A1 A2 A3 4
1 B1 B2 B3 5
2 C1 C2 C3 6
I also have 3 dataframes
df_1
Cri1 Cri2 Cri3 value
0 A1 A2 A3 1
1 B1 B2 B3 2
df_2
Cri1 Cri2 Cri3 value
0 A1 A2 A3 9
1 C1 C2 C3 10
df_3
Cri1 Cri2 Cri3 value
0 B1 B2 B3 15
1 C1 C2 C3 17
What I want is to add value from each frame df to total in the main_df according to Cri
i.e. main_df will become
main_df
Cri1 Cri2 Cri3 total
0 A1 A2 A3 14
1 B1 B2 B3 22
2 C1 C2 C3 33
Of course I can do it using for loop, but at the end I want to apply the method to a large amount of data, say 50000 rows in each dataframe.
Is there other ways to solve it?
Thank you!
First you should align your numeric column names. In this case:
df_main = df_main.rename(columns={'total': 'value'})
Then you have a couple of options.
concat + groupby
You can concatenate and then perform a groupby with sum:
res = pd.concat([df_main, df_1, df_2, df_3])\
.groupby(['Cri1', 'Cri2', 'Cri3']).sum()\
.reset_index()
print(res)
Cri1 Cri2 Cri3 value
0 A1 A2 A3 14
1 B1 B2 B3 22
2 C1 C2 C3 33
set_index + reduce / add
Alternatively, you can create a list of dataframes indexed by your criteria columns. Then use functools.reduce with pd.DataFrame.add to sum these dataframes.
from functools import reduce
dfs = [df.set_index(['Cri1', 'Cri2', 'Cri3']) for df in [df_main, df_1, df_2, df_3]]
res = reduce(lambda x, y: x.add(y, fill_value=0), dfs).reset_index()
print(res)
Cri1 Cri2 Cri3 value
0 A1 A2 A3 14.0
1 B1 B2 B3 22.0
2 C1 C2 C3 33.0

select the first N elements of each row in a column

I am looking to select the first two elements of each row in column a and column b.
Here is an example
df = pd.DataFrame({'a': ['A123', 'A567','A100'], 'b': ['A156', 'A266666','A35555']})
>>> df
a b
0 A123 A156
1 A567 A266666
2 A100 A35555
desired output
>>> df
a b
0 A1 A1
1 A5 A2
2 A1 A3
I have been trying to use df.loc but not been successful.
Use
In [905]: df.apply(lambda x: x.str[:2])
Out[905]:
a b
0 A1 A1
1 A5 A2
2 A1 A3
Or,
In [908]: df.applymap(lambda x: x[:2])
Out[908]:
a b
0 A1 A1
1 A5 A2
2 A1 A3
In [107]: df.apply(lambda c: c.str.slice(stop=2))
Out[107]:
a b
0 A1 A1
1 A5 A2
2 A1 A3

Categories

Resources