I have a pandas dataframe, test, looking like the following:
Col1 Col2 Col 3
A 4 6
A 8 36
B 1 4
B 6 8
Now, I want to pairwise divide the rows of the dataframe resulting in:
Col1 Col2 Col 3
A 2 6
B 6 2
Hence I want to divide the second of the pair by the first of the pair. I amtrying to use groupby but without success.
Anyone a solution?
If you always have a pair of rows, you can just try iloc:
(df.iloc[1::2, 1:]
.div(df.iloc[::2,1:].to_numpy())
.assign(Col1=df.iloc[1::2,1])
)
If the Col1 pair doesn't repeat.
def divide(group):
# You could also use head(1)/tail(1) and first()/last().
return group.iloc[-1] / group.iloc[0]
df_ = df.groupby('Col1').apply(divide).reset_index()
# print(df)
Col1 Col2 Col3
0 A 2.0 6.0
1 B 6.0 2.0
Another option using groupby on the first column and using nth to divide
g = df.groupby("Col1")
out = g.nth(1).div(g.nth(0)).reset_index()
print(out)
Col1 Col2 Col3
0 A 2.0 6.0
1 B 6.0 2.0
Related
Considering the following dataframe df :
df = pd.DataFrame(
{
"col1": [0,1,2,3,4,5,6,7,8,9,10],
"col2": ["A","B","C","D","E","F","G","H","I","J","K"],
"col3": [1e-0,1e-1,1e-2,1e-3,1e-4,1e-5,1e-6,1e-7,1e-8,1e-9,1e-10],
"col4": [0,4,2,5,6,7,6,3,6,2,1]
}
)
I would like to select rows when the col4 value of the current row is greater than the col4 values of the previous and next rows and to store them in an empty frame.
I wrote the following code that works :
df1=pd.DataFrame()
for i in range(1,len(df)-1,1):
if ( (df.iloc[i]['col4'] > df.iloc[i+1]['col4']) and (df.iloc[i]['col4'] > df.iloc[i-1]['col4']) ):
df1=pd.concat([df1,df.iloc[i:i+1]])
I got the expected dataframe df1
col1 col2 col3 col4
1 1 B 1.000000e-01 4
5 5 F 1.000000e-05 7
8 8 I 1.000000e-08 6
But this code is very ugly, not readable, ... Is there a best solution ?
Use boolean indexing with compare next and previous values by Series.shift and Series.gt for greater values, for chain bitwise AND use &:
df = df[df['col4'].gt(df['col4'].shift()) & df['col4'].gt(df['col4'].shift(-1))]
print (df)
col1 col2 col3 col4
1 1 B 1.000000e-01 4
5 5 F 1.000000e-05 7
8 8 I 1.000000e-08 6
EDIT: Solution for always include first and last rows:
mask = df['col4'].gt(df['col4'].shift()) & df['col4'].gt(df['col4'].shift(-1))
mask.iloc[[0, -1]] = True
df = df[mask]
print (df)
col1 col2 col3 col4
0 0 A 1.000000e+00 0
1 1 B 1.000000e-01 4
5 5 F 1.000000e-05 7
8 8 I 1.000000e-08 6
10 10 K 1.000000e-10 1
I'd like to swap column1 value with column2 value if column1.value >= 14 in pandas!
col1
col2
16
1
3
2
4
3
This should become:
col1
col2
1
16
3
2
4
3
Thanks!
Use Series.mask and re-assign the two columns values:
m = df["col1"].ge(14)
out = df.assign(
col1=df["col1"].mask(m, df["col2"]),
col2=df["col2"].mask(m, df["col1"])
)
Output:
col1 col2
0 1 16
1 3 2
2 4 3
Simple one liner solution,
df.loc[df['col1'] >= 14,['col1','col2']] = df.loc[df['col1'] >= 14,['col2','col1']].values
How can I merge two data frames when the column has a slight offset than the column I am merging to?
df1 =
col1
col2
1
a
2
b
3
c
df2 =
col1
col3
1.01
d
2
e
2.95
f
so, the merged column would end up like this even though the values in col1 are slightly different.
df_merge =
col1
col2
col3
1
a
d
2
b
e
3
c
f
I have seen scenarios like this where "col1" is a string, but I'm wondering if it's possible to do this with something like pandas.merge() in the scenario where there is slight numerical offset (e.g +/- 0.05).
Lets do merge_asof with tolerance parameter
pd.merge_asof(
df1.astype({'col1': 'float'}).sort_values('col1'),
df2.sort_values('col1'),
on='col1',
direction='nearest',
tolerance=.05
)
col1 col2 col3
0 1.0 a d
1 2.0 b e
2 3.0 c f
PS: if the dataframes are already sorted on col1 then there is no need to sort again.
Let's take this dataframe :
pd.DataFrame(dict(Col1=["a","c"],Col2=["b","d"],Col3=[1,3],Col4=[2,4]))
Col1 Col2 Col3 Col4
0 a b 1 2
1 c d 3 4
I would like to have one row per value in column Col1 and column Col2 (n=2 and r=2 so the expected dataframe have 2*2 = 4 rows).
Expected result :
Ind Value Col3 Col4
0 Col1 a 1 2
1 Col1 c 3 4
2 Col2 b 1 2
3 Col2 d 3 4
How please could I do ?
Pandas melt does the job here; the rest just has to do with repositioning and renaming the columns appropriately.
Use pandas melt to transform the dataframe, using Col3 and 4 as the index variables. melt typically converts from wide to long.
Next step - reindex the columns, with variable and value as lead columns.
Finally, rename the columns appropriately.
(df.melt(id_vars=['Col3','Col4'])
.reindex(['variable','value','Col3','Col4'],axis=1)
.rename({'variable':'Ind','value':'Value'},axis=1)
)
Ind Value Col3 Col4
0 Col1 a 1 2
1 Col1 c 3 4
2 Col2 b 1 2
3 Col2 d 3 4
I have a pandas dataframe:
Col1 Col2 Col3
0 1 2 3
1 2 3 4
And I want to add a new row summing over two columns [Col1,Col2] like:
Col1 Col2 Col3
0 1 2 3
1 2 3 4
Total 3 5 NaN
Ignoring Col3. What should I do? Thanks in advance.
You can use the pandas.DataFrame.append and pandas.DataFrame.sum methods:
df2 = df.append(df.sum(), ignore_index=True)
df2.iloc[-1, df2.columns.get_loc('Col3')] = np.nan
You can use pd.DataFrame.loc. Note the final column will be converted to float since NaN is considered float:
import numpy as np
df.loc['Total'] = [df['Col1'].sum(), df['Col2'].sum(), np.nan]
df[['Col1', 'Col2']] = df[['Col1', 'Col2']].astype(int)
print(df)
Col1 Col2 Col3
0 1 2 3.0
1 2 3 4.0
Total 3 5 NaN