sum() on specific columns of dataframe - python

I cannot work out how to add a new row at the end. The last row needs to do sum() on specific columns and dividing 2 other columns. While the DF has applied a filter to sum only specific rows.
df:
Categ CategID col3 col4 col5 col6
0 Cat1 1 -65.90 -100.40 -26.91 23.79
1 Cat2 2 -81.91 -15.30 -16.00 10.06
2 Cat3 3 -57.70 -18.62 0.00 0.00
I would like the output to be like so:
3 Total -123.60 -119.02 -26.91 100*(-119.02/-26.91)
col3,col4,col5 would have sum(), and col6 would be the above formula.
If [CategID]==2, then don't include in the TOTAL
I was able to get it almost as I wanted by using .query(), like so:
#tg is a list
df.loc['Total'] = df.query("categID in #tg").sum()
But with the above I cannot have the 'col6' like this 100*(col4.sum() / col5.sum()), because they are all sum().
Then I tried with Series like so, but I don't understand how to apply filter .where()
s = pd.Series( [df['col3'].sum()\
,df['col4'].sum()\
,df['col5'].sum()\
,100*(df['col4'].sum()/df['col5'].sum())\
,index = ['col3','col4','col5','col6'])
df.loc['Total'] = s.where('tag1' in tg)
using the above Series() works, until I add .where()
this gives the error:
ValueError: Array conditional must be same shape as self
So, can I accomplish this with the first method, using .query(), just somehow modify one of the column in TOTAL ?
Otherwise what am I doing wrong in the second method .where()
Thanks

IIUC, you can try:
s = df.mask(df['CategID'].eq(2)).drop("CategID",1).sum()
s.loc['col6'] = 100*(s['col4'] / s['col5'])
df.loc[len(df)] = s
df = df.fillna({'Categ':'Total',"CategID":''})
print(df)
Categ CategID col3 col4 col5 col6
0 Cat1 1 -65.90 -100.40 -26.91 23.790000
1 Cat2 2 -81.91 -15.30 -16.00 10.060000
2 Cat3 3 -57.70 -18.62 0.00 0.000000
3 Total -123.60 -119.02 -26.91 442.289112

Related

Faster method of extracting characters for multiple columns in dataframe

I have a Panda dataframe with multiple columns that has string data in a format like this:
id col1 col2 col3
1 '1:correct' '0:incorrect' '1:correct'
2 '0:incorrect' '1:correct' '1:correct'
What I would like to do is to extract the numeric character before the colon : symbol. The resulting data should look like this:
id col1 col2 col3
1 1 0 1
2 0 1 1
What I have tried is using regex, like following:
colname = ['col1','col2','col3']
row = len(df)
for col in colname:
df[col] = df[col].str.findall(r"(\d+):")
for i in range(0,row):
df[col].iloc[i] = df[col].iloc[i][0]
df[col] = df[col].astype('int64')
The second loop selects the first and only element in a list created by regex. I then convert the object dtype to integer. This code basically does what I want, but it is way too slow even for a small dataset with few thousand rows. I have heard that loops are not very efficient in Python.
Is there a faster, more Pythonic way of extracting numerics in a string and converting it to integers?
Use Series.str.extract for get first value before : in DataFrame.apply for processing each column by lambda function:
colname = ['col1','col2','col3']
f = lambda x: x.str.extract(r"(\d+):", expand=False)
df[colname] = df[colname].apply(f).astype('int64')
print (df)
id col1 col2 col3
0 1 1 0 1
1 2 0 1 1
Another solution with split and selecting first value before ::
colname = ['col1','col2','col3']
f = lambda x: x.str.strip("'").str.split(':').str[0]
df[colname] = df[colname].apply(f).astype('int64')
print (df)
id col1 col2 col3
0 1 1 0 1
1 2 0 1 1
An option is using list comprehension; since this involves strings, you should get fast speed:
import re
pattern = re.compile(r"\d(?=:)")
result = {key: [int(pattern.search(arr).group(0))
if isinstance(arr, str)
else arr
for arr in value.array]
for key, value in df.items()}
pd.DataFrame(result)
id col1 col2 col3
0 1 1 0 1
1 2 0 1 1

Create a column out of the 2nd portion of text of two columns in pandas

I have a dataframe with two columns. I want to create a third column that is the
"sum" of the first two columns, but without the first bit of each column. I think this is best shown in an example:
col1 col2 col3 (need to make)
abc_what_I_want1 abc_what_I_want1 what_I_want1what_I_want1
psdb_what_I_want2 what_I_want2
vxc_what_I_want3 vxc_what_I_want3 what_I_want3what_I_want3
qk_what_I_want4 qk_what_I_want4 what_I_want4what_I_want4
ertsa_what_I_want5 what_I_want5
abc_what_I_want6 abc_what_I_want6 what_I_want6what_I_want6
Note that what_I_want# will be different for every row, but the same between columns in the same row. The prefix will always be the same for each row but can differ/repeat between rows. Cells shown as blank are "" strings.
The code I have so far:
df["col3"] = df["col1"].str.split("_", 1) + df["col2"].str.split("_", 1)
From there I wanted just the 2nd (or last) element of the split so I tried both of the following:
df["col3"] = df["col1"].str.split("_", 1)[1] + df["col2"].str.split("_", 1)[1]
df["col3"] = df["col1"].str.split("_", 1)[-1] + df["col2"].str.split("_", 1)[-1]
Both of these returned errors. The first error I think is because of replicated values (ValueError: cannot reindex from a duplicate axis). The second is a Keyvalue Error.
You were actually quite close, just needed to select the correct slice with str[1] and meanwhile fillna for the empty cells:
m = df['col1'].str.split('_', 1).str[1].fillna('') + df['col2'].str.split('_', 1).str[1].fillna('')
df['col3'] = m
col1 col2 col3
0 abc_what_I_want1 abc_what_I_want1 what_I_want1what_I_want1
1 psdb_what_I_want2 what_I_want2
2 vxc_what_I_want3 vxc_what_I_want3 what_I_want3what_I_want3
3 qk_what_I_want4 qk_what_I_want4 what_I_want4what_I_want4
4 ertsa_what_I_want5 what_I_want5
5 abc_what_I_want6 abc_what_I_want6 what_I_want6what_I_want6
Another method would be to use apply where you can apply split on multiple columns at once:
m = df[['col1', 'col2']].apply(lambda x: x.str.split('_', 1).str[1]).fillna('')
df['col3'] = m['col1']+m['col2']
col1 col2 col3
0 abc_what_I_want1 abc_what_I_want1 what_I_want1what_I_want1
1 psdb_what_I_want2 what_I_want2
2 vxc_what_I_want3 vxc_what_I_want3 what_I_want3what_I_want3
3 qk_what_I_want4 qk_what_I_want4 what_I_want4what_I_want4
4 ertsa_what_I_want5 what_I_want5
5 abc_what_I_want6 abc_what_I_want6 what_I_want6what_I_want6
You can replace() all char up until the first underscore and then apply() a join() or sum() on axis=1:
df['Col3']=df.replace('^[^_]*_','',regex=True).fillna('').apply(''.join,axis=1)
Or:
df['Col3']=df.replace('^[^_]*_','',regex=True).fillna('').sum(axis=1)
Or:
df['Col3']=(pd.Series(df.replace('^[^_]*_','',regex=True).fillna('').values.tolist())
.str.join(''))
col1 col2 Col3
0 abc_what_I_want1 abc_what_I_want1 what_I_want1what_I_want1
1 psdb_what_I_want2 what_I_want2 what_I_want2I_want2
2 vxc_what_I_want3 vxc_what_I_want3 what_I_want3what_I_want3
3 qk_what_I_want4 qk_what_I_want4 what_I_want4what_I_want4
4 NaN ertsa_what_I_want5 what_I_want5
5 abc_what_I_want6 abc_what_I_want6 what_I_want6what_I_want6

Find index of first row closest to value in pandas DataFrame

So I have a dataframe containing multiple columns. For each column, I would like to get the index of the first row that is nearly equal to a user specified number (e.g. within 0.05 of desired number). The dataframe looks kinda like this:
ix col1 col2 col3
0 nan 0.2 1.04
1 0.98 nan 1.5
2 1.7 1.03 1.91
3 1.02 1.42 0.97
Say I want the first row that is nearly equal to 1.0, I would expect the result to be:
index 1 for col1 (not index 3 even though they are mathematically equally close to 1.0)
index 2 for col2
index 0 for col3 (not index 3 even though 0.97 is closer to 1 than 1.04)
I've tried an approach that makes use of argsort():
df.iloc[(df.col1-1.0).abs().argsort()[:1]]
This would, according to other topics, give me the index of the row in col1 with the value closest to 1.0. However, it returns only a dataframe full of nans. I would also imagine this method does not give the first value close to 1 it encounters per column, but rather the value that is closest to 1.
Can anyone help me with this?
Use DataFrame.sub for difference, convert to absolute values by abs, compare by lt (<) and last get index of first value by DataFrame.idxmax:
a = df.sub(1).abs().lt(0.05).idxmax()
print (a)
col1 1
col2 2
col3 0
dtype: int64
But for more general solution, working if failed boolean mask (no value is in tolerance) is appended new column filled by Trues with name NaN:
print (df)
col1 col2 col3
ix
0 NaN 0.20 1.07
1 0.98 NaN 1.50
2 1.70 1.03 1.91
3 1.02 1.42 0.87
s = pd.Series([True] * len(df.columns), index=df.columns, name=np.nan)
a = df.sub(1).abs().lt(0.05).append(s).idxmax()
print (a)
col1 1.0
col2 2.0
col3 NaN
dtype: float64
Suppose, you have some tolerance value tol for the nearly
match threshold. You can create a mask dataframe for
values below the threshold and use first_valid_index()
on each column to get the index of first match occurence.
tol = 0.05
mask = df[(df - 1).abs() < tol]
for col in df:
print(col, mask[col].first_valid_index())

Geometric mean applied on row

I have this data frame as example:
Col1 Col2 Col3 Col4
1 2 3 2.2
I would like to to add a 4th column called 'Gmean' that calculate the geometric mean of the first 3 columns on each row.
How can get it done ?
Thanks!
One way would be with Scipy's geometric mean function -
from scipy.stats.mstats import gmean
df['Gmean'] = gmean(df.iloc[:,:3],axis=1)
Another way with the formula of geometric mean itself -
df['Gmean'] = np.power(df.iloc[:,:3].prod(axis=1),1.0/3)
If there are exactly 3 columns, just use df instead of df.iloc[:,:3]. Also, if you are looking for performance, you might want to work with the underlying array data with df.values or df.iloc[:,:3].values.
df.assign(Gmean=df.iloc[:, :3].prod(1) ** (1. / 3))
Col1 Col2 Col3 Col4 Gmean
0 1 2 3 2.2 1.817121

comparing two dataframes and finding a unique combbination of columns

I have two DataFrames with different size and different number of column, for example:
DF1:
index col1 col2 col3
1 AA A12 SH7B
2 Ac DJS 283
3 ZH 28S 48d
DF2:
index col1 col2 col3 col4
2 AA cc2 SH7B hd5
7 Ac DJS 283,dhb re
10 ZH 28S SJE,48d 385d
23 3V4 38D 350,eh4 sm4
44 S3 3YE 032,she 3927
so the indexes are different. and there are some unique combination of data in the first dataframe which is similar to other dataframe and I want to find them.So I want to iterate through the rows of second dataframe and find every single combination of data per row (for example: (7,Ac,DJS,283,re) and (7,Ac,DJS,dhb,re) are two combinations of index 7 since there is a column that has more than one value) and compare it with the first dataframe's rows and print it out if there is an identical combination in second dataframe as well.
result:
1 Ac DJS 283
2 ZH 28S 48d
thank you
You need to split the col3 from data frame 2 firstly, and then merge it back with data frame 1; To split col3 of data frame 2, a common approach is to split and flatten the col3 while use numpy.repeat to make equal length of other columns:
import pandas as pd
import numpy as np
from itertools import chain
# count how many repeats are needed for other columns based on commas
repeats = df2.col3.str.count(",") + 1
# repeat columns except for col3, split and flatten col3 and merge it back with df1
(df2.drop('col3', 1).apply(lambda col: np.repeat(col, repeats))
.assign(col3 = list(chain.from_iterable(df2['col3'].str.split(','))))
.merge(df1))
# col1 col2 col4 col3
#0 Ac DJS re 283
#1 ZH 28S 385d 48d

Categories

Resources