Appending unique mixed string using pandas or python - python

I have a table or df(if pandas has a better way) with one of the columns with multiple mixed character and string, i need to count them and append a unique mixed string to it, what would be best way to do a python loop or pandas has some syntax to do it? example data
col0 col1 col2
ENSG0001 E001 ENSG001:E001
ENSG0001 E002 ENSG001:E002
.
.
ENSG001 E028 ENSG001:E028
ENSG002 E001 ENSG002:E001
.
ENSG002 E012 ENSG002:E012
Edit:
Need to count the elements in col0 and instead of a number I need E001 as the counter and concatenate col0 and col1 in col2

Add to column Series created by cumcount + astype to string + zfill.
df['col3'] = df['col0'] + ':E' +
df.groupby('col0').cumcount().add(1).astype(str).str.zfill(3)
print (df)
col0 col1 col2 col3
0 ENSG0001 E001 ENSG001:E001 ENSG0001:E001
1 ENSG0001 E002 ENSG001:E002 ENSG0001:E002
2 ENSG001 E028 ENSG001:E028 ENSG001:E001
3 ENSG002 E001 ENSG002:E001 ENSG002:E001
4 ENSG002 E012 ENSG002:E012 ENSG002:E002

Related

removing specific words from a dataset [duplicate]

I have a pandas data frame, which looks like the following:
col1 col2 col3 ...
field1:index1:value1 field2:index2:value2 field3:index3:value3 ...
field1:index4:value4 field2:index5:value5 field3:index5:value6 ...
The field is of int type, index is of int type and value could be int or float type.
I want to convert this data frame into the following expected output:
col1 col2 col3 ...
index1:value1 index2:value2 index3:value3 ...
index4:value4 index5:value5 index5:value6 ...
I want to remove the all field: values from all the cells. How to do this?
EDIT: An example of a cell looks like: 1:1:1.0445731675303e-06 and I would like to reduce such strings to 1:1.0445731675303e-06, in all the cells.
Given
>>> df
col1 col2 col3
0 1:index1:value1 2:index2:value2 3:index3:value3
1 1:index4:value4 2:index5:value5 3:index5:value6
you can use
>>> df.apply(lambda s: s.str.replace('^\d+:', '', regex=True))
col1 col2 col3
0 index1:value1 index2:value2 index3:value3
1 index4:value4 index5:value5 index5:value6
The regex '^\d+:' matches the beginnings of strings that start with a sequence of numbers followed by a colon.
Try this:
df = df.applymap(lambda x: ':'.join(str(x).split(':')[1:]))
print(df)
col1 col2 col3
0 index1:value1 index2:value2 index3:value3
1 index4:value4 index5:value5 index5:value6
Possible other way is to basically split by phrase after first colon and extract using .str[index]
df.apply(lambda s: s.str.split('(^[a-z0-9]+\:(.*))').str[-2])
Another possible solution is to run the string processing in a list comprehension, and create a new dataframe, using the old dataframe's column names :
result = [[":".join(word.split(":")[1:])
for word in ent]
for ent in df.to_numpy()]
pd.DataFrame(result, columns = df.columns)
col1 col2 col3
0 index1:value1 index2:value2 index3:value3
1 index4:value4 index5:value5 index5:value6
This is faster than running an applymap or apply... string processing is usually much faster within python than Pandas.

sum() on specific columns of dataframe

I cannot work out how to add a new row at the end. The last row needs to do sum() on specific columns and dividing 2 other columns. While the DF has applied a filter to sum only specific rows.
df:
Categ CategID col3 col4 col5 col6
0 Cat1 1 -65.90 -100.40 -26.91 23.79
1 Cat2 2 -81.91 -15.30 -16.00 10.06
2 Cat3 3 -57.70 -18.62 0.00 0.00
I would like the output to be like so:
3 Total -123.60 -119.02 -26.91 100*(-119.02/-26.91)
col3,col4,col5 would have sum(), and col6 would be the above formula.
If [CategID]==2, then don't include in the TOTAL
I was able to get it almost as I wanted by using .query(), like so:
#tg is a list
df.loc['Total'] = df.query("categID in #tg").sum()
But with the above I cannot have the 'col6' like this 100*(col4.sum() / col5.sum()), because they are all sum().
Then I tried with Series like so, but I don't understand how to apply filter .where()
s = pd.Series( [df['col3'].sum()\
,df['col4'].sum()\
,df['col5'].sum()\
,100*(df['col4'].sum()/df['col5'].sum())\
,index = ['col3','col4','col5','col6'])
df.loc['Total'] = s.where('tag1' in tg)
using the above Series() works, until I add .where()
this gives the error:
ValueError: Array conditional must be same shape as self
So, can I accomplish this with the first method, using .query(), just somehow modify one of the column in TOTAL ?
Otherwise what am I doing wrong in the second method .where()
Thanks
IIUC, you can try:
s = df.mask(df['CategID'].eq(2)).drop("CategID",1).sum()
s.loc['col6'] = 100*(s['col4'] / s['col5'])
df.loc[len(df)] = s
df = df.fillna({'Categ':'Total',"CategID":''})
print(df)
Categ CategID col3 col4 col5 col6
0 Cat1 1 -65.90 -100.40 -26.91 23.790000
1 Cat2 2 -81.91 -15.30 -16.00 10.060000
2 Cat3 3 -57.70 -18.62 0.00 0.000000
3 Total -123.60 -119.02 -26.91 442.289112

Create a column out of the 2nd portion of text of two columns in pandas

I have a dataframe with two columns. I want to create a third column that is the
"sum" of the first two columns, but without the first bit of each column. I think this is best shown in an example:
col1 col2 col3 (need to make)
abc_what_I_want1 abc_what_I_want1 what_I_want1what_I_want1
psdb_what_I_want2 what_I_want2
vxc_what_I_want3 vxc_what_I_want3 what_I_want3what_I_want3
qk_what_I_want4 qk_what_I_want4 what_I_want4what_I_want4
ertsa_what_I_want5 what_I_want5
abc_what_I_want6 abc_what_I_want6 what_I_want6what_I_want6
Note that what_I_want# will be different for every row, but the same between columns in the same row. The prefix will always be the same for each row but can differ/repeat between rows. Cells shown as blank are "" strings.
The code I have so far:
df["col3"] = df["col1"].str.split("_", 1) + df["col2"].str.split("_", 1)
From there I wanted just the 2nd (or last) element of the split so I tried both of the following:
df["col3"] = df["col1"].str.split("_", 1)[1] + df["col2"].str.split("_", 1)[1]
df["col3"] = df["col1"].str.split("_", 1)[-1] + df["col2"].str.split("_", 1)[-1]
Both of these returned errors. The first error I think is because of replicated values (ValueError: cannot reindex from a duplicate axis). The second is a Keyvalue Error.
You were actually quite close, just needed to select the correct slice with str[1] and meanwhile fillna for the empty cells:
m = df['col1'].str.split('_', 1).str[1].fillna('') + df['col2'].str.split('_', 1).str[1].fillna('')
df['col3'] = m
col1 col2 col3
0 abc_what_I_want1 abc_what_I_want1 what_I_want1what_I_want1
1 psdb_what_I_want2 what_I_want2
2 vxc_what_I_want3 vxc_what_I_want3 what_I_want3what_I_want3
3 qk_what_I_want4 qk_what_I_want4 what_I_want4what_I_want4
4 ertsa_what_I_want5 what_I_want5
5 abc_what_I_want6 abc_what_I_want6 what_I_want6what_I_want6
Another method would be to use apply where you can apply split on multiple columns at once:
m = df[['col1', 'col2']].apply(lambda x: x.str.split('_', 1).str[1]).fillna('')
df['col3'] = m['col1']+m['col2']
col1 col2 col3
0 abc_what_I_want1 abc_what_I_want1 what_I_want1what_I_want1
1 psdb_what_I_want2 what_I_want2
2 vxc_what_I_want3 vxc_what_I_want3 what_I_want3what_I_want3
3 qk_what_I_want4 qk_what_I_want4 what_I_want4what_I_want4
4 ertsa_what_I_want5 what_I_want5
5 abc_what_I_want6 abc_what_I_want6 what_I_want6what_I_want6
You can replace() all char up until the first underscore and then apply() a join() or sum() on axis=1:
df['Col3']=df.replace('^[^_]*_','',regex=True).fillna('').apply(''.join,axis=1)
Or:
df['Col3']=df.replace('^[^_]*_','',regex=True).fillna('').sum(axis=1)
Or:
df['Col3']=(pd.Series(df.replace('^[^_]*_','',regex=True).fillna('').values.tolist())
.str.join(''))
col1 col2 Col3
0 abc_what_I_want1 abc_what_I_want1 what_I_want1what_I_want1
1 psdb_what_I_want2 what_I_want2 what_I_want2I_want2
2 vxc_what_I_want3 vxc_what_I_want3 what_I_want3what_I_want3
3 qk_what_I_want4 qk_what_I_want4 what_I_want4what_I_want4
4 NaN ertsa_what_I_want5 what_I_want5
5 abc_what_I_want6 abc_what_I_want6 what_I_want6what_I_want6

Geometric mean applied on row

I have this data frame as example:
Col1 Col2 Col3 Col4
1 2 3 2.2
I would like to to add a 4th column called 'Gmean' that calculate the geometric mean of the first 3 columns on each row.
How can get it done ?
Thanks!
One way would be with Scipy's geometric mean function -
from scipy.stats.mstats import gmean
df['Gmean'] = gmean(df.iloc[:,:3],axis=1)
Another way with the formula of geometric mean itself -
df['Gmean'] = np.power(df.iloc[:,:3].prod(axis=1),1.0/3)
If there are exactly 3 columns, just use df instead of df.iloc[:,:3]. Also, if you are looking for performance, you might want to work with the underlying array data with df.values or df.iloc[:,:3].values.
df.assign(Gmean=df.iloc[:, :3].prod(1) ** (1. / 3))
Col1 Col2 Col3 Col4 Gmean
0 1 2 3 2.2 1.817121

Changing values in a dataframe column based off a different column (python)

Col1 Col2
0 APT UB0
1 AK0 UUP
2 IL2 PB2
3 OIU U5B
4 K29 AAA
My data frame looks similar to the above data. I'm trying to change the values in Col1 if the corresponding values in Col2 have the letter "B" in it. If the value in Col2 has "B", then I want to add "-B" to the end of the value in Col1.
Ultimately I want Col1 to look like this:
Col1
0 APT-B
1 AK0
2 IL2-B
.. ...
I have an idea of how to approach it... but I'm somewhat confused because I know my code is incorrect. In addition there are NaN values in my actual code for Col1... which will definitely give an error when I'm trying to do val += "-B" since it's not possible to add a string and a float.
for value in dataframe['Col2']:
if "Z" in value:
for val in dataframe['Col1']:
val += "-B"
Does anyone know how to fix/solve this?
Rather than using a loop, lets use pandas directly:
import pandas as pd
df = pd.DataFrame({'Col1': ['APT', 'AK0', 'IL2', 'OIU', 'K29'], 'Col2': ['UB0', 'UUP', 'PB2', 'U5B', 'AAA']})
df.loc[df.Col2.str.contains('B'), 'Col1'] += '-B'
print(df)
Output:
Col1 Col2
0 APT-B UB0
1 AK0 UUP
2 IL2-B PB2
3 OIU-B U5B
4 K29 AAA
You have too many "for" loops in your code. You just need to iterate over the rows once, and for any row satisfying your condition you make the change.
for idx, row in df.iterrows():
if 'B' in row['Col2']:
df.loc[idx, 'Col1'] = str(df.loc[idx, 'Col1']) + '-B'
edit: I used str to convert the previous value in Col1 to a string before appending, since you said you sometimes have non-string values there. If this doesn't work for you, please post your test data and results.
You can use a lambda expression. If 'B' is in Col2, then '-B' get appended to Col1. The end result is assigned back to Col1.
df['Col1'] = df.apply(lambda x: x.Col1 + ('-B' if 'B' in x.Col2 else ''), axis=1)
>>> df
Col1 Col2
0 APT-B UB0
1 AK0 UUP
2 IL2-B PB2
3 OIU-B U5B
4 K29 AAA

Categories

Resources