Create new columns based on splitting string other column

Create new columns based on splitting string other column - python

I have used re.search to get strings of uniqueID from larger strings.
ex:
import re
string= 'example string with this uniqueID: 300-350'
combination = '(\d+)[-](\d+)'
m = re.search(combination, string)
print (m.group(0))
Out: '300-350'
I have created a dataframe with the UniqueID and the Combination as columns.
uniqueID combinations
0 300-350 (\d+)[-](\d+)
1 off-250 (\w+)[-](\d+)
2 on-stab (\w+)[-](\w+)
And a dictionary meaning_combination relating the combination with the variable meaning it represents:
meaning_combination={'(\\d+)[-](\\d+)': 'A-B',
'(\\w+)[-](\\d+)': 'C-A',
'(\\w+)[-](\\w+)': 'C-D'}
I want to create new columns for each variable (A, B, C, D) and fill them with their corresponding values.
the final result should look like this:
uniqueID combinations A B C D
0 300-350 (\d+)[-](\d+) 300 350
1 off-250 (\w+)[-](\d+) 250 off
2 on-stab (\w+)[-](\w+) stab on

I would fix your regexes to:
meaning_combination={'(\d+-\d+)': 'A-B',
'([^0-9\W]+\-\d+)': 'C-A',
'([^0-9\W]+\-[^0-9\W]+)': 'C-D'}
To capture the full group instead of having three capturing groups.
I.e. (300-350, 300, 350) --> (300-350)
You don't need to have extra two capturing groups because if a specific pattern is satisfied then you know what the positions of the word or digit characters are (based on how you defined the pattern) and you can split by - to access them individually.
I.e.:
str = 'example string with this uniqueID: 300-350'
values = re.findall('(\d+-\d+)', str)
>>>['300-350']
#first digit char:
values[0].split('-')[0]
>>>'300'
If you use this way, you can loop over dictionary keys and list of strings and test if the pattern is satisfied in the string. If it's satisfied (len(re.findall(pattern, string)) != 0), then grab the corresponding dictionary value for the key and split it and split the match and assign dictionary_value.split('-')[0] : match[0].split('-')[0] and dictionary_value.split('-')[1] : match[0].split('-')[1] in a new dictionary that your creating in the loop - also assign unique id to the full match value and combination to the matched pattern. Then use pandas to make a Dataframe.
Altogether:
import re
import pandas as pd
stri= ['example string with this uniqueID: 300-350', 'example string with this uniqueID: off-250', 'example string with this uniqueID: on-stab']
meaning_combination={'(\d+-\d+)': 'A-B',
'([^0-9\W]+\-\d+)': 'C-A',
'([^0-9\W]+\-[^0-9\W]+)': 'C-D'}
values = [{'Unique ID': re.findall(x, st)[0], 'Combination': x, y.split('-')[0] : re.findall(x, st)[0].split('-')[0], y.split('-')[1] : re.findall(x, st)[0].split('-')[1]} for st in stri for x, y in meaning_combination.items() if len(re.findall(x, st)) != 0]
df = pd.DataFrame.from_dict(values)
#just to sort it in order since default is alphabetical
col_val = ['Unique ID', 'Combination', 'A', 'B', 'C', 'D']
df = df.reindex(sorted(df.columns, key=lambda x: col_val.index(x) ), axis=1)
print(df)
output:
Unique ID Combination A B C D
0 300-350 (\d+-\d+) 300 350 NaN NaN
1 off-250 ([^0-9\W]+\-\d+) 250 NaN off NaN
2 on-stab ([^0-9\W]+\-[^0-9\W]+) NaN NaN on stab
Also, note, I think you have a typo in your expected output because you have:
'(\\w+)[-](\\d+)': 'C-A'
which would match off-250, but in your final result you have:
uniqueID combinations A B C D
1 off-250 (\w+)[-](\d+) 250 off
When based on your key this should be in C and A.

Related

Multiple lambda outputs in string replacement using apply [Python]

I have a list of "states" from which I have to iterate:
states = ['antioquia', 'boyaca', 'cordoba', 'choco']
I have to iterate one column in a pandas df to replace or cut the string where the state text is found, so I try:
df_copy['joined'].apply([(lambda x: x.replace(x,x[:-len(j)]) if x.endswith(j) and len(j) != 0 else x) for j in states])
And the result is:
Result wanted:
joined column is the input and the desired output is p_joined column
If it's possible also to find the state not only in the end of the string but check if the string contains it and replace it
Thanks in advance for your help.

This will do what your question asks:
df_copy['p_joined'] = df_copy.joined.str.replace('(' + '|'.join(states) + ')$', '')
Output:
joined p_joined
0 caldasantioquia caldas
1 santafeantioquia santafe
2 medelinantioquiamedelinantioquia medelinantioquiamedelin
3 yarumalantioquia yarumal
4 medelinantioquiamedelinantioquia medelinantioquiamedelin

How to fill the NaN values with the values contains in another column?

I want to find if the split column contains anything from the class list. If yes, I want to update the category column using the values from the class list. The desired category is my optimal goal.
sampledata
domain split Category Desired Category
abc#XYT.com XYT.com Null XYT
abb#XTY.com XTY.com Null Null
abc#ssa.com ssa.com Null ssa
bbb#bbc.com bbc.com Null bbc
ccc#abk.com abk.com Null abk
acc#ssb.com ssb.com Null ssb
Class=['NaN','XYT','ssa','abk','abc','def','asds','ssb','bbc','XY','ab']
for index, row in df.iterrows():
for x in class:
intersection=row.split.contains(x)
if intersection:
df.loc[index,'class'] = intersection
Just cannot get it right
Please help,
Thanks

Use str.extract. Create a regular expression that will match one of the words in the list and extract the word will match (or NaN if none).
Update: As the '|' operator is never greedy even if it would produce a longer overall match, you have to reverse sort your list manually.
lst = ['NaN','XY','ab','XYT','ssa','abk','abc','def','asds','ssb','bbc']
lst = sorted(lst, reverse=True)
pat = fr"({'|'.join(lst)})"
df['Category'] = df['split'].str.extract(pat)
>>> df
domain split Category
0 abc#XYT.com XYT.com XYT
1 abb#XTY.com XTY.com NaN
2 abc#ssa.com ssa.com ssa
3 bbb#bbc.com bbc.com bbc
4 ccc#abk.com abk.com abk
5 acc#ssb.com ssb.com ssb
>>> lst
['ssb', 'ssa', 'def', 'bbc', 'asds', 'abk', 'abc', 'ab', 'XYT', 'XY', 'NaN']
>>> pat
'(ssb|ssa|def|bbc|asds|abk|abc|ab|XYT|XY|NaN)'

Assuming there can only be a maximum of one match, try:
df["Category"] = df["split"].apply(lambda x: " ".join(c for c in Class if c in x))
>>> df
domain split Category
0 abc#XYT.com XYT.com XYT
1 abc#XTY.com XTY.com
2 abc#ssa.com ssa.com ssa
3 bbb#bbc.com bbc.com bbc
4 ccc#abk.com abk.com abk
5 acc#ssb.com ssb.com ssb

replace the text in one column with a dictionary in other column

I have texts in one column and respective dictionary in another column. I have tokenized the text and want to replace those tokens which found a match for the key in respective dictionary. the text and and the dictionary are specific to each record of a pandas dataframe.
import pandas as pd
data =[['1','i love mangoes',{'love':'hate'}],['2', 'its been a long time we have not met',{'met':'meet'}],['3','i got a call from one of our friends',{'call':'phone call','one':'couple of'}]]
df = pd.DataFrame(data, columns = ['id', 'text','dictionary'])
The final dataframe the output should be
data =[['1','i hate mangoes'],['2', 'its been a long time we have not meet'],['3','i got a phone call from couple of of our friends']
df = pd.DataFrame(data, columns =['id, 'modified_text'])
I am using Python 3 in a windows machine

You can use dict.get method after zipping the 2 cols and splitting the sentence:
df['modified_text']=([' '.join([b.get(i,i) for i in a.split()])
for a,b in zip(df['text'],df['dictionary'])])
print(df)
Output:
id text \
0 1 i love mangoes
1 2 its been a long time we have not met
2 3 i got a call from one of our friends
dictionary \
0 {'love': 'hate'}
1 {'met': 'meet'}
2 {'call': 'phone call', 'one': 'couple of'}
modified_text
0 i hate mangoes
1 its been a long time we have not meet
2 i got a phone call from couple of of our friends

I added spaces to the key and values to distinguish a whole word from part of it:
def replace(text, mapping):
new_s = text
for key in mapping:
k = ' '+key+' '
val = ' '+mapping[key]+' '
new_s = new_s.replace(k, val)
return new_s
df_out = (df.assign(modified_text=lambda f:
f.apply(lambda row: replace(row.text, row.dictionary), axis=1))
[['id', 'modified_text']])
print(df_out)
id modified_text
0 1 i hate mangoes
1 2 its been a long time we have not met
2 3 i got a phone call from couple of of our friends

Reorder row values csv pandas

I have a csv file
1 , name , 1012B-Amazon , 2044C-Flipcart , Bosh27-Walmart
2 , name , Kelvi20-Flipcart, LG-Walmart
3, name , Kenstar-Walmart, Sony-Amazon , Kenstar-Flipcart
4, name , LG18-Walmart, Bravia-Amazon
I need the rows to be rearranged by the websites ie the part after -;
1, name , 1012B-Amazon , 2044C-Flipcart , Bosh27-Walmart
2, name , , Kelv20-Flipcart, LG-Walmart
3, name , Sony-Amazon, Kenstar-Flipcart ,Kenstar-Walmart
4, name , Bravia-Amazon, ,LG18-Walmart
Is it possible using pandas ? Finding the existence of a sting and re arrange it and iterate through all rows and repeat this for the next string ? I went through the documentation of Series.str.contains and str.extract but was unable to find a solution .

Using sorted with key
df.iloc[:,1:].apply(lambda x : sorted(x,key=lambda y: (y=='',y)),1)
2 3 4 5
1 ABC DEF GHI JKL
2 ABC DEF GHI
3 ABC DEF GHI JKL
#df.iloc[:,1:]=df.iloc[:,1:].apply(lambda x : sorted(x,key=lambda y: (y=='',y)),1)
Since you mention reindex I think get_dummies will work
s=pd.get_dummies(df.iloc[:,1:],prefix ='',prefix_sep='')
s=s.drop('',1)
df.iloc[:,1:]=s.mul(s.columns).values
df
1 2 3 4 5
1 name ABC DEF GHI JKL
2 name ABC DEF GHI
3 name ABC DEF GHI JKL

Assuming the empty value is np.nan:
# Fill in the empty values with some string to allow sorting
df.fillna('NaN', inplace=True)
# Flatten the dataframe, do the sorting and reshape back to a dataframe
pd.DataFrame(list(map(sorted, df.values)))
0 1 2 3
0 ABC DEF GHI JKL
1 ABC DEF GHI NaN
2 ABC DEF GHI JKL
UPDATE
Given the update to the question and the sample data being as follows
df = pd.DataFrame({'name': ['name1', 'name2', 'name3', 'name4'],
'b': ['1012B-Amazon', 'Kelvi20-Flipcart', 'Kenstar-Walmart', 'LG18-Walmart'],
'c': ['2044C-Flipcart', 'LG-Walmart', 'Sony-Amazon', 'Bravia-Amazon'],
'd': ['Bosh27-Walmart', np.nan, 'Kenstar-Flipcart', np.nan]})
a possible solution could be
def foo(df, retailer):
# Find cells that contain the name of the retailer
mask = df.where(df.apply(lambda x: x.str.contains(retailer)), '')
# Squash the resulting mask into a series
col = mask.max(skipna=True, axis=1)
# Optional: trim the name of the retailer
col = col.str.replace(f'-{retailer}', '')
return col
df_out = pd.DataFrame(df['name'])
for retailer in ['Amazon', 'Walmart', 'Flipcart']:
df_out[retailer] = foo(df, retailer)
resulting in
name Amazon Walmart Flipcart
0 name1 1012B Bosh27 2044C
1 name2 LG Kelvi20
2 name3 Sony Kenstar Kenstar
3 name4 Bravia LG18

Edit after Question Update:
This is the abc csv:
1,name,ABC,GHI,DEF,JKL
2,name,GHI,DEF,ABC,
3,name,JKL,GHI,ABC,DEF
This is the company csv (it is necessary to watch the commas carefully):
1,name,1012B-Amazon,2044C-Flipcart,Bosh27-Walmart
2,name,Kelvi20-Flipcart,LG-Walmart,
3,name,Kenstar-Walmart,Sony-Amazon,Kenstar-Flipcart
4,name,LG18-Walmart,Bravia-Amazon,
Here is the code
import pandas as pd
import numpy as np
#These solution assume that each value that is not empty is not repeated
#within each row. If that is not the case for your data, it would be possible
#to do some transformations that the non empty values are unique for each row.
#"get_company" returns the company if the value is non-empty and an
#empty value if the value was empty to begin with:
def get_company(company_item):
if pd.isnull(company_item):
return np.nan
else:
company=company_item.split('-')[-1]
return company
#Using the "define_sort_order" function, one can retrieve a template to later
#sort all rows in the sort_abc_rows function. The template is derived from all
#values, aside from empty values, within the matrix when "by_largest_row" = False.
#One could also choose the single largest row to serve as the
#template for all other rows to follow. Both options work similarly when
#all rows are subsets of the largest row i.e. Every element in every
#other row (subset) can be found in the largest row (or set)
#The difference relates to, when the items contain unique elements,
#Whether one wants to create a table with all sorted elements serving
#as the columns, or whether one wants to simply exclude elements
#that are not in the largest row when at least one non-subset row does not exist
#Rather than only having the application of returning the original data rows,
#one can get back a novel template with different values from that of the
#original dataset if one uses a function to operate on the template
def define_sort_order(data,by_largest_row = False,value_filtering_function = None):
if not by_largest_row:
if value_filtering_function:
data = data.applymap(value_filtering_function)
#data.values returns a numpy array
#with rows and columns. .flatten()
#puts all elements in a 1 dim array
#set gets all unique values in the array
filtered_values = list(set((data.values.flatten())))
filtered_values = [data_value for data_value in filtered_values if not_empty(data_value)]
#sorted returns a list, even with np.arrays as inputs
model_row = sorted(filtered_values)
else:
if value_filtering_function:
data = data.applymap(value_filtering_function)
row_lengths = data.apply(lambda data_row: data_row.notnull().sum(),axis = 1)
#locates the numerical index for the row with the most non-empty elements:
model_row_idx = row_lengths.idxmax()
#sort and filter the row with the most values:
filtered_values = list(set(data.iloc[model_row_idx]))
model_row = [data_value for data_value in sorted(filtered_values) if not_empty(data_value)]
return model_row
#"not_empty" is used in the above function in order to filter list models that
#they no empty elements remain
def not_empty(value):
return pd.notnull(value) and value not in ['',' ',None]
#Sorts all element in each _row within their corresponding position within the model row.
#elements in the model row that are missing from the current data_row are replaced with np.nan
def reorder_data_rows(data_row,model_row,check_by_function=None):
#Here, we just apply the same function that we used to find the sorting order that
#we computed when we originally #when we were actually finding the ordering of the model_row.
#We actually transform the values of the data row temporarily to determine whether the
#transformed value is in the model row. If so, we determine where, and order #the function
#below in such a way.
if check_by_function:
sorted_data_row = [np.nan]*len(model_row) #creating an empty vector that is the
#same length as the template, or model_row
data_row = [value for value in data_row.values if not_empty(value)]
for value in data_row:
value_lookup = check_by_function(value)
if value_lookup in model_row:
idx = model_row.index(value_lookup)
#placing company items in their respective row positions as indicated by
#the model_row #
sorted_data_row[idx] = value
else:
sorted_data_row = [value if value in data_row.values else np.nan for value in model_row]
return pd.Series(sorted_data_row)
##################### ABC ######################
#Reading the data:
#the file will automatically include the header as the first row if this the
#header = None option is not included. Note: "name" and the 1,2,3 columns are not in the index.
abc = pd.read_csv("abc.csv",header = None,index_col = None)
# Returns a sorted, non-empty list. IF you hard code the order you want,
# then you can simply put the hard coded order in the second input in model_row and avoid
# all functions aside from sort_abc_rows.
model_row = define_sort_order(abc.iloc[:,2:],False)
#applying the "define_sort_order" function we created earlier to each row before saving back into
#the original dataframe
#lambda allows us to create our own function without giving it a name.
#it is useful in this circumstance in order to use two inputs for sort_abc_rows
abc.iloc[:,2:] = abc.iloc[:,2:].apply(lambda abc_row: reorder_data_rows(abc_row,model_row),axis = 1).values
#Saving to a new csv that won't include the pandas created indices (0,1,2)
#or columns names (0,1,2,3,4):
abc.to_csv("sorted_abc.csv",header = False,index = False)
################################################
################## COMPANY #####################
company = pd.read_csv("company.csv",header=None,index_col=None)
model_row = define_sort_order(company.iloc[:,2:],by_largest_row = False,value_filtering_function=get_company)
#the only thing that changes here is that we tell the sort function what specific
#criteria to use to reorder each row by. We're using the result from the
#get_company function to do so. The custom function get_company, takes an input
#such as Kenstar-Walmart, and outputs Walmart (what's after the "-").
#we would then sort by the resulting list of companies.
#Because we used the define_sort_order function to retrieve companies rather than company items in order,
#We need to use the same function to reorder each element in the DataFrame
company.iloc[:,2:] = company.iloc[:,2:].apply(lambda companies_row: reorder_data_rows(companies_row,model_row,check_by_function=get_company),axis=1).values
company.to_csv("sorted_company.csv",header = False,index = False)
#################################################
Here is the first result from sorted_abc.csv:
1 name ABC DEF GHI JKL
2 name ABC DEF GHI NaN
3 name ABC DEF GHI JKL
After modifying the code to the subsequent form inquired about,
here is the sorted_company.csv that resulted from running the
script.
1 name 1012B-Amazon 2044C-Flipcart Bosh27-Walmart
2 name NaN Kelvi20-Flipcart LG-Walmart
3 name Sony-Amazon Kenstar-Flipcart Kenstar-Walmart
4 name Bravia-Amazon NaN LG18-Walmart
I hope it helps!

find gene name from liste to dataframe

I actually have to know if I got some gene if my result, to do so I have one list with my genes' names and a dataframe with the same sames:
For exemple
liste["gene1","gene2","gene3","gene4","gene5"]
and a dataframe:
name1 name2
gene1_0035 gene1_0042
gene56_0042 gene56_0035
gene4_0042 gene4_0035
gene2_0035 gene2_0042
gene57_0042 gene57_0035
then I did:
df=pd.read_csv("dataframe_not_max.txt",sep='\t')
df=df.drop(columns=(['Unnamed: 0', 'Unnamed: 0.1']))
#print(df)
print(list(df.columns.values))
name1=df.ix[:,1]
name2=df.ix[:,2]
liste=[]
for record in SeqIO.parse(data, "fasta"):
liste.append(record.id)
print(liste)
print(len(liste))
count=0
for a, b in zip(name1, name2):
if a in liste:
count+=1
if b in liste:
count+=1
print(count)
And what I want is to know how many time I find the gene in ma dataframe from my list but they do not have exactly the same ID since in the list there is not the _number after the gene name, then the if i in liste does not reconize the ID.
Is it possible to say something like :
if a without_number in liste:
In the above exemple it would be :
count = 3 because only gene 1,2 and 4 are present in both the list and the datafra.
Here is a more complicated exemple to see if your script indeed works for my data:
Let's say I have a dataframe such:
cluster_name qseqid sseqid pident_x
15 cluster_016607 EOG090X00GO_0035_0035 EOG090X00GO_0042_0035
16 cluster_016607 EOG090X00GO_0035_0035 EOG090X00GO_0042_0042
18 cluster_016607 EOG090X00GO_0035_0042 EOG090X00GO_0042_0035
19 cluster_016607 EOG090X00GO_0035_0042 EOG090X00GO_0042_0042
29 cluster_015707 EOG090X00LI_0035_0035 EOG090X00LI_0042_0042
30 cluster_015707 EOG090X00LI_0035_0035 EOG090X00LI_0042_0035
34 cluster_015707 EOG090X00LI_0042_0035 g1726.t1_0035_0042
37 cluster_015707 EOG090X00LI_0042_0042 g1726.t1_0035_0042
and a list : ["EOG090X00LI_","EOG090X00GO_","EOG090X00BA_"]
here I get 6 but I should get 2 because I have only 2 sequences in my data EOG090X00LI and EOG090X00GO
in fact, here I want to count when a sequence is present only when it appears once, even if it is for exemple: EOG090X00LI vs seq123454
I do not know if it is clear?
I used for the exemple :
df=pd.read_csv("test_busco_augus.csv",sep=',')
#df=df.drop(columns=(['Unnamed: 0', 'Unnamed: 0.1']))
print(df)
print(list(df.columns.values))
name1=df.ix[:,3]
name2=df.ix[:,4]
liste=["EOG090X00LI_","EOG090X00GO_","EOG090X00BA_"]
print(liste)
#get boolean mask for each column
m1 = name1.str.contains('|'.join(liste))
m2 = name2.str.contains('|'.join(liste))
#chain masks and count Trues
a = (m1 & m2).sum()
print (a)

Using isin
df.apply(lambda x : x.str.split('_').str[0],1).isin(l).sum(1).eq(2).sum()
Out[923]: 3
Adding value_counts
df.apply(lambda x : x.str.split('_').str[0],1).isin(l).sum(1).value_counts()
Out[925]:
2 3
0 2
dtype: int64

Adjusted for updated OP
find where sum is equal to 1
df.stack().str.split('_').str[0].isin(liste).sum(level=0).eq(1).sum()
2
Old Answer
stack and str accessor
You can use split on '_' to scrape the first portion then use isin to determine membership. I also use stack and all with the parameter level=0 to see if membership is True for all columns
df.stack().str.split('_').str[0].isin(liste).all(level=0).sum()
3
applymap
df.applymap(lambda x: x.split('_')[0] in liste).all(1).sum()
3
sum/all with generators
sum(all(x.split('_')[0] in liste for x in r) for r in df.values)
3
Two many map
sum(map(lambda r: all(map(lambda x: x.split('_')[0] in liste, r)), df.values))
3

I think need:
#add _ to end of values
liste = [record.id + '_' for record in SeqIO.parse(data, "fasta")]
#liste = ["gene1_","gene2_","gene3_","gene4_","gene5_"]
#get boolean mask for each column
m1 = df['name1'].str.contains('|'.join(liste))
m2 = df['name2'].str.contains('|'.join(liste))
#chain masks and count Trues
a = (m1 & m2).sum()
print (a)
3
EDIT:
liste=["EOG090X00LI","EOG090X00GO","EOG090X00BA"]
#extract each values before _, remove duplicates and compare by liste
a = name1.str.split('_').str[0].drop_duplicates().isin(liste)
b = name2.str.split('_').str[0].drop_duplicates().isin(liste)
#compare a with a for equal and sum Trues
c = a.eq(b).sum()
print (c)
2

You could convert your dataframe to a series (combining all columns) using stack(), then search for your gene names in liste followed by an underscore _ using Series.str.match():
s = df.stack()
sum([s.str.match(i+'_').any() for i in liste])
Which returns 3
Details:
df.stack() returns the following Series:
0 name1 gene1_0035
name2 gene1_0042
1 name1 gene56_0042
name2 gene56_0035
2 name1 gene4_0042
name2 gene4_0035
3 name1 gene2_0035
name2 gene2_0042
4 name1 gene57_0042
name2 gene57_0035
Since all your genes are followed by an underscore in that series, you just need to see if gene_name followed by _ is in that Series. s.str.match(i+'_').any() returns True if that is the case. Then, you get the sum of True values, and that is your count.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Create new columns based on splitting string other column - python

Related

Multiple lambda outputs in string replacement using apply [Python]

How to fill the NaN values with the values contains in another column?

replace the text in one column with a dictionary in other column

Reorder row values csv pandas

find gene name from liste to dataframe

Categories

Resources