Pandas DataFrame get substrings from column - python

I have a column named "KL" with for example:
sem_0405M4209F2057_1.000
sem_A_0103M5836F4798_1.000
Now I want to extract the four digits after "M" and the four digits after "F". But with df["KL"].str.extract I can't get it to work.
Locations of M and F vary, thus just using the slice [9:13] won't work for the complete column.

If you want to use str.extract, here's how:
>>> df['KL'].str.extract(r'M(?P<M>[0-9]{4})F(?P<F>[0-9]{4})')
M F
0 4209 2057
1 5836 4798
Here, M(?P<M>[0-9]{4}) matches the character 'M' and then captures 4 digits following it (the [0-9]{4} part). This is put in the column M (specified with ?P<M> inside the capturing group). The same thing is done for F.

You could use split to achieve this, probably a better way exists:
In [147]:
s = pd.Series(['sem_0405M4209F2057_1.000','sem_A_0103M5836F4798_1.000'])
s
Out[147]:
0 sem_0405M4209F2057_1.000
1 sem_A_0103M5836F4798_1.000
dtype: object
In [153]:
m = s.str.split('M').str[1].str.split('F').str[0][:4]
f = s.str.split('M').str[1].str.split('F').str[1].str[:4]
print(m)
print(f)
0 4209
1 5836
dtype: object
0 2057
1 4798
dtype: object

You can also use regex:
import re
def get_data(x):
data = re.search( r'M(\d{4})F(\d{4})', x)
if data:
m = data.group(1)
f = data.group(2)
return m, f
df = pd.DataFrame(data={'a': ['sem_0405M4209F2057_1.000', 'sem_0405M4239F2027_1.000']})
df['data'] = df['a'].apply(lambda x: get_data(x))
>>
a data
0 sem_0405M4209F2057_1.000 (4209, 2057)
1 sem_0405M4239F2027_1.000 (4239, 2027)

Related

Pandas : Changing a column of dataset from string to integer

One Column of my dataset is like this:
0 10,000+
1 500,000+
2 5,000,000+
3 50,000,000+
4 100,000+
Name: Installs, dtype: object
and I want to change these 'xxx,yyy,zzz+' strings to integers.
first I tried this function:
df['Installs'] = pd.to_numeric(df['Installs'])
and I got this error:
ValueError: Unable to parse string "10,000" at position 0
and then I tried to remove '+' and ',' with this method:
df['Installs'] = df['Installs'].str.replace('+','',regex = True)
df['Installs'] = df['Installs'].str.replace(',','',regex = True)
but nothing changed!
How can I convert these strings to integers?
With regex=True, the + (plus) character is interepreted specially, as a regex feature. You can either disable regular expression replacement (regex=False), or even better, change your regular expression to match + or , and remove them at once:
df['Installs'] = df['Installs'].str.replace('[+,]', '', regex=True).astype(int)
Output:
>>> df['Installs']
0 10000
1 500000
2 5000000
3 50000000
4 100000
Name: 0, dtype: int64
+ is not a valid regex, use:
df['Installs'] = pd.to_numeric(df['Installs'].str.replace(r'\D', '', regex=True))

Pandas replace function - insert pad 0 to 3 numbers

in Pandas (df), a column with following strings. looking to pad 0 when number within string are <100
Freq
XXX100KHz
XYC200KHz
YYY80KHz
YYY50KHz
to:
Freq
XXX100KHz
XYC200KHz
YYY080KHz
YYY050KHz
following function doesn't work, as \1 then 0 won't work as \10 doesn't exist.
df.replace({'Freq':'^([A-Za-z]+)(\d\d[A-Za-z]*)$'},{'Freq':r'\1**0**\2'},regex=True, inplace=True)
Try:
df["Freq"] = df["Freq"].str.replace(
r"(?<=\D)\d{1,2}(?=KHz)",
lambda g: "{:0>3}".format(g.group()),
regex=True,
)
print(df)
Prints:
Freq
0 XXX100KHz
1 XYC200KHz
2 YYY080KHz
3 YYY050KHz

How to split a string without given delimeter in Panda

dfcolumn = [PUEF2CarmenXFc034DpEd, PUEF2BalulanFc034CamH, CARF1BalulanFc013Baca, ...]
My output should be:
dfnewcolumn1 = [PUEF2, PUEF2 , CARF1]
dfnewcolumn2 = [CarmenXFc034DpEd, BalulanFc034CamH, BalulanFc013Baca]
Assuming your split criteria is by fixed number of characters (e.g. 5 here), you can use:
df['dfnewcolumn1'] = df['dfcolumn'].str[:5]
df['dfnewcolumn2'] = df['dfcolumn'].str[5:]
Result:
dfcolumn dfnewcolumn1 dfnewcolumn2
0 PUEF2CarmenXFc034DpEd PUEF2 CarmenXFc034DpEd
1 PUEF2BalulanFc034CamH PUEF2 BalulanFc034CamH
2 CARF1BalulanFc013Baca CARF1 BalulanFc013Baca
If your split criteria is by the first digit in the string, you can use:
df[['dfnewcolumn1', 'dfnewcolumnX']] = df['dfcolumn'].str.split(r'(?<=\d)\D', n=1, expand=True)
df[['dfnewcolumnX', 'dfnewcolumn2']] = df['dfcolumn'].str.split(r'\D*\d', n=1, expand=True)
df = df.drop(columns='dfnewcolumnX')
Using the following modified original data with more test cases:
dfcolumn
0 PUEF2CarmenXFc034DpEd
1 PUEF2BalulanFc034CamH
2 CARF1BalulanFc013Baca
3 CAF1BalulanFc013Baca
4 PUEFA2BalulanFc034CamH
Run code:
df[['dfnewcolumn1', 'dfnewcolumnX']] = df['dfcolumn'].str.split(r'(?<=\d)\D', n=1, expand=True)
df[['dfnewcolumnX', 'dfnewcolumn2']] = df['dfcolumn'].str.split(r'\D*\d', n=1, expand=True)
df = df.drop(columns='dfnewcolumnX')
Result:
dfcolumn dfnewcolumn1 dfnewcolumn2
0 PUEF2CarmenXFc034DpEd PUEF2 CarmenXFc034DpEd
1 PUEF2BalulanFc034CamH PUEF2 BalulanFc034CamH
2 CARF1BalulanFc013Baca CARF1 BalulanFc013Baca
3 CAF1BalulanFc013Baca CAF1 BalulanFc013Baca
4 PUEFA2BalulanFc034CamH PUEFA2 BalulanFc034CamH
Assuming your prefix consists of a sequence of alphabets followed by a sequence of digits, which both have variable length. Then a regex split function can be constructed and applied on each cell.
Solution
import pandas as pd
import re
# data
df = pd.DataFrame()
df["dfcolumn"] = ["PUEF2CarmenXFc034DpEd", "PUEF2BalulanFc034CamH", "CARF1BalulanFc013Baca"]
def f_split(s: str):
"""Split two part by regex"""
# alphabet(s) followed by digit(s)
o = re.match(r"^([A-Za-z]+\d+)(.*)$", s)
# may add exception handling here if there is no match
return o.group(1), o.group(2)
df[["dfnewcolumn1", "dfnewcolumn2"]] = df["dfcolumn"].apply(f_split).to_list()
Note the .to_list() to convert tuples into lists, which is required for the new column assignment to work.
Result
print(df)
dfcolumn dfnewcolumn1 dfnewcolumn2
0 PUEF2CarmenXFc034DpEd PUEF2 CarmenXFc034DpEd
1 PUEF2BalulanFc034CamH PUEF2 BalulanFc034CamH
2 CARF1BalulanFc013Baca CARF1 BalulanFc013Baca
Hoe about this compact solution:
import pandas as pd
df = pd.DataFrame({"original": ["PUEF2CarmenXFc034DpEd", "PUEF2BalulanFc034CamH", "CARF1BalulanFc013Baca"]})
df2 = pd.DataFrame(df.original.str.split(r"(\d)", n=1).to_list(), columns=["part1", "separator", "part2"])
df2.part1 = df2.part1 + df2.separator.astype(str)
df2
part1 separator part2
0 PUEF2 2 CarmenXFc034DpEd
1 PUEF2 2 BalulanFc034CamH
2 CARF1 1 BalulanFc013Baca
I use:
Series.str.split with a regex pattern and a kwarg to specify that it should only split on the first match.
in th regex pattern, I use a group (the round braces in (\d)) to capture the separating character
to_list() to output the split as a list of lists
DataFrame constructor to build a new DataFrame from that list
string concat of two columns

find gene name from liste to dataframe

I actually have to know if I got some gene if my result, to do so I have one list with my genes' names and a dataframe with the same sames:
For exemple
liste["gene1","gene2","gene3","gene4","gene5"]
and a dataframe:
name1 name2
gene1_0035 gene1_0042
gene56_0042 gene56_0035
gene4_0042 gene4_0035
gene2_0035 gene2_0042
gene57_0042 gene57_0035
then I did:
df=pd.read_csv("dataframe_not_max.txt",sep='\t')
df=df.drop(columns=(['Unnamed: 0', 'Unnamed: 0.1']))
#print(df)
print(list(df.columns.values))
name1=df.ix[:,1]
name2=df.ix[:,2]
liste=[]
for record in SeqIO.parse(data, "fasta"):
liste.append(record.id)
print(liste)
print(len(liste))
count=0
for a, b in zip(name1, name2):
if a in liste:
count+=1
if b in liste:
count+=1
print(count)
And what I want is to know how many time I find the gene in ma dataframe from my list but they do not have exactly the same ID since in the list there is not the _number after the gene name, then the if i in liste does not reconize the ID.
Is it possible to say something like :
if a without_number in liste:
In the above exemple it would be :
count = 3 because only gene 1,2 and 4 are present in both the list and the datafra.
Here is a more complicated exemple to see if your script indeed works for my data:
Let's say I have a dataframe such:
cluster_name qseqid sseqid pident_x
15 cluster_016607 EOG090X00GO_0035_0035 EOG090X00GO_0042_0035
16 cluster_016607 EOG090X00GO_0035_0035 EOG090X00GO_0042_0042
18 cluster_016607 EOG090X00GO_0035_0042 EOG090X00GO_0042_0035
19 cluster_016607 EOG090X00GO_0035_0042 EOG090X00GO_0042_0042
29 cluster_015707 EOG090X00LI_0035_0035 EOG090X00LI_0042_0042
30 cluster_015707 EOG090X00LI_0035_0035 EOG090X00LI_0042_0035
34 cluster_015707 EOG090X00LI_0042_0035 g1726.t1_0035_0042
37 cluster_015707 EOG090X00LI_0042_0042 g1726.t1_0035_0042
and a list : ["EOG090X00LI_","EOG090X00GO_","EOG090X00BA_"]
here I get 6 but I should get 2 because I have only 2 sequences in my data EOG090X00LI and EOG090X00GO
in fact, here I want to count when a sequence is present only when it appears once, even if it is for exemple: EOG090X00LI vs seq123454
I do not know if it is clear?
I used for the exemple :
df=pd.read_csv("test_busco_augus.csv",sep=',')
#df=df.drop(columns=(['Unnamed: 0', 'Unnamed: 0.1']))
print(df)
print(list(df.columns.values))
name1=df.ix[:,3]
name2=df.ix[:,4]
liste=["EOG090X00LI_","EOG090X00GO_","EOG090X00BA_"]
print(liste)
#get boolean mask for each column
m1 = name1.str.contains('|'.join(liste))
m2 = name2.str.contains('|'.join(liste))
#chain masks and count Trues
a = (m1 & m2).sum()
print (a)
Using isin
df.apply(lambda x : x.str.split('_').str[0],1).isin(l).sum(1).eq(2).sum()
Out[923]: 3
Adding value_counts
df.apply(lambda x : x.str.split('_').str[0],1).isin(l).sum(1).value_counts()
Out[925]:
2 3
0 2
dtype: int64
Adjusted for updated OP
find where sum is equal to 1
df.stack().str.split('_').str[0].isin(liste).sum(level=0).eq(1).sum()
2
Old Answer
stack and str accessor
You can use split on '_' to scrape the first portion then use isin to determine membership. I also use stack and all with the parameter level=0 to see if membership is True for all columns
df.stack().str.split('_').str[0].isin(liste).all(level=0).sum()
3
applymap
df.applymap(lambda x: x.split('_')[0] in liste).all(1).sum()
3
sum/all with generators
sum(all(x.split('_')[0] in liste for x in r) for r in df.values)
3
Two many map
sum(map(lambda r: all(map(lambda x: x.split('_')[0] in liste, r)), df.values))
3
I think need:
#add _ to end of values
liste = [record.id + '_' for record in SeqIO.parse(data, "fasta")]
#liste = ["gene1_","gene2_","gene3_","gene4_","gene5_"]
#get boolean mask for each column
m1 = df['name1'].str.contains('|'.join(liste))
m2 = df['name2'].str.contains('|'.join(liste))
#chain masks and count Trues
a = (m1 & m2).sum()
print (a)
3
EDIT:
liste=["EOG090X00LI","EOG090X00GO","EOG090X00BA"]
#extract each values before _, remove duplicates and compare by liste
a = name1.str.split('_').str[0].drop_duplicates().isin(liste)
b = name2.str.split('_').str[0].drop_duplicates().isin(liste)
#compare a with a for equal and sum Trues
c = a.eq(b).sum()
print (c)
2
You could convert your dataframe to a series (combining all columns) using stack(), then search for your gene names in liste followed by an underscore _ using Series.str.match():
s = df.stack()
sum([s.str.match(i+'_').any() for i in liste])
Which returns 3
Details:
df.stack() returns the following Series:
0 name1 gene1_0035
name2 gene1_0042
1 name1 gene56_0042
name2 gene56_0035
2 name1 gene4_0042
name2 gene4_0035
3 name1 gene2_0035
name2 gene2_0042
4 name1 gene57_0042
name2 gene57_0035
Since all your genes are followed by an underscore in that series, you just need to see if gene_name followed by _ is in that Series. s.str.match(i+'_').any() returns True if that is the case. Then, you get the sum of True values, and that is your count.

Vectorized format function for Pandas series

Say I start with a Series of unformatted phone numbers (as strings), and I would like to format them as (XXX) YYY-ZZZZ.
I can get the sub-components of my input using regular expressions and str.match or str.extract. And I can perform the formatting using the result of either:
ser = pd.Series(data=['1234567890', '2345678901', '3456789012'])
matched = ser.str.match(r'(\d{3})(\d{3})(\d{4})')
extracted = ser.astype(str).str.extract(r'(?P<first>\d{3})(?P<second>\d{3})(?P<third>\d{4})')
formatmatched = matched.apply(lambda x: '({0}) {1}-{2}'.format(*x))
print 'formatmatched'
print formatmatched
formatextracted = extracted.apply(lambda x: '({first}) {second}-{third}'.format(**x.to_dict()), axis=1)
print 'formatextracted'
print formatextracted
Results:
formatmatched
0 (123) 456-7890
1 (234) 567-8901
2 (345) 678-9012
dtype: object
formatextracted
0 (123) 456-7890
1 (234) 567-8901
2 (345) 678-9012
dtype: object
Is there a vectorized way to apply that formatting command in either context?
You can do this directly with Series.str.replace():
In [47]: s = pandas.Series(["1234567890", "5552348866", "13434"])
In [49]: s
Out[49]:
0 1234567890
1 5552348866
2 13434
dtype: object
In [50]: s.str.replace(r"(\d{3})(\d{3})(\d{4})", r"(\1) \2-\3")
Out[50]:
0 (123) 456-7890
1 (555) 234-8866
2 13434
dtype: object
You could also imagine doing another transformation first to remove any non-digit characters.
Why don't you try this:
import pandas as pd
ser = pd.Series(data=['1234567890', '2345678901', '3456789012'])
def f(val):
return '({0}) {1}-{2}'.format(val[:3],val[3:6],val[6:])
print ser.apply(f)

Categories

Resources