I'm trying to compare the characters of each string in a list of strings to see which ones match a certain character. After I want to figure out what percentage of the characters in the list of strings match the given character.
So in the end I want a percentage for each character of each string.
This is what I could think of, but it does not work how I want it to
def GC_content_pos(reads_list):
for read in reads_list:
for position in range(len(read)):
if read[position] == "G" or read[position] == "C":
#do something
If I understand your question you want to percentage of characters in a string that match 'G' or 'C'? In that case if you start with this list of strings
>>> reads_list = ['GC', 'GA', 'GABCDE']
You can use a list comprehension to count the number of letter matches for each string, then divide the count of matches with the length of the string
>>> [sum(1 for i in s if i in 'GC')/len(s) for s in reads_list]
[1.0, 0.5, 0.3333333333333333]
Or multiply by 100 to get percentages
>>> [sum(1 for i in s if i in 'GC')/len(s)*100 for s in reads_list]
[100.0, 50.0, 33.33333333333333]
For finding matches, using regex is more efficient than iterating through each character. This function will return the percentage of each string that is either G or C. You can modify it to get the percentages separately for G or C if that is the requirement.
import re
def str_match_per(reads_list):
match_percentages = dict.fromkeys(reads_list)
for read in reads_list:
matches = re.findall(r'(G|C)', read)
matched_percent = len(matches)/len(read)
match_percentages[read] = round(matched_percent*100, 2)
return match_percentages
In [32]: strl = ['G and C', 'only G', 'double CC']
In [33]: str_match_per(strl)
Out[33]: {'G and C': 28.57, 'only G': 16.67, 'double CC': 22.22}
I got a badly managed log, and need to extract into a dictionary using Python.
# Pattern: (keys are not kw1, kw2 ,etc... no pattern in key)
"para1=a, kw2=b, (b, b=b), bb, kw3=c, t4=..."
# where
# - para1=a
# - kw2=b, (b, b=b), bb
# - kw3=c
# - and so on
# extract into a dict:
out = {"para1": "a", "kw2": "b, (b, b=b), bb", "kw3": "c", "t4": ...}
# Notes several important features
'''
1. all 'kw=value' are joined with certain spliter: ', '
2. value and kw themselves may contain spliter. i.e. 'f(x, y)=3, f(x=3, y=2, z=1)=g(x=1, t=2)'
3. all brackets must be in pair, (therefore we can identify spliters in kw or value).
4. all message must be part of kw or value.
'''
Q1: Is there a regex expression(or some Python code) that helps me get above key and value?
There's no pattern in key, kwN is just a reference to key
Q2Update: Thanks to Laurent, I alr know why Q2 doesn't work: Got unexpected result. ', (.*?)=' should give me the shortest matching between ',' and '=' right?
msg = 'a, a, b=b, c=c'
re.findall(', (.*?)=', msg)
>>> ['a, b', 'c']
# I was expecting ['b','c']
# shouldn't ', (.*?)=' give me the shortest matching between ',' and '='? which is 'b' instead of 'a, b'
(New)Q3: Since I'm working with huge loads of data, working efficiency is my first priority. I've worked out a python code which could achieve the goal, but it doesnt feel quick enough, could you help me to make it better?
def my_not_efficient_solution(msg):
'''
Notes:
1. all 'kw=value' are joined with certain spliter: ', '
2. value and kw themselves may contain spliter. i.e. 'f(x, y)=3, f(x=3, y=2, z=1)=g(x=1, t=2)'
3. all brackets must be in pair, (therefore we can identify spliters in kw or value).
4. all message must be part of kw or value.
Solution:
1. split message with spliter -> get entries
2. check each spliter bracekt and equal sign
3. for each entry: append to last one or serve as part of next one or good with itself
'''
spliter=', '
eq_sign=['=']
first=False
bracket_map={'(':1,")":-1,"[":1,"]":-1}
pair_chk_func = lambda s: not sum([bracket_map.get(i,0) for i in s])
eq_chk_func = lambda s: sum([i in s for i in eq_sign])
assert pair_chk_func(msg), 'msg pair check fail.'
res = msg.split(spliter)
# step1: split entry
entries=[]
do_pre='' # last entry is not complete(lack bracket)
do_first = '' # last entry is not complete(lack eq sign)
while res.__len__()>0:
if first and entries.__len__()==2:
entries.pop(-1)
break
if do_first and entries._len__()==0:
do=do_first+res.pop(0)
else:
do_first=''
do=res.pop(0)
eq_chk=eq_chk_func(do_pre+do)
pair_chk=pair_chk_func(do_pre+do)
# case1: not valid entry, no eq sign
# case2: previous entry not complete
# case3: current entry not valid(no eq sign, will drop) and pair incomplete(may be part of next entry)
if not eq_chk or do_pre:
if entries.__len__() > 0:
entries[-1]+=spliter+do
pair_chk=pair_chk_func(entries[-1])
if pair_chk: do_pre=''
else: do_pre=entries[-1]
elif not pair_chk:
do_first=do
# case4: current entry good to go
elif eq_chk and pair_chk:
entries.append(do)
do_pre=''
# case5: current entry not complete(pair not complete)
else:
entries.append(do)
do_pre=do
# step2: split each into dict
output={}
split_mark = '|'.join(eq_sign)
for entry in entries:
splits=re.split(split_mark, entry)
if splits.__len__()<2:
raise ValueError('split fail for message')
kw = splits.pop(0)
while not pair_chk_func(kw):
kw += '='+splits.pop(0)
output[kw]='='.join(splits)
return output
msg = 'B_=a, kw2=b, f(A=3, k=2)=g(t=3, v=5), mark[(blabla), f(xx tt)=33]'
my_not_efficient_solution(msg)
>>> {'B_': 'a',
'kw2': 'b',
'f(A=3, k=2)': 'g(t=3, v=5), mark[(blabla), f(xx tt)=33]'}
Answer to Q1:
Here is my suggestion:
import re
s = "kw1=a, kw2=b, (b, b=b), bb, kw3=c, kw4=..."
pattern = r'(?=(kw.)=(.*?)(?:, kw.=|$))'
result = dict(re.findall(pattern, s))
print(result) # {'kw1': 'a', 'kw2': 'b, (b, b=b), bb', 'kw3': 'c', 'kw4': '...'}
To explain the regex:
the (?=...) is a lookahead assertion to let you find overlapping matches
the ? in (.*?) makes the quantifier * (asterisk) non-greedy
the ?: makes the group (?:, kw.=|$) non-capturing
the |$ at the end allows to take account of the last value in your string
Answer to Q2:
No, this is wrong. The quantifier *? is non-greedy, so it finds the first match. Moreover there is no search for overlapping matches , which could be done with (?=...). So your observed result is the expected done.
I may suggest you this simple solution:
msg = 'a, a, b=b, c=c'
result = re.findall(', ([^,]*?)=', msg)
print(result) # ['b', 'c']
Q1: Is there a regex expression that helps me get above key and value?
To get the key:value in a dictionary format you can use
Say your string is
"kw1=a, kw2=b, (b, b=b), bb, kw3=c, kw4=dd, kw10=jndn"
Using the following regex gives you key and values in a list
results = re.findall(r'(\bkw\d+)=(.*?)(?=,+\s*\bkw\d+=|$)', s)
[('kw1', 'a'), ('kw2', 'b, (b, b=b), bb'), ('kw3', 'c'), ('kw4', 'dd'), ('kw10', 'jndn')]
You can convert it to a dictionary as
dict(results)
Output :
{
'kw1': 'a',
'kw2': 'b, (b, b=b), bb',
'kw3': 'c',
'kw4': 'dd',
'kw10': 'jndn'
}
Explanation :
\b is used like a word boundary and will only match kw and not something like XYZkw
\kw\d+= Match the word kw followed by 1+ digits and =
.*? (Lazy Match) Match as least chars as possible
(?= Positive lookahead, assert to the right
\s*\bkw\d+= Match optional whitespace chars, then pat, 1+ digits and =
| Or
$ Assert the end of the string for the last part
) Close the lookahead
With regex only, how to match an exact number of consecutive repetitions of an arbitrary single token? For example, matching the "aaa" in “ttaaabbb” instead of the "aaaa" “ttaaaabbb”, given the desired number of repetitions is 3.
Clarification: Note I was using "a" for an example, the token can be arbitrary character/number/symbols. That is, given the desired number of repetitions is 3, the desired match of "aaaa**!!!cccc333**" only gives "!!!" and "333".
In short, I want to find a list of tokens "X" where YXXXY appeared in the given string (Y is some other tokens that are different from X, Y can also be the start of the string or the end of the string). Note there can be repeated tokens in the list, e.g., "aaabbbbaaa" should give ["a", "a"].
Some other examples:
Input: "aaabbbbbbaaa****ccc", output: ["a", "a", "c"]
Input: "!!! aaaabbbaaa ccc!!!", output: ["!", "b", "a", "c", "!"].
What I have tried: I tried (.)\1{2} but unfortunately, it matches "aaaa" and "ccccc" as well in the example above. I further changed it to (?!\1)(.)\1{2}(?!\1) such that the prefix and postfix of the repeating pattern differ from it. However, I got an error in this case since the first \1 is undefined when being referred to.
You might use a pattern with 2 capture groups and a repeated backreference.
First match 4 or more times the same repeated character that you want to avoid, then match 3 times the same character.
The single characters that you want are in capture group 2, which you can get using re.finditer for example.
(\S)\1{3,}|(\S)\2{2}
The pattern matches:
(\S)\1{3,} Capture group 1, match a non whitespace char and repeat the backreference 3 or more times
| Or
(\S)\2{2} Capture group 2, match a non whitespace char and repeat the backreference 2 times
Regex demo | Python demo
For example:
import re
strings = [
"aaaa**!!!cccc333**",
"aaabbbbaaa",
"aaabbbbbbaaa****ccc",
"!!! aaaabbbaaa ccc!!!"
]
pattern = r"(\S)\1{3,}|(\S)\2{2}"
for s in strings:
matches = re.finditer(pattern, s)
result = []
for matchNum, match in enumerate(matches, start=1):
if match.group(2):
result.append(match.group(2))
print(result)
Output
['!', '3']
['a', 'a']
['a', 'a', 'c']
['!', 'b', 'a', 'c', '!']
You can do something like this using a regex and a loop:
def exact_re_match(string, length):
regex = re.compile(r'(.)\1*')
for match in regex.finditer(string):
elm = match.group()
if len(elm) == length:
yield elm
string = "aaaa!!!cccc333"
out = list(exact_re_match(string, 3))
print(out)
# ['!!!', '333']
Given a single word (x); return the possible n-grams that can be found in that word.
You can modify the n-gram value according as you want;
it is in the curly braces in the pat variable.
The default n-gram value is 4.
For example; for the word (x):
x = 'abcdef'
The possible 4-gram are:
['abcd', 'bcde', 'cdef']
def ngram_finder(x):
pat = r'(?=(\S{4}))'
xx = re.findall(pat, x)
return xx
The Question is:
How to combine the f-string with the r-string in the regex expression, using curly braces.
You can use this string to combine the n value into your regexp, using double curly brackets to create a single one in the output:
fr'(?=(\S{{{n}}}))'
The regex needs to have {} to make a quantifier (as you had in your original regex {4}). However f strings use {} to indicate an expression replacement so you need to "escape" the {} required by the regex in the f string. That is done by using {{ and }} which in the output create { and }. So {{{n}}} (where n=4) generates '{' + '4' + '}' = '{4}' as required.
Complete code:
import re
def ngram_finder(x, n):
pat = fr'(?=(\S{{{n}}}))'
return re.findall(pat, x)
x = 'abcdef'
print(ngram_finder(x, 4))
print(ngram_finder(x, 5))
Output:
['abcd', 'bcde', 'cdef']
['abcde', 'bcdef']
Appreciate help in a one-liner idiom to do the following efficiently.
I have a string with groups separated by braces as below:
{1:xxxx}{2:xxxx}{3:{10:xxxx}}{4:xxxx\r\n:xxxx}....
How do I convert this into a dictionary format?
dict={1:'xxx',2:'xxxx',3:'{10:xxxx}'},4:'xxxx\r\n:xxxx'}
r = """(?x)
{
(\w+)
:
(
(?:
[^{}]
|
{.+?}
)+
)
}
"""
z = "{1:xxxx}{2:xxxx}{3:{10:xxxx}}{4:'xxxx'}"
print dict(re.findall(r, z))
# {'1': 'xxxx', '3': '{10:xxxx}', '2': 'xxxx', '4': "'xxxx'"}
Feel free to convert to an one-liner if you want - just remove (?x) and all whitespace from the regex.
The above parses only one level of nesting, to handle arbitrary depths consider the more advanced regex module that supports recursive patterns:
import regex
r = """(?x)
{
(\w+)
:
(
(?:
[^{}]
|
(?R)
)+
)
}
"""
z = "{1:abc}{2:{3:{4:foo}}}{5:bar}"
print dict(regex.findall(r, z))
# {'1': 'abc', '2': '{3:{4:foo}}', '5': 'bar'}
This is how I would do it:
raw = """{1:xxxx}{2:xxxx}{3:{10:xxxx}}{4:'xxxx\r\n:xxxx'}"""
def parse(raw):
# split into chunks by '}{' and remove the outer '{}'
parts = raw[1:-1].split('}{')
for part in parts:
# split by the first ':'
num, data = part.split(':', 1)
# yield each entry found
yield int(num), data
# make a dict from it
print dict(parse(raw))
It keeps the '{10:xxxx}' as a string just like in your example.