compare a list of strings on each character

compare a list of strings on each character - python

I'm trying to compare the characters of each string in a list of strings to see which ones match a certain character. After I want to figure out what percentage of the characters in the list of strings match the given character.
So in the end I want a percentage for each character of each string.
This is what I could think of, but it does not work how I want it to
def GC_content_pos(reads_list):
for read in reads_list:
for position in range(len(read)):
if read[position] == "G" or read[position] == "C":
#do something

If I understand your question you want to percentage of characters in a string that match 'G' or 'C'? In that case if you start with this list of strings
>>> reads_list = ['GC', 'GA', 'GABCDE']
You can use a list comprehension to count the number of letter matches for each string, then divide the count of matches with the length of the string
>>> [sum(1 for i in s if i in 'GC')/len(s) for s in reads_list]
[1.0, 0.5, 0.3333333333333333]
Or multiply by 100 to get percentages
>>> [sum(1 for i in s if i in 'GC')/len(s)*100 for s in reads_list]
[100.0, 50.0, 33.33333333333333]

For finding matches, using regex is more efficient than iterating through each character. This function will return the percentage of each string that is either G or C. You can modify it to get the percentages separately for G or C if that is the requirement.
import re
def str_match_per(reads_list):
match_percentages = dict.fromkeys(reads_list)
for read in reads_list:
matches = re.findall(r'(G|C)', read)
matched_percent = len(matches)/len(read)
match_percentages[read] = round(matched_percent*100, 2)
return match_percentages
In [32]: strl = ['G and C', 'only G', 'double CC']
In [33]: str_match_per(strl)
Out[33]: {'G and C': 28.57, 'only G': 16.67, 'double CC': 22.22}

Related

Convert consecutive duplicate character string to known word

I am trying to convert a string with consecutive duplicate characters to it's 'dictionary' word. For example, 'aaawwesome' should be converted to 'awesome'.
Most answers I've come across have either removed all duplicates (so 'stations' would be converted to 'staion' for example), or have removed all consecutive duplicates using itertools.groupby(), however this doesn't account for cases such as 'happy' which would be converted to 'hapy'.
I have also tried using string intersections with Counter(), however this disregards the order of the letters so 'botaniser' and 'baritones' would match incorrectly.
In my specific case, I have a few lists of words:
list_1 = ["wife", "kid", "hello"]
list_2 = ["husband", "child", "goodbye"]
and, given a new word, for example 'hellllo', I want to check if it can be reduced down to any of the words, and if it can, replace it with that word.

use the enchant module, you may need to install it using pip
See which letters duplicate, remove the letter from the word until the word is in the English dictionary.
import enchant
d = enchant.Dict("en_US")
list_1 = ["wiffe", "kidd", "helllo"]
def dup(x):
for n,j in enumerate(x):
y = [g for g,m in enumerate(x) if m==j]
for h in y:
if len(y)>1 and not d.check(x) :
x = x[:h] + x[h+1:]
return x
list_1 = list(map(dup,list_1))
print(list_1)
>>> ['wife', 'kid', 'hello']

Splitting string using different scenarios using regex

I have 2 scenarios so split a string
scenario 1:
"##$hello?? getting good.<li>hii"
I want to be split as 'hello','getting','good.<li>hii (Scenario 1)
'hello','getting','good','li,'hi' (Scenario 2)
Any ideas please??

Something like this should work:
>>> re.split(r"[^\w<>.]+", s) # or re.split(r"[##$? ]+", s)
['', 'hello', 'getting', 'good.<li>hii']
>>> re.split(r"[^\w]+", s)
['', 'hello', 'getting', 'good', 'li', 'hii']

This might be what your looking for \w+ it matches any digit or letter from 1 to n times as many times as possible. Here is a working Java-Script
var value = "##$hello?? getting good.<li>hii";
var matches = value.match(
new RegExp("\\w+", "gi")
);
console.log(matches)
It works by using \w+ which matches word characters as many times as possible. You cound also use [A-Za-b] to match only letters which not numbers. As show here.
var value = "##$hello?? getting good.<li>hii777bloop";
var matches = value.match(
new RegExp("[A-Za-z]+", "gi")
);
console.log(matches)
It matches what are in the brackets 1 to n timeas as many as possible. In this case the range a-z of lower case charactors and the range of A-Z uppder case characters. Hope this is what you want.

For first scenario just use regex to find all words that are contain word characters and <>.:
In [60]: re.findall(r'[\w<>.]+', s)
Out[60]: ['hello', 'getting', 'good.<li>hii']
For second one you need to repleace the repeated characters only if they are not valid english words, you can do this using nltk corpus, and re.sub regex:
In [61]: import nltk
In [62]: english_vocab = set(w.lower() for w in nltk.corpus.words.words())
In [63]: repeat_regexp = re.compile(r'(\w*)(\w)\2(\w*)')
In [64]: [repeat_regexp.sub(r'\1\2\3', word) if word not in english_vocab else word for word in re.findall(r'[^\W]+', s)]
Out[64]: ['hello', 'getting', 'good', 'li', 'hi']

In case you are looking for solution without regex. string.punctuation will give you list of all special characters.
Use this list with list comprehension for achieving your desired result as:
>>> import string
>>> my_string = '##$hello?? getting good.<li>hii'
>>> ''.join([(' ' if s in string.punctuation else s) for s in my_string]).split()
['hello', 'getting', 'good', 'li', 'hii'] # desired output
Explanation: Below is the step by step instruction regarding how it works:
import string # Importing the 'string' module
special_char_string = string.punctuation
# Value of 'special_char_string': '!"#$%&\'()*+,-./:;<=>?#[\\]^_`{|}~'
my_string = '##$hello?? getting good.<li>hii'
# Generating list of character in sample string with
# special character replaced with whitespace
my_list = [(' ' if item in special_char_string else item) for item in my_string]
# Join the list to form string
my_string = ''.join(my_list)
# Split it based on space
my_desired_list = my_string.strip().split()
The value of my_desired_list will be:
['hello', 'getting', 'good', 'li', 'hii']

Split a unicode string into components containing numbers and letters

I'd like to split the string u'123K into 123 and K. I've tried re.match("u'123K", "\d+") to match the number and re.match("u'123K", "K") to match the letter but they don't work. What is a Pythonic way to do this?

Use re.findall() to find all numbers and characters:
>>> s = u'123K'
>>> re.findall(r'\d+|[a-zA-Z]+', s) # or use r'\d+|\D+' as mentioned in comment in order to match all numbers and non-numbers.
['123', 'K']
If you are just dealing with this string or if you only want to split the string from the last character you can simply use a indexing:
num, charracter = s[:-1], s[-1:]

You can also use itertools.groupby method, grouping digits:
>>> import itertools as it
>>> for _,v in it.groupby(s, key=str.isdigit):
print(''.join(v))
123
K

How to use * or + with brackets in regular expressions in Python?

There are multiple space separated characters in the input eg: string = "a b c d a s e "
What should the pattern be such that when I do re.search on the input using the pattern, I'd get the j'th character along with the space following it in the input by using .group(j)?
I tried something of the sort "^(([a-zA-Z])\s)+" but this is not working. What should I do?
EDIT
My actual question is in the heading and the body described only a special case of it:
Here's the general version of the question: if I have to take in all patterns of a specific type (initial question had the pattern "[a-zA-Z]\s") from a string, what should I do?

Use findall() instead and get the j-th match by index:
>>> j = 2
>>> re.findall(r"[a-zA-Z]\s", string)[j]
'c '
where [a-zA-Z]\s would match a lower or upper case letter followed by a single space character.

Why use regex when you can simply use str.split() method and access to the characters with a simple indexing?
>>> new = s.split()
>>> new
['a', 'b', 'c', 'd', 'a', 's', 'e']

You could do:
>>> string = "a b c d a s e "
>>> j=2
>>> re.search(r'([a-zA-Z]\s){%i}' % j, string).group(1)
'b '
Explanation:
With the pattern ([a-zA-Z]\s) you capture a letter then the space;
With the repetition {2} added, you capture the last of the repetition -- in this case the second one (base 1 vs base 0 indexing...).
Demo

Extracting multiple substring from a string

I have a complicated string and would like to try to extract multiple substring from it.
The string consists of a set of items, separated by commas. Each item has an identifier (id-n) for a pair of words inside which is enclosed by brackets. I want to get only the word inside the bracket which has a number attached to its end (e.g. 'This-1'). The number actually indicates the position of how the words should be arrannged after extraction.
#Example of how the individual items would look like
id1(attr1, is-2) #The number 2 here indicates word 'is' should be in position 2
id2(attr2, This-1) #The number 1 here indicates word 'This' should be in position 1
id3(attr3, an-3) #The number 3 here indicates word 'an' should be in position 3
id4(attr4, example-4) #The number 4 here indicates word 'example' should be in position 4
id5(attr5, example-4) #This is a duplicate of the word 'example'
#Example of string - this is how the string with the items looks like
string = "id1(attr1, is-1), id2(attr2, This-2), id3(attr3, an-3), id4(attr4, example-4), id5(atttr5, example-4)"
#This is how the result should look after extraction
result = 'This is an example'
Is there an easier way to do this? Regex doesn't work for me.

A trivial/naive approach:
>>> z = [x.split(',')[1].strip().strip(')') for x in s.split('),')]
>>> d = defaultdict(list)
>>> for i in z:
... b = i.split('-')
... d[b[1]].append(b[0])
...
>>> ' '.join(' '.join(d[t]) for t in sorted(d.keys(), key=int))
'is This an example example'
You have duplicated positions for example in your sample string, which is why example is repeated in the code.
However, your sample is not matching your requirements either - but this results is as per your description. Words arranged as per their position indicators.
Now, if you want to get rid of duplicates:
>>> ' '.join(e for t in sorted(d.keys(), key=int) for e in set(d[t]))
'is This an example'

Why not regex? This works.
In [44]: s = "id1(attr1, is-2), id2(attr2, This-1), id3(attr3, an-3), id4(attr4, example-4), id5(atttr5, example-4)"
In [45]: z = [(m.group(2), m.group(1)) for m in re.finditer(r'(\w+)-(\d+)\)', s)]
In [46]: [x for y, x in sorted(set(z))]
Out[46]: ['This', 'is', 'an', 'example']

OK, how about this:
sample = "id1(attr1, is-2), id2(attr2, This-1),
id3(attr3, an-3), id4(attr4, example-4), id5(atttr5, example-4)"
def make_cryssie_happy(s):
words = {} # we will use this dict later
ll = s.split(',')[1::2]
# we only want items like This-1, an-3, etc.
for item in ll:
tt = item.replace(')','').lstrip()
(word, pos) = tt.split('-')
words[pos] = word
# there can only be one word at a particular position
# using a dict with the numbers as positions keys
# is an alternative to using sets
res = [words[i] for i in sorted(words)]
# sort the keys, dicts are unsorted!
# create a list of the values of the dict in sorted order
return ' '.join(res)
# return a nice string
print make_cryssie_happy(sample)

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

compare a list of strings on each character - python

Related

Convert consecutive duplicate character string to known word

Splitting string using different scenarios using regex

Split a unicode string into components containing numbers and letters

How to use * or + with brackets in regular expressions in Python?

Extracting multiple substring from a string

Categories

Resources