Python frequency table, exclude characters - python

Good evening,
I'm wondering how i can exclude certain characters from a frequency table?
first i read the file, creates a frequency table. after this i change it to a tuple to be able to get a percentage of occourence for each letter.
however i am wondering how i can implement that it does not count certain letters.
ie. an exclude list.
with open('test.txt', 'r') as file:
data = file.read().replace('\n', '')
frequency_table = {char : data.count(char) for char in set(data)}
x0= ("Character frequency table for '{}' is :\n {}".format(data, str(frequency_table)))
from collections import Counter
res = [(*key, val) for key, val in Counter(frequency_table).most_common()]
print("Frequency Tuple list : " + str(res))
print(res[1][1]/res[1][1])#

Sounds like you need an if at the end of your dictionary comprension:
frequency_table = {char : data.count(char) for char in set(data) if char not in EXCLUDE}
You can then set your EXCLUDE as, for example:
a list, i.e. ['a', 'b', 'c', 'd'] or list('abcd')
or, you can just use a string of characters directly, such as 'abcd'.
>>> data = 'aaaabcdefededefefedf'
>>> EXCLUDE_LIST = 'a'
>>> frequency_table = {char : data.count(char) for char in set(data) if char not in EXCLUDE_LIST}
>>> frequency_table
{'b': 1, 'c': 1, 'e': 6, 'f': 4, 'd': 4}

Add an if to your comprehension:
exclude = {'a', 'r'}
frequency_table = {char: data.count(char) for char in set(data) if char not in exclude}
Alternatively, use the difference between two sets:
exclude = {'a', 'r'}
frequency_table = {char: data.count(char) for char in set(data) - exclude}

Related

Converting string in python by counting letters next to eachother

I tried many ways but neither worked. I have to convert string like assdggg to a2sd3g in python. If letters are next to each other we leave only one letter and before it we write how mamy of them were next to eachother. Any idea how can it be done?
I'd suggest itertools.groupby then format as you need
from itertools import groupby
# groupby("assdggg")
# {'a': ['a'], 's': ['s', 's'], 'd': ['d'], 'g': ['g', 'g', 'g']}
result = ""
for k, v in groupby("assdggg"):
count = len(list(v))
result += (str(count) if count > 1 else "") + k
print(result) # a2sd3g
Try using .groupby():
from itertools import groupby
txt = "assdggg"
print(''.join(str(l) + k if (l := len(list(g))) != 1 else k for k, g in groupby(txt)))
output :
a2sd3g
You can try this :
string = 'assdggg'
compression = ''
for char in string :
if char not in compression :
if string.count(char) != 1 :
compression += str(string.count(char))
compression += char
print(compression)
#'a2sd3g'

Check if a string contains a string except a list

I have a string as follows:
f = 'ATCTGTCGTYCACGT'
I want to check whether the string contains any characters except: A, C, G or T, and if so, print them.
for i in f:
if i != 'A' and i != 'C' and i != 'G' and i != 'T':
print(i)
Is there a way to achieve this without looping through the string?
You can use set to achieve the desired output.
f = 'ATCTGTCGTYCACGTXYZ'
not_valid={'A', 'C', 'G' , 'T'}
unique=set(f)
print(unique-not_valid)
output
{'Y','X','Z'} #characters in f which are not equal to 'A','C','G','T'
Depending on the size of your input string, the for loop might be the most efficient solution.
However, since you explicitly ask for a solution without an explicit loop, this can be done with a regex.
import re
f = 'ABCDEFG'
print(*re.findall('[^ABC]', f), sep='\n')
Outputs
D
E
F
G
Just do
l = ['A', 'C', 'G', 'T']
for i in f:
if i not in l:
print(i)
It checks whether the list contains a char of the list
If you don't want to loop through the list you can do:
import re
l = ['A', 'C', 'G', 'T']
contains = bool(re.search("%s" % "[" + "".join(l) + "]", f))
Technically this loops but we convert your input string to a set which removes duplicate values
accepted_values = ['a','t','c','g']
input = 'ATCTGTCGTYCACGT'
print([i for i in set(input.lower()) if i not in accepted_values])

Loop over letters in a string that contains the alphabet to determine which are missing from a dictionary

I am very new to python and trying to find the solution to this for a class.
I need the function missing_letters to take a list, check the letters using histogram and then loop over the letters in alphabet to determine which are missing from the input parameter. Finally I need to print the letters that are missing, in a string.
alphabet = "abcdefghijklmnopqrstuvwxyz"
test = ["one","two","three"]
def histogram(s):
d = dict()
for c in s:
if c not in d:
d[c] = 1
else:
d[c] += 1
return d
def missing_letter(s):
for i in s:
checked = (histogram(i))
As you can see I haven't gotten very far, at the moment missing_letters returns
{'o': 1, 'n': 1, 'e': 1}
{'t': 1, 'w': 1, 'o': 1}
{'t': 1, 'h': 1, 'r': 1, 'e': 2}
I now need to loop over alphabet to check which characters are missing and print. Any help and direction will be much appreciated. Many thanks!
You can use set functions in python, which is very fast and efficient:
alphabet = set('abcdefghijklmnopqrstuvwxyz')
s1 = 'one'
s2 = 'two'
s3 = 'three'
list_of_missing_letters = set(alphabet) - set(s1) - set(s2) - set(s3)
print(list_of_missing_letters)
Or like this:
from functools import reduce
alphabet = set('abcdefghijklmnopqrstuvwxyz')
list_of_strings = ['one', 'two', 'three']
list_of_missing_letters = set(alphabet) - \
reduce(lambda x, y: set(x).union(set(y)), list_of_strings)
print(list_of_missing_letters)
Or using your own histogram function:
alphabet = "abcdefghijklmnopqrstuvwxyz"
test = ["one", "two", "three"]
def histogram(s):
d = dict()
for c in s:
if c not in d:
d[c] = 1
else:
d[c] += 1
return d
def missing_letter(t):
test_string = ''.join(t)
result = []
for l in alphabet:
if l not in histogram(test_string).keys():
result.append(l)
return result
print(missing_letter(test))
Output:
['a', 'b', 'c', 'd', 'f', 'g', 'i', 'j', 'k', 'l', 'm', 'p', 'q', 's', 'u', 'v', 'x', 'y', 'z']
from string import ascii_lowercase
words = ["one","two","three"]
letters = [l.lower() for w in words for l in w]
# all letters not in alphabet
letter_str = "".join(x for x in ascii_lowercase if x not in letters)
Output:
'abcdfgijklmpqsuvxyz'
It is not the easiest question to understand, but from what I can gather you require all the letters of the alphabet not in the input to be returned in console.
So a loop as opposed to functions which have been already shown would be:
def output():
output = ""
for i in list(alphabet):
for key in checked.keys():
if i != key:
if i not in list(output):
output += i
print(output)
Sidenote: Please either make checked a global variable or put it outside of function so this function can use it

how to dict map each word to list of words which follow it in python?

what i am trying to do :
dict that maps each word that appears in the file
to a list of all the words that immediately follow that word in the file.
The list of words can be be in any order and should include
duplicates.So for example the key "and" might have the list
["then", "best", "then", "after", ...] listing
all the words which came after "and" in the text.
f = open(filename,'r')
s = f.read().lower()
words = s.split()#list of words in the file
dict = {}
l = []
i = 0
for word in words:
if i < (len(words)-1) and word == words[i]:
dict[word] = l.append(words[i+1])
print dict.items()
sys.exit(0)
collections.defaultdict is helpful for such iterations. For simplicity, I've invented a string rather than loaded from a file.
from collections import defaultdict
import string
x = '''This is a random string with some
string elements repeated. This is so
that, with someluck, we can solve a problem.'''
translator = str.maketrans('', '', string.punctuation)
y = x.lower().translate(translator).replace('\n', '').split(' ')
result = defaultdict(list)
for i, j in zip(y[:], y[1:]):
result[i].append(j)
# result
# defaultdict(list,
# {'a': ['random', 'problem'],
# 'can': ['solve'],
# 'elements': ['repeated'],
# 'is': ['a', 'so'],
# 'random': ['string'],
# 'repeated': ['this'],
# 'so': ['that'],
# 'solve': ['a'],
# 'some': ['string'],
# 'someluck': ['we'],
# 'string': ['with', 'elements'],
# 'that': ['with'],
# 'this': ['is', 'is'],
# 'we': ['can'],
# 'with': ['some', 'someluck']})
You can use defaultdict for this:
from collections import defaultdict
words = ["then", "best", "then", "after"]
words_dict = defaultdict(list)
for w1,w2 in zip(words, words[1:]):
words_dict[w1].append(w2)
Results:
defaultdict(<class 'list'>, {'then': ['best', 'after'], 'best': ['then']})

KeyError: '\n' python 2.7.5

I have a dictonairy I want to compare to my string, for the each ke in the dictoniary which matches that in the string I wish to convert the string character to that of the dictoniary
I want to compare my dictionary to my string character by character and when they match replace the strings character with the value of the dictionary's match e.g. if A is in the string it will match to A in the dictionary and be replaced with T which is written to the file line2_u_rev_comp. However the error KeyError: '\n' occurs instead. What is this signaling and how can it be removed?
REV_COMP = {
'A': 'T',
'T': 'A',
'C': 'G',
'G': 'C',
'N': 'N',
'U': 'A'
}
tbl = REV_COMP
line2_u_rev_comp = [tbl[k] for k in line2_u_rev[::-1]]
''.join(line2_u_rev_comp)
'\n' means new line, and you can get rid of it (and other extraneous whitespace) using str.strip, e.g.:
line2_u_rev_comp = [tbl[k] for k in line2_u_rev.strip()[::-1]]
line2_u_rev_comp = [tbl.get(k,k) ... ]
this will either get it from the dictionary or return itself
The problem is the tbl[k] but you don't check if the key exists in the dict, if not you need to return k it self.
you also need to reverse again the list since your for statement is reversed.
Try this code:
line2_u_rev = "MY TEST IS THIS"
REV_COMP = {
'A': 'T',
'T': 'A',
'C': 'G',
'G': 'C',
'N': 'N',
'U': 'A'
}
tbl = REV_COMP
line2_u_rev_comp = [tbl[k] if k in tbl else k for k in line2_u_rev[::-1]][::-1]
print ''.join(line2_u_rev_comp)
Output:
MY AESA IS AHIS

Categories

Resources