frequency of letters in column python - python

I want to calculate the frequency of occurrence of each letter in all columns:
for example I have this three sequences :
seq1=AATC
seq2=GCCT
seq3=ATCA
here, we have: in the first column frequency of 'A' is 2 , 'G' is 1 .
for the second column : the frequency of 'A' is 1, 'C' is 1 and 'T' is 1. (the same thing in the rest of column)
first, I try to do the code of calculating frequency I try this:
for example:
s='AATC'
dic={}
for x in s:
dic[x]=s.count(x)
this gives: {'A':2,'T':1,'C':1}
now, I want to apply this on columns.for that I use this instruction:
f=list(zip(seq1,seq2,seq3))
gives:
[('A', 'G', 'A'), ('A', 'C', 'T'), ('T', 'C', 'C'), ('C', 'T', 'A')]
So, here, I calculate the frequency of letters in ():
How can I do this ?
if I work on a file of sequences, how can I use this code to apply it on the sequences of file?
for example my file contains 100 sequences each time I take three sequences and apply this code.

Here:
sequences = ['AATC',
'GCCT',
'ATCA']
f = zip(*sequences)
counts = [{letter: column.count(letter) for letter in column} for column in f]
print(counts)
Output (reformatted):
[{'A': 2, 'G': 1},
{'A': 1, 'C': 1, 'T': 1},
{'C': 2, 'T': 1},
{'A': 1, 'C': 1, 'T': 1}]
Salient features:
Rather than explicitly naming seq1, seq2, etc., we put them into a list.
We unpack the list with the * operator.
We use a dict comprehension inside a list comprehension to generate the counts for each letter in each column. It's basically what you did for the one-sequence case, but more readable (IMO).

As with my answer to your last question, you should wrap your functionality in a function:
def lettercount(pos):
return {c: pos.count(c) for c in pos}
Then you can easily apply it to the tuples from zip:
counts = [lettercount(t) for t in zip(seq1, seq2, seq3)]
Or combine it into the existing loop:
...
counts = []
for position in zip(seq1, seq2, seq3): # sets at same position
counts.append(lettercount(position))
for pair in combinations(position, 2): # pairs within set
...

Related

Replacing all the elements of a list that are consecutive and duplicates in Python

Say I have a list with some numbers that are duplicates.
list = [1,1,1,1,2,3,4,4,1,2,5,6]
I want to identify all the elements in the list that are repeating and consecutive, including the first element, i.e. replacing all elements in the list to values in a dictionary:
mydict = {1: 'a', 4: 'd'}
list = ['a','a','a','a',2,3,'d','d',1,2,5,6]
Because I want to replace the first instance of the repetition as well, I am quite confused as to how to proceed!
itertools.groupby is your friend:
from itertools import groupby
mydict = {1: 'a', 4: 'd'}
A = [1,1,1,1,2,3,4,4,1,2,5,6]
res = []
for k, g in groupby(A):
size = len(list(g))
if size > 1:
res.extend([mydict[k]] * size) # see note 1
else:
res.append(k)
print(res) # -> ['a', 'a', 'a', 'a', 2, 3, 'd', 'd', 1, 2, 5, 6]
Notes:
If you want to catch possible KeyErrors and have a default value you want to fall back on, use mydict.get(k, <default>) instead of mydict[k]

Split string into character and numbers and store in a map Python

I've a string like
'A15B7C2'
It represents count of the character.
I am using re right now to split it into characters and numbers. After that will eventually store it in a dict
import re
data_str = 'A15B7C2'
re.split("(\d+)", data_str)
# prints --> ['A', '15', 'B', '7', 'C', '2', '']
But if I have a string like
'A15B7CD2Ef5'
it means that count of C is 1 (its implicit) and count of Ef is 5. (Uppercase and subsequent lowercase count as one key) consequently I get
'CD' = 2 (Not correct)
'Ef' = 5 (Correct)
How do modify it to provide me proper count?
Whats the best approach to parse and get count and store in a dict?
You can do this all in one fell swoop:
In [2]: s = 'A15B7CD2Ef5'
In [3]: {k: int(v) if v else 1 for k,v in re.findall(r"([A-Z][a-z]?)(\d+)?", s)}
Out[3]: {'A': 15, 'B': 7, 'C': 1, 'D': 2, 'Ef': 5}
The regex is essentially a direct translation of your requirements, leveraging .findall and capture groups:
r"([A-Z][a-z]?)(\d+)?"
Essentially, an uppercase letter that may be followed by a lowercase letter as the first group, and a digit that may or may not be there as the second group (this will return '' if it isn't there.
A trickier example:
In [7]: s = 'A15B7CD2EfFGHK5'
In [8]: {k: int(v) if v else 1 for k,v in re.findall(r"([A-Z][a-z]?)(\d+)?", s)}
Out[8]: {'A': 15, 'B': 7, 'C': 1, 'D': 2, 'Ef': 1, 'F': 1, 'G': 1, 'H': 1, 'K': 5}
Finally, breaking it down with an even trickier example:
In [10]: s = 'A15B7CD2EfFGgHHhK5'
In [11]: re.findall(r"([A-Z](?:[a-z])?)(\d+)?", s)
Out[11]:
[('A', '15'),
('B', '7'),
('C', ''),
('D', '2'),
('Ef', ''),
('F', ''),
('Gg', ''),
('H', ''),
('Hh', ''),
('K', '5')]
In [12]: {k: int(v) if v else 1 for k,v in re.findall(r"([A-Z][a-z]?)(\d+)?", s)}
Out[12]:
{'A': 15,
'B': 7,
'C': 1,
'D': 2,
'Ef': 1,
'F': 1,
'Gg': 1,
'H': 1,
'Hh': 1,
'K': 5}
You could use some regex logic and .span():
([A-Z])[a-z]*(\d+)
See a demo on regex101.com.
In Python this would be:
import re
string = "A15B7CD2Ef5"
rx = re.compile(r'([A-Z])[a-z]*(\d+)')
def analyze(string=None):
result = []; lastpos = 0;
for m in rx.finditer(string):
span = m.span()
if lastpos != span[0]:
result.append((string[lastpos], 1))
else:
result.append((m.group(1), m.group(2)))
lastpos = span[1]
return result
print(analyze(string))
# [('A', '15'), ('B', '7'), ('C', 1), ('E', '5')]
Search for the letters in the string, instead of digits.
import re
data_str = 'A15B7C2'
temp = re.split("([A-Za-z])", data_str)[1:] # First element is just "", don want that
temp= [a if a != "" else "1" for a in temp] # add the 1's that were implicit in the original string
finalDict = dict(zip(temp[0::2], temp[1::2])) # turn the list into a dict
In keeping with your original logic. Instead of using re.split() we can find all the numbers, split the string on the first match, keep the second half of the string for the next split, and store your pairs as tuples for later.
import re
raw = "A15B7CD2Ef5"
# find all the numbers
found = re.findall("(\d+)", raw)
# save the pairs as a list of tuples
pairs = []
# check that numbers where found
if found:
# iterate over all matches
for f in found:
# split the raw, with a max split of one, so that duplicate numbers don't cause more then 2 parts
part = raw.split(f, 1)
# set the original string to the second half of the split
raw = part[1]
# append pair
pairs.append((part[0], f))
# Now for fun expand values
long_str = ""
for p in pairs:
long_str += p[0] * int(p[1])
print pairs
print long_str

Getting key values from a list of dictionaries

I have a list that contains dictionaries with Letters and Frequencies. Basically, I have 53 dictionaries each for every alphabet (lowercase and uppercase) and space.
adict = {'Letter':'a', 'Frequency':0}
bdict = {'Letter':'b', 'Frequency':0}
cdict = {'Letter':'c', 'Frequency':0}
If you input a word, it will scan the word and update the frequency for its corresponding letter.
for ex in range(0, len(temp)):
if temp[count] == 'a': adict['Frequency']+=1
elif temp[count] == 'b': bdict['Frequency']+=1
elif temp[count] == 'c': cdict['Frequency']+=1
For example, I enter the word "Hello", The letters H,e,l,l,o is detected and its frequencies updated. Non zero frequencies will be transferred to a new list.
if adict['Frequency'] != 0 : newArr.append(adict)
if bdict['Frequency'] != 0 : newArr.append(bdict)
if cdict['Frequency'] != 0 : newArr.append(cdict)
After this, I had the newArr sorted by Frequency and transferred to a new list called finalArr. Below is a sample list contents for the word "Hello"
{'Letter': 'H', 'Frequency': 1}
{'Letter': 'e', 'Frequency': 1}
{'Letter': 'o', 'Frequency': 1}
{'Letter': 'l', 'Frequency': 2}
Now what I want is to transfer only the key values to 2 seperate lists; letterArr and numArr. How do I do this? my desired output is:
letterArr = [H,e,o,l]
numArr = [1,1,1,2]
Why don't you just use a collections.Counter? For example:
from collections import Counter
from operator import itemgetter
word = input('Enter a word: ')
c = Counter(word)
letter_arr, num_arr = zip(*sorted(c.items(), key=itemgetter(1,0)))
print(letter_arr)
print(num_arr)
Note the use of sorted() to sort by increasing frequency. itemgetter() is used to reverse the sort order so that the sort is performed first on the frequency, and then on the letter. The sorted frequencies are then separated using zip() on the unpacked list.
Demo
Enter a word: Hello
('H', 'e', 'o', 'l')
(1, 1, 1, 2)
The results are tuples, but you can convert to lists if you want with list(letter_arr) and list(num_arr).
I have a hard time understanding your data structure choice for this problem.
Why don't you just go with a dictionary like this:
frequencies = { 'H': 1, 'e': 1, 'l': 2, 'o': 1 }
Which is even easier to implement with a Counter:
from collections import Counter
frequencies = Counter("Hello")
print(frequencies)
>>> Counter({ 'H': 1, 'e': 1, 'l': 2, 'o': 1 })
Then to add another word, you'd simply have to use the updatemethod:
frequencies.update("How")
print(frequencies)
>>> Counter({'l': 2, 'H': 2, 'o': 2, 'w': 1, 'e': 1})
Finally, to get your 2 arrays, you can do:
letterArr, numArr = zip(*frequencies.items())
This will give you tuples though, if you really need lists, just do: list(letterArr)
You wanted a simple answer without further todo like zip, collections, itemgetter etc. This does the minimum to get it done, 3 lines in a loop.
finalArr= [{'Letter': 'H', 'Frequency': 1},
{'Letter': 'e', 'Frequency': 1},
{'Letter': 'o', 'Frequency': 1},
{'Letter': 'l', 'Frequency': 2}]
letterArr = []
numArr = []
for i in range(len(finalArr)):
letterArr.append(finalArr[i]['Letter'])
numArr.append(finalArr[i]['Frequency'])
print letterArr
print numArr
Output is
['H', 'e', 'o', 'l']
[1, 1, 1, 2]

Python - using dictionary to count keys and values

I am a student in a python course where we created a list of tuples (containing 2 elements) that we're trying to manipulate in various ways. In addition, we are to convert those tuple elements into a dictionary and re-create the manipulations using the dictionary and avoiding for loops. The task I'm stuck on is that given a specific id (which could be a key OR value in the dictionary) the function returns all the other keys/values that are found in that dictionary.
It doesn't seem efficient to use a dictionary for this, but that's the section we are on in the course and is specifically asked by the assignment. Also no for loops (if that is possible?). Recall that the id can be either a key or a value in the dictionary.
example_dictionary = {'A': 'C', 'R': 'E', 'D': 'A', 'L': 'R', 'C': 'D'}
def get_interactions(example_dictionary, id):
output = ''
for j,k in example_dictionary.items():
if j == id:
output = output + k + ' '
if k == id:
output = output + j + ' '
return output
This code works just fine, however it 1) has a for loop (no good) and 2) isn't very pythonic (kind of an eyesore)! How could I use the dictionary more efficiently and condense down my lines? I am in Python 3, Thank you!
Expected result
Having one dictionary and value named wanted, you want to create another dict being copy of
original one with removed all items not having key or value equal to wanted value.
It can be expressed in form of pytest test case with couple of scenarios.
import pytest
scenarios = [
[
# dct
{'A': 'C', 'R': 'E', 'D': 'A', 'L': 'R', 'C': 'D'},
# wanted
"A",
# expected (result)
{'A': 'C', 'D': 'A'},
],
[
# dct
{'A': 'C', 'R': 'E', 'D': 'A', 'L': 'R', 'C': 'D'},
# wanted
"E",
# expected (result)
{'R': 'E'},
],
[
# dct
{'A': 'C', 'R': 'E', 'D': 'A', 'L': 'R', 'C': 'D'},
# wanted
"D",
# expected (result)
{'D': 'A', 'C': 'D'},
],
[
# dct
{'A': 'C', 'R': 'E', 'D': 'A', 'L': 'R', 'C': 'D'},
# wanted
"nothere",
# expected (result)
{},
],
[
# dct
{'A': 'C', 'R': 'E', 'D': 'A', 'L': 'R', 'C': 'D'},
# wanted
"A",
# expected (result)
{'A': 'C', 'D': 'A'},
],
]
# replace with real implementation
def get_key_or_val_itms(dct, wanted):
# something comes here
return result
#pytest.mark.parametrize("scenario", scenarios)
def test_it(scenario):
dct, wanted, expected = scenario
assert get_key_or_val_itms(dct, wanted) == expected
Do not bother with anything apart from scenarios. It lists couple of test scenarios with input
dictionary, value for wanted and expected result.
Building stones for the solution
dict.items() - dict to list of tuples
>>> dct = {'A': 'C', 'R': 'E', 'D': 'A', 'L': 'R', 'C': 'D'}
>>> dct.items()
[('A', 'C'), ('R', 'E'), ('D', 'A'), ('L', 'R'), ('C', 'D')]
testing membership of a value in a tuple/list
>>> 'A' in ('A', 'C')
True
>>> 'A' in ('R', 'E')
False
Lambda function testing, if wanted is present in a tuple
lambda allows "in place" function definition. It is often used in places,
where some functions expects reference to a function.
First, create named function tuple_wanted
>>> wanted = "A"
>>> def tuple_wanted(tpl):
... return wanted in tpl
and test it (note, that wanted has now value "A"):
>>> tuple_wanted(('A', 'C'))
True
>>> tuple_wanted(('R', 'E'))
False
Now create the function. To play with it, we store the result of lambda in fun:
>>> fun = lambda tpl: wanted in tpl
It can be used in the same manner a tuple_wanted before:
>>> fun(('A', 'C'))
True
>>> fun(('R', 'E'))
False
Later on we will use the result of lambda directly (see filter) without
storing it into any variable.
filter removing all list items not passing some test
filter gets test function and iterable (e.g. list of items) to test by it.
Result of calling filter is list of items from the iterable, which passed the test.
In our case, we want to pass only the tuples, containing wanted value (e.g. "A")
>>> filter(tuple_wanted, dct.items())
[('A', 'C'), ('D', 'A')]
>>> filter(fun, dct.items())
[('A', 'C'), ('D', 'A')]
>>> filter(lambda tpl: wanted in tpl, dct.items())
[('A', 'C'), ('D', 'A')]
Convert list of tuples with 2 items into dictionary
>>> tpllst = [('A', 'C'), ('D', 'A')]
>>> dict(tpllst)
{'A': 'C', 'D': 'A'}
Function doing the work
Long version
This version is here to explain what is going on step by step:
def get_key_or_val_itms(dct, wanted):
# dict as [(key, val), (key2, val2), ...]
tpldct = dct.items()
# find tuples, where either key or val equals `wanted` value
# first make function, which detects, the tuple we search for
def tuple_wanted(tpl):
return wanted in tpl
# now use it to filter only what we search for
restpldct = filter(tuple_wanted, tpldct)
# finally, turn the result into dict
return dict(restpldct)
Short version
def get_key_or_val_itms(dct, wanted):
return dict(filter(lambda tpl: wanted in tpl, dct.items()))
Conclusions
It works (with either long or short version of the function):
>>> dct = {'A': 'C', 'R': 'E', 'D': 'A', 'L': 'R', 'C': 'D'}
>>> wanted = "A"
>>> get_key_or_val_itms(dct, wanted)
{'A': 'C', 'D': 'A'}
If you put the function into file with test suite, calling $ py.test -sv the_file.py shall output:
$ py.test -sv the_file.py
py.test================================ test session starts =========================
=======
platform linux2 -- Python 2.7.9, pytest-2.8.7, py-1.4.31, pluggy-0.3.1 -- /home/javl/
.virtualenvs/stack/bin/python2
cachedir: .cache
rootdir: /home/javl/sandbox/stack/dict, inifile:
collected 5 items
countdict.py::test_it[scenario0] PASSED
countdict.py::test_it[scenario1] PASSED
countdict.py::test_it[scenario2] PASSED
countdict.py::test_it[scenario3] PASSED
countdict.py::test_it[scenario4] PASSED
============================= 5 passed in 0.01 seconds ==============================
As can be seen, all the scenarios are passing.
Explanation how py.test works is out of scope of this answer, to learn more about it, see http://pytest.org/latest/
I wouldn't know how to avoid using a for loop, other than making your own for loop, similar to the following:
i = 0
def func(tup, id) {
if i < len(dictionary_items):
output = False
if tup[0] == id or tup[1] == id:
output = id + ' '
i += 1
return output
}
dictionary_items = dictionary.items()
func(dictionary_items[0], id)
func(dictionary_items[1], id)
func(dictionary_items[2], id)
And so on. However, that would be ugly and extremely non-pythonic.
As for making your code more pythonic, you can change the lines output = output + k + ' ' to output += k + ' ' or output = k + ' ' (You're concatenating strings k and ' ' to an empty string, output, which changes nothing about the strings k and ' ').
Furthermore, you could check if j == id or k == id rather than two seperate if statements, then saying output = id + ' ',since if j or k are equal to id, it doesn't matter if you return whichever of j and k is equal to the id or if you return the id itself.
You have to check all the keys and values, so there is always going to be some type of loop. Python has many ways to iterate (ie. loop) through items without explicit use of for.
One good way to iterate through items without for is with the filter, map, and reduce built-in functions along with the lambda syntax for creating small, anonymous functions.
from itertools import chain
# Get values for matching keys and vice versa
values = map(lambda x: x[1] if x[0] == id else None, dct.items())
keys = map(lambda x: x[0] if x[1] == id else None, dct.items())
# Then you filter out the None values
# itertools.chain allows us to conveniently do this in one line
matches = filter(lambda x: x is not None, chain(keys, values))
If you can't use itertools.chain, you'll just need a few extra steps
keys = filter(lambda x: x is not None, keys)
values = filter(lambda x: x is not None, values)
matches = keys + values
If you need a space separated output of values:
output = ' '.join(matches)
You could use list comprehensions, although one could argue that it is a kind of for loop:
example_dictionary = {'A': 'C', 'R': 'E', 'D': 'A', 'L': 'R', 'C': 'D'}
def get_interactions(dic, id):
output =[v for k, v in dic.items() if k == id] + [k for k,v in dic.items() if v == id]
return output

Sort python dictionaries with 'value' as primary key and 'key' as secondary

What I am trying to do here is to display characters according to number of occurrences in a string in descending order. If two characters share the same number of occurrences, then they should be displayed as per the alphabetic order.
So given a string, 'abaddbccdd', what I want to display as output is:
['d', 'a', 'b', 'c']
Here is what I have done so far:
>>> from collections import Counter
>>> s = 'abaddbccdd'
>>> b = Counter(s)
>>> b
Counter({'d': 4, 'a': 2, 'c': 2, 'b': 2})
>>> b.keys()
['a', 'c', 'b', 'd']
>>> c = sorted(b, key=b.get, reverse=True)
>>> c
['d', 'a', 'c', 'b']
>>>
But how to handle the second part? 'a', 'b' and 'c' all appear in the text exactly twice and are out of order. What is the best way (hopefully shortest too) to do this?
This can be done in a single sorting pass. The trick is to do an ascending sort with the count numbers negated as the primary sorting key and the dictionary's key strings as the secondary sorting key.
b = {'d': 4, 'a': 2, 'c': 2, 'b': 2}
c = sorted(b, key=lambda k:(-b[k], k))
print(c)
output
['d', 'a', 'b', 'c']
The shortest way is:
>>> sorted(sorted(b), key=b.get, reverse=True)
['d', 'a', 'b', 'c']
So sort the sequence once in its natural order (the key order) then reverse sort on the values.
Note this won't have the fastest running time if the dictionary is large as it performs two full sorts, but in practice it is probably simplest because you want the values descending and the keys ascending.
The reason it works is that Python guarantees the sort to be stable. That means when the keys are equal the original order is preserved, so if you sort repeatedly from the last key back to the first you will get the desired result. Also reverse=True is different than just reversing the output as it also respects stability and only reverses the result where the keys are different.
You can use a lambda function:
>>> sorted(b, key=lambda char: (b.get(char), 1-ord(char)), reverse=True)
If you are already using a Counter object, there is the Counter.most_common method. This will return a list of the items in order of highest to lowest frequency.
>>> b.most_common()
[('d', 4), ('a', 2), ('b', 2), ('c', 2)]

Categories

Resources