Python - find longest path - python

The function will take in a dictionary as input, and I want to find the length of a longest path in a dictionary. Basically, if in a dictionary, key2 matches value1, and key3 matches value2, and so forth, this counts as a path. For example:
{'a':'b', 'b':'c', 'c':'d'}
In the case above, the length should be three. How would I achieve this? Or more specifically how would I compare keys to values? (it could be anything, strings, numbers, etc., not only numbers)
Many thanks in advance!

I would treat the dictionary as a list of edges in a directed acyclic graph (DAG) and use the networkx module to find the longest path in the graph:
import networkx as nx
data = {'a':'b', 'b':'c', 'c':'d'}
G = nx.DiGraph()
G.add_edges_from(data.items())
try:
path = nx.dag_longest_path(G)
print(path)
# ['a', 'b', 'c', 'd']
print(len(path) - 1)
# 3
except nx.exception.NetworkXUnfeasible: # There's a loop!
print("The graph has a cycle")

If you're insisting on not importing anything you could do something like:
def find_longest_path(data):
longest = 0
for key in data.iterkeys():
seen = set()
length = -1
while key:
if key in seen:
length = -1
raise RuntimeError('Graph has loop')
seen.add(key)
key = data.get(key, False)
length += 1
if length > longest:
longest = length
return longest

Related

How to use enumerate in a list comprehension with two lists?

I just started to use list comprehension and I'm struggling with it. In this case, I need to get the n number of each list (sequence_0 and sequence_1) that the iteration is at each time. How can I do that?
The idea is to get the longest sequence of equal nucleotides (a motif) between the two sequences. Once a pair is finded, the program should continue in the nexts nucleotides of the sequences, checking if they are also equal and then elonganting the motif with it. The final output should be an list of all the motifs finded.
The problem is, to continue in the next nucleotides once a pair is finded, i need the position of the pair in both sequences to the program continue. The index function does not work in this case, and that's why i need the enumerate.
Also, I don't understand exactly the reason for the x and y between (), it would be good to understand that too :)
just to explain, the content of the lists is DNA sequences, so its basically something like:
sequence_1 = ['A', 'T', 'C', 'A', 'C']
def find_shared_motif(arq):
data = fastaread(arq)
seqs = [list(sequence) for sequence in data.values()]
motifs = [[]]
i = 0
sequence_0, sequence_1 = seqs[0], seqs[1] # just to simplify
for x, y in [(x, y) for x in zip(sequence_0[::], sequence_0[1::]) for y in zip(sequence_1[::], sequence_1[1::])]:
print(f'Pairs {"".join(x)} and {"".join(y)} being analyzed...')
if x == y:
print(f'Pairs {"".join(x)} and {"".join(y)} match!')
motifs[i].append(x[0]), motifs[i].append(x[1])
k = sequence_0.index(x[0]) + 2 # NAO ESTA DEVOLVENDO O NUMERO CERTO
u = sequence_1.index(y[0]) + 2
print(k, u)
# Determines if the rest of the sequence is compatible
print(f'Starting to elongate the motif {x}...')
for j, m in enumerate(sequence_1[u::]):
try:
# Checks if the nucleotide is equal for both of the sequences
print(f'Analyzing the pair {sequence_0[k + j]}, {m}')
if m == sequence_0[k + j]:
motifs[i].append(m)
print(f'The pair {sequence_0[k + j]}, {m} is equal!')
# Stop in the first nonequal residue
else:
print(f'The pair {sequence_0[k + j]}, {m} is not equal.')
break
except IndexError:
print('IndexError, end of the string')
else:
i += 1
motifs.append([])
return motifs
...
One way to go with it is to start zipping both lists:
a = ['A', 'T', 'C', 'A', 'C']
b = ['A', 'T', 'C', 'C', 'T']
c = list(zip(a,b))
In that case, c will have the list of tuples below
c = [('A','A'), ('T','T'), ('C','C'), ('A','C'), ('C','T')]
Then, you can go with list comprehension and enumerate:
d = [(i, t) for i, t in enumerate(c)]
This will bring something like this to you:
d = [(0, ('A','A')), (1, ('T','T')), (2, ('C','C')), ...]
Of course you can go for a one-liner, if you want:
d = [(i, t) for i, t in enumerate(zip(a,b))]
>>> [(0, ('A','A')), (1, ('T','T')), (2, ('C','C')), ...]
Now, you have to deal with the nested tuples. Focus on the internal ones. It is obvious that what you want is to compare the first element of the tuples with the second ones. But, also, you will need the position where the difference resides (that lies outside). So, let's build a function for it. Inside the function, i will capture the positions, and t will capture the inner tuples:
def compare(a, b):
d = [(i, t) for i, t in enumerate(zip(a,b))]
for i, t in d:
if t[0] != t[1]:
return i
return -1
In that way, if you get -1 at the end, it means that all elements in both lists are equal, side by side. Otherwise, you will get the position of the first difference between them.
It is important to notice that, in the case of two lists with different sizes, the zip function will bring a list of tuples with the size matching the smaller of the lists. The extra elements of the other list will be ignored.
Ex.
list(zip([1,2], [3,4,5]))
>>> [(1,3), (2,4)]
You can use the function compare with your code to get the positions where the lists differ, and use that to build your motifs.

Checking which key has the most letters in its dictionary in Python

I have a dictionary in a Python code like this:
S = {(x0): 'omicron', (x1): 'a', (x2): 'ab', (x3): 'abbr', (x4): 'abr', (x5): 'abrf', (x6): 'abrfa', (x7): 'af', '(x8)': 'afc'}
I would like to check which key has its corresponding dictionary with the highest numer of letters, except for the one that has 'omicron'. The answer in this example should be: (x6), because it has a dictionary with 5 letters, more than any other key, and not counting (x0):'omicron'.
Is there an efficient way to do this? Thank you.
You could use the key parameter of max:
res = max(S, key=lambda x: (S[x] != 'omicron', len(S[x])))
print(res)
Output
(x6)
This will make the keys that the value is different than 'omicron' have a higher value than one that are equals (1 > 0). For those keys that do not have 'omicron' value use the length as a tie-breaker.
S = {('x0'): 'omicron', ('x1'): 'a', ('x2'): 'ab', ('x3'): 'abbr', ('x4'): 'abr', ('x5'): 'abrf', ('x6'): 'abrfa', ('x7'): 'af', ('x8'): 'afc'}
keys = list(S.keys())
longest = 0
word = ''
for i in range(len(keys)):
if len(S[f'{keys[i]}']) > longest and S[f'{keys[i]}'] != 'omicron':
longest = len(S[f'{keys[i]}'])
word = keys[i]
print(longest, word)
Output:
5 x6

How to identify matching values in dictionary and create a new string with only those keys?

I have a method that pulls repeating letters from a string and adds them to a dictionary with the amount of times they repeat as the values. Now what I would like to do is pull all the keys that have matching values and create a string with only those keys.
example:
text = "theerrrdd"
count = {}
same_value = ""
for ch in text:
if text.count(ch) > 1:
count[ch] = text.count(ch)
How can I check count for keys with matching values, and if found, add those keys to same_value?
So in this example "e" and "d" would both have a value of 2. I want to add them to same_value so that when called, same_value would return "ed".
I basically just want to be able to identify which letters repeated the same amount of time.
First create a letter to count mapping, then reverse this mapping. Using the collections module:
from collections import defaultdict, Counter
text = 'theerrrdd'
# create dictionary mapping letter to count
letter_count = Counter(text)
# reverse mapping to give count to letters mapping
count_letters = defaultdict(list)
for letter, count in letter_count.items():
count_letters[count].append(letter)
Result:
print(count_letters)
defaultdict(<class 'list'>, {1: ['t', 'h'],
2: ['e', 'd'],
3: ['r']})
Then, for example, count_letters[2] gives you all letters which are seen twice in your input string.
Using str.count in a loop is inefficient as it requires a full iteration of your string for each letter. In other words, such an algorithm has quadratic complexity, while collections.Counter has linear complexity.
Another approach would be to use set() to get just the unique characters in the string, loop through the set and create a dict where the counts are the keys with lists of characters for each count. Then you can generate strings for each count using join().
text = "theerrrdd"
chars = set(text)
counts = {}
for ch in chars:
ch_count = text.count(ch)
if counts.get(ch_count, None):
counts[ch_count].append(ch)
else:
counts[ch_count] = [ch]
# print string of chars where count is 2
print(''.join(counts[2]))
# OUTPUT
# ed
I think the most simple solution of all !!
from collections import Counter
text = "theerrrdd"
count = Counter(text)
same_value = ''.join([k for k in count.keys() if count[k] > 1])
print(count)
print(same_value)
From counter dictionary, create another dictionary with all same count letters as values and the count as key. Iterate through values of this newly created dictionary to find all values whose length is greater than 1 and then form strings:
from collections import defaultdict
text = "theerrrdd"
count = {}
new_dict = defaultdict(list)
for ch in text:
if text.count(ch) > 1:
count[ch] = text.count(ch)
for k, v in count.items():
new_dict[v].append(k)
same_values_list = [v for v in new_dict.values() if len(v) > 1]
for x in same_values_list:
print(''.join(x))
# ed
new_dict is the newly created dictionary with count as key and all letters with that count as the value for that key:
print(new_dict)
# defaultdict(<class 'list'>, {2: ['e', 'd'],
# 3: ['r']})

How to standardize the format of element in the list from big data

Trying to count unique value from the following list without using collection:
('TOILET','TOILETS','AIR CONDITIONING','AIR-CONDITIONINGS','AIR-CONDITIONING')
The output which I require is :
('TOILET':2,'AIR CONDITIONiNGS':3)
My code currently is
for i in Data:
if i in number:
number[i] += 1
else:
number[i] = 1
print number
Is it possible to get the output?
Using difflib.get_close_matches to help determine uniqueness
import difflib
a = ('TOILET','TOILETS','AIR CONDITIONING','AIR-CONDITIONINGS','AIR-CONDITIONING')
d = {}
for word in a:
similar = difflib.get_close_matches(word, d.keys(), cutoff = 0.6, n = 1)
#print(similar)
if similar:
d[similar[0]] += 1
else:
d[word] = 1
The actual keys in the dictionary will depend on the order of the words in the list.
difflib.get_close_matches uses difflib.SequenceMatcher to calculate the closeness (ratio) of the word against all possibilities even if the first possibility is close - then sorts by the ratio. This has the advantage of finding the closest key that has a ratio greater than the cutoff. But as the dictionary grows the searches will take longer.
If needed, you might be able to optimize a little by sorting the list first so that similar words appear in sequence and doing something like this (lazy evaluation) - choosing an appropriately large cutoff.
import difflib, collections
z = collections.OrderedDict()
a = sorted(a)
cutoff = 0.6
for word in a:
for key in z.keys():
if difflib.SequenceMatcher(None, word, key).ratio() > cutoff:
z[key] += 1
break
else:
z[word] = 1
Results:
>>> d
{'TOILET': 2, 'AIR CONDITIONING': 3}
>>> z
OrderedDict([('AIR CONDITIONING', 3), ('TOILET', 2)])
>>>
I imagine there are python packages that do this sort of thing and may be optimized.
I don't believe the python list has an easy built-in way to do what you are asking. It does, however, have a count method that can tell you how many of a specific element there are in a list. Example:
some_list = ['a', 'a', 'b', 'c']
some_list.count('a') #=> 2
Usually the way you get what you want is to construct an incrementable hash by taking advantage of the Hash::get(key, default) method:
some_list = ['a', 'a', 'b', 'c']
counts = {}
for el in some_list
counts[el] = counts.get(el, 0) + 1
counts #=> {'a' : 2, 'b' : 1, 'c' : 1}
You can try this:
import re
data = ('TOILETS','TOILETS','AIR CONDITIONING','AIR-CONDITIONINGS','AIR-CONDITIONING')
new_data = [re.sub("\W+", ' ', i) for i in data]
print new_data
final_data = {}
for i in new_data:
s = [b for b in final_data if i.startswith(b)]
if s:
new_data = s[0]
final_data[new_data] += 1
else:
final_data[i] = 1
print final_data
Output:
{'TOILETS': 2, 'AIR CONDITIONING': 3}
original = ('TOILETS', 'TOILETS', 'AIR CONDITIONING',
'AIR-CONDITIONINGS', 'AIR-CONDITIONING')
a_set = set(original)
result_dict = {element: original.count(element) for element in a_set}
First, making a set from original list (or tuple) gives you all values from it, but without repeating.
Then you create a dictionary with keys from that set and values as occurrences of them in the original list (or tuple), employing the count() method.
a = ['TOILETS', 'TOILETS', 'AIR CONDITIONING', 'AIR-CONDITIONINGS', 'AIR-CONDITIONING']
b = {}
for i in a:
b.setdefault(i,0)
b[i] += 1
You can use this code, but same as Jon Clements`s talk, TOILET and TOILETS aren't the same string, you must ensure them.

How to create a frequency matrix?

I just started using Python and I just came across the following problem:
Imagine I have the following list of lists:
list = [["Word1","Word2","Word2","Word4566"],["Word2", "Word3", "Word4"], ...]
The result (matrix) i want to get should look like this:
The Displayed Columns and Rows are all appearing words (no matter which list).
The thing that I want is a programm that counts the appearence of words in each list (by list).
The picture is the result after the first list.
Is there an easy way to achieve something like this or something similar?
EDIT:
Basically I want a List/Matrix that tells me how many times words 2-4566 appeared when word 1 was also in the list, and so on.
So I would get a list for each word that displays the absolute frequency of all other 4555 words in relationship with this word.
So I would need an algorithm that iterates through all this lists of words and builts the result lists
As far as I understand you want to create a matrix that shows the number of lists where two words are located together for each pair of words.
First of all we should fix the set of unique words:
lst = [["Word1","Word2","Word2","Word4566"],["Word2", "Word3", "Word4"], ...] # list is a reserved word in python, don't use it as a name of variables
words = set()
for sublst in lst:
words |= set(sublst)
words = list(words)
Second we should define a matrix with zeros:
result = [[0] * len(words)] * len(words) # zeros matrix N x N
And finally we fill the matrix going through the given list:
for sublst in lst:
sublst = list(set(sublst)) # selecting unique words only
for i in xrange(len(sublst)):
for j in xrange(i + 1, len(sublst)):
index1 = words.index(sublst[i])
index2 = words.index(sublst[j])
result[index1][index2] += 1
result[index2][index1] += 1
print result
I find it really hard to understand what you're really asking for, but I'll try by making some assumptions:
(1) You have a list (A), containing other lists (b) of multiple words (w).
(2) For each b-list in A-list
(3) For each w in b:
(3.1) count the total number of appearances of w in all of the b-lists
(3.2) count how many of the b-lists, in which w appears only once
If these assumptions are correct, then the table doesn't correspond correctly to the list you've provided. If my assumptions are wrong, then I still believe my solution may give you inspiration or some ideas on how to solve it correctly. Finally, I do not claim my solution to be optimal with respect to speed or similar.
OBS!! I use python's built-in dictionaries, which may become terribly slow if you intend to fill them with thousands of words!! Have a look at: https://docs.python.org/2/tutorial/datastructures.html#dictionaries
frq_dict = {} # num of appearances / frequency
uqe_dict = {} # unique
for list_b in list_A:
temp_dict = {}
for word in list_b:
if( word in temp_dict ):
temp_dict[word]+=1
else:
temp_dict[word]=1
# frq is the number of appearances
for word, frq in temp_dict.iteritems():
if( frq > 1 ):
if( word in frq_dict )
frq_dict[word] += frq
else
frq_dict[word] = frq
else:
if( word in uqe_dict )
uqe_dict[word] += 1
else
uqe_dict[word] = 1
I managed to come up with the right answer to my own question:
list = [["Word1","Word2","Word2"],["Word2", "Word3", "Word4"],["Word2","Word3"]]
#Names of all dicts
all_words = sorted(set([w for sublist in list for w in sublist]))
#Creating the dicts
dicts = []
for i in all_words:
dicts.append([i, dict.fromkeys([w for w in all_words if w != i],0)])
#Updating the dicts
for l in list:
for word in sorted(set(l)):
tmpL = [w for w in l if w != word]
ind = ([w[0] for w in dicts].index(word))
for w in dicts[ind][1]:
dicts[ind][1][w] += l.count(w)
print dicts
Gets the result:
['Word1', {'Word4': 0, 'Word3': 0, 'Word2': 2}], ['Word2', {'Word4': 1, 'Word1': 1, 'Word3': 2}], ['Word3', {'Word4': 1, 'Word1': 0, 'Word2': 2}], ['Word4', {'Word1': 0, 'Word3': 1, 'Word2': 1}]]

Categories

Resources