Creating a dictionary with calculated values

Creating a dictionary with calculated values - python

I have a large text string and I would like to create a dictionary with a key = a pair of words (have to go through all possible combinations) in the string and the value = frequency of a given pair of words. Thus, it is a 2D matrix and each matrix element is a number (a frequency of the pair from a column and a row crossing each other. The position of the words in the pair is irrelevant: e.g. if ridebike = 4 (a frequency) then bikeride = 4 as well
The end result is to populate the matrix and then select N number of top pairs.
I am new working with text strings and with Python in general and I got hopelessly lost (also way too many loops in my "code")
This is what I have (after deleting stopwords and punctuations):
textNP = 'stopped traffic bklyn bqe 278 wb manhattan brtillary stx29 wb cadman pla hope oufootball makes safe manhattan kansas tomorrow boomersooner beatwildcats theyhateuscuztheyaintus hatersgonnahate rt bringonthecats bring cats exclusive live footage oklahoma trying get manhattan http colktsoyzvvz rt jonfmorse bring cats exclusive live footage oklahoma trying get manhattan'
Some code (incomplete and wrong):
txtU = set(textNP)
lntxt = len(textNP)
lntxtS = len(txtU)
matrixNP = {}
for b1, i1 in txtU:
for b2, i2 in txtU:
if i1< i2:
bb1 = b1+b2
bb2 = b2+b1
freq = 0
for k in textNP:
for j in textNP:
if k < j:
kj = k+j
if kj == bb1 | kj == bb2:
freq +=1
matrixNP[i1][i2] = freq
matrixNP[i2][i1] = freq
elif i1 == i2: matrixNP[i1][i1] = 1
One of the issues that I am certain that having many loops is wrong. Also, I am not sure how to assign calculated keys (concatenation of words) to a dictionary (I think I got the values correctly)
The text string is not a complete product: it will be cleaned from numbers and few other things with various regexs
Your help will be very much appreciated!

Are you looking for all combinations of 2 words, if so you can use itertools.combinations and a collections.Counter to do what you want:
>>> from itertools import combinations
>>> from collections import Counter
>>> N = 5
>>> c = Counter(tuple(sorted(a)) for a in combinations(textNP.split(), 2))
>>> c.most_common(N)
[(('manhattan', 'rt'), 8),
(('exclusive', 'manhattan'), 8),
(('footage', 'manhattan'), 8),
(('manhattan', 'oklahoma'), 8),
(('bring', 'manhattan'), 8)]
Or are you looking for all pairs of consecutive words then you can create a pairwise function:
>>> from itertools import tee
>>> from collections import Counter
>>> def pairwise(iterable):
... a, b = tee(iterable)
... next(b, None)
... return zip(a, b) # itertools.izip() in python2
>>> N = 5
>>> c = Counter(tuple(sorted(a)) for a in pairwise(textNP.split()))
>>> c.most_common(N)
[(('get', 'manhattan'), 2),
(('footage', 'live'), 2),
(('get', 'trying'), 2),
(('bring', 'cats'), 2),
(('exclusive', 'live'), 2)]
Neither way do I see bike ride in the list.

Related

Python Comprehension - Replace Nested Loop

Working on the 'two-sum' problem..
Input: An unsorted array A (of integers), and a target sum t
The goal: to return a list of tuple pairs (x,y) where x + y = t
I've implemented a hash-table H to store the contents of A. Through use of a nested loop to iterate through H, I'm achieving the desired output. However, in the spirit of learning the art of Python, I'd like to replace the nested loop with a nice 1-liner using comprehension & a maybe lambda function? Suggestions?
Source Code:
import csv
with open('/Users/xxx/Developer/Algorithms/Data Structures/_a.txt') as csvfile:
csv_reader = csv.reader(csvfile, delimiter ='\n')
hash_table = {int(num[0]):int(num[0]) for(num) in csv_reader} #{str:int}
def two_sum(hash_table, target):
pairs = list()
for x in hash_table.keys():
for y in hash_table.keys():
if x == y:
continue
if x + y == target:
pairs.append((x,y))
return pairs

When you have two ranges and you want to loop both of them separately to get all the combinations as in your case, you can combine the loops into one using itertools.product. You can replace the code below
range1 = [1,2,3,4]
range2 = [3, 4, 5]
for x in range1:
for y in range2:
print(x, y)
with
from itertools import product
for x, y in product(range1, range2):
print(x, y)
Both code blocks produce
1 3
1 4
1 5
2 3
2 4
2 5
3 3
3 4
3 5
4 3
4 4
4 5
But you would still need the if check with this construct. However, what product returns is a generator and you can pass that as the iterable to map or filter along with a lambda function.
In your case you only want to include pairs that meet the criteria. Thus, filter is what you want. In my simple example, if we only want combinations whose sum is even, then we could do something like
gen = product(range1, range2)
f = lambda i: (i[0] + i[1]) % 2 == 0
desired_pairs = filter(f, gen)
This can be written as a one-liner like
desired_pairs = filter(lambda i: (i[0] + i[1]) % 2 == 0, product(range1, range2))
without being too complicated for being understood.
Note that like product and map, what filter returns is a generator, which is good if you are just going to loop over it later to do some other work. If you really need a list just do convert it to a list as
desired_pairs = list(filter(lambda i: (i[0] + i[1]) % 2 == 0, product(range1, range2)))
If we print this we get
[(1, 3), (1, 5), (2, 4), (3, 3), (3, 5), (4, 4)]

How to get a value from a two-dimensional list that is in an adjacent column from one that matches the value of another two-dimensional list

I have 2 files I converted to list of lists format. Short examples
a
c1 165.001 17593685
c2 1650.94 17799529
c3 16504399 17823261
b
1 rs3094315 **0.48877594** *17593685* G A
1 rs12562034 0.49571378 768448 A G
1 rs12124819 0.49944228 776546 G A
Using the cycle 'for' I tried to find the common values of these lists, but I can't loop the process. It is necessary since I need to get an value that is adjacent to the value that is common to the two lists(in this given example it is 0.48877594 since 17593685 is common for 'a' and 'b' . My attempts that completely froze:
for i in a:
if i[2] == [d[3] for d in b]:
print(i[0], i[2] + d[2])
or
for i in a and d in b:
if i[2] == d[3]
print(i[0], i[2] + d[2]
Overall I need to get the first file with a new column, which will be that bold adjacent value. Is is my first month of programming and I cant understand logic. Thanks in advance!
+++
List's original format:
a = [['c1', '165.001', '17593685'], ['c2', '1650.94', '17799529'], ['c3', '16504399', '17823261']]
[['c1', '16504399', '17593685.1\n'], ['c2', '16504399', '17799529.1\n'], ['c3', '16504399', '17823261.\n']]
++++ My original data
Two or more people can have DNA segments that are the same, because they were inherited from a common ancestor. File 'a' contains the following columns:
SegmentID, Start of segment, End of Segment, IDs of individuals that share this segment(from 2 to infinity). Example(just a little part since real list has > 1000 raws - segments('c'). Number of individuals can be different.
c1 16504399 17593685 19N 19N.0 19N 19N.0 182AR 182AR.0 182AR 182AR.0 6i 6i.1 6i 6i.1 153A 153A.1 153A 153A.1
c2 14404399 17799529 62BB 62BB.0 62BB 62BB.0 55k 55k.0 55k 55k.0 190k 190k.0 190k 190k.0 51A 51A.1 51A 51A.1 3A 3A.1 3A 3A.1 38k 38k.1 38k 38k.1
c3 1289564 177953453 164Bur 164Bur.0 164Bur 164Bur.0 38BO 38BO.1 38BO 38BO.1 36i 36i.1 36i 36i.1 100k 100k.1 100k 100k.1
file b:
This one always has 6 columns but number of rows more the 100 millions, so only it's part:
1 rs3094315 0.48877594 16504399 G A
1 rs12562034 0.49571378 17593685 A G
1 rs12124819 0.49944228 14404399 G A
1 rs3094221 0.48877594 17799529 G A
1 rs12562222 0.49571378 1289564 A G
1 rs121242223 0.49944228 177953453 G A
So, I need to compare a[1] with b[3] and if they are equal
print(a[1],b[3]), because b[3] is position of segment too but in another measurement system. That is what I can't do

Taking a leap (because the question isn't really clear), I think you are looking for the product of a, b, e.g.:
In []:
for i in a:
for d in b:
if i[2] == d[3]:
print(i[0], i[2] + d[2])
Out[]:
c1 175936850.48877594
You can do the same with itertools.product():
In []:
import itertools as it
for i, d in it.product(a, b):
if i[2] == d[3]:
print(i[0], i[2] + d[2])
Out[]:
c1 175936850.48877594

It would be much faster to leave your data as strings and search:
for a_line in [_ for _ in a.split('\n') if _]: # skip blank lines
search_term = a_line.strip().split()[-1] # get search term
term_loc_in_b = b.find(search_term) #get search term loction in file b
if term_loc_in_b !=-1: #-1 means term not found
# split b once just before search term starts
value_in_b = b[:term_loc_in_b].strip().rsplit(maxsplit=1)[-1]
print(value_in_b)
else:
print('{} not found'.format(search_term))
If the file size is large you might consider using mmap to search b.
mmap.find requires bytes, eg. 'search_term'.encode()

Find values in list which differ from reference list by up to N characters

I have a list like the following:
Test = ['ASDFGH', 'QWERTYU', 'ZXCVB']
And a reference list like this:
Ref = ['ASDFGY', 'QWERTYI', 'ZXCAA']
I want to extract the values from Test if they are N or less characters different from any one of the items in Ref.
For example, if N = 1, only the first two elements of Test should be output. If N = 2, all three elements fit this criteria and should be returned.
It should be noted that I am looking for same charcacter length values (ASDFGY -> ASDFG matching doesn't work for N = 1), so I want something more efficient than levensthein distance.
I have over 1000 values in ref and a couple hundred million in Test so efficiency is key.

Using a generation expression with sum:
Test = ['ASDFGH', 'QWERTYU', 'ZXCVB']
Ref = ['ASDFGY', 'QWERTYI', 'ZXCAA']
from collections import Counter
def comparer(x, y, n):
return (len(x) == len(y)) and (sum(i != j for i, j in zip(x, y)) <= n)
res = [a for a, b in zip(Ref, Test) if comparer(a, b, 1)]
print(res)
['ASDFGY', 'QWERTYI']

Using difflib
Demo:
import difflib
N = 1
Test = ['ASDFGH', 'QWERTYU', 'ZXCVB']
Ref = ['ASDFGY', 'QWERTYI', 'ZXCAA']
result = []
for i,v in zip(Test, Ref):
c = 0
for j,s in enumerate(difflib.ndiff(i, v)):
if s.startswith("-"):
c += 1
if c <= N:
result.append( i )
print(result)
Output:
['ASDFGH', 'QWERTYU']

The newer regex module offers a "fuzzy" match possibility:
import regex as re
Test = ['ASDFGH', 'QWERTYU', 'ZXCVB']
Ref = ['ASDFGY', 'QWERTYI', 'ZXCAA', 'ASDFGI', 'ASDFGX']
for item in Test:
rx = re.compile('(' + item + '){s<=3}')
for r in Ref:
if rx.search(r):
print(rf'{item} is similar to {r}')
This yields
ASDFGH is similar to ASDFGY
ASDFGH is similar to ASDFGI
ASDFGH is similar to ASDFGX
QWERTYU is similar to QWERTYI
ZXCVB is similar to ZXCAA
You can control it via the {s<=3} part which allows three or less substitutions.
To have pairs, you could write
pairs = [(origin, difference)
for origin in Test
for rx in [re.compile(rf"({origin}){{s<=3}}")]
for difference in Ref
if rx.search(difference)]
Which would yield for
Test = ['ASDFGH', 'QWERTYU', 'ZXCVB']
Ref = ['ASDFGY', 'QWERTYI', 'ZXCAA', 'ASDFGI', 'ASDFGX']
the following output:
[('ASDFGH', 'ASDFGY'), ('ASDFGH', 'ASDFGI'),
('ASDFGH', 'ASDFGX'), ('QWERTYU', 'QWERTYI'),
('ZXCVB', 'ZXCAA')]

How to split a list into subsets with no repeating elements in python

I need code that takes a list (up to n=31) and returns all possible subsets of n=3 without any two elements repeating in the same subset twice (think of people who are teaming up in groups of 3 with new people every time):
list=[1,2,3,4,5,6,7,8,9]
and returns
[1,2,3][4,5,6][7,8,9]
[1,4,7][2,3,8][3,6,9]
[1,6,8][2,4,9][3,5,7]
but not:
[1,5,7][2,4,8][3,6,9]
because 1 and 7 have appeared together already (likewise, 3 and 9).
I would also like to do this for subsets of n=2.
Thank you!!

Here's what I came up with:
from itertools import permutations, combinations, ifilter, chain
people = [1,2,3,4,5,6,7,8,9]
#get all combinations of 3 sets of 3 people
combos_combos = combinations(combinations(people,3), 3)
#filter out sets that don't contain all 9 people
valid_sets = ifilter(lambda combo:
len(set(chain.from_iterable(combo))) == 9,
combos_combos)
#a set of people that have already been paired
already_together = set()
for sets in valid_sets:
#get all (sorted) combinations of pairings in this set
pairings = list(chain.from_iterable(combinations(combo, 2) for combo in sets))
pairings = set(map(tuple, map(sorted, pairings)))
#if all of the pairings have never been paired before, we have a new one
if len(pairings.intersection(already_together)) == 0:
print sets
already_together.update(pairings)
This prints:
~$ time python test_combos.py
((1, 2, 3), (4, 5, 6), (7, 8, 9))
((1, 4, 7), (2, 5, 8), (3, 6, 9))
((1, 5, 9), (2, 6, 7), (3, 4, 8))
((1, 6, 8), (2, 4, 9), (3, 5, 7))
real 0m0.182s
user 0m0.164s
sys 0m0.012s

Try this:
from itertools import permutations
lst = list(range(1, 10))
n = 3
triplets = list(permutations(lst, n))
triplets = [set(x) for x in triplets]
def array_unique(seq):
checked = []
for x in seq:
if x not in checked:
checked.append(x)
return checked
triplets = array_unique(triplets)
result = []
m = n * 3
for x in triplets:
for y in triplets:
for z in triplets:
if len(x.union(y.union(z))) == m:
result += [[x, y, z]]
def groups(sets, i):
result = [sets[i]]
for x in sets:
flag = True
for y in result:
for r in x:
for p in y:
if len(r.intersection(p)) >= 2:
flag = False
break
else:
continue
if flag == False:
break
if flag == True:
result.append(x)
return result
for i in range(len(result)):
print('%d:' % (i + 1))
for x in groups(result, i):
print(x)
Output for n = 10:
http://pastebin.com/Vm54HRq3

Here's my attempt of a fairly general solution to your problem.
from itertools import combinations
n = 3
l = range(1, 10)
def f(l, n, used, top):
if len(l) == n:
if all(set(x) not in used for x in combinations(l, 2)):
yield [l]
else:
for group in combinations(l, n):
if any(set(x) in used for x in combinations(group, 2)):
continue
for rest in f([i for i in l if i not in group], n, used, False):
config = [list(group)] + rest
if top:
# Running at top level, this is a valid
# configuration. Update used list.
for c in config:
used.extend(set(x) for x in combinations(c, 2))
yield config
break
for i in f(l, n, [], True):
print i
However, it is very slow for high values of n, too slow for n=31. I don't have time right now to try to improve the speed, but I might try later. Suggestions are welcome!

My wife had this problem trying to arrange breakout groups for a meeting with nine people; she wanted no pairs of attendees to repeat.
I immediately busted out itertools and was stumped and came to StackOverflow. But in the meantime, my non-programmer wife solved it visually. The key insight is to create a tic-tac-toe grid:
1 2 3
4 5 6
7 8 9
And then simply take 3 groups going down, 3 groups going across, and 3 groups going diagonally wrapping around, and 3 groups going diagonally the other way, wrapping around.
You can do it just in your head then.
- : 123,456,789
| : 147,258,368
\ : 159,267,348
/ : 168,249,357
I suppose the next question is how far can you take a visual method like this? Does it rely on the coincidence that the desired subset size * the number of subsets = the number of total elements?

Better algorithm to riffle shuffle (or interleave) multiple lists of varying lengths

I like to watch my favorite TV shows on the go. I have all episodes of each show I'm following in my playlist. Not all shows consist of the same number of episodes. Unlike some who prefer marathons, I like to interleave episodes of one show with those of another.
For example, if I have a show called ABC with 2 episodes, and a show called XYZ with 4 episodes, I would like my playlist to look like:
XYZe1.mp4
ABCe1.mp4
XYZe2.mp4
XYZe3.mp4
ABCe2.mp4
XYZe4.mp4
One way to generate this interleaved playlist is to represent each show as a list of episodes and perform a riffle shuffle on all shows. One could write a function that would compute, for each episode, its position on a unit-time interval (between 0.0 and 1.0 exclusive, 0.0 being beginning of season, 1.0 being end of season), then sort all episodes according to their position.
I wrote the following simple function in Python 2.7 to perform an in-shuffle:
def riffle_shuffle(piles_list):
scored_pile = ((((item_position + 0.5) / len(pile), len(piles_list) - pile_position), item) for pile_position, pile in enumerate(piles_list) for item_position, item in enumerate(pile))
shuffled_pile = [item for score, item in sorted(scored_pile)]
return shuffled_pile
To get the playlist for the above example, I simply need to call:
riffle_shuffle([['ABCe1.mp4', 'ABCe2.mp4'], ['XYZe1.mp4', 'XYZe2.mp4', 'XYZe3.mp4', 'XYZe4.mp4']])
This works fairly well most of the time. However, there are cases where results are non-optimal--two adjacent entries in the playlist are episodes from the same show. For example:
>>> riffle_shuffle([['ABCe1', 'ABCe2'], ['LMNe1', 'LMNe2', 'LMNe3'], ['XYZe1', 'XYZe2', 'XYZe3', 'XYZe4', 'XYZe5']])
['XYZe1', 'LMNe1', 'ABCe1', 'XYZe2', 'XYZe3', 'LMNe2', 'XYZe4', 'ABCe2', 'LMNe3', 'XYZe5']
Notice that there are two episodes of 'XYZ' that appear side-by-side. This situation can be fixed trivially (manually swap 'ABCe1' with 'XYZe2').
I am curious to know if there are better ways to interleave, or perform riffle shuffle, on multiple lists of varying lengths. I would like to know if you have solutions that are simpler, more efficient, or just plain elegant.
Solution proposed by belisarius (thanks!):
import itertools
def riffle_shuffle_belisarius(piles_list):
def grouper(n, iterable, fillvalue=None):
args = [iter(iterable)] * n
return itertools.izip_longest(fillvalue=fillvalue, *args)
if not piles_list:
return []
piles_list.sort(key=len, reverse=True)
width = len(piles_list[0])
pile_iters_list = [iter(pile) for pile in piles_list]
pile_sizes_list = [[pile_position] * len(pile) for pile_position, pile in enumerate(piles_list)]
grouped_rows = grouper(width, itertools.chain.from_iterable(pile_sizes_list))
grouped_columns = itertools.izip_longest(*grouped_rows)
shuffled_pile = [pile_iters_list[position].next() for position in itertools.chain.from_iterable(grouped_columns) if position is not None]
return shuffled_pile
Example run:
>>> riffle_shuffle_belisarius([['ABCe1', 'ABCe2'], ['LMNe1', 'LMNe2', 'LMNe3'], ['XYZe1', 'XYZe2', 'XYZe3', 'XYZe4', 'XYZe5']])
['XYZe1', 'LMNe1', 'XYZe2', 'LMNe2', 'XYZe3', 'LMNe3', 'XYZe4', 'ABCe1', 'XYZe5', 'ABCe2']

A deterministic solution (ie not random)
Sort your shows by decreasing number of episodes.
Select the biggest and arrange a matrix with the number of columns corresponding to the number of episodes of this one, filled in the following way:
A A A A A A <- First show consist of 6 episodes
B B B B C C <- Second and third show - 4 episodes each
C C D D <- Third show 2 episodes
Then collect by columns
{A,B,C}, {A,B,C}, {A,B,D}, {A,B,D}, {A,C}, {A,C}
Then Join
{A,B,C,A,B,C,A,B,D,A,B,D,A,C,A,C}
And now assign sequential numbers
{A1, B1, C1, A2, B2, C2, A3, B3, D1, A4, B4, D2, A5, C3, A6, C4}
Edit
Your case
[['A'] * 2, ['L'] * 3, ['X'] * 5])
X X X X X
L L L A A
-> {X1, L1, X2, L2, X3, L3, X4, A1, X5, A2}
Edit 2
As no Python here, perhaps a Mathematica code may be of some use:
l = {, , ,}; (* Prepare input *)
l[[1]] = {a, a, a, a, a, a};
l[[2]] = {b, b, b, b};
l[[3]] = {c, c, c, c};
l[[4]] = {d, d};
le = Length#First#l;
k = DeleteCases[ (*Make the matrix*)
Flatten#Transpose#Partition[Flatten[l], le, le, 1, {Null}], Null];
Table[r[i] = 1, {i, k}]; (*init counters*)
ReplaceAll[#, x_ :> x[r[x]++]] & /# k (*assign numbers*)
->{a[1], b[1], c[1], a[2], b[2], c[2], a[3], b[3], d[1], a[4], b[4],
d[2], a[5], c[3], a[6], c[4]}

My try:
program, play = [['ABCe1.mp4', 'ABCe2.mp4'],
['XYZe1.mp4', 'XYZe2.mp4', 'XYZe3.mp4', 'XYZe4.mp4',
'XYZe5.mp4', 'XYZe6.mp4', 'XYZe7.mp4'],
['OTHERe1.mp4', 'OTHERe2.mp4']], []
start_part = 3
while any(program):
m = max(program, key = len)
if (len(play) >1 and
play[-1][:start_part] != m[0][:start_part] and
play[-2].startswith(play[-1][:start_part])):
play.insert(-1, m.pop(0))
else:
play.append(m.pop(0))
print play

This would ensure that there is at least 1 and no more than 2 other episodes between two successive episodes of a show:
While there are more than 3 shows, chain two shortest (i.e. having least episodes) shows together end-to-end.
Let A be the longest show and B and C the other two.
If B is shorter than A, pad it with None's at the end
If C is shorter than A, pad it with None's at the beginning
Shuffled playlist is [x for x in itertools.chain(zip(A,B,C)) if x is not None]

This will ensure true shuffle i.e. a different result each time, with no contiguous items as much as possible.
The one you ask probably could return a few (1, 2) results limited by your requests.
from random import choice, randint
from operator import add
def randpop(playlists):
pl = choice(playlists)
el = pl.pop(randint(0, len(pl) -1))
return pl, el
def shuffle(playlists):
curr_pl = None
while any(playlists):
try:
curr_pl, el = randpop([pl for pl in playlists if pl and pl != curr_pl])
except IndexError:
break
else:
yield el
for el in reduce(add, playlists):
yield el
raise StopIteration
if __name__ == "__main__":
sample = [
'A1 A2 A3 A4'.split(),
'B1 B2 B3 B4 B5'.split(),
'X1 X2 X3 X4 X5 X6'.split()
]
for el in shuffle(sample):
print(el)
Edit:
Given episodes order is mandatory just simplify randpop function:
def randpop(playlists):
pl = choice(playlists)
el = pl.pop(0)
return pl, el

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Creating a dictionary with calculated values - python

Related

Python Comprehension - Replace Nested Loop

How to get a value from a two-dimensional list that is in an adjacent column from one that matches the value of another two-dimensional list

Find values in list which differ from reference list by up to N characters

How to split a list into subsets with no repeating elements in python

Better algorithm to riffle shuffle (or interleave) multiple lists of varying lengths

Categories

Resources