Related
I try to count the frequency of word occurances in a variable. The variables counts more than 700.000 observations. The output should return a dictionary with the words that occured the most. I used the code below to do this:
d1 = {}
for i in range(len(words)-1):
x=words[i]
c=0
for j in range(i,len(words)):
c=words.count(x)
count=dict({x:c})
if x not in d1.keys():
d1.update(count)
I've runned the code for the first 1000 observations and it worked perfectly. The output is shown below:
[('semantic', 23),
('representations', 11),
('models', 10),
('task', 10),
('data', 9),
('parser', 9),
('language', 8),
('languages', 8),
('paper', 8),
('meaning', 8),
('rules', 8),
('results', 7),
('performance', 7),
('parsing', 7),
('systems', 7),
('neural', 6),
('tasks', 6),
('entailment', 6),
('generic', 6),
('te', 6),
('natural', 5),
('method', 5),
('approaches', 5)]
When I try to run it for 100.000 observations, it keeps running. I've tried it for more than 24 hours and it still doesn't execute. Does anyone have an idea?
You can use collections.Counter.
from collections import Counter
counts = Counter(words)
print(counts.most_common(20))
#Jon answer is the best in your case, however in some cases collections.counter will be slower than iteration. (specially if afterwards you don't need to sort by frequency) as I asked in this question
You can count frequencies by iteration.
d1 = {}
for item in words:
if item in d1.keys():
d1[item] += 1
else:
d1[item] = 1
# finally sort the dictionary of frequencies
print(dict(sorted(d1.items(), key=lambda item: item[1])))
But again, for your case, using #Jon answer is faster and more compact.
#...
for i in range(len(words)-1):
#...
#...
for j in range(i,len(words)):
c=words.count(x)
#...
if x not in d1.keys():
#...
I've tried to highlight the problems your code is having above. In english this looks something like:
"Count the number of occurences of each word after the word I'm looking at, repeatedly, for every word in the whole list. Also, look through the whole dictioniary I'm building again for every word in the list, while I'm building it."
This is way more work than you need to do; you only need to look at each word in the list once. You do need to look in the dictionary once for every word, but looking at d1.keys() makes this far slower by converting the dictionary to another list and looking through the whole thing. The following code will do what you want, much more quickly:
words = ['able', 'baker', 'charlie', 'dog', 'easy', 'able', 'charlie', 'dog', 'dog']
word_counts = {}
# Look at each word in our list once
for word in words:
# If we haven't seen it before, create a new count in our dictionary
if word not in word_counts:
word_counts[word] = 0
# We've made sure our count exists, so just increment it by 1
word_counts[word] += 1
print(word_counts.items())
The above example will give:
[
('charlie', 2),
('baker', 1),
('able', 2),
('dog', 3),
('easy', 1)
]
Say I have the following tuples.
dummy = [("text", 10), ("This is the Sentence", 20),
("that I Want", 20), ("to Get", 20),
("text", 8), ("text", 6)]
I want to get that "This is the Sentence that I Want to Get" and ignore the rest. The text in particular always have the largest value (in this case it's 20) and they're next to each other. Basically, it will only collect the tuples with max values that are next to each other.
With the following code I only collect the first max tuple, but it ignores the rest.
from operator import itemgetter
max(dummy, key=itemgetter(1))
How do I make it that it will get all other max values?
why not:
get the max key using dict values and implement filter by it
m_value = max(dict(dummy).values())
" ".join([x for x, n in dummy if n == m_value])
my result is:
'This is the Sentence that I Want to Get'
Something like this?
>>> t = np.array([d[0] for d in dummy])
>>> v = np.array([d[1] for d in dummy])
>>> print(t[v==v.max()])
['This is the Sentence' 'that I Want' 'to Get']
Here's my approach:
from operator import itemgetter
dummy = [("text", 10), ("This is the Sentence", 20),
("that I Want", 20), ("to Get", 20),
("text", 8), ("text", 6)]
max_num = max(dummy, key=itemgetter(1))[1]
text_blocks = [text for text, num in dummy if num == max_num]
sentence = ' '.join(text_blocks)
print(sentence)
# This is the Sentence that I Want to Get
You could improve the code further by using namedtuples as your dummy's items
This will work for you.
from operator import itemgetter
# find the max value
max_sen, max_val = max(dummy, key=itemgetter(1))
# filter based on max value and join
" ". join([x[0] for x in filter(lambda x:x[1]==max_val, dummy)])
Why not a Pandas one-liner? With a walrus operator as a bonus
import pandas as pd
' '.join((df :=pd.DataFrame(dummy)).loc[df[1] == max(df[1]),0])
output
'This is the Sentence that I Want to Get'
My initial assumption seems to be wrong. Don't consider the implementations mentioned in this answers. Leaving this here for informational purpose.
Most answers here use sort or max or similar ways which results in
iterating through the data twice. Which is unnecessary I believe.
Check the below implementation you should get your desired output with
a single iteration through data.
And the answers I have added seems to be performing worse. Especially the answer 2 using string. I believe culprit might be reassigning the string every time.
Also the list comprehension seems to be performing far better than appending. Making the answer 1 also comparatively slower to some other answers.
To verify these you may try this code snippet https://gist.github.com/RitwikGopi/1b36a900219e7c087c95baa99fdf65e2#file-test-py
answer 1:
dummy = [("text", 10), ("This is the Sentence", 20),
("that I Want", 20), ("to Get", 20),
("text", 8), ("text", 6)]
max_val = float("-inf")
max_data = []
for data, value in dummy:
if value > max_val:
max_val = value
max_data = [data]
elif value == max_val:
max_data.append(data)
else:
continue
print(max_data)
answer 2:
dummy = [("text", 10), ("This is the Sentence", 20),
("that I Want", 20), ("to Get", 20),
("text", 8), ("text", 6)]
max_val = float("-inf")
max_data = ""
for data, value in dummy:
if value > max_val:
max_val = value
max_data = data
elif value == max_val:
max_data += " " + data
else:
continue
print(max_data)
[i for i, j in dict(dummy).items() if j==max(dict(dummy).values())]
This would give the max values
example result =
['This is the Sentence', 'that I Want', 'to Get']
I have the following problem: I do paramter tests and create for every single paramter combination a new object, which is replaced by the next object created with other paramters. The Object has an attribute jaccard coefficient and an attribute ID. In every step i want to store the jaccard coeeficient of the object. At the end i want the top ten jaccard coeefcient and their corresponding ID.
r=["%.2f" % r for r in np.arange(3,5,1)]
fs=["%.2f" % fs for fs in np.arange(2,5,1)]
co=["%.2f" % co for co in np.arange(1,5,1)]
frc_networks=[]
bestJC = []
bestPercent = []
best10Candidates = []
count = 0
for parameters in itertools.product(r,fs,co):
args = parser.parse_args(["path1.csv","path2.csv","--r",parameters[0],"--fs",parameters[1],"--co",parameters[2]])
if not os.path.isfile('FCR_Network_Coordinates_ID_{}_r_{}_x_{}_y_{}_z_{}_fcr_{}_co_{}_1.csv'.format(count, args.r, args.x, args.y, args.z, args.fs,args.co)):
FRC_Network(count,args.p[0],args.p[1],args.x,args.y,args.z,args.r,args.fs,args.co)
The attributes can be called by FRC_Network.ID and FRC_Network.JC
I think I'd use heapq.heappushpop() for this. That way, no matter how large your input set is, your data requirement is limited to a list of 10 tuples.
Note the use of tuples to keep the JC and ID parameters. Since the comparisons are lexicographic, this will always sort by JC.
Also, note that the final call to .sort() is optional. If you just want the ten best, skip the call. If you want the ten best in order, keep the call.
import heapq
#UNTESTED
best = []
for parameters in itertools.product(r,fs,co):
# ...
if len(best) < 10:
heapq.heappush(best, (FRC_Network.JC, FRC_Network.ID))
else:
heapq.heappushpop(best, (FRC_Network.JC, FRC_Network.ID))
best.sort(reverse=True)
Here is a tested version that demonstrates the concept:
import heapq
import random
from pprint import pprint
best = []
for ID in 'abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ':
JC = random.randint(0, 100)
if len(best) < 10:
heapq.heappush(best, (JC, ID))
else:
heapq.heappushpop(best, (JC, ID))
pprint(best)
Result:
[(81, 'E'),
(82, 'd'),
(83, 'G'),
(92, 'i'),
(95, 'Z'),
(100, 'p'),
(89, 'q'),
(98, 'a'),
(96, 'z'),
(97, 'O')]
So my txt file looks like this:
68,125
113,69
65,86
108,149
152,53
78,90
54,160
20,137
107,90
48,12
I need to read these files and then put it into a list of x and y coordinates tuples.
My output should be
[(68, 125), (113, 69), (65, 86), (108, 149), (152, 53), (78, 90), (54, 160), (20, 137), (107, 90), (48, 12)]
I am stuck on how to do this. I need to use basic python only.
Edit:
My attempt so far is this
numbers = []
input_file = open(filename,'r')
numbers_list = input_file.readlines()
input_file.close()
for i in numbers_list:
numbers += [i]
return numbers
My output returns as this:
['68,125\n', '113,69\n', '65,86\n', '108,149\n', '152,53\n', '78,90\n', '54,160\n', '20,137\n', '107,90\n', '48,12\n']
How do I get rid of the '\n' and also how can I put each individual element in the list into a tuple. Thank you. My mistake for not adding in my attempt.
Read all the content on the basis of new line from file.
Strip the newlines from each string.
Then convert each string into tuple by splitting on comma.
Below is the code witha text file input having content as you have asked and result as you expected.
import sys
def test(filename):
f = open(filename)
lines = f.readlines()
lines = [item.rstrip("\n") for item in lines]
newList = list()
for item in lines:
item = item.split(",")
item = tuple(int(items) for items in item)
newList.append(item)
f.close()
print newList
if __name__ == "__main__":
test(sys.argv[1])
O/P:
techie#gateway2:myExperiments$ python test.py /export/home/techie/myExperiments/test.txt
[(68, 125), (113, 69), (65, 86), (108, 149), (152, 53), (78, 90), (54, 160), (20, 137), (107, 90), (48, 12)]
Hope this will help. :-)
Here are 3 and 2 line answers:
with open("my_txt_file") as f:
lines = f.readlines()
result = [tuple(int(s) for s in line.strip().split(",")) for line in lines]
better, as Ilja Everilä pointed out, "open file as iterator":
with open("my_txt_file") as f:
result = [tuple(int(s) for s in line.strip().split(",")) for line in f]
As your file contains comma separated integer values, you could use the csv module to handle it:
import csv
with open(filename, newline='') as f:
reader = csv.reader(f)
numbers = [tuple(map(int, row)) for row in reader]
I have a list of tuples where each tuple is a (start-time, end-time). I am trying to merge all overlapping time ranges and return a list of distinct time ranges.
For example
[(1, 5), (2, 4), (3, 6)] ---> [(1,6)]
[(1, 3), (2, 4), (5, 8)] ---> [(1, 4), (5,8)]
Here is how I implemented it.
# Algorithm
# initialranges: [(a,b), (c,d), (e,f), ...]
# First we sort each tuple then whole list.
# This will ensure that a<b, c<d, e<f ... and a < c < e ...
# BUT the order of b, d, f ... is still random
# Now we have only 3 possibilities
#================================================
# b<c<d: a-------b Ans: [(a,b),(c,d)]
# c---d
# c<=b<d: a-------b Ans: [(a,d)]
# c---d
# c<d<b: a-------b Ans: [(a,b)]
# c---d
#================================================
def mergeoverlapping(initialranges):
i = sorted(set([tuple(sorted(x)) for x in initialranges]))
# initialize final ranges to [(a,b)]
f = [i[0]]
for c, d in i[1:]:
a, b = f[-1]
if c<=b<d:
f[-1] = a, d
elif b<c<d:
f.append((c,d))
else:
# else case included for clarity. Since
# we already sorted the tuples and the list
# only remaining possibility is c<d<b
# in which case we can silently pass
pass
return f
I am trying to figure out if
Is the a an built-in function in some python module that can do this more efficiently? or
Is there a more pythonic way of accomplishing the same goal?
Your help is appreciated. Thanks!
A few ways to make it more efficient, Pythonic:
Eliminate the set() construction, since the algorithm should prune out duplicates during in the main loop.
If you just need to iterate over the results, use yield to generate the values.
Reduce construction of intermediate objects, for example: move the tuple() call to the point where the final values are produced, saving you from having to construct and throw away extra tuples, and reuse a list saved for storing the current time range for comparison.
Code:
def merge(times):
saved = list(times[0])
for st, en in sorted([sorted(t) for t in times]):
if st <= saved[1]:
saved[1] = max(saved[1], en)
else:
yield tuple(saved)
saved[0] = st
saved[1] = en
yield tuple(saved)
data = [
[(1, 5), (2, 4), (3, 6)],
[(1, 3), (2, 4), (5, 8)]
]
for times in data:
print list(merge(times))
Sort tuples then list, if t1.right>=t2.left => merge
and restart with the new list, ...
-->
def f(l, sort = True):
if sort:
sl = sorted(tuple(sorted(i)) for i in l)
else:
sl = l
if len(sl) > 1:
if sl[0][1] >= sl[1][0]:
sl[0] = (sl[0][0], sl[1][1])
del sl[1]
if len(sl) < len(l):
return f(sl, False)
return sl
The sort part: use standard sorting, it compares tuples the right way already.
sorted_tuples = sorted(initial_ranges)
The merge part. It eliminates duplicate ranges, too, so no need for a set. Suppose you have current_tuple and next_tuple.
c_start, c_end = current_tuple
n_start, n_end = next_tuple
if n_start <= c_end:
merged_tuple = min(c_start, n_start), max(c_end, n_end)
I hope the logic is clear enough.
To peek next tuple, you can use indexed access to sorted tuples; it's a wholly known sequence anyway.
Sort all boundaries then take all pairs where a boundary end is followed by a boundary start.
def mergeOverlapping(initialranges):
def allBoundaries():
for r in initialranges:
yield r[0], True
yield r[1], False
def getBoundaries(boundaries):
yield boundaries[0][0]
for i in range(1, len(boundaries) - 1):
if not boundaries[i][1] and boundaries[i + 1][1]:
yield boundaries[i][0]
yield boundaries[i + 1][0]
yield boundaries[-1][0]
return getBoundaries(sorted(allBoundaries()))
Hm, not that beautiful but was fun to write at least!
EDIT: Years later, after an upvote, I realised my code was wrong! This is the new version just for fun:
def mergeOverlapping(initialRanges):
def allBoundaries():
for r in initialRanges:
yield r[0], -1
yield r[1], 1
def getBoundaries(boundaries):
openrange = 0
for value, boundary in boundaries:
if not openrange:
yield value
openrange += boundary
if not openrange:
yield value
def outputAsRanges(b):
while b:
yield (b.next(), b.next())
return outputAsRanges(getBoundaries(sorted(allBoundaries())))
Basically I mark the boundaries with -1 or 1 and then sort them by value and only output them when the balance between open and closed braces is zero.
Late, but might help someone looking for this. I had a similar problem but with dictionaries. Given a list of time ranges, I wanted to find overlaps and merge them when possible. A little modification to #samplebias answer led me to this:
Merge function:
def merge_range(ranges: list, start_key: str, end_key: str):
ranges = sorted(ranges, key=lambda x: x[start_key])
saved = dict(ranges[0])
for range_set in sorted(ranges, key=lambda x: x[start_key]):
if range_set[start_key] <= saved[end_key]:
saved[end_key] = max(saved[end_key], range_set[end_key])
else:
yield dict(saved)
saved[start_key] = range_set[start_key]
saved[end_key] = range_set[end_key]
yield dict(saved)
Data:
data = [
{'start_time': '09:00:00', 'end_time': '11:30:00'},
{'start_time': '15:00:00', 'end_time': '15:30:00'},
{'start_time': '11:00:00', 'end_time': '14:30:00'},
{'start_time': '09:30:00', 'end_time': '14:00:00'}
]
Execution:
print(list(merge_range(ranges=data, start_key='start_time', end_key='end_time')))
Output:
[
{'start_time': '09:00:00', 'end_time': '14:30:00'},
{'start_time': '15:00:00', 'end_time': '15:30:00'}
]
When using Python 3.7, following the suggestion given by “RuntimeError: generator raised StopIteration” every time I try to run app, the method outputAsRanges from #UncleZeiv should be:
def outputAsRanges(b):
while b:
try:
yield (next(b), next(b))
except StopIteration:
return