How to append only unique values to a key in a dictionary? - python

sorry this is likely a complete noob question, although I'm new to python and am unable to implement any online suggestions such that they actually work. I need decrease the run-time of the code for larger files, so need to reduce the number of iterations i'm doing.
How do I modify the append_value function below to append only UNIQUE values to dict_obj, and remove the need for another series of iterations to do this later on.
EDIT: Sorry, here is an example input/output
Sample Input:
6
5 6
0 1
1 4
5 4
1 2
4 0
Sample Output:
1
4
I'm attempting to solve to solve:
http://orac.amt.edu.au/cgi-bin/train/problem.pl?problemid=416
Output Result
input_file = open("listin.txt", "r")
output_file = open("listout.txt", "w")
ls = []
n = int(input_file.readline())
for i in range(n):
a, b = input_file.readline().split()
ls.append(int(a))
ls.append(int(b))
def append_value(dict_obj, key, value): # How to append only UNIQUE values to
if key in dict_obj: # dict_obj?
if not isinstance(dict_obj[key], list):
dict_obj[key] = [dict_obj[key]]
dict_obj[key].append(value)
else:
dict_obj[key] = value
mx = []
ls.sort()
Dict = {}
for i in range(len(ls)):
c = ls.count(ls[i])
append_value(Dict, int(c), ls[i])
mx.append(c)
x = max(mx)
lss = []
list_set = set(Dict[x]) #To remove the need for this
unique_list = (list(list_set))
for x in unique_list:
lss.append(x)
lsss = sorted(lss)
for i in lsss:
output_file.write(str(i) + "\n")
output_file.close()
input_file.close()
Thank you

The answer to your question, 'how to only append unique values to this container' is fairly simple: change it from a list to a set (as #ShadowRanger suggested in the comments). This isn't really a question about dictionaries, though; you're not appending values to 'dict_obj', only to a list stored in the dictionary.
Since the source you linked to shows this is a training problem for people newer to coding, you should know that changing the lists to sets might be a good idea, but it's not the cause of the performance issues.
The problem boils down to: given a file containing a list of integers, print the most common integer(s). Your current code iterates over the list, and for each index i, iterates over the entire list to count matches with ls[i] (this is the line c = ls.count(ls[i])).
Some operations are more expensive than others: calling count() is one of the more expensive operations on a Python list. It reads through the entire list every time it's called. This is an O(n) function, which is inside a length n loop, taking O(n^2) time. All of the set() filtering for non-unique elements takes O(n) time total (and is even quite fast in practice). Identifying linear-time functions hidden in loops like this is a frequent theme in optimization, but profiling your code would have identified this.
In general, you'll want to use something like the Counter class in Python's standard library for frequency counting. That kind of defeats the whole point of this training problem, though, which is to encourage you to improve on the brute-force algorithm for finding the most frequent element(s) in a list. One possible way to solve this problem is to read the description of Counter, and try to mimic its behavior yourself with a plain Python dictionary.

Answering the question you haven't asked: Your whole approach is overkill.
You don't need to worry about uniqueness; the question prompt guarantees that if you see 2 5, you'll never see 5 2, nor a repeat of 2 5
You don't even care who is friends with who, you just care how many friends an individual has
So don't even bother making the pairs. Just count how many times each player ID appears at all. If you see 2 5, that means 2 has one more friend, and 5 has one more friend, it doesn't matter who they are friends with.
The entire problem can simplify down to a simple exercise in separating the player IDs and counting them all up (because each appearance means one more unique friend), then keeping only the ones with the highest counts.
A fairly idiomatic solution (reading from stdin and writing to stdout; tweaking it to open files is left as an exercise) would be something like:
import sys
from collections import Counter
from itertools import chain, islice
def main():
numlines = int(next(sys.stdin))
friend_pairs = map(str.split, islice(sys.stdin, numlines)) # Convert lines to friendship pairs
counts = Counter(chain.from_iterable(friend_pairs)) # Flatten to friend mentions and count mentions to get friend count
max_count = max(counts.values()) # Identify maximum friend count
winners = [pid for pid, cnt in counts.items() if cnt == max_count]
winners.sort(key=int) # Sort winners numerically
print(*winners, sep="\n")
if __name__ == '__main__':
main()
Try it online!
Technically, it doesn't even require the use of islice nor storing to numlines (the line count at the beginning might be useful to low level languages to preallocate an array for results, but for Python, you can just read line by line until you run out), so the first two lines of main could simplify to:
next(sys.stdin)
friend_pairs = map(str.split, sys.stdin)
But either way, you don't need to uniquify friendships, nor preserve any knowledge of who is friends with whom to figure out who has the most friends, so save yourself some trouble and skip the unnecessary work.

If you intention is to have a list in each value of the dictionary why not iterate the same way you iterated on each key.
if key in dict_obj.keys():
for elem in dict_obje[key]: # dict_obje[key] asusming the value is a list
if (elem == value):
else:
# append the value to the desired list
else:
dic_obj[key] = value

Related

Is there a better way than a while loop to perform this function?

I was attempting some python exercises and I hit the 5s timeout on one of the tests. The function is pre-populated with the parameters and I am tasked with writing code that is fast enough to run within the max timeframe of 5s.
There are N dishes in a row on a kaiten belt, with the ith dish being of type Di​. Some dishes may be of the same type as one another. The N dishes will arrive in front of you, one after another in order, and for each one you'll eat it as long as it isn't the same type as any of the previous K dishes you've eaten. You eat very fast, so you can consume a dish before the next one gets to you. Any dishes you choose not to eat as they pass will be eaten by others.
Determine how many dishes you'll end up eating.
Issue
The code "works" but is not fast enough.
Code
The idea here is to add the D[i] entry if it is not in the pastDishes list (which can be of size K).
from typing import List
# Write any import statements here
def getMaximumEatenDishCount(N: int, D: List[int], K: int) -> int:
# Write your code here
numDishes=0
pastDishes=[]
i=0
while (i<N):
if(D[i] not in pastDishes):
numDishes+=1
pastDishes.append(D[i])
if len(pastDishes)>K:
pastDishes.pop(0)
i+=1
return numDishes
Is there a more effective way?
After much trial and error, I have finally found a solution that is fast enough to pass the final case in the puzzle you are working on. My previous code was very neat and quick, however, I have finally found a module with a tool that makes this much faster. Its from collections just as deque is, however it is called Counter.
This was my original code:
def getMaximumEatenDishCount(N: int, D: list, K: int) -> int:
numDishes=lastMod=0
pastDishes=[0]*K
for Dval in D:
if Dval in pastDishes:continue
pastDishes[lastMod] = Dval
numDishes,lastMod = numDishes+1,(lastMod+1)%K
return numDishes
I then implemented Counter like so:
from typing import List
# Write any import statements here
from collections import Counter
def getMaximumEatenDishCount(N: int, D: 'list[int]', K: int) -> int:
eatCount=lastMod = 0
pastDishes=[0]*K
eatenCounts = Counter({0:K})
for Dval in D:
if Dval in eatenCounts:continue
eatCount +=1
eatenCounts[Dval] +=1
val = pastDishes[lastMod]
if eatenCounts[val] <= 1: eatenCounts.pop(val)
else: eatenCounts[val] -= 1
pastDishes[lastMod]=Dval
lastMod = (lastMod+1)%K
return eatCount
Which ended up working quite well. I'm sure you can make it less clunky, but this should work fine on its own.
Some Explanation of what i am doing:
Typically while loops are actually marginally faster than a for loop, however since I need to access the value at an index multiple times if i used it, using a for loop I believe is actually better in this situation. You can see i also initialised the list to the max size it needs to be and am writing over the values instead of popping+appending which saves alot of time. Additionally, as pointed out by #outis, another small improvement was made in my code by using the modulo operator in conjunction with the variable which removes the need for an additional if statement. The Counter is essentially a special dict object that holds a hashable as the key and an int as the value. I use the fact that lastMod is an index to what would normally be accesed through list.pop(0) to access the object needed to either remove or decrement in the counter
Note that it is not considered 'pythonic' to assign multiple variable on one line, however I believe it adds a slight performance boost which is why I have done it. This can be argued though, see this post.
If anyone else is interested the problem that we were trying to solve, it can be found here: https://www.facebookrecruiting.com/portal/coding_puzzles/?puzzle=958513514962507
Can we use an appropriate data structure? If so:
Data structures
Seems like an ordered set which you have to shrink to a capacity restriction of K.
To meet that, if exceeds (len(ordered_set) > K) we have to remove the first n items where n = len(ordered_set) - K. Ideally the removal will perform in O(1).
However since removal on a set is in unordered fashion. We first transform it to a list. A list containing unique elements in the order of appearance in their original sequence.
From that ordered list we can then remove the first n elements.
For example: the function lru returns the least-recently-used items for a sequence seq limited by capacity-limit k.
To obtain the length we can simply call len() on that LRU return value:
maximumEatenDishCount = len(lru(seq, k))
See also:
Does Python have an ordered set?
Fastest way to get sorted unique list in python?
Using set for uniqueness (up to Python 3.6)
def lru(seq, k):
return list(set(seq))[:k]
Using dict for uniqueness (since Python 3.6)
Same mechanics as above, but using the preserved insertion order of dicts since 3.7:
using OrderedDict explicitly
from collections import OrderedDict
def lru(seq, k):
return list(OrderedDict.fromkeys(seq).keys())[:k]
using dict factory-method:
def lru(seq, k):
return list(dict.fromkeys(seq).keys())[:k]
using dict-comprehension:
def lru(seq, k):
return list({i:0 for i in seq}.keys())[:k]
See also:
The order of keys in dictionaries
Using ordered dictionary as ordered set
How do you remove duplicates from a list whilst preserving order?
Real Python: OrderedDict vs dict in Python: The Right Tool for the Job
As the problem is an exercise, exact solutions are not included. Instead, strategies are described.
There are at least a couple potential approaches:
Use a data structure that supports fast containment testing (a set in use, if not in name) limited to the K most recently eaten dishes. Fortunately, since dict preserves insertion order in newer Python versions and testing key containment is fast, it will fit the bill. dict requires that keys be hashable, but since the problem uses ints to represent dish types, that requirement is met.
With this approach, the algorithm in the question remains unchanged.
Rather than checking whether the next dish type is any of the last K dishes, check whether the last time the next dish was eaten is within K of the current plate count. If it is, skip the dish. If not, eat the dish (update both the record of when the next dish was last eaten and the current dish count). In terms of data structures, the program will need to keep a record of when any given dish type was last eaten (initialized to -K-1 to ensure that the first time a dish type is encountered it will be eaten; defaultdict can be very useful for this).
With this approach, the algorithm is slightly different. The code ends up being slightly shorter, as there's no shortening of the data structure storing information about the dishes as there is in the original algorithm.
There are two takeaways from the latter approach that might be applied when solving other problems:
More broadly, reframing a problem (such as from "the dish is in the last K dishes eaten" to "the dish was last eaten within K dishes of now") can result in a simpler approach.
Less broadly, sometimes it's more efficient to work with a flipped data structure, swapping keys/indices and values.
Approach & takeaway 2 both remind me of a substring search algorithm (the name escapes me) that uses a table of positions in the needle (the string to search for) of where each character first appears (for characters not in the string, the table has the length of the string); when a mismatch occurs, the algorithm uses the table to align the substring with the mismatching character, then starts checking at the start of the substring. It's not the most efficient string search algorithm, but it's simple and more efficient than the naive algorithm. It's similar to but simpler and less efficient than the skip search algorithm, which uses the positions of every occurrence of each character in the needle.
from typing import List
# Write any import statements here
from collections import deque, Counter
def getMaximumEatenDishCount(N: int, D: List[int], K: int) -> int:
# Write your code here
q = deque()
cnt = 0
dish_counter = Counter()
for d in D:
if dish_counter[d] == 0:
cnt += 1
q.append(d)
dish_counter[d] += 1
if len(q) == K + 1:
remove = q.popleft()
dish_counter[remove] -= 1
return cnt

the code is giving me a number from the list instead of the mode

in one of my work i need to find the mode a list called "dataset" using no modual or function that would find the mode by itself.
i tried to make it so it can output the mode or the list of modes depending on the list of numbers. I used 2 for loops so the first number of the list checks each number of the list including its self to see how many numbers of its self there is, for example if my list was 123415 it would say there is 2 ones, and it does this for all the numbers of the list. the number with the most counts would be the mode. The bottom section of the code where the if elif and else is, there is where it checks if the number has the most counts by comparing with the other numbers of the list checking if it has more numbers or the same as the previous biggest number.
I've tried to change the order of the codes but i'm still confused why it is doing this error
pop_number = []
pop_amount = 0
amount = 0
for i in range(len(dataset)):
for x in dataset:
if dataset[i] == x:
amount += 1
if amount>pop_amount:
pop_amount = amount
pop_number = []
pop_number.append(x)
amount = 0
elif amount==pop_amount:
pop_amount = amount
if x not in pop_number:
pop_number.append(x)
pop_amount = amount
amount = 0
else:
continue
print(pop_number)
i expected the output to be the mode of the list or the list of modes but it came up with the last number from the list
As this is apparently homework, I will present a sketch, not working code.
Observe that a dict in Python can hold key-value mappings.
Let the numbers in the input list be the keys, and the values the number of times they occur. Going over the list, use each item as the key for the dict, and add one to the value (starting at 0 -- defaultdict(int) is good for this). If the result is bigger than any previous maximum, remember this key.
Since you want to allow for more than one mode value, the variable which remembers the maximum key should be a list; but since you have a new maximum, replace the old list with a list containing just this key. If another value also reaches the maximum, add it to the list. (That's the append method.)
(See how this is if bigger than maximum so far and then else if equal to maximum so far and then otherwise there is no need to do anything.)
When you have looped over all items in the input list, the list of remembered keys is your result.
Go back and think about what variables you need already before the loop. The maximum so far should be defined but guaranteed to be smaller than any value you will see -- it makes sense to start this at 0 because as soon as you see one key, it will have a bigger count than zero. And the keys you want to remember can start out as an empty list.
Now think about how you would test this. What happens if the input list is empty? What happens if the input list contains just the same number over and over? What happens if every item on the input list is unique? Can you think of other corner cases?
Without using any module or function that will specifically find the mode itself, you can do that with much less code. Your code will work with a little more effort, I highly suggest you to try to solve the problem on your own logic, but meanwhile let me show you how to take the help of all the built-in data structures in Python List, Tuples, Dictionaries and Sets within 7-8 lines. Also there is unzipping at the end (*). I will suggest you to look these up, when you get time.
lst = [1,1,1,1,2,2,2,3,3,3,3,3,3,4,2,2,2,5,5,6]
# finds the unique elements
unique_elems = set(lst)
# creates a dictionary with the unique elems as keys and initializes the values to 0
count = dict.fromkeys(unique_elems,0)
# gets the frequency of each element in the lst
for elem in unique_elems:
count[elem] = lst.count(elem)
# finds max frequency
max_freq = max(count.values())
# stores list of mode(s)
modes = [i for i in count if count[i] == max_freq]
# prints mode(s), I have used unzipping here so that in case there is one mode,
# you don't have to print ugly [x]
print(*modes)
Or if you want to go for the shortest (I really shouldn't be making such bold claims in StackOverflow), then I guess this will be it (even though, writing short codes for the sake of it is discouraged)
lst = [1,1,1,1,2,2,2,3,3,3,3,3,3,4,2,2,2,5,5,6]
freq_dist = [(i, lst.count(i)) for i in set(lst)]
[print(i,end=' ') for i,j in freq_dist if j==max(freq_dist, key=lambda x:x[1])[1]]
And if you just want to go bonkers and say goodbye to loops (Goes without saying, this is ugly, really ugly):
lst = [1,1,1,1,2,2,2,3,3,3,3,3,3,4,2,2,2,5,5,6]
unique_elems = set(lst)
freq_dist = list(map(lambda x:(x, lst.count(x)), unique_elems))
print(*list(map(lambda x:x[0] if x[1] == max(freq_dist,key = lambda y: y[1])[1] else '', freq_dist)))

For each bigram in list, print number of times it appears in other lists - python NLTK

I am new to coding and could use help. Here is my task:
I have a csv of online marketing image titles. It is a single column. Each cell in this column holds the marketing image title text for each ad. It is just a string of words. For instance cell A1 reads: "16 Maddening Tire Fails" and etc etc. To load csv I do:
with open('usethis.csv', 'rb') as f:
mycsv = csv.reader(f)
mycsv = list(mycsv)
I initialize a list:
mylist = []
my desire is to take the text in each cell and extract the bigrams. I do that as follows:
for i, c in enumerate(mycsv):
mylist.append(list(nltk.bigrams(word_tokenize(' '.join(c)))))
mylist then looks like this, but with more data:
[[('16', 'Maddening'), ('Maddening', 'Tire'), ('Tire', 'Fails')], [('16', 'Maddening'), ('Maddening', 'Tire'), ('Tire', 'Fails'), ('Fails', 'That'), ('That', 'Show'), ('Show', 'What'), ('What', 'True'), ('True', 'Negligence'), ('Negligence', 'Looks'), ('Looks', 'Like')]
mylist holds individual lists which are the bigrams created from each cell in my csv.
Now I am wanting to loop through every bigram in all lists and next to each bigram print the number of times it appears in another list (cell). This would be the same as a countifs in excel, basically. For instance, if the bigram "('16', 'Maddening')" in the first list (cell A1) appears 3 other times in (mylist) then print the number 3 next to it. And so on for each bigram. If it is easier to return this information into a new list that's fine. Just printing it out somewhere that makes sense.
I have done a lot of reading online, for instance this link kind of was along the general idea:
How to check if all elements of a list matches a condition?
And also this link about dictionaries was similar in that it is returning a number next to each value as I want to return a count next to each bigram..
What are Python dictionary view objects?....
But I really am at a loss as to how to do this. Thank you so much in advance for your help! Let me know if I need to explain something better.
You can use collections.Counter for this task.
Since you are already using NLTK, FreqDist and and derived classes might come in handy when you want to do more than just counting, but for now let's stick with the simpler Counter.
Counter is a subclass of dict, ie. it can do everthing a dictionary can, but it has additional functionality.
The following snippet extends the code you showed:
from collections import Counter
bigram_counts = Counter()
for cell in mylist:
for bigram in cell:
bigram_counts[bigram] += 1
After this, you can look up individual bigrams with subscript, eg. bigram_counts['16', 'Maddening'] will return 3 or whatever the actual count was.
With bigram_counts.most_common(5) you get the 5 most frequent bigrams.
Update
... to actually answer the specific problem in your question.
In order to know the number of occurrences in all but one cell, you need to have separate counters for each cell.
Replace the previous snippet with the following:
# Populate n+1 counters.
bigram_totals = Counter()
separate_counters = []
for cell in mylist:
bigram_current = Counter()
separate_counters.append(bigram_current)
for bigram in cell:
bigram_totals[bigram] += 1
bigram_current[bigram] += 1
# Look up all bigram counts.
for cell, bigram_current in zip(mylist, separate_counters):
for bigram in cell:
count = bigram_totals[bigram] - bigram_current[bigram]
# print(bigram, count) or whatever...
So, in addition to the total counts, we have a separate counter for each cell.
When doing a lookup, we subtract the local count from the global count to get the sum of occurrences everywhere else.
Btw, since you mentioned learning purposes, the first block can be written a bit shorter by taking advantage of special Counter features:
# Populate n+1 counters.
bigram_totals = Counter()
separate_counters = []
for cell in mylist:
bigram_current = Counter(cell)
separate_counters.append(bigram_current)
bigram_totals.update(bigram_current)
I think this is a bit more elegant, but might harder to understand for a beginner.
Decide for yourself which version you think is more readable.

Find all the possible N-length anagrams - fast alternatives

I am given a sequence of letters and have to produce all the N-length anagrams of the sequence given, where N is the length of the sequence.
I am following a kinda naive approach in python, where I am taking all the permutations in order to achieve that. I have found some similar threads like this one but I would prefer a math-oriented approach in Python. So what would be a more performant alternative to permutations? Is there anything particularly wrong in my attempt below?
from itertools import permutations
def find_all_anagrams(word):
pp = permutations(word)
perm_set = set()
for i in pp:
perm_set.add(i)
ll = [list(i) for i in perm_set]
ll.sort()
print(ll)
If there are lots of repeated letters, the key will be to produce each anagram only once instead of producing all possible permutations and eliminating duplicates.
Here's one possible algorithm which only produces each anagram once:
from collections import Counter
def perm(unplaced, prefix):
if unplaced:
for element in unplaced:
yield from perm(unplaced - Counter(element), prefix + element)
else:
yield prefix
def permutations(iterable):
yield from perm(Counter(iterable), "")
That's actually not much different from the classic recursion to produce all permutations; the only difference is that it uses a collections.Counter (a multiset) to hold the as-yet-unplaced elements instead of just using a list.
The number of Counter objects produced in the course of the iteration is certainly excessive, and there is almost certainly a faster way of writing that; I chose this version for its simplicity and (hopefully) its clarity
This is very slow for long words with many similar characters. Slow compared to theoretical maximum performance that is. For example, permutations("mississippi") will produce a much longer list than necessary. It will have a length of 39916800, but but the set has a size of 34650.
>>> len(list(permutations("mississippi")))
39916800
>>> len(set(permutations("mississippi")))
34650
So the big flaw with your method is that you generate ALL anagrams and then remove the duplicates. Use a method that only generates the unique anagrams.
EDIT:
Here is some working, but extremely ugly and possibly buggy code. I'm making it nicer as you're reading this. It does give 34650 for mississippi, so I assume there aren't any major bugs. Warning again. UGLY!
# Returns a dictionary with letter count
# get_letter_list("mississippi") returns
# {'i':4, 'm':1, 'p': 2, 's':4}
def get_letter_list(word):
w = sorted(word)
c = 0
dd = {}
dd[w[0]]=1
for l in range(1,len(w)):
if w[l]==w[l-1]:
d[c]=d[c]+1
dd[w[l]]=dd[w[l]]+1
else:
c=c+1
d.append(1)
dd[w[l]]=1
return dd
def sum_dict(d):
s=0
for x in d:
s=s+d[x]
return s
# Recursively create the anagrams. It takes a letter list
# from the above function as an argument.
def create_anagrams(dd):
if sum_dict(dd)==1: # If there's only one letter left
for l in dd:
return l # Ugly hack, because I'm not used to dics
a = []
for l in dd:
if dd[l] != 0:
newdd=dict(dd)
newdd[l]=newdd[l]-1
if newdd[l]==0:
newdd.pop(l)
newl=create(newdd)
for x in newl:
a.append(str(l)+str(x))
return a
>>> print (len(create_anagrams(get_letter_list("mississippi"))))
34650
It works like this: For every unique letter l, create all unique permutations with one less occurance of the letter l, and then append l to all these permutations.
For "mississippi", this is way faster than set(permutations(word)) and it's far from optimally written. For instance, dictionaries are quite slow and there's probably lots of things to improve in this code, but it shows that the algorithm itself is much faster than your approach.
Maybe I am missing something, but why don't you just do this:
from itertools import permutations
def find_all_anagrams(word):
return sorted(set(permutations(word)))
You could simplify to:
from itertools import permutations
def find_all_anagrams(word):
word = set(''.join(sorted(word)))
return list(permutations(word))
In the doc for permutations the code is detailled and it seems already optimized.
I don't know python but I want to try to help you: probably there are a lot of other more performant algorithm, but I've thought about this one: it's completely recursive and it should cover all the cases of a permutation. I want to start with a basic example:
permutation of ABC
Now, this algorithm works in this way: for Length times you shift right the letters, but the last letter will become the first one (you could easily do this with a queue).
Back to the example, we will have:
ABC
BCA
CAB
Now you repeat the first (and only) step with the substring built from the second letter to the last one.
Unfortunately, with this algorithm you cannot consider permutation with repetition.

Python loop through a row of numbers to find the ones which are not being used [closed]

Closed. This question needs debugging details. It is not currently accepting answers.
Edit the question to include desired behavior, a specific problem or error, and the shortest code necessary to reproduce the problem. This will help others answer the question.
Closed 7 years ago.
Improve this question
I am trying to make a python script that loops through a text file. In the text file I have something similair to this:
abc1
abc2
abc3
abc5
abc6
Now i want it to loop through all of this and find the numbers that are not being used. In this case it would be abc4 and print it. But im stuck. Ive tried searching for the way to approach this but cant seem to frase the question to get a good answer...
I hope someone can help me or point me in the right direction!
I will add. The text is always abcN (N = a number) the numbers are also in a row. Like in the example
Read the data, discard the text and only keep the numbers. Put the numbers in a set while finding the maximum value. This will assure that you have all numbers in the file, without duplicates, and also the max number to look for.
Once you have the numbers in the set, just loop from the zero to the max value, and check if the number is in the set.
This might not be the most effective or Pythonic solution, but it's a solution.
If you want to get adventitious with itertools, a pythonic solution using generators would seem ideal.
It is worth noting that it handles the edge cases well and is highly scalable.
Implementation
from itertools import tee, izip, islice
with open("test.txt") as fin:
fin1, fin2 = tee((int(line[3:]) for line in fin))
print [line1 + 1 for line2, line1 in izip(islice(fin2 , None), fin1)
if line2 - line1 > 1]
Output (for same input)
abc1
abc2
abc3
abc5
abc6
abc8
[4, 7]
try this:
import re
my_numbers = [int(re.search('\d+', line).group()) for line in open('myfile.txt')]
reference_numbers = range(0, max(my_numbers))
missing_numbers = [num for num in reference_numbers if num not in my_numbers]
For small files:
with open("file") as inp:
c=[]
for line in inp:
c.append(int(line.strip("abc")))
check=set(range(min(c),max(c)+1))
print c
print check
print "difference : "+" ".join(map(str,check-set(c)))
[1, 2, 3, 5, 6]
{1, 2, 3, 4, 5, 6}
difference : 4
You could use set instead of list
As suggested, this is an implementation of one possible solution:
nums=[]
for line in file:
i = int(line[3:])
nums.append(i);
singles=set(nums)
max=max(singles)
missing=[]
for k in range(max):
if k not in singles:
missing.appens(k)
print missing
Hope helped you!
If it is possible to generate all the possible entries without reading the file (eg, if N is limited to a fixed range, say 0-9), you could build all of those into a set, using something like:
possibilities = {'abc{}'.format(i) for i in range(10)}
You can then generate a similar set of the entries that are actually in the file:
entries = set(file)
Then your problem is reduced to "the things in the set possibilities which are not also in entries", which sets support directly:
missing = possibilities - entries
If the size of possibilities is large, you may wish to fill it with numbers instead, and parse out the numbers from each entry in the file. If it is bounded only by the largest number actually present in the file, you would need to dynamically generate it from the entries.
If the file is large enough that keeping all the entries and all the possibilities in memory at once is prohibitive, you can take advantage of the order by using nested loops. Create a generator to give you just the numbers:
entries = (parse_num(line) for line in file)
(where parse_num takes abcN and gives you N, as an int). You can then iterate over these lines while keeping a separate counter of where you expect to be up to - whenever it is different to where you are up to, you have a missing value:
expected = 0
for entry in entries:
while expected < entry:
yield expected
expected += 1
Since the OP pointed out that the numbers will be in order, I thought of this solution which will always expect the nextnum for the following line:
import re
nextnum = 1
for line in open('input_file.txt'):
match = re.search(r'abc(\d+)$', line)
if not match:
print('error: line "%s" did not match' % line)
continue
linenum = int(match.group(1))
if linenum > nextnum:
print('line abc%d skipped, found abc%d!' % (nextnum, linenum))
nextnum = linenum + 1
Note that this will only give one "skipped" output even if there are multiple subsequent numbers missing because it will simply expect the next line to have the current line's number plus one.

Categories

Resources