Faster replacing in list with a lot of matches

Faster replacing in list with a lot of matches - python

just a small problem with list and replacing some list entries.
Maybe some informations around my problem. My idea is really simple and easy. I use the module mmap to read out bigger files. It's some FORTRAN-files which have 7 columns and one million lines. Some values didn't fulfill the format of the FORTRAN-output and I just have ten stars. I can't change the format of the output inside the source code and I have to deal with this problem. After loading the file with mmap I use str.split() to convert the data to a list and then I search for the bad values. Look at the following source code:
f = open(fname,'r+b')
A = str(mmap.mmap(f.fileno(),0)[:]).split()
for i in range(A.count('********')):
A[A.index('********')] = '0.0'
I know it's probably not the best solution but it's quick and dirty. Ok. It's quick if A.count('********') is small. Actually this is my problem. For some files the replacing method doesn't work really fast. If it's to big it take a lot of time. Is there any other method or a total other way to replace my bad values and don't waste a lot of time?
Thanks for any help or any suggestions.
EDIT:
How does the method list.count() works? I can also run through whole list and replacing it by my own.
for k in range(len(A)):
if A[k] == '**********': A[k] = '0.0'
This would be faster for many replacements. But would it be faster if I only would have one match?

The main problem in your code is the use of "A.index" inside the loop -. The index method will walk linearly through your list, from the start up to the next ocurrence of "**" - this turns a O(n) problem into O(n²) - hence your perceived lack of performance.
While using Python the most obvious way is usually the best way to do it: so walking through your list in a Python forloop in this case will undoubtley be better than O(n²) loops in C with the cound and index methods. The not so obvious part is the recomended usage of the built-in function "enumerate" to get both an item value and its index from the list on the for loop.
f = open(fname,'r+b')
A = str(mmap.mmap(f.fileno(),0)[:]).split()
for i, value in enumerate(A):
if value == "********":
A[i] = "0.0"

If you are eventually going to convert this to an array, you might consider using numpy and the np.genfromtxt which has the ability to deal with missing data:
http://docs.scipy.org/doc/numpy/reference/generated/numpy.genfromtxt.html
With a binary file, you can use np.memmap and then use masked arrays to deal with the missing elements.

fin = open(fname, 'r')
fout = open(fname + '_fixed', 'w')
for line in fin:
# replace 10 asterisks by 7 spaces + '0.0'
# If you don't mind losing the fixed-column-width format,
# omit the seven spaces
line = line.replace('**********', ' 0.0')
fout.write(line)
fin.close()
fout.close()
Alternatively if your file is smallish, replace the loop by this:
fout.write(fin.read().replace('**********', ' 0.0'))

If after converting A to one huge string representation, you first could change all the bad values with a single call to the A.replace('********', '0.0') method and then split it, you'd have the same result, likely a lot faster. Something like:
f = open(fname,'r+b')
A = str(mmap.mmap(f.fileno(),0)[:]).replace('********', '0.0').split()
It would use a lot of memory, but that's often the trade-off for speed.

Instead of manipulating A, try using a list comprehension to make a new A:
A = [v if v != '********' else 0.0 for v in A]
I think you'll find this surprisingly fast.

Related

How to append only unique values to a key in a dictionary?

sorry this is likely a complete noob question, although I'm new to python and am unable to implement any online suggestions such that they actually work. I need decrease the run-time of the code for larger files, so need to reduce the number of iterations i'm doing.
How do I modify the append_value function below to append only UNIQUE values to dict_obj, and remove the need for another series of iterations to do this later on.
EDIT: Sorry, here is an example input/output
Sample Input:
6
5 6
0 1
1 4
5 4
1 2
4 0
Sample Output:
1
4
I'm attempting to solve to solve:
http://orac.amt.edu.au/cgi-bin/train/problem.pl?problemid=416
Output Result
input_file = open("listin.txt", "r")
output_file = open("listout.txt", "w")
ls = []
n = int(input_file.readline())
for i in range(n):
a, b = input_file.readline().split()
ls.append(int(a))
ls.append(int(b))
def append_value(dict_obj, key, value): # How to append only UNIQUE values to
if key in dict_obj: # dict_obj?
if not isinstance(dict_obj[key], list):
dict_obj[key] = [dict_obj[key]]
dict_obj[key].append(value)
else:
dict_obj[key] = value
mx = []
ls.sort()
Dict = {}
for i in range(len(ls)):
c = ls.count(ls[i])
append_value(Dict, int(c), ls[i])
mx.append(c)
x = max(mx)
lss = []
list_set = set(Dict[x]) #To remove the need for this
unique_list = (list(list_set))
for x in unique_list:
lss.append(x)
lsss = sorted(lss)
for i in lsss:
output_file.write(str(i) + "\n")
output_file.close()
input_file.close()
Thank you

The answer to your question, 'how to only append unique values to this container' is fairly simple: change it from a list to a set (as #ShadowRanger suggested in the comments). This isn't really a question about dictionaries, though; you're not appending values to 'dict_obj', only to a list stored in the dictionary.
Since the source you linked to shows this is a training problem for people newer to coding, you should know that changing the lists to sets might be a good idea, but it's not the cause of the performance issues.
The problem boils down to: given a file containing a list of integers, print the most common integer(s). Your current code iterates over the list, and for each index i, iterates over the entire list to count matches with ls[i] (this is the line c = ls.count(ls[i])).
Some operations are more expensive than others: calling count() is one of the more expensive operations on a Python list. It reads through the entire list every time it's called. This is an O(n) function, which is inside a length n loop, taking O(n^2) time. All of the set() filtering for non-unique elements takes O(n) time total (and is even quite fast in practice). Identifying linear-time functions hidden in loops like this is a frequent theme in optimization, but profiling your code would have identified this.
In general, you'll want to use something like the Counter class in Python's standard library for frequency counting. That kind of defeats the whole point of this training problem, though, which is to encourage you to improve on the brute-force algorithm for finding the most frequent element(s) in a list. One possible way to solve this problem is to read the description of Counter, and try to mimic its behavior yourself with a plain Python dictionary.

Answering the question you haven't asked: Your whole approach is overkill.
You don't need to worry about uniqueness; the question prompt guarantees that if you see 2 5, you'll never see 5 2, nor a repeat of 2 5
You don't even care who is friends with who, you just care how many friends an individual has
So don't even bother making the pairs. Just count how many times each player ID appears at all. If you see 2 5, that means 2 has one more friend, and 5 has one more friend, it doesn't matter who they are friends with.
The entire problem can simplify down to a simple exercise in separating the player IDs and counting them all up (because each appearance means one more unique friend), then keeping only the ones with the highest counts.
A fairly idiomatic solution (reading from stdin and writing to stdout; tweaking it to open files is left as an exercise) would be something like:
import sys
from collections import Counter
from itertools import chain, islice
def main():
numlines = int(next(sys.stdin))
friend_pairs = map(str.split, islice(sys.stdin, numlines)) # Convert lines to friendship pairs
counts = Counter(chain.from_iterable(friend_pairs)) # Flatten to friend mentions and count mentions to get friend count
max_count = max(counts.values()) # Identify maximum friend count
winners = [pid for pid, cnt in counts.items() if cnt == max_count]
winners.sort(key=int) # Sort winners numerically
print(*winners, sep="\n")
if __name__ == '__main__':
main()
Try it online!
Technically, it doesn't even require the use of islice nor storing to numlines (the line count at the beginning might be useful to low level languages to preallocate an array for results, but for Python, you can just read line by line until you run out), so the first two lines of main could simplify to:
next(sys.stdin)
friend_pairs = map(str.split, sys.stdin)
But either way, you don't need to uniquify friendships, nor preserve any knowledge of who is friends with whom to figure out who has the most friends, so save yourself some trouble and skip the unnecessary work.

If you intention is to have a list in each value of the dictionary why not iterate the same way you iterated on each key.
if key in dict_obj.keys():
for elem in dict_obje[key]: # dict_obje[key] asusming the value is a list
if (elem == value):
else:
# append the value to the desired list
else:
dic_obj[key] = value

Convert for loop into list comprehension with assignment?

I am trying to convert a for loop with an assignment into a list comprehension.
More precisely I am trying to only replace one element from a list with three indexes.
Can it be done?
for i in range(len(data)):
data[i][0] = data[i][0].replace('+00:00','Z').replace(' ','T')
Best

If you really, really want to convert it to a list comprehension, you could try something like this, assuming the sub-lists have three elements, as you stated in the questions:
new_data = [[a.replace('+00:00','Z').replace(' ','T'), b, c] for (a, b, c) in data]
Note that this does not modify the existing list, but creates a new list, though. However, in this case I'd just stick with a regular for loop, which much better conveys what you are actually doing. Instead of iterating the indices, you could iterate the elements directly, though:
for x in data:
x[0] = x[0].replace('+00:00','Z').replace(' ','T')

I believe it could be done, but that's not the best way to do that.
First you would create a big Jones Complexity for a foreign reader of your code.
Second you would exceed preferred amount of chars on a line, which is 80. Which again will bring complexity problems for a reader.
Third is that list comprehension made to return things from comprehensing of lists, here you change your original list. Not the best practice as well.

List comprehension is useful when making lists. So, it is not recommended here. But still, you can try this simple solution -
print([ele[0].replace('+00:00','Z').replace(' ','T') for ele in data])

Although I don't recommend you use list-comprehension in this case, but if you really want to use it, here is a example.
It can handle different length of data, if you need it.
code:
data = [["1 +00:00",""],["2 +00:00","",""],["3 +00:00"]]
print([[i[0].replace('+00:00','Z').replace(' ','T'),*i[1:]] for i in data])
result:
[['1TZ', ''], ['2TZ', '', ''], ['3TZ']]

A more streamline, efficient way to split a list

is there a faster, more efficient way to split rows in a list. My current setup isn't slow but does take longer than I think to split the whole list, maybe due to how many iterations it is required to go through the whole list.
I currently have the code below
found_reader = pd.read_csv(file, delimiter='\n', engine='c')
loaded_list = found_reader
for i in range(len(loaded_list)):
loaded_email_list = loaded_email_list + [loaded_list[i].split(':')[0]]
I just would like a method to do the above in the quickest but efficient time

Here's how you do that efficiently if both loaded_list and loaded_email_list were regular lists (it may need slight adaptation for whatever it is that Pandas uses):
loaded_email_list += [x.partition(':')[0] for x in loaded_list]
Why this is better:
It iterates over the list directly, instead of using range, len, and an index variable
It uses partition, which stops looking after the first :, instead of split, which walks the whole string
It uses a list comprehension to create the new list all at once, rather than creating and concatenating a bunch of single-element lists
It uses x += y, instead of x = x + y, which could theoretically be faster if its __iadd__ is more efficient than assigning its __add__ result back to itself.

Implementing an external merge sort

I'm trying to learn Python and am working on making an external merge sort using an input file with ints. I'm using heapq.merge, and my code almost works, but it seems to be sorting my lines as strings instead of ints. If I try to convert to ints, writelines won't accept the data. Can anyone help me find an alternative? Additionally, am I correct in thinking this will allow me to sort a file bigger than memory (given adequate disk space)
import itertools
from itertools import islice
import tempfile
import heapq
#converts heapq.merge to ints
#def merge(*temp_files):
# return heapq.merge(*[itertools.imap(int, s) for s in temp_files])
with open("path\to\input", "r") as f:
temp_file = tempfile.TemporaryFile()
temp_files = []
elements = []
while True:
elements = list(islice(f, 1000))
if not elements:
break
elements.sort(key=int)
temp_files.append(elements)
temp_file.writelines(elements)
temp_file.flush()
temp_file.seek(0)
with open("path\to\output", "w") as output_file:
output_file.writelines(heapq.merge(*temp_files))

Your elements are read by default as strings, you have to do something like:
elements = list(islice(f, 1000))
elements = [int(elem) for elem in elements]
so that they would be interpreted as integers instead.
That would also mean that you need to convert them back to strings when writing, e.g.:
temp_file.writelines([str(elem) for elem in elements])
Apart from that, you would need to convert your elements again to int for the final merging. In your case, you probably want to uncomment your merge method (and then convert the result back to strings again, same way as above).

Your code doesn't make much sense to me (temp_files.append(elements)? Merging inside the loop?), but here's a way to merge files sorting numerically:
import heapq
files = open('a.txt'), open('b.txt')
with open('merged.txt', 'w') as out:
out.writelines(map('{}\n'.format,
heapq.merge(*(map(int, f)
for f in files))))
First the map(int, ...) turns each file's lines into ints. Then those get merged with heapq.merge. Then map('{}\n'.format turns each of the integers back into a string, with newline. Then writelines writes those lines. In other words, you were already close, just had to convert the ints back to strings before writing them.
A different way to write it (might be clearer for some):
import heapq
files = open('a.txt'), open('b.txt')
with open('merged.txt', 'w') as out:
int_streams = (map(int, f) for f in files)
int_stream = heapq.merge(*int_streams)
line_stream = map('{}\n'.format, int_stream)
out.writelines(line_stream)
And in any case, do use itertools.imap if you're using Python 2 as otherwise it'll read the whole files into memory at once. In Python 3, you can just use the normal map.
And yes, if you do it right, this will allow you to sort gigantic files with very little memory.

You are doing Kway merge within the loop which will add a lots of runtimeComplexity . Better Store the file handles into a spearate list and perform a Kway merge
You also don't have to remove and add new line back ,just sort it based on number.
sorted(temp_files,key=lambda no:int(no.strip()))
Rest of things are fine.
https://github.com/melvilgit/external-Merge-Sort/blob/master/README.md

How to use strip() on to array slice without storing in temporary array

I have at data that looks like this:
Probes FOO BAR
1452463_x_at 306.564 185.705
1439374_x_at 393.742 330.495
1426392_a_at 269.850 209.931
1433432_x_at 636.145 487.012
In the second column it contain white space after tab.
import sys
import csv
import pprint
with open('tmp.txt') as tsvfile:
tabreader = csv.reader(tsvfile,delimiter="\t");
for row in tabreader:
#val = s.strip() for s in [row[1:3]]
val = row[1:3]
print val
Here is the code that prints this:
['FOO', 'BAR']
['306.564 ', '185.705']
['393.742 ', '330.495']
['269.850 ', '209.931']
['636.145 ', '487.012']
Now what I want to do is to strip the white space on the fly while iterating through the row,
without storing the values in temporary array.
Especially with this line:
#val = s.strip() for s in [row[1:3]]
But why it failed? What's the way to do it

You've got the syntax wrong. You want a list-comprehension:
val = [s.strip() for s in row[1:3]]
Now, I'm not exactly sure what you want, but I have created a new list. There's no clean1 way around that.
1You could use an explicit loop and strip the values while re-assigning them to the original list, but ... Yuck...
If you really want to, you can mutate the row in place this way:
row[:] = [s.strip() for s in row[1:3]]
But I'm not completely sure what advantage you'd get here.

There's a concept of generator expressions in python. This is a lazy-evaluated version of list-comprehension that does not create a resulting list immediately. However, ordinary print does not cause the generator to evaluate, so you'll need to convert it to list before printing.
So, with your code it should look like (note round brackets)
for row in tabreader:
val = (s.strip() for s in row[1:3])
print list(val)
Using generator expression doesn't realy have any advantages over list comprehension in your example as you are going to print the result right away. It could be very handy if you need to do some additional processing on huge lists, reducing memory footprint due to the fact that generator expression does not allocate memory required to hold results.
In two words: list-comprehension works like range (allocates the list and fills it with data right away), generator expression works like xrange (generates next item on-demand)

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Faster replacing in list with a lot of matches - python

Instead of manipulating A, try using a list comprehension to make a new A: A = [v if v != '********' else 0.0 for v in A] I think you'll find this surprisingly fast.

Related

How to append only unique values to a key in a dictionary?

Convert for loop into list comprehension with assignment?

A more streamline, efficient way to split a list

Implementing an external merge sort

How to use strip() on to array slice without storing in temporary array

Categories

Resources