write better code instead of 2 for loops - python

I have 2 for loops and I want to make it better like list comprehension or lambda or else.
how can i achieve the same?
for example :
filename = ['a.txt', 'b.txt', 'c.txt']
for files in filename:
for f in glob.glob(os.path.join(source_path, files)):
print f
... some processing...

Your code is perfectly fine as it is. You can only make it less legible by introducing unnecessary complex constructs.

You can compress the two for loops into a single generator expression*, with a new for loop to extract the file names from it.
for f in (f_ for files in filename
for f_ in glob.glob(os.path.join(source_path, files))):
print f
# ...
As the other answer said, this is not better, this is worse and you shouldn't use it (I'm not sure that's enough emphasis!). It is far harder to understand what is going on, and probably has little performance benefit (in fact, the extra layers of indirection mean it is likely to be slower).
(* basically equivalent to a list comprehension, but better in situations like this.)

I would do it like below. The reason being that now you can separate your search pattern formation, searching and file prosessing. It is easier to expand if they are unrelated.
If your system is slightly exotic (e.g. distributed network drive), the line with both glob and os.path.join is a nasty line. Although as others have mentioned, two loops is perfectly ok.
filename = ['a.txt', 'b.txt', 'c.txt']
searchPatterns = [os.path.join(source_path, files) for files in filename]
searchResults = [glob.glob(pattern) for pattern in searchPatterns]
fileListFlat = sum(searchResults,[])
for file in fileListFlat:
print file

Long expression is hard to read when you have to scan to right and round back. it is even worse when there are many local variables, lambdas and comprehensions, merely being separated by parens and commas, in few lines. Use them only if your code does not get longer and more complex.
For you case, I prefer to extract find as a tradeoff. But just as the top answer said, your code is fine enough.
from itertools import chain
find = lambda p: glob.glob(os.path.join(source_path, p))
for file in chain(map(find, filename)):
"""
=) I like one-level indentation here.
=( I don't know which file pattern is used currently,
unless I use longer expression...
"""

Related

Efficiently search for many different strings in large file

I am trying to find a fast way of searching strings in a file. First of all, I don't have only one string to find. I have a list of 1900 strings to find in a file which is 150MB. So basically I am opening a file, looping for 1900 times to find all occurrences of that string in that file. Here are some of the attributes of my search.
Size of the file to be searched is 150mb – it’s text file.
I need to find all occurrences of 1900 strings in a file. Means I am looping 1900 times entire file to search for all occurrences.
It’s not simple search, I have to use regex to search the string.
In few cases, I need a line above and a line below the where I found the search string. So I need to use file.readlines() not file.read()
In few cases I also have to replace the searched string with new string.
First I am trying to find a best way to search in the file. My code is taking too long. I am not sure if this is best way to do it:
#searchstrings is list of 1900 strings
file = open("mytextfile.txt", "r")
for line in file:
for i in range(len(searchstrings)):
if searchstrings[i] in line:
print(line)
file.close()
This code does the job but it’s extremely slow. Also it does not give me option to choose the line above or below where the searchstring is found.
Another code I am using to replace the string is like below. This code is also extremely slow. Here I am using regex.
file = open("mytextfile.txt", "r")
file_data = file.read()
#searchstrings is list of 1900 strings
#replacestrings is list of 1900 strings that needs to be replaced
for i in range(len(searchstrings)):
src_str = re.compile(searchstrings[i], re.IGNORECASE)
file_data = src_str.sub(replacestrings[i], file_data)
file.close()
I know the performance of the code depends on the computing power as well, however, I just want to know what is the best way to write this code that will work at optimum speed for given hardware. Also I would like to know how to time the program execution.
I like Unix commands, they are fun, fast and efficient.
import re, sys
map(sys.stdout.write,(string_x for string_x in sys.stdin if re.search(sys.argv[1],string_x)))
A few observations.
For idiomatic Python, you usually want
for string in searchstrings:
...
instead of
for i in range(len(searchstrings)):
searchstrings[i]
and with open(filename) as f: ... instead of open()/close(). The with statement will close the file automatically.
When you want to replace any of several strings with a regex, you can do
re.sub('|'.join(YOUR_STRINGS), replacement, text)
because | is the regex symbol for "or", instead of looping over them all individually.
For performance, I might try switching from CPython to PyPy. PyPy is another implementation of the same language but often much faster.
On the other hand, if that's really all your program is supposed to do, you might want to use a dedicated tool for the job, like Ag or RipGrep which has already been optimized for this job. Possibly through the subprocess.run() function if you're working in Python.

Proper way of breaking line accessing dictionary

I am try to conform to pep8 directives and therefore to break the following line:
config_data_dict['foo']['bar']['foobarfoo'] \
['barfoobar'] = something_else
However I am getting the following warning now just after the ['foobarfoo'] section
whitespace before '[' pep8(E211)
How should I properly break a line as the above (assuming I cannot brake it around =)?
Parentheses seem to work:
(config_data_dict['foo']['bar']['foobarfoo']
['barfoobar']) = something_else
This also seems to be the recommended style according to PEP8:
The preferred way of wrapping long lines is by using Python's implied line continuation inside parentheses, brackets and braces. Long lines can be broken over multiple lines by wrapping expressions in parentheses. These should be used in preference to using a backslash for line continuation.
You could break inside the [...] (though I'm not really sure which would be considered more readable: breaking after the [, or before the ], or both):
config_data_dict[
'foo'][
'bar'][
'foobarfoo'][
'barfoobar'] = something_else
As a general rule, either put all the keys on the same line, or put each key on a separate line. This applies to the explicit parenthesization used in other answers, for example,
(config_data_dict
['foo']
['bar']
['foobarfoo']
['barfoobar']) = something_else
However, I would just use one or more temporary variables:
d = config_data_dict['foo']['bar']['foobarfoo']
d['barfoobar'] = something_else
You have lots of ways you can do this but here’s my opinion on the 2 “best” ways. (I say best loosely as opinion and context do apply)
Using operator.setitem. (this is almost the same as described in this answer but to me looks much more readable as it doesn’t have the leading parentheses)
setitem(config_data_dict
['foo']['bar']['foobarfoo'],
'barfoobar', something_else)
Or some prefer the reduce method with operator.getitem. (also alike this answer the reduce approach could be easier on the eyes if you’re getting REALLY deeply nested but I prefer the later as it’s not adding yet another somewhat unnecessary function into the mix)
path = ['foo','bar','foobarfoo']
reduce(getitem, path, config_data_dict)['barfoobar'] = something_else
Or to allow for more nice indenting use setitem here too
setitem(reduce(getitem, path, config_data_dict),
'barfoobar', something_else)
All of that being said, you could use shorter variable names, for instance config_data_dict really doesn’t need the dict at the end. It does make the variable more descriptive although I’m sure most should easily be able to discern that it’s a dict by how you’re accessing it.

python speed up this regex sub

p = re.compile('>.*\n')
p.sub('', text)
I want to delete all lines starting with a '>'. I have a really huge file (3GB) that I process in chunks of size 250MB, so the variable "text" is a string of size 250MB. (I tried different sizes, but the performance was always the same for the complete file).
Now, can I speed up this regex somehow? I tried the multi-line matching, but it was a lot slower. Or are there even better ways?
(I already tried to split the string and then filter out the line like this, but it was also slower (i also tried a lambda instead of def del_line: (that might not be working code, it's just from memory):
def del_line(x): return x[0] != '>'
def func():
....
text = file.readlines(chunksize)
text = filter(del_line, text)
...
EDIT:
As suggested in the comments, I also tried walking line by line:
text = []
for line in file:
if line[0] != '>':
text.append(line)
text = ''.join(text)
That's also slower, it needs ~12 sec. My regex need ~7 sec. (yeah, that's fast, but it must also run on slower machines)
EDIT: Of course, I also tried str.startswith('>'), it was slower...
If you have the chance, running grep as a subprocess is probably the most pragmatic choice.
If for whatever reason you can't rely on grep, you could try implementing some of the "tricks" that make grep fast. From the author himself, you can read about them here: http://lists.freebsd.org/pipermail/freebsd-current/2010-August/019310.html
At the ending of the article, the author summarizes the main points. The one that stands out to me the most is:
Moreover, GNU grep AVOIDS BREAKING THE INPUT INTO LINES. Looking for
newlines would slow grep down by a factor of several times, because to
find the newlines it would have to look at every byte!
The idea would be to load the entire file in memory and iterate with it on byte-level instead of line-level. Only when you find a match, you look for the line boundaries and delete it.
You say you have to run this on other computers. If it's within your reach and you are not doing it already, consider running it on PyPy instead of CPython (the default interpreter). This may (or may not) improve the runtime by a significant factor, depending on the nature of the program.
Also, as some comments already mentioned, benchmark with the actual grep to get a baseline of how fast you can go, reasonably speaking. Get it on Cygwin if you are on Windows, it's easy enough.
This is not faster?
def cleanup(chunk):
return '\n'.join(st for st in chunk.split('\n') if not(st and st[0] == '>'))
EDIT: yeah, that is not faster. That's twice as slow.
Maybe consider using subprocess and a tool like grep, as suggested by Ryan P. You could even take advantage of multiprocessing.

Python - How to style this line of code?

First of, I'm not sure if SO is the right place for this question, so feel free to move it somewhere more appropriate if necessary.
cmd_folder = os.path.realpath(os.path.abspath(os.path.split(inspect.getfile( inspect.currentframe() ))[0]))
I have this line of code. The Python PEP 8 document recommends limiting lines to 79 characters to preserve readability on smaller screens.
What is the most elegant way to style this line of code to fit the PEP recommendations?
cmd_folder = os.path.realpath(os.path.abspath(os.path.split
(inspect.getfile
( inspect.currentframe() ))[0]))
Is this the most appropriate way for is there a better one that I have not thought of?
I would say in the case of your example code, it would be more appropriate to split them up into individual operation, as opposed to making the code harder to read.
aFile = inspect.getfile(inspect.currentframe())
cmd_folder = os.path.realpath(os.path.abspath(os.path.split(aFile)[0]))
It should not be such a chore to trace the start and close of all those parenthesis to figure out what is happening with each temp variable. Variable names can help with clarity, by naming the intention/type of the results.
If it is two, maybe 3, nested calls I might do the act of newlines in one call, but definitely not when you have a ton of parenthesis, and list indexing squished in between. But normally I am only more inclined to do that with chained calls like foo.bar().biz().baz() since it flows left to right.
Always assume some poor random developer will have to read your code tomorrow.
I am in general with the answer provided by jdi (split into several expressions). But I would like to show another aspect of this issue.
In general, if just trying to properly break and indent this line, you should also follow PEP8 rules on indentation:
Use 4 spaces per indentation level.
For really old code that you don't want to mess up, you can continue to use 8-space tabs.
Continuation lines should align wrapped elements either vertically using Python's implicit line joining inside parentheses, brackets and braces, or using a hanging indent. When using a hanging indent the following considerations should be applied; there should be no arguments on the first line and further indentation should be used to clearly distinguish itself as a continuation line.
[several examples follow]
So in your case it could look like this:
#---------------------------------------------------------79th-column-mark--->|
cmd_folder = os.path.realpath(
os.path.abspath(os.path.split(inspect.getfile(inspect.currentframe()))[0]))
But as I mentioned at the beginning, and as PEP20 (The Zen of Python) mentions:
Flat is better than nested.
Sparse is better than dense.
Readability counts.
thus you should definitely split your code into few expressions, as jdi notes.
I tend to do it like this:
cmd_folder = os.path.realpath(os.path.abspath(os.path.split(
inspect.getfile(inspect.currentframe()))[0]))
Open-parens at the end of the first line, and indention on continuation lines. Ultimately it boils down to what you think is more aesthetic.
There's also no shame in doing part of the computation and saving the result in a variable, like this:
f = inspect.getfile(inspect.currentframe())
cmd_folder = os.path.realpath(os.path.abspath(os.path.split(f)[0]))
If you really insist on keeping it one expression, I'd go with an "extreme" simply because it actually looks nice:
cmd_folder = os.path.realpath(
os.path.abspath(
os.path.split(
inspect.getfile(
inspect.currentframe()
))[0]
))

Awk, bash or python for converting a regular file?

I have a text file with lots of lines and with this structure:
[('name_1a',
'name_1b',
value_1),
('name_2a',
'name_2b',
value_2),
.....
.....
('name_XXXa',
'name_XXXb',
value_XXX)]
I would like to convert it to:
name_1a, name_1b, value_1
name_2a, name_2b, value_2
......
name_XXXa, name_XXXb, value_XXX
I wonder what would be the best way, whether awk, python or bash.
Thanks
Jose
Tried evaluating it python? Looks like a list of tuples to me.
eval(your_string)
Note, it's massively unsafe! If there's code in there to delete your hard disk, evaluating it will run that code!
I would like to use Python:
lines = open('filename.txt','r').readlines()
n = len(lines) # n % 3 == 0
for i in range(0,n,3):
name1 = lines[i].strip("',[]\n\r")
name2 = lines[i+1].strip("',[]\n\r")
value = lines[i+2].strip("',[]\n\r")
print name1,name2,value
It looks like legal Python. You might be able to just import it as a module and then write it back out after formatting it.
Oh boy, here is a job for ast.literal_eval:
(literal_eval is safer than eval, since it restricts the input string to literals such as strings, numbers, tuples, lists, dicts, booleans and None:
import ast
filename='in'
with open(filename,'r') as f:
contents=f.read()
data=ast.literal_eval(contents)
for elt in data:
print(', '.join(map(str,elt)))
here's one way to do it with (g)awk
$ awk -vRS=")," ' { gsub(/\n|[\047\]\[)(]/,"") } 1' file
name_1a,name_1b,value_1
name_2a,name_2b,value_2
name_XXXa,name_XXXb,value_XXX
Awk is typically line oriented, and bash is a shell, with limited numbrer of string manipulation functions. It really depends on where your strength as a programmer lies, but all other things being equal, I would choose python.
Did you ever consider that by redirecting the time it took to post this on SO, you could have had it done?
"AWK is a language for processing
files of text. A file is treated as a
sequence of records, and by default
each line is a record. Each line is
broken up into a sequence of fields,
so we can think of the first word in a
line as the first field, the second
word as the second field, and so on.
An AWK program is of a sequence of
pattern-action statements. AWK reads
the input a line at a time. A line is
scanned for each pattern in the
program, and for each pattern that
matches, the associated action is
executed." - Alfred V. Aho[2]
Asking what's the best language for doing a given task is a very different question to say, asking: 'what's the best way of doing a given task in a particular language'. The first, what you're asking, is in most cases entirely subjective.
Since this is a fairly simple task, I would suggest going with what you know (unless you're doing this for learning purposes, which I doubt).
If you know any of the languages you suggested, go ahead and solve this in a matter of minutes. If you know none of them, now enters the subjective part, I would suggest learning Python, since it's so much more fun than the other 2 ;)
If the values are legal python values, you can take advantage of eval() since your data is a legal python data sucture. The following would work if values are integers, otherwise you might have to massage the print call a bit:
input = """[('name_1a',
'name_1b',
1),
('name_2a',
'name_2b',
2),
('name_XXXa',
'name_XXXb',
3)]"""
for e in eval(input):
print '%s,%s,%d' % e
P.S. using eval() is quite controversial since it will execute any valid python code that you pass into it, so take care.

Categories

Resources