Improve efficiency in Python matching - python

If we have the following input and we would like to keep the rows if their "APPID colum " (4rd column) are the same and their column "Category" (the 18th column) are one "Cell" and one "Biochemical" or one "Cell" and one "Enzyme".
A , APPID , C , APP_ID , D , E , F , G , H , I , J , K , L , M , O , P , Q , Category , S , T
,,, APP-1 ,,,,,,,,,,,,,, Cell ,,
,,, APP-1 ,,,,,,,,,,,,,, Enzyme ,,
,,, APP-2 ,,,,,,,,,,,,,, Cell ,,
,,, APP-3 ,,,,,,,,,,,,,, Cell ,,
,,, APP-3 ,,,,,,,,,,,,,, Biochemical ,,
The ideal output will be
A , APPID , C , APP_ID , D , E , F , G , H , I , J , K , L , M , O , P , Q , Category , S , T
,,, APP-1 ,,,,,,,,,,,,,, Enzyme ,,
,,, APP-3 ,,,,,,,,,,,,,, Biochemical ,,
,,, APP-1 ,,,,,,,,,,,,,, Cell ,,
,,, APP-3 ,,,,,,,,,,,,,, Cell ,,
"APP-1" is kept because their column 3 are the same and their Category are one "Cell" and the other one is "Enzyme". The same thing for "APP-3", which has one "Cell" and the other one is "Biochemical" in its "Category" column.
The following attempt could do the trick:
import os
App=["1"]
for a in App:
outname="App_"+a+"_target_overlap.csv"
out=open(outname,'w')
ticker=0
cell_comp_id=[]
final_comp_id=[]
# make compound with cell activity (to a target) list first
filename="App_"+a+"_target_Detail_average.csv"
if os.path.exists(filename):
file = open (filename)
line=file.readlines()
if(ticker==0): # Deal with the title
out.write(line[0])
ticker=ticker+1
for c in line[1:]:
c=c.split(',')
if(c[17]==" Cell "):
cell_comp_id.append(c[3])
else:
cell_comp_id=list(set(cell_comp_id))
# while we have list of compounds with cell activity, now we search the Bio and Enz and make one final compound list
if os.path.exists(filename):
for c in line[1:]:
temporary_line=c #for output_temp
c=c.split(',')
for comp in cell_comp_id:
if (c[3]==comp and c[17]==" Biochemical "):
final_comp_id.append(comp)
out.write(str(temporary_line))
elif (c[3]==comp and c[17]==" Enzyme "):
final_comp_id.append(comp)
out.write(str(temporary_line))
else:
final_comp_id=list(set(final_comp_id))
# After we obatin a final compound list in target a , we go through all the csv again for output the cell data
filename="App_"+a+"_target_Detail_average.csv"
if os.path.exists(filename):
for c in line[1:]:
temporary_line=c #for output_temp
c=c.split(',')
for final in final_comp_id:
if (c[3]==final and c[17]==" Cell "):
out.write(str(temporary_line))
out.close()
When the input file is small (tens of thousands lines), this script can finish its job in reasonable time. However, the input files become millions to billions of lines, this script will take forever to finish (days...). I think the issue is we create a list of APPID with "Cell" in the 18th column. Then we go back to compare this "Cell" list (maybe half million lines) to the whole file (1 million lines for example): If any APPID in the Cell list and the whole file is the same and the 18th column of the row in the whole file is "Enzyme" or "Biochemical", we keep the information. This step seems to be very time consuming.
I am thinking maybe preparing "Cell" , "Enzyme" and "Biochemical" dictionaries and compare them will be faster? May I know if any guru has better way to process it? Any example/comment will be helpful. Thanks.
We use python 2.7.6.

reading the file(s) efficiently
One big problem is that you're reading the file all in one go using readlines. This will require loading it ALL into memory at one go. I doubt if you have that much memory available.
Try:
with open(filename) as fh:
out.write(fh.readline()) # ticker
for line in fh: #iterate through lines 'lazily', reading as you go.
c = line.split(',')
style code to start with. This should help a lot. Here, in context:
# make compound with cell activity (to a target) list first
if os.path.exists(filename):
with open(filename) as fh:
out.write(fh.readline()) # ticker
for line in fh:
cols = line.split(',')
if cols[17] == " Cell ":
cell_comp_id.append(cols[3])
the with open(...) as syntax is a very common python idiom which automatically handles closing the file when you finish the with block, or if there is an error. Very useful.
sets
Next thing is, as you suggest, using sets a little better.
You don't need to recreate the set each time, you can just update it to add items. Here's some example set code (written in the python interperter style, >>> at the beginning means it's a line of stuff to type - don't actually type the >>> bit!):
>>> my_set = set()
>>> my_set
set()
>>> my_set.update([1,2,3])
>>> my_set
set([1,2,3])
>>> my_set.update(["this","is","stuff"])
>>> my_set
set([1,2,3,"this","is","stuff"])
>>> my_set.add('apricot')
>>> my_set
set([1,2,3,"this","is","stuff","apricot"])
>>> my_set.remove("is")
>>> my_set
set([1,2,3,"this","stuff","apricot"])
so you can add items, and remove them from a set without creating a new set from scratch (which you are doing each time with the cell_comp_id=list(set(cell_comp_id)) bit.
You can also get differences, intersections, etc:
>>> set(['a','b','c','d']) & set(['c','d','e','f'])
set(['c','d'])
>>> set([1,2,3]) | set([3,4,5])
set([1,2,3,4,5])
See the docs for more info.
So lets try something like:
cells = set()
enzymes = set()
biochemicals = set()
with open(filename) as fh:
out.write(fh.readline()) #ticker
for line in fh:
cols = line.split(',')
row_id = cols[3]
row_category = cols[17]
if row_category == ' Cell ':
cells.add(row_id)
elif row_category == ' Biochemical ':
biochemicals.add(row_id)
elif row_category == ' Enzyme ':
enzymes.add(row_id)
Now you have sets of cells, biochemicals and enzymes. You only want the cross section of these, so:
cells_and_enzymes = cells & enzymes
cells_and_biochemicals = cells & biochemicals
You can then go through all the files again and simply check if row_id (or c[3]) is in either of those lists, and if so, print it.
You can actually combine those two lists even further:
cells_with_enz_or_bio = cells_and_enzymes | cells_and_biochemicals
which would be the cells which have enzymes or biochemicals.
So then when you run through the files the second time, you can do:
if row_id in cells_with_enz_or_bio:
out.write(line)
after all that?
Just using those suggestions might be enough to get you by. You still are storing in memory the entire sets of cells, biochemicals and enzymes, though. And you're still running through the files twice.
So one there are two ways we could potentially speed it up, while still staying with a single python process. I don't know how much memory you have available. If you run out of memory, then it might possibly slow things down slightly.
reducing sets as we go.
If you do have a million records, and 800000 of them are pairs (have a cell record and a biochemical record) then by the time you get to the end of the list, you're storing 800000 IDs in sets. To reduce memory usage, once we've established that we do want to output a record, we could save that information (that we want to print the record) to a file on disk, and stop storing it in memory. Then we could read that list back later to figure out which records to print.
Since this does increase disk IO, it could be slower. But if you are running out of memory, it could reduce swapping, and thus end up faster. It's hard to tell.
with open('to_output.tmp','a') as to_output:
for a in App:
# ... do your reading thing into the sets ...
if row_id in cells and (row_id in biochemicals or row_id in enzymes):
to_output.write('%s,' % row_id)
cells.remove(row_id)
biochemicals.remove(row_id)
enzymes.remove(row_id)
once you've read through all the files, you now have a file (to_output.tmp) which contains all the ids that you want to keep. So you can read that back into python:
with open('to_output.tmp') as ids_file:
ids_to_keep = set(ids_file.read().split(','))
which means you can then on your second run through the files simply say:
if row_id in ids_to_keep:
out.write(line)
using dict instead of sets:
If you have plenty of memory, you could bypass all of that and use dicts for storing the data, which would let you run through the files only once, rather than using sets at all.
cells = {}
enzymes = {}
biochemicals = {}
with open(filename) as fh:
out.write(fh.readline()) #ticker
for line in fh:
cols = line.split(',')
row_id = cols[3]
row_category = cols[17]
if row_category == ' Cell ':
cells[row_id] = line
elif row_category == ' Biochemical ':
biochemicals[row_id] = line
elif row_category == ' Enzyme ':
enzymes[row_id] = line
if row_id in cells and row_id in biochemicals:
out.write(cells[row_id])
out.write(biochemicals[row_id])
if row_id in enzymes:
out.write(enzymes[row_id])
elif row_id in cells and row_id in enzymes:
out.write(cells[row_id])
out.write(enzymes[row_id])
The problem with this method is that if any rows are duplicated, it will get confused.
If you are sure that the input records are unique, and that they either have enzyme or biochemical records, but not both, then you could easily add del cells[row_id] and the appropriate others to remove rows from the dicts once you've printed them, which would reduce memory usage.
I hope this helps :-)

A technique I have used to deal with massive files quickly in Python is to use the multiprocessing library to split the file into large chunks, and process those chunks in parallel in worker subprocesses.
Here's the general algorithm:
Based on the amount of memory you have available on the system that will run this script, decide how much of the file you can afford to read into memory at once. The goal is to make the chunks as large as you can without causing thrashing.
Pass the file name and chunk beginning/end positions to subprocesses, which will each open the file, read in and process their sections of the file, and return their results.
Specifically, I like to use a multiprocessing pool, then create a list of chunk start/stop positions, then use the pool.map() function. This will block until everyone has completed, and the results from each subprocess will be available if you catch the return value from the map call.
For example, you could do something like this in your subprocesses:
# assume we have passed in a byte position to start and end at, and a file name:
with open("fi_name", 'r') as fi:
fi.seek(chunk_start)
chunk = fi.readlines(chunk_end - chunk_start)
retriever = operator.itemgetter(3, 17) # extracts only the elements we want
APPIDs = {}
for line in chunk:
ID, category = retriever(line.split())
try:
APPIDs[ID].append(category) # we've seen this ID before, add category to its list
except KeyError:
APPIDs[ID] = [category] # we haven't seen this ID before - make an entry
# APPIDs entries will look like this:
#
# <APPID> : [list of categories]
return APPIDs
In your main process, you would retrieve all the returned dictionaries and resolve duplicates or overlaps, then output something like this:
for ID, categories in APPIDs.iteritems():
if ('Cell' in categories) and ('Biochemical' in categories or 'Enzyme' in categories):
# print or whatever
A couple of notes/caveats:
Pay attention to the load on your hard disk/SSD/wherever your data is located. If your current method is already maxing out its throughput, you probably won't see any performance improvements from this. You can try implementing the same algorithm with threading instead.
If you do get a heavy hard disk load that's NOT due to memory thrashing, you can also reduce the number of simultaneous subprocesses you're allowing in the pool. This will result in fewer read requests to the drive, while still taking advantage of truly parallel processing.
Look for patterns in your input data you can exploit. For example, if you can rely on matching APPIDs to be next to each other, you can actually do all of your comparisons in the subprocesses and let your main process hang out until its time to combine the subprocess data structures.
TL;DR
Break your file up into chunks and process them in parallel with the multiprocessing library.

Related

How to remove duplicate lines in two large text files by number of appearance?

I have two large text files with same number of lines (9 million lines, about 12gb each file). So they cannot be loaded in memory.
Lines in these text files presented in table look like this:
I need to remove duplicates in A.txt and B.txt and leave only most frequent combination for each line from A.txt. In case of 2 most frequent lines with same number of repetitions program should choose line that first appeared in text and remove all others.
In real files lines aren't just (A,B,C,D,...1,6,7,..) and each line has about 2000 characters.
Final text files presented in table should look like this:
How can you avoid reading 2 × 12 GB into memory at once, but still process all the data?
By loading those 24 GB chunk by chunk, and discarding data you don't need anymore as you go. As your files are line-based, reading and processing line-by-line seems prudent. Having 4000-ish characters in memory at once shouldn't pose a problem on modern personal computers.
Combining the files
You want the end result ordered (or maybe even sorted) by the line contents of A.txt. For not losing the relationship between the lines in A.txt and B.txt when changing their order, we need to combine their contents first.
Do that by
opening both files without reading them yet
opening a new file AB.txt for writing
repeating the following until all of A.txt and B.txt have been processed:
reading a line of A.txt
reading a line of B.txt
append the combined content as a new line to AB.txt
discard what we've read so far from memory
If you know that a certain character (say, '\t') cannot occur in A.txt you can use that as the separator:
with \
open("A.txt") as a_file, \
open("B.txt") as b_file, \
open("AB.txt", "w") as ab_file:
for a_line, b_line in zip(a_file, b_file):
# get rid of the line endings, whatever they are
a_line, = a_line.splitlines()
b_line, = b_line.splitlines()
# output the combined content to AB.txt
print(f"{a_line}\t{b_line}", file=ab_file)
(Note that this relies on zip acting "lazy" and returning a generator rather than reading the files completely and returning a huge list as it would in Python 2.)
If all lines in A.txt have the same fixed length you don't need any separator at all. (For keeping your sanity while debugging, you might still want to use one, though.) If you don't know of any character that can't occur in A.txt, you can create a csv.writer and use its writerow method to write the lines to AB.txt. It will take care of the required escaping or quoting for you.
You might wonder where the step
discard what we've read so far from memory
is implemented. This happens implicitly here, because the only variables that hold data from the files, a_line and b_line are overwritten for each iteration.
Ordering the combined file
For ordering the whole file we have to completely load it into memory, right?
Nope. Well, actually yes, but again, not all of it at once. We can use external sorting. You can try to implement this yourself in Python, or you can just use the UNIX command line tool sort (manual page), which does just that. On a Linux or macOS system, you already have it available. On Windows, it should be included in any UNIX emulator such as Cygwin or MinGW. If you already have Git installed with the default Git installer for Windows, you can use UNIX sort from the included "Git Bash".
Note that due to the order of our content within each line, the file will be sorted first by the content that came from A.txt, then (if that's the same) by the content that came from B.txt.
Counting
Once you have the sorted combined file, you can again process it line-by-line but you have to keep some data around between the lines.
What we want to do is:
For each block of subsequent lines with same A-content:
within that, for each block of subsequent lines with same B-content:
count its lines
keep tally of which B-content yet seen (within the A-content block) has had the most lines
at the end of the A-content block: output a line with the A-content and the most frequent B-content for that A-content
Because we can rely on the ordering we imposed above, this will produce the wanted result.
Something like this should work:
read a line
split it into A-content and B-content
if the A-content is the same as on the previous line*:
if the B-content is the same as on the previous line*:
increase the counter for the current a-b-content combination
else (i.e., if the B-content is different than on the previous line):
store which B-content is the most seen so far and its tally
(it's either the previously most seen B-content or the one from the previous line)
reset the counter for the current a-b-content combination
increase that counter by one (for the current line)
store the B-content somewhere, so we can compare it to that of the next line in the next iteration
else (i.e., if the A-content is different than on the previous line)**:
output the A-content of the previous line and the most seen B-content from the previous line
reset the counter for the current a-b-content combination
reset the information on what the most seen B-content was and its tally
store A-content and B-content of the current line, so they can be compared to those of the next line in the next iteration
repeat until the whole file has been processed
* for the first line, this is implicitly false
** also do this when you've reached the end of the file
Actually implementing this is in Python left as an exerciser for the reader. Note that you'll have to define some of the used variables before the step where they're mentioned in the description above for them to have the right scope.
Note that you can also do the counting step more cleverly than described here using capabilities of Python's comprehensive standard library. See Heap Overflow's answer for a great example.
This option would not require the whole file in memory, but will need to keep a dictionary with A as keys and multiple dicts with B as keys. That can be simplified if you could hash or categorize the values (assigning an int value to each unique A and for each unique B).
Edit: Changed the dictionaries to use the hashed keys to reduce memory footprint at expense of CPU and changed output to show the lines to keep (as original A and B values are obfuscated)
Assuming my file is:
Lines,A.txt,B.txt
1,A,1
2,A,1
3,A,2
4,B,1
5,B,2
from collections import Counter
from csv import DictReader
_ = {}
_file = DictReader(open('abc.txt', 'r'), delimiter=',')
hash_to_line = {}
for row in _file:
a = hash(row['A.txt'])
b = hash(row['B.txt'])
if a not in _:
_[a] = Counter()
if b not in _[a]:
hash_to_line[(a, b)] = row['Lines']
_[a][b] += 1
output = []
for A in _:
_vals = list(_[A].values())
_keys = list(_[A].keys())
_max = max(_vals)
_vals.index(_max)
A, _[A][_keys[_vals.index(_max)]]
output.append(hash_to_line[A, _keys[_vals.index(_max)]])
print('lines to keep:', output)
Replace the print with the appropriate storing of results.
Building on das-g's answer, counting the combined and sorted lines with nested groupby:
from itertools import groupby
from operator import itemgetter
# Combine and sort, in reality done like das-g described.
A = 'ABCDACADAEF'
B = '16781918216'
combined_and_sorted = sorted(zip(A, B))
# Count and produce the results
for a, agroup in groupby(combined_and_sorted, itemgetter(0)):
bcounts = ((b, sum(1 for _ in group))
for b, group in groupby(agroup, itemgetter(1)))
print(a, max(bcounts, key=itemgetter(1))[0])
Output:
A 1
B 6
C 7
D 8
E 1
F 6
As I mentioned in the comments, I think using the shelve module would allow you to do what you want.
Here's a example implementation and its output. Note that I added four lines to the end of A.txt and B.txt to make sure the correct value was selected when it wasn't the first one encountered. Also note that I've left in development and debugging code in that you will definitely want to remove or disable before running it on really large input files.
As a table, the two input files looked like this:
1 A 1
2 B 6
3 C 7
4 D 8
5 A 1
6 C 9
7 A 1
8 D 8
9 A 2
10 E 1
11 F 6
12 G 5
13 G 7
15 G 7
16 G 7
And, again as a table, here's what the two output files would be if they were rewritten:
1 A 1
2 B 6
3 C 7
4 D 8
5 E 1
6 F 6
7 G 7
import glob
from operator import itemgetter, methodcaller
import os
import shelve
shelf_name = 'temp_shelf'
file_a_name = 'A.txt'
file_b_name = 'B.txt'
with open(file_a_name) as file_a, \
open(file_b_name) as file_b, \
shelve.open(shelf_name, flag='n', writeback=False) as shelf:
for line, items in enumerate(zip(file_a, file_b)):
x, fx = map(methodcaller('rstrip'), items) # Remove line endings.
d = shelf.setdefault(x, {}) # Each shelf entry is a regular dict.
d[fx] = d.setdefault(fx, 0) + 1 # Update its value.
shelf[x] = d # Update shelf.
print('{!r} -> {!r}'.format(x, shelf[x]))
# Show what ended up in shelf during development.
print('\nFinal contents of shelf:')
for k, d in shelf.items():
print(' {!r}: {!r}'.format(k, d))
# Change file names to prevent overwriting originals during development.
file_a_name = 'A_updated.txt'
file_b_name = 'B_updated.txt'
comp_name = 'composite.txt' # Used for development only.
# Update files leaving only most frequent combination.
with open(file_a_name, 'w') as file_a, \
open(file_b_name, 'w') as file_b, \
open(comp_name, 'w') as file_c, \
shelve.open(shelf_name, flag='r') as shelf:
for line, (k, d) in enumerate(shelf.items(), 1):
file_a.write(k + '\n')
fx = sorted(d.items(), # Rev sort by number of occurrences.
key=itemgetter(1), reverse=True)[0][0] # Most common fx.
file_b.write(fx + '\n')
file_c.write('{} {} {}\n'.format(line, k, fx)) # For development only.
# CLean up by removing shelf files.
for filename in glob.glob(shelf_name + '.*'):
print('removing', filename)
os.remove(filename)
print('\nDone')

Python:Loop through .csv of urls and save it as another column

New to python, read a bunch and watched a lot of videos. I can't get it to work and i'm getting frustrated.
I have a List of links like below:
"KGS ID","Latitude","Longitude","Location","Operator","Lease","API","Elevation","Elev_Ref","Depth_start","Depth_stop","URL"
"1002880800","37.2354869","-100.4607509","T32S R29W, Sec. 27, SW SW NE","Stanolind Oil and Gas Co.","William L. Rickers 1","15-119-00164","2705"," KB","2790","7652","http://www.kgs.ku.edu/WellLogs/32S29W/1043696830.zip"
"1002880821","37.1234622","-100.1158111","T34S R26W, Sec. 2, NW NW NE","SKELLY OIL CO","GRACE MCKINNEY 'A' 1","15-119-00181","2290"," KB","4000","5900","http://www.kgs.ku.edu/WellLogs/34S26W/1043696831.zip"
I'm trying to get python to go to "URL" and save it in a folder named "location" as filename "API.las".
ex) ......"location"/Section/"API".las
C://.../T32S R29W/Sec.27/15-119-00164.las
The file has hundred of rows and links to download. I wanted to implement a sleep function at the also to not bombard the servers.
What are some of the different ways to do this? I've tried pandas and a few other methods... any ideas?
You will have to do something like this
for link, file_name in zip(links, file_names):
u = urllib.urlopen(link)
udata = u.read()
f = open(file_name+".las", "w")
f.write(udata)
f.close()
u.close()
If the contents of your file are not what you wanted, you might want to look at a scraping library like BeautifulSoup for parsing.
This might be a little dirty, but it's a first pass at solving the problem. This is all contingent on each value in the CSV being encompassed in double quotes. If this is not true, this solution will need heavy tweaking.
Code:
import os
csv = """
"KGS ID","Latitude","Longitude","Location","Operator","Lease","API","Elevation","Elev_Ref","Depth_start","Depth_stop","URL"
"1002880800","37.2354869","-100.4607509","T32S R29W, Sec. 27, SW SW NE","Stanolind Oil and Gas Co.","William L. Rickers 1","15-119-00164","2705"," KB","2790","7652","http://www.kgs.ku.edu/WellLogs/32S29W/1043696830.zip"
"1002880821","37.1234622","-100.1158111","T34S R26W, Sec. 2, NW NW NE","SKELLY OIL CO","GRACE MCKINNEY 'A' 1","15-119-00181","2290"," KB","4000","5900","http://www.kgs.ku.edu/WellLogs/34S26W/1043696831.zip"
""".strip() # trim excess space at top and bottom
root_dir = '/tmp/so_test'
lines = csv.split('\n') # break CSV on newlines
header = lines[0].strip('"').split('","') # grab first line and consider it the header
lines_d = [] # we're about to perform the core actions, and we're going to store it in this variable
for l in lines[1:]: # we want all lines except the top line, which is a header
line_broken = l.strip('"').split('","') # strip off leading and trailing double-quote
line_assoc = zip(header, line_broken) # creates a tuple of tuples out of the line with the header at matching position as key
line_dict = dict(line_assoc) # turn this into a dict
lines_d.append(line_dict)
section_parts = [s.strip() for s in line_dict['Location'].split(',')] # break Section value to get pieces we need
file_out = os.path.join(root_dir, '%s%s%s%sAPI.las'%(section_parts[0], os.path.sep, section_parts[1], os.path.sep)) # format output filename the way I think is requested
# stuff to show what's actually put in the files
print file_out, ':'
print ' ', '"%s"'%('","'.join(header),)
print ' ', '"%s"'%('","'.join(line_dict[h] for h in header))
output:
~/so_test $ python so_test.py
/tmp/so_test/T32S R29W/Sec. 27/API.las :
"KGS ID","Latitude","Longitude","Location","Operator","Lease","API","Elevation","Elev_Ref","Depth_start","Depth_stop","URL"
"1002880800","37.2354869","-100.4607509","T32S R29W, Sec. 27, SW SW NE","Stanolind Oil and Gas Co.","William L. Rickers 1","15-119-00164","2705"," KB","2790","7652","http://www.kgs.ku.edu/WellLogs/32S29W/1043696830.zip"
/tmp/so_test/T34S R26W/Sec. 2/API.las :
"KGS ID","Latitude","Longitude","Location","Operator","Lease","API","Elevation","Elev_Ref","Depth_start","Depth_stop","URL"
"1002880821","37.1234622","-100.1158111","T34S R26W, Sec. 2, NW NW NE","SKELLY OIL CO","GRACE MCKINNEY 'A' 1","15-119-00181","2290"," KB","4000","5900","http://www.kgs.ku.edu/WellLogs/34S26W/1043696831.zip"
~/so_test $
Approach 1 :-
Your file has suppose 1000 rows.
Create masterlist which has the data stored in this form ->
[row1,row2,row3 and so on]
Once done, loop through this masterlist. You will get a a row in string format in every iteration.
split it make a list and splice the last column of url i.e. row[-1]
and append it to a empty list named result_url. Once it has run for all rows, save it in a file and you can easily create a directory using os module and move your file over there
Approach 2 :-
If file is too huge, read line one by one in try block and process your data (using csv module you can get each row as a list, splice url and write it to file API.las everytime).
Once your program moves 1001th line it will move to except block where you can just 'pass' or write a print to get notified.
In approach 2, you are not saving all data in any data structure, you just storing a single row at executing it, so it is more fast.
import csv, os
directory_creater = os.mkdir('Locations')
fme = open('./Locations/API.las','w+')
with open('data.csv','r') as csvfile:
spamreader = csv.reader(csvfile, delimiter = ',')
print spamreader.next()
while True:
try:
row= spamreader.next()
get_url = row[-1]
to_write = get_url+'\n'
fme.write(to_write)
except:
print "Program has run. Check output."
exit(1)
This code can do all that you mentioned efficiently in less time.

Huge memory consumption on list .split()

I have a simple function that read a file and returns a list of words:
def _read_words(filename):
with tf.gfile.GFile(filename, "r") as f:
words = f.read().replace("\n"," %s " % EOS).split()
print(" %d Mo" % (sys.getsizeof(words)/(1000*1000)))
return words
Notes:
the tf.gfile.GFile comes from TensorFlow, the same is happening with open so you can ignore it.
EOS is a string containing "<eos>"
When I run it with a 1.3GB file, the process reserve more than 20GB of RAM (see htop screenshot), but it prints 2353 Mo for sys.getsizeof(words).
Note that this process is nothing more than:
import reader
path = "./train.txt"
w = reader._read_words(path)
When I run it step by step I see the following:
d = file.read() => 4.039 GB RAM
d = d.replace('\n', ' <eos> ') => 5.4GB RAM
d = d.split() => 22GB RAM
So here I am:
I can understand that split uses extra memory, but that much looks ridiculous.
Could I use better operations/data structures to do it? A solution using numpy would be great.
I can find a workaround (see below) specific to my case, but I would still not understand why I can't do it the easiest way.
Any clues or suggestion are welcome,
Thx,
Paul
Workaround:
As explained in the comment I need to:
Build a vocabulary i.e. a dict {"word": count}.
Then I select only the top n word, by occurence (n being a parameter)
Each of these words are assigned a integer (1 to n, 0 is for end of sentence tag <eos>)
I load the whole data, spliting by sentence (\n or our internal tag <eos>)
We sort it by sentence length
What I can do better:
Count words on the fly (line by line)
Build the vocabulary using this word count (we don't need the whole data then)
Load the whole data as integers, using the vocabulary, and therefore saving some memory.
Sort the list, same length, less memory consumption.

eliminate text after certain character in python pipeline- with slice?

This is a short script I've written to refine and validate a large dataset that I have.
# The purpose of this script is the refinement of the job data attained from the
# JSI as it is rendered by the `csv generator` contributed by Luis for purposes
# of presentation on the dashboard map.
import csv
# The number of columns
num_headers = 9
# Remove invalid characters from records
def url_escaper(data):
for line in data:
yield line.replace('&','&')
# Be sure to configure input & output files
with open("adzuna_input_THRESHOLD.csv", 'r') as file_in, open("adzuna_output_GO.csv", 'w') as file_out:
csv_in = csv.reader( url_escaper( file_in ) )
csv_out = csv.writer(file_out)
# Get rid of rows that have the wrong number of columns
# and rows that have only whitespace for a columnar value
for i, row in enumerate(csv_in, start=1):
if not [e for e in row if not e.strip()]:
if len(row) == num_headers:
csv_out.writerow(row)
else:
print "line %d is malformed" % i
I have one field that is structured like so:
finance|statistics|lisp
I've seen ways to do this using other utilities like R, but I want to ideally achieve the same effect within the scope of this python code.
Maybe I can iterate over all the characters of all the columnar values, perhaps as a list, and if I see a | I can dispose of the | and all the text that follows it within the scope of the column value.
I think surely it can be achieved with slices as they do here but I don't quite understand how the indices with slices work- and I can't see how I could include this process harmoniously within the cascade of the current script pipeline.
With regex I guess it's something like this
(?:|)(.*)
Why not use string's split method?
In[4]: 'finance|statistics|lisp'.split('|')[0]
Out[4]: 'finance'
It does not fail with exception when you do not have separator character in the string too:
In[5]: 'finance/statistics/lisp'.split('|')[0]
Out[5]: 'finance/statistics/lisp'

Creating a table which has sentences from a paragraph each on a row with Python

I have an abstract which I've split to sentences in Python. I want to write to 2 tables. One which has the following columns: abstract id (which is the file number that I extracted from my document), sentence id (automatically generated) and each sentence of this abstract on a row.
I would want a table that looks like this
abstractID SentenceID Sentence
a9001755 0000001 Myxococcus xanthus development is regulated by(1st sentence)
a9001755 0000002 The C signal appears to be the polypeptide product (2nd sentence)
and another table NSFClasses having abstractID and nsfOrg.
How to write sentences (each on a row) to table and assign sentenceId as shown above?
This is my code:
import glob;
import re;
import json
org = "NSF Org";
fileNo = "File";
AbstractString = "Abstract";
abstractFlag = False;
abstractContent = []
path = 'awardsFile/awd_1990_00/*.txt';
files = glob.glob(path);
for name in files:
fileA = open(name,'r');
for line in fileA:
if line.find(fileNo)!= -1:
file = line[14:]
if line.find(org) != -1:
nsfOrg = line[14:].split()
print file
print nsfOrg
fileA = open(name,'r')
content = fileA.read().split(':')
abstract = content[len(content)-1]
abstract = abstract.replace('\n','')
abstract = abstract.split();
abstract = ' '.join(abstract)
sentences = abstract.split('.')
print sentences
key = str(len(sentences))
print "Sentences--- "
As others have pointed out, it's very difficult to follow your code. I think this code will do what you want, based on your expected output and what we can see. I could be way off, though, since we can't see the file you are working with. I'm especially troubled by one part of your code that I can't see enough to refactor, but feels obviously wrong. It's marked below.
import glob
for filename in glob.glob('awardsFile/awd_1990_00/*.txt'):
fh = open(filename, 'r')
abstract = fh.read().split(':')[-1]
fh.seek(0) # reset file pointer
# See comments below
for line in fh:
if line.find('File') != -1:
absID = line[14:]
print absID
if line.find('NSF Org') != -1:
print line[14:].split()
# End see comments
fh.close()
concat_abstract = ''.join(abstract.replace('\n', '').split())
for s_id, sentence in enumerate(concat_abstract.split('.')):
# Adjust numeric width arguments to prettify table
print absID.ljust(15),
print '{:06d}'.format(s_id).ljust(15),
print sentence
In that section marked, you are searching for the last occurrence of the strings 'File' and 'NSF Org' in the file (whether you mean to or not because the loop will keep overwriting your variables as long as they occur), then doing something with the 15th character onward of that line. Without seeing the file, it is impossible to say how to do it, but I can tell you there is a better way. It probably involves searching through the whole file as one string (or at least the first part of it if this is in its header) rather than looping over it.
Also, notice how I condensed your code. You store a lot of things in variables that you aren't using at all, and collecting a lot of cruft that spreads the state around. To understand what line N does, I have to keep glancing ahead at line N+5 and back over lines N-34 to N-17 to inspect variables. This creates a lot of action at a distance, which for reasons cited is best to avoid. In the smaller version, you can see how I substituted in string literals in places where they are only used once and called print statements immediately instead of storing the results for later. The results are usually more concise and easily understood.

Categories

Resources