Replacing string with id using dictionary in python

Replacing string with id using dictionary in python - python

I have a dictionary file that contains a word in each line.
titles-sorted.txt
a&a
a&b
a&c_bus
a&e
a&f
a&m
....
For each word, its line number is the word's id.
Then I have another file that contains a set of words separated by tab in each line.
a.txt
a_15 a_15_highway_(sri_lanka) a_15_motorway a_15_motorway_(germany) a_15_road_(sri_lanka)
I'd like to replace all of the words by id if it exists in the dictionary, so that the output looks like,
3454 2345 123 5436 322 ....
So I wrote such python code to do this:
f = open("titles-sorted.txt")
lines = f.readlines()
titlemap = {}
nr = 1
for l in lines:
l = l.replace("\n", "")
titlemap[l.lower()] = nr
nr+=1
fw = open("a.index", "w")
f = open("a.txt")
lines = f.readlines()
for l in lines:
tokens = l.split("\t")
if tokens[0] in titlemap.keys():
fw.write(str(titlemap[tokens[0]]) + "\t")
for t in tokens[1:]:
if t in titlemap.keys():
fw.write(str(titlemap[t]) + "\t")
fw.write("\n")
fw.close()
f.close()
But this code is ridiculously slow, so it makes me suspicious if I have done everything right.
Is this an efficient way to do this?

The write loop contains a lot of calls to write, which are usually inefficient. You can probably speed things up by writing only once per line (or once per file if the file is small enough)
tokens = l.split("\t")
fw.write('\t'.join(fw.write(str(titlemap[t])) for t in tokens if t in titlemap)
fw.write("\n")
or even:
lines = []
for l in f:
lines.append('\t'.join(fw.write(str(titlemap[t])) for t in l.split('\t') if t in titlemap)
fw.write('\n'.join(lines))
Also, if your tokens are used more than once, you can save time by converting them to string when you read then:
titlemap = {l.strip().lower(): str(index) for index, l in enumerate(f, start=1)}

So, I suspect this differs based on the operating system you're running on and the specific python implementation (someone wiser than I may be able to provide some clarify here), but I have a suspicion about what is going on:
Every time you call write, some amount of your desired write request gets written to a buffer, and then once the buffer is full, this information is written to file. The file needs to be fetched from your hard disk (as it doesn't exist in main memory). So your computer pauses while it waits the several milliseconds that it takes to fetch the block from the harddisk and writes to it. On the other hand, you can do the parsing of the string and the lookup to your hashmap in a couple of nanoseconds, so you spend a lot of time waiting for the write request to finish!
Instead of writing immediately, what if you instead kept a list of the lines that you wanted to write and then only wrote them at the end, all in a row, or if you're handling a huge, huge file that will exceed the capacity of your main memory, write it once you have parsed a certain number of lines.
This allows the writing to disk to be optimized, as you can write multiple blocks at a time (again, this depends on how Python and the operating system handle the write call).

If we apply the suggestions so far and clean up your code some more (e.g. remove unnecessary .keys() calls), is the following still too slow for your needs?
title_map = {}
token_file = open("titles-sorted.txt")
for number, line in enumerate(token_file):
title_map[line.rstrip().lower()] = str(number + 1)
token_file.close()
input_file = open("a.txt")
output_file = open("a.index", "w")
for line in input_file:
tokens = line.split("\t")
if tokens[0] in title_map:
output_list = [title_map[tokens[0]]]
output_list.extend(title_map[token] for token in tokens[1:] if token in title_map)
output_file.write("\t".join(output_list) + "\n")
output_file.close()
input_file.close()
If it's still too slow, give us slightly more data to work with including an estimate of the number of lines in each of your two input files.

Related

Most efficient way to convert large .txt files (size >30GB) .txt into .csv after pre-processing using Python

I have data in a .txt file that looks like this (let's name it "myfile.txt"):
28807644'~'0'~'Maun FCU'~'US#####28855353'~'0'~'WNB Holdings LLC'~'US#####29212330'~'0'~'Idaho First Bank'~'US#####29278777'~'0'~'Republic Bank of Arizona'~'US#####29633181'~'0'~'Friendly Hills Bank'~'US#####29760145'~'0'~'The Freedom Bank of Virginia'~'US#####100504846'~'0'~'Community First Fund Federal Credit Union'~'US#####
I have tried a couple of ways to convert this .txt into a .csv, one of them was using CSV library, but since I like Panda's a lot, I used the following:
import pandas as pd
import time
#time at the start of program is noted
start = time.time()
# We set the path where our file is located and read it
path = r'myfile.txt'
f = open(path, 'r')
content = f.read()
# We replace undesired strings and introduce a breakline.
content_filtered = content.replace("#####", "\n").replace("'", "")
# We read everything in columns with the separator "~"
df = pd.DataFrame([x.split('~') for x in content_filtered.split('\n')], columns = ['a', 'b', 'c', 'd'])
# We print the dataframe into a csv
df.to_csv(path.replace('.txt', '.csv'), index = None)
end = time.time()
#total time taken to print the file
print("Execution time in seconds: ",(end - start))
This takes about 35 seconds to process, is a file of 300MB, I can accept that type of performance, but I'm trying to do the same for a way much larger file which size is 35GB and it produces a MemoryError message.
I tried using the CSV library, but the results were similar, I attempted putting everything into a list, and afterward, write it over to a CSV:
import csv
# We write to CSV
with open(path.replace('.txt', '.csv'), "w") as outfile:
write = csv.writer(outfile)
write.writerows(split_content)
Results were similar, not a huge improvement. Is there a way or methodology I can use to convert VERY large .txt files into .csv? Likely above 35GB?
I'd be happy to read any suggestions you may have, thanks in advance!

I took your sample string, and made a sample file by multiplying that string by 100 million (something like your_string*1e8...) to get a test file that is 31GB.
Following #Grismar's suggestion of chunking, I made the following, which processes that 31GB file in ~2 minutes, with a peak RAM usage depending on the chunk size.
The complicated part is keeping track of the field and record separators, which are multiple characters, and will certainly span across a chunk, and thus be truncated.
My solution is to inspect the end of each chunk and see if it has a partial separator. If it does, that partial is removed from the end of the current chunk, the current chunk is written-out, and the partial becomes the beginning of (and should be completed by) the next chunk:
CHUNK_SZ = 1024 * 1024
FS = "'~'"
RS = '#####'
# With chars repeated in the separators, check most specific (least ambiguous)
# to least specific (most ambiguous) to definitively catch a partial with the
# fewest number of checks
PARTIAL_RSES = ['####', '###', '##', '#']
PARTIAL_FSES = ["'~", "'"]
ALL_PARTIALS = PARTIAL_FSES + PARTIAL_RSES
f_out = open('out.csv', 'w')
f_out.write('a,b,c,d\n')
f_in = open('my_file.txt')
line = ''
while True:
# Read chunks till no more, then break out
chunk = f_in.read(CHUNK_SZ)
if not chunk:
break
# Any previous partial separator, plus new chunk
line += chunk
# Check end-of-line for a partial FS or RS; only when separators are more than one char
final_partial = ''
if line.endswith(FS) or line.endswith(RS):
pass # Write-out will replace complete FS or RS
else:
for partial in ALL_PARTIALS:
if line.endswith(partial):
final_partial = partial
line = line[:-len(partial)]
break
# Process/write chunk
f_out.write(line
.replace(FS, ',')
.replace(RS, '\n'))
# Add partial back, to be completed next chunk
line = final_partial
# Clean up
f_in.close()
f_out.close()

Since your code just does straight up replacement, you could just read through all the data sequentially and detect parts that need replacing as you go:
def process(fn_in, fn_out, columns):
new_line = b'#####'
with open(fn_out, 'wb') as f_out:
# write the header
f_out.write((','.join(columns)+'\n').encode())
i = 0
with open(fn_in, "rb") as f_in:
while (b := f_in.read(1)):
if ord(b) == new_line[i]:
# keep matching the newline block
i += 1
if i == len(new_line):
# if matched entirely, write just a newline
f_out.write(b'\n')
i = 0
# write nothing while matching
continue
elif i > 0:
# if you reach this, it was a partial match, write it
f_out.write(new_line[:i])
i = 0
if b == b"'":
pass
elif b == b"~":
f_out.write(b',')
else:
# write the byte if no match
f_out.write(b)
process('my_file.txt', 'out.csv', ['a', 'b', 'c', 'd'])
That does it pretty quickly. You may be able to improve performance by reading in chunks, but this is pretty quick all the same.
This approach has the advantage over yours that it holds almost nothing in memory, but it does very little to optimise reading the file fast.
Edit: there was a big mistake in an edge case, which I realised after re-reading, fixed now.

Just to share an alternative way, based on convtools (table docs | github).
This solution is faster the OP's, but ~7 times slower than Zach's (Zach works with str chunks, while this one works with row tuples, reading via csv.reader).
Still, this approach may be useful as it allows to tap into stream processing and work with columns, rearrange them, add new ones, etc.
from convtools import conversion as c
from convtools.contrib.fs import split_buffer
from convtools.contrib.tables import Table
def get_rows(filename):
with open(filename, "r") as f:
for row in split_buffer(f, "#####"):
yield row.replace("'", "")
Table.from_csv(
get_rows("tmp.csv"), dialect=Table.csv_dialect(delimiter="~")
).into_csv("tmp_out.csv", include_header=False)

python merge files by rules

I need to write script in python that accept and merge 2 files to a new file according to the following rule:
1)take 1 word from 1st file followed by 2 words from the second file.
2) when we reach the end of 1 file i'll need to copy the rest of the other file to the merged file without change.
I wrote that script, but i managed to only read 1 word from each file.
Complete script will be nice, but I really want to understand by words how i can do this by my own.
This is what i wrote:
def exercise3(file1,file2):
lstFile1=readFile(file1)
lstFile2=readFile(file2)
with open("mergedFile", 'w') as outfile:
merged = [j for i in zip(lstFile1, lstFile2) for j in i]
for word in merged:
outfile.write(word)
def readFile(filename):
lines = []
with open(filename) as file:
for line in file:
line = line.strip()
for word in line.split():
lines.append(word)
return lines

Your immediate problem is that zip alternates items from the iterables you give it: in short, it's a 1:1 mapping, where you need 1:2. Try this:
lstFile2a = listfile2[0::2]
lstFile2b = listfile2[1::2]
... zip(lstfile1, listfile2a, lstfile2b)
This is a bit inefficient, but gets the job done.
Another way is to zip up pairs (2-tuples) in lstFile2 before zipping it with lstFile1. A third way is to forget zipping altogether, and run your own indexing:
for i in min(len(lstFile1), len(lstFile2)//2):
outfile.write(lstFile1[i])
outfile.write(lstFile2[2*i])
outfile.write(lstFile2[2*i+1])
However, this leaves you with the leftovers of the longer file to handle.
These aren't particularly elegant, but they should get you moving.

What is the most efficient way of looping over each line of a file?

I have a file, dataset.nt, which isn't too large (300Mb). I also have a list, which contains around 500 elements. For each element of the list, I want to count the number of lines in the file which contain it, and add that key/value pair to a dictionary (the key being the name of the list element, and the value the number of times this element appears in the file).
This is the first thing I tired to achieve that result:
mydict = {}
for i in mylist:
regex = re.compile(r"/Main/"+re.escape(i))
total = 0
with open("dataset.nt", "rb") as input:
for line in input:
if regex.search(line):
total = total+1
mydict[i] = total
It didn't work (as in, it runs indefinitely), and I figured I should find a way not to read each line 500 times. So I tried this:
mydict = {}
with open("dataset.nt", "rb") as input:
for line in input:
for i in mylist:
regex = re.compile(r"/Main/"+re.escape(i))
total = 0
if regex.search(line):
total = total+1
mydict[i] = total
Performance din't improve, the script still runs indefinitely. So I googled around, and I tried this:
mydict = {}
file = open("dataset.nt", "rb")
while 1:
lines = file.readlines(100000)
if not lines:
break
for line in lines:
for i in list:
regex = re.compile(r"/Main/"+re.escape(i))
total = 0
if regex.search(line):
total = total+1
mydict[i] = total
That one has been running for the last 30 minutes, so I'm assuming it's not any better.
How should I structure this code so that it completes in a reasonable amount of time?

I'd favor a slight modification of your second version:
mydict = {}
re_list = [re.compile(r"/Main/"+re.escape(i)) for i in mylist]
with open("dataset.nt", "rb") as input:
for line in input:
# any match has to contain the "/Main/" part
# -> check it's there
# that may help a lot or not at all
# depending on what's in your file
if not '/Main/' in line:
continue
# do the regex-part
for i, regex in zip(mylist, re_list):
total = 0
if regex.search(line):
total = total+1
mydict[i] = total
As #matsjoyce already suggested, this avoids re-compiling the regex on each iteration.
If you really need to that many different regex patterns then I don't think there's much you can do.
Maybe it's worth checking if you can regex-capture whatever comes after "/Main/" and then compare this to your list. That may help reducing the amount of "real" regex searches.

Looks like a good candidate for some map/reduce like parallelisation... You could split your dataset file in N chunks (where N = how many processors you have), launch N subprocesses each scanning one chunk, then sum the results.
This of course doesn't prevent you from first optimizing the scan, ie (based on sebastian's code):
targets = [(i, re.compile(r"/Main/"+re.escape(i))) for i in mylist]
results = dict.fromkeys(mylist, 0)
with open("dataset.nt", "rb") as input:
for line in input:
# any match has to contain the "/Main/" part
# -> check it's there
# that may help a lot or not at all
# depending on what's in your file
if '/Main/' not in line:
continue
# do the regex-part
for i, regex in targets:
if regex.search(line):
results[i] += 1
Note that this could be better optimized if you posted a sample from your dataset. If for exemple your dataset can be sorted on "/Main/{i}" (using the system sort program for exemple), you wouldn't have to check each line for each value of i. Or if the position of "/Main/" in the line is known and fixed, you could use a simple string comparison on the relevant part of the string (which can be faster than a regexp).

The other solutions are very good. But, since there is a regex for each element, and is not important if the element appears more than once per line, you could count the lines containing target expression using re.findall.
Also after certain ammount of lines is better to read the hole file (if you have enough memory and it isn't a design restriction) to memory.
import re
mydict = {}
mylist = [...] # A list with 500 items
# Optimizing calls
findall = re.findall # Then python don't have to resolve this functions for every call
escape = re.escape
with open("dataset.nt", "rb") as input:
text = input.read() # Read the file once and keep it in memory instead access for read each line. If the number of lines is big this is faster.
for elem in mylist:
mydict[elem] = len(findall(".*/Main/{0}.*\n+".format(escape(elem)), text)) # Count the lines where the target regex is.
I test this with a file of size 800Mb (I wanted to see how much time take load a file as big like this into memory, is more fast that you would think).
I don't test the whole code with real data, just the findall part.

Refering to a list of names using Python

I am new to Python, so please bear with me.
I can't get this little script to work properly:
genome = open('refT.txt','r')
datafile - a reference genome with a bunch (2 million) of contigs:
Contig_01
TGCAGGTAAAAAACTGTCACCTGCTGGT
Contig_02
TGCAGGTCTTCCCACTTTATGATCCCTTA
Contig_03
TGCAGTGTGTCACTGGCCAAGCCCAGCGC
Contig_04
TGCAGTGAGCAGACCCCAAAGGGAACCAT
Contig_05
TGCAGTAAGGGTAAGATTTGCTTGACCTA
The file is opened:
cont_list = open('dataT.txt','r')
a list of contigs that I want to extract from the dataset listed above:
Contig_01
Contig_02
Contig_03
Contig_05
My hopeless script:
for line in cont_list:
if genome.readline() not in line:
continue
else:
a=genome.readline()
s=line+a
data_out = open ('output.txt','a')
data_out.write("%s" % s)
data_out.close()
input('Press ENTER to exit')
The script successfully writes the first three contigs to the output file, but for some reason it doesn't seem able to skip "contig_04", which is not in the list, and move on to "Contig_05".
I might seem a lazy bastard for posting this, but I've spent all afternoon on this tiny bit of code -_-

I would first try to generate an iterable which gives you a tuple: (contig, gnome):
def pair(file_obj):
for line in file_obj:
yield line, next(file_obj)
Now, I would use that to get the desired elements:
wanted = {'Contig_01', 'Contig_02', 'Contig_03', 'Contig_05'}
with open('filename') as fin:
pairs = pair(fin)
while wanted:
p = next(pairs)
if p[0] in wanted:
# write to output file, store in a list, or dict, ...
wanted.forget(p[0])

I would recommend several things:
Try using with open(filename, 'r') as f instead of f = open(...)/f.close(). with will handle the closing for you. It also encourages you to handle all of your file IO in one place.
Try to read in all the contigs you want into a list or other structure. It is a pain to have many files open at once. Read all the lines at once and store them.
Here's some example code that might do what you're looking for
from itertools import izip_longest
# Read in contigs from file and store in list
contigs = []
with open('dataT.txt', 'r') as contigfile:
for line in contigfile:
contigs.append(line.rstrip()) #rstrip() removes '\n' from EOL
# Read through genome file, open up an output file
with open('refT.txt', 'r') as genomefile, open('out.txt', 'w') as outfile:
# Nifty way to sort through fasta files 2 lines at a time
for name, seq in izip_longest(*[genomefile]*2):
# compare the contig name to your list of contigs
if name.rstrip() in contigs:
outfile.write(name) #optional. remove if you only want the seq
outfile.write(seq)

Here's a pretty compact approach to get the sequences you'd like.
def get_sequences(data_file, valid_contigs):
sequences = []
with open(data_file) as cont_list:
for line in cont_list:
if line.startswith(valid_contigs):
sequence = cont_list.next().strip()
sequences.append(sequence)
return sequences
if __name__ == '__main__':
valid_contigs = ('Contig_01', 'Contig_02', 'Contig_03', 'Contig_05')
sequences = get_sequences('dataT.txt', valid_contigs)
print(sequences)
The utilizes the ability of startswith() to accept a tuple as a parameter and check for any matches. If the line matches what you want (a desired contig), it will grab the next line and append it to sequences after stripping out the unwanted whitespace characters.
From there, writing the sequences grabbed to an output file is pretty straightforward.
Example output:
['TGCAGGTAAAAAACTGTCACCTGCTGGT',
'TGCAGGTCTTCCCACTTTATGATCCCTTA',
'TGCAGTGTGTCACTGGCCAAGCCCAGCGC',
'TGCAGTAAGGGTAAGATTTGCTTGACCTA']

Is there a way to read a file in a loop in python using a separator other than newline

I usually read files like this in Python:
f = open('filename.txt', 'r')
for x in f:
doStuff(x)
f.close()
However, this splits the file by newlines. I now have a file which has all of its info in one line (45,000 strings separated by commas). While a file of this size is trivial to read in using something like
f = open('filename.txt', 'r')
doStuff(f.read())
f.close()
I am curious if for a much larger file which is all in one line it would be possible to achieve a similar iteration effect as in the first code snippet but with splitting by comma instead of newline, or by any other character?

The following function is a fairly straightforward way to do what you want:
def file_split(f, delim=',', bufsize=1024):
prev = ''
while True:
s = f.read(bufsize)
if not s:
break
split = s.split(delim)
if len(split) > 1:
yield prev + split[0]
prev = split[-1]
for x in split[1:-1]:
yield x
else:
prev += s
if prev:
yield prev
You would use it like this:
for item in file_split(open('filename.txt')):
doStuff(item)
This should be faster than the solution that EMS linked, and will save a lot of memory over reading the entire file at once for large files.

Open the file using open(), then use the file.read(x) method to read (approximately) the next x bytes from the file. You could keep requesting blocks of 4096 characters until you hit end-of-file.
You will have to implement the splitting yourself - you can take inspiration from the csv module, but I don't believe you can use it directly because it wasn't designed to deal with extremely long lines.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Replacing string with id using dictionary in python - python

Related

Most efficient way to convert large .txt files (size >30GB) .txt into .csv after pre-processing using Python

python merge files by rules

What is the most efficient way of looping over each line of a file?

Refering to a list of names using Python

Is there a way to read a file in a loop in python using a separator other than newline

Categories

Resources