python iterate over a file and replace strings - python

I'm using the 're' library to replace occurrences of different strings in multiple files. The replacement pattern works fine, but I'm not able to maintain the changes to the files. I'm trying to get the same functionality that comes with the following lines:
with open(KEY_FILE, mode='r', encoding='utf-8-sig') as f:
replacements = csv.DictReader(f)
user_data = open(temp_file, 'r').read()
for col in replacements:
user_data = user_data.replace(col[ORIGINAL_COLUMN], col[TARGET_COLUMN])
data_output = open(f"{temp_file}", 'w')
data_output.write(user_data)
data_output.close()
The key line here is:
user_data = user_data.replace(col[ORIGINAL_COLUMN], col[TARGET_COLUMN])
It takes care of updating the data in place using the replace method.
I need to do the same but with the 're' library:
with open(KEY_FILE, mode='r', encoding='utf-8-sig') as f:
replacements = csv.DictReader(f)
user_data = open(temp_file, 'r').read()
a = open(f"{test_file}", 'w')
for col in replacements:
original_str = col[ORIGINAL_COLUMN]
target_str = col[TARGET_COLUMN]
compiled = re.compile(re.escape(original_str), re.IGNORECASE)
result = compiled.sub(target_str, user_data)
a.write(result)
I only end up with the last item in the .csv dict changed in the output file. Can't seem to get the changes made in previous iterations of the for loop to persist.
I know that it is pulling from the same file each time... which is why it is getting reset each loop, but I can't sort out a workaround.
Thanks

Try something like this?
#!/usr/bin/env python3
import csv
import re
import sys
from io import StringIO
KEY_FILE = '''aaa,bbb
xxx,yyy
'''
TEMP_FILE = '''here is aaa some text xxx
bla bla aaaxxx
'''
ORIGINAL_COLUMN = 'FROM'
TARGET_COLUMN = 'TO'
user_data = StringIO(TEMP_FILE).read()
with StringIO(KEY_FILE) as f:
reader = csv.DictReader(f, ['FROM','TO'])
for row in reader:
original_str = row[ORIGINAL_COLUMN]
target_str = row[TARGET_COLUMN]
compiled = re.compile(re.escape(original_str), re.IGNORECASE)
user_data = compiled.sub(target_str, user_data)
sys.stdout.write("modified user_data:\n" + user_data)
Some things to note:
The main problem was result = sub(..., user_data) rather than result = sub(..., result). You want to keep updating the same string, rather than always applying to the original.
The compiling of regex is fairly pointless in this case, since each is just used once.
I don't have access to your test files, so I used StringIO versions inline and printing to stdout; hopefully that's easy enough to translate back to your real code (:
In future posts, you might consider doing similar, so that your question has 100% runnable code someone else can try out without guessing.

Related

MemoryError in Python by searching a large file using mmap and re.findall

I'm looking to implement a few lines of python, using re, to firstly manipulate a string then use that string as a regex search. I have strings with *'s in the middle of them, i.e. ab***cd, with the *'s being any length. The aim of this is to do the regex search in a document to extract any lines that match the starting and finishing characters, with any number of characters in between. i.e. ab12345cd, abbbcd, ab_fghfghfghcd, would all be positive matches. Examples of negative matches: 1abcd, agcd, bb111cd.
I have come up with the regex of [\s\S]*? to input instead of the *'s. So I want to get from an example string of ab***cd to ^ab[\s\S]*?cd, I will then use that for a regex search of a document.
I then wanted to open the file in mmap, search through it using the regex, then save the matches to file.
import re
import mmap
def file_len(fname):
with open(fname) as f:
for i, l in enumerate(f):
pass
return i + 1
def searchFile(list_txt, raw_str):
search="^"+raw_str #add regex ^ newline operator
search_rgx=re.sub(r'\*+',r'[\\s\\S]*?',search) #replace * with regex function
#search file
with open(list_txt, 'r+') as f:
data = mmap.mmap(f.fileno(), 0)
results = re.findall(bytes(search_rgx,encoding="utf-8"),data, re.MULTILINE)
#save results
f1 = open('results.txt', 'w+b')
results_bin = b'\n'.join(results)
f1.write(results_bin)
f1.close()
print("Found "+str(file_len("results.txt"))+" results")
searchFile("largelist.txt","ab**cd")
Now this works fine with a small file. However when the file gets larger (1gb of text) I get this error:
Traceback (most recent call last):
File "c:\Programming\test.py", line 27, in <module>
searchFile("largelist.txt","ab**cd")
File "c:\Programming\test.py", line 21, in searchFile
results_bin = b'\n'.join(results)
MemoryError
Firstly - can anyone help optimize the code slightly? Am I doing something seriously wrong? I used mmap because I know I wanted to look at large files and I wanted to read the file line and by line rather than all at once (hence someone suggested mmap).
I've also been told to have a look at the pandas library for more data manipulation. Would panda's replace mmap?
Thanks for any help. I'm pretty new to python as you can tell - so appreciate any help.
You are doing line by line processing so you want to avoid accumulating data in memory. Regular file reads and writes should work well here. mmap is backed by virtual memory, but that has to turn into real memory as you read it. Accumulating results in findall is also a memory hog. Try this as an alternate:
import re
# buffer to 1Meg but any effect would be modest
MEG = 2**20
def searchFile(filename, raw_str):
# extract start and end from "ab***cd"
startswith, endswith = re.match(r"([^\*]+)\*+?([^\*]+)", raw_str).groups()
with open(filename, buffering=MEG) as in_f, open("results.txt", "w", buffering=MEG) as out_f:
for line in in_f:
stripped = line.strip()
if stripped.startswith(startswith) and stripped.endswith(endswith):
out_f.write(line)
# write test file
test_txt = """ab12345cd
abbbcd
ab_fghfghfghcd
1abcd
agcd
bb111cd
"""
want = """ab12345cd
abbbcd
ab_fghfghfghcd
"""
open("test.txt", "w").write(test_txt)
searchFile("test.txt", "ab**cd")
result = open("results.txt").read()
print(result == want)
I am not sure what advantage you believe you will get from opening the input file with mmap, but since each string that must be matched is delimited by a newline (as per your comment), I would use the below approach (Note that it is Python, but deliberately kept as pseudo code):
with open(input_file_path, "r") as input_file:
with open(output_file_path, "x" as output_file:
for line in input_file:
if is_match(line):
print(line, file=output_file)
possibly tuning the endline parameter of the print function to your needs.
This way results are written as they are generated, and you avoid having a large results in memory before writing it.
Furthermore, you don't need to concentrate about newlines. Only whether each line matches.
How about this? In this situation, what you want is a list of all of your lines represented as strings. The following emulates that, resulting in a list of strings:
import io
longstring = """ab12345cd
abbbcd
ab_fghfghfghcd
1abcd
agcd
bb111cd
"""
list_of_strings = io.StringIO(longstring).read().splitlines()
list_of_strings
Outputs
['ab12345cd', 'abbbcd', 'ab_fghfghfghcd', '1abcd', 'agcd', 'bb111cd']
This is the part that matters
s = pd.Series(list_of_strings)
s[s.str.match('^ab[\s\S]*?cd')]
Outputs
0 ab12345cd
1 abbbcd
2 ab_fghfghfghcd
dtype: object
Edit2: Try this: (I don't see a reason for you to want to it as a function, but I've done it like that since that what you did in the comments.)
def newsearch(filename):
with open(filename, 'r', encoding="utf-8") as f:
list_of_strings = f.read().splitlines()
s = pd.Series(list_of_strings)
s = s[s.str.match('^ab[\s\S]*?cd')]
s.to_csv('output.txt', header=False, index=False)
newsearch('list.txt')
A chunk-based approach
import os
def newsearch(filename):
outpath = 'output.txt'
if os.path.exists(outpath):
os.remove(outpath)
for chunk in pd.read_csv(filename, sep='|', header=None, chunksize=10**6):
chunk = chunk[chunk[0].str.match('^ab[\s\S]*?cd')]
chunk[0].to_csv(outpath, index=False, header=False, mode='a')
newsearch('list.txt')
A dask approach
import dask.dataframe as dd
def newsearch(filename):
chunk = dd.read_csv(filename, header=None, blocksize=25e6)
chunk = chunk[chunk[0].str.match('^ab[\s\S]*?cd')]
chunk[0].to_csv('output.txt', index=False, header=False, single_file = True)
newsearch('list.txt')

How would update a line in a text file if it contains a certain string with user input in Python?

Currently I have a program that accepts a text file as a setting parameter. The format of the text file is like so:
What I would like to do in Python is to be able to parse the file. I want to read each line and check if it contains a specific string and if the line contains it, be able to append/update in place a user input text right into the end of that line.
For example, the "PRIMER_PRODUCT_SIZE_RANGE=" (highlighted) is found in the text file and I want to update it from "PRIMER_PRODUCT_SIZE_RANGE=100-300" to read for example "PRIMER_PRODUCT_SIZE_RANGE=500-1000". I want to be able to do this for several of the parameters (that I have chosen for my needs).
A few tools I've looked into would include fileinput module, doing a regular file open/write (which I can't do in place editing apparently), and a module I found called the in_place module.
Some preliminary code I have using the in_place module:
def SettingFileParser(filepath):
with in_place.InPlaceText(filepath) as text_file:
for line in text_file:
if 'PRIMER_PRODUCT_SIZE_RANGE=' in line:
text_file.write("idk what to put here")
I'm a noob programmer to preface so any help or guidance in the right direction would be appreciated.
You need to open the file for reading (default) and then you need to open a file for writing:
def SettingFileParser(filepath):
with open(filepath, 'r') as read_file:
lines = read_file.readlines():
with open(filepath, 'w') as write_file:
for line in lines:
if line.startswith('PRIMER_PRODUCT_SIZE_RANGE='):
# keep the line the same except the part needing replacement
write_file.write(line.replace('100-300','500-1000'))
else:
# otherwise just write the line as is
write_file.write(line)
For this task, regular expressions are the way to go. I refer you to this post.
Concretely, something like this might do the trick:
import re
# read in file here...
re.sub(r"PRIMER_PRODUCT_SIZE_RANGE=[0-9,-]*", "PRIMER_PRODUCT_SIZE_RANGE=500-1000", s)
Here, s would be the entire string in the text file.
You can use re for changing/manipulating strings (explanation of this regex on external site):
data = """
# PRIMER_PRODUCT_SIZE_RANGE=xxx-yyy
PRIMER_PRODUCT_SIZE_RANGE=100-300
PRIMER_NUM_RETURN=3
PRIMER_MAX_END_STABILITY=9.0
PRIMER_MAX_HAIRPIN_TH=24.0
"""
import re
def change_parameter(data, parameter, new_value):
return re.sub(r'^(\s*(?<!#)\s*{}\s*)=.*$'.format(parameter), r'\1={}'.format(new_value), data, flags=re.M|re.I)
data = change_parameter(data, 'PRIMER_PRODUCT_SIZE_RANGE', '100-500')
data = change_parameter(data, 'PRIMER_MAX_HAIRPIN_TH', '99.9')
print(data)
This prints:
# PRIMER_PRODUCT_SIZE_RANGE=xxx-yyy
PRIMER_PRODUCT_SIZE_RANGE=100-500
PRIMER_NUM_RETURN=3
PRIMER_MAX_END_STABILITY=9.0
PRIMER_MAX_HAIRPIN_TH=99.9
For reading/writing to file you can use this snippet:
with open('parameters.txt', 'r', newline='') as f_in:
data = f_in.read()
with open('parameters.txt', 'w', newline='') as f_out:
data = change_parameter(data, 'PRIMER_PRODUCT_SIZE_RANGE', '100-500')
data = change_parameter(data, 'PRIMER_MAX_HAIRPIN_TH', '99.9')
f_out.write(data)
EDIT:
Extended version of change_parameter():
import re
data = """
PRIMER_PRODUCT_SIZE_RANGE=100-500 200-400
PRIMER_NUM_RETURN=3
PRIMER_MAX_END_STABILITY=9.0
"""
def change_parameter_ext(data, parameter, old_value, new_value):
def _my_sub(g):
return g[1] + '=' + re.sub(r'{}'.format(old_value), new_value, g[2], flags=re.I).strip()
return re.sub(r'^(\s*(?<!#)\s*{}\s*)=(.*)$'.format(parameter), _my_sub, data, flags=re.M|re.I).strip()
data = change_parameter_ext(data, 'PRIMER_PRODUCT_SIZE_RANGE', '200-400', '500-600')
data = change_parameter_ext(data, 'PRIMER_NUM_RETURN', '3', '100-200 300-400')
print(data)
Prints:
PRIMER_PRODUCT_SIZE_RANGE=100-500 500-600
PRIMER_NUM_RETURN=100-200 300-400
PRIMER_MAX_END_STABILITY=9.0

Python - How to read a specific line in a text file?

I have a huge text file (12GB). The lines are tab delimited and the first column contains an ID. For each ID I want to do something. Therefore, my plan is to go start with the first line, go through the first column line by line until the next ID is reached.
start_line = b
num_lines = 377763316
while b < num_lines:
plasmid1 = linecache.getline("Result.txt", b-1)
plasmid1 = plasmid1.strip("\n")
plasmid1 = plasmid1.split("\t")
plasmid2 = linecache.getline("Result.txt", b)
plasmid2 = plasmid2.strip("\n")
plasmid2 = plasmid2.split("\t")
if not str(plasmid1[0]) == str(plasmid2[0]):
end_line = b
#do something
The code works, but the problem is that linecache seems to reload the txt-file every time. The code would run several years if I don't increase the performance.
I appreciate your help if you have a good idea how to solve the issue or know an alternative approach!
Thanks,
Philipp
I think numpy.loadtxt() is the way to go. Also it would be nice to pass usecols argument to specify which columns you actually need from the file. Numpy package is solid library written with high performance in mind.
After calling loadtxt() you will get ndarray back.
You can use itertools:
from itertools import takewhile
class EqualityChecker(object):
def __init__(self, id):
self.id = id
def __call__(self, current_line):
result = False
current_id = current_line.split('\t')[0]
if self.id == current_id:
result = True
return result
with open('hugefile.txt', 'r') as f:
for id in ids:
checker = EqualityChecker(id)
for line in takewhile(checker, f.xreadlines()):
do_stuff(line)
In outer loop id can actually be obtain from the first line with an id non-matching previous value.
You should open the file just once, and iterate over the lines.
with open('Result.txt', 'r') as f:
aline = f.next()
currentid = aline.split('\t', 1)[0]
for nextline in f:
nextid = nextline.split('\t', 1)[0]
if nextid != currentid:
#do stuff
currentid = nextid
You get the idea, just use plain python.
Only one line is read in each iteration. The extra 1 argument in the split will split only to the first tab, encreasing performance. You will not get better performance with any specialized library. Only a plain C language implementation could beat this approach.
If you get the AttributeError: '_io.TextIOWrapper' object has, it is probably because you are using Python 3.X (see question io-textiowrapper-object). Try this version instead:
with open('Result.txt', 'r') as f:
aline = f.readline()
currentid = aline.split('\t', 1)[0]
while aline != '':
aline = f.readline()
nextid = aline.split('\t', 1)[0]
if nextid != currentid:
#do stuff
currentid = nextid

How to replace a list of special characters in a csv in python

I have some csv files that may or may not contain characters like “”à that are undesirable, so I want to write a simple script that will feed in a csv and feed out a csv (or its contents) with those characters replaced with more standard characters, so in the example:
bad_chars = '“”à'
good_chars = '""a'
The problem so far is that my code seems to produce a csv with perhaps the wrong encoding? Any help would be appreciated in making this simpler and/or making sure my output csv doesn't force an incorrect regex encoding--maybe using pandas?
Attempt:
import csv, string
upload_path = sys.argv[1]
input_file = open('{}'.format(upload_path), 'rb')
upload_csv = open('{}_fixed.csv'.format(upload_path.strip('.csv')), 'wb')
data = csv.reader(input_file)
writer = csv.writer(upload_csv, quoting=csv.QUOTE_ALL)
in_chars = '\xd2\xd3'
out_chars = "''"
replace_list = string.maketrans(in_chars, out_chars)
for line in input_file:
line = str(line)
new_line = line.translate(replace_list)
writer.writerow(new_line.split(','))
input_file.close()
upload_csv.close()
As you stamped your question with the pandas tag - here is a pandas solution:
import pandas as pd
(pd.read_csv('/path/to/file.csv')
.replace(r'RegEx_search_for_str', r'RegEx_replace_with_str', regex=True)
.to_csv('/path/to/fixed.csv', index=False)
)

Adding each item in list to end of specific lines in FASTA file

I solved this in the comments below.
So essentially what I am trying to do is add each element of a list of strings to the end of specific lines in a different file.
Hard to explain but essentially I want to parse a FASTA file, and every time it reaches a header (line.startswith('>')) I want it to replace parts of that header with an element in a list I've already made.
For example:
File1:
">seq1 unwanted here
AATATTATA
ATATATATA
>seq2 unwanted stuff here
GTGTGTGTG
GTGTGTGTG
>seq3 more stuff I don't want
ACACACACAC
ACACACACAC"
I want it to keep ">seq#" but replace everything after with the next item in the list below:
List:
mylist = "['things1', '', 'things3', 'things4', '' 'things6', 'things7']"
Result (modified file1):
">seq1 things1
AATATTATA
ATATATATA
>seq2 # adds nothing here due to mylist[1] = ''
GTGTGTGTG
GTGTGTGTG
>seq3 things3
ACACACACAC
ACACACACAC
As you can see I want it to add even the blank items in the list.
So once again, I want it to parse this FASTA file, and every time it gets to a header (there are thousands), I want it to replace everything after the first word with the next item in the separate list I have made.
What you have will work, but there are a few unnecessary lines so I've edited down to use a few less lines. Also, an important note is that you don't close your file handles. This could result in errors, specifically when writing to file, either way it's bad practice. code:
#!/usr/bin/python
import sys
# gets list of annotations
def get_annos(infile):
with open(infile, 'r') as fh: # makes sure the file is closed properly
annos = []
for line in fh:
annos.append( line.split('\t')[5] ) # added tab as separator
return annos
# replaces extra info on each header with correct annotation
def add_annos(infile1, infile2, outfile):
annos = get_annos(infile1) # contains list of annos
with open(infile2, 'r') as f2, open(outfile, 'w') as output:
for line in f2:
if line.startswith('>'):
line_split = list(line.split()[0]) # split line on whitespace and store first element in list
line_split.append(annos.pop(0)) # append data of interest to current id line
output.write( ' '.join(line_split) + '\n' ) # join and write to file with a newline character
else:
output.write(line)
anno = sys.argv[1]
seq = sys.argv[2]
out = sys.argv[3]
add_annos(anno, seq, out)
get_annos(anno)
This is not perfect but it cleans things up a bit. I'd might veer away from using pop() to associate the annotation data with the sequence IDs unless you are certain the files are in the same order every time.
There is a great library in python for Fasta and other DNA file parsing. It is totally helpful in Bioinformatics. You can also manipulate any data according to your need.
Here is a simple example extracted from the library website:
from Bio import SeqIO
for seq_record in SeqIO.parse("ls_orchid.fasta", "fasta"):
print(seq_record.id)
print(repr(seq_record.seq))
print(len(seq_record))
You should get something like this on your screen:
gi|2765658|emb|Z78533.1|CIZ78533
Seq('CGTAACAAGGTTTCCGTAGGTGAACCTGCGGAAGGATCATTGATGAGACCGTGG...CGC', SingleLetterAlphabet())
740
...
gi|2765564|emb|Z78439.1|PBZ78439
Seq('CATTGTTGAGATCACATAATAATTGATCGAGTTAATCTGGAGGATCTGTTTACT...GCC', SingleLetterAlphabet())
592
***********EDIT*********
I solved this before anyone could help. This is my code, can anyone tell me if I have any bad practices? Is there a way to do this without writing everything to a new file? Seems like it would take a long time/lots of memory.
#!/usr/bin/python
# Script takes unedited FASTA file, removed seq length and
# other header info, adds annotation after sequence name
# run as: $ python addanno.py testanno.out testseq.fasta out.txt
import sys
# gets list of annotations
def get_annos(infile):
f = open(infile)
list2 = []
for line in f:
columns = line.strip().split('\t')
list2.append(columns[5])
return list2
# replaces extra info on each header with correct annotation
def add_annos(infile1, infile2, outfile):
mylist = get_annos(infile1) # contains list of annos
f2 = open(infile2, 'r')
output = open(out, 'w')
for line in f2:
if line.startswith('>'):
l = line.partition(" ")
list3 = list(l)
del list3[1:]
list3.append(' ')
list3.append(mylist.pop(0))
final = ''.join(list3)
line = line.replace(line, final)
output.write(line)
output.write('\n')
else:
output.write(line)
anno = sys.argv[1]
seq = sys.argv[2]
out = sys.argv[3]
add_annos(anno, seq, out)
get_annos(anno)

Categories

Resources