Python: using regex and a loop to clean mulitiple text files

Python: using regex and a loop to clean mulitiple text files - python

I'm cleaning newspaper articles stored in separated text files.
In one of the cleaning stages, I want to remove all the text within one file that comes after the deliminator 'LOAD-DATE:'. I use a small piece of code that does the work when applied to just one string. See below.
line = 'A little bit of text. LOAD-DATE: And some redundant text'
import re
m = re.match('(.*LOAD-DATE:)', line)
if m:
line = m.group(1)
line = re.sub('LOAD-DATE:', '', line)
print(line)
A little bit of text.
However, when I translate the code to a loop to clean a whole bunch of seperate text files (which works fine in other stages of the script), than it produces gigantic, identical text files, which don't look at all like the original newspaper articles. See loop:
files = glob.glob("*.txt")
for f in files:
with open(f, "r") as fin:
try:
import re
m = re.match('(.*LOAD-DATE:)', fin)
if m:
data = m.group(1)
data = re.sub('LOAD-DATE:', '', data)
except:
pass
with open(f, 'w') as fout:
fout.writelines(data)
Something clearly goes wrong in the loop, but I have no idea what.

Try going line by line through the file. Something like
import re
files = glob.glob("*.txt")
for f in files:
with open(f, "r") as fin:
data = []
for line in fin:
m = re.match('(.*LOAD-DATE:)', line)
if m:
line = m.group(1)
line = re.sub('LOAD-DATE:', '', line)
data.append(line)
with open(f, 'w') as fout:
fout.writelines(data)

I made 10 txt files all containing the string:
A little bit of text. LOAD-DATE: And some redundant text
I changed the m variable as patrick suggested to allow the file to be opened and read.
m = re.match('(.*LOAD-DATE:)', fin.read())
But I also found that I had to include the writelines inside the if statement
if m:
data = m.group(1)
data = re.sub('LOAD-DATE:', '', data)
with open(f, 'w') as fout:
fout.writelines(data)
It changed them all no problem and very quickly.
I hope this helps.

Related

splitlines() and iterating over an opened file give different results

I have files with sometimes weird end-of-lines characters like \r\r\n. With this, it works like I want:
with open('test.txt', 'wb') as f: # simulate a file with weird end-of-lines
f.write(b'abc\r\r\ndef')
with open('test.txt', 'rb') as f:
for l in f:
print(l)
# b'abc\r\r\n'
# b'def'
I want to able to get the same result from a string. I thought about splitlines but it does not give the same result:
print(b'abc\r\r\ndef'.splitlines())
# [b'abc', b'', b'def']
Even with keepends=True, it's not the same result.
Question: how to have the same behaviour than for l in f with splitlines()?
Linked: Changing str.splitlines to match file readlines and https://bugs.python.org/issue22232
Note: I don't want to put everything in a BytesIO or StringIO, because it does a x0.5 speed performance (already benchmarked); I want to keep a simple string. So it's not a duplicate of How do I wrap a string in a file in Python?.

Why don't you just split it:
input = b'\nabc\r\r\r\nd\ref\nghi\r\njkl'
result = input.split(b'\n')
print(result)
[b'', b'abc\r\r\r', b'd\ref', b'ghi\r', b'jkl']
You will loose the trailing \n that can be added later to every line, if you really need them. On the last line there is a need to check if it is really needed. Like
fixed = [bstr + b'\n' for bstr in result]
if input[-1] != b'\n':
fixed[-1] = fixed[-1][:-1]
print(fixed)
[b'\n', b'abc\r\r\r\n', b'd\ref\n', b'ghi\r\n', b'jkl']
Another variant with a generator. This way it will be memory savvy on the huge files and the syntax will be similar to the original for l in bin_split(input) :
def bin_split(input_str):
start = 0
while start>=0 :
found = input_str.find(b'\n', start) + 1
if 0 < found < len(input_str):
yield input_str[start : found]
start = found
else:
yield input_str[start:]
break

There are a couple ways to do this, but none are especially fast.
If you want to keep the line endings, you might try the re module:
lines = re.findall(r'[\r\n]+|[^\r\n]+[\r\n]*', text)
# or equivalently
line_split_regex = re.compile(r'[\r\n]+|[^\r\n]+[\r\n]*')
lines = line_split_regex.findall(text)
If you need the endings and the file is really big, you may want to iterate instead:
for r in re.finditer(r'[\r\n]+|[^\r\n]+[\r\n]*', text):
line = r.group()
# do stuff with line here
If you don't need the endings, then you can do it much more easily:
lines = list(filter(None, text.splitlines()))
You can omit the list() part if you just iterate over the results (or if using Python2):
for line in filter(None, text.splitlines()):
pass # do stuff with line

I would iterate through like this:
text = "b'abc\r\r\ndef'"
results = text.split('\r\r\n')
for r in results:
print(r)

This is a for l in f: solution:
The key to this is the newline argument on the open call. From the documentation:
[![enter image description here][1]][1]
Therefore, you should use newline='' when writing to suppress newline translation and then when reading use newline='\n', which will work if all your lines terminate with 0 or more '\r' characters followed by a '\n' character:
with open('test.txt', 'w', newline='') as f:
f.write('abc\r\r\ndef')
with open('test.txt', 'r', newline='\n') as f:
for line in f:
print(repr(line))
Prints:
'abc\r\r\n'
'def'
A quasi-splitlines solution:
This strictly speaking not a splitlines solution since to be able to handle arbitrary line endings a regular expression version of split would have to be used capturing the line endings and then re-assembling the lines and their endings. So, instead this solution just uses a regular expression to break up the input text allowing line endings consisting of any number of '\r' characters followed by a '\n' character:
import re
input = '\nabc\r\r\ndef\nghi\r\njkl'
with open('test.txt', 'w', newline='') as f:
f.write(input)
with open('test.txt', 'r', newline='') as f:
text = f.read()
lines = re.findall(r'[^\r\n]*\r*\n|[^\r\n]+$', text)
for line in lines:
print(repr(line))
Prints:
'\n'
'abc\r\r\n'
'def\n'
'ghi\r\n'
'jkl'
Regex Demo

How can I find a line according with two subsequent words in a text file

I am very new to Python so please excuse ignorant questions or overly complicated code. :)
I am very thankful for any help.
The code I have so far is to open read a/several text files, search the lines according to keywords
and then write a new textfiles while leaving out the lines with found keywords. This is to clean the files (newspaper articles) of information I do not want to have before analysing the remaining text. The problem is that I am only able to search for single words. However, sometimes I would like to search for a specific combination of words, i.e. not just "Rechte", but "Alle Rechte vorbehalten".
If I save this into my delword-list, it doesn't work (I think because part in line.split only checks single words.)
Any help is very much appreciated!
import os
delword = ['Quelle:', 'Ressort:', 'Ausgabe:', 'Dokumentnummer:', 'Rechte', 'Alle Rechte vorbehalten']
path = r'C:\files'
pathnew = r'C:\files\new'
dir = []
for f in os.listdir(path):
if f.endswith(".txt"):
#print(os.path.join(path, f))
print(f)
if f not in dir:
dir.append(f)
for f in dir:
fpath = os.path.join(path, f)
print (fpath)
fopen = open(fpath, encoding="utf-8", errors='ignore')
printline = True
#print(fopen.read())
fnew = 'clean' + f
fpathnew = os.path.join(pathnew, fnew)
with open(fpath, encoding="utf-8", errors='ignore') as input:
with open(fpathnew, "w", errors='ignore') as output:
for line in input:
printline = True
for part in line.split():
for i in range(len(delword)):
if delword [i] in part:
#line = " ".join((line).split())
printline = False
#print('Found: ', line)
if printline == False:
output.write('\n')
if printline == True:
output.write(line)
input.close()
output.close()
fopen.close()

For this particular case - you don't need to split the line. You can run similar checks with
for line in input:
for word in delword:
if word in line: ...
Just as side note: usually more generic or complex problems will be using regular expressions, as tool created for such processing

import filenames iteratively from a different file

I have a large number of entries in a file. Let me call it file A.
File A:
('aaa.dat', 'aaa.dat', 'aaa.dat')
('aaa.dat', 'aaa.dat', 'bbb.dat')
('aaa.dat', 'aaa.dat', 'ccc.dat')
I want to use these entries, line by line, in a program that would iteratively pick an entry from file A, concatenate the files in this way:
filenames = ['aaa.dat', 'aaa.dat', 'ccc.dat'] ###entry number 3
with open('out.dat', 'w') as outfile: ###the name has to be aaa-aaa-ccc.dat
for fname in filenames:
with open(fname) as infile:
outfile.write(infile.read().strip())
All I need to do is to substitute the filenames iteratively and create an output in a "aaa-aaa-aaa.dat" format. I would appreciate any help-- feeling a bit lost!
Many thanks!!!

You can retrieve and modify the file names in the following way:
import re
pattern = re.compile('\W')
with open('fnames.txt', 'r') as infile:
for line in infile:
line = (re.sub(pattern, ' ', line)).split()
# Old filenames - to concatenate contents
content = [x + '.dat' for x in line[::2]];
# New filename
new_name = ('-').join(line[::2]) + '.dat'
# Write the concatenated content to the new
# file (first read the content all at once)
with open(new_name, 'w') as outfile:
for con in content:
with open(con, 'r') as old:
new_content = old.read()
outfile.write(new_content)
This program reads your input file, here named fnames.txt with the exact structure from your post, line by line. For each line it splits the entries using a precompiled regex (precompiling regex is suitable here and should make things faster). This assumes that your filenames are only alphanumeric characters, since the regex substitutes all non-alphanumeric characters with a space.
It retrieves only 'aaa' and dat entries as a list of strings for each line and forms a new name by joining every second entry starting from 0 and adding a .dat extension to it. It joins using a - as in the post.
It then retrieves the individual file names from which it will extract the content into a list content by selecting every second entry from line.
Finally, it reads each of the files in content and writes them to the common file new_name. It reads each of them all at ones which may be a problem if these files are big and in general there may be more efficient ways of doing all this. Also, if you are planning to do more things with the content from old files before writing, consider moving the old file-specific operations to a separate function for readability and any potential debugging.

Something like this:
with open(fname) as infile, open('out.dat', 'w') as outfile:
for line in infile:
line = line.strip()
if line: # not empty
filenames = eval(line.strip()) # read tuple
filenames = [f[:-4] for f in filenames] # remove extension
filename = '-'.join(filenames) + '.dat' # make filename
outfile.write(filename + '\n') # write

If your problem is just calculating the new filenames, how about using os.path.splitext?
'-'.join([
f[0] for f in [os.path.splitext(path) for path in filenames]
]) + '.dat'
Which can be probably better understood if you see it like this:
import os
clean_fnames = []
filenames = ['aaa.dat', 'aaa.dat', 'ccc.dat']
for fname in filenames:
name, extension = os.path.splitext(fname)
clean_fnames.append(name)
name_without_ext = '-'.join(clean_fnames)
name_with_ext = name_without_ext + '.dat'
print(name_with_ext)
HOWEVER: If your issue is that you can not get the filenames in a list by reading the file line by line, you must keep in mind that when you read files, you get text (strings) NOT Python structures. You need to rebuild a list from a text like: "('aaa.dat', 'aaa.dat', 'aaa.dat')\n".
You could take a look to ast.literal_eval or try to rebuild it yourself. The code below outputs a lot of messages to show what's happening:
import pprint
collected_fnames = []
with open('./fileA.txt') as f:
for line in f:
print("Read this (literal) line: %s" % repr(line))
line_without_whitespaces_on_the_sides = line.strip()
if not line_without_whitespaces_on_the_sides:
print("line is empty... skipping")
continue
else:
line_without_parenthesis = (
line_without_whitespaces_on_the_sides
.lstrip('(')
.rstrip(')')
)
print("Cleaned parenthesis: %s" % line_without_parenthesis)
chunks = line_without_parenthesis.split(', ')
print("Collected %s chunks in a %s: %s" % (len(chunks), type(chunks), chunks))
chunks_without_quotations = [chunk.replace("'", "") for chunk in chunks]
print("Now we don't have quotations: %s" % chunks_without_quotations)
collected_fnames.append(chunks_without_quotations)
print("collected %s lines with filenames:\n%s" %
(len(collected_fnames), pprint.pformat(collected_fnames)))

Using Python to re-write my code from 'readlines' to 'for line in file' format

I have a working script to extract certain data from a series of huge text files. Unfortunately I went down the route of 'readlines' and consequently my code is running out of memory after a certain number of files processed.
I am try to re-write my code to process the files line by line using the 'for line in file' format, but I am now having problems with my line processing once a string is found.
Basically once my string is found I hope to go to various surrounding lines in the text file, so I am hoping to go back to say 16 (and 10 and 4) lines before and do some line processing to collect some associated data to the search line. With the readlines route I enumerated the file, but I am struggling to work out the correct method with a line by line method (or find out indeed if it is even possible!).
Here's my code, I'll admit I have some bad code in there as I have played about a bit with the line grabbing, basically around the line[-xx] parts...
searchstringsFilter1 = ['Filter Used : 1']
with open(file, 'r') as f:
for line in f:
timestampline = None
timestamp = None
for word in searchstringsFilter1:
if word in line:
#print line
timestampline = line[-16]
#print timestampline
keyline = line
Rline = line[-10]
print Rline
Rline = re.sub('[()]', '', Rline)
SNline = line[-4]
SNline = re.sub('[()]', '', SNline)
split = keyline.split()
str = timestampline
match = re.search(r'\d{2}:\d{2}:\d{2}.\d{3}', str)
valueR = Rline.split()
valueSN = SNline.split()
split = line.split()
worksheetFilter.write(row_num,0,match.group())
worksheetFilter.write(row_num,1,split[3], integer_format)
worksheetFilter.write(row_num,2,valueR[4], decimal_format)
worksheetFilter.write(row_num,3,valueSN[3], decimal_format)
row_num+=1
tot = tot+1
break
print 'total count for', '"',searchstringsFilter1[a],'"', 'is', tot
Filtertot = tot
tot = 0
Is there anything obvious I am doing wrong, or am I following a completely incorrect path to do what I am trying to do?
Many thanks for reading this,
MikG

You need a circular buffer to temporarly hold the previous line in memory. This can be obtained using collections.deque :
import collections
ring_buf = collections.deque(maxlen=17)
with open(file, 'r') as f:
for line in f:
ring_buf.append([line]) # append the new line and overwrite the last one
# FIFO style
timestampline = None
timestamp = None
for word in searchstringsFilter1:
if word in line:
#print line
timestampline = ring_buf[-16]
#print timestampline
keyline = line
Rline = ring_buf[-10]
print Rline
Rline = re.sub('[()]', '', Rline)
SNline = ring_buf[-4]
SNline = re.sub('[()]', '', SNline)

If you know how many lines you need to use at a time (let's say that you need 16 lines at a time), you can do this:
with open(file, 'r') as f:
# Some sort of loop...
chunk = [next(f) for x in xrange(16)]
chunk should contain the next 16 lines of the file.
EDIT: after some clarification, this might be more useful:
with open(file, 'r') as f:
chunk = [next(f) for x in xrange(16)]
while not whatWeWant(chunk[15]):
chunk.append(next(f))
chunk.pop(0)
Obviously, this would need some guards and checks, but I think this is what you want. chunk[15] will be the line you were trying to find, and chunk[0:15] will be the lines before it.

Insert string at the beginning of each line

How can I insert a string at the beginning of each line in a text file, I have the following code:
f = open('./ampo.txt', 'r+')
with open('./ampo.txt') as infile:
for line in infile:
f.insert(0, 'EDF ')
f.close
I get the following error:
'file' object has no attribute 'insert'

Python comes with batteries included:
import fileinput
import sys
for line in fileinput.input(['./ampo.txt'], inplace=True):
sys.stdout.write('EDF {l}'.format(l=line))
Unlike the solutions already posted, this also preserves file permissions.

You can't modify a file inplace like that. Files do not support insertion. You have to read it all in and then write it all out again.
You can do this line by line if you wish. But in that case you need to write to a temporary file and then replace the original. So, for small enough files, it is just simpler to do it in one go like this:
with open('./ampo.txt', 'r') as f:
lines = f.readlines()
lines = ['EDF '+line for line in lines]
with open('./ampo.txt', 'w') as f:
f.writelines(lines)

Here's a solution where you write to a temporary file and move it into place. You might prefer this version if the file you are rewriting is very large, since it avoids keeping the contents of the file in memory, as versions that involve .read() or .readlines() will. In addition, if there is any error in reading or writing, your original file will be safe:
from shutil import move
from tempfile import NamedTemporaryFile
filename = './ampo.txt'
tmp = NamedTemporaryFile(delete=False)
with open(filename) as finput:
with open(tmp.name, 'w') as ftmp:
for line in finput:
ftmp.write('EDF '+line)
move(tmp.name, filename)

For a file not too big:
with open('./ampo.txt', 'rb+') as f:
x = f.read()
f.seek(0,0)
f.writelines(('EDF ', x.replace('\n','\nEDF ')))
f.truncate()
Note that , IN THEORY, in THIS case (the content is augmented), the f.truncate() may be not really necessary. Because the with statement is supposed to close the file correctly, that is to say, writing an EOF (end of file ) at the end before closing.
That's what I observed on examples.
But I am prudent: I think it's better to put this instruction anyway. For when the content diminishes, the with statement doesn't write an EOF to close correctly the file less far than the preceding initial EOF, hence trailing initial characters remains in the file.
So if the with statement doens't write EOF when the content diminishes, why would it write it when the content augments ?
For a big file, to avoid to put all the content of the file in RAM at once:
import os
def addsomething(filepath, ss):
if filepath.rfind('.') > filepath.rfind(os.sep):
a,_,c = filepath.rpartition('.')
tempi = a + 'temp.' + c
else:
tempi = filepath + 'temp'
with open(filepath, 'rb') as f, open(tempi,'wb') as g:
g.writelines(ss + line for line in f)
os.remove(filepath)
os.rename(tempi,filepath)
addsomething('./ampo.txt','WZE')

f = open('./ampo.txt', 'r')
lines = map(lambda l : 'EDF ' + l, f.readlines())
f.close()
f = open('./ampo.txt', 'w')
map(lambda l : f.write(l), lines)
f.close()

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Python: using regex and a loop to clean mulitiple text files - python

Related

splitlines() and iterating over an opened file give different results

How can I find a line according with two subsequent words in a text file

import filenames iteratively from a different file

Using Python to re-write my code from 'readlines' to 'for line in file' format

Insert string at the beginning of each line

Categories

Resources