Combining Regex files in Python

Combining Regex files in Python - python

I have 48 .rx.txt files and I'm trying to combine them using Python. I know that when you combine .rx.txt files, you have to include a "|" in between the files.
Here's the code that I'm using:
import glob
read_files = filter(lambda f: f!='final.txt' and f!='result.txt', glob.glob('*.txt'))
with open("REGEXES.rx.txt", "wb") as outfile:
for f in read_files:
with open(f, "rb") as infile:
outfile.write(infile.read())
outfile.write('|')
But when I try to run that I get this error:
Traceback (most recent call last):
File "/Users/kosay.jabre/Desktop/Password Assessor/RegexesNEW/CombineFilesCopy.py", line 10, in <module>
outfile.write('|')
TypeError: a bytes-like object is required, not 'str'
Any ideas on how I can combine my files into one file?

Your REGEXES.rx.txt is opened in binary mode, but with outfile.write('|') you attempting to write string to it instead of binary. It seems that all of your files contain text data, so instead of opening them as binaries open them as texts, i.e.:
with open("REGEXES.rx.txt", "w") as outfile:
for f in read_files:
with open(f, "r") as infile:
outfile.write(infile.read())
outfile.write('|')

In python2.7.x your code will work fine, but for python3.x you should add b prefix to the string outfile.write(b'|') that will mark the string as a binary string and then we will be able to write it in a binary file mode.
Then your code for python3.x will be:
import glob
read_files = filter(lambda f: f!='final.txt' and f!='result.txt', glob.glob('*.txt'))
with open("REGEXES.rx.txt", "wb") as outfile:
for f in read_files:
with open(f, "rb") as infile:
outfile.write(infile.read())
outfile.write(b'|')

Related

Removing comma from Text file using Python

I'm starting to play around with Python and trying to merge a couple of files I have into a single file. When I use the below code:
import glob
path = "C:\\Users\\abc\\OneDrive\\Trading\\"
read_files = glob.glob(path + "*.txt")
with open("result.txt", "wb") as outfile:
for f in read_files:
with open(f, "rb") as infile:
outfile.write(infile.read())
My output file appears to have many names with ,,,, example:
ASX:MCR,,,,,,,
,ASX:RHC,,,,,,
,,ASX:LTR,,,,,
,,,,ASX:MAY,,,
,,,,,,ASX:ANP,
beside it.
How can I remove all the commas to get a list of stock codes in a single line and remove any duplicates:
ASX:BGT
ASX:CNB
ASX:BFG
ASX:ICI

Concatenating dump files

How to open a dump file (binary)? the answer provided in this question isn't working
filenames = ['file1.dmp', "file2.dmp", "file3.dmp"]
with open('test_file.obj', 'w') as outfile:
for fname in filenames:
with open(fname) as infile:
for line in infile:
outfile.write(line)
file1: 367kb
file2: 1kb
file3: 1000kbp
The output file is only 5kb
When I count lines in the file it returns 4 when I know its much bigger. I think it has to do with the HEX representation which python isn't able to parse?

Hi you are opening the output file with 'w' which won't work mostly for binary files you can open file in wb and then try it.
filenames = ['file1.dmp', "file2.dmp", "file3.dmp"]
with open('test_file.obj', 'wb') as outfile:
for fname in filenames:
with open(fname, 'rb') as infile:
for line in infile:
outfile.write(line)

Trying to merge all text files in a folder and append file as well

I am trying to merge all text files in a folder. I have this part working, but when I try to append the file name before the contents of each text file, I'm getting a error that reads: TypeError: a bytes-like object is required, not 'str'
The code below must be pretty close, but something is definitely off. Any thoughts what could be wrong?
import glob
folder = 'C:\\my_path\\'
read_files = glob.glob(folder + "*.txt")
with open(folder + "final_result.txt", "wb") as outfile:
for f in read_files:
with open(f, "rb") as infile:
outfile.write(f)
outfile.write(infile.read())
outfile.close

outfile.write(f) seems to be your problem because you opened the file with in binary mode with 'wb'. You can convert to bytes using encode You'll likely not want to close outfile in your last line either (although you aren't calling the function anyway). So something like this might work for you:
import glob
folder = 'C:\\my_path\\'
read_files = glob.glob(folder + "*.txt")
with open(folder + "final_result.txt", "wb") as outfile:
for f in read_files:
with open(f, "rb") as infile:
outfile.write(f.encode('utf-8'))
outfile.write(infile.read())

on opening a data file to reading the lines included in that file

I am using the following code segment to partition a data file into two parts:
def shuffle_split(infilename, outfilename1, outfilename2):
with open(infilename, 'r') as f:
lines = f.readlines()
lines[-1] = lines[-1].rstrip('\n') + '\n'
shuffle(lines)
with open(outfilename1, 'w') as f:
f.writelines(lines[:90000])
with open(outfilename2, 'w') as f:
f.writelines(lines[90000:])
outfilename1.close()
outfilename2.close()
shuffle_split(data_file, training_file,validation_file)
Running this code segment cause the following error,
in shuffle_split
with open(infilename, 'r') as f:
TypeError: coercing to Unicode: need string or buffer, file found
What's wrong with the way of opening the data_file for input?

Whatever you're passing in as infilename is already a file, rather than a file's path name.

Replace and overwrite instead of appending

I have the following code:
import re
#open the xml file for reading:
file = open('path/test.xml','r+')
#convert to string:
data = file.read()
file.write(re.sub(r"<string>ABC</string>(\s+)<string>(.*)</string>",r"<xyz>ABC</xyz>\1<xyz>\2</xyz>",data))
file.close()
where I'd like to replace the old content that's in the file with the new content. However, when I execute my code, the file "test.xml" is appended, i.e. I have the old content follwed by the new "replaced" content. What can I do in order to delete the old stuff and only keep the new?

You need seek to the beginning of the file before writing and then use file.truncate() if you want to do inplace replace:
import re
myfile = "path/test.xml"
with open(myfile, "r+") as f:
data = f.read()
f.seek(0)
f.write(re.sub(r"<string>ABC</string>(\s+)<string>(.*)</string>", r"<xyz>ABC</xyz>\1<xyz>\2</xyz>", data))
f.truncate()
The other way is to read the file then open it again with open(myfile, 'w'):
with open(myfile, "r") as f:
data = f.read()
with open(myfile, "w") as f:
f.write(re.sub(r"<string>ABC</string>(\s+)<string>(.*)</string>", r"<xyz>ABC</xyz>\1<xyz>\2</xyz>", data))
Neither truncate nor open(..., 'w') will change the inode number of the file (I tested twice, once with Ubuntu 12.04 NFS and once with ext4).
By the way, this is not really related to Python. The interpreter calls the corresponding low level API. The method truncate() works the same in the C programming language: See http://man7.org/linux/man-pages/man2/truncate.2.html

file='path/test.xml'
with open(file, 'w') as filetowrite:
filetowrite.write('new content')
Open the file in 'w' mode, you will be able to replace its current text save the file with new contents.

Using truncate(), the solution could be
import re
#open the xml file for reading:
with open('path/test.xml','r+') as f:
#convert to string:
data = f.read()
f.seek(0)
f.write(re.sub(r"<string>ABC</string>(\s+)<string>(.*)</string>",r"<xyz>ABC</xyz>\1<xyz>\2</xyz>",data))
f.truncate()

import os#must import this library
if os.path.exists('TwitterDB.csv'):
os.remove('TwitterDB.csv') #this deletes the file
else:
print("The file does not exist")#add this to prevent errors
I had a similar problem, and instead of overwriting my existing file using the different 'modes', I just deleted the file before using it again, so that it would be as if I was appending to a new file on each run of my code.

See from How to Replace String in File works in a simple way and is an answer that works with replace
fin = open("data.txt", "rt")
fout = open("out.txt", "wt")
for line in fin:
fout.write(line.replace('pyton', 'python'))
fin.close()
fout.close()

in my case the following code did the trick
with open("output.json", "w+") as outfile: #using w+ mode to create file if it not exists. and overwrite the existing content
json.dump(result_plot, outfile)

Using python3 pathlib library:
import re
from pathlib import Path
import shutil
shutil.copy2("/tmp/test.xml", "/tmp/test.xml.bak") # create backup
filepath = Path("/tmp/test.xml")
content = filepath.read_text()
filepath.write_text(re.sub(r"<string>ABC</string>(\s+)<string>(.*)</string>",r"<xyz>ABC</xyz>\1<xyz>\2</xyz>", content))
Similar method using different approach to backups:
from pathlib import Path
filepath = Path("/tmp/test.xml")
filepath.rename(filepath.with_suffix('.bak')) # different approach to backups
content = filepath.read_text()
filepath.write_text(re.sub(r"<string>ABC</string>(\s+)<string>(.*)</string>",r"<xyz>ABC</xyz>\1<xyz>\2</xyz>", content))

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Combining Regex files in Python - python

Related

Removing comma from Text file using Python

Concatenating dump files

Trying to merge all text files in a folder and append file as well

on opening a data file to reading the lines included in that file

Replace and overwrite instead of appending

Categories

Resources