Read text file as a whole [duplicate] - python

This question already has answers here:
Does reading an entire file leave the file handle open?
(4 answers)
Closed 8 years ago.
I need your help. I want to read a text file "as a whole" and not line by line. This is because by doing line by line my regex doesn't work well, it needs the whole text. So far this is what I am being doing:
with open(r"AllText.txt") as fp:
for line in fp:
for i in re.finditer(regexp_v3, line):
print i.group()
I need to open my file, read it all, search if for my regex and print my results. How can I accomplish this?

To get all the content of a file, just use file.read():
all_text = fp.read() # Within your with statement.
all_text is now a single string containing the data in the file.
Note that this will contain newline characters, but if you are extracting things with a regex they shouldn't be a problem.

For that use read:
with open("AllText.txt") as fp:
whole_file_text = fp.read()
Note however, that your test will contain \n where the new-line used to be in your text.
For example, if this was your text file:
#AllText.txt
Hello
How
Are
You
Your whole_file_text string will be as follows:
>>> whole_file_text
'Hello\nHow\nAre\nYou'
You can do either of the following:
>>> whole_file_text.replace('\n', ' ')
'Hello How Are You'
>>> whole_file_text.replace('\n', '')
'HelloHowAreYou'

If you don't want to read the entire file into memery, you can use mmap
Memory-mapped file objects behave like both strings and like file objects.
import re, mmap
with open(r'AllText.txt', 'r+') as f:
data = mmap.mmap(f.fileno(), 0)
mo = re.finditer(regexp_v3, data)

Related

Editing a txt file in Python to edit the formatting and then create new txt file

and thank you for taking the time to read this post. This is literally my first time trying to use Python so bare with me.
My Target/Goal: Edit the original text file (Original .txt file) so that for every domain listed an "OR" is added in between them (below target formatting image). Any help is greatly appreciated.
I have been able to google the information to open and read the txt file, however, I am not sure how to do the formatting part.
Script
Original .txt file
Target formatting
You can achieve this in a couple lines as:
with open(my_file) as fd:
result = fd.read().replace("\n", " OR ")
You could then write this to another file with:
with open(formatted_file, "w") as fd:
fd.write(result)
something you could do is the following
import re
# This opens the file in read mode
with open('Original.txt', 'r') as file:
# Read the contents of the file
contents = file.read()
# Seems that your original file has line breaks to each domain so
# you could replace it with the word "OR" using a regular expression
contents = re.sub(r'\n+', ' OR ', contents)
# Then you should open the file in write mode
with open('Original.txt', 'w') as file:
# and finally write the modified contents to the file
file.write(contents)
a suggestion is, maybe you want to try first writing in a different file to see if you are happy with the results (or do a copy of Original.txt just in case)
with open('AnotherOriginal.txt', 'w') as file:
file.write(contents)

How to fix a text file with characters \u2014, \u2017, etc using python?

A text file has contents like
"Length: As per client\u2019s need|\u2022 Material: CFC|\u2022"
I'm trying to convert this to characters. How to read, convert this to characters and save it back.
In general, something along the lines of
uni_chr_re = re.compile(r'\\u([a-fA-F0-9]{4})')
lines = []
with open(filename) as f:
for line in f:
lines.append(uni_chr_re.sub(lambda m: unichr(int(m.group(1), 16)), line))
That's the general approach, but the specifics depend on the details such as where this text came from, as Martijn pointed out.

Delete comments in text file

I am trying to delete comments starting on new lines in a Python code file using Python code and regular expressions. For example, for this input:
first line
#description
hello my friend
I would like to get this output:
first line
hello my friend
Unfortunately this code didn't work for some reason:
with open(input_file,"r+") as f:
string = re.sub(re.compile(r'\n#.*'),"",f.read()))
f.seek(0)
f.write(string)
for some reason the output I get is the same as the input.
1) There is no reason to call re.compile unless you save the result. You can always just use the regular expression text.
2) Seeking to the beginning of the file and writing there may cause problems for you if your replacement text is shorter than your original text. It is easier to re-open the file and write the data.
Here is how I would fix your program:
import re
input_file = 'in.txt'
with open(input_file,"r") as f:
data = f.read()
data = re.sub(r'\n#.*', "", data)
with open(input_file, "w") as f:
f.write(data)
It doesn't seem right to start the regular expression with \n, and I don't think you need to use re.compile here.
In addition to that, you have to use the flag re.M to make the search on multiline
This will delete all lines that start with # and empty lines.
with open(input_file, "r+") as f:
text = f.read()
string = re.sub('^(#.*)|(\s*)$', '', text, flags=re.M)
f.write(string)

Multiple line file into one string

Hello I'm making a python program that takes in a file. I want this to be set to a single string. My current code is:
with open('myfile.txt') as f:
title = f.readline().strip();
content = f.readlines();
The text file (simplified) is:
Title of Document
asdfad
adfadadf
adfadaf
adfadfad
I want to strip the title (which my program does) and then make the rest one string. Right now the output is:
['asdfad\n', 'adfadadf\n', ect...]
and I want:
asdfadadfadadf ect...
I am new to python and I have spent some time trying to figure this out but I can't find a solution that works. Any help would be appreciated!
You can do this:
with open('/tmp/test.txt') as f:
title=f.next() # strip title line
data=''.join(line.rstrip() for line in f)
Use list.pop(0) to remove the first line from content.
Then str.join(iterable). You'll also need to strip off the newlines.
content.pop(0)
done = "".join([l.strip() for l in content])
print done
Another option is to read the entire file, then remove the newlines instead of joining together:
with open('somefile') as fin:
next(fin, None) # ignore first line
one_big_string = fin.read().replace('\n', '')
If you want the rest of the file in a single chunk, just call the read() function:
with open('myfile.txt') as f:
title = f.readline().strip()
content = f.read()
This will read the file until EOF is encountered.

Delete a specific string (not line) from a text file python

I have a text file with two lines in a text file:
<BLAHBLAH>483920349<FOOFOO>
<BLAHBLAH>4493<FOOFOO>
Thats the only thing in the text file. Using python, I want to write to the text file so that i can take away BLAHBLAH and FOOFOO from each line. It seems like a simple task but after refreshing my file manipulation i cant seem to find a way to do it.
Help is greatly appreciated :)
Thanks!
If it's a text file as you say, and not HTML/XML/something else, just use replace:
for line in infile.readlines():
cleaned_line = line.replace("BLAHBLAH","")
cleaned_line = cleaned_line.replace("FOOFOO","")
and write cleaned_line to an output file.
f = open(path_to_file, "w+")
f.write(f.read().replace("<BLAHBLAH>","").replace("<FOOFOO>",""))
f.close()
Update (saving to another file):
f = open(path_to_input_file, "r")
output = open(path_to_output_file, "w")
output.write(f.read().replace("<BLAHBLAH>","").replace("<FOOFOO>",""))
f.close()
output.close()
Consider the regular expressions module re.
result_text = re.sub('<(.|\n)*?>',replacement_text,source_text)
The strings within < and > are identified. It is non-greedy, ie it will accept a substring of the least possible length. For example if you have "<1> text <2> more text", a greedy parser would take in "<1> text <2>", but a non-greedy parser takes in "<1>" and "<2>".
And of course, your replacement_text would be '' and source_text would be each line from the file.

Categories

Resources