I am trying to extract complete sentences from a long text file and adding them as strings to a list in Python 2.7. I want to automate this and not just cut and paste in the list.
Here is what I have:
from sys import argv
script, filename = argv # script = alien.py; filename = roswell.txt
listed = []
text = open(filename, 'rw')
for i in text:
lines = readline(i)
listed.append(lines)
print listed
text.close()
Nothing loads to the list.
You can do it with a while loop:
listed = []
with open(filename,"r") as text:
Line = text.readline()
while Line!='':
listed.append(Line)
Line = text.readline()
print listed
In the previous example, I assumed that each sentence is written on a different line, if that's not the case, use this code instead:
listed = []
with open(filename,"r") as text:
Line = text.readline()
while Line!='':
Line1 = Line.split(".")
for Sentence in Line1:
listed.append(Sentence)
Line = text.readline()
print listed
And on a side note, try using with open(...) as text: instead of text = open(...)
Normally sentences are separated by '. ', not '\n'. Under this condition, use split with period+space(without return-enter):
listed = []
fd = open(filename,"r")
try:
data = fd.read()
sentences = data.split(". ")
for sentence in sentences:
listed.append(sentence)
print listed
finally:
fd.close()
Related
I have a txt file with the following info:
545524---Python foundation---Course---https://img-c.udemycdn.com/course/100x100/647442_5c1f.jpg---Outsourcing Development Work: Learn My Proven System To Hire Freelance Developers
Another line with the same format but different info and continue....
Here on line 1, Python foundation is the course title. If a user has input "foundation" how do I print out Python foundation? It's basically printing the whole title of a course based on the given word.
I can use something like:
input_text = 'foundation'
file1 = open("file.txt", "r")
readfile = file1.read()
if input_text in readfile:
#This prints only foundation keyword not the whole title
I assume that your input file has multiple lines separated by enters in this format:
<Course-id>---<Course-name>---Course---<Course-image-link>---<Desc>
input_text = 'foundation'
file1 = open('file.txt', 'r')
lines = file1.readlines()
for line in lines:
book_title_pattern = r'---([\w\d\s_\.,;:()]+)---'
match = re.search(book_title_pattern, line)
if match:
matched_title = match.groups(1)[0]
if input_text in matched_title:
print(matched_title)
Get the key value that you're searching for. User input perhaps or we'll hard-code it here for demo' purposes.
Open the file and read one line at a time. Use RE to parse the line looking for a specific pattern. Check that we've actually found a token matching the RE criterion then check if it contains the 'key' value. Print result as appropriate.
import re
key = 'foundation'
with open('input.txt') as infile:
for line in map(str.strip, infile):
if (t := re.findall('---([a-zA-Z\s]+)---', line)) and key in t[0]:
print(t[0])
You can use regex to match ---Course name--- using ---([a-zA-Z ]+)---. This will give you all the course names. Then you can check for the user_input in each course and print the course name if you find user_input in it.:
import re
user_input = 'foundation'
file1 = open("file.txt", "r")
readfile = file1.read()
course_name = re.findall('---([a-zA-Z ]+)---', readfile)
for course in course_name:
if user_input in course: #Then check for 'foundation' in course_name
print(course)
Output:
Python foundation
I read from a file, if it finds a ".", it should add a newline "\n" to the text and write it back to the file. I tried this code but still have the problem.
inp = open('rawCorpus.txt', 'r')
out = open("testFile.text", "w")
for line in iter(inp):
l = line.split()
if l.endswith(".")
out.write("\n")
s = '\n'.join(l)
print(s)
out.write(str(s))
inp.close()
out.close()
Try This ( Normal way ):
with open("rawCorpus.txt", 'r') as read_file:
raw_data = read_file.readlines()
my_save_data = open("testFile.text", "a")
for lines in raw_data:
if "." in lines:
re_lines = lines.replace(".", ".\r\n")
my_save_data.write(re_lines)
else:
my_save_data.write(lines + "\n")
my_save_data.close()
if your text file is not big you can try this too :
with open("rawCorpus.txt", 'r') as read_file:
raw_data = read_file.read()
re_data = raw_data.replace(".", ".\n")
with open("testFile.text", "w") as save_data:
save_data.write(re_data)
UPDATE ( output new lines depends on your text viewer too! because in some text editors "\n" is a new line but in some others "\r\n" is a new line. ) :
input sample :
This is a book. i love it.
This is a apple. i love it.
This is a laptop. i love it.
This is a pen. i love it.
This is a mobile. i love it.
Code:
last_buffer = []
read_lines = [line.rstrip('\n') for line in open('input.txt')]
my_save_data = open("output.txt", "a")
for lines in read_lines:
re_make_lines = lines.split(".")
for items in re_make_lines:
if items.replace(" ", "") == "":
pass
else:
result = items.strip() + ".\r\n"
my_save_data.write(result)
my_save_data.close()
Ouput Will Be :
This is a book.
i love it.
This is a apple.
i love it.
This is a laptop.
i love it.
This is a pen.
i love it.
This is a mobile.
i love it.
You are overwriting the string s in every loop with s = '\n'.join(l).
Allocate s = '' as empty string before the for-loop and add the new lines during every loop, e.g. with s += '\n'.join(l) (short version of s = s + '\n'.join(l)
This should work:
inp = open('rawCorpus.txt', 'r')
out = open('testFile.text', 'w')
s = '' # empty string
for line in iter(inp):
l = line.split('.')
s += '\n'.join(l) # add new lines to s
print(s)
out.write(str(s))
inp.close()
out.close()
Here is my own solution, but still I want one more newline after ".", that this solution not did this
read_lines = [line.rstrip('\n') for line in open('rawCorpus.txt')]
words = []
my_save_data = open("my_saved_data.txt", "w")
for lines in read_lines:
words.append(lines)
for word in words:
w = word.rstrip().replace('.', '\n.')
w = w.split()
my_save_data.write(str("\n".join(w)))
print("\n".join(w))
my_save_data.close()
Assume:
self.base_version = 1000
self.target_version = 2000
I have a file as follows:
some text...
<tsr_args> \"upgrade_test test_mode=upgrade base_sw=1000 target_sw=2000 system_profile=eth\"</tsr_args>
some text...
<tsr_args> \"upgrade_test test_mode=rollback base_sw=2000 target_sw=1000 system_profile=eth manufacture_type=no-manufacture\"</tsr_args>
some text...
<tsr_args> \"upgrade_test test_mode=downgrade base_sw=2000 target_sw=1000 system_profile=eth no_boot_next_enable_flag=True\"</tsr_args>
I need the base and target version values to be placed as specified above (Note that on the 2nd and 3rd entry, the base and target are opposite).
I tried to do it as follows, but it does not work:
base_regex = re.compile('.*test_mode.*base_sw=(.*)')
target_regex = re.compile('.*test_mode.*target_sw=(.*)')
o = open(file,'a')
for line in open(file):
if 'test_mode' in line:
if 'upgrade' in line:
new_line = (re.sub(base_regex, self.base_version, line))
new_line = (re.sub(target_regex, self.target_version, line))
o.write(new_line)
elif 'rollback' in line or 'downgrade' in line):
new_line = (re.sub(base_regex, self.target_version, line))
new_line = (re.sub(target_regex, self.base_version, line))
o.write(new_line)
o.close()
Assume the above code runs properly without any syntax errors.
The file is not modified at all.
The complete line is modified instead of just the captured group. How can I make re.sub to substitute only the captured group?
You are opening file with a -> append. So, your changes should be at the end of file. You should create a new file and replace old_one at the end of your script.
There is only one way I know if you want replace several matching groups: first of all you find word using regexp and replace it like a string without regexp.
Thanks Jimilan for your remarks. I fixed my code, and now it`s working:
base_regex = re.compile(.*test_mode.*base_sw=(\S*))
target_regex = re.compile(.*test_mode.*target_sw=(\S*))
for file in self.upgrade_cases_files_list:
file_handle = open(file, 'r')
file_string = file_handle.read()
file_handle.close()
base_version_result = base_regex.search(file_string)
target_version_result = target_regex.search(file_string)
if base_version_result is not None:
current_base_version = base_version_result.group(1)
else:
raise Exception("Could not detect base version in the following file: -> %s \n" % (file))
if target_version_result is not None:
current_target_version = target_version_result.group(1)
else:
raise Exception("Could not detect target version in the following file: -> %s \n" % (file))
file_string = file_string.replace(current_base_version, self.base_version)
file_string = file_string.replace(current_target_version, self.target_version)
file_handle = open(file, 'w')
file_handle.write(file_string)
file_handle.close()
I have two different functions in my program, one writes an output to a txt file (function A) and the other one reads it and should use it as an input (function B).
Function A works just fine (although i'm always open to suggestions on how i could improve).
It looks like this:
def createFile():
fileName = raw_input("Filename: ")
fileNameExt = fileName + ".txt" #to make sure a .txt extension is used
line1 = "1.1.1"
line2 = int(input("Enter line 2: ")
line3 = int(input("Enter line 3: ")
file = (fileNameExt, "w+")
file.write("%s\n%s\n%s" % (line1, line2, line3))
file.close()
return
This appears to work fine and will create a file like
1.1.1
123
456
Now, function B should use that file as an input. This is how far i've gotten so far:
def loadFile():
loadFileName = raw_input("Filename: ")
loadFile = open(loadFileName, "r")
line1 = loadFile.read(5)
That's where i'm stuck, i know how to use this first 5 characters but i need line 2 and 3 as variables too.
f = open('file.txt')
lines = f.readlines()
f.close()
lines is what you want
Other option:
f = open( "file.txt", "r" )
lines = []
for line in f:
lines.append(line)
f.close()
More read:
https://docs.python.org/2/tutorial/inputoutput.html#reading-and-writing-files
from string import ascii_uppercase
my_data = dict(zip(ascii_uppercase,open("some_file_to_read.txt"))
print my_data["A"]
this will store them in a dictionary with lettters as keys ... if you really want to cram it into variables(note that in general this is a TERRIBLE idea) you can do
globals().update(my_data)
print A
Need a bash script or python script to find and replace text between two tags?
E.g:
<start>text to find and replace with the one I give as input<end>
' text to find and replace with the one I give as input' is just an example and it could vary every time.
I want to do something like ./changetxt inputfile.xxx newtext
where changetxt has the script;
inputfile.xxx has the text that needs a change and newtext is what goes into inputfile.xxx
python:
import sys
if __name__ == "__main__":
#ajust these to your need
starttag = "<foo>"
endtag = "</foo>"
inputfilename = sys.argv[1]
outputfilename = inputfilename + ".out"
replacestr = sys.argv[2]
#open the inputfile from the first argument
inputfile = open(inputfilename, 'r')
#open an outputfile to put the result in
outputfile = open(outputfilename, 'w')
#test every line in the file for the starttag
for line in inputfile:
if starttag in line and endtag in line:
#compose a new line with the replaced string
newline = line[:line.find(starttag) + len(starttag)] + replacestr + line[line.find(endtag):]
#and write the new line to the outputfile
outputfile.write(newline)
else:
outputfile.write(line)
outputfile.close()
inputfile.close()
Save this in a replacetext.py file and run as python replacetext.py \path\to\inputfile "I want this text between the tags"
You could also use BeautifulSoup for this, from their docs:
If you set a tag’s .string attribute, the tag’s contents are replaced
with the string you give:
markup = 'I linked to <i>example.com</i>'
soup = BeautifulSoup(markup)
tag = soup.a
tag.string = "New link text."
tag
# New link text.