Thanks in advance for your help.
I have a large text document made up of many books. All the books have "running headers" and I have noticed that they appear just before the line of the page number. The page number has 1 to 4 digit. The page number is on a new line.
I want to iterate through the file and make Python to delete the previous iteration when it gets to the line that starts with a page number.
Thanks
Bennett
My sample code is:
import re
f=open("corpus.txt", "r+", "a")
for line in f:
line = line.rstrip()
if re.search('^[0-9]*?', line):
#delete previous line
Code:
file = open(r"C:\Users\Asus\Desktop\sample.txt").read().splitlines()
output = open(r"C:\Users\Asus\Desktop\output.txt",'w')
for index, line in enumerate(file):
try:
if file[index+1].strip().isdigit() == False and file[index].strip().isdigit() == False:
output.write(file[index])
output.write('\n')
except:
output.write(file[index]) #printing last line
output.write('\n')
output.close()
Input:
wearing sandals, a cock that crows, a cloak
to dissect, a sponge, some vinegar and one
man to hammer the nails home.
56
Or you can take a length of steel,
castle to hold your banquet in.
77
wearing sandals, a cock that crows, a cloak
to dissect, a sponge, some vinegar and one
Output:
wearing sandals, a cock that crows, a cloak
to dissect, a sponge, some vinegar and one
Or you can take a length of steel,
wearing sandals, a cock that crows, a cloak
to dissect, a sponge, some vinegar and one
Related
I have a text file with all of them currently having the same end character (N), which is being used to identify progress the system makes. I want to change the end character to "Y" in case the program ends via an error or other interruptions so that upon restarting the program will search until a line has the end character "N" and begin working from there. Below is my code as well as a sample from the text file.
UPDATED CODE:
def GeoCode():
f = open("geocodeLongLat.txt", "a")
with open("CstoGC.txt",'r') as file:
print("Geocoding...")
new_lines = []
for line in file.readlines():
check = line.split('~')
print(check)
if 'N' in check[-1]:
geolocator = Nominatim()
dot_number, entry_name, PHY_STREET,PHY_CITY,PHY_STATE,PHY_ZIP = check[0],check[1],check[2],check[3],check[4],check[5]
address = PHY_STREET + " " + PHY_CITY + " " + PHY_STATE + " " + PHY_ZIP
f.write(dot_number + '\n')
try:
location = geolocator.geocode(address)
f.write(dot_number + "," + entry_name + "," + str(location.longitude) + "," + str(location.latitude) + "\n")
except AttributeError:
try:
address = PHY_CITY + " " + PHY_STATE + " " + PHY_ZIP
location = geolocator.geocode(address)
f.write(dot_number + "," + entry_name + "," + str(location.longitude) + "," + str(location.latitude) + "\n")
except AttributeError:
print("Cannot Geocode")
check[-1] = check[-1].replace('N','Y')
new_lines.append('~'.join(check))
with open('CstoGC.txt','r+') as file: # IMPORTANT to open as 'r+' mode as 'w/w+' will truncate your file!
for line in new_lines:
file.writelines(line)
f.close()
Output:
2967377~DARIN COLE~22112 TWP RD 209~ALVADA~OH~44802~Y
WAY 64 SUITE 100~EADS~TN~38028~N
384767~MILLER FARMS TRANS LLC~1103 COURT ST~BEDFORD~IA~50833~N
986150~R G S TRUCKING LTD~1765 LOMBARDIE DRIVE~QUESNEL~BC~V2J 4A8~N
1012987~DONALD LARRY KIVETT~4509 LANSBURY RD~GREENSBORO~NC~27406-4509~N
735308~ALZEY EXPRESS INC~2244 SOUTH GREEN STREET~HENDERSON~KY~42420~N
870337~RIES FARMS~1613 255TH AVENUE~EARLVILLE~IA~52057~N
148428~P R MASON & SON LLC~HWY 70 EAST~WILLISTON~NC~28589~N
220940~TEXAS MOVING CO INC~908 N BOWSER RD~RICHARDSON~TX~75081-2869~N
854042~ARMANDO ORTEGA~6590 CHERIMOYA AVENUE~FONTANA~CA~92337~N
940587~DIAMOND A TRUCKING INC~192285 E COUNTY ROAD 55~HARMON~OK~73832~N
1032455~INTEGRITY EXPRESS LLC~380 OLMSTEAD AVENUE~DEPEW~NY~14043~N
889931~DUNSON INC~33 CR 3581~FLORA VISTA~NM~87415~N
143608~LARRY A PETERSON & DONNA M PETERSON~W6359 450TH AVE~ELLSWORTH~WI~54011~N
635528~JAMES E WEBB~3926 GREEN ROAD~SPRINGFIELD~TN~37172~N
805496~WAYNE MLADY~22272 135TH ST~CRESCO~IA~52136~N
704996~SAVINA C MUNIZ~814 W LA QUINTA DR~PHARR~TX~78577~N
893169~BINDEWALD MAINTENANCE INC~213 CAMDEN DR~SLIDELL~LA~70459~N
948130~LOGISTICIZE LTD~861 E PERRY ST~PAULDING~OH~45879~N
438760~SMOOTH OPERATORS INC~W8861 CREEK ROAD~DARIEN~WI~53114~N
518872~A B C RELOCATION SERVICES INC~12 BOCKES ROAD~HUDSON~NH~03051~N
576143~E B D ENTERPRISES INC~29 ROY ROCHE DRIVE~WINNIPEG~MB~R3C 2E6~N
968264~BRIAN REDDEMANN~706 WESTGOR STREET~STORDEN~MN~56174-0220~N
721468~QUALITY LOGISTICS INC~645 LEONARD RD~DUNCAN~SC~29334~N
As you can see I am already keeping track of which line I am at just by using x. Should I use something like file.readlines()?
Sample of text document:
570772~CORPORATE BANK TRANSIT OF KENTUCKY INC~3157 HIGHWAY 64 SUITE 100~EADS~TN~38028~N
384767~MILLER FARMS TRANS LLC~1103 COURT ST~BEDFORD~IA~50833~N
986150~R G S TRUCKING LTD~1765 LOMBARDIE DRIVE~QUESNEL~BC~V2J 4A8~N
1012987~DONALD LARRY KIVETT~4509 LANSBURY RD~GREENSBORO~NC~27406-4509~N
735308~ALZEY EXPRESS INC~2244 SOUTH GREEN STREET~HENDERSON~KY~42420~N
870337~RIES FARMS~1613 255TH AVENUE~EARLVILLE~IA~52057~N
148428~P R MASON & SON LLC~HWY 70 EAST~WILLISTON~NC~28589~N
220940~TEXAS MOVING CO INC~908 N BOWSER RD~RICHARDSON~TX~75081-2869~N
854042~ARMANDO ORTEGA~6590 CHERIMOYA AVENUE~FONTANA~CA~92337~N
940587~DIAMOND A TRUCKING INC~192285 E COUNTY ROAD 55~HARMON~OK~73832~N
1032455~INTEGRITY EXPRESS LLC~380 OLMSTEAD AVENUE~DEPEW~NY~14043~N
889931~DUNSON INC~33 CR 3581~FLORA VISTA~NM~87415~N
Thank you!
Edit: updated code thanks to #idlehands
There are a few ways to do this.
Option #1
My original thought was to use the tell() and seek() method to go back a few steps but it quickly shows that you cannot do this conveniently when you're not opening the file in bytes and definitely not in a for loop of readlines(). You can see the reference threads here:
Is it possible to modify lines in a file in-place?
How to solve "OSError: telling position disabled by next() call"
The investigation led to this piece of code:
with open('file.txt','rb+') as file:
line = file.readline() # initiate the loop
while line: # continue while line is not None
print(line)
check = line.split(b'~')[-1]
if check.startswith(b'N'): # carriage return is expected for each line, strip it
# ... do stuff ... #
file.seek(-len(check), 1) # place the buffer at the check point
file.write(check.replace(b'N', b'Y')) # replace "N" with "Y"
line = file.readline() # read next line
In the first referenced thread one of the answers mentioned this could lead you to potential problems, and directly modifying the bytes on the buffer while reading it is probably considered a bad idea™. A lot of pros probably will scold me for even suggesting it.
Option #2a
(if file size is not horrendously huge)
with open('file.txt','r') as file:
new_lines = []
for line in file.readlines():
check = line.split('~')
if 'N' in check[-1]:
# ... do stuff ... #
check[-1] = check[-1].replace('N','Y')
new_lines.append('~'.join(check))
with open('file.txt','r+') as file: # IMPORTANT to open as 'r+' mode as 'w/w+' will truncate your file!
for line in new_lines:
file.writelines(line)
This approach loads all the lines into memory first, so you do the modification in memory but leave the buffer alone. Then you reload the file and write the lines that were changed. The caveat is that technically you are rewriting the entire file line by line - not just the string N even though it was the only thing changed.
Option #2b
Technically you could open the file as r+ mode from the onset and then after the iterations have completed do this (still within the with block but outside of the loop):
# ... new_lines.append('~'.join(check)) #
file.seek(0)
for line in new_lines:
file.writelines(line)
I'm not sure what distinguishes this from Option #1 since you're still reading and modifying the file in the same go. If someone more proficient in IO/buffer/memory management wants to chime in please do.
The disadvantage for Option 2a/b is that you always end up storing and rewriting the lines in the file even if you are only left with a few lines that needs to be updated from 'N' to 'Y'.
Results (for all solutions):
570772~CORPORATE BANK TRANSIT OF KENTUCKY INC~3157 HIGHWAY 64 SUITE 100~EADS~TN~38028~Y
384767~MILLER FARMS TRANS LLC~1103 COURT ST~BEDFORD~IA~50833~Y
986150~R G S TRUCKING LTD~1765 LOMBARDIE DRIVE~QUESNEL~BC~V2J 4A8~Y
1012987~DONALD LARRY KIVETT~4509 LANSBURY RD~GREENSBORO~NC~27406-4509~Y
735308~ALZEY EXPRESS INC~2244 SOUTH GREEN STREET~HENDERSON~KY~42420~Y
870337~RIES FARMS~1613 255TH AVENUE~EARLVILLE~IA~52057~Y
148428~P R MASON & SON LLC~HWY 70 EAST~WILLISTON~NC~28589~Y
220940~TEXAS MOVING CO INC~908 N BOWSER RD~RICHARDSON~TX~75081-2869~Y
854042~ARMANDO ORTEGA~6590 CHERIMOYA AVENUE~FONTANA~CA~92337~Y
940587~DIAMOND A TRUCKING INC~192285 E COUNTY ROAD 55~HARMON~OK~73832~Y
1032455~INTEGRITY EXPRESS LLC~380 OLMSTEAD AVENUE~DEPEW~NY~14043~Y
889931~DUNSON INC~33 CR 3581~FLORA VISTA~NM~87415~Y
And if you were to say, encountered a break at the line starting with 220940, the file would become:
570772~CORPORATE BANK TRANSIT OF KENTUCKY INC~3157 HIGHWAY 64 SUITE 100~EADS~TN~38028~Y
384767~MILLER FARMS TRANS LLC~1103 COURT ST~BEDFORD~IA~50833~Y
986150~R G S TRUCKING LTD~1765 LOMBARDIE DRIVE~QUESNEL~BC~V2J 4A8~Y
1012987~DONALD LARRY KIVETT~4509 LANSBURY RD~GREENSBORO~NC~27406-4509~Y
735308~ALZEY EXPRESS INC~2244 SOUTH GREEN STREET~HENDERSON~KY~42420~Y
870337~RIES FARMS~1613 255TH AVENUE~EARLVILLE~IA~52057~Y
148428~P R MASON & SON LLC~HWY 70 EAST~WILLISTON~NC~28589~Y
220940~TEXAS MOVING CO INC~908 N BOWSER RD~RICHARDSON~TX~75081-2869~N
854042~ARMANDO ORTEGA~6590 CHERIMOYA AVENUE~FONTANA~CA~92337~N
940587~DIAMOND A TRUCKING INC~192285 E COUNTY ROAD 55~HARMON~OK~73832~N
1032455~INTEGRITY EXPRESS LLC~380 OLMSTEAD AVENUE~DEPEW~NY~14043~N
889931~DUNSON INC~33 CR 3581~FLORA VISTA~NM~87415~N
There are pros and cons to these approaches. Try and see which one fits your use case the best.
I would read the entire input file into a list and .pop() the lines off one at a time. In case of an error, append the popped item to the list and write overwrite the input file. This way it will always be up to date and you won't need any other logic.
I am trying to read a file which has format like below: It has two '\n' space in between every line.
Great tool for healing your life--if you are ready to change your beliefs!<br /><a href="http
Bought this book for a friend. I read it years ago and it is one of those books you keep forever. Love it!
I read this book many years ago and have heard Louise Hay speak a couple of times. It is a valuable read...
I am using below python code to read the line and convert it into Dataframe:
open_reviews = open("C:\\Downloads\\review_short.txt","r",encoding="Latin-1" ).read()
documents = []
for r in open_reviews.split('\n\n'):
documents.append(r)
df = pd.DataFrame(documents)
print(df.head())
The output I am getting is as below:
0 I was very inspired by Louise's Hay approach t...
1 \n You Can Heal Your Life by
2 \n I had an older version
3 \n I love Louise Hay and
4 \n I thought the book was exellent
Since I used two (\n), it gets appended at beginning of each line. Is there any other way to handle this, so that I get output as below:
0 I was very inspired by Louise's Hay approach t...
1 You Can Heal Your Life by
2 I had an older version
3 I love Louise Hay and
4 I thought the book was exellent
This appends every non-blank line.
filename = "..."
lines = []
with open(filename) as f:
for line in f:
line = line.strip()
if line:
lines.append(line)
>>> lines
['Great tool for healing your life--if you are ready to change your beliefs!<br /><a href="http',
'Bought this book for a friend. I read it years ago and it is one of those books you keep forever. Love it!',
'I read this book many years ago and have heard Louise Hay speak a couple of times. It is a valuable read...']
lines = pd.DataFrame(lines, columns=['my_text'])
>>> lines
my_text
0 Great tool for healing your life--if you are r...
1 Bought this book for a friend. I read it years...
2 I read this book many years ago and have heard...
Try using the .stip() method. It will remove any unnecessary whitespace characters from the beginning or end of a string.
You can use it like this:
for r in open_review.split('\n\n'):
documents.append(r.strip())
Use readlines() and clean the line with strip().
filename = "C:\\Downloads\\review_short.txt"
open_reviews = open(filename, "r", encoding="Latin-1")
documents = []
for r in open_reviews.readlines():
r = r.strip() # clean spaces and \n
if r:
documents.append(r)
I have two files which look exactly the same:
file1
1 in seattle today the secretary of education richard riley delivered his address
1 one of the things he focused on as the president had done
1 abc's michele norris has been investigating this
2 we're going to take a closer look tonight at the difficulty of getting meaningful
file2
1 in seattl today the secretari of educ richard riley deliv hi address
1 one of the thing he focus on a the presid had done
1 abc michel norri ha been investig thi
2 we'r go to take a closer look tonight at the difficulti of get meaning
When I run this code:
result=defaultdict(list)
with open("onthis.txt","r") as filer:
for line in filer:
label, sentence= line.strip().split(' ', 1)
result[label].append(sentence)
It works perfectly for file1 but gives me a value error for file2:
label, sentence= line.strip().split(' ', 1)
ValueError: need more than 1 value to unpack
I don't seem to catch the reason when they are both in the same format.
So, I just removed the empty lines by this terminal command:
sed '/^$/d' onthis.txt > trial
But the same error appears.
They can't be exactly the same. My guess is that there is an empty / white-space-only line somewhere in your second file, most likely right at the end.
The error is telling you that when it is performing the split, there are no spaces to split on so only one value is being returned, rather than a value for both label and sentence.
Based on your edit I suspect you might still have "empty" lines in your text file. Well I probably better should say: lines filled with nothing but white spaces.
I've extended your example file:
1 in seattl today the secretari of educ richard riley deliv hi address
1 one of the thing he focus on a the presid had done
1 abc michel norri ha been investig thi
2 we'r go to take a closer look tonight at the difficulti of get meaning
3 foo
4 bar
5 qun
It's probably not clear but the line between 3 foo and 4 bar is filled by a couple of white spaces while the lines between 4 bar 5 qun are "just" new lines (\n).
Notice the output of sed '/^$/d'
1 in seattl today the secretari of educ richard riley deliv hi address
1 one of the thing he focus on a the presid had done
1 abc michel norri ha been investig thi
2 we'r go to take a closer look tonight at the difficulti of get meaning
3 foo
4 bar
5 qun
The empty lines are truly removed - no doubt. But the pseudo-empty white space lines is still there. Running your python script will throw an error when reaching this line:
2 we'r go to take a closer look tonight at the difficulti of get meaning
3 foo
Traceback (most recent call last):
File "python.py", line 9, in <module>
label, sentence= line.strip().split(' ', 1)
ValueError: need more than 1 value to unpack
So my suggestion would be to extend your script by one line, making it skip empty lines in your input file.
for line in filer:
if not line.strip(): continue
Doing so has the positive side effect you don't have to prepare your input files with some sed-magic before.
Based on the above that you have provided (with a tweak). This seems to give the expected result.
result = {}
with open("test.txt", "r") as filer:
for line in filer:
label, sentence = line.strip().split(' ', 1)
try:
result[label].append(sentence)
except KeyError:
result[label] = [sentence]
Output:
{'2': ["we'r go to take a closer look tonight at the difficulti of get meaning"], '1': ['in seattl today the secretari of educ richard riley deliv hi address', 'one of the thing he focus on a the presid had done', 'abc michel norri ha been investig thi']}
So this must mean that we there is something missing from what you have provided. I think that if the above doesn't give you what you need then more info is required
this is my code:
>>> p = open(r'/Users/ericxx/Desktop/dest.txt','r+')
>>> xx = p.read()
>>> xx = xx[:0]+"How many roads must a man walk down\nBefore they call him a man" +xx[0:]
>>> p.writelines(xx)
>>> p.close()
the original file content looks like:
How many seas must a white dove sail
Before she sleeps in the sand
the result looks like :
How many seas must a white dove sail
Before she sleeps in the sand
How many roads must a man walk down
Before they call him a man
How many seas must a white dove sail
Before she sleeps in the sand
expected output :
How many roads must a man walk down
Before they call him a man
How many seas must a white dove sail
Before she sleeps in the sand
You have to "rewind" the file between reading and writing:
p.seek(0)
The whole code will look like this (with other minor changes):
p = open('/Users/ericxx/Desktop/dest.txt','r+')
xx = p.read()
xx = "How many roads must a man walk down\nBefore they call him a man" + xx
p.seek(0)
p.write(xx)
p.close()
Adding to #messas answer,
while doing seek to add the data in the front it can also leave you with old data at the end of your file if you ever shortened xx at any point.
This is because p.seek(0) puts the input pointer in the file at the beginning of the file and any .write() operation will overwrite content as it goes. However a shorter content written vs content already in the file will result in som old data being left at the end, not overwritten.
To avoid this you could open and close the file twice with , 'w') as the opening parameter the second time around or store/fetch the file contents length and pad your new content. Or truncate the file to your new desired length.
To truncate the file, simply add p.flush() after you've written the data.
Also, use the with operator
with open('/Users/ericxx/Desktop/dest.txt','r+') as p:
xx = p.read()
xx = "How many roads must a man walk down\nBefore they call him a man" + xx
p.seek(0)
p.write(xx)
p.flush()
I'm on my phone so apologies if the explanation is some what short and blunt and lacking code. Can update more tomorrow.
I have been working with beautiful soup to extract data from website APIs for use in a fan site I am building.
I have extracted the data into text files however I am having trouble formatting it.
Charles Dance
Lord Tywin Lannister (S 02+)
Natalie Dormer
Queen Margaery Tyrell (S 02+)
Harry Lloyd
Viserys Targaryen (S 01)
Mark Addy
King Robert Baratheon (S 01)
Alfie Allen
Theon Greyjoy
Sean Bean
Lord Eddard Stark (S 01)
I have several text files like this for shows.
I would like to have both the actor and character on the same line separated with commas for input into a database later.
Charles Dance , Lord Tywin Lannister (S 02+)
Natalie Dormer , Queen Margaery Tyrell (S 02+)
Harry Lloyd , Viserys Targaryen (S 01)
Mark Addy , King Robert Baratheon (S 01)
Alfie Allen , Theon Greyjoy
Sean Bean , Lord Eddard Stark (S 01)
If anyone could provide any assistance or pointers it would be greatly appreciated.
Solved:
Many thanks to Tdelaney and wnnmaw . you da real MVPs
def readline(fp):
#Read a line from a file, strip new line and raise Indexerror
#on end of file
line = fp.readline()
if not line:
raise IndexError()
return line.strip()
with open('Casts/GOTcast.txt') as in_file, open('GOTcastFIXED.txt', 'w') as out_file:
try:
while True:
out_file.write("%s, %s\n" % (readline(in_file), readline(in_file)))
except IndexError:
pass
This ought to work:
>>> lst = ["a", "A", "b", "B", "c", "C"]
>>> lines = [" , ".join(tup) for tup in zip(lst[::2], lst[1::2])]
>>> lines
['a , A', 'b , B', 'c , C']
You can read your text file like this:
with open("yourfile.txt", 'r') as f:
lines = f.readlines()
So, you just want to read the file and combine every two lines into one. That's easy as long as you deal with a few details like stipping the new line character knowing when you've finished reading the file.
One way to solve this is to wrap the details in a function so that a simple for loop can do the rest.
def readline(fp):
"""Read a line from a file, strip new line and raise Indexerror
on end of file"""
line = fp.readline()
if not line:
raise IndexError()
return line.strip()
with open('infile.txt') as in_file, open('outfile.txt', 'w') as out_file:
try:
while True:
out_file.write("%s, %s\n" % (readline(in_file), readline(in_file)))
except IndexError:
pass