How to Remove Random Text Breaks in HTML Text - python

I'm looking to scrape some of the text out of a couple HTML documents, but I can't get rid of some of the line breaks. Currently I have Beautiful Soup parsing the web pages, then I read in all the lines and attempt to strip all the newline characters out of the text, but I can't get rid of the ones in the middle of strings. For example,
<font face="ARIAL" size="2">Thomas
H. Lenagh </font>
I'm looking to get the name of this person out on one line, but there is a newline character of some sort in the middle. Here's what I've tried so far:
line=line.replace("\n"," ")
line=line.replace("\\n"," ")
line=line.replace("\r\n", " ")
line=line.replace("\t", " ")
line=line.replace("\\r\\n"," ")
I've also tried the following regex expressions:
line=re.sub("\n"," ",line)
line=re.sub("\\n", " ",line)
line=re.sub("\s\s+", " ",line)
None have worked so far and I'm not sure what character I'm missing. Any ideas?
EDIT: Here's the full code that I'm using (minus error checking):
soup=BeautifulSoup(threePage) #make the soup
paragraph=soup.stripped_strings
if paragraph is not None:
for i in range (len(data)): #for all rows...
lineCounter=lineCounter+1
row =data[i]
row=row.replace("\n"," ") #remove newline (<enter>) characters
row = re.sub("---+"," ",row) #remove dashed lines
row =re.sub(","," ",row) #replace commas with spaces
row=re.sub("\s\s+", " ",row) #remove
if ("/s/" in row): #if /s/ is in the row, remove it
row=re.sub(".*/s/"," ",row)
if ("/S/" in row): #upper case of the last removal
row=re.sub(".*/S/"," ",row)
row = row.replace("\n"," ")
row=row.strip()#remove any weird characters

You haven't shared what the rest of your code looks like after the for loop, but I'm guessing a very simplified version is something like:
data = ["a\nb", "c\nd", "e\nf"]
for i in range(len(data)):
row = data[i]
row = row.replace("\n", "")
#let's see if that fixed it...
print(data)
#output: ['a\nb', 'c\nd', 'e\nf']
#hey, the newlines are still there! What gives?
This occurs because calling replace on a string doesn't mutate it in-place, and assigning new values to row doesn't change what values are being stored in data. If you want data to be changed too, you've got to assign the values back.
data = ["a\nb", "c\nd", "e\nf"]
for i in range(len(data)):
row = data[i]
row = row.replace("\n", "")
data[i] = row
#let's see if that fixed it...
print data
#output: ['ab', 'cd', 'ef']
#looking good!
Bonus style tip: if your replacement logic is simple enough to express in one expression, you can get it all on one line and avoid messing around with range and indices etc:
data = [row.replace("\n", "") for row in data]

Related

python 3 parsing a semicolon separated very long string to remove each second element

I'm pretty new to python and are looking for a way to get the following result from a long string
reading in lines of a textfile where each line looks like this
; 2:55:12;PuffDG;66,81; Puff4OG;66,75; Puff3OG;35,38;
after dataprocessing the data shall be stored in another textfile with this data
short example
2:55:12;66,81;66,75;35,38;
the real string is much longer but always with the same pattern
; 2:55:12;PuffDG;66,81; Puff4OG;66,75; Puff3OG;35,38; Puff2OG;30,25; Puff1OG;29,25; PuffFB;23,50; ....
So this means remove leading semicolon
keep second element
remove third element
keep fourth element
remove fith element
keep sixth element
and so on
the number of elements can vary so I guess as a first step I have to parse the string to get the number of elements and then do some looping through the string and assign each part that shall be kept to a variable
I have tried some variations of the command .split() but with no success.
Would it be easier to store all elements in a list and then for-loop through the list keeping and dropping elements?
If Yes how would this look like so at the end I have stored a file with
lines like this
2:55:12 ; 66,81 ; 66,75 ; 35,38 ;
2:56:12 ; 67,15 ; 74;16 ; 39,15 ;
etc. ....
best regards Stefan
This solution works independently of the content between the semicolons
One line, though it's a bit messier:
result = ' ; '.join(string.split(';')[1::2])
Getting rid of lead semicolon:
Just slice it off!
string = string[2:]
Splitting by semicolon & every second element:
Given a string, we can split by semicolon:
arr = string.split(';')[1::2]
The [::2] means to slice out every second element, starting with index 1. This keeps all "even" elements (second, fourth, etcetera).
Resulting string
To produce the string result you want, simply .join:
result = ' ; '.join(arr)
A regex based solution, which operates on the original input:
inp = "; 2:55:12;PuffDG;66,81; Puff4OG;66,75; Puff3OG;35,38;"
output = re.sub(r'\s*[A-Z][^;]*?;', '', inp)[2:]
print(output)
This prints:
2:55:12;66,81;66,75;35,38;
This shows how to do it for one line of input if the same pattern repeats itself every time
input_str = "; 2:55:12;PuffDG;66,81; Puff4OG;66,75; Puff3OG;35,38;"
f = open('output.txt', 'w') # open text to write to
output_list = input_str.split(';')[1::2] # create list with numbers of interest
# write to file
for out in output_list:
f.write(f"{out.strip()} ; ")
# end line
f.write("\n")
thank you very much for the quick response. You are awesome.
Your solutions are very comact.
In the meantime I found another solution but this solution needs more lines of code
best regards Stefan
I'm not familiar with how to insert code as a code-section properly
So I add it as plain text
fobj = open(r"C:\Users\Stefan\AppData\Local\Programs\Python\Python38-32\Heizung_2min.log")
wobj = open(r"C:\Users\Stefan\AppData\Local\Programs\Python\Python38-32\Heizung_number_2min.log","w")
for line in fobj:
TextLine = fobj.readline()
print(TextLine)
myList = TextLine.split(';')
TextLine = ""
for index, item in enumerate(myList):
if index % 2 == 1:
TextLine += item
TextLine += ";"
TextLine += '\n'
print(TextLine)
wobj.write(TextLine)
fobj.close()
wobj.close()`

String Cutting with multiple lines

so i'm new to python besides some experience with tKintner (some GUI experiments).
I read an .mbox file and copy the plain/text in a string. This text contains a registering form. So a Stefan, living in Maple Street, London working for the Company "MultiVendor XXVideos" has registered with an email for a subscription.
Name_OF_Person: Stefan
Adress_HOME: London, Maple
Street
45
Company_NAME: MultiVendor
XXVideos
I would like to take this data and put in a .csv row with column
"Name", "Adress", "Company",...
Now i tried to cut and slice everything. For debugging i use "print"(IDE = KATE/KDE + terminal... :-D ).
Problem is, that the data contains multiple lines after keywords but i only get the first line.
How would you improve my code?
import mailbox
import csv
import email
from time import sleep
import string
fieldnames = ["ID","Subject","Name", "Adress", "Company"]
searchKeys = [ 'Name_OF_Person','Adress_HOME','Company_NAME']
mbox_file = "REG.mbox"
export_file_name = "test.csv"
if __name__ == "__main__":
with open(export_file_name,"w") as csvfile:
writer = csv.DictWriter(csvfile, dialect='excel',fieldnames=fieldnames)
writer.writeheader()
for message in mailbox.mbox(mbox_file):
if message.is_multipart():
content = '\n'.join(part.get_payload() for part in message.get_payload())
content = content.split('<')[0] # only want text/plain.. Ill split #right before HTML starts
#print content
else:
content = message.get_payload()
idea = message['message-id']
sub = message['subject']
fr = message['from']
date = message['date']
writer.writerow ('ID':idea,......) # CSV writing will work fine
for line in content.splitlines():
line = line.strip()
for pose in searchKeys:
if pose in line:
tmp = line.split(pose)
pmt = tmp[1].split(":")[1]
if next in line !=:
print pose +"\t"+pmt
sleep(1)
csvfile.closed
OUTPUT:
OFFICIAL_POSTAL_ADDRESS =20
Here, the lines are missing..
from file:
OFFICIAL_POSTAL_ADDRESS: =20
London, testarossa street 41
EDIT2:
#Yaniv
Thank you, iam still trying to understand every step, but just wanted to give a comment. I like the idea to work with the list/matrix/vector "key_value_pairs"
The amount of keywords in the emails is ~20 words. Additionally, my values are sometimes line broken by "=".
I was thinking something like:
Search text for Keyword A,
if true:
search text from Keyword A until keyword B
if true:
copy text after A until B
Name_OF_=
Person: Stefan
Adress_
=HOME: London, Maple
Street
45
Company_NAME: MultiVendor
XXVideos
Maybe the HTML from EMAIL.mbox is easier to process?
<tr><td bgcolor=3D"#eeeeee"><font face=3D"Verdana" size=3D"1">
<strong>NAM=
E_REGISTERING_PERSON</strong></font></td><td bgcolor=3D"#eeeeee"><font
fac=e=3D"Verdana" size=3D"1">Stefan </font></td></tr>
But the "=" are still there
should i replace ["="," = "] with "" ?
I would go for a "routine" parsing loop over the input lines, and maintain a current_key and current_value variables, as a value for a certain key in your data might be "annoying", and spread across multiple lines.
I've demonstrated such parsing approach in the code below, with some assumptions regarding your problem. For example, if an input line starts with a whitespace, I assumed it must be the case of such "annoying" value (spread across multiple lines). Such lines would be concatenated into a single value, using some configurable string (the parameter join_lines_using_this). Another assumption is that you might want to strip whitespaces from both keys and values.
Feel free to adapt the code to fit your assumptions on the input, and raise Exceptions whenever they don't hold!
# Note the usage of .strip() in some places, to strip away whitespaces. I assumed you might want that.
def parse_funky_text(text, join_lines_using_this=" "):
key_value_pairs = []
current_key, current_value = None, ""
for line in text.splitlines():
line_split = line.split(':')
if line.startswith(" ") or len(line_split) == 1:
if current_key is None:
raise ValueError("Failed to parse this line, not sure which key it belongs to: %s" % line)
current_value += join_lines_using_this + line.strip()
else:
if current_key is not None:
key_value_pairs.append((current_key, current_value))
current_key, current_value = None, ""
current_key = line_split[0].strip()
# We've just found a new key, so here you might want to perform additional checks,
# e.g. if current_key not in sharedKeys: raise ValueError("Encountered a weird key?! %s in line: %s" % (current_key, line))
current_value = ':'.join(line_split[1:]).strip()
# Don't forget the last parsed key, value
if current_key is not None:
key_value_pairs.append((current_key, current_value))
return key_value_pairs
Example usage:
text = """Name_OF_Person: Stefan
Adress_HOME: London, Maple
Street
45
Company_NAME: MultiVendor
XXVideos"""
parse_funky_text(text)
Will output:
[('Name_OF_Person', 'Stefan'), ('Adress_HOME', 'London, Maple Street 45'), ('Company_NAME', 'MultiVendor XXVideos')]
You indicate in the comments that your input strings from the content should be relatively consistent. If that is the case, and you want to be able to split that string across multiple lines, the easiest thing to do would be to replace \n with spaces and then just parse the single string.
I've intentionally constrained my answer to using just string methods rather than inventing a huge function to do this. Reason: 1) Your process is already complex enough, and 2) your question really boils down to how to process the string data across multiple lines. If that is the case, and the pattern is consistent, this will get this one off job done
content = content.replace('\n', ' ')
Then you can split on each of the boundries in your consistently structured headers.
content = content.split("Name_OF_Person:")[1] #take second element of the list
person = content.split("Adress_HOME:")[0] # take content before "Adress Home"
content = content.split("Adress_HOME:")[1] #take second element of the list
address = content.split("Company_NAME:")[0] # take content before
company = content.split("Adress_HOME:")[1] #take second element of the list (the remainder) which is company
Normally, I would suggest regex. (https://docs.python.org/3.4/library/re.html). Long term, if you need to do this sort of thing again, regex is going to pay dividends on time spend munging data. To make a regex function "cut" across multiple lines, you would use the re.MULTILINE option. So it might endup looking something like re.search('Name_OF_Person:(.*)Adress_HOME:', html_reg_form, re.MULTILINE)

python - matching string and replacing

I have a file i am trying to replace parts of a line with another word.
it looks like bobkeiser:bob123#bobscarshop.com:0.0.0.0.0:23rh32o3hro2rh2:234212
i need to delete everything but bob123#bobscarshop.com, but i need to match 23rh32o3hro2rh2 with 23rh32o3hro2rh2:poniacvibe , from a different text file and place poniacvibe infront of bob123#bobscarshop.com
so it would look like this bob123#bobscarshop.com:poniacvibe
I've had a hard time trying to go about doing this, but i think i would have to split the bobkeiser:bob123#bobscarshop.com:0.0.0.0.0:23rh32o3hro2rh2:234212 with data.split(":") , but some of the lines have a (:) in a spot that i don't want the line to be split at, if that makes any sense...
if anyone could help i would really appreciate it.
ok, it looks to me like you are using a colon : to separate your strings.
in this case you can use .split(":") to break your strings into their component substrings
eg:
firststring = "bobkeiser:bob123#bobscarshop.com:0.0.0.0.0:23rh32o3hro2rh2:234212"
print(firststring.split(":"))
would give:
['bobkeiser', 'bob123#bobscarshop.com', '0.0.0.0.0', '23rh32o3hro2rh2', '234212']
and assuming your substrings will always be in the same order, and the same number of substrings in the main string you could then do:
firststring = "bobkeiser:bob123#bobscarshop.com:0.0.0.0.0:23rh32o3hro2rh2:234212"
firstdata = firststring.split(":")
secondstring = "23rh32o3hro2rh2:poniacvibe"
seconddata = secondstring.split(":")
if firstdata[3] == seconddata[0]:
outputdata = firstdata
outputdata.insert(1,seconddata[1])
outputstring = ""
for item in outputdata:
if outputstring == "":
outputstring = item
else
outputstring = outputstring + ":" + item
what this does is:
extract the bits of the strings into lists
see if the "23rh32o3hro2rh2" string can be found in the second list
find the corresponding part of the second list
create a list to contain the output data and put the first list into it
insert the "poniacvibe" string before "bob123#bobscarshop.com"
stitch the outputdata list back into a string using the colon as the separator
the reason your strings need to be the same length is because the index is being used to find the relevant strings rather than trying to use some form of string type matching (which gets much more complex)
if you can keep your data in this form it gets much simpler.
to protect against malformed data (lists too short) you can explicitly test for them before you start using len(list) to see how many elements are in it.
or you could let it run and catch the exception, however in this case you could end up with unintended results, as it may try to match the wrong elements from the list.
hope this helps
James
EDIT:
ok so if you are trying to match up a long list of strings from files you would probably want something along the lines of:
firstfile = open("firstfile.txt", mode = "r")
secondfile= open("secondfile.txt",mode = "r")
first_raw_data = firstfile.readlines()
firstfile.close()
second_raw_data = secondfile.readlines()
secondfile.close()
first_data = []
for item in first_raw_data:
first_data.append(item.replace("\n","").split(":"))
second_data = []
for item in second_raw_data:
second_data.append(item.replace("\n","").split(":"))
output_strings = []
for item in first_data:
searchstring = item[3]
for entry in second_data:
if searchstring == entry[0]:
output_data = item
output_string = ""
output_data.insert(1,entry[1])
for data in output_data:
if output_string == "":
output_string = data
else:
output_string = output_string + ":" + data
output_strings.append(output_string)
break
for entry in output_strings:
print(entry)
this should achieve what you're after and as prove of concept will print the resulting list of stings for you.
if you have any questions feel free to ask.
James
Second edit:
to make this output the results into a file change the last two lines to:
outputfile = open("outputfile.txt", mode = "w")
for entry in output_strings:
outputfile.write(entry+"\n")
outputfile.close()

How do I parse a sequentially organized multiline string into a data structure using regex/python?

I need to parse a multi-line string into a data structure containing (1) the identifier and (2) the text after the identifier (but before the next > symbol). the identifier always comes on its own line, but the text can take up multiple lines.
>identifier1
lalalalalalalalalalalalalalalalala
>identifier2
bababababababababababababababababa
>identifier3
wawawawawawawawawawawawawawawawawa
after execution I might have the data structured something like this:
id = ['identifier1', 'identifier2', 'identifier3']
and
txt =
['lalalalalalalalalalalalalalalalala',
'bababababababababababababababababa',
'wawawawawawawawawawawawawawawawawa']
It seems I would want to use regex to find (1) things after > but before carriage return, and (2) things between >'s, having temporarily deleted the identifier string and EOL, replacing with "".
The thing is I will have hundreds of these identifiers so I need to run the regex sequentially. Any ideas on how to attack this problem? I am working in python but feel free to use whatever language you want in your response.
*Update 1: code from slater getting closer but things are still not partitioned sequentially into id, text, id, text, etc *
teststring = '''>identifier1
lalalalalalalalalalalalalalalalala
>identifier2
bababababababababababababababababa
>identifier3
wawawawawawawawawawawawawawawawawa'''
# First, split the text into relevant chunks
split_text = teststring.split('>')
#see where we are after split
print split_text
#remove spaces that will mess up the partitioning
while '' in split_text:
split_text.remove('')
#see where we are after removing '', before partitioning
print split_text
id = [text.partition(r'\n')[0] for text in split_text]
txt = [text.partition(r'\n')[0] for text in split_text]
#see where we are after partition
print id
print txt
print len(split_text)
print len(id)
but the output was:
['', 'identifier1\nlalalalalalalalalalalalalalalalala\n', 'identifier2\nbababababababababababababababababa\n', 'identifier3\nwawawawawawawawawawawawawawawawawa']
['identifier1\nlalalalalalalalalalalalalalalalala\n', 'identifier2\nbababababababababababababababababa\n', 'identifier3\nwawawawawawawawawawawawawawawawawa']
['identifier1\nlalalalalalalalalalalalalalalalala\n', 'identifier2\nbababababababababababababababababa\n', 'identifier3\nwawawawawawawawawawawawawawawawawa']
['identifier1\nlalalalalalalalalalalalalalalalala\n', 'identifier2\nbababababababababababababababababa\n', 'identifier3\nwawawawawawawawawawawawawawawawawa']
3
3
note: it needs to work for a multiline string, dealing with all the \n's. a better test case might be:
teststring = '''
>identifier1
lalalalalalalalalalalalalalalalala
lalalalalalalalalalalalalalalalala
>identifier2
bababababababababababababababababa
bababababababababababababababababa
>identifier3
wawawawawawawawawawawawawawawawawa
wawawawawawawawawawawawawawawawawa'''
# First, split the text into relevant chunks
split_text = teststring.split('>')
#see where we are after split
print split_text
#remove spaces that will mess up the partitioning
while '' in split_text:
split_text.remove('')
#see where we are after removing '', before partitioning
print split_text
id = [text.partition(r'\n')[0] for text in split_text]
txt = [text.partition(r'\n')[0] for text in split_text]
#see where we are after partition
print id
print txt
print len(split_text)
print len(id)
current output:
['\n', 'identifier1\nlalalalalalalalalalalalalalalalala\nlalalalalalalalalalalalalalalalala\n', 'identifier2\nbababababababababababababababababa\nbababababababababababababababababa\n', 'identifier3\nwawawawawawawawawawawawawawawawawa\nwawawawawawawawawawawawawawawawawa']
['\n', 'identifier1\nlalalalalalalalalalalalalalalalala\nlalalalalalalalalalalalalalalalala\n', 'identifier2\nbababababababababababababababababa\nbababababababababababababababababa\n', 'identifier3\nwawawawawawawawawawawawawawawawawa\nwawawawawawawawawawawawawawawawawa']
['\n', 'identifier1\nlalalalalalalalalalalalalalalalala\nlalalalalalalalalalalalalalalalala\n', 'identifier2\nbababababababababababababababababa\nbababababababababababababababababa\n', 'identifier3\nwawawawawawawawawawawawawawawawawa\nwawawawawawawawawawawawawawawawawa']
['\n', 'identifier1\nlalalalalalalalalalalalalalalalala\nlalalalalalalalalalalalalalalalala\n', 'identifier2\nbababababababababababababababababa\nbababababababababababababababababa\n', 'identifier3\nwawawawawawawawawawawawawawawawawa\nwawawawawawawawawawawawawawawawawa']
4
4
Personally, I feel that you should use regex as little as possible. It's slow, difficult to maintain, and generally unreadable.
That said, solving this in python is extremely straightforward. I'm a little unclear on what exactly you mean by running this "sequentially", but let me know if this solution doesn't fit your needs.
# First, split the text into relevant chunks
split_text = text.split('>')
id = [text.partition('\n')[0] for text in split_text]
txt = [text.partition('\n')[2] for text in split_text]
Obviously, you could make the code more efficient, but if you're only dealing with hundreds of identifiers it really shouldn't be needed.
If you want to remove any blank entries that might occur, you can do the following:
list_with_blanks = ['', 'hello', '', '', 'world']
filter(None, list_with_blanks)
>>> ['hello', 'world']
Let me know if you have any more questions.
Unless I misunderstood the question, it's as easy as
for line in your_file:
if line.startswith('>'):
id.append(line[1:].strip())
else:
text.append(line.strip())
Edit: to concatenate multiple lines:
ids, text = [], []
for line in teststring.splitlines():
if line.startswith('>'):
ids.append(line[1:])
text.append('')
elif text:
text[-1] += line
I found a solution. It's certainly not very pythonic but it works.
======================================================================
======================================================================
teststring = '''
>identifier1
lalalalalalalalalalalalalalalalala\n
lalalalalalalalalalalalalalalalala\n
>identifier2
bababababababababababababababababa\n
bababababababababababababababababa\n
>identifier3
wawawawawawawawawawawawawawawawawa\n
wawawawawawawawawawawawawawawawawa\n'''
i = 0
j = 0
#split the multiline string by line
dsplit = teststring.split('\n')
#the indicies of identifiers
index = list()
for line in dsplit:
if line.startswith('>'):
print line
index.append(i)
j = j + 1
i = i+1
index.append(i) #add this so you get the last block of text
#the text corresponding to each index
thetext = list()
#the names corresponding to each gene
thenames = list()
for n in range(0, len(index)-1):
thetext.append("")
for k in range(index[n]+1, index[n+1]):
thetext[n] = thetext[n] + dsplit[k]
thenames.append(dsplit[index[n]][1:]) # the [1:] removes the first character (>) from the line
print "the indicies", index
print "the text: ", thetext
print "the names", thenames
print "this many text entries: ", len(thetext)
print "this many index entries: ", j
this gives the following output:
>identifier1
>identifier2
>identifier3
the indicies [1, 6, 11, 16]
the text: ['lalalalalalalalalalalalalalalalalalalalalalalalalalalalalalalalalala', 'babababababababababababababababababababababababababababababababababa', 'wawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawa']
the names ['identifier1', 'identifier2', 'identifier3']
this many text entries: 3
this many index entries: 3

while iterating if statement wont evaluate

this little snippet of code is my attempt to pull multiple unique values out of rows in a CSV. the CSV looks something like this in the header:
descr1, fee part1, fee part2, descr2, fee part1, fee part2,
with the descr columns having many unique names in a single column. I want to take these unique fee names and make a new header out of them. to do this I decided to start by getting all the different descr columns names, so that when I start pulling data from the actual rows I can check to see if that row has a fee amount or one of the fee names I need. There are probably a lot of things wrong with this code, but I am a beginner. I really just want to know why my first if statement is never triggered when the l in fin does equal a comma, I know it must at some point as it writes a comma to my row string. thanks!
row = ''
header = ''
columnames = ''
cc = ''
#fout = open(","w")
fin = open ("raw data.csv","rb")
for l in fin:
if ',' == l:
if 'start of cust data' not in row:
if 'descr' in row:
columnames = columnames + ' ' + row
row = ''
else:
pass
else:
pass
else:
row = row+l
print(columnames)
print(columnames)
When you iterate over a file, you get lines, not characters -- and they have the newline character, \n, at the end. Your if ',' == l: statement will never succeed because even if you had a line with only a single comma in it, the value of l would be ",\n".
I suggest using the csv module: you'll get much better results than trying to do this by hand like you're doing.

Categories

Resources