How to clean HTML removing repeated paragraphs? - python

I´m trying to clean an html file that has repeated paragraphs within body. Below I show the input file and expected output.
Input.html
https://jsfiddle.net/97ptc0Lh/4/
Output.html
https://jsfiddle.net/97ptc0Lh/1/
I've been trying with the following code using BeautifulSoup but I don´t know why is not working, since the resultant list CleanHtml contains the repeated elements (paragraphs) that I´d like to remove.
from bs4 import BeautifulSoup
fp = open("Input.html", "rb")
soup = BeautifulSoup(fp, "html5lib")
Uniques = set()
CleanHtml = []
for element in soup.html:
if element not in Uniques:
Uniques.add(element)
CleanHtml.append(element)
print (CleanHtml)
May someone help me to reach this goal please.

I think this should do it:
elms = []
for elem in soup.find_all('font'):
if elem not in elms:
elms.append(elem)
else:
target =elem.findParent().findParent()
target.decompose()
print(soup.html)
This should get you your the desired output.
Edit:
To remove only for those paragraphs that have don't size 4 or 5, change the else block to
else:
if elem.attrs['size'] != "4" and elem.attrs['size'] !="5":
target =elem.findParent().findParent()
target.decompose()

Related

BeautifulSoup trying to remove HTML data from list

As mentioned above, I am trying to remove HTML from the printed output to just get text and my dividing | and -. I get span information as well as others that I would like to remove. As it is part of the program that is a loop, I cannot search for the individual text information of the page as they change. The page architecture stays the same, which is why printing the items in the list stays the same. Wondering what would be the easiest way to clean the output. Here is the code section:
infoLink = driver.find_element_by_xpath("//a[contains(#href, '?tmpl=component&detail=true&parcel=')]").click()
driver.switch_to.window(driver.window_handles[1])
aInfo = driver.current_url
data = requests.get(aInfo)
src = data.text
soup = BeautifulSoup(src, "html.parser")
parsed = soup.find_all("td")
for item in parsed:
Original = (parsed[21])
Owner = parsed[13]
Address = parsed[17]
print (*Original, "|",*Owner, "-",*Address)
Example output is:
<span class="detail-text">123 Main St</span> | <span class="detail-text">Banner,Bruce</span> - <span class="detail-text">1313 Mockingbird Lane<br>Santa Monica, CA 90405</br></span>
Thank you!
To get the text between the tags just use get_text() but you should be aware, that there is always text between the tags to avoid errors:
for item in parsed:
Original = (parsed[21].get_text(strip=True))
Owner = parsed[13].get_text(strip=True)
Address = parsed[17].get_text(strip=True)
I wrote an algorithm recently that does something like this. It won't work if your target text has a < or a > in it, though.
def remove_html_tags(string):
data = string.replace(string[string.find("<"):string.find(">") + 1], '').strip()
if ">" in data or "<" in data:
return remove_html_tags(data)
else:
return str(data)
It recursively removes the text between < and >, inclusive.
Let me know if this works!

Python scraping delete duplicates

I dont wanna have a email address twice, with this code I get the error
TypeError: unhashable type: 'list'
So i assume that the line
allLinks= set()
is wrong and I have to use a tuple and not a list, is that right?
Thats my code:
import requests
from bs4 import BeautifulSoup as soup
def get_emails(_links:list):
for i in range(len(_links)):
new_d = soup(requests.get(_links[i]).text, 'html.parser').find_all('a', {'class':'my_modal_open'})
if new_d:
yield new_d[-1]['title']
start = 20
while True:
d = soup(requests.get('http://www.schulliste.eu/type/gymnasien/?bundesland=&start={page_id}'.format(page_id=start)).text, 'html.parser')
results = [i['href'] for i in d.find_all('a')][52:-9]
results = [link for link in results if link.startswith('http://')]
next_page=d.find('div', {'class': 'paging'}, 'weiter')
if next_page:
start+=20
else:
break
allLinks= set()
if results not in allLinks:
print(list(get_emails(results)))
allLinks.add(results)
You're trying to add an entire list of emails as a single entry in the set.
What you want is to add the actual emails each one in a separate set entry.
The problem is in this line:
allLinks.add(results)
It adds the entire results list as a single element in the set and that doesn't work. Use this instead:
allLinks.update(results)
This will update the set with elements from the list, but each element will be a separate entry in the set.
I got it working but I still get duplicate emails.
allLinks = []
if results not in allLinks:
print(list(get_emails(results)))
allLinks.append((results))
Does anybody know why?

Beautifulsoup scrape content of a cell beside another one

I am trying to scrape the content of a cell besides another cell of which I know the name e.g. "Staatsform", "Amtssprache", "Postleitzahl" etc. In the picture the needed content is always in the right cell.
The basic code is the following one, but I am stuck with it:
source_code = requests.get('https://de.wikipedia.org/wiki/Hamburg')
plain_text = source_code.text
soup = BeautifulSoup(plain_text, "html.parser")
stastaform = soup.find(text="Staatsform:")...???
Many thanks in advance!
I wanted to exercise care in limiting the search to what is called the 'Infobox' in the English-language wikipedia. Therefore, I searched first for the heading 'Basisdaten', requiring that it be a th element. Not exactly definitive perhaps but more likely to be. Having found that I looked for tr elements under 'Basisdaten' until I found another tr including a (presumed different) heading. In this case, I search for 'Postleitzahlen:' but this approach makes it possible to find any/all of the items between 'Basisdaten' and the next heading.
PS: I should also mention the reason for if not current.name. I noticed some lines consisting of just new lines which BeautifulSoup treats as strings. These don't have names, hence the need to treat them specially in code.
import requests
import bs4
page = requests.get('https://de.wikipedia.org/wiki/Hamburg').text
soup = bs4.BeautifulSoup(page, 'lxml')
def getInfoBoxBasisDaten(s):
return str(s) == 'Basisdaten' and s.parent.name == 'th'
basisdaten = soup.find_all(string=getInfoBoxBasisDaten)[0]
wanted = 'Postleitzahlen:'
current = basisdaten.parent.parent.nextSibling
while True:
if not current.name:
current = current.nextSibling
continue
if wanted in current.text:
items = current.findAll('td')
print (items[0])
print (items[1])
if '<th ' in str(current): break
current = current.nextSibling
Result like this: two separate td elements, as requested.
<td>Postleitzahlen:</td>
<td>20095–21149,<br/>
22041–22769,<br/>
27499</td>
This works most of the time:
def get_content_from_right_column_for_left_column_containing(text):
"""return the text contents of the cell adjoining a cell that contains `text`"""
navigable_strings = soup.find_all(text=text)
if len(navigable_strings) > 1:
raise Exception('more than one element with that text!')
if len(navigable_strings) == 0:
# left-column contents that are links don't have a colon in their text content...
if ":" in text:
altered_text = text.replace(':', '')
# but `td`s and `th`s do.
else:
altered_text = text + ":"
navigable_strings = soup.find_all(text=altered_text)
try:
return navigable_strings[0].find_parent('td').find_next('td').text
except IndexError:
raise IndexError('there are no elements containing that text.')

How can I group it by using "search" function in regular expression?

I have been developing a python web-crawler to collect the used car stock data from this website. (http://www.bobaedream.co.kr/cyber/CyberCar.php?gubun=I&page=20)
First of all, I would like to collect only "BMW" from the list. So, I used "search" function in regular expression like the code below. But, it keeps returning "None".
Is there anything wrong in my code?
Please give me some advice.
Thanks.
from bs4 import BeautifulSoup
import urllib.request
import re
CAR_PAGE_TEMPLATE = "http://www.bobaedream.co.kr/cyber/CyberCar.php?gubun=I&page="
def fetch_post_list():
for i in range(20,21):
URL = CAR_PAGE_TEMPLATE + str(i)
res = urllib.request.urlopen(URL)
html = res.read()
soup = BeautifulSoup(html, 'html.parser')
table = soup.find('table', class_='cyber')
print ("Page#", i)
# 50 lists per each page
lists=table.find_all('tr', itemtype="http://schema.org/Article")
count=0
r=re.compile("[BMW]")
for lst in lists:
if lst.find_all('td')[3].find('em').text:
lst_price=lst.find_all('td')[3].find('em').text
lst_title=lst.find_all('td')[1].find('a').text
lst_link = lst.find_all('td')[1].find('a')['href']
lst_photo_url=''
if lst.find_all('td')[0].find('img'):
lst_photo_url = lst.find_all('td')[0].find('img')['src']
count+=1
else: continue
print('#',count, lst_title, r.search("lst_title"))
return lst_link
fetch_post_list()
r.search("lst_title")
This is searching inside the string literal "lst_title", not the variable named lst_title, that's why it never matches.
r=re.compile("[BMW]")
The square brackets indicate that you're looking for one of those characters. So, for example, any string containing M will match. You just want "BMW". In fact you don't even need regular expressions, you can just test:
"BMW" in lst_title

Using BeautifulSoup to find a tag and evaluate whether it fits some criteria

I am writing a program to extract text from a website and write it into a text file. Each entry in the text file should have 3 values separated by a tab. The first value is hard-coded to XXXX, the 2nd value should initialize to the first item on the website with , and the third value is the next item on the website with a . The logic I'm trying to introduce is looking for the first and write the associated string into the text file. Then find the next and write the associated string into the text file. Then, look for the next p class. If it's "style4", start a new line, if it's another "style5", write it into the text file with the first style5 entry but separated with a comma (alternatively, the program could just skip the next style5.
I'm stuck on the part of the program in bold. That is, getting the program to look for the next p class and evaluate it against style4 and style5. Since I was having problems with finding and evaluating the p class tag, I chose to pull my code out of the loop and just try to accomplish the first iteration of the task for starters. Here's my code so far:
import urllib2
from bs4 import BeautifulSoup
soup = BeautifulSoup(urllib2.urlopen('http://www.kcda.org/KCDA_Awarded_Contracts.htm').read())
next_vendor = soup.find('p', {'class': 'style4'})
print next_vendor
next_commodity = next_vendor.find_next('p', {'class': 'style5'})
print next_commodity
next = next_commodity.find_next('p')
print next
I'd appreciate any help anybody can provide! Thanks in advance!
I am not entirely sure how you are expecting your output to be. I am assuming that you are trying to get the data in the webpage in the format:
Alphabet \t Vendor \t Category
You can do this:
# The basic things
import urllib2
from bs4 import BeautifulSoup
soup = BeautifulSoup(urllib2.urlopen('http://www.kcda.org/KCDA_Awarded_Contracts.htm').read())
Get the td of interest:
table = soup.find('table')
data = table.find_all('tr')[-1]
data = data.find_all('td')[1:]
Now, we will create a nested output dictionary with alphabets as the keys and an inner dict as the value. The inner dict has vendor name as key and category information as it's value
output_dict = {}
current_alphabet = ""
current_vendor = ""
for td in data:
for p in td.find_all('p'):
print p.text.strip()
if p.get('class')[0] == 'style6':
current_alphabet = p.text.strip()
vendors = {}
output_dict[current_alphabet] = vendors
continue
if p.get('class')[0] == 'style4':
print "Here"
current_vendor = p.text.strip()
category = []
output_dict[current_alphabet][current_vendor] = category
continue
output_dict[current_alphabet][current_vendor].append(p.text.strip())
This gets the output_dict in the format:
{ ...
u'W': { u'WTI - Weatherproofing Technologies': [u'Roofing'],
u'Wenger Corporation': [u'Musical Instruments and Equipment'],
u'Williams Scotsman, Inc': [u'Modular/Portable Buildings'],
u'Witt Company': [u'Interactive Technology']
},
u'X': { u'Xerox': [u"Copiers & MFD's", u'Printers']
}
}
Skipping the earlier parts for brevity. Now it is just a matter of accessing this dictionary and writing out to a tab separated file.
Hope this helps.
Agree with #shaktimaan. Using a dictionary or list is a good approach here. My attempt is slightly different.
import requests as rq
from bs4 import BeautifulSoup as bsoup
import csv
url = "http://www.kcda.org/KCDA_Awarded_Contracts.htm"
r = rq.get(url)
soup = bsoup(r.content)
primary_line = soup.find_all("p", {"class":["style4","style5"]})
final_list = {}
for line in primary_line:
txt = line.get_text().strip().encode("utf-8")
if txt != "\xc2\xa0":
if line["class"][0] == "style4":
key = txt
final_list[key] = []
else:
final_list[key].append(txt)
with open("products.csv", "wb") as ofile:
f = csv.writer(ofile)
for item in final_list:
f.writerow([item, ", ".join(final_list[item])])
For the scrape, we isolate style4 and style5 tags right away. I did not bother going for the style6 or the alphabet headers. We then get the text inside each tag. If the text is not a whitespace of sorts (this is all over the tables, probably obfuscation or bad mark-up), we then check if it's style4 or style5. If it's the former, we assign it as a key to a blank list. If it 's the latter, we append it to the blank list of the most recent key. Obviously the key changes every time we hit a new style4 only so it's a relatively safe approach.
The last part is easy: we just use ", ".join on the value part of the key-value pair to concatenate the list as one string. We then write it to a CSV file.
Due to the dictionary being unsorted, the resulting CSV file will not be sorted alphabetically. Screenshot of result below:
Changing it to a tab-delimited file is up to you. That's simple enough. Hope this helps!

Categories

Resources