Python scraping delete duplicates - python

I dont wanna have a email address twice, with this code I get the error
TypeError: unhashable type: 'list'
So i assume that the line
allLinks= set()
is wrong and I have to use a tuple and not a list, is that right?
Thats my code:
import requests
from bs4 import BeautifulSoup as soup
def get_emails(_links:list):
for i in range(len(_links)):
new_d = soup(requests.get(_links[i]).text, 'html.parser').find_all('a', {'class':'my_modal_open'})
if new_d:
yield new_d[-1]['title']
start = 20
while True:
d = soup(requests.get('http://www.schulliste.eu/type/gymnasien/?bundesland=&start={page_id}'.format(page_id=start)).text, 'html.parser')
results = [i['href'] for i in d.find_all('a')][52:-9]
results = [link for link in results if link.startswith('http://')]
next_page=d.find('div', {'class': 'paging'}, 'weiter')
if next_page:
start+=20
else:
break
allLinks= set()
if results not in allLinks:
print(list(get_emails(results)))
allLinks.add(results)

You're trying to add an entire list of emails as a single entry in the set.
What you want is to add the actual emails each one in a separate set entry.
The problem is in this line:
allLinks.add(results)
It adds the entire results list as a single element in the set and that doesn't work. Use this instead:
allLinks.update(results)
This will update the set with elements from the list, but each element will be a separate entry in the set.

I got it working but I still get duplicate emails.
allLinks = []
if results not in allLinks:
print(list(get_emails(results)))
allLinks.append((results))
Does anybody know why?

Related

Change scraped output

I have a loop putting URLs into my broswer and scraping its content, generating this output:
2PRACE,0.0014
Hispanic,0.1556
API,0.0688
Black,0.0510
AIAN,0.0031
White,0.7200
The code looks like this:
f1 = open('urlz.txt','r',encoding="utf8")
ethnicity_urls = f1.readlines()
f1.close()
from urllib import request
from bs4 import BeautifulSoup
import time
import openpyxl
import pprint
for each in ethnicity_urls:
time.sleep(1)
scraped = request.urlopen(each)
soup = BeautifulSoup(scraped)
soup1 = soup.select('p')
print(soup1)
resultFile = open('results.csv','a')
resultFile.write(pprint.pformat(soup1))
resultFile.close()
My problem is quite simple yet I do not find any tool that helps me achieve it. I would like to change the output from a list with "\n" in it to this:
2PRACE,0.0014 Hispanic,0.1556 API,0.0688 Black,0.0510 AIAN,0.0031 White,0.7200
I did not succeed by using replace as it told me I am treating a number of elements the same as a single element.
My approach here was:
for each in ethnicity_urls:
time.sleep(1)
scraped = request.urlopen(each)
soup = BeautifulSoup(scraped)
soup1 = soup.select('p')
soup2 = soup1.replace('\n',' ')
print(soup2)
resultFile = open('results.csv','a')
resultFile.write(pprint.pformat(soup2))
resultFile.close()
Can you help me find the correct approach to mutate the output before writing it to a csv?
The error message I get:
AttributeError: ResultSet object has no attribute 'replace'. You're probably treating a list of elements like a single element. Did you call find_all() when you meant to call find()?
See the solution to the problem in my answer below. Thanks for all the responses!
soup1 seems to be an iterable, so you cannot just call replace on it.
Instead you could loop through all string items in soup1 and then call replace for every single one of them and then save the changes string to your soup2 variable. Something like this:
for e in soup1:
soup2.append(e.replace('\n',' '))
You need to iterate over the soup.
Soup is a list of elements:
The BS4 Documentation is excellent and has many many examples:
https://www.crummy.com/software/BeautifulSoup/bs4/doc/
Use strip() to remove the \n
for x in soup1:
for r in x.children:
try:
print(r.strip())
except TypeError:
pass
Thank you both for the ideas and resources. I think I could implement what you suggested. The current build is
for each in ethnicity_urls:
time.sleep(1)
scraped = request.urlopen(each)
soup = BeautifulSoup(scraped)
soup1 = soup.select('p')
for e in soup1:
soup2 = str(soup1)
soup2 = soup2.replace('\n','')
print(soup2)
resultFile = open('results.csv','a')
resultFile.write(pprint.pformat(soup2))
resultFile.close()
And works just fine. I can do the final adjustments now in excel.

How to clean HTML removing repeated paragraphs?

I´m trying to clean an html file that has repeated paragraphs within body. Below I show the input file and expected output.
Input.html
https://jsfiddle.net/97ptc0Lh/4/
Output.html
https://jsfiddle.net/97ptc0Lh/1/
I've been trying with the following code using BeautifulSoup but I don´t know why is not working, since the resultant list CleanHtml contains the repeated elements (paragraphs) that I´d like to remove.
from bs4 import BeautifulSoup
fp = open("Input.html", "rb")
soup = BeautifulSoup(fp, "html5lib")
Uniques = set()
CleanHtml = []
for element in soup.html:
if element not in Uniques:
Uniques.add(element)
CleanHtml.append(element)
print (CleanHtml)
May someone help me to reach this goal please.
I think this should do it:
elms = []
for elem in soup.find_all('font'):
if elem not in elms:
elms.append(elem)
else:
target =elem.findParent().findParent()
target.decompose()
print(soup.html)
This should get you your the desired output.
Edit:
To remove only for those paragraphs that have don't size 4 or 5, change the else block to
else:
if elem.attrs['size'] != "4" and elem.attrs['size'] !="5":
target =elem.findParent().findParent()
target.decompose()

Python scraping href adjust url

This code works for the URL http://www.schulliste.eu/schule/ but not for
http://www.schulliste.eu/type/gymnasien/
Does anybody know why? I think it has something to do with the keyword "title"
Also I like to have the plain email adresses (without brackets and quotes) among themselves, is that possible?
import requests
from bs4 import BeautifulSoup as soup
def get_emails(_links: list, _r=[0, 10]):
for i in range(*_r):
new_d = soup(requests.get(_links[i]).text, 'html.parser').find_all('a', {'class':'my_modal_open'})
if new_d:
yield new_d[-1]['title']
d = soup(requests.get('http://www.schulliste.eu/schule/').text, 'html.parser')
results = [i['href'] for i in d.find_all('a')][52:-9]
print(list(get_emails(results)))
I guess that it does not work, b/c searched item 'a', {'class':'my_modal_open'} is not found by the second link.
To print it without quotes, you could try this:
items = list(get_emails(results))
for item in items:
print(item)

When I try to use list indexing in my if statement, why does it fail?

When I try to index in the statement, it says index out of range. I am trying to scrape stuff from the website.
import requests
from bs4 import BeautifulSoup
page = requests.get('https://www.set.or.th/set/factsheet.do?symbol=TRUE&ssoPageId=3&language=en&country=US')
soup = BeautifulSoup(page.text, 'html.parser')
list_stuff = list()
for x in soup.findAll('table',{'class':'factsheet'}):
for tr in x.findAll('tr'):
stuff=[td for td in tr.stripped_strings]
if stuff[0] == 'Beta':
list_stuff.append(stuff[1])
The code returns an error saying list index out of range and points to the
stuff[0] line in the for loop
Add a step to check if the tuple is empty or not.
if not not stuff: #check if stuff is not empty
if stuff[0] == 'Beta':
list_stuff.append(stuff[1])

range() integer end argument expected, got Tag

I'm trying to write a for loop to go through a HTML table consisting of th and td tags. It's contained in the URL:
https://www.saa.gov.uk/search.php?SEARCHED=1&SEARCH_TABLE=valuation_roll_cpsplit&SEARCH_TERM=edinburgh%2C+GOGARBANK%2C+EDINBURGH%2C+Edinburgh%2C+City+Of&x=16&y=8&DISPLAY_COUNT=10&ASSESSOR_ID=&TYPE_FLAG=CP&ORDER_BY=PROPERTY_ADDRESS&H_ORDER_BY=SET+DESC&DRILL_SEARCH_TERM=GOGARBANK%2C+EDINBURGH%2C+Edinburgh%2C+City+Of&DD_UNITARY_AUTHORITY=Edinburgh%2C+City+Of&DD_TOWN=EDINBURGH&DD_STREET=GOGARBANK&DISPLAY_MODE=FULL&UARN=103G494E2%28B%29&PPRN=000000000000532&ASSESSOR_IDX=10&#results'
I think th is for table heading and I'd like to extract the td (table data). The for loops I am trying to use are giving me an error:
range() integer end argument expected, got Tag.
Can someone explain to me why please? The output I want is
103G494E2(B)(LOTHIAN VJB)
YARD
I've also tried using the
for i in range(len(elems)):
but its given me an error object of type 'int' has no len(). Is the i in this case being defined as an integer by using the range function? This method has worked for me before so not really sure why it doen't this time. many thanks.
import requests
from bs4 import BeautifulSoup as soup
import csv
url = 'https://www.saa.gov.uk/search.php?SEARCHED=1&SEARCH_TABLE=valuation_roll_cpsplit&SEARCH_TERM=edinburgh%2C+GOGARBANK%2C+EDINBURGH%2C+Edinburgh%2C+City+Of&x=16&y=8&DISPLAY_COUNT=10&ASSESSOR_ID=&TYPE_FLAG=CP&ORDER_BY=PROPERTY_ADDRESS&H_ORDER_BY=SET+DESC&DRILL_SEARCH_TERM=GOGARBANK%2C+EDINBURGH%2C+Edinburgh%2C+City+Of&DD_UNITARY_AUTHORITY=Edinburgh%2C+City+Of&DD_TOWN=EDINBURGH&DD_STREET=GOGARBANK&DISPLAY_MODE=FULL&UARN=103G494E2%28B%29&PPRN=000000000000532&ASSESSOR_IDX=10&#results'
baseurl = 'https://www.saa.gov.uk'
session = requests.session()
response = session.get(url)
# content of search page in soup
html = soup(response.content,"lxml")
# list of result entries
rslt_table = html.find("table", {"summary":"Property details"})
ref = 'n/a'
vsr = 'n/a'
for col in rslt_table:
elems = col.find("th")
data = col.find("td")
#for i in range(len(elems)):
for i in range(elems):
if elems [i].text == "Ref No. / Office":
ref = data[i].text
print ref
if elems [i].text == 'Description':
vsr = data[i].text
print vsr
You don't need to use range, you should be using enumerate() just use for i,elem in enumerate(elems) and then check against elem instead of elems[i]. Using enumerate allows you to also keep track of the index so you can access the correct elements in data.
That for loop would look like this:
for col in rslt_table:
elems = col.find_all("th")
data = col.find_all("td")
for i,elem in enumerate(elems):
if elem.text == "Ref No. / Office":
ref = data[i].text
print ref
if elem.text == 'Description':
vsr = data[i].text
print vsr
You should also use find_all() instead of find() to get a list of items, not just the single one. So your rslt_table should look like:
rslt_table = html.find_all("table", {"summary":"Property details"})
You're making several mistakes. First of all, find returns a single element - to get a collection of elements, you must use find_all everywhere. range doesn't take an element, nor a list; use enumerate() or range(len()).
Fixed code would be
rslt_table = html.find_all("table", {"summary":"Property details"})
for col in rslt_table:
elems = col.find_all("th")
data = col.find_all("td")
for i, e in enumerate(elems):
if e.text == "Ref No. / Office":
ref = data[i].text
print(ref)
if e.text == 'Description':
vsr = data[i].text
print(vsr)

Categories

Resources