Duplicates links in python - python

Good morning world
I'm new to python and trying out things. I'm trying to remove duplicate links from the below run.
currently their are 253 links that were retrieved. Can someone please help me with this?
import requests
from bs4 import BeautifulSoup
import csv
page = "https://www.census.gov/programs-surveys/popest.html"
r = requests.get(page)
raw_html = r.text
soup = BeautifulSoup(raw_html, 'html.parser')
links = soup.find_all("a")
print ('Number of links retrieved: ', len (links))

Convert it to a set and it will remove duplicates:
links = set(soup.find_all("a"))
out:
Number of links retrieved: 244

set will not care about the sort order.
Therefor i used a list with cleaning the href correctly.
Now the len is 123
from bs4 import BeautifulSoup
import requests
r = requests.get("https://www.census.gov/programs-surveys/popest.html")
soup = BeautifulSoup(r.text, 'html.parser')
links = []
for item in soup.findAll("a", href=True):
item = item.get("href")
if item.startswith("h"):
pass
else:
item = f"https://www.census.gov/{item}"
if item not in links:
links.append(item)
print(item)
print(len(links))
Output:
https://www.census.gov/#content
https://www.census.gov/en.html
https://www.census.gov/topics/population/age-and-sex.html
https://www.census.gov/businessandeconomy
https://www.census.gov/topics/education.html
https://www.census.gov/topics/preparedness.html
https://www.census.gov/topics/employment.html
https://www.census.gov/topics/families.html
https://www.census.gov/topics/population/migration.html
https://www.census.gov/geo
https://www.census.gov/topics/health.html
https://www.census.gov/topics/population/hispanic-origin.html
https://www.census.gov/topics/housing.html
https://www.census.gov/topics/income-poverty.html
https://www.census.gov/topics/international-trade.html
https://www.census.gov/topics/population.html
https://www.census.gov/topics/population/population-estimates.html
https://www.census.gov/topics/public-sector.html
https://www.census.gov/topics/population/race.html
https://www.census.gov/topics/research.html
https://www.census.gov/topics/public-sector/voting.html
https://www.census.gov/about/index.html
https://www.census.gov/data
https://www.census.gov/academy
https://www.census.gov/about/what/admin-data.html
https://www.census.gov/data/data-tools.html
https://www.census.gov/developers/
https://www.census.gov/data/experimental-data-products.html
https://www.census.gov/data/related-sites.html
https://www.census.gov/data/software.html
https://www.census.gov/data/tables.html
https://www.census.gov/data/training-workshops.html
https://www.census.gov/library/visualizations.html
https://www.census.gov/library.html
https://www.census.gov/AmericaCounts
https://www.census.gov/library/audio.html
https://www.census.gov/library/fact-sheets.html
https://www.census.gov/library/photos.html
https://www.census.gov/library/publications.html
https://www.census.gov/library/video.html
https://www.census.gov/library/working-papers.html
https://www.census.gov/programs-surveys/are-you-in-a-survey.html
https://www.census.gov/programs-surveys/decennial-census/2020census-redirect.html
https://www.census.gov/2020census
https://www.census.gov/programs-surveys/acs
https://www.census.gov/programs-surveys/ahs.html
https://www.census.gov/programs-surveys/abs.html
https://www.census.gov/programs-surveys/asm.html
https://www.census.gov/programs-surveys/cog.html
https://www.census.gov/programs-surveys/cbp.html
https://www.census.gov/programs-surveys/cps.html
https://www.census.gov/EconomicCensus
https://www.census.gov/internationalprograms
https://www.census.gov/programs-surveys/metro-micro.html
https://www.census.gov/popest
https://www.census.gov/programs-surveys/popproj.html
https://www.census.gov/programs-surveys/saipe.html
https://www.census.gov/programs-surveys/susb.html
https://www.census.gov/programs-surveys/sbo.html
https://www.census.gov/sipp/
https://www.census.gov/programs-surveys/surveys-programs.html
https://www.census.gov/newsroom.html
https://www.census.gov/partners
https://www.census.gov/programs-surveys/sis.html
https://www.census.gov/NAICS
https://www.census.gov/library/reference/code-lists/schedule/b.html
https://www.census.gov/data/developers/data-sets/Geocoding-services.html
https://www.census.gov/about-us
https://www.census.gov/about/who.html
https://www.census.gov/about/what.html
https://www.census.gov/about/business-opportunities.html
https://www.census.gov/careers
https://www.census.gov/fieldjobs
https://www.census.gov/about/history.html
https://www.census.gov/about/policies.html
https://www.census.gov/privacy
https://www.census.gov/regions
https://www.census.gov/about/contact-us/staff-finder.html
https://www.census.gov/about/contact-us.html
https://www.census.gov/about/faqs.html
https://www.commerce.gov/
https://www.census.gov//en.html
https://www.census.gov//programs-surveys.html
https://www.census.gov//popest
https://www.census.gov//programs-surveys/popest/about.html
https://www.census.gov//programs-surveys/popest/data.html
https://www.census.gov//programs-surveys/popest/geographies.html
https://www.census.gov//programs-surveys/popest/guidance.html
https://www.census.gov//programs-surveys/popest/guidance-geographies.html
https://www.census.gov//programs-surveys/popest/library.html
https://www.census.gov//programs-surveys/popest/news.html
https://www.census.gov//programs-surveys/popest/technical-documentation.html
https://www.census.gov//programs-surveys/popest/data/tables.html
https://www.census.gov//programs-surveys/popest/about/schedule.html
https://www.census.gov//newsroom/press-releases/2019/popest-nation.html
https://www.census.gov//newsroom/press-releases/2019/popest-nation/popest-nation-spanish.html
https://www.census.gov//newsroom/press-releases/2019/new-years-2020.html
https://www.census.gov//data/tables/time-series/demo/popest/pre-1980-national.html
https://www.census.gov//data/tables/time-series/demo/popest/pre-1980-state.html
https://www.census.gov//data/tables/time-series/demo/popest/pre-1980-county.html
https://www.census.gov//library/publications/2015/demo/p25-1142.html
https://www.census.gov//library/publications/2010/demo/p25-1139.html
https://www.census.gov//library/publications/2010/demo/p25-1138.html
https://www.census.gov//programs-surveys/popest/library/publications.html
https://www.census.gov//library/visualizations/2020/comm/superbowl.html
https://www.census.gov//library/visualizations/2019/comm/slower-growth-nations-pop.html
https://www.census.gov//library/visualizations/2019/comm/happy-new-year-2020.html
https://www.census.gov//programs-surveys/popest/library/visualizations.html
https://www.census.gov/#
https://www.census.gov/#uscb-nav-skip-header
https://www.census.gov/newsroom/blogs.html
https://www.census.gov/newsroom/stories.html
https://www.facebook.com/uscensusbureau
https://twitter.com/uscensusbureau
https://www.linkedin.com/company/us-census-bureau
https://www.youtube.com/user/uscensusbureau
https://www.instagram.com/uscensusbureau/
https://www.census.gov/quality/
https://www.census.gov/datalinkage
https://www.census.gov/about/policies/privacy/privacy-policy.html#accessibility
https://www.census.gov/foia
https://www.usa.gov/
https://www.census.gov//
123

Related

webscraping bus stops with beautifulsoup

I am trying to web scrape bus stop names for a given line, here is an example page for line 212 https://www.m2.rozkladzik.pl/warszawa/rozklad_jazdy.html?l=212. I want to have as an output two lists, one with bus stop names in one direction and the other list with another direction. (It's clearly seen on the web page). I managed to get all names in one list with
import requests
from bs4 import BeautifulSoup
def download_bus_schedule(bus_number):
URL = "http://www.m2.rozkladzik.pl/warszawa/rozklad_jazdy.html?l=" + bus_number
r = requests.get(URL)
soup = BeautifulSoup(r.content,
'html5lib')
print(soup.prettify())
all_bus_stops = []
table = soup.find_all('a')
for element in table:
if element.get_text() in all_bus_stops:
continue
else:
all_bus_stops.append(element.get_text())
return all_bus_stops
print(download_bus_schedule('212'))
I guess the solution would be to somehow divide the soup into two parts.
You can use the bs4.element.Tag.findAll method:
import requests
from bs4 import BeautifulSoup
def download_bus_schedule(bus_number):
all_bus_stops = []
URL = "http://www.m2.rozkladzik.pl/warszawa/rozklad_jazdy.html?l=" + bus_number
r = requests.get(URL)
soup = BeautifulSoup(r.content, 'html.parser')
for s in soup.select(".holo-list"):
bus_stops = []
for f in s.findAll("li"):
if f.text not in bus_stops:
bus_stops.append(f.text)
all_bus_stops.append(bus_stops)
return all_bus_stops
print(download_bus_schedule('212'))
Output:
[['Pl.Hallera', 'Pl.Hallera', 'Darwina', 'Namysłowska', 'Rondo Żaba', 'Rogowska', 'Kołowa', 'Dks Targówek', 'Metro Targówek Mieszkaniowy', 'Myszkowska', 'Handlowa', 'Metro Trocka', 'Bieżuńska', 'Jórskiego', 'Łokietka', 'Samarytanka', 'Rolanda', 'Żuromińska', 'Targówek-Ratusz', 'Św.Wincentego', 'Malborska', 'Ch Targówek'],
['Ch Targówek', 'Ch Targówek', 'Malborska', 'Św.Wincentego', 'Targówek-Ratusz', 'Żuromińska', 'Gilarska', 'Rolanda', 'Samarytanka', 'Łokietka', 'Jórskiego', 'Bieżuńska', 'Metro Trocka', 'Metro Trocka', 'Metro Trocka', 'Handlowa', 'Myszkowska', 'Metro Targówek Mieszkaniowy', 'Dks Targówek', 'Kołowa', 'Rogowska', 'Rondo Żaba', '11 Listopada', 'Bródnowska', 'Szymanowskiego', 'Pl.Hallera', 'Pl.Hallera']]
import requests
from bs4 import BeautifulSoup
def download_bus_schedule(bus_number):
URL = "http://www.m2.rozkladzik.pl/warszawa/rozklad_jazdy.html?l=" + bus_number
r = requests.get(URL)
soup = BeautifulSoup(r.content,
'html5lib')
bus_stops_1 = []
bus_stops_2 = []
directions = soup.find_all("ul", {"class":"holo-list"})
for stop in directions[0].find_all("a"):
if stop not in bus_stops_1:
bus_stops_1.append(stop.text.strip())
for stop in directions[1].find_all("a"):
if stop not in bus_stops_2:
bus_stops_2.append(stop.text.strip())
all_bus_stops = (bus_stops_1, bus_stops_2)
return all_bus_stops
print(download_bus_schedule('212')[0])
print(download_bus_schedule('212')[1])
I may have misunderstood as I do not know Polish but see if this helps.
from bs4 import BeautifulSoup
import requests
url = 'https://www.m2.rozkladzik.pl/warszawa/rozklad_jazdy.html?l=212'
resp = requests.get(url)
soup = BeautifulSoup(resp.content, "html.parser")
d = {}
for h2 in soup.select('h2.holo-divider'):
d[h2.text] = []
ul = h2.next_sibling
for li in ul.select('li'):
if li.a.text not in d[h2.text]:
d[h2.text].append(li.a.text)
from pprint import pprint
pprint(d)
As all stops are encapsulated in the next un-ordered list, you could use the find_next function of bs4.
e.g.
URL = f"http://www.m2.rozkladzik.pl/warszawa/rozklad_jazdy.html?l={bus_number}"
r = requests.get(URL)
soup = BeautifulSoup(r.content,
'html5lib')
directions = ["Ch Targówek","Pl.Hallera"]
result = {}
for direction in directions:
header = soup.find(text=direction)
list = header.find_next("ul")
stops_names = [stop.get_text() for stop in list]
result[direction] = stops_names
return result
Plus you might want to use f-string to format your strings as it improves reading and is less error prone.

AttributeError: type object 'typing.re' has no attribute 'split'

I try to scam URL link form google. Users can input any search then they can take a URL link. but here is the main problem is this split function can't work. I can't fix it. So please help me
[[Suppose: Any user can input "all useless website" that time google can showing us a result. User can take only URL link.]]
from typing import re
from bs4 import BeautifulSoup
import requests
user_input = input('Enter value for search : ')
print('Please Wait')
page_source = requests.get("https://www.google.com/search?q=" + user_input)
soup = BeautifulSoup(page_source.text, 'html.parser')
print(soup.title)
print(soup.title.string)
print(soup.title.parent.name)
all_links = soup.find_all('a')
for link in all_links:
link_google = re.split(":(?=http)", link["href"].replace("/url?q=", ""))
print(link_google.find["a"])
You're importing re from the wrong place. You need to use it via import re, as follows:
import re
...
link_google = re.split(":(?=http)", link["href"].replace("/url?q=", ""))
Update to make your code work:
import re correctly
fix this line from all_links = soup.find_all('a') to all_links = soup.find_all('a', href=True)
Take the link and clean it up like you did before (re.split() works perfectly but it returns a list) and add that link to a list (unpack the list) or print it
Here is the code updated to make it work
# issue 1
import re
from bs4 import BeautifulSoup
import requests
user_input = input('Enter value for search : ')
print('Please Wait')
page_source = requests.get("https://www.google.com/search?q=" + user_input)
soup = BeautifulSoup(page_source.text, 'html.parser')
print(soup.title)
print(soup.title.string)
print(soup.title.parent.name)
# issue 2
all_links = soup.find_all('a', href=True)
for link in all_links:
link_from_google = re.split(":(?=http)", link["href"].replace("/url?q=", ""))
# issue 3
print(link_from_google[0])
>>> {returns all the http links}
One liner list comprehension for fun
google_links = [re.split(":(?=http)", link["href"].replace("/url?q=", ""))[0] for link in soup.find_all('a', href=True)]
>>> {returns all the http links}

BeautifulSoup Python3 append multiple links output to single list

import requests
from bs4 import BeautifulSoup
import re
links = ["https://bitcointalk.org/index.php?board=159.0",
"https://bitcointalk.org/index.php?board=159.40",
"https://bitcointalk.org/index.php?board=159.80"]
def get_span():
for url in links:
page = requests.get(url)
soup = BeautifulSoup(page.text, "html.parser")
t1 = str(soup.findAll("span", id=re.compile('^msg_')))
print(t1)
t2 = [x for x in re.findall(r'\d+\.\d+', t1)]
t2.sort(key=float, reverse=True)
t3 = "https://bitcointalk.org/index.php?topic"
for hn in t2:
if len(hn) >= 9:
hn = '{}={}'.format(t3, hn)
print(hn)
get_span()
Hello!
My code iterates items in link, then finds the span with
id=msg_, then finds all numbers in id=msg_, sorts them in
descending order. Problem is that it iterates first item and gives output
of it, then second item and so on, so out put contains 3 lists. So it
sorts items separately.. I want to get a output with all 3 items from links
sorted in one list.
You can use list.extend to add items to list and then sort the final list before returning it.
For example:
import re
import requests
from bs4 import BeautifulSoup
links = ["https://bitcointalk.org/index.php?board=159.0",
"https://bitcointalk.org/index.php?board=159.40",
"https://bitcointalk.org/index.php?board=159.80"]
def get_span(links):
rv = []
r = re.compile(r'\d{7,}\.\d+')
for url in links:
soup = BeautifulSoup(requests.get(url).content, "html.parser")
rv.extend(a['href'] for a in soup.select('span[id^="msg_"] > a') if r.search(a['href']))
return sorted(rv, key=lambda k: float(r.search(k).group(0)), reverse=True)
all_links = get_span(links)
# print links on screen:
for link in all_links:
print(link)
Prints:
https://bitcointalk.org/index.php?topic=5255494.0
https://bitcointalk.org/index.php?topic=5255416.0
https://bitcointalk.org/index.php?topic=5255389.0
https://bitcointalk.org/index.php?topic=5255376.0
https://bitcointalk.org/index.php?topic=5255316.0
https://bitcointalk.org/index.php?topic=5254720.0
https://bitcointalk.org/index.php?topic=5254480.0
https://bitcointalk.org/index.php?topic=5254448.0
https://bitcointalk.org/index.php?topic=5254287.0
https://bitcointalk.org/index.php?topic=5252504.0
https://bitcointalk.org/index.php?topic=5251621.0
https://bitcointalk.org/index.php?topic=5250998.0
https://bitcointalk.org/index.php?topic=5250388.0
https://bitcointalk.org/index.php?topic=5250185.0
https://bitcointalk.org/index.php?topic=5248406.0
https://bitcointalk.org/index.php?topic=5247112.0
... and so on.
EDIT: If you want to show link text n
ext to url, you can use this example:
import re
import requests
from bs4 import BeautifulSoup
links = ["https://bitcointalk.org/index.php?board=159.0",
"https://bitcointalk.org/index.php?board=159.40",
"https://bitcointalk.org/index.php?board=159.80"]
def get_span(links):
rv = []
r = re.compile(r'\d{7,}\.\d+')
for url in links:
soup = BeautifulSoup(requests.get(url).content, "html.parser")
rv.extend((a['href'], a.text) for a in soup.select('span[id^="msg_"] > a') if r.search(a['href']))
return sorted(rv, key=lambda k: float(r.search(k[0]).group(0)), reverse=True)
all_links = get_span(links)
# print links on screen:
for link, text in all_links:
print('{} {}'.format(link, text))
Prints:
https://bitcointalk.org/index.php?topic=5255494.0 NUL Token - A new hyper-deflationary experiment! Airdrop!
https://bitcointalk.org/index.php?topic=5255416.0 KEEP NETWORK - A privacy layer for Ethereum
https://bitcointalk.org/index.php?topic=5255389.0 [ANN] ICO - OBLICHAIN | Blockchain technology at the service of creative genius
https://bitcointalk.org/index.php?topic=5255376.0 UniChain - The 4th Generation Blockchain Made For The Smart Society 5.0
https://bitcointalk.org/index.php?topic=5255316.0 INFINITE RICKS ! First Multiverse Cryptocurrency ! PoS 307%
https://bitcointalk.org/index.php?topic=5254720.0 [GMC] GameCredits - Unofficial & Unmoderated for Censored Posts.
https://bitcointalk.org/index.php?topic=5254480.0 [ANN] [BTCV] Bitcoin VaultA higher standard in security
https://bitcointalk.org/index.php?topic=5254448.0 [ANN] Silvering (SLVG) token - New Silver Asset Backed Cryptocurrency
... and so on.

Newbie, scraping table from url and not able to get output in python command prompt

This is a homework assignment, I have to sum up all the data in the span tag and print it, I took out all the info in the span tag and appended it to a list, I dont know how to go beyond that as any funtion I type out of the for loop does not function, also I have to hit enter twice after I run this in the python command prompt to get an output
I am new here, please forgive the format of the question, thanks for the help
import urllib.request, urllib.parse, urllib.error
lst = list()
from bs4 import BeautifulSoup
url = 'http://py4e-data.dr-chuck.net/comments_42.html'
html = urllib.request.urlopen(url).read()
soup = BeautifulSoup(html, 'html.parser')
spans = soup.findAll('span', attrs = {'class' : 'comments'})
for span in spans:
num = int(span.text)
lst.append(num)
print(num)
No need to collect to list or anything if you just adding them together. You can do it like this:
import urllib.request, urllib.parse, urllib.error
lst = list()
from bs4 import BeautifulSoup
url = 'http://py4e-data.dr-chuck.net/comments_42.html'
html = urllib.request.urlopen(url).read()
soup = BeautifulSoup(html, 'html.parser')
spans = soup.findAll('span', attrs = {'class' : 'comments'})
sum = 0
for span in spans:
sum += int(span.text)
print("Total Sum = " + str(sum))
As you have prepared the list of numbers in span, you can have the sum of numbers using sum() function in python. Pass your list as argument of sum().
import urllib.request, urllib.parse, urllib.error
lst = list()
from bs4 import BeautifulSoup
url = 'http://py4e-data.dr-chuck.net/comments_42.html'
html = urllib.request.urlopen(url).read()
soup = BeautifulSoup(html, 'html.parser')
spans = soup.findAll('span', attrs = {'class' : 'comments'})
for span in spans:
num = int(span.text)
lst.append(num)
print(num)
sum_of_nums = sum(lst)
print(sum_of_nums)

HTML Parsing gives no response

I'm trying to parse a web page, and that's my code:
from bs4 import BeautifulSoup
import urllib2
openurl = urllib2.urlopen("http://pastebin.com/archive/Python")
read = BeautifulSoup(openurl.read())
soup = BeautifulSoup(openurl)
x = soup.find('ul', {"class": "i_p0"})
sp = soup.findAll('a href')
for x in sp:
print x
I really with I could be more specific but as the title says, it gives me no response. No errors, nothing.
First of all, omit the line read = BeautifulSoup(openurl.read()).
Also, the line x = soup.find('ul', {"class": "i_p0"}) doesn't actually make any difference, because you are reusing x variable in the loop.
Also, soup.findAll('a href') doesn't find anything.
Also, instead of old-fashioned findAll(), there is a find_all() in BeautifulSoup4.
Here's the code with several alterations:
from bs4 import BeautifulSoup
import urllib2
openurl = urllib2.urlopen("http://pastebin.com/archive/Python")
soup = BeautifulSoup(openurl)
sp = soup.find_all('a')
for x in sp:
print x['href']
This prints the values of href attribute of all links on the page.
Hope that helps.
I altered a couple of lines in your code and I do get a response, not sure if that is what you want though.
Here:
openurl = urllib2.urlopen("http://pastebin.com/archive/Python")
soup = BeautifulSoup(openurl.read()) # This is what you need to use for selecting elements
# soup = BeautifulSoup(openurl) # This is not needed
# x = soup.find('ul', {"class": "i_p0"}) # You don't seem to be making a use of this either
sp = soup.findAll('a')
for x in sp:
print x.get('href') #This is to get the href
Hope this helps.

Categories

Resources