Python University Names and Abbrevations and Weblink - python

I want to prepare a dataframe of universities, its abbrevations and website link.
My code:
abb_url = 'https://en.wikipedia.org/wiki/List_of_colloquial_names_for_universities_and_colleges_in_the_United_States'
abb_html = requests.get(abb_url).content
abb_df_list = pd.read_html(abb_html)
Present answer:
ValueError: No tables found
Expected answer:
df =
| | university_full_name | uni_abb | uni_url|
---------------------------------------------------------------------
| 0 | Albert Einstein College of Medicine | AECOM | https://en.wikipedia.org/wiki/Albert_Einstein_College_of_Medicine|

That's one funky page you have there...
First, there are indeed no tables in there. Second, some organizations don't have links, others have redirect links and still others use the same abbreviation for more than one organization.
So you need to bring in the heavy artillery: xpath...
import pandas as pd
import requests
from lxml import html as lh
url = "https://en.wikipedia.org/wiki/List_of_colloquial_names_for_universities_and_colleges_in_the_United_States"
response = requests.get(url)
doc = lh.fromstring(response.text)
rows = []
for uni in doc.xpath('//h2[./span[#class="mw-headline"]]//following-sibling::ul//li'):
info = uni.text.split(' – ')
abb = info[0]
#for those w/ no links
if not uni.xpath('.//a'):
rows.append((abb," ",info[1]))
#now to account for those using the same abbreviation for multiple teams
for a in uni.xpath('.//a'):
dat = a.xpath('./#*')
#for those with redirects
if len(dat)==3:
del dat[1]
link = f"https://en.wikipedia.org{dat[0]}"
rows.append((abb,link,dat[1]))
#and now, at last, to the dataframe
cols = ['abb','url','full name']
df = pd.DataFrame(rows,columns=cols)
df
Output:
abb url full name
0 AECOM https://en.wikipedia.org/wiki/Albert_Einstein_... Albert Einstein College of Medicine
1 AFA https://en.wikipedia.org/wiki/United_States_Ai... United States Air Force Academy
etc.
Note: you can rearrange the order of columns in the dataframe, if you are so inclined.

Select and iterate only the expected <li> and extract its information, but be aware there is a university without an <a> (SUI – State University of Iowa), so this should be handled with if-statement in example:
for e in soup.select('h2 + ul li'):
data.append({
'abb':e.text.split('-')[0],
'full_name':e.text.split('-')[-1],
'url':'https://en.wikipedia.org' + e.a.get('href') if e.a else None
})
Example
from bs4 import BeautifulSoup
import requests
import pandas as pd
url = "https://en.wikipedia.org/wiki/List_of_colloquial_names_for_universities_and_colleges_in_the_United_States"
response = requests.get(url)
soup = BeautifulSoup(response.text)
data = []
for e in soup.select('h2 + ul li'):
data.append({
'abb':e.text.split('-')[0],
'full_name':e.text.split('-')[-1],
'url':'https://en.wikipedia.org' + e.a.get('href') if e.a else None
})
pd.DataFrame(data)
Output:
abb
full_name
url
0
AECOM
Albert Einstein College of Medicine
https://en.wikipedia.org/wiki/Albert_Einstein_College_of_Medicine
1
AFA
United States Air Force Academy
https://en.wikipedia.org/wiki/United_States_Air_Force_Academy
2
Annapolis
U.S. Naval Academy
https://en.wikipedia.org/wiki/United_States_Naval_Academy
3
A&M
Texas A&M University, but also others; see A&M
https://en.wikipedia.org/wiki/Texas_A%26M_University
4
A&M-CC or A&M-Corpus Christi
Corpus Christi
https://en.wikipedia.org/wiki/Texas_A%26M_University%E2%80%93Corpus_Christi
...

There are no tables on this page, but lists. So the goal will be to go through the <ul> and then <li> tags, skipping the paragraphs you are not interested in (the first and those after the 26th).
You can extract aab_code of the university this way:
uni_abb = li.text.strip().replace(' - ', ' - ').replace(' - ', ' - ').split(' - ')[0]
while to get the url you have to access the 'href' and 'title' parameter inside the <a> tag:
for a in li.find_all('a', href=True):
title = a['title']
url= f"https://en.wikipedia.org/{a['href']}"
Accumulate the extracted information into a list, and finally create the dataframe by assigning appropriate column names.
Here is the complete code, in which I use BeautifulSoup:
import requests
import pandas as pd
from bs4 import BeautifulSoup
abb_url = 'https://en.wikipedia.org/wiki/List_of_colloquial_names_for_universities_and_colleges_in_the_United_States'
abb_html = requests.get(abb_url).content
soup = BeautifulSoup(abb_html)
l = []
for ul in soup.find_all("ul")[1:26]:
for li in ul.find_all("li"):
uni_abb = li.text.strip().replace(' - ', ' – ').replace(' — ', ' – ').split(' – ')[0]
for a in li.find_all('a', href=True):
l.append((a['title'], uni_abb, f"https://en.wikipedia.org/{a['href']}"))
df = pd.DataFrame(l, columns=['university_full_name', 'uni_abb', 'uni_url'])
Result:
university_full_name uni_abb uni_url
0 Albert Einstein College of Medicine AECOM https://en.wikipedia.org//wiki/Albert_Einstein...
1 United States Air Force Academy AFA https://en.wikipedia.org//wiki/United_States_A...

Related

How to webscrape a list element without class name?

I am making a webscrapper for Bookdepository and I came across a problem with the html elements of the site. The page for a book has a section called Product Details and I need to take each element from the list. However some of the elements, not all, like Language have this structure
sample image. How is it possible to get this element?
My work in progress is this. Thanks a lot in advance
import bs4
from urllib.request import urlopen
book_isbn = ("9781399703994")
book_urls = "https://www.bookdepository.com/Enid-Blytons-Christmas-Tales-Enid-Blyton/" + book_isbn
source = urlopen(book_urls).read()
soup = bs4.BeautifulSoup(source,'lxml')
book_description = soup.find('div', class_='item-excerpt trunc')
book_title = soup.find('h1').text
book_info = soup.find('ul', class_='biblio-info')
book_pages = book_info.find('span', itemprop='numberOfPages').text
book_ibsn = book_info.find('span', itemprop='isbn').text
book_publication_date = book_info.find('span', itemprop='datePublished').text
book_publisher = book_info.find('span', itemprop='name').text
book_author = soup.find('span', itemprop="author").text
book_cover = soup.find('div', class_='item-img-content').img
book_language = book_info.find_next(string='Language',)
book_format = book_info.find_all(string='Format', )
print('Number of Pages: ' + book_pages.strip())
print('ISBN Number: ' + book_ibsn)
print('Publication Date: ' + book_publication_date)
print('Publisher Name: ' + book_publisher.strip())
print('Author: '+ book_author.strip())
print(book_cover)
print(book_language)
print(book_format)
To get the corresponding <span> to your label you could go with:
book_info.find_next(string='Language').find_next('span').get_text(strip=True)
A more generic approach to get all these product details could be:
import bs4, re
from urllib.request import urlopen
book_isbn = ("9781399703994")
book_urls = "https://www.bookdepository.com/Enid-Blytons-Christmas-Tales-Enid-Blyton/" + book_isbn
source = urlopen(book_urls).read()
soup = bs4.BeautifulSoup(source,'lxml')
book = {
'description':soup.find('div', class_='item-excerpt trunc').get_text(strip=True),
'title':soup.find('h1').text
}
book.update({e.label.text.strip():re.sub('\s+', ' ',e.span.text).strip() for e in soup.select('.biblio-info li')})
book
Output:
{'description': "'A breathtaking memoir...I was so moved by this book.' Oprah'It is startlingly honest and, at times, a jaw-dropping read, charting her rise from poverty and abuse to becoming the first African-American to win the triple crown of an Oscar, Emmy and Tony for acting.' BBC NewsTHE DEEPLY PERSONAL, BRUTALLY HONEST ACCOUNT OF VIOLA'S INSPIRING LIFEIn my book, you will meet a little girl named Viola who ran from her past until she made a life changing decision to stop running forever.This is my story, from a crumbling apartment in Central Falls, Rhode Island, to the stage in New York City, and beyond. This is the path I took to finding my purpose and my strength, but also to finding my voice in a world that didn't always see me.As I wrote Finding Me, my eyes were open to the truth of how our stories are often not given close examination. They are bogarted, reinvented to fit into a crazy, competitive, judgmental world. So I wrote this for anyone who is searching for a way to understand and overcome a complicated past, let go of shame, and find acceptance. For anyone who needs reminding that a life worth living can only be born from radical honesty and the courage to shed facades and be...you.Finding Me is a deep reflection on my past and a promise for my future. My hope is that my story will inspire you to light up your own life with creative expression and rediscover who you were before the world put a label on you.show more",
'title': 'Finding Me : A Memoir - THE INSTANT SUNDAY TIMES BESTSELLER',
'Format': 'Hardback | 304 pages',
'Dimensions': '160 x 238 x 38mm | 520g',
'Publication date': '26 Apr 2022',
'Publisher': 'Hodder & Stoughton',
'Imprint': 'Coronet Books',
'Publication City/Country': 'London, United Kingdom',
'Language': 'English',
'ISBN10': '1399703994',
'ISBN13': '9781399703994',
'Bestsellers rank': '31'}
You can check if the label text is equal to Language and then print the text. I have also added a better approach to parse the product details section in a single iteration.
Check the code given below:-
import bs4
from urllib.request import urlopen
import re
book_isbn = ("9781399703994")
book_urls = "https://www.bookdepository.com/Enid-Blytons-Christmas-Tales-Enid-Blyton/" + book_isbn
source = urlopen(book_urls).read()
soup = bs4.BeautifulSoup(source,'lxml')
book_info = soup.find('ul', class_='biblio-info')
lis=book_info.find_all('li')
# Check if the label name is Language and the print the span text
for val in lis:
label=val.find('label')
if label.text.strip()=='Language':
span=val.find('span')
span_text=(span.text.strip())
print('Language--> '+span_text)
# A better approach to get all the Name and value pairs in the Product details section in a single iteration
for val in lis:
label=val.find('label')
span=val.find('span')
span_text=(span.text.strip())
modified_text = re.sub('\n', ' ', span_text)
modified_text = re.sub(' +', ' ', modified_text)
print(label.text.strip()+'--> '+modified_text)
You can grab desired data from details portion using css selectors
import bs4
from urllib.request import urlopen
import re
book_isbn = ("9781399703994")
book_urls = "https://www.bookdepository.com/Enid-Blytons-Christmas-Tales-Enid-Blyton/" + book_isbn
#print(book_urls)
source = urlopen(book_urls).read()
soup = bs4.BeautifulSoup(source,'lxml')
book_description = soup.find('div', class_='item-excerpt trunc')
book_title = soup.find('h1').text
book_info = soup.find('ul', class_='biblio-info')
book_pages = book_info.find('span', itemprop='numberOfPages').text
book_ibsn = book_info.find('span', itemprop='isbn').text
book_publication_date = book_info.find('span', itemprop='datePublished').text
book_publisher = book_info.find('span', itemprop='name').text
book_author = soup.find('span', itemprop="author").text
book_cover = soup.find('div', class_='item-img-content').img.get('src')
book_language =soup.select_one('.biblio-info > li:nth-child(7) span').get_text(strip=True)
book_format = soup.select_one('.biblio-info > li:nth-child(1) span').get_text(strip=True)
book_format = re.sub(r'\s+', ' ',book_format).replace('|','')
print('Number of Pages: ' + book_pages.strip())
print('ISBN Number: ' + book_ibsn)
print('Publication Date: ' + book_publication_date)
print('Publisher Name: ' + book_publisher.strip())
print('Author: '+ book_author.strip())
print(book_cover)
print(book_language)
print(book_format)
Output:
Number of Pages: 304 pages
ISBN Number: 9781399703994
Publication Date: 26 Apr 2022
Publisher Name: Hodder & Stoughton
Author: Viola Davis
https://d1w7fb2mkkr3kw.cloudfront.net/assets/images/book/lrg/9781/3997/9781399703994.jpg
English
Hardback 304 pages

Create a data frame with headings as column names and <li> tag content as rows, then print this data frame into a text file

I'm trying to get the main body data from this website
I want to get a data frame (or any other object which makes life easier) as output with subheadings as column names and body under the subheading as lines under that column.
My code is below:
from bs4 import BeautifulSoup
import requests
import re
url = "https://www.bankersadda.com/17th-september-2021-daily-gk-update/"
page = requests.get(url)
html = page.text
soup = BeautifulSoup(html,'lxml') #"html.parser")
article = soup.find(class_ = "entry-content")
headings = []
lines = []
my_df = pd.DataFrame(index=range(100))
for strong in article.findAll('strong'):
if strong.parent.name =='p':
if strong.find(text=re.compile("News")):
headings.append(strong.text)
#headings
k=0
for ul in article.findAll('ul'):
for li in ul.findAll('li'):
lines.append(li.text)
lines= lines + [""]
my_df[k] = pd.Series(lines)
k=k+1
my_df
I want to use the "headings" list to get the data frame column names.
Clearly I'm not writing the correct logic. I explored nextSibling, descendants and other attributes too, but I can't figure out the correct logic. Can someone please help?
Once you get the headline, use .find_next() to get that news article list. Then add them into a list under the headline as a key in a dictionary. Then simply use pd.concat() with ignore_index=False
from bs4 import BeautifulSoup
import requests
import re
import pandas as pd
url = "https://www.bankersadda.com/17th-september-2021-daily-gk-update/"
page = requests.get(url)
html = page.text
soup = BeautifulSoup(html,'lxml') #"html.parser")
article = soup.find(class_ = "entry-content")
headlines = {}
news_headlines = article.find_all('p',text=re.compile("News"))
for news_headline in news_headlines:
end_of_news = False
sub_title = news_headline.find_next('p')
headlines[news_headline.text] = []
#print(news_headline.text)
while end_of_news == False:
headlines[news_headline.text].append(sub_title.text)
articles = sub_title.find_next('ul')
for li in articles.findAll('li'):
headlines[news_headline.text].append(li.text)
#print(li.text)
sub_title = articles.find_next('p')
if 'News' in sub_title.text or sub_title.text == '' :
end_of_news = True
df_list = []
for headings, lines in headlines.items():
temp = pd.DataFrame({headings:lines})
df_list.append(temp)
my_df = pd.concat(df_list, ignore_index=False, axis=1)
Output:
print(my_df)
National News ... Obituaries News
0 1. Cabinet approves 100% FDI under automatic r... ... 11. Eminent Kashmiri Writer Aziz Hajini passes...
1 The Union Cabinet, chaired by Prime Minister N... ... Noted writer and former secretary of Jammu and...
2 A total of 9 structural and 5 process reforms ... ... He has over twenty books in Kashmiri to his cr...
3 Change in the definition of AGR: The definitio... ... 12. Former India player and Mohun Bagan great ...
4 Rationalised Spectrum Usage Charges: The month... ... Former India footballer and Mohun Bagan captai...
5 Four-year Moratorium on dues: Moratorium has b... ... Bhabani Roy helped Mohun Bagan win the Rovers ...
6 Foreign Direct Investment (FDI): The governmen... ... 13. 2 times Olympic Gold Medalist Yuriy Sedykh...
7 Auction calendar fixed: Spectrum auctions will... ... Double Olympic hammer throw gold medallist Yur...
8 Important takeaways for all competitive exams: ... He set the world record for the hammer throw w...
9 Minister of Communications: Ashwini Vaishnaw. ... He won his first gold medal at the 1976 Olympi...
[10 rows x 8 columns]

Getting All Classes from A Beautiful Soup Object that Contain a Class Name but not ID

main_url = 'https://www.indeed.com'
url_to_list = ["https://www.indeed.com/jobs?q=Data%20Scientist&advn=9634682979760233"]
for tab_number in range(1,101):
url_temp = f'https://www.indeed.com/jobs?q=Data+Scientist&start={tab_number}0'
url_to_list.append(url_temp)
url = url_to_list[0]
resp = requests.get(url)
soup = bs4.BeautifulSoup(resp.text, 'html.parser')
#class_titles = soup.findAll("div", {"class": "title"}) #Contains Descrition URL and Title
#class_sjcl = soup.findAll("div", {"class": "sjcl"}) #contains Ratings, company name, area
#class_salarySnippet_holisticSalar = soup.findAll("div",{"class": "salarySnippet holisticSalary"}) #contains pay
main_class = soup.findAll('div', class_='jobsearch-SerpJobCard unifiedRow row result clickcard')
This is supposed to webscrape Indeed but I got stuck. I want this to give me a Title, description etc. The hard part is that I want it to return None if there is no salary available for example. Everything is contained in class_='jobsearch-SerpJobCard unifiedRow row result clickcard'
However it has a different ID in each container, so the findAll() method returns empty. I've been trying a bunch of other solutions to similar problems but it keeps returning an empty list.
This script will go through the pages and get some information about each result. If the result doesn't have, for example, salary - it will put '-' into the data (you can change it to None if you want):
import requests
from bs4 import BeautifulSoup
base_url = 'https://www.indeed.com/jobs?q=Data+Scientist&start={}'
data = []
for p in range(0, 100, 10):
print('Scraping results {}...'.format(p))
soup = BeautifulSoup(requests.get(base_url.format(p)).content, 'html.parser')
for result in soup.select('.result'):
title = result.select_one('.title').get_text(strip=True)
job_url = result.select_one('.title a')['href']
company = result.select_one('.company').get_text(strip=True) if result.select_one('.company') else '-'
rating = result.select_one('.ratingsDisplay').get_text(strip=True) if result.select_one('.ratingsDisplay') else '-'
location = result.select_one('.location').get_text(strip=True) if result.select_one('.location') else '-'
salary = result.select_one('.salary').get_text(strip=True) if result.select_one('.salary') else '-'
data.append((title, company, rating, location, salary, job_url))
# just print the data for now:
print('{:<65} {:<50} {:<10} {:<65} {:<10}'.format(*'Title Company Rating Location Salary'.split()))
for row in data:
print('{:<65} {:<50} {:<10} {:<65} {:<10}'.format(*row[:-1]))
Prints:
Scraping results 0...
Scraping results 10...
Scraping results 20...
...
Title Company Rating Location Salary
Data Scientist – Pricing Optimization Delta 4.2 Atlanta, GA -
Data Scientist - Entry Level Numerdox - Sacramento, CA -
Data Scientist RTI International 3.7 Durham, NC 27709 -
Entry Level Data Scientist IBM 3.9 United States -
Data Scientist - Economic Data Zillow 3.8 Seattle, WA 98101(Downtown area) -
Data Scientist FCA 4.0 Detroit, MI 48201 -
Data Scientist, Analytics (University Grad) Facebook 4.2 New York, NY 10017 -
Data Scientist Oath Inc 3.8 Champaign, IL -
...and so on.

Unable to scrape string and list at the same time

I'm trying to get name, address and key contacts from a webpage using a python script. I can get them individually in the right way. However, what I wish to do is get the name and address as string and the key contacts in a list so that I can write them in a csv file in 6 columns. I can't find any way to include the value of data-cfemail within the list of contacts.
Website address
I've tried with:
import requests
from bs4 import BeautifulSoup
link = "https://www.fis.com/fis/companies/details.asp?l=e&filterby=species&specie_id=615&page=1&company_id=160574&country_id="
res = requests.get(link,headers={"User-Agent":"Mozilla/5.0"})
soup = BeautifulSoup(res.text,'lxml')
name = soup.select_one("#name").text.strip()
address = soup.select("#description_details tr:contains('Address:') td")[1].text
contacts = [' '.join(item.get_text(strip=True).split()) for item in soup.select("#contacts table tr td")]
print(name,address,contacts)
Current output:
Bahia Grande S.A. - BG Group
Maipú 1252 Piso 8°
['Founder & PresidentMr Guillermo Jacob', 'VP FinanceMr Andres Jacob[email protected]', 'ControllerMr Juan Carlos Peralta[email protected]', 'VP AdmnistrationMs Veronica Vinuela[email protected]', '']
Expected output (as the emails are protected the value of data-cfemail will do):
Bahia Grande S.A. - BG Group
Maipú 1252 Piso 8°
[Founder & President, Mr Guillermo Jacob]
[VP Finance, Mr Andres Jacob,bbdad1dad8d4d9fbd9dad3d2dadcc9dad5dfde95d8d4d695dac9]
[Controller,Mr Juan Carlos Peralta,0b61687b6e796a677f6a4b696a63626a6c796a656f6e25686466256a79]
[VP Admnistration,Ms Veronica Vinuela,87f1f1eee9f2e2ebe6c7e5e6efeee6e0f5e6e9e3e2a9e4e8eaa9e6f5]
You could do it the following way restricting to the appropriate tds #contacts td[height] then the appropriate ids
td.select('#contacts_title, #contacts_name, #contacts_email') then testing in a list comprehension if current has the cfemail and acting accordingly.
from bs4 import BeautifulSoup as bs
import requests
r = requests.get('https://www.fis.com/fis/companies/details.asp?l=e&filterby=species&specie_id=615&page=1&company_id=160574&country_id=')
soup = bs(r.content, 'lxml')
name = soup.select_one('#name').text.strip()
address = soup.select_one('#description_details td:contains("Address:") + td div').text
print(name)
print(address)
for td in soup.select('#contacts td[height]'):
print([i.text.strip().replace('\xa0',' ') if i.select_one('.__cf_email__') is None else i.select_one('.__cf_email__')['data-cfemail']
for i in td.select('#contacts_title, #contacts_name, #contacts_email')])
OP's implementation:
contacts = [', '.join([i.text.strip().replace('\xa0',' ') if i.select_one('.__cf_email__') is None else i.select_one('.__cf_email__')['data-cfemail'] for i in td.select('#contacts_title, #contacts_name, #contacts_email')]) for td in soup.select('#contacts td[height]')]
You can iterate over the table storing the contact information:
import requests
from bs4 import BeautifulSoup as soup
d = soup(requests.get('https://www.fis.com/fis/companies/details.asp?l=e&filterby=species&specie_id=615&page=1&company_id=160574&country_id=').text, 'html.parser')
title, address = d.find('div', {'id':'name'}).text, d.find('div', {'id':'description_details'}).tr.div.text
contacts = [[i.find_all('div') for i in b.find_all('td')] for b in d.find('div', {'id':'contacts'}).table.find_all('tr')]
result = [[j.get_text(strip=True) if j.a is None else j.a.span['data-cfemail'] for j in i] for b in contacts for i in b if i]
Output:
'\xa0Bahia Grande S.A. - BG Group' #title
'Maipú 1252 Piso 8°' #address
[['Founder & President', 'Mr\xa0Guillermo\xa0Jacob'], ['VP Finance', 'Mr\xa0Andres\xa0Jacob', 'e6878c87858984a684878e8f87819487888283c885898bc88794'], ['Controller', 'Mr\xa0Juan Carlos\xa0Peralta', '264c45564354474a52476644474e4f474154474842430845494b084754'], ['VP Admnistration', 'Ms\xa0Veronica\xa0Vinuela', 'baccccd3d4cfdfd6dbfad8dbd2d3dbddc8dbd4dedf94d9d5d794dbc8']] #contact info

Find .nextsibling with Beautifulsoup4

I am trying to get (some) contents of a table from an URL.
So far I have managed to get two desired contents of the page, but there is a third one (third column) that I would like to get only its text. The problem is, the underlying link exists elsewhere on the page (with different text) and if I want to load the table into an SQL database, the contents of the third colum won't match the first two columns.
import urllib2
from bs4 import BeautifulSoup4
startURL = "http://some.url/website.html"
page = urllib2.urlopen(startURL).read()
soup = BeautifulSoup(page, "html.parser")
for links in soup.findAll("a"):
if "href" in links.attrs:
www = links.attrs.values()
if not "https://" in www[0]: # to exclude all non-relative links, e.g. external links
if "view/" in www[0]: # To get only my desired links of column 1
link_of_column1 = www[0] # this is now my wanted link
Okay, so with this code I can get the second column. Where and how would I have to apply the .nextsibling() function, to get the next link in the next (3rd) column?
Edit:
As I have been asked: The URL is https://myip.ms/browse/web_hosting/World_Web_Hosting_Global_Statistics.html and I want to get the contents from Column 2 and 3, which is "Hosting Company" (link-text and link) and "Country" (only the text).
Edit2:
Another thing I forgot...how can I extract the information that its 137,157 records?
First find the table which has all the info using its id=web_hosting_tbl attribute. Then iterate over all the rows of the table. But, if you look at the page source, the rows you need are not consecutive, but, alternate, and they don't have any class names. Also, the first row of the table is the header row, so we've to skip that.
After getting the required rows (using table.find_all('tr')[1::2]), find all the columns and then get the required information from the corresponding columns.
Code:
import requests
from bs4 import BeautifulSoup
r = requests.get('https://myip.ms/browse/web_hosting/World_Web_Hosting_Global_Statistics.html')
soup = BeautifulSoup(r.text, 'lxml')
table = soup.find('table', id='web_hosting_tbl')
for row in table.find_all('tr')[1::2]:
all_columns = row.find_all('td')
name = all_columns[1].a.text
link = all_columns[1].a['href']
country = all_columns[2].a.text
print(name, link, country, sep=' | ')
Partial output:
Godaddy.com, LLC | /view/web_hosting/2433/Godaddy_com_LLC.html | USA
Cloudflare, Inc | /view/web_hosting/4638/Cloudflare_Inc.html | USA
Amazon.com, Inc | /view/web_hosting/615/Amazon_com_Inc.html | USA
Ovh Sas | /view/web_hosting/7593/Ovh_Sas.html | France
Hetzner Online Ag | /view/web_hosting/45081/Hetzner_Online_Ag.html | Germany
Hostgator.com Llc | /view/web_hosting/26757/Hostgator_com_Llc.html | USA
Google Inc | /view/web_hosting/617/Google_Inc.html | USA
Bluehost Inc | /view/web_hosting/3886/Bluehost_Inc.html | USA
...
Code: (Python 3.6+, used f-strings)
import urllib.parse
from collections import namedtuple
from datetime import datetime
import bs4
import requests
HostingCompany = namedtuple('HostingCompany',
('name', 'country', 'websites', 'usage', 'usage_by_top', 'update_time'))
class MyIpLink:
url_base = 'https://myip.ms'
def __init__(self, tag: bs4.element.Tag, *, is_anchor=False):
a_tag = tag.find('a')
if is_anchor: # treat `tag` as an anchor tag
a_tag = tag
self.text = tag.text.strip()
self.url = urllib.parse.urljoin(self.url_base, a_tag['href'])
def __repr__(self):
return f'{self.__class__.__name__}(text={repr(self.text)}, url={repr(self.url)})'
url = 'https://myip.ms/browse/web_hosting/World_Web_Hosting_Global_Statistics.html'
html = requests.get(url).text
soup = bs4.BeautifulSoup(html, 'html.parser')
rows = soup.select('#web_hosting_tbl > tbody > tr')[::2] # skips "more info" rows
companies = []
for row in rows:
tds = row.find_all('td')
name = MyIpLink(tds[1])
country = MyIpLink(tds[2])
websites = [MyIpLink(a, is_anchor=True) for a in tds[3].find_all('a')]
usage = MyIpLink(tds[4])
usage_by_top = MyIpLink(tds[5])
update_time = datetime.strptime(tds[6].text.strip(), '%d %b %Y, %H:%M')
company = HostingCompany(name, country, websites, usage, usage_by_top, update_time)
companies.append(company)
import pprint
pprint.pprint(companies)
print(companies[0].name.text)
print(companies[0].name.url)
print(companies[0].country.text)
Output:
[HostingCompany(name=MyIpLink(text='Godaddy.com, LLC', url='https://myip.ms/view/web_hosting/2433/Godaddy_com_LLC.html'), country=MyIpLink(text='USA', url='https://myip.ms/view/best_hosting/USA/Best_Hosting_in_USA.html'), websites=[MyIpLink(text='www.godaddy.com', url='https://myip.ms/go.php?1229687315_ITg7Im93dCkWE0kNAhQSEh0FUeHq5Q==')], usage=MyIpLink(text='512,701 sites', url='https://myip.ms/browse/sites/1/ownerID/2433/ownerIDii/2433'), usage_by_top=MyIpLink(text='951 sites', url='https://myip.ms/browse/sites/1/rankii/100000/ownerID/2433/ownerIDii/2433'), update_time=datetime.datetime(2018, 5, 2, 5, 17)),
HostingCompany(name=MyIpLink(text='Cloudflare, Inc', url='https://myip.ms/view/web_hosting/4638/Cloudflare_Inc.html'), country=MyIpLink(text='USA', url='https://myip.ms/view/best_hosting/USA/Best_Hosting_in_USA.html'), websites=[MyIpLink(text='www.cloudflare.com', url='https://myip.ms/go.php?840626136_OiEsK2ROSxAdGl4QGhYJG+Tp6fnrv/f49w==')], usage=MyIpLink(text='488,119 sites', url='https://myip.ms/browse/sites/1/ownerID/4638/ownerIDii/4638'), usage_by_top=MyIpLink(text='16,160 sites', url='https://myip.ms/browse/sites/1/rankii/100000/ownerID/4638/ownerIDii/4638'), update_time=datetime.datetime(2018, 5, 2, 5, 10)),
HostingCompany(name=MyIpLink(text='Amazon.com, Inc', url='https://myip.ms/view/web_hosting/615/Amazon_com_Inc.html'), country=MyIpLink(text='USA', url='https://myip.ms/view/best_hosting/USA/Best_Hosting_in_USA.html'), websites=[MyIpLink(text='www.amazonaws.com', url='https://myip.ms/go.php?990446041_JyYhKGFxThMQHUMRHhcDExHj8vul7f75')], usage=MyIpLink(text='453,230 sites', url='https://myip.ms/browse/sites/1/ownerID/615/ownerIDii/615'), usage_by_top=MyIpLink(text='9,557 sites', url='https://myip.ms/browse/sites/1/rankii/100000/ownerID/615/ownerIDii/615'), update_time=datetime.datetime(2018, 5, 2, 5, 4)),
...
]
Godaddy.com, LLC
https://myip.ms/view/web_hosting/2433/Godaddy_com_LLC.html
USA
Gonna update the answer with some explanation in the evening. Cheers!
Try the below approach. It should give you the texts from column 2, the links from column 2 and again the texts from column 3 out of that table. I used lxml instead of BeautifulSoup to make it faster. Thanks.
import requests
from urllib.parse import urljoin
from lxml.html import fromstring
URL = 'https://myip.ms/browse/web_hosting/World_Web_Hosting_Global_Statistics.html'
res = requests.get(URL)
root = fromstring(res.text)
for items in root.cssselect('#web_hosting_tbl tr:not(.expand-child)')[1:]:
name = items.cssselect("td.row_name a")[0].text
link = urljoin(URL,items.cssselect("td.row_name a")[0].attrib['href'])
country = items.cssselect("td a[href^='/view/best_hosting/']")[0].text
print(name, link, country)
Results:
Godaddy.com, LLC https://myip.ms/view/web_hosting/2433/Godaddy_com_LLC.html USA
Cloudflare, Inc https://myip.ms/view/web_hosting/4638/Cloudflare_Inc.html USA
Amazon.com, Inc https://myip.ms/view/web_hosting/615/Amazon_com_Inc.html USA
Ovh Sas https://myip.ms/view/web_hosting/7593/Ovh_Sas.html France
Hetzner Online Ag https://myip.ms/view/web_hosting/45081/Hetzner_Online_Ag.html Germany
Hostgator.com Llc https://myip.ms/view/web_hosting/26757/Hostgator_com_Llc.html USA
Google Inc https://myip.ms/view/web_hosting/617/Google_Inc.html USA

Categories

Resources