Python Beautifulsoup scraping script unpacking, hardcoding and duplication

Python Beautifulsoup scraping script unpacking, hardcoding and duplication - python

I'm practising some Python scraping and I'm a bit stuck with the following exercise. The aim is to scrape the tickers resulting when applying some filters. Code below:
tickers = []
counter = 1
while True:
url = ("https://finviz.com/screener.ashx?v=111&f=cap_large&r="+ str(counter))
req = Request(url, headers={'User-Agent': 'Mozilla/5.0'})
webpage = urlopen(req).read()
html = soup(webpage, "html.parser")
rows = html.select('table[bgcolor="#d3d3d3"] tr')
for i in rows[1:]:
a1, a2, a3, a4 = (x.text for x in i.find_all('td')[1:5])
i = a1
tickers.append(i)
counter+=20
if tickers[-1]==tickers[-2]:
break
I'm not sure how to extract only 1 column so I'm using the code for all them (a1, a2, a3, a4 = (x.text for x in i.find_all('td')[1:5])), is there a way just to get the first column?
Is there a way to avoid having to hardcode '20' in the script?
When I run the code it creates a duplicate of the last ticker, is there another way to make the code stop when it went through all the entries?

So you are only interested in the values of tickers column, select it more specific - Based on its content the <a>:
html.select('table[bgcolor="#d3d3d3"] a.screener-link-primary')
To avoid working with the hardcoded 20 just take a look if there is a next page element and use its href:
html.select_one('.tab-link:-soup-contains("next")')
Example
import requests,time
from bs4 import BeautifulSoup
url = "https://finviz.com/screener.ashx?v=111&f=cap_large"
headers = {'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/96.0.4664.110 Safari/537.36','accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9'}
tickers = []
while True:
r = requests.get(url, headers=headers)
html = BeautifulSoup(r.text, "html.parser")
for a in html.select('table[bgcolor="#d3d3d3"] a.screener-link-primary'):
tickers.append(a.text)
if html.select_one('.tab-link:-soup-contains("next")'):
url = "https://finviz.com/"+html.select_one('.tab-link:-soup-contains("next")')['href']
else:
break
# be kind and add some delay between your requests
time.sleep(1)
tickers

You can use nth-child range to filter out first row in table, then nth-child(2) to get the tickers column within the remaining table rows
tickers = [td.text for td in html.select('table[bgcolor="#d3d3d3"] tr:nth-child(n+2) td:nth-child(2)')]
With an existing list use
tickers.extend([td.text for td in html.select('table[bgcolor="#d3d3d3"] tr:nth-child(n+2) td:nth-child(2)')])
Read about nth-child here:
http://nthmaster.com/
and
https://developer.mozilla.org/en-US/docs/Web/CSS/:nth-child
You can stop when there is no more "next" present. counter needs to increment by 20 each request.
import requests
from bs4 import BeautifulSoup as bs
tickers = []
counter = 1
with requests.Session() as s:
s.headers = {'User-Agent':'Mozilla/5.0'}
while True:
# print(counter)
url = ("https://finviz.com/screener.ashx?v=111&f=cap_large&r="+ str(counter))
res = s.get(url)
html = bs(res.text, "html.parser")
tickers.extend([td.text for td in html.select('table[bgcolor="#d3d3d3"] tr:nth-child(n+2) td:nth-child(2)')])
if html.select_one('.tab-link b:-soup-contains("next")') is None:
break
counter+=20

Related

Why am I not seeing any results in my output from extracting indeed data using python

I am trying to run this code in idle 3.10.6 and I am not seeing any kind of data that should be extracted from Indeed. All this data should be in the output when I run it but it isn't. Below is the input statement
#Indeed data
import requests
from bs4 import BeautifulSoup
import pandas as pd
def extract(page):
headers = {"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko"}
url = "https://www.indeed.com/jobs?q=Data&l=United+States&sc=0kf%3Ajt%28internship%29%3B&vjk=a2f49853f01db3cc={page}"
r = requests.get(url,headers)
soup = BeautifulSoup(r.content, "html.parser")
return soup
def transform(soup):
divs = soup.find_all("div", class_ = "jobsearch-SerpJobCard")
for item in divs:
title = item.find ("a").text.strip()
company = item.find("span", class_="company").text.strip()
try:
salary = item.find("span", class_ = "salarytext").text.strip()
finally:
salary = ""
summary = item.find("div",{"class":"summary"}).text.strip().replace("\n","")
job = {
"title":title,
"company":company,
'salary':salary,
"summary":summary
}
joblist.append(job)
joblist = []
for i in range(0,40,10):
print(f'Getting page, {i}')
c = extract(10)
transform(c)
df = pd.DataFrame(joblist)
print(df.head())
df.to_csv('jobs.csv')
Here is the output I get
Getting page, 0
Getting page, 10
Getting page, 20
Getting page, 30
Empty DataFrame
Columns: []
Index: []
Why is this going on and what should I do to get that extracted data from indeed? What I am trying to get is the jobtitle,company,salary, and summary information. Any help would be greatly apprieciated.

The URL string includes {page}, bit it's not an f-string, so it's not being interpolated, and the URL you are fetching is:
https://www.indeed.com/jobs?q=Data&l=United+States&sc=0kf%3Ajt%28internship%29%3B&vjk=a2f49853f01db3cc={page}
That returns an error page.
So you should add an f before opening quote when you set url.
Also, you are calling extract(10) each time, instead of extract(i).

This is the correct way of using url
url = "https://www.indeed.com/jobs?q=Data&l=United+States&sc=0kf%3Ajt%28internship%29%3B&vjk=a2f49853f01db3cc={page}".format(page=page)
r = requests.get(url,headers)
here r.status_code gives an error 403 which means the request is forbidden.The site will block your request from fullfilling.use indeed job search Api

Using Python, BeautifulSoup, CSV to scrape a URL

In this URL https://doc8643.com/aircrafts I want to scrape all rows.
Then for each individual row, for example https://doc8643.com/aircraft/A139
I want to scrape these three areas of data
<table class="table centered-table">
<h4>Manufacturers</h4>
<h4>Technical Data</h4>
Can this is done in python?
import requests, csv
from bs4 import BeautifulSoup
from urllib.request import Request
url = 'https://doc8643.com/aircrafts'
req = Request(url , headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/88.0.4324.190 Safari/537.36'})
with open('doc8643.csv', "w", encoding="utf-8") as f:
writer = csv.writer(f)
while True:
print(url)
html = requests.get(url)
soup = BeautifulSoup(html.text, 'html.parser')
# Go throught table = tbody and extract the data under the 'td' tag
for row in soup.select('ul.nav.nav-pills.nav-stacked li.aircraft_item'):
writer.writerow([c.text if c.text else '' for c in row.select('h3')])
print(row)
# If more than one page then iterate through all of them
if soup.select_one('ul.pagination li.active + li a'):
url = soup.select_one('ul.pagination li.active + li a')['href']
else:
break

You should create function which get value c.text (ie, A139) and creates full url like https://doc8643.com/aircraft/A139 and runs Request or requests and BeautifulSoup to get all needs data
def scrape_details(number):
url = 'https://doc8643.com/aircraft/' + number
print('details:', url)
response = requests.get(url, headers=headers)
soup = BeautifulSoup(response.text, 'html.parser')
# ... scrape details and put in list `results` ...
return results
and run it in your loop
for row in soup.select('ul.nav.nav-pills.nav-stacked li.aircraft_item'):
data = [c.text if c.text else '' for c in row.select('h3')]
for item in data:
values = scrape_details(item)
writer.writerow([item] + values)
The biggest problem is to scrape details.
For some details it needs to scrape dl and next all dt and dd and use zip() to group in pairs.
Something like
def scrape_details(number):
url = 'https://doc8643.com/aircraft/' + number
print('details:', url)
response = requests.get(url, headers=headers)
soup = BeautifulSoup(response.text, 'html.parser')
results = []
all_dl = soup.find_all('dl')
for item in all_dl:
all_dt = item.find_all('dt')
all_dd = item.find_all('dd')
for dt, dd in zip(all_dt, all_dd):
pair = f"{dt.string}: {dd.string}"
results.append(pair)
print(pair)
#print(results)
return results
but this need more code - and I skip this part.
Minimal working code
EDIT: I added url = 'https://doc8643.com' + url
import csv
import requests
from bs4 import BeautifulSoup
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/88.0.4324.190 Safari/537.36'}
# --- functions ---
def scrape_details(number):
url = 'https://doc8643.com/aircraft/' + number
print('details:', url)
response = requests.get(url, headers=headers)
soup = BeautifulSoup(response.text, 'html.parser')
results = []
all_dl = soup.find_all('dl')
for item in all_dl:
all_dt = item.find_all('dt')
all_dd = item.find_all('dd')
for dt, dd in zip(all_dt, all_dd):
pair = f"{dt.string}: {dd.string}"
results.append(pair)
print(pair)
#print(results)
return results
# --- main ---
url = 'https://doc8643.com/aircrafts'
with open('doc8643.csv', "w", encoding="utf-8") as f:
writer = csv.writer(f)
writer.writerow(["data1", "data2", "data3", "etc..."])
while True:
print('url:', url)
response = requests.get(url, headers=headers)
soup = BeautifulSoup(response.text, 'html.parser')
# Go throught table = tbody and extract the data under the 'td' tag
for row in soup.select('ul.nav.nav-pills.nav-stacked li.aircraft_item'):
data = [c.text if c.text else '' for c in row.select('h3')]
for item in data:
values = scrape_details(item)
writer.writerow([item] + values)
# If more than one page then iterate through all of them
if soup.select_one('ul.pagination li.active + li a'):
url = soup.select_one('ul.pagination li.active + li a')['href']
url = 'https://doc8643.com' + url
else:
break
BTW:
Maybe it would be better to keep results as dictionary
results[dt.string] = [dd.string]

Loop duplicating results

I'm writing a code for web-scrape Transfermarkt website, but I'm having some issues on the code.
The code had returned an error that was fixed thru the topic: Loop thru multiple URLs in Python - InvalidSchema("No connection adapters were found for {!r}".format
After this fix, other problems came in.
First: the code is duplicating the results on data frame.
Second one, the code is taking only the last element of each URL. In fact, what I want is get all the agencies URLs in the pagina = range(1) and then scrape all players in each agency, thru the URL scrapped in the first part.
ps.: pagina = range(1) it will be range (1,40), its the numbers of pages that i will scrape to get all agency's links.
Can anyone give me a hand on this issues?
Thanks!
import requests
from bs4 import BeautifulSoup
import pandas as pd
import time
from requests.sessions import default_headers
nome=[]
posicao=[]
nacionalidade=[]
idade=[]
clube=[]
contrato=[]
valor=[]
tf = f"http://www.transfermarkt.com.br"
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:75.0) Gecko/20100101 Firefox/75.0'
}
pagina = range(1,5)
def main(url):
with requests.Session() as req:
links = []
for lea in pagina:
print(f"Extraindo links da página {lea}")
r = req.get(url.format(lea), headers=headers)
soup = BeautifulSoup(r.content, 'html.parser')
link = [f"{tf}{item.next_element.get('href')}" for item in soup.findAll(
"td", class_="hauptlink")]
links.extend(link)
print(f"Collected {len(links)} Links")
time.sleep(1)
for url in links:
r= requests.get(url, headers=headers)
r.status_code
soup = BeautifulSoup(r.text, 'html.parser')
player_info= soup.find_all('tr', class_=['odd', 'even'])
for info in player_info:
player = info.find_all("td")
vall= info.find('td', {'class': 'zentriert hauptlink'})
nome.append(player[2].text)
posicao.append(player[3].text)
nacionalidade.append(player[4].img['alt'])
idade.append(player[5].text)
clube.append(player[6].img['alt'])
contrato.append(player[7].text)
valor.append(vall)
time.sleep(1)
df = pd.DataFrame(
{"NOME":nome,
"POSICAO":posicao,
"NACIONALIDADE":nacionalidade,
"IDADE":idade,
"CLUBE":clube,
"CONTRATO":contrato,
"VALOR":valor}
)
print(df)
df
#df.to_csv('MBB.csv', index=False)
main("https://www.transfermarkt.com.br/berater/beraterfirmenuebersicht/berater?ajax=yw1&page={}")

Fetching information from the different links on a web page and writing them to a .xls file using pandas,bs4 in Python

I am a beginner to Python Programming. I am practicing web scraping using bs4 module in python.
I have extracted some fields from a web page but it is extracting only 13 items whereas the web page has more than 13 items. I cannot understand why are the rest of the items not extracted.
Another thing is I want to extract the contact number and the email address of each item on the web page but they are available in the item's respective links. I am a beginner and frankly speaking I got stuck to how to access and scrape the link of each item's individual web page within a given web page. Kindly tell where am I doing wrong and if possible suggest what is to be done.
import requests
from bs4 import BeautifulSoup as bs
import pandas as pd
res = requests.post('https://www.nelsonalexander.com.au/real-estate-agents/?office=&agent=A')
soup = bs(res.content, 'lxml')
data = soup.find_all("div",{"class":"agent-card large large-3 medium-4 small-12 columns text-center end"})
records = []
for item in data:
name = item.find('h2').text.strip()
position = item.find('h3').text.strip()
records.append({'Names': name, 'Position': position})
df = pd.DataFrame(records,columns=['Names','Position'])
df=df.drop_duplicates()
df.to_excel(r'C:\Users\laptop\Desktop\NelsonAlexander.xls', sheet_name='MyData2', index = False, header=True)
I have done the above code for just extracting the names and the position of each items but it only scrapes 13 records but there are more records than it in the web page. I could not write any code for extracting the contact number and the email address of each record because it is present inside each item's individual page as I have got stuck.
The Excel sheet looks like this:

That website is loading the list dynamically as you scroll, however you can trace the ajax request, and parse data directly:
import requests
from bs4 import BeautifulSoup as bs
import pandas as pd
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:66.0) Gecko/20100101 Firefox/66.0', 'Referer': 'https://www.nelsonalexander.com.au/real-estate-agents/?office=&agent=A'}
records = []
with requests.Session() as s:
for i in range(1, 22):
res = s.get(f'https://www.nelsonalexander.com.au/real-estate-agents/page/{i}/?ajax=1&agent=A', headers=headers)
soup = bs(res.content, 'lxml')
data = soup.find_all("div",{"class":"agent-card large large-3 medium-4 small-12 columns text-center end"})
for item in data:
name = item.find('h2').text.strip()
position = item.find('h3').text.strip()
phone = item.find("div",{"class":"small-6 columns text-left"}).find("a").get('href').replace("tel:", "")
records.append({'Names': name, 'Position': position, 'Phone': phone})
df = pd.DataFrame(records,columns=['Names','Position', 'Phone'])
df=df.drop_duplicates()
df.to_excel(r'C:\Users\laptop\Desktop\NelsonAlexander.xls', sheet_name='MyData2', index = False, header=True)

I am convinced that the emails are no where in the DOM. I made some modification to #drec4s code to instead go until there are no entries (dynamically).
import requests
from bs4 import BeautifulSoup as bs
import pandas as pd
import itertools
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:66.0) Gecko/20100101 Firefox/66.0', 'Referer': 'https://www.nelsonalexander.com.au/real-estate-agents/?office=&agent=A'}
records = []
with requests.Session() as s:
for i in itertools.count():
res = s.get('https://www.nelsonalexander.com.au/real-estate-agents/page/{}/?ajax=1&agent=A'.format(i), headers=headers)
soup = bs(res.content, 'lxml')
data = soup.find_all("div",{"class":"agent-card large large-3 medium-4 small-12 columns text-center end"})
if(len(data) > 0):
for item in data:
name = item.find('h2').text.strip()
position = item.find('h3').text.strip()
phone = item.find("div",{"class":"small-6 columns text-left"}).find("a").get('href').replace("tel:", "")
records.append({'Names': name, 'Position': position, 'Phone': phone})
print({'Names': name, 'Position': position, 'Phone': phone})
else:
break

How do I search within a website using the 'requests' module?

I want to search for different company names on the website. Website link: https://www.firmenwissen.de/index.html
On this website, I want to use the search engine and search companies. Here is the code I am trying to use:
from bs4 import BeautifulSoup as BS
import requests
import re
companylist = ['ABEX Dachdecker Handwerks-GmbH']
url = 'https://www.firmenwissen.de/index.html'
payloads = {
'searchform': 'UFT-8',
'phrase':'ABEX Dachdecker Handwerks-GmbH',
"mainSearchField__button":'submit'
}
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/71.0.3578.98 Safari/537.36'}
html = requests.post(url, data=payloads, headers=headers)
soup = BS(html.content, 'html.parser')
link_list= []
links = soup.findAll('a')
for li in links:
link_list.append(li.get('href'))
print(link_list)
This code should bring me the next page with company information. But unfortunately, it returns only the home page. How can I do this?

Change your initial url you are doing search for. Grab the appropriate hrefs only and add to a set to ensure no duplicates (or alter selector to return only one match if possible); add those items to a final set for looping to ensure only looping required number of links. I have used Session on assumption you will repeat for many companies.
Iterate over the set using selenium to navigate to each company url and extract whatever info you need.
This is an outline.
from bs4 import BeautifulSoup as BS
import requests
from selenium import webdriver
d = webdriver.Chrome()
companyList = ['ABEX Dachdecker Handwerks-GmbH','SUCHMEISTEREI GmbH']
url = 'https://www.firmenwissen.de/ergebnis.html'
baseUrl = 'https://www.firmenwissen.de'
headers = {'User-Agent': 'Mozilla/5.0'}
finalLinks = set()
## searches section; gather into set
with requests.Session() as s:
for company in companyList:
payloads = {
'searchform': 'UFT-8',
'phrase':company,
"mainSearchField__button":'submit'
}
html = s.post(url, data=payloads, headers=headers)
soup = BS(html.content, 'lxml')
companyLinks = {baseUrl + item['href'] for item in soup.select("[href*='firmeneintrag/']")}
# print(soup.select_one('.fp-result').text)
finalLinks = finalLinks.union(companyLinks)
for item in finalLinks:
d.get(item)
info = d.find_element_by_css_selector('.yp_abstract_narrow')
address = d.find_element_by_css_selector('.yp_address')
print(info.text, address.text)
d.quit()
Just the first links:
from bs4 import BeautifulSoup as BS
import requests
from selenium import webdriver
d = webdriver.Chrome()
companyList = ['ABEX Dachdecker Handwerks-GmbH','SUCHMEISTEREI GmbH', 'aktive Stuttgarter']
url = 'https://www.firmenwissen.de/ergebnis.html'
baseUrl = 'https://www.firmenwissen.de'
headers = {'User-Agent': 'Mozilla/5.0'}
finalLinks = []
## searches section; add to list
with requests.Session() as s:
for company in companyList:
payloads = {
'searchform': 'UFT-8',
'phrase':company,
"mainSearchField__button":'submit'
}
html = s.post(url, data=payloads, headers=headers)
soup = BS(html.content, 'lxml')
companyLink = baseUrl + soup.select_one("[href*='firmeneintrag/']")['href']
finalLinks.append(companyLink)
for item in set(finalLinks):
d.get(item)
info = d.find_element_by_css_selector('.yp_abstract_narrow')
address = d.find_element_by_css_selector('.yp_address')
print(info.text, address.text)
d.quit()

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Python Beautifulsoup scraping script unpacking, hardcoding and duplication - python

Related

Why am I not seeing any results in my output from extracting indeed data using python

Using Python, BeautifulSoup, CSV to scrape a URL

Loop duplicating results

Fetching information from the different links on a web page and writing them to a .xls file using pandas,bs4 in Python

How do I search within a website using the 'requests' module?

Categories

Resources