Issue with scraping in python

Issue with scraping in python - python

I am trying to scrape some precise lines and create table from collected data (url attached), but cannot get more than the entire body text. Thus, I got stuck.
To give some example:
I would like to arrive at the below table, scraping details from the body content.All the details are there, however any help on how to retrieve them in a form given below would be much appreciated.
My code is:
import requests
from bs4 import BeautifulSoup
# providing url
url = 'https://www.polskawliczbach.pl/wies_Baniocha'
# creating request object
req = requests.get(url)
# creating soup object
data = BeautifulSoup(req.text, 'html')
# finding all li tags in ul and printing the text within it
data1 = data.find('body')
for li in data1.find_all("li"):
print(li.text, end=" ")

At first find the ul and then try to find li inside ul. Scrape needed data, save scraped data in variable and make table using pandas. Now we have done all things if you want to save table then save it in csv file otherwise just print it.
Here's the code implementation of all above things:
from bs4 import BeautifulSoup
import requests
import pandas as pd
page = requests.get('https://www.polskawliczbach.pl/wies_Baniocha')
soup = BeautifulSoup(page.content, 'lxml')
lis=soup.find_all("ul",class_="list-group row")[1].find_all("li")[1:-1]
dic={"name":[],"value":[]}
for li in lis:
try:
dic["name"].append(li.find(text=True,recursive=False).strip())
dic["value"].append(li.find("span").text.replace(" ",""))
print(li.find(text=True,recursive=False).strip(),li.find("span").text.replace(" ",""))
except:
pass
df=pd.DataFrame(dic)
print(df)
# If you want to save this as file then uncomment following line:
# df.to_csv("<FILENAME>.csv")
And additionally if you want to scrape all then "categories", I don't understand that language so,I don't know which is useful and which is not but anyway here's the code, you can just change this part of above code:
soup = BeautifulSoup(page.content, 'lxml')
dic={"name":[],"value":[]}
lis=soup.find_all("ul",class_="list-group row")
for li in lis:
a=li.find_all("li")[1:-1]
for b in a:
error=0
try:
print(b.find(text=True,recursive=False).strip(),"\t",b.find("span").text.replace(" ","").replace(",",""))
dic["name"].append(b.find(text=True,recursive=False).strip())
dic["value"].append(b.find("span").text.replace(" ","").replace(",",""))
except Exception as e:
pass
df=pd.DataFrame(dic)

Find main tag by specific class and from it find all li tag
main_data=data.find("ul", class_="list-group").find_all("li")[1:-1]
names=[]
values=[]
main_values=[]
for i in main_data:
values.append(i.find("span").get_text())
names.append(i.find(text=True,recursive=False))
main_values.append(values)
For table representation use pandas module
import pandas as pd
df=pd.DataFrame(columns=names,data=main_values)
df
Output:
Liczba mieszkańców (2011) Kod pocztowy Numer kierunkowy
0 1 935 05-532 (+48) 22

Related

HTML parts locating

I am trying to extract each row individually to eventually create a dataframe to export them into a csv. I can't locate the individual parts of the html.
I can find and save the entire content (although I can only seem to save this on a loop so the pages appear hundreds of times), but I can't find any html parts nested beneath this. My code is as follows, trying to find the first row:
page = requests.get(url)
soup = BeautifulSoup(page.text, 'html.parser')
content = soup.find('div', {'class': 'view-content'})
for infos in content:
try:
data = infos.find('div', {'class': 'type type_18'}).text
except:
print("None found")
df = pd.DataFrame(data)
df.columns = df.columns.str.lower().str.replace(': ','')
df[['type','rrr']] = df['rrr'].str.split("|",expand=True)
df.to_csv (r'savehere.csv', index = False, header = True)
This code just prints "None found" because, I assume, it hasn't found anything else to print. I don't know if I am not finding the right html part or what.
Any help would be much appreciated.

What happens?
Main issue here is that content = soup.find('div', {'class': 'view-content'}) is no ResultSet and contains only a single element. Thats why your second loop only iterates once.
Also Caused by this behavior you will swap from beautifoulsoup method find() to python string method find() and these two are operating in a different way - Without try/except you will see the what is going on, it try to find a string:
for x in soup.find('div', {'class': 'view-content'}):
print(x.find('div'))
Output
...
-1
<div class="views-field views-field-title-1"> <span class="views-label views-label-title-1">RRR: </span> <span class="field-content"><div class="type type_18">Eleemosynary grant</div>2256</span> </div>
...
How to fix?
Select your elements more specific in this case the views-row:
sections = soup.find_all('div', {'class': 'views-row'})
While you iterate each section you could select expected value:
sections = soup.find_all('div', {'class': 'views-row'})
for section in sections:
print(section.select_one('div[class*="type_"]').text)
Example
Is scraping all the information and creates DataFrame
import requests
from bs4 import BeautifulSoup
import pandas as pd
data = []
website = #link here#
page = requests.get(url)
soup = BeautifulSoup(page.text, 'html.parser')
sections = soup.find_all('div', {'class': 'views-row'})
for section in sections:
d = {}
for row in section.select('div.views-field'):
d[row.span.text] = row.select_one('span:nth-of-type(2)').get_text('|',strip=True)
data.append(d)
df = pd.DataFrame(data)
### replacing : in header and set all to lower case
df.columns = df.columns.str.lower().str.replace(': ','')
...

I think that You wanted to make pagination using for loop and range method and to grab RRR value.I've done the next pages meaning pagination in long url.
from bs4 import BeautifulSoup
import requests
import pandas as pd
url = #insert url#
data=[]
for page in range(1,7):
req=requests.get(url.format(page=page))
soup = BeautifulSoup(req.content,'lxml')
for r in soup.select('[class="views-field views-field-title-1"] span:nth-child(2)'):
rr=list(r.stripped_strings)[-1]
#print(rr)
data.append(rr)
df = pd.DataFrame(data,columns=['RRR'])
print(df)
#df.to_csv('data.csv',index=False)
Output:
List

Empty Dataframe when scraping specific column from website

I wanted to try to scrape some specific columns (Company details column) in the CNBC Nasdaq 100 website specifically the Adobe stocks, below is the snippet of my code
# Importing Libraries
from bs4 import BeautifulSoup
import requests
import csv
import pandas as pd
def get_company_info(url):
original_url = url
key = {}
l = []
page_response = requests.get(url, timeout=240)
page_content = BeautifulSoup(page_response.content, "html.parser")
name = page_content.find('div',{"class":"quote-section-header large-header"}).find("span",{"class":"symbol"}).text
description = page_content.find_all('div',{"class":"moduleBox"})
for items in description:
for i in range(len(items.find_all("tr"))-1):
# Gather data
key["stock_desc"] = items.find_all("td", {"class":"desc"})[i].find('div',attrs={'id':'descLong'}).text
shares = items.find_all("td").find("table",attrs={"id":"shares"})
for rest_of_items in shares:
for i in range(len(items.find_all("tr"))-1):
key["stock_outstanding-shares"] = items.find_all("td", {"class":"bold aRit"})[i].text
key["stock_ownership"] = items.find_all("td", {"class":"bold aRit"})[i].text
key["stock_market_cap"] = items.find_all("td", {"class":"bold aRit"})[i].text
key["stock_lastSplit"] = items.find_all("td", {"class":"bold aRit"})[i].text
# Print ("")
l.append(key)
key['name'] = name
df = pd.DataFrame(l)
print(df)
return key, df
get_company_info("https://www.cnbc.com/quotes/?symbol=ADBE&tab=profile")
So, I'm keen to get the result in dataframe so that I can change to CSV file, but my code keep showing empty dataframe result, Below are the error shown
The result I wanted is something like this

The information you are looking for is not available in the url you requested. This is because the information is fetched by the page using a JavaScript. Which in turn requests a different URL which provides the data.
Example code
from bs4 import BeautifulSoup
import requests
page=requests.get("https://apps.cnbc.com/view.asp?symbol=ADBE.O&uid=stocks/summary")
soup = BeautifulSoup(page.content, 'html.parser')
Name=soup.find("h5",id="companyName").text
stock_desc= soup.find("div",id="descLong").text
table=soup.find("table",id="shares")
details=table.find_all("td", class_="bold aRit")
stock_outstanding_shares= details[0].text
stock_ownership= details[1].text
stock_market_cap= details[2].text
stock_lastSplit= details[3].text
You can create dataframe and export to csv.

Unable to print once to get all the data altogether

I've written a script in python to scrape the tablular content from a webpage. In the first column of the main table there are the names. Some names have links to lead another page, some are just the names without any link. My intention is to parse the rows when a name has no link to another page. However, when the name has link to another page then the script will first parse the concerning rows from the main table and then follow that link to parse associated information of that name from the table located at the bottom under the title Companies. Finally, write them in a csv file.
site link
I've tried so far:
import requests
from urllib.parse import urljoin
from bs4 import BeautifulSoup
link = "https://suite.endole.co.uk/insight/company/ajax_people.php?ajax_url=ajax_people&page=1&company_number=03512889"
base = "https://suite.endole.co.uk"
res = requests.get(link)
soup = BeautifulSoup(res.text,"lxml")
for item in soup.select("table tr")[1:]:
if not item.select_one("td a[href]"):
first_table = [i.text for i in item.select("td")]
print(first_table)
else:
first_table = [i.text for i in item.select("td")]
print(first_table)
url = urljoin(base,item.select_one("td a[href]").get("href"))
resp = requests.get(url)
soup_ano = BeautifulSoup(resp.text,"lxml")
for elems in soup_ano.select(".content:contains(Companies) table tr")[1:]:
associated_info = [elem.text for elem in elems.select("td")]
print(associated_info)
My above script can do almost everything but I can't create any logic to print once rather than printing thrice to get all the data atltogether so that I can write them in a csv file.

Put all your scraped data into a list, here I've called the list associated_info then all the data is in one place & you can iterate over the list to print it out to a CSV if you like...
import requests
from urllib.parse import urljoin
from bs4 import BeautifulSoup
link = "https://suite.endole.co.uk/insight/company/ajax_people.php?ajax_url=ajax_people&page=1&company_number=03512889"
base = "https://suite.endole.co.uk"
res = requests.get(link)
soup = BeautifulSoup(res.text,"lxml")
associated_info = []
for item in soup.select("table tr")[1:]:
if not item.select_one("td a[href]"):
associated_info.append([i.text for i in item.select("td")])
else:
associated_info.append([i.text for i in item.select("td")])
url = urljoin(base,item.select_one("td a[href]").get("href"))
resp = requests.get(url)
soup_ano = BeautifulSoup(resp.text,"lxml")
for elems in soup_ano.select(".content:contains(Companies) table tr")[1:]:
associated_info.append([elem.text for elem in elems.select("td")])
print(associated_info)

How to store URL from BeautifulSoup results to a list and then to a table

I'm scraping a real estate webpage trying to get some URLs to then create a table.
https://www.zonaprop.com.ar/locales-comerciales-alquiler-palermo-hollywood-0-ambientes-publicado-hace-menos-de-1-mes.html
I have days triying to
store the results to a list or dictionary to then
create a table
but I'm really stuck
from bs4 import BeautifulSoup
import requests
import re
source=requests.get('https://www.zonaprop.com.ar/locales-comerciales-alquiler-palermo-hollywood-0-ambientes-publicado-hace-menos-de-1-mes.html').text
soup=BeautifulSoup(source,'lxml')
#Extract URL
link_text = ''
URL=[]
PlacesDf = pd.DataFrame(columns=['Address', 'Location.lat', 'Location.lon'])
for a in soup.find_all('a', attrs={'href': re.compile("/propiedades/")}):
link_text = a['href']
URL='https://www.zonaprop.com.ar'+link_text
print(URL)
ok, the output It's ok for me:
https://www.zonaprop.com.ar/propiedades/local-en-alquiler-soler-6000-palermo-hollywood-a-44227001.html#map
https://www.zonaprop.com.ar/propiedades/local-en-alquiler-soler-6000-palermo-hollywood-a-44227001.html
https://www.zonaprop.com.ar/propiedades/local-en-alquiler-soler-6000-palermo-hollywood-a-44227001.html
https://www.zonaprop.com.ar/propiedades/excelente-esquina-en-alquiler-s-lote-propio-con-43776599.html
https://www.zonaprop.com.ar/propiedades/excelente-esquina-en-alquiler-s-lote-propio-con-43776599.html
https://www.zonaprop.com.ar/propiedades/excelente-esquina-en-alquiler-s-lote-propio-con-43776599.html
https://www.zonaprop.com.ar/propiedades/excelente-local-en-alquiler-palermo-hollywood-fitz-44505027.html#map
https://www.zonaprop.com.ar/propiedades/excelente-local-en-alquiler-palermo-hollywood-fitz-44505027.html
https://www.zonaprop.com.ar/propiedades/excelente-local-en-alquiler-palermo-hollywood-fitz-44505027.html
https://www.zonaprop.com.ar/propiedades/local-palermo-hollywood-44550855.html#map
https://www.zonaprop.com.ar/propiedades/local-palermo-hollywood-44550855.html
https://www.zonaprop.com.ar/propiedades/local-palermo-hollywood-44550855.html
https://www.zonaprop.com.ar/propiedades/local-comercial-o-edificio-corporativo-oficinas-500-43164952.html
https://www.zonaprop.com.ar/propiedades/local-comercial-o-edificio-corporativo-oficinas-500-43164952.html
https://www.zonaprop.com.ar/propiedades/local-comercial-o-edificio-corporativo-oficinas-500-43164952.html
https://www.zonaprop.com.ar/propiedades/local-palermo-viejo-44622843.html#map
https://www.zonaprop.com.ar/propiedades/local-palermo-viejo-44622843.html
https://www.zonaprop.com.ar/propiedades/local-palermo-viejo-44622843.html
https://www.zonaprop.com.ar/propiedades/alquiler-de-local-comercial-en-palermo-hollywood-44571635.html#map
https://www.zonaprop.com.ar/propiedades/alquiler-de-local-comercial-en-palermo-hollywood-44571635.html
https://www.zonaprop.com.ar/propiedades/alquiler-de-local-comercial-en-palermo-hollywood-44571635.html
the thing is that the output are real links(you can click on them and go to the page)
But when I try to store it in a new variable(list or dictionary with the column name 'Address' to join with "PlacesDf"(same column name 'Address')) /convert to table/ or whatever trick I cannot find the solution. In fact, when I try to convert to pandas:
Address = pd.dataframe(URL)
it only creates a one row table.
I expect to see something like that
Adresses=['https://www.zonaprop.com.ar/propiedades/local-en-alquiler-soler-6000-palermo-hollywood-a-44227001.html#map','
https://www.zonaprop.com.ar/propiedades/local-en-alquiler-soler-6000-palermo-hollywood-a-44227001.html',...]
or a Dictionary or whatever I can turn to a table with pandas

you should do the following:
from bs4 import BeautifulSoup
import requests
import re
import pandas as pd
source=requests.get('https://www.zonaprop.com.ar/locales-comerciales-alquiler-palermo-hollywood-0-ambientes-publicado-hace-menos-de-1-mes.html').text
soup=BeautifulSoup(source,'lxml')
#Extract URL
all_url = []
link_text = ''
PlacesDf = pd.DataFrame(columns=['Address', 'Location.lat', 'Location.lon'])
for a in soup.find_all('a', attrs={'href': re.compile("/propiedades/")}):
link_text = a['href']
URL='https://www.zonaprop.com.ar'+link_text
print(URL)
all_url.append(URL)
df = pd.DataFrame({"URLs":all_url}) #replace "URLs" with your desired column name
hope this helps

I don't know where you are getting lat and lon from and I am making an assumption about address. I can see you have a lot of duplicates in your current urls returns. I would suggest the following css selectors to target just the listings links. These are class selectors so faster than your current method.
Use the len of that returned list of links to define the row dimension and you already have the columns.
from bs4 import BeautifulSoup as bs
import requests
import pandas as pd
import re
r = requests.get('https://www.zonaprop.com.ar/locales-comerciales-alquiler-palermo-hollywood-0-ambientes-publicado-hace-menos-de-1-mes.html')
soup = bs(r.content, 'lxml') #'html.parser'
links = ['https://www.zonaprop.com.ar' + item['href'] for item in soup.select('.aviso-data-title a')]
locations = [re.sub('\n|\t','',item.text).strip() for item in soup.select('.aviso-data-location')]
df = pd.DataFrame(index=range(len(links)),columns= ['Address', 'Lat', 'Lon', 'Link'])
df.Link = links
df.Address = locations
print(df)

Scraping row elements from a dynamic table using Bs4

I'm attempting to scrape the list of tickers for the Nasdaq 100 from the CNBC website: https://www.cnbc.com/nasdaq-100/. I am new to beautiful soup, but if there is a more straight forward way to scrape the list and save the data I am interested in any solution.
The code below does not return an error;however, it does not return any tickers either.
import bs4 as bs
import pickle # serializes any python object so that we do not have to go back to the CNBC website to get the tickers each time we want
# to use the 100 ticker symbols
import requests
def save_nasdaq_tickers():
''' We start by getting the source code for CNBC. We will use the request module for this'''
resp = requests.get('https://www.cnbc.com/nasdaq-100')
soup = bs.BeautifulSoup(resp.text,"lxml")# we use txt when the response comes from request module I think because resp.txt is text of source code.
table = soup.find('table',{'class':"data quoteTable"}) # We want all table of the class we think matches the table data we want from cnbc
tickers = [] # empty tickers list
# Next week iterate through the table.
for row in table.findAll('tr')[1:]:# we want to find all table rows except the header row which should be row 0 so 1 onward [:1]
ticker = row.findAll('td')[0].txt #td is the columns of the table 0 is the first column which I perceived to be the tickers
# We specifiy .txt because it is a soup object
tickers.append(ticker)
# Save this list of tickers using pickle and with open???
with open("Nasdaq100Tickers","wb") as f: # name the file Nasdaq100... etc
pickle.dump(tickers,f) # dumping the tickers to file f
print(tickers)
return tickers
save_nasdaq_tickers()

Just a small wrong in your code if you wonder why you got nothing in your tickers. ticker = row.findAll('td')[0].txt to ticker = row.findAll('td')[0].text. But when you desire to get full content in dynamic page, you need selenium.
def save_nasdaq_tickers():
try:
dr = webdriver.Chrome()
dr.get("https://www.cnbc.com/nasdaq-100")
text = dr.page_source
except Exception as e:
raise e
finally:
dr.close()
soup = bs.BeautifulSoup(text,"lxml")
table = soup.find('table',{'class':"data quoteTable"})

You can mimic the XHR request made and parse out the JSON containing the data you are after
import requests
import pandas as pd
import json
from pandas.io.json import json_normalize
from bs4 import BeautifulSoup
url = 'https://quote.cnbc.com/quote-html-webservice/quote.htm?partnerId=2&requestMethod=quick&exthrs=1&noform=1&fund=1&output=jsonp&symbols=AAL|AAPL|ADBE|ADI|ADP|ADSK|ALGN|ALXN|AMAT|AMGN|AMZN|ATVI|ASML|AVGO|BIDU|BIIB|BMRN|CDNS|CELG|CERN|CHKP|CHTR|CTRP|CTAS|CSCO|CTXS|CMCSA|COST|CSX|CTSH&callback=quoteHandler1'
res = requests.get(url)
soup = BeautifulSoup(res.content, "lxml")
s = soup.select('html')[0].text.strip('quoteHandler1(').strip(')')
data= json.loads(s)
data = json_normalize(data)
df = pd.DataFrame(data)
print(df[['symbol','last']])
Returns JSON as follows (sample expanded):

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Issue with scraping in python - python

Related

HTML parts locating

Empty Dataframe when scraping specific column from website

Unable to print once to get all the data altogether

How to store URL from BeautifulSoup results to a list and then to a table

Scraping row elements from a dynamic table using Bs4

Categories

Resources