Table not scraping correctly python BeautifulSoup

Table not scraping correctly python BeautifulSoup - python

I have the following code which is trying to scrape the main table on this page. I need to get the NORAD ID and Launch date the 2nd and 4th columns. However I can't get BeutifulSoup to find the table by going of its ID.
import requests
from bs4 import BeautifulSoup
data = []
URL = 'https://www.n2yo.com/satellites/?c=52&srt=2&dir=1'
page = requests.get(URL)
soup = BeautifulSoup(page.content, 'html.parser')
table = soup.find("table", id="categoriestab")
rows = table.find_all('tr')
for row in rows:
cols = row.find_all('td')
cols = [ele.text.strip() for ele in cols]
data.append([ele for ele in cols if ele]) # Get rid of empty values
print(data)

For getting NORAD ID and Launch date, You can try it:
import pandas as pd
url = "https://www.n2yo.com/satellites/?c=52&srt=2&dir=0"
df = pd.read_html(url)
data = df[2].drop(["Name", "Int'l Code", "Period[minutes]", "Action"], axis=1)
print(data)
Output will be:

Change
soup = BeautifulSoup(page.content, 'html.parser')
to
soup = BeautifulSoup(page.content, 'lxml')

If you print the soup and do a search you will not find the id you are looking for in the output. This most likely means this page is JavaScript rendered. You can look into using PhantomJS or selenium. I used selenium to solve a problem like this that I ran into. You will need to download chrome driver: https://chromedriver.chromium.org/downloads. Here is the code that I used.
driver = webdriver.Chrome(executable_path=<YOUR PATH>, options=options)
driver.get('YOUR URL')
driver.implicitly_wait(1)
soup_file = BeautifulSoup(driver.page_source, 'html.parser')
What this does is sets up the driver to connect to the url, waits until its loaded, grabs all the code and puts it into the BeautifulSoup object.
Hope this helps!

Related

Getting records from a table in a page using selenium, pandas and beautifulsoup triggered a search input

I'm writing a python script to extract records of all people in a site using selenium, beautifulsoup and pandas. I, however don't know how to go about that because the site is designed such that someone has to search first before getting the result. For test purposes henceforth, I'm passing a search value and manipulating the same via selenium. The issue is that after writing the script on a python shell in ipython, I get the desirable results, but the same is throwing an error in a python file when running via python command.
code
from selenium import webdriver
from bs4 import BeautifulSoup
from time import sleep
import pandas as pd
import requests
import re
br.get(url)
content = br.page_source
soup = BeautifulSoup(content, 'lxml')
sleep(2)
sName = br.find_element_by_xpath("/html/body/div[1]/div[2]/section/div[2]/div/div/div/div/div/div/div[2]/form/div[1]/div/div/input")
sleep(3)
sName.send_keys("martin")
br.find_element_by_xpath("//*[#id='provider']/div[1]/div/div/div/button").click()
sleep(3)
table = soup.find('table')
tbody = table.find_all('tbody')
body = tbody.find_all('tr')
#
# get column heads
head = body[0]
body_rows = body[1:]
headings = []
for item in head.find_all('th'):
item = (item.text).rstrip("\n")
headings.append(item)
print(headings)
#declare an empty list for holding all records
all_rows = []
# loop through all table rows to get all table datas
for row_num in range(len(body_rows)):
row = []
for row_item in body_rows[row_num].find_all('td'):
stripA = re.sub("(\xa0)|(\n)|,","",row_item.text)
row.append(stripA)
all_rows.append(row)
# match each record to its field name
# cols = ['name', 'license', 'xxx', 'xxxx']
df = pd.DataFrame(data=all_rows, columns=headings)

You don't need the overhead of a browser or to worry about waits. You can simply mimic the post request the page makes
import requests
from bs4 import BeautifulSoup as bs
import pandas as pd
data = {'search_register': '1', 'search_text': 'Martin'}
r = requests.post('https://osp.nckenya.com/ajax/public', data=data)
soup = bs(r.content, 'lxml')
results = pd.read_html(str(soup.select_one('#datatable2')))
print(results)

Python Web Scraping - How to scrape this type of site?

Okay, so I need to scrape the following webpage: https://www.programmableweb.com/category/all/apis?deadpool=1
It's a list of APIs. There are approx 22,000 APIs to scrape.
I need to:
1) Get the URL of each API in the table (pages 1-889), and also to scrape the following info:
API name
Description
Category
Submitted
2) I then need to scrape a bunch of information from each URL.
3) Export the data to a CSV
The thing is, I’m a bit lost of how to think about this project. From what I can see, there are no AJAX calls been made to populate the table, which means I’m going to have to parse the HTML directly (right?)
In my head, the logic would be something like this:
Use the requests & BS4 libraries to scrape the table
Then, somehow grab the HREF from every row
Access that HREF, scrape the data, move onto the next one
Rinse and repeat for all table rows.
Am I on the right track, is this possible with requests & BS4?
Here's are some screenshots of what I've been trying to explain.
Thank you SOO much for any help. This is hurting my head haha

Here we go using requests, BeautifulSoup and pandas:
import requests
from bs4 import BeautifulSoup
import pandas as pd
url = 'https://www.programmableweb.com/category/all/apis?deadpool=1&page='
num = int(input('How Many Page to Parse?> '))
print('please wait....')
name = []
desc = []
cat = []
sub = []
for i in range(0, num):
r = requests.get(f"{url}{i}")
soup = BeautifulSoup(r.text, 'html.parser')
for item1 in soup.findAll('td', attrs={'class': 'views-field views-field-title col-md-3'}):
name.append(item1.text)
for item2 in soup.findAll('td', attrs={'class': 'views-field views-field-search-api-excerpt views-field-field-api-description hidden-xs visible-md visible-sm col-md-8'}):
desc.append(item2.text)
for item3 in soup.findAll('td', attrs={'class': 'views-field views-field-field-article-primary-category'}):
cat.append(item3.text)
for item4 in soup.findAll('td', attrs={'class': 'views-field views-field-created'}):
sub.append(item4.text)
result = []
for item in zip(name, desc, cat, sub):
result.append(item)
df = pd.DataFrame(
result, columns=['API Name', 'Description', 'Category', 'Submitted'])
df.to_csv('output.csv')
print('Task Completed, Result saved to output.csv file.')
Result can be viewed online: Check Here
Output Simple:
Now For href parsing:
import requests
from bs4 import BeautifulSoup
import pandas as pd
url = 'https://www.programmableweb.com/category/all/apis?deadpool=0&page='
num = int(input('How Many Page to Parse?> '))
print('please wait....')
links = []
for i in range(0, num):
r = requests.get(f"{url}{i}")
soup = BeautifulSoup(r.text, 'html.parser')
for link in soup.findAll('td', attrs={'class': 'views-field views-field-title col-md-3'}):
for href in link.findAll('a'):
result = 'https://www.programmableweb.com'+href.get('href')
links.append(result)
spans = []
for link in links:
r = requests.get(link)
soup = soup = BeautifulSoup(r.text, 'html.parser')
span = [span.text for span in soup.select('div.field span')]
spans.append(span)
data = []
for item in spans:
data.append(item)
df = pd.DataFrame(data)
df.to_csv('data.csv')
print('Task Completed, Result saved to data.csv file.')
Check Result Online: Here
Sample View is Below:
In Case if you want those 2 csv files together so here's the code:
import pandas as pd
a = pd.read_csv("output.csv")
b = pd.read_csv("data.csv")
merged = a.merge(b)
merged.to_csv("final.csv", index=False)
Online Result: Here

You should read more about scraping if you are going to pursue it .
from bs4 import BeautifulSoup
import csv , os , requests
from urllib import parse
def SaveAsCsv(list_of_rows):
try:
with open('data.csv', mode='a', newline='', encoding='utf-8') as outfile:
csv.writer(outfile).writerow(list_of_rows)
except PermissionError:
print("Please make sure data.csv is closed\n")
if os.path.isfile('data.csv') and os.access('data.csv', os.R_OK):
print("File data.csv Already exists \n")
else:
SaveAsCsv([ 'api_name','api_link','api_desc','api_cat'])
BaseUrl = 'https://www.programmableweb.com/category/all/apis?deadpool=1&page={}'
for i in range(1, 890):
print('## Getting Page {} out of 889'.format(i))
url = BaseUrl.format(i)
res = requests.get(url)
soup = BeautifulSoup(res.text,'html.parser')
table_rows = soup.select('div.view-content > table[class="views-table cols-4 table"] > tbody tr')
for row in table_rows:
tds = row.select('td')
api_name = tds[0].text.strip()
api_link = parse.urljoin(url, tds[0].find('a').get('href'))
api_desc = tds[1].text.strip()
api_cat = tds[2].text.strip() if len(tds) >= 3 else ''
SaveAsCsv([api_name,api_link,api_desc,api_cat])

How to specify table for BeautifulSoup to find?

I'm trying to grab the table on this page https://nces.ed.gov/collegenavigator/?id=139755 under the Net Price expandable object. I've gone through tutorials for BS4, but I get so confused by the complexity of the html in this case that I can't figure out what syntax and which tags to use.
Here's a screenshot of the table and html I'm trying to get:
This is what I have so far. How do I add other tags to narrow down the results to just that one table?
import requests
from bs4 import BeautifulSoup
page = requests.get('https://nces.ed.gov/collegenavigator/?id=139755')
soup = BeautifulSoup(page.text, 'html.parser')
soup = soup.find(id="divctl00_cphCollegeNavBody_ucInstitutionMain_ctl02")
print(soup.prettify())
Once I can parse that data, I will format into a dataframe with pandas.

In this case I'd probably just use pandas to retrieve all tables then index in for appropriate
import pandas as pd
table = pd.read_html('https://nces.ed.gov/collegenavigator/?id=139755')[10]
print(table)
If you are worried about future ordering you could loop the tables returned by read_html and test for presence of a unique string to identify table or use bs4 functionality of :has , :contains (bs4 4.7.1+) to identify the right table to then pass to read_html or continue handling with bs4
import pandas as pd
from bs4 import BeautifulSoup as bs
r = requests.get('https://nces.ed.gov/collegenavigator/?id=139755')
soup = bs(r.content, 'lxml')
table = pd.read_html(str(soup.select_one('table:has(td:contains("Average net price"))')))
print(table)

ok , maybe this can help you , I add pandas
import requests
from bs4 import BeautifulSoup
import pandas as pd
page = requests.get('https://nces.ed.gov/collegenavigator/?id=139755')
soup = BeautifulSoup(page.text, 'html.parser')
div = soup.find("div", {"id": "divctl00_cphCollegeNavBody_ucInstitutionMain_ctl02"})
table = div.findAll("table", {"class": "tabular"})[1]
l = []
table_rows = table.find_all('tr')
for tr in table_rows:
td = tr.find_all('td')
if td:
row = [i.text for i in td]
l.append(row)
df=pd.DataFrame(l, columns=["AVERAGE NET PRICE BY INCOME","2015-2016","2016-2017","2017-2018"])
print(df)

Here is a basic script to scrape that first table in that accordion:
from bs4 import BeautifulSoup
from urllib.request import urlopen
url = "https://nces.ed.gov/collegenavigator/?id=139755#netprc"
page = urlopen(url)
soup = BeautifulSoup(page, 'html.parser')
parent_table = soup.find('div', attrs={'id':'netprc'})
desired_table = parent_table.find('table')
print(desired_table.prettify())
I assume you only want the values within the table so I did an overkill version of this as well that will combine the column names and values together:
from bs4 import BeautifulSoup
from urllib.request import urlopen
url = "https://nces.ed.gov/collegenavigator/?id=139755#netprc"
page = urlopen(url)
soup = BeautifulSoup(page, 'html.parser')
parent_table = soup.find('div', attrs={'id':'netprc'})
desired_table = parent_table.find('table')
header_row = desired_table.find_all('th')
headers = []
for header in header_row:
header_text = header.get_text()
headers.append(header_text)
money_values = []
data_row =desired_table.find_all('td')
for rows in data_row:
row_text = rows.get_text()
money_values.append(row_text)
for yrs,money in zip(headers,money_values):
print(yrs,money)
This will print out the following:
Average net price
2015-2016 $13,340
2016-2017 $15,873
2017-2018 $16,950

How can I find the right xpath and loop over table?

I would like to get all the values from the table "Elektriciteit NL" on https://powerhouse.net/forecast-prijzen-onbalans/. However after endlessly trying to find the right xpath using selenium I was not able to scrape the table.
I tried to use "inspect" and copy the xpath from the table to identify the length of the table for scraping later. After this failed I tried to use "contain" however this was not succesfull either. Afterwards i tried some things using BeautifullSoup however without any luck.
#%%
import pandas as pd
from selenium import webdriver
import pandas as pd
#%% powerhouse Elektriciteit NL base & peak
url = "https://powerhouse.net/forecast-prijzen-onbalans/"
#%% open webpagina
driver = webdriver.Chrome(executable_path = path + 'chromedriver.exe')
driver.get(url)
#%%
prices = []
#loop for values in table
for j in range(len(driver.find_elements_by_xpath('//tr[#id="endex_nl_forecast"]/div[3]/table/tbody/tr[1]/td[4]'))):
base = driver.find_elements_by_xpath('//tr[#id="endex_nl_forecast"]/div[3]/table/tbody/tr[1]/td[4]')[j]
#%%
#trying with BeautifulSoup
from bs4 import BeautifulSoup
import requests
response = requests.get(url)
soup = BeautifulSoup(response.content, "html.parser")
table = soup.find('table', id = 'endex_nl_forecast')
rows = soup.find_all('tr')
I would like to have the table in a dataframe and understand how xpath exactly works. I'm kind of new to the whole concept.

If you are open to ways other than xpath you could do this without selenium or xpath:
you could just use pandas
import pandas as pd
table = pd.read_html('https://powerhouse.net/forecast-prijzen-onbalans/')[4]
If you want text representation of icons you could extract the class name of the svg which describes arrow direction from the appropriate tds.
from bs4 import BeautifulSoup as bs
import requests
import pandas as pd
r = requests.get('https://powerhouse.net/forecast-prijzen-onbalans/')
soup = bs(r.content, 'lxml')
table = soup.select_one('#endex_nl_forecast table')
rows = []
headers = [i.text for i in table.select('th')]
for tr in table.select('tr')[1:]:
rows.append([i.text if i.svg is None else i.svg['class'][2].split('-')[-1] for i in tr.select('td') ])
df = pd.DataFrame(rows, columns = headers)
print(df)
Sample rows:

You can use Selenium driver to locate the table & its contents,
url = 'https://powerhouse.net/forecast-prijzen-onbalans/'
driver.get(url)
time.sleep(3)
To Read Table Headers & Print
tableHeader = driver.find_elements_by_xpath("//*[#id='endex_nl_forecast']//thead//th")
print(tableHeader)
for header in tableHeader:
print(header.text)
To Find number of rows in the table
rowElements = driver.find_elements_by_xpath("//*[#id='endex_nl_forecast']//tbody/tr")
print('Total rows in the table:', len(rowElements))
To print each rows as is
for row in rowElements:
print(row.text)

Unable to print once to get all the data altogether

I've written a script in python to scrape the tablular content from a webpage. In the first column of the main table there are the names. Some names have links to lead another page, some are just the names without any link. My intention is to parse the rows when a name has no link to another page. However, when the name has link to another page then the script will first parse the concerning rows from the main table and then follow that link to parse associated information of that name from the table located at the bottom under the title Companies. Finally, write them in a csv file.
site link
I've tried so far:
import requests
from urllib.parse import urljoin
from bs4 import BeautifulSoup
link = "https://suite.endole.co.uk/insight/company/ajax_people.php?ajax_url=ajax_people&page=1&company_number=03512889"
base = "https://suite.endole.co.uk"
res = requests.get(link)
soup = BeautifulSoup(res.text,"lxml")
for item in soup.select("table tr")[1:]:
if not item.select_one("td a[href]"):
first_table = [i.text for i in item.select("td")]
print(first_table)
else:
first_table = [i.text for i in item.select("td")]
print(first_table)
url = urljoin(base,item.select_one("td a[href]").get("href"))
resp = requests.get(url)
soup_ano = BeautifulSoup(resp.text,"lxml")
for elems in soup_ano.select(".content:contains(Companies) table tr")[1:]:
associated_info = [elem.text for elem in elems.select("td")]
print(associated_info)
My above script can do almost everything but I can't create any logic to print once rather than printing thrice to get all the data atltogether so that I can write them in a csv file.

Put all your scraped data into a list, here I've called the list associated_info then all the data is in one place & you can iterate over the list to print it out to a CSV if you like...
import requests
from urllib.parse import urljoin
from bs4 import BeautifulSoup
link = "https://suite.endole.co.uk/insight/company/ajax_people.php?ajax_url=ajax_people&page=1&company_number=03512889"
base = "https://suite.endole.co.uk"
res = requests.get(link)
soup = BeautifulSoup(res.text,"lxml")
associated_info = []
for item in soup.select("table tr")[1:]:
if not item.select_one("td a[href]"):
associated_info.append([i.text for i in item.select("td")])
else:
associated_info.append([i.text for i in item.select("td")])
url = urljoin(base,item.select_one("td a[href]").get("href"))
resp = requests.get(url)
soup_ano = BeautifulSoup(resp.text,"lxml")
for elems in soup_ano.select(".content:contains(Companies) table tr")[1:]:
associated_info.append([elem.text for elem in elems.select("td")])
print(associated_info)

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Table not scraping correctly python BeautifulSoup - python

For getting NORAD ID and Launch date, You can try it: import pandas as pd url = "https://www.n2yo.com/satellites/?c=52&srt=2&dir=0" df = pd.read_html(url) data = df[2].drop(["Name", "Int'l Code", "Period[minutes]", "Action"], axis=1) print(data) Output will be:

Change soup = BeautifulSoup(page.content, 'html.parser') to soup = BeautifulSoup(page.content, 'lxml')

Related

Getting records from a table in a page using selenium, pandas and beautifulsoup triggered a search input

Python Web Scraping - How to scrape this type of site?

How to specify table for BeautifulSoup to find?

How can I find the right xpath and loop over table?

Unable to print once to get all the data altogether

Categories

Resources