How can I find the right xpath and loop over table?

How can I find the right xpath and loop over table? - python

I would like to get all the values from the table "Elektriciteit NL" on https://powerhouse.net/forecast-prijzen-onbalans/. However after endlessly trying to find the right xpath using selenium I was not able to scrape the table.
I tried to use "inspect" and copy the xpath from the table to identify the length of the table for scraping later. After this failed I tried to use "contain" however this was not succesfull either. Afterwards i tried some things using BeautifullSoup however without any luck.
#%%
import pandas as pd
from selenium import webdriver
import pandas as pd
#%% powerhouse Elektriciteit NL base & peak
url = "https://powerhouse.net/forecast-prijzen-onbalans/"
#%% open webpagina
driver = webdriver.Chrome(executable_path = path + 'chromedriver.exe')
driver.get(url)
#%%
prices = []
#loop for values in table
for j in range(len(driver.find_elements_by_xpath('//tr[#id="endex_nl_forecast"]/div[3]/table/tbody/tr[1]/td[4]'))):
base = driver.find_elements_by_xpath('//tr[#id="endex_nl_forecast"]/div[3]/table/tbody/tr[1]/td[4]')[j]
#%%
#trying with BeautifulSoup
from bs4 import BeautifulSoup
import requests
response = requests.get(url)
soup = BeautifulSoup(response.content, "html.parser")
table = soup.find('table', id = 'endex_nl_forecast')
rows = soup.find_all('tr')
I would like to have the table in a dataframe and understand how xpath exactly works. I'm kind of new to the whole concept.

If you are open to ways other than xpath you could do this without selenium or xpath:
you could just use pandas
import pandas as pd
table = pd.read_html('https://powerhouse.net/forecast-prijzen-onbalans/')[4]
If you want text representation of icons you could extract the class name of the svg which describes arrow direction from the appropriate tds.
from bs4 import BeautifulSoup as bs
import requests
import pandas as pd
r = requests.get('https://powerhouse.net/forecast-prijzen-onbalans/')
soup = bs(r.content, 'lxml')
table = soup.select_one('#endex_nl_forecast table')
rows = []
headers = [i.text for i in table.select('th')]
for tr in table.select('tr')[1:]:
rows.append([i.text if i.svg is None else i.svg['class'][2].split('-')[-1] for i in tr.select('td') ])
df = pd.DataFrame(rows, columns = headers)
print(df)
Sample rows:

You can use Selenium driver to locate the table & its contents,
url = 'https://powerhouse.net/forecast-prijzen-onbalans/'
driver.get(url)
time.sleep(3)
To Read Table Headers & Print
tableHeader = driver.find_elements_by_xpath("//*[#id='endex_nl_forecast']//thead//th")
print(tableHeader)
for header in tableHeader:
print(header.text)
To Find number of rows in the table
rowElements = driver.find_elements_by_xpath("//*[#id='endex_nl_forecast']//tbody/tr")
print('Total rows in the table:', len(rowElements))
To print each rows as is
for row in rowElements:
print(row.text)

Related

scrape a table in python not showing correct results as expected

need to scrape all the table data from rajya sabha website. however, instead of scraping from the url link the code scrapes the original table page by page
from selenium import webdriver
import chromedriver_binary
import os
import requests
import pandas as pd
import numpy as np
from bs4 import BeautifulSoup
import csv
import time
import lxml
url = 'https://rsdebate.nic.in/simple-search?query=climate+change&sort_by=dc.identifier.sessionnumber_sort&order=asc&rpp=100&etal=0&start=0'
#url_call = f"https://rsdebate.nic.in/simple-search?query=climate+change&sort_by=dc.identifier.sessionnumber_sort&order=asc&rpp=100&etal=0&start={i}"
page = requests.get(url)
soup = BeautifulSoup(page.text, 'lxml')
table1 = soup.find('table', id='sam_table')
headers = []
for a in table1.find_all('th'):
title = a.text
headers.append(title)
rsdata = pd.DataFrame(columns = headers)
rsdata.to_csv('rs_debate_data.csv', mode ='a',index=False)
# Create a for loop to fill rajya sabha data
for k in range(0,96):
url_call = f"https://rsdebate.nic.in/simple-search?query=climate+change&sort_by=dc.identifier.sessionnumber_sort&order=asc&rpp=100&etal=0&start={k}"
page = requests.get(url_call)
for j in table1.find_all('tr')[1:]:
row_data = j.find_all('td')
row = [i.text for i in row_data]
length = len(rsdata)
rsdata.loc[length] = row
rsdata.to_csv('rs_debate_data.csv', mode ='a',index=False, header=False)
print(k)
# Export to csv
# Try to read csv
#rs_data = pd.read_csv('rs_debate_data.csv')
i was trying to scrape only rows related to keyword climate change in the debate title column of the table.

for k in range(0,96):
url_call = "..."
page = requests.get(url_call)
for j in table1.find_all('tr')[1:]:
This loop does a find_all() on the original table1 results, not on the page it just fetched...

Getting records from a table in a page using selenium, pandas and beautifulsoup triggered a search input

I'm writing a python script to extract records of all people in a site using selenium, beautifulsoup and pandas. I, however don't know how to go about that because the site is designed such that someone has to search first before getting the result. For test purposes henceforth, I'm passing a search value and manipulating the same via selenium. The issue is that after writing the script on a python shell in ipython, I get the desirable results, but the same is throwing an error in a python file when running via python command.
code
from selenium import webdriver
from bs4 import BeautifulSoup
from time import sleep
import pandas as pd
import requests
import re
br.get(url)
content = br.page_source
soup = BeautifulSoup(content, 'lxml')
sleep(2)
sName = br.find_element_by_xpath("/html/body/div[1]/div[2]/section/div[2]/div/div/div/div/div/div/div[2]/form/div[1]/div/div/input")
sleep(3)
sName.send_keys("martin")
br.find_element_by_xpath("//*[#id='provider']/div[1]/div/div/div/button").click()
sleep(3)
table = soup.find('table')
tbody = table.find_all('tbody')
body = tbody.find_all('tr')
#
# get column heads
head = body[0]
body_rows = body[1:]
headings = []
for item in head.find_all('th'):
item = (item.text).rstrip("\n")
headings.append(item)
print(headings)
#declare an empty list for holding all records
all_rows = []
# loop through all table rows to get all table datas
for row_num in range(len(body_rows)):
row = []
for row_item in body_rows[row_num].find_all('td'):
stripA = re.sub("(\xa0)|(\n)|,","",row_item.text)
row.append(stripA)
all_rows.append(row)
# match each record to its field name
# cols = ['name', 'license', 'xxx', 'xxxx']
df = pd.DataFrame(data=all_rows, columns=headings)

You don't need the overhead of a browser or to worry about waits. You can simply mimic the post request the page makes
import requests
from bs4 import BeautifulSoup as bs
import pandas as pd
data = {'search_register': '1', 'search_text': 'Martin'}
r = requests.post('https://osp.nckenya.com/ajax/public', data=data)
soup = bs(r.content, 'lxml')
results = pd.read_html(str(soup.select_one('#datatable2')))
print(results)

Table not scraping correctly python BeautifulSoup

I have the following code which is trying to scrape the main table on this page. I need to get the NORAD ID and Launch date the 2nd and 4th columns. However I can't get BeutifulSoup to find the table by going of its ID.
import requests
from bs4 import BeautifulSoup
data = []
URL = 'https://www.n2yo.com/satellites/?c=52&srt=2&dir=1'
page = requests.get(URL)
soup = BeautifulSoup(page.content, 'html.parser')
table = soup.find("table", id="categoriestab")
rows = table.find_all('tr')
for row in rows:
cols = row.find_all('td')
cols = [ele.text.strip() for ele in cols]
data.append([ele for ele in cols if ele]) # Get rid of empty values
print(data)

For getting NORAD ID and Launch date, You can try it:
import pandas as pd
url = "https://www.n2yo.com/satellites/?c=52&srt=2&dir=0"
df = pd.read_html(url)
data = df[2].drop(["Name", "Int'l Code", "Period[minutes]", "Action"], axis=1)
print(data)
Output will be:

Change
soup = BeautifulSoup(page.content, 'html.parser')
to
soup = BeautifulSoup(page.content, 'lxml')

If you print the soup and do a search you will not find the id you are looking for in the output. This most likely means this page is JavaScript rendered. You can look into using PhantomJS or selenium. I used selenium to solve a problem like this that I ran into. You will need to download chrome driver: https://chromedriver.chromium.org/downloads. Here is the code that I used.
driver = webdriver.Chrome(executable_path=<YOUR PATH>, options=options)
driver.get('YOUR URL')
driver.implicitly_wait(1)
soup_file = BeautifulSoup(driver.page_source, 'html.parser')
What this does is sets up the driver to connect to the url, waits until its loaded, grabs all the code and puts it into the BeautifulSoup object.
Hope this helps!

To get the columns of the table also the first column containing link by clicking that link to get the data

I have the below link
http://www.igrmaharashtra.gov.in/eASR/eASRCommon.aspx?hDistName=Pune
In this i want to scrape data in proper format in excel.The SurveyNo link contains the data when it is click i want the row-wise data with the data on clicking the survey number.
Also want the format that i have attached in the image (desired output in excel)
import urllib.request
from bs4 import BeautifulSoup
import csv
import os
from selenium import webdriver
from selenium.webdriver.support.select import Select
from selenium.webdriver.common.keys import Keys
import time
url = 'http://www.igrmaharashtra.gov.in/eASR/eASRCommon.aspx?
hDistName=Pune'
chrome_path =r'C:/Users/User/AppData/Local/Programs/Python/Python36/Scripts/chromedriver.exe'
driver = webdriver.Chrome(executable_path=chrome_path)
driver.implicitly_wait(10)
driver.get(url)
Select(driver.find_element_by_name('ctl00$ContentPlaceHolder5$ddlTaluka')).select_by_value('5')
Select(driver.find_element_by_name('ctl00$ContentPlaceHolder5$ddlVillage')).select_by_value('1872')
soup=BeautifulSoup(driver.page_source, 'lxml')
table = soup.find("table" , attrs = {'id':'ctl00_ContentPlaceHolder5_grdUrbanSubZoneWiseRate' })
with open('Baner.csv', 'w',encoding='utf-16',newline='') as csvfile:
f = csv.writer(csvfile, dialect='excel')
f.writerow(['SurveyNo','Subdivision', 'Open ground', 'Resident house','Offices','Shops','Industrial','Unit (Rs./)']) # headers
rows = table.find_all('tr')[1:]
data=[]
for tr in rows:
cols = tr.find_all('td')
for td in cols:
links = driver.find_elements_by_link_text('SurveyNo')
l =len(links)
data12 =[]
for i in range(l):
newlinks = driver.find_elements_by_link_text('SurveyNo')
newlinks[i].click()
soup = BeautifulSoup(driver.page_source, 'lxml')
td1 = soup.find("textarea", attrs={'class': 'textbox'})
data12.append(td1.text)
data.append(td.text)
data.append(data12)
print(data)
Please find the image. In that format I required the output of scrape data.

You could do the following and simply re-arrange columns at end along with desired renaming. There is the assumption SurveyNo exists for all wanted rows. I extract the hrefs from the SurveyNo cells which are actually executable strings you can pass to execute_script to show the survey numbers without worrying about stale element etc....
from selenium import webdriver
import pandas as pd
url = 'http://www.igrmaharashtra.gov.in/eASR/eASRCommon.aspx?hDistName=Pune'
d = webdriver.Chrome()
d.get(url)
d.find_element_by_css_selector('[value="5"]').click()
d.find_element_by_css_selector('[value="1872"]').click()
tableElement = d.find_element_by_id('ctl00_ContentPlaceHolder5_grdUrbanSubZoneWiseRate')
table = pd.read_html(tableElement.get_attribute('outerHTML'))[0]
table.columns = table.iloc[0]
table = table.iloc[1:]
table = table[table.Select == 'SurveyNo'] #assumption SurveyNo exists for all wanted rows
surveyNo_scripts = [item.get_attribute('href') for item in d.find_elements_by_css_selector("#ctl00_ContentPlaceHolder5_grdUrbanSubZoneWiseRate [href*='Select$']")]
i = 0
for script in surveyNo_scripts:
d.execute_script(script)
surveys = d.find_element_by_css_selector('textarea').text
table.iloc[i]['Select'] = surveys
i+=1
print(table)
#rename and re-order columns as required
table.to_csv(r"C:\Users\User\Desktop\Data.csv", sep=',', encoding='utf-8-sig',index = False )
Output before rename and re-order:
In a loop you can concat all dfs and then write out in one go (my preference - shown here) or later append as shown here

How to store URL from BeautifulSoup results to a list and then to a table

I'm scraping a real estate webpage trying to get some URLs to then create a table.
https://www.zonaprop.com.ar/locales-comerciales-alquiler-palermo-hollywood-0-ambientes-publicado-hace-menos-de-1-mes.html
I have days triying to
store the results to a list or dictionary to then
create a table
but I'm really stuck
from bs4 import BeautifulSoup
import requests
import re
source=requests.get('https://www.zonaprop.com.ar/locales-comerciales-alquiler-palermo-hollywood-0-ambientes-publicado-hace-menos-de-1-mes.html').text
soup=BeautifulSoup(source,'lxml')
#Extract URL
link_text = ''
URL=[]
PlacesDf = pd.DataFrame(columns=['Address', 'Location.lat', 'Location.lon'])
for a in soup.find_all('a', attrs={'href': re.compile("/propiedades/")}):
link_text = a['href']
URL='https://www.zonaprop.com.ar'+link_text
print(URL)
ok, the output It's ok for me:
https://www.zonaprop.com.ar/propiedades/local-en-alquiler-soler-6000-palermo-hollywood-a-44227001.html#map
https://www.zonaprop.com.ar/propiedades/local-en-alquiler-soler-6000-palermo-hollywood-a-44227001.html
https://www.zonaprop.com.ar/propiedades/local-en-alquiler-soler-6000-palermo-hollywood-a-44227001.html
https://www.zonaprop.com.ar/propiedades/excelente-esquina-en-alquiler-s-lote-propio-con-43776599.html
https://www.zonaprop.com.ar/propiedades/excelente-esquina-en-alquiler-s-lote-propio-con-43776599.html
https://www.zonaprop.com.ar/propiedades/excelente-esquina-en-alquiler-s-lote-propio-con-43776599.html
https://www.zonaprop.com.ar/propiedades/excelente-local-en-alquiler-palermo-hollywood-fitz-44505027.html#map
https://www.zonaprop.com.ar/propiedades/excelente-local-en-alquiler-palermo-hollywood-fitz-44505027.html
https://www.zonaprop.com.ar/propiedades/excelente-local-en-alquiler-palermo-hollywood-fitz-44505027.html
https://www.zonaprop.com.ar/propiedades/local-palermo-hollywood-44550855.html#map
https://www.zonaprop.com.ar/propiedades/local-palermo-hollywood-44550855.html
https://www.zonaprop.com.ar/propiedades/local-palermo-hollywood-44550855.html
https://www.zonaprop.com.ar/propiedades/local-comercial-o-edificio-corporativo-oficinas-500-43164952.html
https://www.zonaprop.com.ar/propiedades/local-comercial-o-edificio-corporativo-oficinas-500-43164952.html
https://www.zonaprop.com.ar/propiedades/local-comercial-o-edificio-corporativo-oficinas-500-43164952.html
https://www.zonaprop.com.ar/propiedades/local-palermo-viejo-44622843.html#map
https://www.zonaprop.com.ar/propiedades/local-palermo-viejo-44622843.html
https://www.zonaprop.com.ar/propiedades/local-palermo-viejo-44622843.html
https://www.zonaprop.com.ar/propiedades/alquiler-de-local-comercial-en-palermo-hollywood-44571635.html#map
https://www.zonaprop.com.ar/propiedades/alquiler-de-local-comercial-en-palermo-hollywood-44571635.html
https://www.zonaprop.com.ar/propiedades/alquiler-de-local-comercial-en-palermo-hollywood-44571635.html
the thing is that the output are real links(you can click on them and go to the page)
But when I try to store it in a new variable(list or dictionary with the column name 'Address' to join with "PlacesDf"(same column name 'Address')) /convert to table/ or whatever trick I cannot find the solution. In fact, when I try to convert to pandas:
Address = pd.dataframe(URL)
it only creates a one row table.
I expect to see something like that
Adresses=['https://www.zonaprop.com.ar/propiedades/local-en-alquiler-soler-6000-palermo-hollywood-a-44227001.html#map','
https://www.zonaprop.com.ar/propiedades/local-en-alquiler-soler-6000-palermo-hollywood-a-44227001.html',...]
or a Dictionary or whatever I can turn to a table with pandas

you should do the following:
from bs4 import BeautifulSoup
import requests
import re
import pandas as pd
source=requests.get('https://www.zonaprop.com.ar/locales-comerciales-alquiler-palermo-hollywood-0-ambientes-publicado-hace-menos-de-1-mes.html').text
soup=BeautifulSoup(source,'lxml')
#Extract URL
all_url = []
link_text = ''
PlacesDf = pd.DataFrame(columns=['Address', 'Location.lat', 'Location.lon'])
for a in soup.find_all('a', attrs={'href': re.compile("/propiedades/")}):
link_text = a['href']
URL='https://www.zonaprop.com.ar'+link_text
print(URL)
all_url.append(URL)
df = pd.DataFrame({"URLs":all_url}) #replace "URLs" with your desired column name
hope this helps

I don't know where you are getting lat and lon from and I am making an assumption about address. I can see you have a lot of duplicates in your current urls returns. I would suggest the following css selectors to target just the listings links. These are class selectors so faster than your current method.
Use the len of that returned list of links to define the row dimension and you already have the columns.
from bs4 import BeautifulSoup as bs
import requests
import pandas as pd
import re
r = requests.get('https://www.zonaprop.com.ar/locales-comerciales-alquiler-palermo-hollywood-0-ambientes-publicado-hace-menos-de-1-mes.html')
soup = bs(r.content, 'lxml') #'html.parser'
links = ['https://www.zonaprop.com.ar' + item['href'] for item in soup.select('.aviso-data-title a')]
locations = [re.sub('\n|\t','',item.text).strip() for item in soup.select('.aviso-data-location')]
df = pd.DataFrame(index=range(len(links)),columns= ['Address', 'Lat', 'Lon', 'Link'])
df.Link = links
df.Address = locations
print(df)

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

How can I find the right xpath and loop over table? - python

Related

scrape a table in python not showing correct results as expected

Getting records from a table in a page using selenium, pandas and beautifulsoup triggered a search input

Table not scraping correctly python BeautifulSoup

To get the columns of the table also the first column containing link by clicking that link to get the data

How to store URL from BeautifulSoup results to a list and then to a table

Categories

Resources