I am new to scraping with Python and have encountered a weird issue.
I am attempting to scrape of OCR'd newspaper articles from a list of URLS using selenium -- the proxy settings on the data source make this easier than other options.
However, I receive tracebacks for the text data every time I run my code. Here is the code that I am using:
article_links = []
for link in driver.find_elements_by_xpath('/html/body/div[1]/main/section[1]/ul[2]/li[*]/div[2]/div[1]/h3/a'):
links = link.get_attribute("href")
article_links.append(links)
articles = []
for article in article_links:
driver.switch_to.window(driver.window_handles[-1])
driver.get(article)
driver.find_element_by_css_selector("#js-doc-explorer-show-additional-views").click()
time.sleep(1)
for article_text in driver.find_elements_by_css_selector("#ocr-container > div.fulltext-ocr.js-page-ocr"):
articles.append(article_text)
I come closest to solving the issue by using .click(), which opens a hidden panel for my data. However, upon using this code, the only data that fills is the last row in the dataset. Without the .click(), all rows come back with nothing. Changing the sleep settings also does not help.
The Xpath for the text data is:
/html/body/div[2]/main/section/div[2]/div[2]/section[2]/div/div[4]/text()
Alternatively, is there a way to get each link's source code and parse it with beautifulsoup after the fact?
UPDATE: There has to be something wrong with the loops -- I can get either the first or last values, but nothing in between.
In a more recent version of Selenium, the method find_elements_by_xpath() is deprecated. Is that the issue you are facing? If it is, import from selenium.webdriver.common.by import By and change it to find_elements(By.XPATH, ...) Similarly, find_elements_by_css_selector() is replaced with find_elements(By.CSS_SELECTOR, ...)
You don't specify if this is even the issue, but if it is, I hope this helps :-)
The solution is found by calling the relevant (unique) class and specifying that it must contain text.
news = []
for article in article_links:
driver2.get(article)
driver2.find_element(By.CSS_SELECTOR, "#js-doc-explorer-show-additional-views").click()
article_text = driver2.find_element(By.XPATH, '//div[#class="fulltext-ocr js-page-ocr"][contains(text()," ")]')
news.append([article_text.text])
Related
Classic case of code used to work, changed nothing, now it doesn't work no more here. I'm trying to extract a list of unique appid values from this page that I'm saving locally as roguelike.html
The code I have looks like this and it used to work as of a couple months ago when I last ran it, but now the end result is a list of 1 with just a NoneType in it. Any ideas as to what's going wrong here?
from bs4 import BeautifulSoup
text_file = open("roguelike.html", "rb")
steamdb_text = text_file.read()
text_file.close()
soup = BeautifulSoup(steamdb_text, "html.parser")
trs = [tr for tr in soup.find_all('tr')]
apps = []
for app in soup.find_all('tr'):
apps.append(app.get('data-appid'))
appset = list(set(apps))
Is there a simpler way to get the unique appids from the page source? The individual elements I'm trying to cycle over and grab look like:
<tr class="app" data-appid="98821" data-cache="1533726913">
where I want all the unique data-appid values. I'm scratching my head trying to figure out if formatting in the page changed (doesn't seem like it), or some kind of version upgrade in Spyder, Python, or Beautifulsoup broke something that used to be working.
Any ideas?
I tried this code and it worked well for me. You should make sure that the html file you have is the right file. Perhaps you've hit a capcha test in the html test.
Here is code
driver = webdriver.Chrome()
driver.get('https://tieba.baidu.com/f?kw=比特币&ie=utf-8&tab=good')
driver.find_elements_by_css_selector('a.j_th_tit')[0].click()
a = driver.find_elements_by_css_selector('div.d_post_content.j_d_post_content.clearfix')
for i in a:
print(i.text)
Here is HTML I'm struggling with. There are many texts at the page, but those all have same class; d_post_content j_d_post_content clearfix.
<div id='post_content_52497574149' class='d_post_content j_d_post_content clearfix' style='display:;'> Here is the Text that I need to get; it is written in Chinese and stackoverflow may not permit to writhe Chinese in the body </div>
I want to automatically access to the website and get some texts for my homework assignment. With this code above, I could open the website, click the link, but I cannot access to the text needed. All of the texts needed are in the class, so I tried to access to the Class to get the texts, but it didn't work. When I check the length of the list a, len(a) is zero. Could anyone help me?
This line bring you to a new tab:
driver.find_elements_by_css_selector('a.j_th_tit')[0].click()
So you need switch it first. After perform the above, just add a line:
driver.switch_to.window(driver.window_handles[-1])
When you click the link in this statement:
driver.find_elements_by_css_selector('a.j_th_tit')[0].click()
A new tab is opened. But you are not switching to that tab.
I would recommend adding this statement:
driver.switch_to.window(driver.window_handles[-1])
Before you actually call find_elements_by_css_selector.
It will solve your issue.
I want to export all store data from the following website into a excel-file:
https://www.ybpn.de/ihre-parfuemerien
The problem: The Map is "dynamic", so the needed data loads when you enter a postal code.
The data is need is stored in the div-class "storefinder__list-item" with a unique reference in the data-"storefinder-reference" div-class, example: data-storefinder-reference="132"
I tried:
soup.find("div", {"data-storefinder-reference": "132"})
But the output is: NONE
I think this problem is caused by the fact that the page is dynamic, so the needed data loads then, when you enter a postal code. So when I search for the reference id "132" its "there", but not loaded on the website and bs4 cant find this id.
Any ideas to improve the code?
For this you might need to look into tools like selenium and/or "firefox-headless".
Especially selenium allows you to "remote-control" web-pages with Python
Here is a tutorial: https://realpython.com/modern-web-automation-with-python-and-selenium/
If the problem is waiting for the page to load, you can do it with selenium.
`result = driver.execute_script('var text = document.title ; return text')`
If there is jquery on the page, it certainly does
result=driver.execute_script("""
$(document).ready({
var $text=$('yourselector').text()
return $text
})
""")
Note: For selenium you can look here
You could just open the page in chrome or ff, open the web debug console and query the elements. if you see them they are in the dom and thus queryable. But that will be done in javascript. if you‘re lucky they use jQuery.
I am trying to create a python script to scrape the public county records website. I ultimately want to be able to have a list of owner names and the script run through all the names and pull the most recent deed of trust information (lender name and date filed). For the code below, I just wrote the owner name as a string 'ANCHOR EQUITIES LTD'.
I have used Selenium to automate the entering of owner name into form boxes but when the 'return' button is pressed and my results are shown, the website url does not change. I try to locate the specific text in the table using xpath but the path does not exist when I look for it. I have concluded the path does not exist because it is searching for the xpath on the first page with no results shown. BeautifulSoup4 wouldn't work in this situation because parsing the url would only return a blank search form html
See my code below:
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
browser = webdriver.Chrome()
browser.get('http://deed.co.travis.tx.us/ords/f?p=105:5:0::NO:::#results')
ownerName = browser.find_element_by_id("P5_GRANTOR_FULLNAME")
ownerName.send_keys('ANCHOR EQUITIES LTD')
docType = browser.find_element_by_id("P5_DOCUMENT_TYPE")
docType.send_keys("deed of trust")
ownerName.send_keys(Keys.RETURN)
print(browser.page_source)
#lenderName = browser.find_element_by_xpath("//*[#id=\"report_results\"]/tbody[2]/tr/td/table/tbody/tr[25]/td[9]/text()")
enter code here
I have commented out the variable that is giving me trouble.. Please help!!!!
If I am not explaining my problem correctly, please feel free to ask and I will clear up any questions.
I think you almost have it.
You match the element you are interested in using:
lenderNameElement = browser.find_element_by_xpath("//*[#id=\"report_results\"]/tbody[2]/tr/td/table/tbody/tr[25]/td[9]")
Next you access the text of that element:
lenderName = lenderNameElement.text
Or in a single step:
lenderName = browser.find_element_by_xpath("//*[#id=\"report_results\"]/tbody[2]/tr/td/table/tbody/tr[25]/td[9]").text
have you used following xpath?
//table[contains(#summary,"Search Results")]/tbody/tr
I have checked it's work perfect.In that, you have to iterate each tr
I've decided to take a swing at web scraping using Python (with lxml and requests). The webpage I'm trying to scrape to learn is: http://www.football-lineups.com/season/Real_Madrid/2013-2014
What I want to scrape is the table on the left of the webpage (the table with the scores and formations used). Here is the code I'm working with:
from lxml import html
import requests
page=requests.get("http://www.football-lineups.com/season/Real_Madrid/2013-2014")
tree=html.fromstring(page.text)
competition=tree.xpath('//*[#id="sptf"]/table/tbody/tr[2]/td[4]/font/text()')
print competition
The xpath that I input is the xpath that I copied over from Chrome. The code should normally return the competition of the first match in the table (i.e. La Liga). In other words, it should return the second row, fourth column entry (there is a random second column on the web layout, I don't know why). However, when I run the code, I get back an empty list. Where might this code be going wrong?
If you inspect the row source of the page you will see that the lineup table is not there.
It is fed after loading the page using AJAX so you wont be able to fetch it only by getting http://www.football-lineups.com/season/Real_Madrid/2013-2014 since the JS won't be interpreted and thus the AJAX not executed.
The AJAX request is the following:
URL: http://www.football-lineups.com/ajax/get_sectf.php
method: POST
data: d1=3&d2=-2013&d3=0&d4=1&d5=0&d6=1&d7=20&d8=0&d9=&d10=0&d11=0&d12=undefined
Maybe you can forge the request to get this data. I'll let you analyse what are those well named dX arguments :)
Here, I give full code which fulfill your requirement:
from selenium import webdriver
import csv
url="http://www.football-lineups.com/season/Real_Madrid/2013-2014"
driver=webdriver.Chrome('./chromedriver.exe')
driver.get(url)
myfile = open('demo.csv', 'wb')
wr = csv.writer(myfile, quoting=csv.QUOTE_ALL)
tr_list=driver.find_elements_by_xpath("//span[#id='sptf']/table/tbody/tr")
for tr in tr_list:
lst=[]
for td in tr.find_elements_by_tag_name('td'):
lst.append(td.text)
wr.writerow(lst)
driver.quit()
myfile.close()