i've been trying to scrape the website below but having some problems.I cannot find how they build the list of empresas(in english : companies) that they show. When i select some categorie and submit the form, the url doesnt change, i've tryed to look in the request but no sucess.(not a webdeveloper here ).
http://www.vitrinedoexportador.gov.br
I first tried to go though all links in the webpage. The first approach that i've tried was bruteforcing all the urls. They has this syntax.
"http://www.vitrinedoexportador.gov.br/bens/ve/br/detalhes/index/cdEmpresa/" + 6 digit code + "#inicio".
But i think that trying out 999999 possibilities would be wrong way to aproach the problem.
The next approach that i'm trying is navigatin through the pages using selenium webdriver.
with the code below:
from selenium import webdriver
from selenium.webdriver.support.ui import Select
from bs4 import BeautifulSoup
import time
browser = webdriver.Firefox()
browser.get('http://www.vitrinedoexportador.gov.br/bens/ve/br#a')
# navigate to the page
select = Select(browser.find_element_by_id('cdSetor'))
print (select.options)
for opt in select.options:
print (opt.text)
opt.click()
if(opt.text != 'Escolha'):
opt.submit()
time.sleep(5) # tem q colocar esse para a página poder carregar.
listaEmpresas = browser.find_elements_by_tag_name("h6")
for link in listaEmpresas:
print(link)
print (listaEmpresas)
listaEmpresas[0].click()
But seens incredibly slow, and i only could get still one companie, is there a more smart way to do this?
Other approach that i've tried is using scrap, i can already parse a entire companie page with all the fields that i want. so if u guys help me in the way to get all the IDS , i can parse in my already built-in scrapy project.
Thank you.
I've done something very similar to this already and there is no super easy way. There is usually no list with all companies, because that belongs to the backend. You have to use the frontend to navigate to a page where you can build a loop to scrap what you want.
For example: I clicked the main url, then I changed the filter 'Valor da empresa' which has only five options. I chose the first, which gave me 3436 companies. Now it dependes if you want to scrap details of company or only main info, like tel cep address that are already in this page. If you want details you have to build a loop that clicks in each link, scrap from main page, go back to search and click on the next link. If you need only main information, you can already get that on search page by grabbing class=resultitem with beautifull soup, and looping through data to get first page.
In any case, next step (after all links of first page are scraped) is pressing the second page and doing it again.
After you scrap all 3436 of first filter, do it again for other 4 filters, and you would get all companies
You can use other filters, but they have many options and to go through all companies you would have to go through all of them, which is more work.
Hope that helps!
Related
I am trying to scrape a website with product listings that if clicked on redirect the user to a new tab with further information/contact the seller details. I am trying to retrieve said URL without actually having to click on each listing in the catalog and wait for the page to load as this would take a lot of time.
I have searched in web inspector for the "href" but the only link available is to the image source of each listing. However, I noticed that after clicking each element, a GET request method gets sent and this is the URL (https://api.wallapop.com/api/v3/items/v6g2v4y045ze?language=es) it contains pretty much all the information I need, I'm not sure if it's of any use, but its the furthest I've gotten.
UPDATE: I tried the code I was suggested (with modifications to specifically find the 'href' attributes in the clickable elements), but I get None returning. I have been looking into finding an 'onclick' element or something similar that might have what I'm looking for but so far it looks like the solution will end up being clicking each element and extracting all the information from there.
elements123 = driver.find_elements(By.XPATH, '//a[contains(#class,"ItemCardList__item")]')
for e in elements123:
print(e.get_attribute('href'))
I appreciate any insights, thank you in advance.
You need something like this:
from selenium import webdriver
from selenium.webdriver.common.by import By
driver = webdriver.Chrome()
driver.get("https://google.com")
# Get all the elements available with tag name 'a'
elements = driver.find_elements(By.TAG_NAME, 'a')
for e in elements:
print(e.get_attribute('href'))
I am trying to build a simple web scraper to extract flight information from the student Universe.
I used the selenium to navigate the webpages to get the flight information for my desired location and date. There is no problem for me to get to the right page with all the information.
However, I have difficulties in extracting the information from the webpage. I used xpath to locate those elements that contain desired info, but extracting the information is unsuccessful unless I manually scroll up and down the webpage. It seems that this has something to do with subframes embedded in the website. I tried to iterate all the iframe to see if I get information with the command driver.switch_to.frame(), but the problem remains.
It would be great if anyone could offer some help regarding how to scrape information from websites like this. The problem may not be caused by the existence of subframe. any input is appreciated.
The code I used to extract flights info is shown below, an article tag contains all the info(carrier name, departure time,arrival time and so on). What I did first is to locate this element.
def parseprice(driver):
driver.maximize_window()
parser = lxml.html.fromstring(driver.page_source,driver.current_url)
flights=parser.xpath('//article[#class="itin activeCheck"]')
driver.quit()
carriername=flights[0].xpath('//p[#id="airlineName0"]/text()')
duration=flights[0].xpath('//strong[#id="duration0"]/text()')
depttime=flights[0].xpath('//span[#id="departureTime0"]/text()')
arrtime=flights[0].xpath('//span[#id="arrivalTime0"]/text()')
price=flights[0].xpath('//p[#ng-click="pricePoint()"]//text()')
stops=flights[0].xpath('//p[#id="stops0"]//text()')
stoplis=list()
for st in stops:
res1=re.search('^(\d)+\D*',st)
if res1 is not None:
stoplis.append(int(res1.group(1)))
now=datetime.datetime.now()
now=now.timetuple()
for i in range(20):
yield{'current time':str(now[1])+'/'+str(now[2])+'/'+str(now[0]),'carrier':carriername[i],'duration':duration[i],'price':price[i],'numstops':stoplis[i],'departure_time':depttime[i],'arrival_time':arrtime[i]}
https://www.fedsdatacenter.com/federal-pay-rates/index.php?y=2017&n=&l=&a=&o=
This website seems to be written by jquery(AJAX). I would like to scrape all pages' tables. When I inspect the 1,2,3,4 page tags, they do not have a specific href link. Besides, clicking on them does not create a clear pattern of get requests, therefore, I find it hard to use Python urllib to send a get request for each page.
You can use Selenium with Python http://selenium-python.readthedocs.io/ to navigate through the pages. I would find the Next button and .click() it then time.sleep(seconds) and scrape the page. I can't navigate to the last page on this site, unfortunately (it seems broken - which you should also be aware of), but I'm assuming the Next button disappears or something when you get to the last page. If not, you might want to save the what you've scraped everytime you go to a new page, this way you don't lose your data in the event of an error.
I'm trying to scrape data from this website called Anhembi
But when I try all the options from selenium to find elements, I get nothing. Anyone know why this happens?
I've already tried:
driver.find_element_by_xpath('//*[#class="agenda_result_laco_box"]')
And made a for-loop through that to click in every single one and get the info I need which consists of the day, website and name of the events. How can I do that?
Clearly, there is an iframe involved , you need to switch the focus of your web driver in order to interact with elements which are in iframe/frameset/frame.
You can try with this code :
driver.get("http://www.anhembi.com.br/agenda/")
driver.switch_to.frame(driver.find_element_by_css_selector("iframe[src='http://intranet.spturis.com.br/intranet/modulos/booking/anhembisite_busca.php']"))
all_data = driver.find_elements_by_css_selector("div.agenda_result_laco_box")
print(len(all_data))
for data in all_data:
print(data.text)
I am trying to scrape a website. This is a continuation of this
soup.findAll is not working for table
I was able to obtain needed data but the site has multiple pages which vary by the day. Some days it can be 20 pages and 33 pages on another. I was trying to implement this solution by obtaining the last page element How to scrape the next pages in python using Beautifulsoup
but when I got to the pager div in on the site I want to scrape I found this format
<a class="ctl00_cph1_mnuPager_1" href="javascript:__doPostBack('ctl00$cph1$mnuPager','32')">32</a>
<a class="ctl00_cph1_mnuPager_1">33</a>
how can I scrape all the pages in the site given that it the amount of pages change daily?
by the way page url does not change with page changes.
BS4 will not solve this issues anytime, because of it can't run Js
First, you can try to use Scrapy and this answer
You can use Selenium for it
I would learn how to use Selenium -- it's simple and effective in handling situations where BS4 won't do the job.
You can use it to log into sites, enter keys into search boxes, and click buttons on the screen. Not to mention, you can watch what it's doing with a browser.
I use it even when I'm doing something in BS4 to monitor the progress better of a scraping project.
Like some people have mentioned you might want to look at selenium. I wrote a blogpost for doing something like this a while back: http://danielfrg.com/blog/2015/09/28/crawling-python-selenium-docker/
Now things are much better with chrome and firefox headless.
Okay, so if I'm understanding correctly, there's an undetermined amount of pages that you want to scrape? I had a similar issue if that's the case. Look at the inspected page and see if there is an element that doesn't exist there but exists on the pages with content.
In my for loop I used
`pages = list(map(str, range(1, 5000))) /5000 is just a large number that what I
searching for wouldn't reach that high.
for n in pages:
base_url = 'url here'
url = base_url + n /n is the number of pages at the end of my url
/this is the element that didn't exist after the pages with content finished
figure = soup.find_all("figure")
if figure:
pass
else:
break /would break out of the page iterations and jump to my other listing in
another url after there wasn't any content left on the last page`
I hope this helps some, or helps cover what you needed.