Trouble scraping phone numbers from a website - python

So I've been trying to extract every single phone number from a website that deals in properties (renting/buying houses,apartments, etc).
There's plenty of categories (cities, type of properties) and ads in each of those. Whenever you enter an ad, there's obviously more pictures, descriptions, and a phone number at the bottom.
This is the site in question.
https://www.nekretnine.rs/
I wrote a python script that's supposed to extract those phone numbers, but it's giving me nothing. This is the script.
I figure it's not working cuz its looking for that information from the home page, and the info is not there, but I just can't figure out how to include all those ads across all those categories in my loop. Don't even ask about API, they have none. I mean, I crashed their website with the original, sleepless script.
for i in range (1,50):
url = ("https://www.nekretnine.rs/"+ str(i))
page = urlopen(url)
soup = BeautifulSoup(page)
x = soup.find_all("div", {"class":"label-small"})
time.sleep (2)
for item in x:
number =item.find_all("form",attrs = {"span":"cell-number"})[0].text
data.append((number ))
print (data)

If the content you need is not on the home page, you should use beautifulsoup to find the links to other pages that you need, then post a request to get that html and look for the information there

For anyone stumbling here, I found the answer
https://webscraper.io/
This browser script has everything I needed, it's simple, no coding required, minus some regex if you need it

Related

How to web-scrape all articles of a certain category from New York Times

I need to be able to scrape the content of many articles of a certain category from the New York Times. For example, let's say we want to look at all of the articles related "terrorism." I would go to this link to view all of the articles: https://www.nytimes.com/topic/subject/terrorism
From here, I can click on the individual links, which directs me to a URL that I can scrape. I am using Python with the BeautifulSoup package to help me retrieve the article text.
Here is the code that I have so far, which lets me scrape all of the text from one specific article:
from bs4 import BeautifulSoup
session = requests.Session()
url = "https://www.nytimes.com/2019/10/23/world/middleeast/what-is-going-to-happen-to-us-inside-isis-prison-children-ask-their-fate.html"
req = session.get(url)
soup = BeautifulSoup(req.text, 'html.parser')
paragraphs = soup.find_all('p')
for p in paragraphs:
print(p.get_text())
The problem is, I need to be able to scrape all of these articles under the category, and I'm not sure how to do that. Since I can scrape one article as long as I am given the URL, I would assume my next step is to find a way to gather all of the URLs under this specific category, and then run my above code on each of them. How would I do this, especially given the format of the page? What do I do if the only way to see more articles is to manually select the "SHOW MORE" button at the bottom of the list? Are these capabilities that are included in BeautifulSoup?
You're probably going to want to put a limit to how many articles you want to pull at a time. I clicked the Show More button a handful of times for the terrorism category and it just keeps going.
To find the links, you need to analyze the html structure and find patterns. In this case, each article preview is in a list element with class = "css-13mho3u". However I checked another category and this class pattern won't be consistent to other ones. But you can see that these list elements are all under an ordered list element which class = "polite" and this is consistent to other news categories.
Under each list category, there is one link that will link to the article. So you simply have to grab it and extract the href. Your code can look something like this:
ol = soup.find('ol', {'class':'polite'})
lists = ol.findAll('li')
for list in lists:
link = list.find('a')
url = link['href']
To click on the Show More button you'll need to use additional tools outside of beautiful soup. You can use Selenium webdriver to click it to open up the next page. You can follow the top answer at this SO question to learn to do that.

Get method from requests library seems to return homepage rather than specific URL

I'm new to Python & object-oriented programming in general. I'm trying to build a simple web scraper to create data frames from NBA contract data on basketball-reference.com. I had planned to use the requests library together with BeautifulSoup. However, the get method seems to be returning the site's homepage rather than the page affiliated with the URL I give.
I give a URL to a team's contracts page (https://www.basketball-reference.com/contracts/IND.html), but when I print the html it looks like it belongs to the homepage.
I haven't been able to find any documentation on the web about anyone else having this problem...
I'm using the Spyder IDE.
# Import library
import requests
# Assign the URL for contract scraping
url = 'https://www.basketball-reference.com/contracts/IND.html'
# Pull contracts page
page = requests.get(url)
# Check that correct page is being pulled
print(page.text)
This seems like it should be very straightforward, so I'm not understanding why the console is displaying html that clearly doesn't pertain to the page I'm trying to point to. I'm not getting any errors, just html from the homepage.
After checking the code on repl.it and visiting the webpage myself, I can confirm you are pulling in the correct page's HTML. The page variable contains the tables of data, as well as their info... and also the page's advertisements, the contact info, the social media buttons and links, the adblock detection scripts, and everything else on the webpage. Your issue isn't that you're getting the wrong page, it's that you're getting the entire page, not just the data.
You'll want to pick out the exact bits you're interested in - maybe by selecting the table and its child elements? The table's HTML id is contracts - that should be a good place to start.
(Try visiting the page in your browser, right-clicking anywhere on the page, and clicking "view page source" - that's what your program is pulling in. There's a LOT more to a webpage than most people realize!)
As a word of warning, though, Sports Reference has a data use policy that precludes web crawlers / spiders on their site. I would recommend checking (and using) one of the free sites they link instead; you risk being IP banned otherwise.
Simply printing the result of the get request on the terminal won't be very helpful, as the HTML page content returned is long - your terminal will truncate the printed response. I'm assuming in your case maybe the website has parts of the homepage reused in other pages as well, so it might get confusing.
I recommend writing the response into a file and then opening the file in the browser. You will see that your code is pulling the right page.

Script cannot fetch data from a web page

I am trying to write a program in Python that can take the name of a stock and its price and print it. However, when I run it, nothing is printed. it seems like the data is having a problem being fetched from the website. I double checked that the path from the web page is correct, but for some reason the text does not want to show up.
from lxml import html
import requests
page = requests.get('https://www.bloomberg.com/quote/UKX:IND?in_source=topQuotes')
tree = html.fromstring(page.content)
Prices = tree.xpath('//span[#class="priceText__1853e8a5"]/text()')
print ('Prices:' , Prices)
here is the website I am trying to get the data from
I have tried BeautifulSoup, but it has the same problem.
If you print the string page.content, you'll see that the website code it captures is actually for a captcha test, not the "real" destination page itself you see when you manually visit the website. It seems that the website was smart enough to see that your request to this URL was from a script and not manually from a human, and it effectively prevented your script from scraping any real content. So Prices is empty because there simply isn't a span tag of class "priceText__1853e8a5" on this special Captcha page. I get the same when I try scraping with urllib2.
As others have suggested, Selenium (actual web automation) might be able to launch the page and get you what you need. The ID looks dynamically generated, though I do get the same one when I manually look at the page. Another alternative is to simply find a different site that can give you the quote you need without blocking your script. I tried it with https://tradingeconomics.com/ukx:ind and that works. Though of course you'll need a different xpath to find the cell you need.

How can I scrape a site with multiple pages using beautifulsoup and python?

I am trying to scrape a website. This is a continuation of this
soup.findAll is not working for table
I was able to obtain needed data but the site has multiple pages which vary by the day. Some days it can be 20 pages and 33 pages on another. I was trying to implement this solution by obtaining the last page element How to scrape the next pages in python using Beautifulsoup
but when I got to the pager div in on the site I want to scrape I found this format
<a class="ctl00_cph1_mnuPager_1" href="javascript:__doPostBack('ctl00$cph1$mnuPager','32')">32</a>
<a class="ctl00_cph1_mnuPager_1">33</a>
how can I scrape all the pages in the site given that it the amount of pages change daily?
by the way page url does not change with page changes.
BS4 will not solve this issues anytime, because of it can't run Js
First, you can try to use Scrapy and this answer
You can use Selenium for it
I would learn how to use Selenium -- it's simple and effective in handling situations where BS4 won't do the job.
You can use it to log into sites, enter keys into search boxes, and click buttons on the screen. Not to mention, you can watch what it's doing with a browser.
I use it even when I'm doing something in BS4 to monitor the progress better of a scraping project.
Like some people have mentioned you might want to look at selenium. I wrote a blogpost for doing something like this a while back: http://danielfrg.com/blog/2015/09/28/crawling-python-selenium-docker/
Now things are much better with chrome and firefox headless.
Okay, so if I'm understanding correctly, there's an undetermined amount of pages that you want to scrape? I had a similar issue if that's the case. Look at the inspected page and see if there is an element that doesn't exist there but exists on the pages with content.
In my for loop I used
`pages = list(map(str, range(1, 5000))) /5000 is just a large number that what I
searching for wouldn't reach that high.
for n in pages:
base_url = 'url here'
url = base_url + n /n is the number of pages at the end of my url
/this is the element that didn't exist after the pages with content finished
figure = soup.find_all("figure")
if figure:
pass
else:
break /would break out of the page iterations and jump to my other listing in
another url after there wasn't any content left on the last page`
I hope this helps some, or helps cover what you needed.

Web scraping with urls changing by date

I am writing a script that uses python and BeautifulSoup4. The script itself is finished the only part that has brought up an issue is the urls being used.
I am passing the urls with this code:
urllist = ["samplewebsitename.com/2015/05/xxx-chapter-{}.html".format(str(pgnum).zfill(2)) for pgnum in range(1, chapter_number+1)]
for url in urllist:
url_queue.put(url)
A problem that I have come across is when scraping a site I noticed that a part of the url is changing depending on when it was uploaded. For example:
samplewebsitename.com/2015/05/xxx-chapter-01.html
samplewebsitename.com/2015/06/xxx-chapter-32.html
samplewebsitename.com/2015/10/xxx-chapter-47.html
I can deal with the chapters because they are sequential but there is no set pattern for the months and years on when the material was added. I'm wondering if there is a way to figure this out.
The year and month would also need to become variables to be replaced by the hard coded ones in the example but getting them from the website seems a bit harder than I thought it would be.
EDIT
Apparently you can grab the links from a dropdown list which simplifies the whole problem to just parsing the dropdown itself for all the links.
The only minor issue that I am having now is how to actually parse it correctly. Currently trying to find the select element of the site but i'm still quite new at this.
#Gets all the url's for each chapter
urllist = []
starturl = "http://www.bimanga.com/2015/05/read-manga-tokyo-ghoul-re-chapter-01.html"
response = requests.get(starturl)
html = response.content
soup = BeautifulSoup(html, "html.parser")
for option in soup.findAll('option'):
#urllist.append(option["value"])
print(option["value"]) #Debugging
The year and month can be gained from the dropdown you see here: http://i.imgur.com/pvKgnDw.png
Parse the dropdown (select element) and get the links. Then you probably won't even need to construct the url from year and month. The dropdown might contain entire url to the chapter.

Categories

Resources