I am writing a script that uses python and BeautifulSoup4. The script itself is finished the only part that has brought up an issue is the urls being used.
I am passing the urls with this code:
urllist = ["samplewebsitename.com/2015/05/xxx-chapter-{}.html".format(str(pgnum).zfill(2)) for pgnum in range(1, chapter_number+1)]
for url in urllist:
url_queue.put(url)
A problem that I have come across is when scraping a site I noticed that a part of the url is changing depending on when it was uploaded. For example:
samplewebsitename.com/2015/05/xxx-chapter-01.html
samplewebsitename.com/2015/06/xxx-chapter-32.html
samplewebsitename.com/2015/10/xxx-chapter-47.html
I can deal with the chapters because they are sequential but there is no set pattern for the months and years on when the material was added. I'm wondering if there is a way to figure this out.
The year and month would also need to become variables to be replaced by the hard coded ones in the example but getting them from the website seems a bit harder than I thought it would be.
EDIT
Apparently you can grab the links from a dropdown list which simplifies the whole problem to just parsing the dropdown itself for all the links.
The only minor issue that I am having now is how to actually parse it correctly. Currently trying to find the select element of the site but i'm still quite new at this.
#Gets all the url's for each chapter
urllist = []
starturl = "http://www.bimanga.com/2015/05/read-manga-tokyo-ghoul-re-chapter-01.html"
response = requests.get(starturl)
html = response.content
soup = BeautifulSoup(html, "html.parser")
for option in soup.findAll('option'):
#urllist.append(option["value"])
print(option["value"]) #Debugging
The year and month can be gained from the dropdown you see here: http://i.imgur.com/pvKgnDw.png
Parse the dropdown (select element) and get the links. Then you probably won't even need to construct the url from year and month. The dropdown might contain entire url to the chapter.
Related
So I've been trying to extract every single phone number from a website that deals in properties (renting/buying houses,apartments, etc).
There's plenty of categories (cities, type of properties) and ads in each of those. Whenever you enter an ad, there's obviously more pictures, descriptions, and a phone number at the bottom.
This is the site in question.
https://www.nekretnine.rs/
I wrote a python script that's supposed to extract those phone numbers, but it's giving me nothing. This is the script.
I figure it's not working cuz its looking for that information from the home page, and the info is not there, but I just can't figure out how to include all those ads across all those categories in my loop. Don't even ask about API, they have none. I mean, I crashed their website with the original, sleepless script.
for i in range (1,50):
url = ("https://www.nekretnine.rs/"+ str(i))
page = urlopen(url)
soup = BeautifulSoup(page)
x = soup.find_all("div", {"class":"label-small"})
time.sleep (2)
for item in x:
number =item.find_all("form",attrs = {"span":"cell-number"})[0].text
data.append((number ))
print (data)
If the content you need is not on the home page, you should use beautifulsoup to find the links to other pages that you need, then post a request to get that html and look for the information there
For anyone stumbling here, I found the answer
https://webscraper.io/
This browser script has everything I needed, it's simple, no coding required, minus some regex if you need it
I need to be able to scrape the content of many articles of a certain category from the New York Times. For example, let's say we want to look at all of the articles related "terrorism." I would go to this link to view all of the articles: https://www.nytimes.com/topic/subject/terrorism
From here, I can click on the individual links, which directs me to a URL that I can scrape. I am using Python with the BeautifulSoup package to help me retrieve the article text.
Here is the code that I have so far, which lets me scrape all of the text from one specific article:
from bs4 import BeautifulSoup
session = requests.Session()
url = "https://www.nytimes.com/2019/10/23/world/middleeast/what-is-going-to-happen-to-us-inside-isis-prison-children-ask-their-fate.html"
req = session.get(url)
soup = BeautifulSoup(req.text, 'html.parser')
paragraphs = soup.find_all('p')
for p in paragraphs:
print(p.get_text())
The problem is, I need to be able to scrape all of these articles under the category, and I'm not sure how to do that. Since I can scrape one article as long as I am given the URL, I would assume my next step is to find a way to gather all of the URLs under this specific category, and then run my above code on each of them. How would I do this, especially given the format of the page? What do I do if the only way to see more articles is to manually select the "SHOW MORE" button at the bottom of the list? Are these capabilities that are included in BeautifulSoup?
You're probably going to want to put a limit to how many articles you want to pull at a time. I clicked the Show More button a handful of times for the terrorism category and it just keeps going.
To find the links, you need to analyze the html structure and find patterns. In this case, each article preview is in a list element with class = "css-13mho3u". However I checked another category and this class pattern won't be consistent to other ones. But you can see that these list elements are all under an ordered list element which class = "polite" and this is consistent to other news categories.
Under each list category, there is one link that will link to the article. So you simply have to grab it and extract the href. Your code can look something like this:
ol = soup.find('ol', {'class':'polite'})
lists = ol.findAll('li')
for list in lists:
link = list.find('a')
url = link['href']
To click on the Show More button you'll need to use additional tools outside of beautiful soup. You can use Selenium webdriver to click it to open up the next page. You can follow the top answer at this SO question to learn to do that.
I am using the get method of the requests library in python to scrape information from a website which is organized into pages (i.e paginated with numbers at the bottom).
Page 1 link: https://realfood.tesco.com/search.html?DietaryOption=Vegetarian
I am able to extract the data that I need from the first page but when I feed my code the url for the second page, I get the same data from the first page. Now after carefully analyzing my code, I am certain the issue is not my code logic but the way the second page url is structured.
So my question is how can I get my code to work as I want. I suspect it is a question of parameters but I am not 100% percent sure. If indeed it is parameters that I need to pass to request, I would appreciate some guidance on how to break down the parameters. My page 2 link is attached below.
Thanks.
Page 2 link: https://realfood.tesco.com/search.html?DietaryOption=Vegetarian#!q='selectedobjecttype%3DRECIPES%26page%3D2%26perpage%3D30%26DietaryOption%3DVegetarian'
Note: The pages are not really links per se.
It looks like platform is ASP.NET and pagination links are operated by JS. I seriously doubt you will have it easy with python, since beautifulsoup is a HTML parser/extractor, so if you really want to use this site, I would suggest to looking into Selenium or even PhantomJS, since they fully replicate the browser.
But in this particular case you are lucky, because there's a legacy website version which doesn't use modern bells and whistles :)
http://legacy.realfood.tesco.com/recipes/search.html?st=vegetarian&cr=False&page=3&srt=searchRelevance
It looks like the pagination of this site is handled by the query parameters passed in the second URL you posted, ie:
https://realfood.tesco.com/search.html?DietaryOption=Vegetarian#!q='selectedobjecttype%3DRECIPES%26page%3D2%26perpage%3D30%26DietaryOption%3DVegetarian'
The query string is url encoded. %3D is = and %26 is &. It might be more readable like this:
q='selectedobjecttype=RECIPES&page=2&perpage=30&DietaryOption=Vegetarian'
For example, if you wanted to pull back the fifth page of Vegetarian Recipes the URL would look like this:
https://realfood.tesco.com/search.html?DietaryOption=Vegetarian#!q='selectedobjecttype%3DRECIPES%26page%3D5%26perpage%3D30%26DietaryOption%3DVegetarian'
You can keep incrementing the page number until you get a page with no results which looks like this.
What about this?
from bs4 import BeautifulSoup
import urllib.request
for numb in ('1', '10'):
resp = urllib.request.urlopen("https://realfood.tesco.com/search.html?DietaryOption=Vegetarian")
soup = BeautifulSoup(resp, from_encoding=resp.info().get_param('charset'))
for link in soup.find_all('a', href=True):
print(link['href'])
Hopefully it works for you. I can't test it because my office blocks these kinds of things. I'll try it when I get home tonight to see if it does what it should do...
I am trying to scrape a website. This is a continuation of this
soup.findAll is not working for table
I was able to obtain needed data but the site has multiple pages which vary by the day. Some days it can be 20 pages and 33 pages on another. I was trying to implement this solution by obtaining the last page element How to scrape the next pages in python using Beautifulsoup
but when I got to the pager div in on the site I want to scrape I found this format
<a class="ctl00_cph1_mnuPager_1" href="javascript:__doPostBack('ctl00$cph1$mnuPager','32')">32</a>
<a class="ctl00_cph1_mnuPager_1">33</a>
how can I scrape all the pages in the site given that it the amount of pages change daily?
by the way page url does not change with page changes.
BS4 will not solve this issues anytime, because of it can't run Js
First, you can try to use Scrapy and this answer
You can use Selenium for it
I would learn how to use Selenium -- it's simple and effective in handling situations where BS4 won't do the job.
You can use it to log into sites, enter keys into search boxes, and click buttons on the screen. Not to mention, you can watch what it's doing with a browser.
I use it even when I'm doing something in BS4 to monitor the progress better of a scraping project.
Like some people have mentioned you might want to look at selenium. I wrote a blogpost for doing something like this a while back: http://danielfrg.com/blog/2015/09/28/crawling-python-selenium-docker/
Now things are much better with chrome and firefox headless.
Okay, so if I'm understanding correctly, there's an undetermined amount of pages that you want to scrape? I had a similar issue if that's the case. Look at the inspected page and see if there is an element that doesn't exist there but exists on the pages with content.
In my for loop I used
`pages = list(map(str, range(1, 5000))) /5000 is just a large number that what I
searching for wouldn't reach that high.
for n in pages:
base_url = 'url here'
url = base_url + n /n is the number of pages at the end of my url
/this is the element that didn't exist after the pages with content finished
figure = soup.find_all("figure")
if figure:
pass
else:
break /would break out of the page iterations and jump to my other listing in
another url after there wasn't any content left on the last page`
I hope this helps some, or helps cover what you needed.
I have been using scrapy for a personal project. My problem is very similar to the question asked on the following page:
Scrapy: Follow link to get additional Item data?
The page I am scraping is the following:
http://www.tennisinsight.com/player_activity.php?player_id=51
This page has a list of matches in this form for eg:
Round of 16 Def. Ivan Dodig(37,2.41) (CRO) 6-3 6-3 Recap Match Stats $1.043
I have currently written in scrapy code that opens every link on the page which has the "Match Stats" link, and scrapes data on that page into an individual record
In addition to this, I want to scrape the "Odds" column (which is the $1.043 above) and add this data to the record.
I have searched for an answer and it seems that I have to use the Request meta field and pass this data along to the parse method. However, I have a problem because I am struggling to incorporate it into my code. The answer from the stackoverflow link I linked above is "To scrape additional fields which are on other pages, in a parse method extract URL of the page with additional info, create and return from that parse method a Request object with that URL and pass already extracted data via its meta parameter."
This makes perfect sense, however, the URLs that I scrape are in the rules, so I dont know how to extract the required data.
Here is part of my code so far which will hopefully better explain my problem.
rules = (
Rule(SgmlLinkExtractor(allow=r"match_stats_popup.php\?matchID=\d+",
restrict_xpaths='//td[#class="matchStyle"]',
tags='a', attrs='href', process_value=getPopupLink), callback='parse_match', follow=True)
The parse_match function parses the match stats into one item.
So what happens is that each of these match stats links are opened up, and there is no way for me to access the main page's Odds column.
Any help will be much appreciated.