Condition based scraping using selenium python

Condition based scraping using selenium python - python

I want to scrape the dates and the respectives news headlines/articles for a period of 6days- like when the python script runs today,it should scrape headlines/articles from today(10th August) to 4th August.
I am able to scrape the dates and headlines/urls for all dates as of now from here.
here is the code for the same
websites = ['https://www.thespiritsbusiness.com/tag/rum/']
for spirits in websites:
browser.get(spirits)
time.sleep(1)
news_links = browser.find_elements_by_xpath('//*[#id="archivewrapper"]/div/div[2]/h3')
n_links = [ele.find_element_by_tag_name('a').get_attribute('href') for ele in news_links]
dates = browser.find_elements_by_xpath('//*[#id="archivewrapper"]/div/div[2]/small')
n_dates = [ele.text for ele in dates]
print(n_links)
print(n_dates)
But how do I scrape for a period for last 6days from today? Is there an idea?

See the page 2 url is
https://www.thespiritsbusiness.com/tag/rum/page/2/
which basically means, that for next iteration you would need to add /page/2/ in URL.
you can have a websites list as :
websites = ['https://www.thespiritsbusiness.com/tag/rum/', 'https://www.thespiritsbusiness.com/tag/rum/page/2/', 'https://www.thespiritsbusiness.com/tag/rum/page/3/']
and so on, to achieve this.
or you can do this programmatically as well :-
page_number = 1
websites = ['https://www.thespiritsbusiness.com/tag/rum/']
for spirits in websites:
browser.get(spirits + f"page/{page_number}/")
page_number = page_number + 1

Related

Python Web scraping, Now I am keep getting StaleElementReferenceException

I am trying to scrape a site.
I want to loop through all the dates in dateelements and if they match day than click on date, to select. collect the data and add it to the dataframe. I want to do it in one go as each day has unique data.
#Select the Display by splitting the data by AM/PM Split at 12 pm
browser.find_element_by_xpath("//*[contains(#id,'react-select-4--value-
item')]").click()
Display = browser.find_element_by_xpath("//*[contains(#id,'react-select-
4--option-2')]").click()
######Select the date for scraping from the planning search calender
browser.find_element_by_id("room-planning-from-date").click()
planning_From_Dates = browser.find_elements_by_xpath("//*
[contains(#class,'DayPicker-Day')]")
for dateelements in itertools.islice(planning_From_Dates,None,None,n):
date = dateelements.text
for day in days:
time.sleep(0.1)
print('This is the date:',date,'| This is the day',day)
if date == day:
dateelements.click()
time.sleep(3)
child_groups =browser.find_elements_by_xpath("//*
[contains(#class,'roomPlanning__attendanceInfoHeader')]")
for group in child_groups:
#time.sleep(0.2)
Group = group.text
group.click()
#Lets bring the page source into the tool make sure we have
the correct page
page = browser.page_source
dfs = pd.read_html(page, header = 0)
df = dfs[0]
##Add dfs together

Python - Combine two, single column lists into one dual column list and print

I'm just beginning to dabble with Python, and as many have done I am starting with a web-scraping example to try the language.
I have seen many examples of using zip and map to combine lists, but I am having issues attempting to have that list print.
Again, I am new so please be gentle.
The code gathers everything from 2 certain tag types (the date and title of a post) and returns them as 2 lists. For this I am using BeautifulSoup and requests.
The site I am practicing on for this test is the blog for a small game called 'Staxel'
I can get my code to print a full list of one tag using [soup.find] and [print] in a for loop, but when I attempt to add a 2nd list to print I am simply getting a termination with no error.
Any tips on how to correctly print the 2 lists?
I am looking for output like
Entry 2019-01-06 New Years
Entry 2018-11-30 Staxel Changelog for 1.3.52
# import libraries
import requests
import ssl
from bs4 import BeautifulSoup
# set the URL string
quote_page = 'https://blog.playstaxel.com'
# query the website and return the html to give us a 'page' variable
page = requests.get(quote_page)
# parse the html using beautiful soup and store in a variable ... 'soup'
soup = BeautifulSoup(page.content, 'lxml')
# Remove the 'div' of name and get it's value
title_box = soup.find_all('h1',attrs={'class':'entry-title'})
date_box = soup.find_all('span',attrs={'class':'entry-date published'})
titles = [title.text.strip() for title in title_box]
dates = [date.text.strip()for date in date_box]
date_list = zip(dates, titles)
for heading in date_list:
print ("Entry {}")

The problem is your query for dates is returning an empty list, so the zipped result will also be empty. To extract the date from that page, you want to look for tags of type time, not span, with class entry-date published:
like this:
date_box = soup.find_all("time", attrs={"class": "entry-date published"})
So with the following code:
import requests
from bs4 import BeautifulSoup
quote_page = "https://blog.playstaxel.com"
page = requests.get(quote_page)
soup = BeautifulSoup(page.content, "lxml")
title_box = soup.find_all("h1", attrs={"class": "entry-title"})
date_box = soup.find_all("time", attrs={"class": "entry-date published"})
titles = [title.text.strip() for title in title_box]
dates = [date.text.strip() for date in date_box]
for date, title in zip(dates, titles):
print(f"{date}: {title}")
The result becomes:
2019-01-10: Magic update – feature preview
2019-01-06: New Years
2018-11-30: Staxel Changelog for 1.3.52
2018-11-13: Staxel Changelog for 1.3.49
2018-10-21: Staxel Changelog for 1.3.48
2018-10-12: Halloween Update & GOG

Cannot scrape dataid from Morningstar - How can I access the Network inspection tool from Python?

I'm trying to scrape Morningstar.com to get financial data and prices of each fund available on the website. Fortunately I have no problem at scraping financial data (holdings, asset allocation, portfolio, risk, etc.), but when it comes to find the URL that hosts the daily prices in JSON format for each fund, there is a "dataid" value that is not available in the HTML code and without it there is no way to know the exact URL that hosts all the prices.
I have tried to print the whole page as text for many funds, and none of them show in the HTML code the "dataid" value that I need in order to get the prices. The URL that hosts the prices also includes the "secid", which is scrapeable very easily but has no relationship at all with the "dataid" that I need to scrape.
import requests
from lxml import html
import re
import json
quote_page = "https://www.morningstar.com/etfs/arcx/aadr/quote.html"
prices1 = "https://mschart.morningstar.com/chartweb/defaultChart?type=getcc&secids="
prices2 = "&dataid="
prices3 = "&startdate="
prices4 = "&enddate="
starting_date = "2018-01-01"
ending_date = "2018-12-28"
quote_html = requests.get(quote_page, timeout=10)
quote_tree = html.fromstring(quote_html.text)
security_id = re.findall('''meta name=['"]secId['"]\s*content=['"](.*?)['"]''', quote_html.text)[0]
security_type = re.findall('''meta name=['"]securityType['"]\s*content=['"](.*?)['"]''', quote_html.text)[0]
data_id = "8225"
daily_prices_url = prices1 + security_id + ";" + security_type + prices2 + data_id + prices3 + starting_date + prices4 + ending_date
daily_prices_html = requests.get(daily_prices_url, timeout=10)
json_prices = daily_prices_html.json()
for json_price in json_prices["data"]["r"]:
j_prices = json_price["t"]
for j_price in j_prices:
daily_prices = j_price["d"]
for daily_price in daily_prices:
print(daily_price["i"] + " || " + daily_price["v"])
The code above works for the "AADR" ETF only because I copied and pasted the "dataid" value manually in the "data_id" variable, and without this piece of information there is no way to access the daily prices. I would not like to use Selenium as alternative to find the "dataid" because it is a very slow tool and my intention is to scrape data for more than 28k funds, so I have tried only robot web-scraping methods.
Do you have any suggestion on how to access the Network inspection tool, which is the only source I have found so far that shows the "dataid"?
Thanks in advance

The data id may not be that important. I varied the code F00000412E that is associated with AADR whilst keeping the data id constant.
I got a list of all those codes from here:
https://www.firstrade.com/scripts/free_etfs/io.php
Then add the code of choice into your url e.g.
[
"AIA",
"iShares Asia 50 ETF",
"FOUSA06MPQ"
]
Use FOUSA06MPQ
https://mschart.morningstar.com/chartweb/defaultChart?type=getcc&secids=FOUSA06MPQ;FE&dataid=8225&startdate=2017-01-01&enddate=2018-12-30
You can verify the values by adding the other fund as a benchmark to your chart e.g. XNAS:AIA
28th december has value of 55.32. Compare this with JSON retrieved:
I repeated this with
[
"ALD",
"WisdomTree Asia Local Debt ETF",
"F00000M8TW"
]
https://mschart.morningstar.com/chartweb/defaultChart?type=getcc&secids=F00000M8TW;FE&dataid=8225&startdate=2017-01-01&enddate=2018-12-30

dataId 8217 works well for me, irrespective of the security.

Extract using Beautiful Soup

I want to fetch the stock price from web site: http://www.bseindia.com/
For example stock price appears as "S&P BSE :25,489.57".I want to fetch the numeric part of it as "25489.57"
This is the code i have written as of now.It is fetching the entire div in which this amount appears but not the amount.
Below is the code:
from bs4 import BeautifulSoup
from urllib.request import urlopen
page = "http://www.bseindia.com"
html_page = urlopen(page)
html_text = html_page.read()
soup = BeautifulSoup(html_text,"html.parser")
divtag = soup.find_all("div",{"class":"sensexquotearea"})
for oye in divtag:
tdidTags = oye.find_all("div", {"class": "sensexvalue2"})
for tag in tdidTags:
tdTags = tag.find_all("div",{"class":"newsensexvaluearea"})
for newtag in tdTags:
tdnewtags = newtag.find_all("div",{"class":"sensextext"})
for rakesh in tdnewtags:
tdtdsp1 = rakesh.find_all("div",{"id":"tdsp"})
for texts in tdtdsp1:
print(texts)

I had a look around in what is going on when that page loads the information and I was able to simulate what the javascript is doing in python.
I found out it is referencing a page called IndexMovers.aspx?ln=en check it out here
It looks like this page is a comma separated list of things. First comes the name, next comes the price, and then a couple other things you don't care about.
To simulate this in python, we request the page, split it by the commas, then read through every 6th value in the list, adding that value and the value one after that to a new list called stockInformation.
Now we can just loop through stock information and get the name using item[0] and price with item[1]
import requests
newUrl = "http://www.bseindia.com/Msource/IndexMovers.aspx?ln=en"
response = requests.get(newUrl).text
commaItems = response.split(",")
#create list of stocks, each one containing information
#index 0 is the name, index 1 is the price
#the last item is not included because for some reason it has no price info on indexMovers page
stockInformation = []
for i, item in enumerate(commaItems[:-1]):
if i % 6 == 0:
newList = [item, commaItems[i+1]]
stockInformation.append(newList)
#print each item and its price from your list
for item in stockInformation:
print(item[0], "has a price of", item[1])
This prints out:
S&P BSE SENSEX has a price of 25489.57
SENSEX#S&P BSE 100 has a price of 7944.50
BSE-100#S&P BSE 200 has a price of 3315.87
BSE-200#S&P BSE MidCap has a price of 11156.07
MIDCAP#S&P BSE SmallCap has a price of 11113.30
SMLCAP#S&P BSE 500 has a price of 10399.54
BSE-500#S&P BSE GREENEX has a price of 2234.30
GREENX#S&P BSE CARBONEX has a price of 1283.85
CARBON#S&P BSE India Infrastructure Index has a price of 152.35
INFRA#S&P BSE CPSE has a price of 1190.25
CPSE#S&P BSE IPO has a price of 3038.32
#and many more... (total of 40 items)
Which clearly is equivlent to the values shown on the page
So there you have it, you can simulate exactly what the javascript on that page is doing to load the information. Infact you now have even more information than was just shown to you on the page and the request is going to be faster because we are downloading just data, not all that extraneous html.

If you look into the source code of your page (e.g. by storing it into a file and opening it with an editor), you will see that the actual stock price 25,489.57 does not show up directly. The price is not in the stored html code but loaded in a different way.
You could use the linked page where the numbers show up:
http://www.bseindia.com/sensexview/indexview_new.aspx?index_Code=16&iname=BSE30

How can python auto scrape on a daily basis given certain url change rules?

Hi new comer in the python learning. Sales man travel a lot, would like to save some bucks in the hotel booking, so I am using python to scrape certain hotels on certain days for personal use.
I can use python to scrape a specific webpage, but im having trouble in making a serial search.
The single webpage scrape goes like this:
import requests
from bs4 import BeautifulSoup
url ="http://hotelname.com/arrivalDate=05%2F23%2F2016**&departureDate=05%2F24%2F2016" #means arrive on May23 and leaves on May
wb_data = requests.get(url)
soup = BeautifulSoup(wb_data.text,'lxml')
names = soup.select('.PropertyName')
prices = soup.select('.RateSection ')
for name,price in zip(names,prices):
data = {
"name":name.get_text(),
"price":price.get_text()
}
print (data)
By doing this I can get the price of the hotels on that day. But I would like to know the price in a longer period(say 15 days), so I can arrange my travel and save some bucks. The question is how can I make the search auto loop itself?
eg. hotelname('') price(200USD) May 1 Check in(CI) and May 2 check out(CO)
hotelname('') price(150USD) May 2 CI May 3 CO
..........
hotelname('') price(170USD) May30 CI May 31 CO
Hope I make my intentions clear. Can someone help guide in what way should I do to achieve this auto search? It is too much work to manually change the date in the urls. Thanks

You can use the datetime lib to get the dates and increment a day at a time in the loop for n days:
import requests
from bs4 import BeautifulSoup
from datetime import datetime, timedelta
def n_booking(n):
# start tomorrow
bk = (datetime.now() + timedelta(days=1))
# check next n days
for i in range(n):
mon, day, year = bk.month, bk.day, bk.year
# go to next day
bk = (datetime.now() + timedelta(days=1))
d_mon, d_day, d_year = bk.month, bk.day, bk.year
url ="http://hotelname.com/arrivalDate=d{mon}%2F{day}%2F{year}**&departureDate={d_mon}%2F{d_day}%2F{d_year}"\
.format(mon=mon, day=day, year=year, d_day=d_day, d_mon=d_mon,d_year=d_year)
wb_data = requests.get(url)
soup = BeautifulSoup(wb_data.text,'lxml')
names = soup.select('.PropertyName')
prices = soup.select('.RateSection ')
for name,price in zip(names,prices):
yield {
"name":name.get_text(),
"price":price.get_text()
}

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Condition based scraping using selenium python - python

Related

Python Web scraping, Now I am keep getting StaleElementReferenceException

Python - Combine two, single column lists into one dual column list and print

Cannot scrape dataid from Morningstar - How can I access the Network inspection tool from Python?

Extract using Beautiful Soup

How can python auto scrape on a daily basis given certain url change rules?

Categories

Resources