lxml not getting updated webpage

lxml not getting updated webpage - python

Simple script here, i'm just trying to get the number of people in a gym from a webpage every 15 minutes and save the result in a text file. However, the script is outputting the result from the first time I ran it (39), as opposed to the updated number of 93 (which can be seen by refreshing the webpage). Any ideas why this is? Note, I set the time to sleep to 10 seconds incase you want to run it yourself.
from lxml import html
import time
import requests
x = 'x'
while x == x:
time.sleep(10)
page = requests.get('http://www.puregym.com/gyms/holborn/whats-happening')
string = html.fromstring(page.content)
people = string.xpath('normalize-space(//span[#class="people-number"]/text()[last()])')
print people
#printing it for debug purposes
f = open("people.txt","w")
f.write(people)
f.write("\n")
Cheers

You are not closing the people.txt file after each loop, it is better to use Python's with function to do this as follows:
from lxml import html
import time
import requests
x = 'x'
while x == 'x':
time.sleep(10)
page = requests.get('http://www.puregym.com/gyms/holborn/whats-happening')
string = html.fromstring(page.content)
people = string.xpath('normalize-space(//span[#class="people-number"]/text()[last()])')
print people
#printing it for debug purposes
with open("people.txt", "w") as f:
f.write('{}\n'.format(people))
If you want to keep a log of all entries, you would need to move the with statement outside your while loop. Also I think you meant while x == 'x'. Currently the site is showing 39, which is seen in the people.txt.

Related

Beautifulsoup unable to get data from mds-data-table from morningstar

I'm trying to get the dividend information from morningstar.
The following code works for scraping info from finviz but the dividend information is not the same as my broker platform.
symbol = 'bxs'
morningstar_url = 'https://www.morningstar.com/stocks/xnys/' + symbol + '/dividends'
http = urllib3.PoolManager()
response = http.request('GET', morningstar_url)
soup = BeautifulSoup(response.data, 'lxml')
html = list(soup.children)[1]
[type(item) for item in list(soup.children)]
def display_elements(L, show = 0):
test = list(L.children)
if(show):
for i in range(len(test)):
print(i)
print(test[i])
print()
return(test)
test = display_elements(html,1)
I have no issue printing out the elements but cannot find the element that houses the information such as "Total Yield %" of 2.8%. How do I get inside the mds-data-table to extract the information?

Great question! I've actually worked on this specifically, but years ago. Morningstar will only load the tables after running a script to prevent this exact type of scraping behavior. If you view source generally, immediately on load, you won't be able to see any HTML.
What your going to want to do is find the JavaScript code that is loading the elements, and hook up bs4 to use that. You'll have to poke around the files, but somewhere deep in those js files, you'll find a dynamic URL. It'll be hidden, but it'll be in there somewhere. I'll go look at some of my old code and see if i can find something that helps.
So here's an edited sample of what used to work for me:
from urllib.request import urlopen
exchange = 'NYSE'
ticker = 'V'
if exchange == 'NYSE':
exchange_code = "XNYS"
elif exchange in ["NasdaqNM", "NASDAQ"]:
exchange_code = "XNAS"
else:
logging.info("Unknown Exchange Code for {}".format(stock.symbol))
return
time_now = int(time.time())
time_delay = int(time.time()+150)
morningstar_raw = urlopen(f'http://financials.morningstar.com/ajax/ReportProcess4HtmlAjax.html?&t={exchange_code}:{ticker}&region=usa&culture=en-US&cur=USD&reportType=is&period=12&dataType=A&order=asc&columnYear=5&rounding=3&view=raw&r=354589&callback=jsonp{time_now}&_={time_delay}')
print(morningstar_raw)
Granted this solution is from a file lasted edited sometime in 2018, and they may have changed up their scripting, but you can find this and much more on my github project wxStocks

While Loop with time.sleep

i will use a while loop for a refresh for a method.
def usagePerUserApi():
while True:
url = ....
resp = requests.get(url, headers=headers, verify=False)
data = json.loads(resp.content)
code = resp.status_code
Verbindungscheck.ausgabeVerbindungsCode(code)
head =.....
table = []
for item in (data['data']):
if item['un'] == tecNo:
table.append([
item['fud'],
item['un'],
str(item['lsn']),
str(item['fns']),
str(item['musage'])+"%",
str(item['hu']),
str(item['mu']),
str(item['hb']),
str(item['mb'])
])
print(tabulate(table,headers=head, tablefmt="github"))
time.sleep(300)
If I leave time.sleep like this, it will be displayed as an error. If I put it under the while loop, It will be updated constantly and does not wait 5 minutes.
I don't know where the mistake is. I hope you can help me.

You need to import the python time library
If you place
import time
at the top of your file it should work

Have you imported the time library? If not, then add
import time
to the top of your code, and it should work.
Also bear in mind that there may be problems with output buffering, where the program won't wait as expected, and so you'll need to turn it off, as shown by this answer.

Why am I not getting any data back from website?

So I'm brand new the whole web scraping thing. I've been working on a project that requires me to get the word of the day from here. I have successfully grabbed the word now I just need to get the definition, but when I do so I get this result:
Avuncular (Correct word of the day)
Definition:
[]
here's my code:
from lxml import html
import requests
page = requests.get('https://www.merriam-webster.com/word-of-the-day')
tree = html.fromstring(page.content)
word = tree.xpath('/html/body/div[1]/div/div[4]/main/article/div[1]/div[2]/div[1]/div/h1/text()')
WOTD = str(word)
WOTD = WOTD[2:]
WOTD = WOTD[:-2]
print(WOTD.capitalize())
print("Definition:")
wordDef = tree.xpath('/html/body/div[1]/div/div[4]/main/article/div[2]/div[1]/div/div[1]/p[1]/text()')
print(wordDef)
[] is supposed to be the first definition but won't work for some reason.
Any help would be greatly appreciated.

Your xpath is slightly off. Here's the correct one:
wordDef = tree.xpath('/html/body/div[1]/div/div[4]/main/article/div[3]/div[1]/div/div[1]/p[1]/text()')
Note div[3] after main/article instead of div[2]. Now when running you should get:
Avuncular
Definition:
[' suggestive of an uncle especially in kindliness or geniality']

If you wanted to avoid hardcoding index within xpath, the following would be an alternative to your current attempt:
import requests
from lxml.html import fromstring
page = requests.get('https://www.merriam-webster.com/word-of-the-day')
tree = fromstring(page.text)
word = tree.xpath("//*[#class='word-header']//h1")[0].text
wordDef = tree.xpath("//h2[contains(.,'Definition')]/following-sibling::p/strong")[0].tail.strip()
print(f'{word}\n{wordDef}')
If the wordDef fails to get the full portion then try replacing with the below one:
wordDef = tree.xpath("//h2[contains(.,'Definition')]/following-sibling::p")[0].text_content()
Output:
avuncular
suggestive of an uncle especially in kindliness or geniality

Python-How would i go about getting block of text from an html document

https://www.cpms.osd.mil/Content/AF%20Schedules/survey-sch/111/111R-03Apr2003.html
This is the page I am trying to parse. It's from a government site which in my experience are not known for keeping up their certificates, so you are are going to be warned about it not being safe by your browser. All I want is this part,http://imgur.com/a/BL14W.
edit: Sorry, for the lack of information. I started asking this question then I got called away at work. It's no excuse but when I came back it was time to go home so, I just kinda hit submit.
I have already tried doing it more "manually" but apparently not all of the documents came out exactly the same. Here is what I tried:
def table_parser(page):
file = open(page)
table = []
num = 0
for line in file:
if 'Grade' in line:
num += 1
if num > 0:
num += 1
if 3 <= num < 21:
line = line.rstrip()
if line != '':
split_line = line.split(' ')
split_line = [x for x in split_line if x != '']
strip_line = split_line[:16]
table.append(strip_line)
WG = []
WL = []
WS = []
for l in table:
WG.append((l[1:6]))
WL.append(l[6:11])
WS.append(l[11:16])
file.close()
# Return 3 lists for the 3 charts I want
return WG, WL, WS
This is what I used that got the about half of the 65k files I started with mostly right. I passed the returned lists into csv writers to store them till I can get them all cleaned up. I know there is probably a better way but I came up with this before I could wrap my head around BeautifulSoup. I don't necessarily want the code to do this, just pointers on where to start. I tried to find documentation on BeautifulSoup but I couldn't figure out where to start for what I need.

Your question is a little vague so I'll try my best to help you.
1. Install Beautiful Soup 4
To get a block of text from a webpage,you will need to use the external library BeautifulSoup4 (BS4). Once downloaded and installed to your computer, first import BS4 using the following from bs4 import BeautifulSoupand import urllib.request. Then simply setup BS4 using soup = BeautifulSoup("", "html.parser").
2. Download Webpage
Downloading a webpage is simple, just use site_download = urllib.request.urlopen(url). In your case, simply replace "url" with the url you provided here. Then we need to read what we've downloaded using site_read = site_download.read().decode('utf-8') followed by soup = BeautifulSoup(site_read, "html.parser").
3. Get Block of Text
You can get text in many different ways, so I'll show you a few examples.
To get the first instance of < P > tag (paragraph) text:
text = soup.find("p")
text = getText()
To get all instances of the < P > tag:
text = soup.findAll("p")
text = getText()
To get text from a specific class:
text = soup.find(attrs={"class": "class_name_here"})
text = getText()
4. Further Info
More information on how to get different types of tags and other things you can do with BS4 can be found HERE.

Bioinformatics : Programmatic Access to the BacDive Database

the resource at "BacDive" - ( http://bacdive.dsmz.de/) is a highly useful database for accessing bacterial knowledge, such as strain information, species information and parameters such as growth temperature optimums.
I have a scenario in which I have a set of organism names in a plain text file, and I would like to programmatically search them 1 by 1 against the Bacdive database (which doesnt allow a flat file to be downloaded) and retrieve the relevent information and populate my text file accordingly.
What are the main modules (such as beautifulsoups) that I would need to accomplish this? Is it straight forward? Is it allowed to programmatically access webpages ? Do I need permission?
A bacteria name would be "Pseudomonas putida" . Searching this would give 60 hits on bacdive. Clicking one of the hits, takes us to the specific page, where the line : "Growth temperature: [Ref.: #27] Recommended growth temperature : 26 °C " is the most important.
The script would have to access bacdive (which i have tried accessing using requests, but I feel they do not allow programmatic access, I have asked the moderator about this, and they said I should register for their API first).
I now have the API access. This is the page (http://www.bacdive.dsmz.de/api/bacdive/). This may seem quite simple to people who do HTML scraping, but I am not sure what to do now that I have access to the API.

Here is the solution...
import re
import urllib
from bs4 import BeautifulSoup
def get_growth_temp(url):
soup = BeautifulSoup(urllib.urlopen(url).read())
no_hits = int(map(float, re.findall(r'[+-]?[0-9]+',str(soup.find_all("span", class_="searchresultlayerhits"))))[0])
if no_hits > 1 :
letters = soup.find_all("li", class_="searchresultrow1") + soup.find_all("li", class_="searchresultrow2")
all_urls = []
for i in letters:
all_urls.append('http://bacdive.dsmz.de/index.php' + i.a["href"])
max_temp = []
for ind_url in all_urls:
soup = BeautifulSoup(urllib.urlopen(ind_url).read())
a = soup.body.findAll(text=re.compile('Recommended growth temperature :'))
if a:
max_temp.append(int(map(float, re.findall(r'[+-]?[0-9]+', str(a)))[0]))
print "Recommended growth temperature : %d °C:\t" % max(max_temp)
url = 'http://bacdive.dsmz.de/index.php?search=Pseudomonas+putida'
if __name__ == "__main__":
# TO Open file then iterate thru the urls/bacterias
# with open('file.txt', 'rU') as f:
# for url in f:
# get_growth_temp(url)
get_growth_temp(url)
Edit:
Here I am passing single url. if you want to pass multiple urls to get their growth temperature. call the function(url) by opening file. code is commented.
Hope it helped you..
Thanks

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

lxml not getting updated webpage - python

Related

Beautifulsoup unable to get data from mds-data-table from morningstar

While Loop with time.sleep

Why am I not getting any data back from website?

Python-How would i go about getting block of text from an html document

Bioinformatics : Programmatic Access to the BacDive Database

Categories

Resources