Python 3, How To Use Python To Save Data From This Page? - python

I am trying to save price data from this page using Python 3.x.
I want my script to go through every option under the Fund Provider dropdown, and then save the resulting table to a local file.
Unfortunately, when I look at the source code, it appears that all the menu options and table data come from JSON files, and I am not sure where to begin as I can't seem to read the files from a browser. I know how to use urlretrieve, and have used it for simple static web pages, but my skills are not advanced enough to navigate complex multiple resource documents.
Any advice on how I can achieve my goal would be most appreciated.

Sorry for doing an incorrect copy and paste with the URL. Anyway - I found a solution. What I needed to do is:
use Firebug (an extention for Firebug) to identify the location of the json files, along with the posted information.
then use urlretrieve to download the data, including post information with each request
example code:
from urllib.request import urlretrieve
import urllib
url = 'http://www.example.com'
values = {'example_param1' : 'example value 1',
'example_param2' : 'example value 2'}
data = urllib.parse.urlencode(values)
data = data.encode('utf-8') # data should be bytes
save_path = save_root + fund_provider + '.json'
urlretrieve(url, save_path, data=data )

Related

Extracting JSON data from a html page with Python

I am fairly new to html & JSON and am struggling a little with extracting the data I am after in a usable format within Python on a Raspberry Pi project.
I am using a device which outputs some live data over a wifi link in the format of a html page. Although the data shown on the page can be changed, I am only really concerned with getting data from a single page for now. When viewed in Notepad ++ the page looks like:
<!DOCTYPE html>
<html><head><meta http-equiv="Content-Type" content="text/html; charset=windows-1252"><style>.b{position:absolute;top:0;bottom:0;left:0;right:0;height:100%;background-color:#000;height:auto !important;}.f{border-radius: 10px;font-weight:bold;position:absolute;top:50%;left:0;right:0;margin:auto;background:#024d27;padding:50px;box-sizing:border-box;color:#FF0;margin:30px;box-shadow:0px 2px 18px -4px #0F0;transform:translateY(-50%);}#V{font-size:96px;}#U{font-size: 56px;}#N{font-size: 36px;}</style></head><body><div class="b"><div class="f"><span id="N">Voltage</span><br><span id="V">12.53</span> <span id="U">V</span><br></div></div><script>reqData();setInterval(reqData, 200);function reqData() {var xhr = new XMLHttpRequest();xhr.onload = function() {if (this.status == 200) {var data = JSON.parse(xhr.responseText);document.getElementById('N').innerHTML = data.n;document.getElementById('V').innerHTML = data.v;document.getElementById('U').innerHTML = data.u;} else {document.getElementById('N').innerHTML = "?";document.getElementById('V').innerHTML = "?";document.getElementById('U').innerHTML = "?";}};xhr.open('GET', 'readVal', true);xhr.send();}</script></body></html>
As you can see, it is a fairly simple page which just provides the information I am trying to extract, presented in a Green box with Yellow text on a black background.
From staring at the info a little, the information I am trying to extract is that associated with Span ID = 'V' (voltage), 'N' (name) and 'U' (units).
The data is displayed live on the webpage (i.e. updates every 200ms (i think) without refreshing the page) and I would like to extract the values as frequently as possible.
I have tried a few different blocks of code/methods and this seems to be the only one which I am currently able to gain any success with:
import urllib.request, json, html
data = urllib.request.urlopen("http://192.168.4.1").read()
print (data)
This returns me the html source code for the page correctly (albeit with a delay of about 5seconds which may just be related to the low spec of the Pi Zero i am running it on).
However, I dont seem able to extract the JSON data from within this. I have tried:
data_json = json.loads(data)
but this gives me a JSONDecodeError: expecting value: line 1 column 1 (char 0) which I am assuming is because the 'data' is a mix of HTML code and JSON still. I have also noticed that the actual variable information I am trying to retrieve (Voltage, 12.53 & V from the example source page at the top) are just shown as '?' placeholders when I open the page using urllib rather than loading the actual value shown on the page.
Is anyone able to offer me any pointers at all please?
Thanks in advance,
Steve
As you've noticed from the error message and the raw HTML code, the result you're getting from your device isn't json data, it's html with javascript. It looks like the HTML you posted does an ajax request (a javascript GET request) to some local endpoint (/readVal perhaps?).
Try opening http://192.168.4.1 in your browser, open dev tools, and observe what network requests the page makes under the hood - specifically, look for some XHR requests. Look at the request URL and response - I bet you'll find some local endpoint that returns the raw json data you want.
Or just try http://192.168.4.1/readVal and see if that's it.

Read a csv file from bitbucket using Python and convert it to a df

I am trying to read a url csv file from bitbucket and I want to read it into a df using python. Also for the work I am doing I can not read it locally , it has to be from bitbucket all the time.
Any ideas on how to do this? Thank you!
Here is my example:
url = 'https://bitbucket.EXAMPLE.com/EXAMPLE/EXAMPLE/EXAMPLE/EXAMPLE/raw/wpcProjects.csv?at=refs%2Fheads%2Fmaster'
colnames=['project_id','project_name','gourmet_url']
df7 = pd.read_csv(url, names =colnames)
However, the output is not correct, its not the df being outputted its some bad data.
You have multiple options, but your question is actually 2 separate questions.
How to get a file (.csv in this case) from a remote location.
How to load a csv into a "df" which is a pandas data frame.
For #2, you simply import pandas, and use the df = pandas.read_csv() function call. See the documentation! If the CSV file was in the current directory, you would do pandas.read_csv('myfile.csv')
The CSV is on a server somewhere. In this case, it happens to be on bitbucket's servers accessed from their website. You can fetch it and save it locally, then access it, or you can fetch it to a temporary location, read it into pandas, and discard it. You could even read the data from the file into python as a string. However, having a lot of options doesn't mean they are all useful. I am just listing them for completeness. Looking at the documentation, pandas already has remote fetching built into the read_csv() function. If the passed in path is a valid URL scheme, where, in pandas,
"Valid URL schemes include http, ftp, s3, gs, and file".
If you want to locally save it, you can use pandas to do so once again, using the .write() method of a data frame.
FOR BITBUCKET SPECIFICALLY:
You need to make sure to link to the 'raw' file on bitbucket. Get the link to the raw file, and pass that in. The link used to view the file on your web browser is not the direct link to the raw file by default, it's a webpage that offers a view into that file. Get the raw file link, then pass that into pandas.
Code example:
Assume we want (a random csv file I found on bitbucket):
https://bitbucket.org/pedrorijo91/nodejstutorial/src/db4c991864e65c4d72e98a1dc94e33606e3adde9/node_modules/levelmeup/data/horse_js.csv?at=master
What you need is a link to the raw file! clicking on ... and pressing 'open raw' we get:
https://bitbucket.org/pedrorijo91/nodejstutorial/raw/db4c991864e65c4d72e98a1dc94e33606e3adde9/node_modules/levelmeup/data/horse_js.csv
Let's look at this in detail, the link is the same up to the project name:
https://bitbucket.org/pedrorijo91/nodejstutorial/
afterwards, the raw file is under raw/
then it's the same pointer (random but same letters and numbers)
db4c991864e65c4d72e98a1dc94e33606e3adde9/
Finally, it's the same directory structure:
node_modules/levelmeup/data/horse_js.csv
The first link ends with a ?at=master which is parsed by the web server and originates from src/ at the web server. The second link, the actual link to the raw file, starts from raw/ and ends with .csv
import pandas as pd
RAW_Bitbucket_URL = 'https://bitbucket.org/pedrorijo91/nodejstutorial/raw/db4c991864e65c4d72e98a1dc94e33606e3adde9/node_modules/levelmeup/data/horse_js.csv'
df = pd.read_csv(RAW_Bitbucket_URL)
The above code is successful for me.
 You may need to download the entire file so you can try to make the request with requests and then read it as a file in pandas.read_csv().
>>> import pandas as pd
>>> import requests
>>> url = 'https://bitbucket.org/pedrorijo91/nodejstutorial/raw/db4c991864e65c4d72e98a1dc94e33606e3adde9/node_modules/levelmeup/data/horse_js.csv'
>>> r = requests.get(url, allow_redirects=True)
>>> open('file.csv', 'wb').write(r.content)
>>> pd.read_csv('file.csv', encoding='utf-8-sig').head()
ID Tweet Date Via
0 374667940827635712 So, yes, a 100% JS App is 100% awesome 08:59:32, 9-3, 2013 web
1 374656867466637312 "vituperating priests" who rail against JavaSc... 08:15:32, 9-3, 2013 web
2 374654221292806144 Node/Browserify/CJS folks, is there any benefit 08:05:01, 9-3, 2013 Twitter for iPhone
3 374640446955212800 100% JavaScript applications. You may get some 07:10:17, 9-3, 2013 Twitter for iPhone
4 374613490763169792 A node.js app that will order you a sandwich 05:23:10, 9-3, 2013 web

When I run the code, it runs without errors, but the csv file is not created, why?

I found a tutorial and I'm trying to run this script, I did not work with python before.
tutorial
I've already seen what is running through logging.debug, checking whether it is connecting to google and trying to create csv file with other scripts
from urllib.parse import urlencode, urlparse, parse_qs
from lxml.html import fromstring
from requests import get
import csv
def scrape_run():
with open('/Users/Work/Desktop/searches.txt') as searches:
for search in searches:
userQuery = search
raw = get("https://www.google.com/search?q=" + userQuery).text
page = fromstring(raw)
links = page.cssselect('.r a')
csvfile = '/Users/Work/Desktop/data.csv'
for row in links:
raw_url = row.get('href')
title = row.text_content()
if raw_url.startswith("/url?"):
url = parse_qs(urlparse(raw_url).query)['q']
csvRow = [userQuery, url[0], title]
with open(csvfile, 'a') as data:
writer = csv.writer(data)
writer.writerow(csvRow)
print(links)
scrape_run()
The TL;DR of this script is that it does three basic functions:
Locates and opens your searches.txt file.
Uses those keywords and searches the first page of Google for each
result.
Creates a new CSV file and prints the results (Keyword, URLs, and
page titles).
Solved
Google add captcha couse i use to many request
its work when i use mobile internet
Assuming the links variable is full and contains data - please verify.
if empty - test the api call itself you are making, maybe it returns something different than you expected.
Other than that - I think you just need to tweak a little bit your file handling.
https://www.guru99.com/reading-and-writing-files-in-python.html
here you can find some guidelines regarding file handling in python.
in my perspective, you need to make sure you create the file first.
start on with a script which is able to just create a file.
after that enhance the script to be able to write and append to the file.
from there on I think you are good to go and continue with you're script.
other than that I think that you would prefer opening the file only once instead of each loop, it could mean much faster execution time.
let me know if something is not clear.

post text from files into a web page form using python

I have a set of text files. I need to input them one after the other to a web server. I know how to input text using mechanize but have no idea how to extract text from files stored on computer and input them one after the other. In other words, say I have 10 files on my hard disk, I need to post text from one file, submit, then post another file and the process should go on until all the files are posted. Please help me with suggestions.
Thank you.
First make a for loop to iterate through your files. Then read the files and encode them into a POST request with urllib and urllib2. You need to change the url, filename pattern, and form fields accordingly.
url = "http://www.example.com/form"
import glob
import urllib
import urllib2
for filename in glob.glob("file*.txt"):
filedata = open(filename).read()
data = urllib.urlencode({'data' : filedata})
urllib2.urlopen(url=url, data=data)

Parsing xml in python - don't understand the DOM

I've been reading up on parsing xml with python all day, but looking at the site i need to extract data on, i'm not sure if i'm barking up the wrong tree. Basically i want to get the 13-digit barcodes from a supermarket website (found in the name of the images). For example:
http://www.tesco.com/groceries/SpecialOffers/SpecialOfferDetail/Default.aspx?promoId=A31033985
has 11 items and 11 images, the barcode for the first item is 0000003235676. However when i look at the page source (i assume this is the best way to extract all of the barcodes in one go with python, urllib and beautifulsoup) all of the barcodes are on one line (line 12) however the data doesn't seem to be structured as i would expect in terms of elements and attributes.
new TESCO.sites.UI.entities.Product({name:"Lb Mens Mattifying Dust 7G",xsiType:"QuantityOnlyProduct",productId:"275303365",baseProductId:"72617958",quantity:1,isPermanentlyUnavailable:true,imageURL:"http://img.tesco.com/Groceries/pi/805/5021320051805/IDShot_90x90.jpg",maxQuantity:99,maxGroupQuantity:0,bulkBuyLimitGroupId:"",increment:1,price:2.5,abbr:"g",unitPrice:3.58,catchWeight:"0",shelfName:"Mens Styling",superdepartment:"Health & Beauty",superdepartmentID:"TO_1448953606"});
new TESCO.sites.UI.entities.Product({name:"Lb Mens Thickening Shampoo 250Ml",xsiType:"QuantityOnlyProduct",productId:"275301223",baseProductId:"72617751",quantity:1,isPermanentlyUnavailable:true,imageURL:"http://img.tesco.com/Groceries/pi/225/5021320051225/IDShot_90x90.jpg",maxQuantity:99,maxGroupQuantity:0,bulkBuyLimitGroupId:"",increment:1,price:2.5,abbr:"ml",unitPrice:1,catchWeight:"0",shelfName:"Mens Shampoo ",superdepartment:"Health & Beauty",superdepartmentID:"TO_1448953606"});
new TESCO.sites.UI.entities.Product({name:"Lb Mens Sculpting Puty 75Ml",xsiType:"QuantityOnlyProduct",productId:"275301557",baseProductId:"72617906",quantity:1,isPermanentlyUnavailable:true,imageURL:"http://img.tesco.com/Groceries/pi/287/5021320051287/IDShot_90x90.jpg",maxQuantity:99,maxGroupQuantity:0,bulkBuyLimitGroupId:"",increment:1,price:2.5,abbr:"ml",unitPrice:3.34,catchWeight:"0",shelfName:"Pastes, Putty, Gums, Pomades",superdepartment:"Health & Beauty",superdepartmentID:"TO_1448953606"});
Maybe something like BeautifulSoup is overkill? I understand the DOM tree is not the same thing as the raw source, but why are they so different - when i go to inspect element in firefox the data seems structured as i would expect.
Apologies if this comes across as totally stupid, thanks in advance.
Unfortunately, the barcode is not given in the HTML as structured data; it only appears embedded as part of a URL. So we'll need to isolate the URL and then pick off the barcode with string manipulation:
import urllib2
import bs4 as bs
import re
import urlparse
url = 'http://www.tesco.com/groceries/SpecialOffers/SpecialOfferDetail/Default.aspx?promoId=A31033985'
response = urllib2.urlopen(url)
content = response.read()
# with open('/tmp/test.html', 'w') as f:
# f.write(content)
# Useful for debugging off-line:
# with open('/tmp/test.html', 'r') as f:
# content = f.read()
soup = bs.BeautifulSoup(content)
barcodes = set()
for tag in soup.find_all('img', {'src': re.compile(r'/pi/')}):
href = tag['src']
scheme, netloc, path, query, fragment = urlparse.urlsplit(href)
barcodes.add(path.split('\\')[1])
print(barcodes)
yields
set(['0000003222737', '0000010039670', '0000010036297', '0000010008393', '0000003050453', '0000010062951', '0000003239438', '0000010078402', '0000010016312', '0000003235676', '0000003203132'])
As your site uses javascript to format its content, You might find useful switching from urllib to a tool like Selenium. That way you can crawl pages as they render for a real user with a web browser. This github project seems to solve your task.
Other option will be filtering out json data from page javascript scripts and getting data directly from there.

Categories

Resources