I have a Google sheet that grabs data from a website using the =IMPORTXML function. I also have a Python script that grabs data from the Google sheet. The whole thing is working, but I'm now trying to streamline it. This whole thing started as a manual process in Google Sheets. It's now automated, but it's not pretty.
Two specific questions:
1) What is the best way to scrape a website using Python? I'd like to get this all running in a single script. Would something like Beautiful Soup be a good solution?
2) Currently the query to the google API is coded to run separately each query (it's not a sub function, but I'd like to turn it into one). It essentially copies the quickstart script:
spreadsheetId = 'xxxx'
rangeName = 'xxxx'
result = service.spreadsheets().values().get(spreadsheetId=spreadsheetId,range=rangeName).execute()
values = result.get('values', [])
variable = ''
for row in values:
variable = '%s' % (row[0])
if variable != storedVariable:
print ('Condition not met...')
return;
#Do a thing
My code has various versions of setting a variable, checking it against a stored value, and proceeding if the correct conditions exist. Is there an easier way to parse the values returned from the API call so that it's set as a variable?
BeautifulSoup will work well for scraping data as long as the page is completely static. For most webpages you'll need to be able interact with the page to get to the data you need or iterate through multiple pages. Selenium is great for these situations.
I don't have a better solution for this question. The google-api-python-client library is cumbersome. It looks like gspread used to be a good alternative with more features, but it hasn't been updated for almost a year and seems to have fallen behind the google library.
Related
first post so go easy on me :)
The situation is that I'm trying to scrape the information off of a web based (customer) CMS (Customer-Management System) that has sales information on it to have it then get those values into excel or Google sheets to ultimately build a report, thus saving time/errors from flipping through all of them manually.
I remember using a solution (multiple tools) once that would basically go through the pages and take values from defined fields on those pages and then throw that information into columns on a sheet that we'd then manipulate manually. I'm pretty sure it was python based and (I think) used tampermonkey extension to get the information on a dev/debugger version of chrome.
The process looked something like this:
Already logged into the CMS -> Execute the tool/script that would then automatically open an order in a new window
It'd then go through that order and take values from specific fields and then copy those values in a sheet
It'd then close the window and proceed on to the next order in the specified range
Once it completes the specified (date) range, the columns would be something like salesperson / order number / sale amount / attachment amount / etc - to then be manually manipulated, no further automation needed (beyond the formulas in the sheet)
Anyone have any ideas on how to get this done or any guides anyone knows of for this specific type of task? Trying to automate this as much as possible - Thanks in advance.
Python should be a good choice as it provides you with many different tools. Depending on the functionality of the CMS you can choose different packages.
Simple HTML scraping
For simple scraping of static HTML content scrapy or Beautiful Soup should be enough.
Scraping including executable content
For these cases you can use Selenium which you can combine with Beautiful Soup. For more details can be found in this related question and this one.
def live_price(stock):
string = (data.decode("utf-8"))
conn.request("GET", f"/stock/{stock}/ohlc", headers=headers)
print(Price)
live_price("QCOM")
I want to be able to type "live_price("stockname") and then have the function output the data for the stock. If anyone can help that would be great. All other variables mentioned are defined elsewhere in the code.
import yfinance
def live_price(stock):
inst = yfinance.download(stock)
print(inst['Open'][-1])
When one has a hammer, everything looks like a nail. Or, in different words - The best solution for your problem will actually be achieved with Google Sheets as it has access to Google Finance live data (which is by far the best possible data source for live prices). If you'd later like to make any analysis using Python, you can just draw data from your google sheet either locally with your preferred code editor, or even better, while using Google Colabratory.
I'm using Python 2.7 and Zeep to call SuiteTalk v2017_2_0, the SOAP-based NetSuite web service API. The command I'm running is search like so:
from zeep import Client
netsuite = Client(WSDL)
TransactionSearchAdvanced = netsuite.get_type(
'ns19:TransactionSearchAdvanced')
TransactionSearchRow = netsuite.get_type('ns19:TransactionSearchRow')
# login removed for brevity
r = netsuite.service.search(TransactionSearchAdvanced(
savedSearchId=search, columns=TransactionSearchRow()))
Now the results of this include all the data I want but I can't figure out how (if at all) I can determine the display columns that the website would show for this saved search and the order they go in.
I figure I could probably netsuite.service.get() and pass the internalId of the saved search but what type do I specify? Along those lines, has anyone found a decent reference for all the objects, type enumerations, etc.?
https://stackoverflow.com/a/50257412/1807800
Check out the above link regarding Search Preferences. It explains how to limit columns returned to those only in the search.
I wrote a script that scrapes various things from around the web and stores them in a python list and have a few questions about the best way to get it into a HTML table to display on a web page.
First off should my data be in a list? It will at most be a 25 by 9 list.
I’m assuming I should write the list to a file for the web site to import? Is a text file preferred or something like a CSV, XML file?
Whats the standard way to import a file into a table? In my quick look around the web I didn’t see an obvious answer (Major web design beginner). Is Javascript this best thing to use? Or can python write out something that can easily be read by HTML?
Thanks
store everything in a database eg: sqlite,mysql,mongodb,redis ...
then query the db every time you want to display the data.
this is good for changing it later from multiple sources.
store everything in a "flat file": sqlite,xml,json,msgpack
again, open and read the file whenever you want to use the data.
or read it in completly on startup
simple and often fast enough.
generate a html file from your list with a template engine eg jinja, save it as html file.
good for simple hosters
There are some good python webframeworks out there some i used:
Flask, Bottle, Django, Twisted, Tornado
They all more or less output html.
Feel free to use HTML5/DHTML/Java Script.
You could use a webframework to create/use an "api" on the backend, which serves json or xml.
Then your java script callback will display it on your site.
The most direct way to create an HTML table is to loop through your list and print out the rows.
print '<table><tr><th>Column1</th><th>Column2</th>...</tr>'
for row in my_list:
print '<tr>'
for col in row:
print '<td>%s</td>' % col
print '</tr>'
print '</table>'
Adjust the code as needed for your particular table.
I'm not new to programming—but am (very) new to web-scraping. I'd like to get data from this website in this manner:
Get the team-data from the given URL and store it in some text file.
"Click" the links of each of the team members and store that data in some other text file.
Click various other specific links and store data in its own separate text file.
Again, I'm quite new to this. I have tried opening the specified website with urllib2 (in hopes of being able to parse it with BeautifulSoup), but opening it resulted in a timeout.
Ultimately, I'd like to do something like specify a team's URL to a script, and have said script update associated text files of the team, its players, and various other things in different links.
Considering what I want to do, would it be better to learn how to create a web-crawler, or directly do things via urllib2? I'm under the impression that a spider is faster, but will basically click on links at random unless told to do otherwise (I do not know whether or not this impression is accurate).