I'm new at Python and I need expert guidance for the project I'm trying to finish at work, as none of my coworkers are programmers.
I'm making a script that logs into a website and pulls a CSV dataset. Here are the steps that I'd like to automate:
Open chrome, go to a website
Login with username/password
Navigate to another internal site via menu dropdown
Input text into a search tag box or delete search tags, e.g. "Hours", press "Enter" or "Tab" to select (repeat this for 3-4 search tags)
Click "Run data"
Wait until data loads, then click "Download" to get a CSV file with 40-50k rows of data
Repeat this process 3-4 times for different data pulls, different links and different search tags
This process usually takes 30-40 minutes for a total of 4 or 5 data pulls each week so it's like watching paint dry.
I've tried to automate this using the pyautogui module, but it isn't working out for me. It works too fast, or doesn't work at all. I think I'm using it wrong.
This is my code:
import webbrowser
import pyautogui
#pyautogui.position()
#print(pyautogui.position())
#1-2
pyautogui.FAILSAFE = True
chrome_path = 'open -a /Applications/Google\ Chrome.app %s'
#2-12
url = 'http://Google.com/'
webbrowser.get(chrome_path).open(url)
pyautogui.moveTo(185, 87, duration=0.25)
pyautogui.click()
pyautogui.typewrite('www.linkedin.com')
pyautogui.press('enter')
#loginhere? Research
In case pyautogui is not suited for this task, can you recommend an alternative way?
The way you are going about grabbing your data is very error prone and not how people generally go about grabbing data from websites. What you want is a web scraper, which allows you to grab information from websites or some companies provide API's that allow you easier access to the data.
To grab information from LinkedIn it has a built in API. You did mention that you were navigating to another site though in which case I would see if that site has an API or look into using Scrapy, a web scraper that should allow you to pull the information you need.
Sidenote: You can also look into synchronous and asynchronous programming with python to make multiple requests faster/easier
Related
I'd like to ask somebody with experience with headless browsers and python if it's possible to extract box info with distance from closest strike on webpage below. Till now I was using python bs4 but since everything is driven by jQuery here simple download of webpage doesn't work. I found PhantomJS but I wasn't able extract it too so I am not sure if it's possible. Thanks for hints.
https://lxapp.weatherbug.net/v2/lxapp_impl.html?lat=49.13688&lon=16.56522&v=1.2.0
This isn't really a Linux question, it's a StackOverflow question, so I won't go into too much detail.
The thing you want to do can be easily done with Selenium. Selenium has both a headless mode, and a heady mode (where you can watch it open your browser and click on things). The DOM query API is a bit less extensive than bs4, but it does have nice visual query (location on screen) functions. So you would write a Python script that initializes Selenium, goes to your website and interacts with it. You may need to do some image recognition on screenshots at some point. It may be as simple as finding for a certain query image on the screen, or something much more complicated.
You'd have to go through the Selenium tutorials first to see how it works, which would take you 1-2 days. Then figure out what Selenium stuff you can use to do what you want, that depends on luck and whether what you want happens to be easy or hard for that particular website.
Instead of using Selenium, though, I recommend trying to reverse engineer the API. For example, the page you linked to hits https://cmn-lx.pulse.weatherbug.net/data/lightning/v1/spark with parameters like:
_
callback
isGpsLocation
location
locationtype
safetyMessage
shortMessage
units
verbose
authid
timestamp
hash
You can figure out by trial and error which ones you need and what to put in them. You can capture requests from your browser and then read them yourself. Then construct appropriate requests from a Python program and hit their API. It would save you from having to deal with a Web UI designed for humans.
I'm using Python to access the SEC's website for 10-K downloadable spreadsheets. I created code that requests user input for a Stock Ticker Symbol, successfully opens Firefox, accesses the Edgar search page at https://www.sec.gov/edgar/searchedgar/companysearch.html, and inputs the correct ticker symbol. The problem is downloading the spreadsheet automatically and saving it.
Right now, I can manually click on "View Excel Spreadsheet", and the spreadsheet automatically downloads. But when I run my Python code, I get a dialog box from Firefox. I've set Firefox to automatically download, I've tried using 'find_element_by_xpath', 'find_element_by_css_selector' and both do not work to simply download the file. Both those methods merely call up the same dialog box. I tried 'find_element_by_link_text' and got an error message about not being able to find "view Excel Spreadsheet". My example ticker symbol was CAT for Caterpillar (NYSE: CAT). My code is below:
import selenium.webdriver.support.ui as ui
from pathlib import Path
import selenium.webdriver as webdriver
import time
ticker = input("please provide a ticker symbol: ")
# can do this other ways, but will create a function to do this
def get_edgar_results(ticker):
url = "https://www.sec.gov/cgi-bin/browse-edgar?action=getcompany&CIK=" + str(ticker) + "&type=10-k&dateb=20200501&count=20"
# define variable that opens Firefox via my executable path for geckodriver
driver = webdriver.Firefox(executable_path=r"C:\Program Files\JetBrains\geckodriver.exe")
# timers to wait for the webpage to open and display the page itself
wait = ui.WebDriverWait(driver,40)
driver.set_page_load_timeout(40)
driver.get(url)
# timers to have page wait for the page to load.
# seemed that the total amount of time was necessary; not sure about these additional lines
driver.set_page_load_timeout(50)
wait = ui.WebDriverWait(driver, 50)
time.sleep(30)
# actual code to search the resulting page for the button to click and access the excel document for download
annual_links = driver.find_element_by_xpath('//*[#id="interactiveDataBtn"]')
annual_links.click()
# need to download the excel spreadsheet itself in the "financial report"
driver.set_page_load_timeout(50)
wait = ui.WebDriverWait(driver, 50)
excel_sheet = driver.find_element_by_xpath('/html/body/div[5]/table/tbody/tr[1]/td/a[2]')
excel_sheet.click()
# i'm setting the resulting dialog box to open and download automatically from now on. if i want to change it back
# i'll need to use this page: https://support.mozilla.org/en-US/kb/change-firefox-behavior-when-open-file
# Testing showed that dialog box "open as" probably suits my needs better than 'save'.
driver.close()
driver.quit()
get_edgar_results(ticker)
Any help or suggestions are greatly appreciated. Thanks!!
This is not so much a recommendation based on your actual code, or how Selenium works, but more general advice when trying to gather information from the web.
Given the opportunity, accessing a website through its API is far more friendly to programming than attempting the same task through Selenium. When you use Selenium for webscraping, very often the websites do not behave in the same way they do when accessed through a normal browser. This could be for any number of reasons, not thee least of which could be websites intentionally preventing automated browsers like Selenium from accessing them.
In this case, EDGAR SEC provides an HTTPS access service through which you should be able to get the information you're looking for.
Without digging too deeply into this data, it should not be tremendously difficult to instead request this information with an http request library like requests, and save it that way.
import requests
result = requests.get("https://www.sec.gov/Archives/edgar/data/18230/000001823020000214/Financial_Report.xlsx")
with open("file.xlsx", "wb") as excelFile:
excelFile.write(result.content)
The only difficulty comes with getting the stock ticker's CIK to build the above URL, but that shouldn't be too hard with the same API information.
The EDGAR website fairly transparently exposes you to its data through its URLs. You can bypass all the Selenium weirdness and instead just build the URL, and request the information directly without loading all the JavaScript, etc.
EDIT: You can also browse through this information in a more programmatic fashion, too. The link above mentions that each directory within edgar/full-index also provides a JSON file, which is easily computer-readable. So you could request https://www.sec.gov/Archives/edgar/full-index/index.json, parse out the year you want, request that year, parse out the quarter you want, request that quarter, then parse out the company you want, and request that company's information, etc.
For instance, to get the CIK number of Caterpillar, you would get and parse thee company.gz file from https://www.sec.gov/Archives/edgar/full-index/2020/QTR4.json, parse it into a dataframe, find the line with CATERPILLAR INC on it, find the CIK and accession numbers from the associated .txt file, and then find the right URL to download their Excel file. A bit circuitous, but if you can work out a way to just skip to the CIK number you can cut down on the number of requests needed.
So I have a program I want to run using selenium specifically that takes a series of actions on a password-protected website. Basically, I need to be able to input a unique link and password when I get it, which will take me to the main website which I have automated. The issue here is that Selenium takes very long to get to load a webpage when you start it up and time is very important in this application. Inputting the link and launching the browser to that link directly takes a long time. What I have tried doing is preloading the browser to a different website (ie, https://google.com) beforehand, and then waiting on user input for the link to the actual page. This process works a lot quicker, but I'm having trouble getting it to work inside a function and with multiprocessing. I am using multiprocessing to execute this on a wide scale with lots of instances. I am trying to start all of my functions the second a link is defined by me. I am on Windows 10, using Python 3.8.3, and using Chrome for my Selenium browser.
from selenium import webdriver
global link
link = input('Paste Link Here: ')
def instance_1():
browser1 = webdriver.Chrome(*my webdriver file path*)
browser1.get('https://google.com')
#need something that waits here until the link variable is defined by me
browser1.get(link)
#the rest of the automation works fine from here
Ideally, the solution would be able to work with multiprocessing. The ideal flow would be something like this:
1. All selenium instances" (written as their own functions) start-up and preload to a website (this part works fine)
2. They wait until the link to go to is specified (this is where the issue is)
3. They then go to the link and execute the automation (this part works fine)
Tldr; basically anything that would allow me to let the program continue while waiting on the input would be nice.
I am looking for a python module that will let me navigate searchbars, links etc of a website.
For context I am looking to do a little webscraping of this website [https://www.realclearpolitics.com/]
I simply want to take information on each state (polling data etc) in relation to the 2020 election and organize it all in a collection of a database.
Obviously there are a lot of states to go through and each is on a seperate webpage. So im looking for a method in python in which i could quickly navigate the site and take the data of each page etc aswell as update and add to existing data. So finding a method of quickly navigating links and search bars with my inputted data would be very helpful.
Any suggestions would be greatly appreciated.
# a simple list that contains the names of each state
states = ["Alabama", "Alaska" ,"Arizona", "....."]
for state in states:
#code to look up the state in the searchbar of website
#figures being taken from website etc
break
Here is the rough idea i have
There are many options to accomplish this with Python. As #LD mentioned, you can use Selenium. Selenium is a good option if you need to interact with a websites UI via a headless browser. E.g clicking a button, entering text into a search bar, etc. If your needs aren't that complex, for instance if you just need to quickly scrape all the raw content from a web page and process it, than you should use the requests module from Python's standard library.
For processing raw content from a crawl, I would recommend beautiful soup.
Hope that helps!
I am trying to scrape date from a URL. The data is not in HTML tables, so pandas.read_html() is not picking it up.
The URL is:
https://www.athlinks.com/event/1015/results/Event/638761/Course/988506/Results
The data I'd like to get is a table gender, age, time for the past 5k races (name is not really important). The data is presented in the web page 50 at a time for around 25 pages.
It uses various javascript frameworks for the UI (node.js, react). Found this out using the "What Runs" ad-on in chrome browser.
Here's the real reason I'd like to get this data. I'm a new runner and will be participating in this 5k next weeked and would like to explore some of the distribution statistics for past faces (its an annual race, and data goes back to 1980's).
Thanks in advance!
The data comes from socket.io, and there are python packages for it. How did I find it?
If you open Network panel in your browser and choose XHR filter, you'll find something like
https://results-hub.athlinks.com/socket.io/?EIO=3&transport=polling&t=MYOPtCN&sid=5C1HrIXd0GRFLf0KAZZi
Look into content it is what we need.
Luckily this site has a source maps.
Now you can go to More tools -> Search and find this domain.
And then find resultsHubUrl in settings.
This property used inside setUpSocket.
And setUpSocket used inside IndividualResultsStream.js and RaseStreams.js.
Now you can press CMD + P and go deep down to this files.
So... I've spent around five minutes to find it. You can go ahead! Now you have all the necessary tools. Feel free to use breakpoints and read more about chrome developer tools.
You actually need to render the JS in a browser engine before crawling the generated HTML. Have you tried https://github.com/scrapinghub/splash, https://github.com/miyakogi/pyppeteer, or https://www.npmjs.com/package/spa-crawler ? You can also try to inspect the page (F12 -> Networking) while is loading the data relevant to you (from a restful api, I suppose), and then make the same calls from command line using curl or the requests python library.