I am attempting to download a file from a link, parse the file, then save specific data to my heroku database. I have successfully set up my selenium chrome webdriver and I am able to log in. Normally, when I get the url, it begins downloading automatically. I have set up a new directory for the file to be saved to on heroku. It does not appear to be here or anywhere.
I have tried different methods of setting the download directory, other methods of logging in to the website, and have functionally done it locally, but not in heroku production.
# importing libraries
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.common.by import By
import time
import datetime
from datetime import timedelta
import os
import json
import csv
# temporary credentials to later be stored
# as env vars
user = "user"
psw = "pasw"
account = 'account'
# this is the directory to download the file
file_directory = os.path.abspath('files')
# making this directory the default chrome web driver directory
options = webdriver.ChromeOptions()
prefs = {
"download.default_directory": file_directory
}
options.add_experimental_option('prefs',prefs)
# setting up web driver
driver = webdriver.Chrome(chrome_options=options)
# logging in to pinterest
url_login = 'https://www.pinterest.com/login/?referrer=home_page'
driver.get(url_login)
username = driver.find_element_by_id("email")
username.send_keys(user)
password = driver.find_element_by_id("password")
password.send_keys(psw)
driver.find_element_by_id("password").send_keys(Keys.ENTER)
# sleep 20 sec so page loads fully
time.sleep(20)
# collect metrics for yesterday
yesterday = datetime.date.today() - datetime.timedelta(days=1)
yesterday = str(yesterday)
# download link for metrics
url = "https://analytics.pinterest.com/analytics/profile/" + account + "/export/?application=all&tab=impressions&end_date=" + yesterday + '&start_date=' + yesterday
driver.get(url)
# setting up file identification for pinterest CSV file
date = datetime.date.today() - datetime.timedelta(days=2)
date = str(date)[:10]
file_location = os.path.join(file_directory,'profile-'+account+'-impressions-all-'+date+'.csv')
# opening up file
test_list = []
with open(file_location,newline = '', encoding = 'utf-8') as f:
reader = csv.reader(f)
for row in reader:
test_list.append(row)
# gathering relevant metrics for yesterday
this_list = test_list[1:3]
# re-organizing metrics
this_dict = {}
i=0
while(i<len(this_list[0])):
this_dict[this_list[0][i]] = this_list[1][i]
i+=1
return(this_dict)
driver.close()
I expect that the get("https://analytics.pinterest.com/analytics/profile/" + account + "/export/?application=all&tab=impressions&end_date=" + yesterday + '&start_date=' + yesterday) will download the CSV to the directory I have specified. It does not. I have used heroku run bash and searched through to try to find it, but it does not work.
UPDATE I do NOT need to store the file permanently. I need to store it temporarily and parse it. I understand that on a dyno restart it will all be lost.
** UPDATE** I have done this with another method. I have passed the cookies and header to a requests session. I used a 'User-Agent' of a chrome browser on Linux. I then assigned the file to a variable (csv_file = s.get(url)). I split the lines up to an array. I then used an empty string and the .join() method to add each line to one massive string. I then parsed the string by identifiers that would normally separate the lines in a csv. I now have the relevant metrics.
The thing you're missing is that heroku run bash will start a different dyno, with no access to the filesystem of the one that downloaded the file.
It's fine to use the Heroku filesystem as temporary storage for actions within the same process. But if you need access to stored files from a separate process, you should use something else, eg S3.
Related
I'm trying to log covid data from a website and update it each day with new cases. So far I have managed to put the numbers of cases in the file through scraping, but each day I have to manually enter the dates and run the file to get the updated statistics. How would I go about writing a script that will update the CSV each day, with new dates and the new number of cases, while saving the old ones for future use? I wrote this and run it in Virtual Studio Code.
import csv
import bs4
import urllib
from urllib.request import urlopen as uReq
from urllib.request import Request, urlopen
from bs4 import BeautifulSoup as soup
#For sites that can't be opened due to Urllib blocker, use a Mozilla User agent to get access
pageRequest = Request('https://coronavirusbellcurve.com/', headers = {'User-Agent': 'Mozilla/5.0'})
htmlPage = urlopen(pageRequest).read()
page_soup = soup(htmlPage, 'html.parser')
specificDiv = page_soup.find("div", {"class": "table-responsive-xl"})
TbodyStats = specificDiv.table.tbody.tr.contents
TbodyDates = specificDiv.table.thead.tr.contents
def writeCSV():
with open('CovidHTML.csv','w', newline= '') as file:
theWriter = csv.writer(file)
theWriter.writerow(['5/8', ' 5/9', ' 5/10',' 5/11',' 5/12'])
row = []
for i in range(3,len(TbodyStats),2):
row.append([TbodyStats[i].text])
theWriter.writerow(row)
writeCSV()
If you want to preserve the older contents of the csv file, then open the file in append mode (as correctly pointed out by #bfris)
with open('CovidHTML.csv','a', newline= '') as file:
If you are using Linux, you can set up a cron job to invoke the python script every day at some specific time.
First, locate the path to python using the which command:
$ which python3
This gave me
/usr/bin/python3
Then the cron job will look like:
10 14 * * * /usr/bin/python3 /path/to/python/file.py
Add this line to the crontab file. This will call the python script everyday at 2:10PM everyday.
You can take a look here for details.
In case you are using Windows, you can take a look at this question.
I am writing an script that will upload file from my local machine to an webpage. This is the url: https://tutorshelping.com/bulkask and there is a upload option. but i am trouble not getting how to upload it.
my current script:
import webbrowser, os
def fileUploader(dirname):
mydir = os.getcwd() + dirname
filelist = os.listdir(mydir)
for file in filelist:
filepath = mydir + file #This is the file absolte file path
print(filepath)
url = "https://tutorshelping.com/bulkask"
webbrowser.open_new(url) # open in default browser
webbrowser.get('firefox').open_new_tab(url)
if __name__ == '__main__':
dirname = '/testdir'
fileUploader(dirname)
A quick solution would be to use something like AppRobotic personal macro software to either interact with the Windows pop-ups and apps directly, or just use X,Y coordinates to move the mouse, click on buttons, and then to send keyboard keys to type or tab through your files.
Something like this would work when tweaked, so that it runs at the point when you're ready to click the upload button and browse for your file:
import win32com.client
x = win32com.client.Dispatch("AppRobotic.API")
import webbrowser
# specify URL
url = "https://www.google.com"
# open with default browser
webbrowser.open_new(url)
# wait a bit for page to open
x.Wait(3000)
# use UI Item Explorer to find the X,Y coordinates of Search box
x.MoveCursor(438, 435)
# click inside Search box
x.MouseLeftClick
x.Type("AppRobotic")
x.Type("{ENTER}")
I don't think the Python webbrowser package can do anything else than open a browser / tab with a specific url.
If I understand your question well, you want to open the page, set the file to upload and then simulate a button click. You can try pyppeteer for this.
Disclaimer: I have never used the Python version, only the JS version (puppeteer).
I have a truckload of trace files I'm trying to catalog. The idea is to open each one with "chrome://tracing" then save a screenshot. Screenshots are easy to catalog.
Here is the process:
start chrome = works
open "chrome://tracing" = works
open file <== missing part <- I need help with
save screenshot = works
There are 2 ways to open the file in chrome://tracing:
a) - use the "load" button, navigate to file and open
Update: I was able to locate and click on the "Load" button using Selenium
Now - need to handle the file open / loading ??
b) - drag and drop a trace file to the main part of the window - opens it
[ no idea how to do this..]
Here is the actual code I have so far:
from selenium import webdriver
driver = webdriver.Chrome() # Optional argument, if not specified will search path
driver.get("chrome://tracing");
time.sleep(2) # Let the user actually see something
# Find load button
# or drop file to main window ?
# Send the file location to the button
file_location = 'C:\........json'
driver.send_keys(file_location) # don't know where to sent it :: idea from https://towardsdatascience.com/controlling-the-web-with-python-6fceb22c5f08
time.sleep(15) # some files are big - might take 15 seconds to load
date_stamp = str(datetime.datetime.now()).split('.')[0]
date_stamp = date_stamp.replace(" ", "_").replace(":", "_").replace("-", "_")
file_name = date_stamp + ".png"
driver.save_screenshot(file_name)
After some research and trial and error here is my final(?) working code
located "Load" button and opened the file Open dialog
used pywinauto to take care communication with the Open dialog
saved a screenshot - using a unique filename generated from datestamp
import time
from selenium import webdriver
from pywinauto.application import Application
import datetime
options = webdriver.ChromeOptions()
options.add_argument("start-maximized")
driver = webdriver.Chrome(chrome_options=options)
driver.get("chrome://tracing");
time.sleep(2)
# Find load button
sdomele = driver.find_element_by_tag_name("tr-ui-timeline-view")
ele = driver.execute_script("return arguments[0].shadowRoot;",sdomele)
button_found = ele.find_element_by_id("load-button")
button_found.click() # let's load that file
time.sleep(3)
# here comes the pywinauto part to take care communication with the Open file dialog
app = Application().connect(title='Open') # connect to an existing window
dlg = app.window(title='Open') # communicate with this window
#file_location = os.path.join(submission_dir, folder, file_name)
file_location = "C:\\FILES2OPEN\\file01.json"
app.dlg.type_keys(file_location) # txt goes to the "File Name" box
time.sleep(2) #type is slow - this is just for safety
app.dlg.OpenButton.click() # click the open button
time.sleep(15) # some files might be big
# generate filename based on current time
date_stamp = str(datetime.datetime.now()).split('.')[0]
date_stamp = date_stamp.replace(" ", "_").replace(":", "_").replace("-", "_")
file_name = date_stamp + ".png"
driver.save_screenshot(file_name) # save screenshot (just the "inner" part of the browser window / not a full screenshot)
time.sleep(2)
driver.quit()
The reason you were not able to find the load button is because its present in a shadow dom. So first you need to find the shadow dom using execute_script ,then locate the "load" button as usual. The following code worked for me :
sdomele = _driver.find_element_by_tag_name("tr-ui-timeline-view")
ele = _driver.execute_script("return arguments[0].shadowRoot;",sdomele)
ele.find_element_by_id("load-button").click()
I copy some Python code in order to download data from a website. Here is my specific website:
https://www.codot.gov/business/bidding/bid-tab-archives/bid-tabs-2017-1
Here is the code which I copied:
import requests
from bs4 import BeautifulSoup
def _getUrls_(res):
hrefs = []
soup = BeautifulSoup(res.text, 'lxml')
main_content = soup.find('div',{'id' : 'content-core'})
table = main_content.find("table")
for a in table.findAll('a', href=True):
hrefs.append(a['href'])
return(hrefs)
bidurl = 'https://www.codot.gov/business/bidding/bid-tab-archives/bid-tabs-2017-1'
r = requests.get(bidurl)
hrefs = _getUrls_(r)
def _getPdfs_(hrefs, basedir):
for i in range(len(hrefs)):
print(hrefs[i])
respdf = requests.get(hrefs[i])
pdffile = basedir + "/pdf_dot/" + hrefs[i].split("/")[-1] + ".pdf"
try:
with open(pdffile, 'wb') as p:
p.write(respdf.content)
p.close()
except FileNotFoundError:
print("No PDF produced")
basedir= "/Users/ABC/Desktop"
_getPdfs_(hrefs, basedir)
The code runs successfully, but it did not download anything at all, even though there is no Filenotfounderror obviously.
I tried the following two URLs:
https://www.codot.gov/business/bidding/bid-tab-archives/bid-tabs-2017/aqc-088a-035-20360
https://www.codot.gov/business/bidding/bid-tab-archives/bid-tabs-2017/aqc-r100-258-21125
However both of these URLs return >>> No PDF produced.
The thing is that the code worked and downloaded successfully for other people, but not me.
Your code works I just tested. You need to make sure the basedir exists, you want to add this to your code:
if not os.path.exists(basedir):
os.makedirs(basedir)
I used this exact (indented) code but replaced the basedir with my own dir and it worked only after I made sure that the path actually exists. This code does not create the folder in case it does not exist.
As others have pointed out, you need to create basedir beforehand. The user running the script may not have the directory created. Make sure you insert this code at the beginning of the script, before the main logic.
Additionally, hardcoding the base directory might not be a good idea when transferring the script to different systems. It would be preferable to use the users %USERPROFILE% enviorment variable:
from os import envioron
basedir= join(environ["USERPROFILE"], "Desktop", "pdf_dot")
Which would be the same as C:\Users\blah\Desktop\pdf_dot.
However, the above enivorment variable only works for Windows. If you want it to work Linux, you will have to use os.environ["HOME"] instead.
If you need to transfer between both systems, then you can use os.name:
from os import name
from os import environ
# Windows
if name == 'nt':
basedir= join(environ["USERPROFILE"], "Desktop", "pdf_dot")
# Linux
elif name == 'posix':
basedir = join(environ["HOME"], "Desktop", "pdf_dot")
You don't need to specify the directory or create any folder manually. All you need do is run the following script. When the execution is done, you should get a folder named pdf_dot in your desktop containing the pdf files you wish to grab.
import requests
from bs4 import BeautifulSoup
import os
URL = 'https://www.codot.gov/business/bidding/bid-tab-archives/bid-tabs-2017-1'
dirf = os.environ['USERPROFILE'] + '\Desktop\pdf_dot'
if not os.path.exists(dirf):os.makedirs(dirf)
os.chdir(dirf)
res = requests.get(URL)
soup = BeautifulSoup(res.text, 'lxml')
pdflinks = [itemlink['href'] for itemlink in soup.find_all("a",{"data-linktype":"internal"}) if "reject" not in itemlink['href']]
for pdflink in pdflinks:
filename = f'{pdflink.split("/")[-1]}{".pdf"}'
with open(filename, 'wb') as f:
f.write(requests.get(pdflink).content)
I actually wanted my bookmarks for a text classifier .It needs data in .json format .So i want to know a python script which will retrieve data from the bookmarks directory and store it in a .json file.(I am using ubuntu)
Google Chrome already saves bookmarks in a form of JSON. Your question does not define what is desired outcome so here is a simple code to access and print the whole file of your saved bookmarks on Google Chrome Windows operating system. You will need to do some adjustments to the code as it is designed to run on Windows rather than Ubuntu as I do not have access to it at this moment.
import getpass
import json
user = getpass.getuser()
loc = "C:/Users/{}/AppData/Local/Google/Chrome/User Data/Default/Bookmarks.bak".format(user)
f = open(loc, encoding="utf8")
data = json.load(f)
print(data)
Edit:
import getpass
import json
user = getpass.getuser()
loc = "C:/Users/{}/AppData/Local/Google/Chrome/User Data/Default/Bookmarks.bak".format(user)
with open(loc, encoding="utf8") as f:
data = json.load(f)
for y in range(0,100):
try:
for x in data["roots"]["bookmark_bar"]["children"][y]["children"]:
print(x["url"])
except:
pass