Downloading a file using wget or request - python

I am automating the process to download the csv file from the google trends after I do a search, It can be done by clicking the download button using selenium but I am thinking of doing it using requests, when I click the button definitely a request is posted in a form of url by any means can I fetch that url,
The url looks like this:
https://trends.google.com/trends/api/widgetdata/multiline/csv?req=%7B%22time%22%3A%222021-03-09%202022-03-09%22%2C%22resolution%22%3A%22WEEK%22%2C%22locale%22%3A%22en-US%22%2C%22comparisonItem%22%3A%5B%7B%22geo%22%3A%7B%22country%22%3A%22PK%22%7D%2C%22complexKeywordsRestriction%22%3A%7B%22keyword%22%3A%5B%7B%22type%22%3A%22BROAD%22%2C%22value%22%3A%22bahria%20town%20karachi%22%7D%5D%7D%7D%5D%2C%22requestOptions%22%3A%7B%22property%22%3A%22%22%2C%22backend%22%3A%22IZG%22%2C%22category%22%3A29%7D%7D&token=APP6_UEAAAAAYioYw6KoUJcW4_Xv6d4fc4sJn3swSoON&tz=-300

Related

trying to download full HTML pages

I am tring to download few hundreds of HTML pages in order to parse them and calculate some measures.
I tried it with linux WGET, and with a loop of the following code in python:
url = "https://www.camoni.co.il/411788/168022"
html = urllib.request.urlopen(url).read()
but the html file I got doen't contain all the content I see in the browser in the same page. for example text I see on the screen is not found in the HTML file. only when I right click the page in the browser and "Save As" i get the full page.
the problem - I need a big anount of pages and can not do it by hand.
URL example - https://www.camoni.co.il/411788/168022 - thelast number changes
thank you
That's because that site is not static. It uses JavaScript (in this example jQuery lib) to fetch additional data from server and paste on page.
So instead of trying to GET raw HTML you should inspect requests in developer tools. There's a POST request on https://www.camoni.co.il/ajax/tabberChangeTab with such data:
tab_name=tab_about
memberAlias=ד-ר-דינה-ראלט-PhD
currentURL=/411788/ד-ר-דינה-ראלט-PhD
And the result is HTML that pasted on page after.
So instead of trying to just download page you should inspect page and requests to get data or use headless browser such as Google Chrome to emulate 'Save As' button and save data.

Automatically Upload CSV to Webapp Upload Form

How can I make a bot which will automatically go to a file upload form on my webapp at a public link, click the upload a file button, select a file and submit it?
For example, I have a file called "stored.csv" on my desktop, and I have a webapp with an upload for that looks like this:
All I'm trying to do is have a script which can grab that stored.csv file, go to the public link (http://website.com/upload/) that takes you to this page and then submit the file so that it all happens automatically when the script is run.
It would be much easier to send post request that button does right away.
All you need to do is to:
On that page open dev. tools in your browser (F12 most likely)
In appeared window click on the "Network" tab
Then leaving this window opened, choose any file and click "submit"
New record will appear at the end in "Network" tab containing information about the request that was made
Then knowing the request that you need to make, you can easily implement it in python:
import requests as req
url = "Url that you will acquire"
data = {
"smth" : "path/to/file" # just copy the body from the known request
}
res = req.post(url=url, data=data)
print(res.status)
And that's it.
There is some stuff that you'll need to figure out by your own, but now you got a map.
Hope this will help!

Scraping a secure website requiring clicks on javascript links

I have a daily task at work to download some files from internal company website. The site requires a login. But the main url is something like:
https://abcd.com
But when I open that in the browser, it redirects to something like:
https://abcdGW/ln-eng.aspx?lang=eng&lnid=e69d5d-xxx-xxx-1111cef&regl=en-US
My task normally is to open this site, login, click some links back and forth and download some files. This takes me 10 minutes everyday. But I wanna automate this using python. Using my basic knowledge I have written below code:
import urllib3
from bs4 import BeautifulSoup
import requests
import http
url = "https://abcd.com"
redirectURL = requests.get(url).url
jar = http.cookiejar.CookieJar(policy=None)
http = urllib3.PoolManager()
acc_pwd = {'datasouce': 'Data1', 'user':'xxxx', 'password':'xxxx'}
response = http.request('GET', redirectURL)
soup = BeautifulSoup(response.data)
r = requests.get(redirectURL, cookies=jar)
r = requests.post(redirectURL, cookies=jar, data=acc_pwd)
print ("RData %s" % r.text)
This shows that I am able to successfully login. The next step is something where i am stuck. On the page after login I have some links on left side, one of those I need to click. When I inspect them in Chrome, I see them as:
href="javascript:__doPostBack('myAppControl$menu_itm_proj11','')"><div class="menu-cell">
<img class="menu-image" src="images/LiteMenu/projects.png" style="border-width:0px;"><span class="menu-text">Projects</span> </div></a>
This is probably a javascript link. I need to click this, and then on new page another link, then another to download a file and back to the main page and do this all over again to download different files.
I would be grateful to anyone who can help or suggest.
Thanks to chris, I was able to complete this..
First using the request library I got the redirect url as:
redirectURL = requests.get(url).url
After that I use scrapy and selenium for click links and downloading files..
By adding selenium to the browser as add-in/plugin, it was quite simple.

How to fix HTML downloading instead of image file

I'm trying to download a file from a link using urllib in Python 3.7 and it downloads the HTML file and not the Image File.
So I'm trying to receive information from a Google Form, the information is sent to a Google Sheet. I'm able to receive the information in the sheet no problem. However the Form requires an Image submission which appears in the sheet as a URL. (Example: https://drive.google.com/open?id=1YCBmEOz6_l7WDQw5t6AYBSb9B5XXKTuX)
This is my code:
import urllib.request
import random
Then I create a download function:
def downloader(image_url):
file_name = random.randrange(1,10000)
full_file_name = str(file_name) + '.png'
print(full_file_name)
urllib.request.urlretrieve(image_url,full_file_name)
I get the URL and isolate the ID of the image:
ImgId="https://drive.google.com/open?id=1Mp5XYoyyEfWJryz8ojLbHuZ6V0IzERIV"
ImgId=ImgId[33:]
Then I put the ID in a download link:
ImgId="https://drive.google.com/uc?authuser=0&id="+ImgId+"&export=download"
Which results in (in the above example) "https://drive.google.com/uc?authuser=0&id=1YCBmEOz6_l7WDQw5t6AYBSb9B5XXKTuX&export=download".
Next I run the download function:
downloader(ImgId)
So after this I expected the png file to be downloaded into the folder of the program, however it downloaded a html file of the google drive log-in page instead of an image file, or even an html file of the image. Noting that to view or download the image it requires you to be signed in to Google to download in the browser, could authorization be an issue?
(Note: If I manually paste the download link as generated by the program into my browser it downloads the image correctly)
(P.S I'm an absolute noob, so yeah)
(Thanks in advance for any answers)
Instead of using urllib for dowmloading, use requests and get the page contents using GET rest call and then convert the response content to soup content using beautifulsoup and then point to the content which you want to download, as the download function inside html would have a download link associated with it and then send a get request again with js download.
import requests
import bs4
response = requests.get(<your_url>)
soup = bs4.BeautifulSoup(response.content, 'html5lib')
# Get the download link and supply all the necessary values to the link
# Initiate Requests again

How to get the pdf file that downloads when I click 'submit' which also redirects me to new page

I am using mechanize to automatically download some pdf documents from webpages. When there is a pdf icon on the page, I can do this to get the file:
b.find_link(text="PDF download")
req = b.click_link(text="PDF download")
b.open(req)
Then I just write it to a new file.
However, for some of the documents I need, there is no direct 'PDF download' link on the page. Instead I have to click a 'submit' button to make a "delivery request" for the document: after clicking this button, the download starts happening while I am taken to another page which says "delivery request in progress" and then, once the download has finished, " Your delivery request is complete".
I have tried using mechanize to click the submit button, and then save the file that downloads by doing this:
b.select_form(nr=0)
b.submit()
downloaded_file = b.response().read()
but this stores the html of the page I am redirected to, not the file that downloads.
How do I get the file that downloads after I click 'submit'?
For anyone with a similar problem, I found a workaround: mechanize emulates a browser that doesn't have JavaScript so I turned that off on my browser too, then when I went to the download page I could see a link that said 'if the download hasn't already started, click here to download'. Then I could just get mechanize to find that link and follow it in the normal way- and write the response to a new file.

Categories

Resources