Automate .get requests via python - python

I have a python script that scrapes a page, uses jinja2 templating engine to output the appropriate HTML that I finally got working thanks to you kind folks and the people of The Coding Den Discord.
I'm looking to automate the .get request I'm making at the top of the file.
I have thousands of URLs I want this script to run on. What's a good way to go about this? I've tried using an array of URLs but requests says no to that. It complains that must be a string. So it seems I would need to iterate over the compiledUrls variable each time. Any advice on the subject would be much appreciated.

Build a text file with the urls.
urls.txt
https://www.perfectimprints.com/custom-promos/20267/Pizza-Cutters1.html
https://www.perfectimprints.com/custom-promos/20267/Pizza-Cutters2.html
https://www.perfectimprints.com/custom-promos/20267/Pizza-Cutters3.html
https://www.perfectimprints.com/custom-promos/20267/Pizza-Cutters4.html
https://www.perfectimprints.com/custom-promos/20267/Pizza-Cutters5.html
get urls and process:
with open("urls.txt") as file:
for single_url in file:
url = requests.get(single_url.strip())
..... # your code continue here

Related

How to use a URL to get .csv data in Python

First post - be gentle!
I am starting to learn Python and would like to get information from a table in a web page (https://en.wikipedia.org/wiki/European_Union#Demographics) in to a panda.
I am using Google Colab and from researching a bit I understand the process has something to do with 'web scraping' turning HTML in to .CSV.
Any thoughts welcome please. Worth noting I am constrained by not being able to download additional software due to the secure nature of my work.
Thanks.
You need a library to help you parse the HTML - a well known library for that in Python would be BeautifulSoup.
There are also some available tools online that do this kind of thing for you, and you can take some inspiration from them, even if you can't use them directly: https://wikitable2csv.ggor.de/
As you see this website above use the CSS "table.wikitable" to identify the tables.
You can use Scrapy, a python based scraping framework to get and parse the data as required. In Scrapy, you can create spiders which crawl a set of urls which you have initialized. Furthermore, you can parse the HTML data using something like Beautiful Soup to get your table from the response. The Scrapy documentation in itself is pretty useful and should get you through to set it up quickly! Scrapy also let you export the parsed data as CSV which should help you with the export part.
All the best!

HTML scraping vs json file in aspnet framework?

I would like to download the data in this table:
http://portal.ujn.gov.rs/Izvestaji/IzvestajiVelike.aspx
I know how to use selenium to go through the pages and the CSS selectors are helpful enough that it shouldn't be too difficult to get all the data...
However, I am curious if anyone knows some way of getting to a json or whatever intermediary object is used to make the html? As in, whatever the raw data format file that gets exported by the server is? Is this possible with aspnet frameworks?
I have found such solutions in the past, but with much simpler web pages and web pages with get requests...
Thank you!
Taking a look at the website (I have no experience with Russian at all but not like it maters much.) It looks to me like it is pulling the information from a database via PHP (In my book the "old" way of doing it) not a JSON file. Which means that your basically stuck doing it the normal web scraping route like you said OR to find a SQL injection (which I am in NO WAY SUGGESTING as it is illegal?) to be able to bypass the limitations of there crappy search page.

Python search script in an HTML webpage

I have an HTML webpage. It has a search textbox. I want to allow the user to search within a dataset. The dataset is represented by a bunch of files on my server. I wrote a python script which can make that search.
Unfortunately, I'm not familiar with how can I unite the HTML page and a Python script.
The task is to put a python script into the html file so, that:
Python code will be run on the server side
Python code can somehow take the values from the HTML page as input
Python code can somehow put the search results to the HTML webpage as output
Question 1 : How can I do this?
Question 2 : How the python code should be stored on the website?
Question 3 : How it should take HTML values as input?
Question 4 : How can it output the results to the webpage? Do I need to install/use any additional frameworks?
Thanks!
There are too many things to get wrong if you try to implement that by yourself with only what the standard library provides.
I would recommend using a web framework, like flask or django. I linked to the quickstart sections of the comprehensive documentation of both. Basically, you write code and URL specifications that are mapped to the code, e.g. an HTTP GET on /search is mapped to a method returning the HTML page.
You can then use a form submit button to GET /search?query=<param> with the being the user's input. Based on that input you search the dataset and return a new HTML page with results.
Both frameworks have template languages that help you put the search results into HTML.
For testing purposes, web frameworks usually come with a simple webserver you can use. For production purposes, there are better solutions like uwsgi and gunicorn
Also, you should consider putting the data into a database, parsing files for each query can be quite inefficient.
I'm sure you will have more questions on the way, but that's what stackoverflow is for, and if you can ask more specific questions, it is easier to provide more focused answers.
I would look at the cgi library in python.
You should check out Django, its a very flexible and easy Python web-framework.

Code for web crawling with Python 2.7.3 in mac terminal?

I am a social scientist and a complete newbie/noob when it comes to coding. I have searched through the other questions/tutorials but am unable to get the gist of how to crawl a news website targeting the comments section specifically. Ideally, I'd like to tell python to crawl a number of pages and return all the comments as a .txt file. I've tried
from bs4 import BeautifulSoup
import urllib2
url="http://www.xxxxxx.com"
and that's as far as I can go before I get an error message saying bs4 is not a module. I'd appreciate any kind of help on this, and please, if you decide to respond, DUMB IT DOWN for me!
I can run wget on terminal and get all kinds of text from websites which is awesome IF I could actually figure out how to save the individual output html files into one big .txt file. I will take a response to either question.
Try Scrapy. It is a fast high-level screen scraping and web crawling framework, used to crawl websites and extract structured data from their pages. It can be used for a wide range of purposes, from data mining to monitoring and automated testing.
You will most likely encounter this as you go, but in some cases, if the site is employing 3rd party services for comments, like Disqus, you will find that you will not be able to pull the comments down in this manner. Just a heads up.
I've gone down this route before and have had to tailor the script to a particular site's layout/design/etc.
I've found libcurl to be extremely handy, if you don't mind doing the post-processing using Python's string handler functions.
If you don't need to implement it purely in Python, you can make use of wget's recursive mirroring option to handle the content pull, then write your python code to parse the downloaded files.
I'll add my two cents here as well.
The first things to check are that you installed beautiful soup, and that it lives somewhere that it can be found. There's all kinds of things that can go wrong here.
My experience is similar to yours: I work at a web startup, and we have a bunch of users who register, but give us no information about their job (which is actually important for us). So my idea was to scrape the homepage and the "About us" page from the domain in their email address, and try to put a learning algorithm around the data that I captured to predict their job. The results for each domain are stored as a text file.
Unfortunately (for you...sorry), the code I ended up with was a bit complicated. The problem is that you'll end up getting a lot of garbage when you do the scraping, and you'll have to filter it out. You'll also end up with encoding issues, and (assuming you want to do some learning here) you'll have to get rid of low-value words. The total code is about 1000 lines, and I'll post some important pieces that may help you out here, if you're interested.

How do i output a dynamically generated web page to a .html page instead of .py cgi page?

So ive just started learning python on WAMP, ive got the results of a html form using cgi, and successfully performed a database search with mysqldb. I can return the results to a page that ends with .py by using print statements in the python cgi code, but i want to create a webpage that's .html and have that returned to the user, and/or keep them on the same webaddress when the database search results return.
thanks
paul
edit: to clarify on my local machine, i see /localhost/search.html in the address bar i submit the html form, and receive a results page at /localhost/cgi-bin/searchresults.py. i want to see the results on /localhost/results.html or /localhost/search.html. if this was on a public server im ASSUMING it would return .../cgi-bin/searchresults.py, the last time i saw /cgi-bin/ directories was in the 90s in a url. ive glanced at addhandler, as david suggested, im not sure if thats what i want.
edit: thanks all of you for your input, yep without using frameworks, mod_rewrite seems the way to go, but having looked at that, I decided to save myself the trouble and go with django with mod_wsgi, mainly because of the size of its userbase and amount of docs. i might switch to a lighter/more customisable framework, once ive got the basics
First, I'd suggest that you remember that URLs are URLs and that file extensions don't matter, and that you should just leave it.
If that isn't enough, then remember that URLs are URLs and that file extensions don't matter — and configure Apache to use a different rule to determine that is a CGI program rather than a static file to be served up as is. You can use AddHandler to add a handler for files on the hard disk with a .html extension.
Alternatively, you could use mod_rewrite to tell Apache that …/foo.html means …/foo.py
Finally, I'd suggest that if you do muck around with what URLs look like, that you remove any sign of something that looks like a file extension (so that …/foo is requested rather then …/foo.anything).
As for keeping the user on the same address for results as for the request … that is just a matter of having the program output the basic page without results if it doesn't get the query string parameters that indicate a search term had been passed.

Categories

Resources