Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 9 years ago.
Improve this question
I am attempting to retrieve information from a Health Inspection website, then parse and save the data to variables, then maybe save the records to a file. I suppose I could use dictionaries to store the information, from each business.
The website in question is: http://www.swordsolutions.com/Inspections.
Clicking [Search] on the website will start displaying information.
I need to be able to pass some search data to the website, and then parse the information that is returned into variables and then to files.
I am fetching the website to a file using:
import urllib
u = urllib.urlopen('http://www.swordsolutions.com/Inspections')
data = u.read()
f = open('data.html', 'wb')
f.write(data)
f.close()
This is the data that is retrieved by urllib: http://bpaste.net/show/126433/ and currently does not show anything useful.
Any ideas?
I'll just refer you.
You want to submit a form with several pre-defined field values. Then you want to parse the data returned. Then, next steps depend on whether it is easy to automate that form post request.
You have plenty of options here:
using browser developer tools analyze what is going on while clicking on "submit". Then, if there is a simple POST request - simulate it using urllib2 or requests or mechanize or whatever you like
give a try to Scrapy and it's FormRequest class
use a real automated browser with the help of selenium. Fill the data into fields, click submit, get and parse the data using the same one tool (selenium)
Basically, if there is a lot of javascript logic involved into the form submitting process - you'll have to go with automated browsing tool, like selenium.
Plus, note that there are several tools for parsing HTML: BeautifulSoup, lxml.
Also see:
Web scraping with Python
Hope that helps.
Related
Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 2 years ago.
Improve this question
What I am trying to do...
I am trying to automate a download of a zip file from a URL that does not redirect the URL, but instead opens up a "Save as" prompt the moment you open the URL.
What I have tried...
"Urllib request", "Wget", and "Requests" libraries are all giving me a 1KB file which in a text editor reads "Invalid request". This could make sense as the Website URL I am inputting is blank by default, and I don't believe its redirecting the URL to anywhere as I had "allow_redirects=True" using the "Requests" Library. I believe this link is using JavaScript to redirect to the "Save as" and when I click it and head to downloads (In Chrome) and see that there is a download link for this file. This download link appears to always work but I am unsure how to grab it with Python.
Leads...
I have found a lead in Stack Overflow about using the library "Spynner", but I am not sure HOW and WHY that would solve my problem.
I am using Python 3.8.2
You need a web scraping tool. They usually have headless browsers and everything you need to "bot like" human behaviour. I would recommend Selenium because you can use it from python directly; here is an example: File managing Selenium.
Be careful, web scraping is not completely legal so you should have authorization to use it on any web service. Proceed with caution.
Like Juan said, I just needed to use a web scraping tool. After learning selenium I was able to bypass the save as requirement.
Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 3 years ago.
Improve this question
My question is similar to this one but I have a few concerns:
1) How do we query for a photo after typing a few key words
2) How do we take a random photo out of the results
3) How do I display the photo that came out
4) How I can do all of this without opening the browser
I would appreciate any help you could give me.
(this is all in python 3)
Ok, your question is a bit broad for Stack Overflow, so it's best if you ask for one small point at a time. I'll provide some resources below so you know what to start research on, then when you have questions when implementing them, it's best if you post detailed questions about each step.
You can also use this as a way to guide you into constructing your project.
5) IMPORTANT This is actually the most important step: if you are going to be 'scraping' or grabbing a large amount of images from the website, make sure the site you are grabbing from allows you to do it!. Check out this article here about scraping ettiquete, so you don't get into any trouble.
1) For number 1, the process will consist of two steps. One step is accepting the string input from the user. You can take a look at this tutorial here, or search through how to accept and handle inputs from the user. Secondly, you will need to feed that user input into what is called a 'webdriver'. The webdriver allows you to automate a web browser to browse the web for you and query for images (eg. through google images). You can find some resources about the Chrome webdriver here or online.
2) This step is done much easier than step one, just consult on how to extract from urls from the question you posted yourself. link. You could make a list of links from the results and just grab an image from a random link. To add to that, for python 3, the function you need instead is urllib.request.urlopen(url). The best practice would make use of the code as referenced here
import urllib.request
import shutil
...
# Download the file from `url` and save it locally under `file_name`:
with urllib.request.urlopen(url) as response, open(file_name, 'wb') as out_file:
shutil.copyfileobj(response, out_file)
3) This step involves using some libraries to display images from Python, you can find a question detailing it here. This step depends on which OS you are using, but you can find helpful posts on Stack Overflow for them.
4) If you want to do this without opening a browser, search up on how to use the WebDriver without GUI display, either on the web or SO.
Good luck!
Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 7 years ago.
Improve this question
In reference towards me question, how would one be able to input data and retrieve data from various websites (not using an API)?
Is there a module that searches or acts like a human for purposes as in searching along applicably given fields; in effort to (as said before) retrieve data?
Sorry if I'm making my question hard to follow along; though if so, here's an example of what I am trying to accomplish:
Directing an AI towards a specific website.
Inputting data into the search field.
Then finally, retrieving said data after previously ran processes.
I'm fairly new to the section or field in manipulating websites via APIs or various (unknown) code; therefore, sorry if I missed anything!
You can use
mechanize,
BeautifulSoup,
Urllib,
Urllib2,
modules in Python. What I suggest you is use mechanize module. It is like scraping website through python program. More over simply a browser through python code.
Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 7 years ago.
Improve this question
Is there a way to scrape data from a popup? I'd like to import data from the site tennisinsight.com.
For example, http://tennisinsight.com/match-preview/?matchid=191551201
This is a sample data extraction link. When clicking "overview" there is a button with "Match Stats", I'd like to be able to import those data from many links in a text or CSV file.
What's the best way to accomplish this? Is Scrapy able to do this? Is there software able to do this?
You want to open the network analyzer in your browser (e.g. in Web Developer in Firefox) to see what requests are sent when you click the "match stats" button in order to replicate them using python.
When I do it, a POST request is sent to http://tennisinsight.com/wp-admin/admin-ajax.php with action and matchID parameters.
You presumably already know the match ID (see URL you posted above), so you just need to set up a POST request for each matchID you have.
import requests
r = requests.post('http://tennisinsight.com/wp-admin/admin-ajax.php', data={'action':'showMatchStats', 'matchID':'191551201'})
print r.text #this is your content of interest
Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 8 years ago.
Improve this question
I would like to know if there is some tool that given a url to a blog/webpage it identifies and extract the main text. Because an article page, say a blog post, may have different parts of text, one of this part is the article itself. There is a way to identify and extract it?
Thank you.
There are three steps for this problem:
Retrieve the data from the URL
Extract article text (removing ads ...)
Summarize the text
1 is easily done with Python urllib2.urlopen.
If you know the structure of the web site (main HTML tags and such) this can be easily done with tools such as BeautifulSoup. Removing ads in generic way is a bigger subject - you can find some research on the subject online.
Creating a summary by extracting sentences is well known field. I think NLTK has some modules to do that. You can even take a look at a simple (and effective) approach I wrote a while back.
You could use an AJAX call to grab the content, but you have to be on the same domain. You can't copy someone else's content.
Alternatively, grab it with PHP using $content = file_get_contents('{filename}'); and then use the html tag (e.g. '<section>') to split it.
What are you using it for? Because if it is your content, I would use ajax and always put the content you want to grab in a tag with a specific class assigned. If it is someone else's content then you might want to ask their permission first.