Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 8 years ago.
Improve this question
I would like to know if there is some tool that given a url to a blog/webpage it identifies and extract the main text. Because an article page, say a blog post, may have different parts of text, one of this part is the article itself. There is a way to identify and extract it?
Thank you.
There are three steps for this problem:
Retrieve the data from the URL
Extract article text (removing ads ...)
Summarize the text
1 is easily done with Python urllib2.urlopen.
If you know the structure of the web site (main HTML tags and such) this can be easily done with tools such as BeautifulSoup. Removing ads in generic way is a bigger subject - you can find some research on the subject online.
Creating a summary by extracting sentences is well known field. I think NLTK has some modules to do that. You can even take a look at a simple (and effective) approach I wrote a while back.
You could use an AJAX call to grab the content, but you have to be on the same domain. You can't copy someone else's content.
Alternatively, grab it with PHP using $content = file_get_contents('{filename}'); and then use the html tag (e.g. '<section>') to split it.
What are you using it for? Because if it is your content, I would use ajax and always put the content you want to grab in a tag with a specific class assigned. If it is someone else's content then you might want to ask their permission first.
Related
Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 4 years ago.
Improve this question
This is related to website we are creating at work. The script for website is written in HTML and Python. We want to translate website in different languages text by text. The idea we have is to save all the words and phrases in the excel file and by using some API it translates everything and saves in the same excel file. I had an idea of using Google API but I want to save all the translations in Excel once so we don't have to pay for using the API again and again.
I am looking for less tedious way to:
1) Save all the words and phrases from website into the excel file
2) Translate those saved words and phrases and be able to save in the same file.
This is actually a common problem that has been solved in a similar way to what you have in mind. I would suggest taking a look at https://www.mattlayman.com/2015/i18n.html https://www.python.org/community/sigs/current/i18n-sig/ and https://docs.python.org/2/library/gettext.html which describe using the gettext method to display the proper translation. This is a common problem in web development that happens out of the box in web frameworks like ruby on rails.
I believe you will still have to find the translations yourself, but if you save them in the proper files there are built in functions that can retrieve the right translation for you based on the user's location.
Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 4 years ago.
Improve this question
and thanks in advance! I was hoping someone might be able to point me in the right direction as to how to scrape a searchable online database. Here is the url: https://hord.ca/projects/eow/. If possible, I'd like to be able to access all of the data from the site's database, I'm just not sure how to access it using bs4... Maybe bs4 isn't the answer here though. Still a relatively new Pythonista, any help is greatly appreciated!
Since you are new there are going to be a combination of things you need to address, you need to have a good handle on where to look in html, make sure you understand how the site works, what does it put into its URLs, and why? what are the class names of the important bits of the site you will want to reference? and how does it handle multipage display (if it does so at all).
once you are intimate with the website you are scraping you will need to apply that knowledge when you go to make your automation.
for beginners id highly reccomend this ebook: https://automatetheboringstuff.com/
its a great read and easy to follow for even the beginner in both python and html. even better its free to read on the site!
chapter 11 is the part you are specifically looking for on webscraping. which will give you the rundown on what you need to be looking for and how to go about planning your code.
but i highly recommend you read the whole thing once you are done focusing on your current project.
Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 7 years ago.
Improve this question
In reference towards me question, how would one be able to input data and retrieve data from various websites (not using an API)?
Is there a module that searches or acts like a human for purposes as in searching along applicably given fields; in effort to (as said before) retrieve data?
Sorry if I'm making my question hard to follow along; though if so, here's an example of what I am trying to accomplish:
Directing an AI towards a specific website.
Inputting data into the search field.
Then finally, retrieving said data after previously ran processes.
I'm fairly new to the section or field in manipulating websites via APIs or various (unknown) code; therefore, sorry if I missed anything!
You can use
mechanize,
BeautifulSoup,
Urllib,
Urllib2,
modules in Python. What I suggest you is use mechanize module. It is like scraping website through python program. More over simply a browser through python code.
Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 9 years ago.
Improve this question
I am attempting to retrieve information from a Health Inspection website, then parse and save the data to variables, then maybe save the records to a file. I suppose I could use dictionaries to store the information, from each business.
The website in question is: http://www.swordsolutions.com/Inspections.
Clicking [Search] on the website will start displaying information.
I need to be able to pass some search data to the website, and then parse the information that is returned into variables and then to files.
I am fetching the website to a file using:
import urllib
u = urllib.urlopen('http://www.swordsolutions.com/Inspections')
data = u.read()
f = open('data.html', 'wb')
f.write(data)
f.close()
This is the data that is retrieved by urllib: http://bpaste.net/show/126433/ and currently does not show anything useful.
Any ideas?
I'll just refer you.
You want to submit a form with several pre-defined field values. Then you want to parse the data returned. Then, next steps depend on whether it is easy to automate that form post request.
You have plenty of options here:
using browser developer tools analyze what is going on while clicking on "submit". Then, if there is a simple POST request - simulate it using urllib2 or requests or mechanize or whatever you like
give a try to Scrapy and it's FormRequest class
use a real automated browser with the help of selenium. Fill the data into fields, click submit, get and parse the data using the same one tool (selenium)
Basically, if there is a lot of javascript logic involved into the form submitting process - you'll have to go with automated browsing tool, like selenium.
Plus, note that there are several tools for parsing HTML: BeautifulSoup, lxml.
Also see:
Web scraping with Python
Hope that helps.
Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 8 years ago.
Improve this question
I am currently doing a research project and I am attempting to figure out a good way to identify ads given access to the html of a webpage.
I thought it might be a good idea to start with AdBlock. AdBlock is a program that prevents ads from being displayed to the user, so presumably it has a mechanism for identifying things as ads.
I downloaded the source code for AdBlockPlus, but I find myself completely lost in all of the files. I am not sure where to start looking for this detection mechanism, so I was wondering if anyone had any advice on where to start. Alternatively if you have dealt with AdBlock before and are familiar with it, I would appreciate any extra information.
For example, if the webpage needs to be rendered in a real browser to use Adblock, there are programs that will automate the loading of a webpage so this wouldn't be a problem but I am not sure how to figure out if this is what AdBlock does in the first place.
Note: AdBlock is written in Python and Perl :)
Thanks!
I would advise you to first have a look at writing adblock filter rules.
Then, once you get an idea of this, you can start parsing adblock lists available in various languages to suit your needs.