Mechanism for Identifying Ads on a Webpage [Specifically AdBlock] [closed] - python

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 8 years ago.
Improve this question
I am currently doing a research project and I am attempting to figure out a good way to identify ads given access to the html of a webpage.
I thought it might be a good idea to start with AdBlock. AdBlock is a program that prevents ads from being displayed to the user, so presumably it has a mechanism for identifying things as ads.
I downloaded the source code for AdBlockPlus, but I find myself completely lost in all of the files. I am not sure where to start looking for this detection mechanism, so I was wondering if anyone had any advice on where to start. Alternatively if you have dealt with AdBlock before and are familiar with it, I would appreciate any extra information.
For example, if the webpage needs to be rendered in a real browser to use Adblock, there are programs that will automate the loading of a webpage so this wouldn't be a problem but I am not sure how to figure out if this is what AdBlock does in the first place.
Note: AdBlock is written in Python and Perl :)
Thanks!

I would advise you to first have a look at writing adblock filter rules.
Then, once you get an idea of this, you can start parsing adblock lists available in various languages to suit your needs.

Related

Scraping searchable online dictionary [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 4 years ago.
Improve this question
and thanks in advance! I was hoping someone might be able to point me in the right direction as to how to scrape a searchable online database. Here is the url: https://hord.ca/projects/eow/. If possible, I'd like to be able to access all of the data from the site's database, I'm just not sure how to access it using bs4... Maybe bs4 isn't the answer here though. Still a relatively new Pythonista, any help is greatly appreciated!
Since you are new there are going to be a combination of things you need to address, you need to have a good handle on where to look in html, make sure you understand how the site works, what does it put into its URLs, and why? what are the class names of the important bits of the site you will want to reference? and how does it handle multipage display (if it does so at all).
once you are intimate with the website you are scraping you will need to apply that knowledge when you go to make your automation.
for beginners id highly reccomend this ebook: https://automatetheboringstuff.com/
its a great read and easy to follow for even the beginner in both python and html. even better its free to read on the site!
chapter 11 is the part you are specifically looking for on webscraping. which will give you the rundown on what you need to be looking for and how to go about planning your code.
but i highly recommend you read the whole thing once you are done focusing on your current project.

Could Python3 script (requests) used to get data from a website seem 'suspicious'? [closed]

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 6 years ago.
Improve this question
Might be a silly question... I want to use a python script to get some data from a website every 10 or 20min.
I'm using:
requests.get("http://somewebsite.php")
data = response.text
to get the data, and the rest is basically extraction of values from the string etc.
I would like to loop it and make a new request to the website every 10 or 20min to get data.
Assuming I'm running this script for few hours:
Would it look suspicious to the owner of the website?
Would it in any way 'hurt' the website or is it just equivalent to refreshing the website in the browser?
I just don't want someone, somewhere think something malicious is happening when I'm just playing around learning python. The data is not even important, I just want to see if the script that I wrote works. I just figured I might ask here before running it.
Thanks for any replies in advance.
Although you don't want to do any harm, you can misconfigure the script by accident (we are just humans), generate suspicious activity and a real person might spend some time investigating your activity (I'm not kidding, these things really happen).
My suggestion is to use a testing service like https://httpbin.org/ to play with the requests library. HttpBin is actually created by the same person who started the requests library (Kenneth Reitz).

How would I go about pulling data from a website using Python? [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 7 years ago.
Improve this question
In reference towards me question, how would one be able to input data and retrieve data from various websites (not using an API)?
Is there a module that searches or acts like a human for purposes as in searching along applicably given fields; in effort to (as said before) retrieve data?
Sorry if I'm making my question hard to follow along; though if so, here's an example of what I am trying to accomplish:
Directing an AI towards a specific website.
Inputting data into the search field.
Then finally, retrieving said data after previously ran processes.
I'm fairly new to the section or field in manipulating websites via APIs or various (unknown) code; therefore, sorry if I missed anything!
You can use
mechanize,
BeautifulSoup,
Urllib,
Urllib2,
modules in Python. What I suggest you is use mechanize module. It is like scraping website through python program. More over simply a browser through python code.

How to write a crawler for a desk-top [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 8 years ago.
Improve this question
I'd like to write a program that indexes my pdf and music files on my hard drive(not server). I plan to do this via perl or python, or both. I'll basically be writing a crawler for my desctop. The user interface will be in JavaFx, which I think quite fluent in. I've done a couple of projects in JavaFx. I have not done anything in perl/ python. I however, have done a few lines of code in them while teaching myself the syntax.
The question is what topics should I start my research in when embarking on writing a crawler. I've seen quite a number of tutorials online on crawlers but all do web page indexing. Plus what modules should I look into?
In python to find the files you can use os.walk - the examples in the help are very helpful.
Assuming that you are looking to do more than just locate the files and get their names you will need to look into getting some more information about the contents, there are python libraries that can get text from pdf files such as PDFMiner and pdfquery.
Likewise there are numerous python tools that can get you some more information on your music files.
It all depends on how you are planning on indexing them.

Can I use urlllib2 to play videos? [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 8 years ago.
Improve this question
Can I use urllib2 to open a webpage which contain video (like vimeo page) and this visit will be counted as view?
In general, yes. A request done with urllib2 will be a normal HTTP request and as such will be recognized as a normal “visit” for the server you are connecting to. Depending on what additional headers you set, you can even make yourself look like a common browser, so they won’t be able to filter you out either.
As far as video counts go however, I’m pretty sure that simply visiting the site—without executing any code on it, and without actually playing the video—will not increase the view counter. In addition, these sites employ some systems to prevent abuse of the counter too. So if you have the hope to be able to spoof real views and increment the view counter by repeatedly visiting the page, then you will be out of luck.
As for actually playing—if you are interested in the content instead of the view counter—then yes, you can use Python to get access to the video. Of course Python won’t be able to play it, but you can download it instead. There are scripts like this one that already do this for you too.

Categories

Resources