I'm doing some crawling with Python, and would like to be able to identify (however imperfectly) the flash I come across - is it a video, an ad, a game, or whatever.
I assume I would have to decompile the swf, which seems doable. But what sort of processing would I do with the decompiled Actionscript to figure out what it's purpose is?
Edit: or any better ideas would be most welcome also.
I think your best bet would be to check the context where you see the swf file
usually they're embedded within web pages so if that page has 100 occurences of the word "game", then it might be a game, as an example
To detect an ad it might be trickier but i think that checking the domainname where the swf is hosted might do the trick, also html tags around the swf will be of great use
It might help to look at the arguments passed to the Flash movie. If there's reference to an FLV file then there's a good chance the SWF is being used to play a movie.
The path to the SWF might help too. If it's under, say an /ads directory then it's probably just a banner ad. Or if it's under /games then it's probably a game.
Other than using heuristics like this there's probably not much you can do. SWFs can be used for a lot of different things, and there's really nothing in the SWF itself that would tell you what "type" it is.
Tough one. I guess you should try find a scope for a swf context.
As you said, swfs can be: ads,games, video players, they can also contain experimental art.
who knows. Once you know what exactly your after, it should be easier to figure out how to look for that kind of data.
I think it would be easier to get started with commercial websites. Those need promotion, so if they might promotional ria's setup with a little bit of SEO in mind so look for things like swfobject, swfaddress and tracking stuff ( omniture and who knows what else ). They should have keywords in the embedding html.
Google and Yahoo are working with Adobe as far as I know to make SWFs indexable. There is something mentioned about a custom FlashPlayer used for Flash indexing in the Flash Internals presentation from Adobe MAX.
Hope it helps.
Related
I am aware that this can't be done with bash script only, or it isn't as far as I know (and I'm still learning). This is why I'm asking for help. What do I need more ? Are there specific tools ?
This is what I'd like to do:
Upload an image to https://www.google.com/searchbyimage/upload
Then find all the identical images
Download the one which has the greatest resolution
So far I've been able to upload an image to Searchbyimage through curl. This uploaded image then creates a very long token that is used to search similar images, with some supplementary keywords.
The uploaded image creates a link composed like so:
https://www.google.com/search?tbs=sbi:
After this is the awfully long token: AMhZZith3JfR2OzwmuyQjufBifvdFWNjMShRMypWIE2-g005QfYLeTATLhGHAWz8MLI-tbgHzZp-bREPlJbsNWhY7U4Z2_19bu0oHII6VJPIVVJSPANODqnrJXp6X5VKKoXHMLcBCmI9eIpxS_1EX9g9YJPFL2XFEfJqIApLX83erP5mlRM7rSiIF5Te_1RPNyVkp4IPZPBRtoOKGhpDw2xad-JZsqd2ai4F5sMvyO2A_18PMFKg21nTRH_1jVeOeUhz8U5zkL4lycIg3kafAYlNy8YwmjSFcmc2nZB_10t9MFyi2BnBmemDRp4DCACI0FVM6pLTIB8VCBpU9A
And it adds this at the end: &hl=fr.
Finally the image is searched, and I have the choice between clicking "similar images" or "all sizes" (it's "all sizes" I want, as similar images doesn't ensure it will be identical). This will add some keywords from google's analysis of the picture (here, a photography of Émile Zola) and create a second token:
The picture I searched here
https://www.google.com/search?safe=strict&hl=fr&
q=emile+zola&tbm=isch
&tbs=simg:
CAQSmQEJthA57uIOXdcajQELEKjU2AQaBggXCD0IQgwLELCMpwgaYgpgCAMSKLQZ9QH3BLMZ2A6xGdcO3w70Ad0OwjrEOqEuwzqiLsE67iSTLoM4oC4aMIk1iw7XQn7Wu55hLB2k-bnfW3_1yf24eA0N-w-baKvWkDj48J67yZZS-uQ-BgjCRQyAEDAsQjq7-CBoKCggIARIEnfZWUgw&sa=X&ved=0ahUKEwi965ashtrhAhWI3eAKHSmRCBwQ2A4IKygB
&biw=1920&bih=944
With at the end the resolution of the picture. The idea is to recreate this second link, to then download the highest resolution image amongst what google has found. I have to get the token, but everything else can be found on the picture file itself: the file is properly named after the picture, and thus could make for keywords, and its resolution is also easily known. I'd like to make it a script, to download higher resolution images of many paintings - over a thousand - that are in low quality. Ideally I'd use it quite often. So far I had found how to upload a picture with curl, and it had gave me back a token, but uncomplete. Beyond this, I was completely lost.
In theory this doesn't seem impossible. The problem is I'm too much of a newbie: I enjoy a lot so far Linux and bash, but I only know so few. I have of course done some hours of googling before, nothing showed up that I knew I could use. There is nothing alike neither on github: a lot of scripts that search for similar images, but none for identical. None of them that also compares the sizes of these images. There's also a python API for reverse image searching, but it didn't seem like it could search for identical images, and it seems related to the google API, which is problematic. All of this is probably dumbly hard for me because I'm only a beginner, and I don't know enough to build this script: but in another way - maybe due to my lack of knowledge - it doesn't seem impossible at all, and I'm very willing to try, fail, try again: learn. So here I am, to ask: how do I do that ? Can it be done in bash only ? If not, what must I include ? Or perhaps it cannot be done ?
Lastly, I know there is a google API for reverse image searching. That'd be very useful, if it wasn't limited to a hundred image searches a day: if you want more, you've got to pay. And by a 100 images a day, it'd take me around eleven days to reverse search all the images I wanted in a better quality: in the end, I'd be done as fast by searching all that myself, by hand. But neither these options seems to be a solution: and this script doesn't seem impossible. It is only beyond my current capacities.
Thank you in advance, if anyone has got an idea !
PS: I can use linux wether through WSL, or a virtual machine. Both work very fine so far, including whatever command or package. WSL is much faster. And sorry for my english, I'm french !
Second PS: I've been asked to show what I had as code, but this doesn't get beyond this:
curl -i -F sch=sch -F encoded_image=#path/to/my/imagefile.jpg https://www.google.com/searchbyimage/upload
Which was a partial answer to my question I had found here:
How to use google search by image in curl
There's two fundamental ways to use the web programmatically:
via API: this is purpose built for computers to access web resources and always preferred. You follow strict rules and get well defined results back.
by crawling: this is when the computer pretends to be a user, emulating the clicking on links done in a browser. Basically curl, but over and over again with state stored in between, parameters generated correctly, encoding applied, etc.
As you say, there's an API available so if it does what you want then it's the right way to go. The fact that it does what you want, but enforces limits, is a very useful sign that was you're trying to do has limits. Those limits will have been carefully set to incentivise you to work within them. Trying to crawl for the same results will likely either breach Google's service term limits, or your sanity limits.
So if you really want to work around the API, then use a crawler library such as Python Scrapy. But note that the API limits might be a useful indication of how far you can expect to get without paying.
Ok, so I've looked around on how to do this and haven't really found an answer that showed me examples that I could work from.
What I'm trying to do is have a script that can do things like:
-Log into website
-Fill out forms or boxes etc.
Something simple that might help me in that I though of would be for example if I could write a script that would let me log into one if those text message websites like textnow or something like that, and then fill out a text message and send it to myself.
If anyone knows a good place that explains how to do something like this, or if anyone would be kind enough to give some guidance of their own then that would be greatly appreciated.
So after some good answers and further research, I have found that selenium is the thing that best suits my needs. It works not only with python, but supports other languages as well. If anyone else is looking for something that I had been when I asked the my question, a quick Google search for "selenium" should give them all the information they need about the tool that I found best for what I needed.
I'm learning a practice called 'web scraping' using python. From what I can tell so far the idea is to send out a request to load the site data from a server, store the DOM html in a variable, and then basically data mine the s*** out of the resulting string until you are able to quickly access exactly and only the information you need.
Well I'm ready to start fiddling with statements that might help me do the actual data mining, but first I need to see and understand all of the html in my string. After I've got the hang of it I won't care what the html looks like, but right now I need to be able to reference it to properly analyze my output. so far I've tried google, python.net, youtube, various blogs and etc. But they all look like alianeese.
I'm just looking for the typical stuff you know?
<html><head><meta><script src=""><style src=""><title></title></head><body><div class=""><img src=""></div><div><h1>my page</h1><li></li><li></li><li></li><li></li><li></li><li></li><p>click here</p></div></body></html>
You get what I'm saying? Just a website... that uses like... html... to render some simple structured data.
P.S. This is kind of neat. I went to give this post some tags and I discovered 'simple-html-dom'. So I googled it. Apparently it's some kind of language that lets you parse html from online sources in exactly the way I am trying to. I may check that out later, but I still want to figure out how to do this with python.
EDIT Actually something like this would work fine but it's just so big. I would prefer something smaller to work with.
While it would probably be nice to build your own web pages to use, you can also try looking for pages "optimized for lynx". Lynx is a text-only browser with which "simple" pages naturally work best.
Most of the links you'll find will be dead already, but I found this list for instance, which still has many alive and equally simple pages: http://www.put.com/dead.html (please ignore the content itself... there is no particular reason I chose this example other than that it probably works nicely for your purposes!)
I want to make a Python script available as a service on the net. The script, which is my first 'proper' Python program, takes a txt file as argument and writes an image into the work directory. So:
How difficult is it for somebody who is new to Python and web development?
How much work is it?
Do I need a framework (Django, cherryPy, web2py)?
Are there good tutorials?
How do I avoid the server to be compromised?
What are my next steps?
==> What is the easiest way?
In the end it is enough, if it is a white page, with some text, and a button, which when clicked, opens a file dialog. After the txt is processed, the server should just return the image, which was written on the hard drive. Already I have access to a server which has Ubuntu installed through a friend.
[update]
Thanks for all your answers. After reading them I want to stress again, that I want to have it as minimal as possible. Srikar's suggestion sounds like the easiest one:
Put it in executable directory of your OS (commonly known as CGI
path). Provide a simple HTML form & upon form submission hit this
script which executes & returns back the image you want to display.
Any objections or comments? Do you know any tutorials for that?
[udpate2]
I found this SO answer: File Sharing Site in Python Is this a sensible approach?
It's not too difficult. Actually, it sounds like a good first project.
That too subjective to answer. An hour to days.
No, you don't need one, but I'd use one if I were you. They abstract away some of the stuff you really don't care about, and you'll learn a tool you can use again in the future.
Plenty. If you want a real rundown of how Python works for the web, read the HOWTO from Python.org. If you just want to learn how to do this one project, pick a framework and do their tutorial.
This question is so broad and complex that I'm not going to try to answer it. Search this site, or Google, for questions like that.
Your next step should be to pick a framework; I've used Django successfully. Just download it, follow the installation instructions, and work your way through their tutorial; it should tell you everything you need to know to do what you want. If you still have questions once you've learned how to do the basics, come back and ask again!
Edit: The answer to that other question will certainly work for you. There, they just receive a GET request and respond with data from a Python file. You need to receive a GET request, respond with an HTML page (easy enough), then respond to a POST request that includes an uploaded file (slightly more complicated) and run your python routine on the uploaded file and then respond with the created image (or a link to it).
Take a look at this page which includes a simple Python script to do file uploads. You should easily be able to modify it to do what you want.
How difficult is it for somebody who is new to Python and web development?
Depends on your level of knowledge.
How much work is it?
Depends on which method you choose to solve the problem.
Do I need a framework (Django, cherryPy, web2py)?
Not necessarily - you could get started by using the CGI (http://docs.python.org/library/cgi.html)
Are there good tutorials?
Yes, there are plenty. The Python docs are an excellent place to start.
How do I avoid the server to be compromised?
Again, depends on the method you choose to solve the problem, although there are commonalities.
What are my next steps?
Dare I say it again, choose a method, read the docs, have a play!
If its just as simple as you have described it. Then you might not even need Django. You could simply use CGI scripting. All of these design decisions, depend on whether
You need (or foresee) a SQL storage?
or a Content-Management-System?
Will you need multiple-user support?
Do you need tight security?
Do you need different privileges for different users?
Do you need an Admin to manage your site?
If the answer to above questions is atleast 60% correct, then you might consider Django. otherwise, just write a python script. Put it in executable directory of your OS (commonly known as CGI path). Provide a simple HTML form & upon form submission hit this script which executes & returns back the image you want to display. So, it all depends on the features you need...
In the end, I created what I needed with Flask.
They have a well documented pattern / tutorial on Uploading Files. The tutorial is understandable even for people with little python and web expericence.
To get a first working version it took me 2h and the resulting code was only 50 lines. This includes, starting the webserver, having a html file/form with file upload and serving a file back to the user.
We have lots of data and some charts repesenting one logical item. Charts and data is stored in various files. As a result, most users can easily access and re-use the information in their applications.
However, this not exactly a good way of storing data. Amongst other reasons, charts belong to some data, the charts and data have some meta-information that is not reflected in the file system, there are a lot of files, etc.
Ideally, we want
one big "file" that can store all
information (text, data and charts)
the "file" is human readable,
portable and accessible by
non-technical users
allows typical office applications
like MS Word or MS Excel to extract
text, data and charts easily.
light-weight, easy solution. Quick
and dirty is sufficient. Not many
users.
I am happy to use some scripting language like Python to generate the "file", third-party tools (ideally free as in beer), and everything that you find on a typical Windows-centric office computer.
Some ideas that we currently ponder:
using VB or pywin32 to script MS Word or Excel
creating html and publish it on a RESTful web server
Could you expand on the ideas above? Do you have any other ideas? What should we consider?
I can only agree with Reef on the general concepts he presented:
You will almost certainly prefer the data in a database than in a single large file
You should not worry that the data is not directly manipulated by users because as Reef mentioned, it can only go wrong. And you would be suprised at how ugly it can get
Concerning the usage of MS Office integration tools I disagree with Reef. You can quite easily create an ActiveX Server (in Python if you like) that is accessible from the MS Office suite. As long as you have a solid infrastructure that allows some sort of file share, you could use that shared area to keep your code. I guess the mess Reef was talking about mostly is about keeping users' versions of your extract/import code in sync. If you do not use some sort of shared repository (a simple shared folder) or if your infrastructure fails you often so that the shared folder becomes unavailable you will be in great pain. Note what is also somewhat painful if you do not have the appropriate tools but deal with many users: The ActiveX Server is best registered on each machine.
So.. I just said MS Office integration is very doable. But whether it is the best thing to do is a different matter. I strongly believe you will serve your users better if you build a web-site that handles their data for them. This sort of tool however almost certainly becomes an "ongoing project". Often, even as an "ongoing project", the time saved by your users could still make it worth it. But sometimes, strategically, you want to give your users a poorer experience to control project costs. In that case the ActiveX Server I mentioned could be what you want.
Instead of using one big file, You should use a database. Yes, You can store various types of files like gifs in the database if You like to.
The file would not be human readable or accessible by non-technical users, but this is good.
The database would have a website that Your non-technical users would use to insert, update and get data from. They would be able to display it on the page or export it to csv (or even xls - it's not that hard, I've seen some csv->xls converters). You could look into some open standard document formats, I think it should be quite easy to output data with in it. Do not try to output in "doc" format (but You could try "docx"). You should be able to easily teach the users how to export their data to a CSV and upload it to the site, or they could use the web interface to insert the data if they like to.
If You will allow Your users to mess with the raw data, they will break it (i have tried that, You have no idea how those guys could do that). The only way to prevent it is to make a web form that only allows them to perform certain actions that You exactly know how that they should suppose to perform.
The database + web page solution is the good one. Using VB or pywin32 to script MSOffice will get You in so much trouble I cannot even imagine.
You could use gnuplot or some other graphics library to draw (pretty straightforward to implement, it does all the hard work for You).
I am afraid that the "quick" and dirty solution is tempting, but I only can say one thing: it will not be quick. In a few weeks You will find that hacking around with MSOffice scripting is messy, buggy and unreliable and the non-technical guys will hate it and say that in other companies they used to have a simple web panel that did that. Then You will find that You will not be able to ask about the scripting because everyone uses the web interfaces nowadays, as they are quite easy to implement and maintain.
This is not a small project, it's a medium sized one, You need to remember this while writing it. It will take some time to do it and test it and You will have to add new features as the non-technical guys will start using it. I knew some passionate php teenagers who would be able to write this panel in a week, but as I understand You have some better resources so I hope You will come with a really reliable, modular, extensible solution with good usability and happy users.
Good luck!