How to write a crawler for a desk-top [closed] - python

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 8 years ago.
Improve this question
I'd like to write a program that indexes my pdf and music files on my hard drive(not server). I plan to do this via perl or python, or both. I'll basically be writing a crawler for my desctop. The user interface will be in JavaFx, which I think quite fluent in. I've done a couple of projects in JavaFx. I have not done anything in perl/ python. I however, have done a few lines of code in them while teaching myself the syntax.
The question is what topics should I start my research in when embarking on writing a crawler. I've seen quite a number of tutorials online on crawlers but all do web page indexing. Plus what modules should I look into?

In python to find the files you can use os.walk - the examples in the help are very helpful.
Assuming that you are looking to do more than just locate the files and get their names you will need to look into getting some more information about the contents, there are python libraries that can get text from pdf files such as PDFMiner and pdfquery.
Likewise there are numerous python tools that can get you some more information on your music files.
It all depends on how you are planning on indexing them.

Related

Could you check it's possible (Selenium python automation + PHP) [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 2 years ago.
Improve this question
Our system is developed with PHP and one of our coworkers developed Amazon automation program with Python.
I am wondering if it's possible to integrate together ?
if it is please recommend what ways i can do this
https://github.com/jasonminsookim/order_automation/blob/master/src/amzn.py
Here's code Amazon automation program
Thank you
There are lots of ways to do this, but I would weigh what you have available to you and go from there. The tempfile solution is the most general, and is a common interface pattern for any two or more languages, but you can get more exotic if performance is a major concern with pipes.
Temp-file
I guess the most rudimentary way to do this would be to have the python file output some data to a file that can be read in by php or vice versa.
Something like creating a directory called /orders where php put's in order.json files and python takes those in, reads them and gets the result, then puts it back as a order-result.json. Essentially a temp-file system to communicate between the two.
Pipes
Alternatively depending on your setup you could pipe results into php from python with something like the subprocessing module and a php CLI that interfaces with your DB.

Translation of website in different languages [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 4 years ago.
Improve this question
This is related to website we are creating at work. The script for website is written in HTML and Python. We want to translate website in different languages text by text. The idea we have is to save all the words and phrases in the excel file and by using some API it translates everything and saves in the same excel file. I had an idea of using Google API but I want to save all the translations in Excel once so we don't have to pay for using the API again and again.
I am looking for less tedious way to:
1) Save all the words and phrases from website into the excel file
2) Translate those saved words and phrases and be able to save in the same file.
This is actually a common problem that has been solved in a similar way to what you have in mind. I would suggest taking a look at https://www.mattlayman.com/2015/i18n.html https://www.python.org/community/sigs/current/i18n-sig/ and https://docs.python.org/2/library/gettext.html which describe using the gettext method to display the proper translation. This is a common problem in web development that happens out of the box in web frameworks like ruby on rails.
I believe you will still have to find the translations yourself, but if you save them in the proper files there are built in functions that can retrieve the right translation for you based on the user's location.

How would I go about pulling data from a website using Python? [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 7 years ago.
Improve this question
In reference towards me question, how would one be able to input data and retrieve data from various websites (not using an API)?
Is there a module that searches or acts like a human for purposes as in searching along applicably given fields; in effort to (as said before) retrieve data?
Sorry if I'm making my question hard to follow along; though if so, here's an example of what I am trying to accomplish:
Directing an AI towards a specific website.
Inputting data into the search field.
Then finally, retrieving said data after previously ran processes.
I'm fairly new to the section or field in manipulating websites via APIs or various (unknown) code; therefore, sorry if I missed anything!
You can use
mechanize,
BeautifulSoup,
Urllib,
Urllib2,
modules in Python. What I suggest you is use mechanize module. It is like scraping website through python program. More over simply a browser through python code.

Python - First Interface with a Program [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 8 years ago.
Improve this question
I have spent the last six months learning python as a way to automate my working environment. So far I have automated data extraction and report downloading from various web-based sources through the use of webcrawlers, interacted with excel files, created visual representations of data through matplotlib, and removed almost all the monotony from bank reconciliation.
I now come to a new task which takes up a large amount of my daily workload. We use an accounts program called Sage 50 Accounts. I effectively want to begin to learn how to manipulate the data contained within this program so that my daily thought patterns can be put into Python code.
Because this hasn't been done, there's no pre-made API. So my question is:
When wishing to interact with a new program through Python, how does a programmer begin such an inquiry?
Please accept that this question is only vague and general because I'm incredibly new to such a task.
SData is Sage's general data access API layer and should suit your purposes.
Otherwise you might need to invest in or obtain a Sage Development SDK.

Mechanism for Identifying Ads on a Webpage [Specifically AdBlock] [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 8 years ago.
Improve this question
I am currently doing a research project and I am attempting to figure out a good way to identify ads given access to the html of a webpage.
I thought it might be a good idea to start with AdBlock. AdBlock is a program that prevents ads from being displayed to the user, so presumably it has a mechanism for identifying things as ads.
I downloaded the source code for AdBlockPlus, but I find myself completely lost in all of the files. I am not sure where to start looking for this detection mechanism, so I was wondering if anyone had any advice on where to start. Alternatively if you have dealt with AdBlock before and are familiar with it, I would appreciate any extra information.
For example, if the webpage needs to be rendered in a real browser to use Adblock, there are programs that will automate the loading of a webpage so this wouldn't be a problem but I am not sure how to figure out if this is what AdBlock does in the first place.
Note: AdBlock is written in Python and Perl :)
Thanks!
I would advise you to first have a look at writing adblock filter rules.
Then, once you get an idea of this, you can start parsing adblock lists available in various languages to suit your needs.

Categories

Resources