Keeping an optimise log file for a predetermined number of requests?

Keeping an optimise log file for a predetermined number of requests? - python

I need to log certain requests made to my app being developed in Flask.
I know in advance whether the request is a candidate to be that needs to be logged or not.
I'll have at most 1000 requests to be logged. This is an upper limit, but it will probably be lower. The number is still uncertain for now, as we're adding those, but it won't be once deployed. Each request will generate on average two A4 text.
I wanted to save the content, the time, the filenames (usually at most 2 to 3 files) that came in with the requests.
I don't mind the space this log file will take up.
I would like it to be fast to open, and fast to search the log for previously done requests. I'll have a preference for fast opening and logging into the file new information, more than the searching itself, since it'll be just a few hundreds of requests.
Should I pickle, save in a json, or use some other format?
The way I was thinking of structuring the information would be this:
{ "past_request": <sorted_list_filenames>,
"filename_1" : {"content":..., "time_requested":...,...},
"filename_102" : {"content":..., "time_requested":...,...}
...

Related

How to call a internet Python script from another local Python script

I have a Python script that does some complicated statistical calculations on data and want to sell it. I thought about compiling it, but I have read that Python compiled with py2exe can be easily decompiled. So I tried another aproach, namely running the script from the Internet.
Now my code, in file local.py looks like:
local.py
#imports
#define variables and arrays
#very complicated code i want to protect
I want to obtain something like:
File local.py:
#imports
#define variables and arrays
#call to online.py
print(result)
File online.py:
#code protected
#exit(result)
Is there is a better way to do this ?
I want client to run the code without being able to see it.

A web API would work great for this, which is a super simple thing to do. There are numerous Python frameworks for creating simple HTTP APIs, and even writing one from scratch with just low level sockets wouldn't be too hard. If you only want to protect the code itself, you don't even need security features.
If you only want people who've paid for the usage to have access to your API, then you'll need some sort of authentication, like an API key. That's a pretty easy thing to do too, and may come nearly for free from some of the aforementioned frameworks.
So your client's code might look something like this:
File local.py:
import requests
inputdata = get_data_to_operate_on()
r = requests.post('https://api.gustavo.com/somemagicfunction?apikey=23423h234h2g432j34', data=inputdata)
if r.status_code == 200:
result = r.json()
# operate on the resulting JSON here
...
This code does a HTTP POST request, passing whatever data is returned by the get_data_to_operate_on() call in the body of the request. The body of the response is the result of the processing, which in this code is assumed to be JSON data, but could be in some other format as well.
There are all sorts of options you could add, by changing the path of the request (the '/somemagicfunction' part) or by adding additional query parameters.
This might help you to get started on the server side: https://nordicapis.com/8-open-source-frameworks-for-building-apis-in-python. And here's one way to host your Python server code: https://devcenter.heroku.com/articles/getting-started-with-python

Search/Filter/Select/Manipulate data from a website using Python

I'm working on a project that basically requires me to go to a website, pick a search mode (name, year, number, etc), search a name, select amongst the results those with a specific type (filtering in other words), pick the option to save those results as opposed to emailing them, pick a format to save them then download them by clicking the save button.
My question is, is there a way to do those steps using a Python program? I am only aware of extracting data and downloading pages/images, but I was wondering if there was a way to write a script that would manipulate the data, and do what a person would manually do, only for a large number of iterations.
I've thought of looking into the URL structures, and finding a way to generate for each iteration the accurate URL, but even if that works, I'm still stuck because of the "Save" button, as I can't find a link that would automatically download the data that I want, and using a function of the urllib2 library would download the page but not the actual file that I want.
Any idea on how to approach this? Any reference/tutorial would be extremely helpful, thanks!
EDIT: When I inspect the save button here is what I get:
Search Button

This would depend a lot on the website your targeting and how their search is implemented.
For some websites, like Reddit, they have an open API where you can add a .json extension to a URL and get a JSON string response as opposed to pure HTML.
For using a REST API or any JSON response, you can load it as a Python dictionary using the json module like this
import json
json_response = '{"customers":[{"name":"carlos", "age":4}, {"name":"jim", "age":5}]}'
rdict = json.loads(json_response)
def print_names(data):
for entry in data["customers"]:
print(entry["name"])
print_names(rdict)

You should take a look at the Library of Congress docs for developers. If they have an API, you'll be able to learn about how you can do search and filter through their API. This will make everything much easier than manipulating a browser through something like Selenium. If there's an API, then you could easily scale your solution up or down.
If there's no API, then you have
Use Selenium with a browser(I prefer Firefox)
Try to get as much info generated, filtered, etc. without actually having to push any buttons on that page by learning how their search engine works with GET and POST requests. For example, if you're looking for books within a range, then manually conduct this search and look at how the URL changes. If you're lucky, you'll see that your search criteria is in the URL. Using this info you can actually conduct a search by visiting that URL which means your program won't have to fill out a form and push buttons, drop-downs, etc.
If you have to use the browser through Selenium(for example, if you want to save the whole page with html, css, js files then you have to press ctrl+s then click "save" button), then you need to find libraries that allow you to manipulate the keyboard within Python. There are such libraries for Ubuntu. These libraries will allow you to press any keys on the keyboard and even do key combinations.
An example of what's possible:
I wrote a script that logs me in to a website, then navigates me to some page, downloads specific links on that page, visits every link, saves every page, avoids saving duplicate pages, and avoids getting caught(i.e. it doesn't behave like a bot by for example visiting 100 pages per minute).
The whole thing took 3-4 hours to code and it actually worked in a virtual Ubuntu machine I had running on my Mac which means while it was doing all that work I could do use my machine. If you don't use a virtual machine, then you'll either have to leave the script running and not interfere with it or make a much more robust program that IMO is not worth coding since you can just use a virtual machine.

How to pass large data between systems

I have pricing data that is stored in XML format that is being generated on an hourly basis. It is roughly 100MB in size, if stored as XML. I need to send this data to my main system in order to process this. In the future, it is also possible that this data size is sent ever 1m.
What would be the best way to send this data? My thinking thus far was:
- It would be too large to send as JSON to a POST endpoint
- Possible to send it as XML and store it on my server
Is there a more optimal way to do this?

As mentioned in the answer by Michael Anderson, you could possibly send only a diff of the changes across each system.
One way to do this is to introduce a protocol such as git.
With git, you could:
Generate the data on the first system and push to a private repo
Have your second system pull the changes
This would be much more efficient than pulling the entire copy of the data every time.
It would also be compressed, and over an encrypted channel (depending on the git server/service)

Assuming you're on linux and the data already is written somewhere in your filesystem, why not just do a simple scp or rsync in a crontab entry?
You probably want to compress before sending, or enable compression in the protocol.
If your data only changes slightly, you could also try sending a patch against the previous version (generated with diff) instead of the entire data, and then regenerating on the other end.

Database in Excel using win32com or xlrd Or Database in mysql

I have developed a website where the pages are simply html tables. I have also developed a server by expanding on python's SimpleHTTPServer. Now I am developing my database.
Most of the table contents on each page are static and doesn't need to be touched. However, there is one column per table (i.e. page) that needs to be editable and stored. The values are simply text that the user can enter. The user enters the text via html textareas that are appended to the tables via javascript.
The database is to store key/value pairs where the value is the user entered text (for now at least).
Current situation
Because the original format of my webpages was xlsx files I opted to use an excel workbook as my database that basically just mirrors the displayed web html tables (pages).
I hook up to the excel workbook through win32com. Every time the table (page) loads, javascript iterates through the html textareas and sends an individual request to the server to load in its respective text from the database.
Currently this approach works but is terribly slow. I have tried to optimize everything as much as I can and I believe the speed limitation is a direct consequence of win32com.
Thus, I see four possible ways to go:
Replace my current win32com functionality with xlrd
Try to load all the html textareas for a table (page) at once through one server call to the database using win32com
Switch to something like sql (probably use mysql since it's simple and robust enough for my needs)
Use xlrd but make a single call to the server for each table (page) as in (2)
My schedule to build this functionality is around two days.
Does anyone have any thoughts on the tradeoffs in time-spent-coding versus speed of these approaches? If anyone has any better/more streamlined methods in mind please share!

Probably not the answer you were looking for, but your post is very broad, and I've used win32coma and Excel a fair but and don't see those as good tools towards your goal. An easier strategy is this:
for the server, use Flask: it is a Python HTTP server that makes it crazy easy to respond to HTTP requests via Python code and HTML templates. You'll have a fully capable server running in 5 minutes, then you will need a bit of time create code to get data from your DB and render from templates (which are really easy to use).
for the database, use SQLite (there is far more overhead intergrating with MysQL); because you only have 2 days, so
you could also use a simple CSV file, since the API (Python has a CSV file read/write module) is much simpler, less ramp up time. One CSV per user, easy to manage. You don't worry about insertion of rows for a user, you just append; and you don't implement remove of rows for a user, you just mark as inactive (a column for active/inactive in your CSV). In processing GET request from client, as you read from the CSV, you can count how many certain rows are inactive, and do a re-write of the CSV, so once in a while the request will be a little slower to respond to client.
even simpler yet you could use in-memory data structure of your choice if you don't need persistence across restarts of the server. If this is for a demo this should be acceptable limitation.
for the client side, use jQuery on top of javascript -- maybe you are doing that already. Makes it super easy to manipulate the DOM and use effects like slide-in/out etc. Get yourself the book "Learning jQuery", you'll be able to make good use of jQuery in just a couple hours.
If you only have two days it might be a little tight, but you will probably need more than 2 days to get around the issues you are facing with your current strategy, and issues you will face imminently.

How do I go to a random website? - python

How to generate a random yet valid website link, regardless of languages. Actually, the more diverse the language of the website it generates, the better it is.
I've been doing it by using other people's script on their webpage, how can i not rely on these random site forwarding script and make my own?. I've been doing it as such:
import webbrowser
from random import choice
random_page_generator = ['http://www.randomwebsite.com/cgi-bin/random.pl',
'http://www.uroulette.com/visit']
webbrowser.open(choice(random_page_generator), new=2)

I've been doing it by using other people's script on their webpage, how can i not rely on these random site forwarding script and make my own?
There are two ways to do this:
Create your own spider that amasses a huge collection of websites, and pick from that collection.
Access some pre-existing collection of websites, and pick from that collection. For example, DMOZ/ODP lets you download their entire database;* Google used to have a customized random site URL;** etc.
There is no other way around it (short of randomly generating and testing valid strings of arbitrary characters, which would be a ridiculously bad idea).
Building a web spider for yourself can be a fun project. Link-driven scraping libraries like Scrapy can do a lot of the grunt work for you, leaving you to write the part you care about.
* Note that ODP is a pretty small database compared to something like Google's or Yahoo's, because it's primarily a human-edited collection of significant websites rather than an auto-generated collection of everything anyone has put on the web.
** Google's random site feature was driven by both popularity and your own search history. However, by feeding it an empty search history, you could remove that part of the equation. Anyway, I don't think it exists anymore.

A conceptual explanation, not a code one.
Their scripts are likely very large and comprehensive. If it's a random website selector, they have a huge, huge list of websites line by line, and the script just picks one. If it's a random URL generator, it probably generates a string of letters (e.g. "asljasldjkns"), plugs it between http:// and .com, tries to see if it is a valid URL, and if it is, sends you that URL.
The easiest way to design your own might be to ask to have a look at theirs, though I'm not certain of the success you'd have there.
The best way as a programmer is simply to decipher the nature of URL language. Practice the building of strings and testing them, or compile a huge database of them yourself.
As a hybridization, you might try building two things. One script that, while you're away, searches for/tests URLs and adds them to a database. Another script that randomly selects a line out of this database to send you on your way. The longer you run the first, the better the second becomes.
EDIT: Do Abarnert's thing about spiders, that's much better than my answer.

The other answers suggest building large databases of URL, there is another method which I've used in the past and documented here:
http://41j.com/blog/2011/10/find-a-random-webserver-using-libcurl/
Which is to create a random IP address and then try and grab a site from port 80 of that address. This method is not perfect with modern virtual hosted sites, and of course only fetches the top page but it can be an easy and effective way of getting random sites. The code linked above is C but it should be easily callable from python, or the method could be easily adapted to python.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.