I'd like to start by asking for your opinion on how I should tackle this task, instead of simply how to structure my code.
Here is what I'm trying to do: I have a lot of data loaded into a mysql table for a large number of unique names + dates (i.e., where the date is a separate field). My goal is to be able to select a particular name (using rawinput, and perhaps in the future add a drop-down menu) and see a monthly trend, with a moving average, and perhaps other stats, for one of the fields (revenue, revenue per month, clicks, etc). What is your advice - to move this data to an excel workbook via python, or is there a way to display this information in python (with charts that compare to excel, of course)?
Thanks!
Analyze of such data (name,date) could be seen as issuing ad-hoc SQL queries to get timeseries information.
You will 'sample' your information by a date/time frame (day/week/month/year or more detailled by hour/minute) depending of how large is your dataset.
I often use such query where the date field is truncate to the sample rate, in mysql DATE_FORMAT function is cool for that (postgres and oracle use date_trunc and trunc respectivly)
What you want to see in your data is in your your WHERE conditions.
select DATE_FORMAT(date_field,'%Y-%m-%d') as day,
COUNT(*) as nb_event
FROM yourtable
WHERE name = 'specific_value_to_analyze'
GROUP BY DATE_FORMAT(date_field,'%Y-%m-%d');
execute this query and output to a csv file. You could use direct mysql commands for that, but I recommend to make a python script that execute such query, and you can use getopt options for output formatting (with or without columns headers, use different separator than default one, etc). And even you can build dynamically the query based on some options.
To plot such information, look at time series tools. If you have missing data (date that won't appears in result of such sql query) you should take care for the choice. Excel is not the correct one for that, I think (or not master enough it), but could be a start.
Personaly I found dygraph, a javascript library, really cool for time series plotting, and it can be used with a csv file as source. Careful in such configuration, due to crossdomain security constraint, the csv file and html page that display the Dygraph object should be on the same server (or whatever the security constraint of your browser want to accept).
I used to build such webapp using django, as it's my favourite web framework, where I wrap url call as this :
GET /timeserie/view/<category>/<value_to_plot>
GET /timeserie/csv/<category>/<value_to_plot>
The first url call a view that simply output a template file with a variable that reference the url to get the csv file for the Dygraph object :
<script type="text/javascript">
g3 = new Dygraph(
document.getElementById("graphdiv3"),
"{{ csv_url }}",
{
rollPeriod: 15,
showRoller: true
}
);
</script>
The second url call a view that generate the sql query and output the result as text/csv to be rendered by Dygraph.
It's "home made" could stand simple or be extended, run easily on any desktop computer, could be extended to output json format for use by others javascript libraries/framework.
Else there is tool in opensource, related to such reporting (but timeseries capabilities are often not enough for my need) like Pentaho, JasperReport, SOFA. You make the query as datasource inside a report in such tool and build a graph that output timeserie.
I found that today web technique with correct javascript library/framework is really start to be correct to challenge that old fashion of reporting by such classical BI tools and it make things interactive :-)
Your problem can be broken down into two main pieces: analyzing the data, and presenting it. I assume that you already know how to do the data analysis part, and you're wondering how to present it.
This seems like a problem that's particularly well suited to a web app. Is there a reason why you would want to avoid that?
If you're very new to web programming and programming in general, then something like web2py could be an easy way to get started. There's a simple tutorial here.
For a desktop database-heavy app, have a look at dabo. It makes things like creating views on database tables really simple. wxpython, on which it's built, also has lots of simple graphing features.
Related
I would like to create a website where I show some text but mainly dynamic data in tables and plots. Let us assume that the user can choose whether he wants to see the DAX or the DOW JONES prices for a specific timeframe. I guess these data I have to store in a database. As I am not experienced with creating websites, I have no idea what the most reasonable setup for this website would be.
Would it be reasonable for this example to choose a database where every row corresponds of 9 fields, where the first column is the timestamp (lets say data for every minute), the next four columns correspond to the high, low, open, close price of DAX for this timestamp and columns 5 to 9 correspond to high, low, open, close price for DOW JONES?
Could this be scaled to hundreds of columns with a reasonable speed
of the database?
Is this an efficient implementation?
When this website is online, you can choose whether you want to see DAX or DOW JONES prices for a specific timeframe. The corresponding data would be chosen via python from the database and plotted in the graph. Is this the general idea how this will be implemented?
To get the data, I can run another python script on the webserver to dynamically collect the desired data and write them in the database?
As a total beginner with webhosting (is this even the right term?) it is very hard for me to ask precise questions. I would be happy if I could find out whats the general structure I need to create the website, the database and the connection between both. I was thinking about amazon web services.
You could use a database, but that doesn't seem necessary for what you described.
It would be reasonable to build the database as you described. Look into SQL for doing so. You can download a package XAMPP that will give you pretty much everything you need for that. This is easily scalable to hundreds of thousands of entries - that's what databases are for.
If your example of stock prices is actually what you are trying to show, however, this is completely unnecessary as there are already plenty of databases that have this data and will allow you to query them. What you would really want in this scenario is an API. Alpha Vantage is a free service that will serve you data on stock prices, and has plenty of documentation to help you get it set up with python.
I would structure the project like this:
Use the python library Flask to set up the back end.
In addition to instantiating the Flask app, instantiate the Alpha Vantage class as well (you will need to pip install both of these).
In one of the routes you declare under Flask, use the Alpha Vantage api to get the data you need and simply display it to the screen.
If I am assuming you are a complete beginner, one or more of those steps may not make sense to you, in which case take them one at a time. Start by learning how to build a basic Flask app, then look at the API.
YouTube is your friend for both of these things.
Background:
I am developing a Django app for a business application that takes client data and displays charts in a dashboard. I have large databases full of raw information such as part sales by customer, and I will use that to populate the analyses. I have been able to do this very nicely in the past using python with pandas, xlsxwriter, etc., and am now in the process of replicating what I have done in the past in this web app. I am using a PostgreSQL database to store the data, and then using Django to build the app and fusioncharts for the visualization. In order to get the information into Postgres, I am using a python script with sqlalchemy, which does a great job.
The question:
There are two ways I can manipulate the data that will be populating the charts. 1) I can use the same script that exports the data to postgres to arrange the data as I like it before it is exported. For instance, in certain cases I need to group the data by some parameter (by customer for instance), then perform calculations on the groups by columns. I could do this for each different slice I want and then export different tables for each model class to postgres.
2) I can upload the entire database to postgres and manipulate it later with django commands that produce SQL queries.
I am much more comfortable doing it up front with python because I have been doing it that way for a while. I also understand that django's queries are little more difficult to implement. However, doing it with python would mean that I will need more tables (because I will have grouped them in different ways), and I don't want to do it the way I know just because it is easier, if uploading a single database and using django/SQL queries would be more efficient in the long run.
Any thoughts or suggestions are appreciated.
Well, it's the usual tradeoff between performances and flexibility. With the first approach you get better performances (your schema is taylored for the exact queries you want to run) but lacks flexibility (if you need to add more queries the scheam might not match so well - or even not match at all - in which case you'll have to repopulate the database, possibly from raw sources, with an updated schema), with the second one you (hopefully) have a well normalized schema but one that makes queries much more complex and much more heavy on the database server.
Now the question is: do you really have to choose ? You could also have both the fully normalized data AND the denormalized (pre-processed) data alongside.
As a side note: Django ORM is indeed most of a "80/20" tool - it's designed to make the 80% simple queries super easy (much easier than say SQLAlchemy), and then it becomes a bit of a PITA indeed - but nothing forces you to use django's ORM for everything (you can always drop down to raw sql or use SQLAlchemy alongside).
Oh and yes: your problem is nothing new - you may want to read about OLAP
This might sound like a bit of an odd question - but is it possible to load data from a (in this case MySQL) table to be used in Django without the need for a model to be present?
I realise this isn't really the Django way, but given my current scenario, I don't really know how better to solve the problem.
I'm working on a site, which for one aspect makes use of a table of data which has been bought from a third party. The columns of interest are liklely to remain stable, however the structure of the table could change with subsequent updates to the data set. The table is also massive (in terms of columns) - so I'm not keen on typing out each field in the model one-by-one. I'd also like to leave the table intact - so coming up with a model which represents the set of columns I am interested in is not really an ideal solution.
Ideally, I want to have this table in a database somewhere (possibly separate to the main site database) and access its contents directly using SQL.
You can always execute raw SQL directly against the database: see the docs.
There is one feature called inspectdb in Django. for legacy databases like MySQL , it creates models automatically by inspecting your db tables. it stored in our app files as models.py. so we don't need to type all column manually.But read the documentation carefully before creating the models because it may affect the DB data ...i hope this will be useful for you.
I guess you can use any SQL library available for Python. For example : http://www.sqlalchemy.org/
You have just then to connect to your database, perform your request and use the datas at your will. I think you can't use Django without their model system, but nothing prevents you from using another library for this in parallel.
I am new to python and pyramid and I am trying to figure out a way to print out some object values that I am using in a view callable to get a better idea of how things are working. More specifically, I am wanting to see what is coming out of a sqlalchemy query.
DBSession.query(User).filter(User.name.like('%'+request.matchdict['search']+'%'))
I need to take that query and then look up what Office a user belongs to by the office_id attribute that is part of the User object. I was thinking of looping through the users that come up from that query and doing another query to look up the office information (in the offices table). I need to build a dictionary that includes some User information and some Office information then return it to the browser as json.
Is there a way that I can experiment with different attempts at this while viewing my output without having to rely on the browser. I am more of a front end developer so when I am writing javascript I just view my outputs using console.log(output).
console.log(output) is to JavaScript
as
????? is to Python (specifically pyramid view callable)
Hope the question is not dumb. Just trying to learn. Appreciate anyones help.
This is a good reason to experiment with pshell, Pyramid's interactive python interpreter. From within pshell you can tinker with things on the command-line and see what they will do before adding them to your application.
http://docs.pylonsproject.org/projects/pyramid/en/1.4-branch/narr/commandline.html#the-interactive-shell
Of course, you can always use "print" and things will show up in the console. SQLAlchemy also has the sqlalchemy.echo ini option that you can turn on to see all queries. And finally, it sounds like you just need to do a join but maybe aren't familiar with how to write complex database queries, so I'd suggest you look into that before resorting to writing separate queries. Likely a single query can return you what you need.
Note: Scroll down to the Background section for useful details. Assume the project uses Python-Django and South, in the following illustration.
What's the best way to import the following CSV
"john","doe","savings","personal"
"john","doe","savings","business"
"john","doe","checking","personal"
"john","doe","checking","business"
"jemma","donut","checking","personal"
Into a PostgreSQL database with the related tables Person, Account, and AccountType considering:
Admin users can change the database model and CSV import-representation in real-time via a custom UI
The saved CSV-to-Database table/field mappings are used when regular users import CSV files
So far two approaches have been considered
ETL-API Approach: Providing an ETL API a spreadsheet, my CSV-to-Database table/field mappings, and connection info to the target database. The API would then load the spreadsheet and populate the target database tables. Looking at pygrametl I don't think what i'm aiming for is possible. In fact, i'm not sure any ETL APIs do this.
Row-level Insert Approach: Parsing the CSV-to-Database table/field mappings, parsing the spreadsheet, and generating SQL inserts in "join-order".
I implemented the second approach but am struggling with algorithm defects and code complexity. Is there a python ETL API out there that does what I want? Or an approach that doesn't involve reinventing the wheel?
Background
The company I work at is looking to move hundreds of project-specific design spreadsheets hosted in sharepoint into databases. We're near completing a web application that meets the need by allowing an administrator to define/model a database for each project, store spreadsheets in it, and define the browse experience. At this stage of completion transitioning to a commercial tool isn't an option. Think of the web application as a django-admin alternative, though it isn't, with a DB modeling UI, CSV import/export functionality, customizable browse, and modularized code to address project-specific customizations.
The implemented CSV import interface is cumbersome and buggy so i'm trying to get feedback and find alternate approaches.
How about separating the problem into two separate problems?
Create a Person class which represents a person in the database. This could use Django's ORM, or extend it, or you could do it yourself.
Now you have two issues:
Create a Person instance from a row in the CSV.
Save a Person instance to the database.
Now, instead of just CSV-to-Database, you have CSV-to-Person and Person-to-Database. I think this is conceptually cleaner. When the admins change the schema, that changes the Person-to-Database side. When the admins change the CSV format, they're changing the CSV-to-Database side. Now you can deal with each separately.
Does that help any?
I write import sub-systems almost every month at work, and as I do that kind of tasks to much I wrote sometime ago django-data-importer. This importer works like a django form and has readers for CSV, XLS and XLSX files that give you lists of dicts.
With data_importer readers you can read file to lists of dicts, iter on it with a for and save lines do DB.
With importer you can do same, but with bonus of validate each field of line, log errors and actions, and save it at end.
Please, take a look at https://github.com/chronossc/django-data-importer. I'm pretty sure that it will solve your problem and will help you with process of any kind of csv file from now :)
To solve your problem I suggest use data-importer with celery tasks. You upload the file and fire import task via a simple interface. Celery task will send file to importer and you can validate lines, save it, log errors for it. With some effort you can even present progress of task for users that uploaded the sheet.
I ended up taking a few steps back to address this problem per Occam's razor using updatable SQL views. It meant a few sacrifices:
Removing: South.DB-dependent real-time schema administration API, dynamic model loading, and dynamic ORM syncing
Defining models.py and an initial south migration by hand.
This allows for a simple approach to importing flat datasets (CSV/Excel) into a normalized database:
Define unmanaged models in models.py for each spreadsheet
Map those to updatable SQL Views (INSERT/UPDATE-INSTEAD SQL RULEs) in the initial south migration that adhere to the spreadsheet field layout
Iterating through the CSV/Excel spreadsheet rows and performing an INSERT INTO <VIEW> (<COLUMNS>) VALUES (<CSV-ROW-FIELDS>);
Here is another approach that I found on github. Basically it detects the schema and allows overrides. Its whole goal is to just generate raw sql to be executed by psql and or whatever driver.
https://github.com/nmccready/csv2psql
% python setup.py install
% csv2psql --schema=public --key=student_id,class_id example/enrolled.csv > enrolled.sql
% psql -f enrolled.sql
There are also a bunch of options for doing alters (creating primary keys from many existing cols) and merging / dumps.