I'm trying to import some data to my app, i'm 'n00b', thats my 1st app, so any advice is welcome,
the data that i will upload is about 350mb, its a .csv, i tryed to upload with the bulk upload, but i got the error :
Unable to download kind stats for all-kinds downloa
Kind stats are generated periodically by the appser
Kind stats are not available on dev_appserver.
i searched a little, and i found that is because the app doesnt have de stats, i create the model and populate with some entitie, it was some days ago and the app doesnt have the statistics, i need a advice, what would be the best way to upload the amount of data to the app ?
if is useful, the data is here:
http://dados.gov.br/dataset/atendimentos-de-consumidores-nos-procons-sindec
thanks
You must be using the autotmatic option generate bulkloader.yaml . For doing the auto generation, app engine uses the datastore statistics as the reference. Initially you had no data and hence no statistics and so you got the error related to no data stat.
Later when you added data to your datastore, your data statistics may still be empty because it normally takes upto 24 hours and even upto few days for app engine to update datastore statistics . So your data stats must still be empty and so auto generation option will not work.
You can follow the bulk upload documentation and manually create a bulkloader.yaml based on the example given there and modify to match your datastore fields and their types. Using this manually generated file, you should be able to proceed.
Related
I'm doing a web scraping + data analysis project that consists of scraping product prices every day, clean the data, and store that data into a PostgreSQL database. The final user won't have access to the data in this database (scraping every day becomes huge, so eventually I won't be able to upload in GitHub), but I want to explain how to replicate the project. The steps are basically:
Scraping with Selenium (Python), and save the raw data into CSV files (already in GitHub);
Read these CSV files, clean the data, and store it into the database (the cleaning script already in GitHub);
Retrieve the data from the database to create dashboards and anything that I want (not yet implemented).
To clarify, my question is about how can I teach someone that sees my project, to replicate it, given that this person won't have the database info (tables, columns). My idea is:
Add SQL queries in a folder, showing how to create the database skeleton (same tables and columns);
Add in README info like creating the environment variables to access the database;
It is okay doing that? I'm looking for best practices in this context. Thanks!
I'm trying to add real-time features to my Django webapp. Basically, i want to show real time data on a webpage.
I have an external Python script which generates some JSON data, not big data but around 10 records per second. On the other part, i have a Django app, i would like my Django app to receive that data and show it on a HTML page in real time. I've already considered updating the data on a db and then retrieving it from Django, but i would have too many queries, since Django would query the DB 1+ times per second for every user and my external script would be writing a lot of data every second.
What i'm missing is a "central" system, a way to make these two pieces communicate. I know the question is probably not specific enough, but is there some way to do this? I know something about Django Channels, but i don't know if i could do what i want with it; i've also considered updating the data on a RabbitMQ queue and then retrieve it from Django, but this is not the best use of RabbitMQ.
So is there a way to do this with Django-Channels? Any kind of advice is appreciated.
I would suggest using Django Channels. You can also use Redis instead of RabbitMQ. In your case, Redis might be a better choice.
Here is an approach: http://www.maxburstein.com/blog/realtime-django-using-nodejs-and-socketio/
I would like to create a website where I show some text but mainly dynamic data in tables and plots. Let us assume that the user can choose whether he wants to see the DAX or the DOW JONES prices for a specific timeframe. I guess these data I have to store in a database. As I am not experienced with creating websites, I have no idea what the most reasonable setup for this website would be.
Would it be reasonable for this example to choose a database where every row corresponds of 9 fields, where the first column is the timestamp (lets say data for every minute), the next four columns correspond to the high, low, open, close price of DAX for this timestamp and columns 5 to 9 correspond to high, low, open, close price for DOW JONES?
Could this be scaled to hundreds of columns with a reasonable speed
of the database?
Is this an efficient implementation?
When this website is online, you can choose whether you want to see DAX or DOW JONES prices for a specific timeframe. The corresponding data would be chosen via python from the database and plotted in the graph. Is this the general idea how this will be implemented?
To get the data, I can run another python script on the webserver to dynamically collect the desired data and write them in the database?
As a total beginner with webhosting (is this even the right term?) it is very hard for me to ask precise questions. I would be happy if I could find out whats the general structure I need to create the website, the database and the connection between both. I was thinking about amazon web services.
You could use a database, but that doesn't seem necessary for what you described.
It would be reasonable to build the database as you described. Look into SQL for doing so. You can download a package XAMPP that will give you pretty much everything you need for that. This is easily scalable to hundreds of thousands of entries - that's what databases are for.
If your example of stock prices is actually what you are trying to show, however, this is completely unnecessary as there are already plenty of databases that have this data and will allow you to query them. What you would really want in this scenario is an API. Alpha Vantage is a free service that will serve you data on stock prices, and has plenty of documentation to help you get it set up with python.
I would structure the project like this:
Use the python library Flask to set up the back end.
In addition to instantiating the Flask app, instantiate the Alpha Vantage class as well (you will need to pip install both of these).
In one of the routes you declare under Flask, use the Alpha Vantage api to get the data you need and simply display it to the screen.
If I am assuming you are a complete beginner, one or more of those steps may not make sense to you, in which case take them one at a time. Start by learning how to build a basic Flask app, then look at the API.
YouTube is your friend for both of these things.
I packed some defs in a class to do a work of translation text into the other language using Bing or Google translation API. and using a Flask to render templates as you may aware of.
However, when do the translation, I have some status and progress information, like completed 10% and etc during translate the paragraphs. but that information generated in the translation class as you may imagine - the class does the translation job.
However, in my Flask app, after call the class do the translation I want to have an ajax call from the webpage to the Flask App to retrieve that 10% information which generated from that class.
Here is what I did:
If I don't use any class, put all the defs within the main Flask App file, I can use the a global variable store the 10% information, but that makes the codes complex and I want to pack all the associated defs in a class.
in the Flask App, I was trying to use session['translation_pos'] to retrieve the information that I stored in session['translation_pos'] in the class, but it seems not works.
I use python 3 and Flask, I don't know how to get this progress percentage information from the class - where the data generated - to the App.
May be one of the way would like store the number in a text file or some places, and read the file in the App, but I was thinking that certainly shouldn't the way to handle this problems.
Would anyone could advise with some idea that will be much appreciated.
Thank you All.
You may want to look at a different approach to running the task, using something like Celery or Redis Queue - covered very well here in the mega tutorial.
By using one of these, you can run the task and query the runner periodically for progress to report it back to the user.
If it were me, for the data processing I would store this in a database. When the task is completed, it gets re-queried and passed through to the UI as a template variable (or streamed from an ajax function if it's a large data set).
Note: Scroll down to the Background section for useful details. Assume the project uses Python-Django and South, in the following illustration.
What's the best way to import the following CSV
"john","doe","savings","personal"
"john","doe","savings","business"
"john","doe","checking","personal"
"john","doe","checking","business"
"jemma","donut","checking","personal"
Into a PostgreSQL database with the related tables Person, Account, and AccountType considering:
Admin users can change the database model and CSV import-representation in real-time via a custom UI
The saved CSV-to-Database table/field mappings are used when regular users import CSV files
So far two approaches have been considered
ETL-API Approach: Providing an ETL API a spreadsheet, my CSV-to-Database table/field mappings, and connection info to the target database. The API would then load the spreadsheet and populate the target database tables. Looking at pygrametl I don't think what i'm aiming for is possible. In fact, i'm not sure any ETL APIs do this.
Row-level Insert Approach: Parsing the CSV-to-Database table/field mappings, parsing the spreadsheet, and generating SQL inserts in "join-order".
I implemented the second approach but am struggling with algorithm defects and code complexity. Is there a python ETL API out there that does what I want? Or an approach that doesn't involve reinventing the wheel?
Background
The company I work at is looking to move hundreds of project-specific design spreadsheets hosted in sharepoint into databases. We're near completing a web application that meets the need by allowing an administrator to define/model a database for each project, store spreadsheets in it, and define the browse experience. At this stage of completion transitioning to a commercial tool isn't an option. Think of the web application as a django-admin alternative, though it isn't, with a DB modeling UI, CSV import/export functionality, customizable browse, and modularized code to address project-specific customizations.
The implemented CSV import interface is cumbersome and buggy so i'm trying to get feedback and find alternate approaches.
How about separating the problem into two separate problems?
Create a Person class which represents a person in the database. This could use Django's ORM, or extend it, or you could do it yourself.
Now you have two issues:
Create a Person instance from a row in the CSV.
Save a Person instance to the database.
Now, instead of just CSV-to-Database, you have CSV-to-Person and Person-to-Database. I think this is conceptually cleaner. When the admins change the schema, that changes the Person-to-Database side. When the admins change the CSV format, they're changing the CSV-to-Database side. Now you can deal with each separately.
Does that help any?
I write import sub-systems almost every month at work, and as I do that kind of tasks to much I wrote sometime ago django-data-importer. This importer works like a django form and has readers for CSV, XLS and XLSX files that give you lists of dicts.
With data_importer readers you can read file to lists of dicts, iter on it with a for and save lines do DB.
With importer you can do same, but with bonus of validate each field of line, log errors and actions, and save it at end.
Please, take a look at https://github.com/chronossc/django-data-importer. I'm pretty sure that it will solve your problem and will help you with process of any kind of csv file from now :)
To solve your problem I suggest use data-importer with celery tasks. You upload the file and fire import task via a simple interface. Celery task will send file to importer and you can validate lines, save it, log errors for it. With some effort you can even present progress of task for users that uploaded the sheet.
I ended up taking a few steps back to address this problem per Occam's razor using updatable SQL views. It meant a few sacrifices:
Removing: South.DB-dependent real-time schema administration API, dynamic model loading, and dynamic ORM syncing
Defining models.py and an initial south migration by hand.
This allows for a simple approach to importing flat datasets (CSV/Excel) into a normalized database:
Define unmanaged models in models.py for each spreadsheet
Map those to updatable SQL Views (INSERT/UPDATE-INSTEAD SQL RULEs) in the initial south migration that adhere to the spreadsheet field layout
Iterating through the CSV/Excel spreadsheet rows and performing an INSERT INTO <VIEW> (<COLUMNS>) VALUES (<CSV-ROW-FIELDS>);
Here is another approach that I found on github. Basically it detects the schema and allows overrides. Its whole goal is to just generate raw sql to be executed by psql and or whatever driver.
https://github.com/nmccready/csv2psql
% python setup.py install
% csv2psql --schema=public --key=student_id,class_id example/enrolled.csv > enrolled.sql
% psql -f enrolled.sql
There are also a bunch of options for doing alters (creating primary keys from many existing cols) and merging / dumps.