I'm looking at a way to access a csv file's cells in a random fashion. If I use Python's csv module, I can only iterate through all lines which is rather slow. I should also add that the file is pretty large (>100MB) and that I'm looking at short response time.
I could preprocess the file into a different data format for faster row/column access. Perhaps someone has done this before and can share some experiences.
Background:
I'd like to show an extract of the csv on screen provided by a web server (depending on scroll position). Keeping the file in memory is not an option.
I have found SQLite good for this sort of thing. It is easy to set up and you can store the data locally, but you also get easier control over what you select than csv files and you get the facility to add indexes etc.
There is also a built in facility for loading csv files into a table: http://www.sqlite.org/cvstrac/wiki?p=ImportingFiles.
Let me know if you want any further details on the SQLite route i.e. how to create the table, load the data in or query it from Python.
SQLite Instructions to load .csv file to table
To create a database file you can just add the filename required as an argument when opening SQLite. Navigate to the directory containing the csv file from the command line (I am assuming here that you want the SQLite .db file to be contained in the same dir). If using Windows add SQLite to your PATH environment variable if not already done, (instructions here if you need them) and open SQLite as follows with an argument for the name that you want to give your database file e.g.:
sqlite3 example.db
Check the database file has been created by entering:
.databases
Create a table to hold the data. I am using an example for a simple customer table here. If data types are inconsistent for any columns use text:
create table customers (ID integer, Title text, Forename text, Surname text, Postcode text, Addr_Line1 text, Addr_Line2 text, Town text, County text, Home_Phone text, Mobile text, Comments text);
Specify the separator to be used:
.separator ","
Issue the command to import the data, the sytnax takes the form .import filename.ext table_name e.g.:
.import cust.csv customers
Check that the data has loaded in:
select count(*) from customers;
Add an index for columns that you are likely to filter on (full syntax described here) e.g.:
create index cust_surname on customers(surname);
You should now have fast access to the data when filtering on any of the indexed columns. To leave SQLite use .exit, to get a list of other helpful non-SQL commands use .help.
Python Alternative
Alternatively if you want to stick with pure Python and pre-process the file then you could load the data into a dictionary which would allow much faster access to the data as the dictionary keys behave like an index meaning that you can get to values associated with a key quickly without going through the records one by one. I would need further details of your input data and what fields the lookups would be based on to provide further details on how to implement this.
However, unless you will know in advance when the data will be required (to be able to pre-process the file before the request for data) then you would still have the overhead of loading the file from disk into memory every time you run this. Depending on your exact usage this may make the database solution more appropriate.
Related
I would like to store CSV files in SQL Server. I've created a table with column "myDoc" as varbinary(max). I generate the CSV's on a server using Python/Django. I would like to insert the actual CSV (not the path) as a BLOB object so that I can later retrieve the actual CSV file.
How do I do this? I haven't been able to make much headway with this documentation, as it mostly refers to .jpg's
https://msdn.microsoft.com/en-us/library/a1904w6t(VS.80).aspx
Edit:
I wanted to add that I'm trying to avoid filestream. The CSVs are too small (5kb) and I don't need text search over them.
Not sure why you want varbinary over varchar, but it will work either way
Insert Into YourTable (myDoc)
Select doc = BulkColumn FROM OPENROWSET(BULK 'C:\Working\SomeXMLFile.csv', SINGLE_BLOB) x
I have a dump that was made from a PostgreSQL database. I want to check for some information in that dump, specifically checking if there are entries in a certain table with certain values in certain fields.
This is for a Python program that should run automatically on many different inputs on customer machines, so I need a programmatic solution, not manually opening the file and looking for where that table is defined. I could restore the dump to a database and then delete it, but I'm worrying that this operation is heavy or that it has side-effects. I want there to be no side-effects to my query, I just want to do the check without it affecting anything in my system.
Is that possible in any way? Preferably in Python?
Any dump format: restore and query
The most practical thing to do is restore them to a temporary PostgreSQL database then query the database. It's by far the simplest option. If you have a non-superuser with createdb rights you can do this pretty trivially and safely with pg_restore.
SQL-format
If it's a plaintext (.sql) format dump, if desperate and you know the dumps were not created with the --inserts or --column-inserts options and you don't use the same table name in multiple schemas, you could just search for the text
COPY tablename (
at the start of a line, then read the COPY-format data (see below) until you find \. at the start of a line.
If you do use the same table name in different schemas you have to parse the dump to find the SET search_path entry for the schema you want, then start looking for the desired table COPY statement.
Custom-format
However, if the dump is in the PostgreSQL custom format, which you should always prefer and request by using -Fc with pg_dump, it is IIRC really a tar file with a custom header. You can either seek within it to find the tar header then extract it, or you can use pg_restore to list its header and then extract individual tables.
For this task I'd do the latter. To list tables in the dump:
pg_restore --list out.dump
To dump a particular table as tab-separated COPY format by qualified name, e.g. table address in schema public:
pg_restore -n public -t address out.dump
The output has a bunch of stuff you can't get pg_restore to skip at the start, but your script can just look for the word COPY (uppercase) at the start of a line and start reading on the next line, until it reaches a \. at the end of a line. For details on the format see the PostgreSQL manual on COPY
Of course you need the pg_restore binary for this.
Make sure there is no PGDATABASE environment variable set when you invoke pg_restore. Otherwise it'll restore to a DB instead of printing output to stdout.
Dump the database to a CSV file (or a CSV file for each table) and then you can load and query them using pandas.
You could convert your dump to INSERT INTO dump with this little tool I've written :
https://github.com/freddez/pg-dump2insert
It will be easier to grep specific table data in this form.
I am new to django and i have been scouting the site to find good ways to store images generated using python(there seems to be contradicting views on saving it to a local folder, file or database). My website's sole purrpose is to make qrcodes all the time. In my local machine i would save it in my program folder like:
import pyqrcode
qr = pyqrcode.create("{'name': 'myqr'}")
qr.png("horn.png", scale=6)
print "All done!"
I am running into a situation where i will be dealing with many users doing the same thing. I doubt saving it to a local folder would be a viable option. I am set on saving these images as blobs in mysql.
Has anyone done this sort of thing? If so, what was the best way to implement.Example code would also be helpful.
Storing images in the database is somewhat an edge-case. To relieve the database, I would not store them in the database, but in a folder on your server.
I understand your question so, that some users might create the same qr-codes. In this case, I would create a database table like that:
CREATE TABLE qrcodes (
value TEXT PRIMARY KEY,
fname TEXT
);
With this table you can find the qr-codes by the encoded value. If an entry exists for a given value, than you can just return the contents of the file whose name is stored.
Leaves one question: How to create the file names. There are many possibilities. One would be to create an UUID and make a filename from it. An other option would be to use a global counter and give every file a new number. The counter must be stored in a database table of course.
Of course you can still store the image in the database, if you want. Just don't use the fname field, but a blob-field that stores the content. That solution should work on most databases, but is likely slower than the file approach when you have really big amounts of data.
I have a text file (CSV) which acts as a database for my application formatted as follows:
ID(INT),NAME(STRING),AGE(INT)
1,John,23
2,Paul,34
3,Jack,12
Before you ask, I cannot get away from a CSV text file (imposed) but I can remove/change the first row (header) into another format or into another file all together (I added it to keep track of the schema).
When I start my application I want to read-in all the data in-memory so I can then query it and change it and stuff. I need to extra:
- Schema (column names and types)
- Data
I am trying to figure out what would be the best way to store this in memory using Python (very new to the language and its constructs) - any suggestions/recommendations?
Thanks,
If you use a Pandas DataFrame you can query it like it was an SQL table, and read it directly from CSV and write it back out as well. I think that this is the best option for you. It's very fast and performant, and builds on solid, proven technologies.
I recently created a script that parses several web proxy logs into a tidy sqlite3 db file that is working great for me... with one snag. the file size. I have been pressed to use this format (a sqlite3 db) and python handles it natively like a champ, so my question is this... what is the best form of string compression that I can use for db entries when file size is the sole concern. zlib? base-n? Klingon?
Any advice would help me loads, again just string compression for characters that are compliant for URLs.
Here is a page with an SQLite extension to provide compression.
This extension provides a function that can be called on individual fields.
Here is some of the example text from the page
create a test table
sqlite> create table test(name varchar(20),surname varchar(20));
insert into test table some text by compressing text, you can also compress binary
content and insert it into a blob field
sqlite> insert into test values(mycompress('This is a sample text'),mycompress('This is a
sample text'));
this shows nothing because our data is in binary format and compressed
sqlite> select * from test;
following works, it uncompresses the data
sqlite> select myuncompress(name),myuncompress(surname) from test;
what sort of parsing do you do before you put it in the database? I get the impression that it is fairly simple with a single table holding each entry - if not then my apologies.
Compression is all about removing duplication, and in a log file most of the duplication is between entries rather than within each entry so compressing each entry individually is not going to be a huge win.
This is off the top of my head so feel free to shoot it down in flames, but I would consider breaking the table into a set of smaller tables holding the individual parts of the entry. A log entry would then mostly consist of a timestamp (as DATE type rather than a string) plus a set of indexes into other tables (e.g. requesting IP, request type, requested URL, browser type etc.)
This would have a trade-off of course, since it would make the database a lot more complex to maintain, but on the other hand it would enable meaningful queries such as "show me all the unique IPs that requested page X in the last week".
Instead of inserting compression/decompression code into your program, you could store the table itself on a compressed drive.