Best way to store a csv database in memory?

Best way to store a csv database in memory? - python

I have a text file (CSV) which acts as a database for my application formatted as follows:
ID(INT),NAME(STRING),AGE(INT)
1,John,23
2,Paul,34
3,Jack,12
Before you ask, I cannot get away from a CSV text file (imposed) but I can remove/change the first row (header) into another format or into another file all together (I added it to keep track of the schema).
When I start my application I want to read-in all the data in-memory so I can then query it and change it and stuff. I need to extra:
- Schema (column names and types)
- Data
I am trying to figure out what would be the best way to store this in memory using Python (very new to the language and its constructs) - any suggestions/recommendations?
Thanks,

If you use a Pandas DataFrame you can query it like it was an SQL table, and read it directly from CSV and write it back out as well. I think that this is the best option for you. It's very fast and performant, and builds on solid, proven technologies.

Related

Python: Local Database for storing and retrieving historical data

I'm trying to create some weather models and I want to store and retrieve data on my hard drive.
Data is in this format:
{'Date_Time':'2020-07-18 18:16:17','Temp':29.0, 'Humidity':45.3}
{'Date_Time':'2020-07-18 18:18:17','Temp':28.9, 'Humidity':45.4}
{'Date_Time':'2020-07-18 18:20:17 ','Temp':28.8, 'Humidity':48.3}
I have new data coming in every day, I have old data from ~5 years ago.
I would like to periodically merge the data sets and create one large data set to manipulate.
Things I need:
1. Check if the date-time pair already exists, else add new data
2. Change old data values
3. Add new data values to the database
4. Must be on a local storage, I have plenty of space.
Things I would like but do not need:
1. Fastest Read access possible, not so concerned about storage time as that happens in the background mostly.
2. Something that makes searching for all data from today, last 7 days etc easy to retrieve
Things I have tried:
Appending to a json file
Works for now but is slow because I have to load the entire data set every time I want to append/modify
Appending to a text file
Easy to store, but hard to modify/check values
SQLLite3
I looked into this and it seemed workable, just wanted to know if there was something better before I just go ahead and do this.
Thank you for your help!

Not sure whether it's "better" but json_database seems to do what you're looking for:
save and load from file
search recursively by key and key/value pairs
fuzzy search
supports arbitrary objects

The selection of JSON vs TXT vs SQL or NoSQL DB would be based on your current and future requirements.
From your inputs, you have data for last 5 years and the data from the example is for every 2 seconds. Based on this, it seems like you will have a large dataset or will need to prune the dataset frequently. For large datasets, using a SQL or NoSQL DB would be ideal so that you do not load all data to memory for every read/write operation.
Using the date-time as your primary key, you would be able to read-write pretty quickly using a database.
Using SQLLite is a good start but if your data is going to grow, you should plan to move to an external SQL/NoSQL database.
Seeing that your data is mostly time based, it would be good to evaluate Time Series database like InfluxDB or Graphite.

CSV vs JSON vs DB - which is fastest and scalable to load in the memory and retrieve data

I have a large 1.5Gb data file with multiple fields separated by tabs.
I need to do lookups in this file from a web interface/ajax queries like an API, possibly large number of ajax requests coming in each second. So it needs to be fast in response.
What is the fastest option for retrieving this data? Is there performance-tested info, benchmarking?
Considering the tab-separated CSV file is a flat file that will be loaded in the memory. But it cannot produce an index.
JSON has more text because, but an 'indexed' JSON can be created, grouping entries for a certain field.

Neither. They are both horrible for your stated purpose. JSON cannot be partially loaded; TSV can be scanned without loading it in memory, but has sequential access. Use a proper database.
If, for some reason, you can't use a database, you can McGyver it by using TSV or JSONL (not JSON) with an additional index file that specifies the byte position of the start of the record for each ID (or another searchable field).

Insert CSV file into SQL Server (not import!)

I would like to store CSV files in SQL Server. I've created a table with column "myDoc" as varbinary(max). I generate the CSV's on a server using Python/Django. I would like to insert the actual CSV (not the path) as a BLOB object so that I can later retrieve the actual CSV file.
How do I do this? I haven't been able to make much headway with this documentation, as it mostly refers to .jpg's
https://msdn.microsoft.com/en-us/library/a1904w6t(VS.80).aspx
Edit:
I wanted to add that I'm trying to avoid filestream. The CSVs are too small (5kb) and I don't need text search over them.

Not sure why you want varbinary over varchar, but it will work either way
Insert Into YourTable (myDoc)
Select doc = BulkColumn FROM OPENROWSET(BULK 'C:\Working\SomeXMLFile.csv', SINGLE_BLOB) x

How to import data from LibreOffice Calc to a SQL database?

What's the best way to switch to a database management software from LibreOffice Calc?
I would like to move everything from a master spreadsheet to a database with certain conditions. Is it possible to write a script in Python that would do all of this for me?
The data I have is well structured I have about 300 columns of assets and under every asset there is 0 - ~50 filenames. The asset names are uniform as well as the filenames.
Thank you all!

You can ofcourse use python for this task but it might be an overkill.
The CSV export / import sequence is likely much faster, less error prone and needs less ongoing maintainance (e.g if you change the spreadsheet columns). The sequence is roughly as follows:
select the sheet that you want to import into a DB
select Files / Save as.. and then text/csv
select a column separator that will not interfere with your data (e.g. |)
The import sequence into a database depends on your choice of db but today many IDE's and database GUI environments will automatically import / introspect your CSV file and create the table / insert the data for you. Things to be double check:
You may have to indicate that the first row is a header
The assigned datatype may need fine tuning if the automated guesses are not optimal

You can create a python script that will read this spreadsheet row by row and then run insert statements in a database. In fact, would be even better if you save the spreadsheet as CSV for example, if you only need the data there.

Random access csv file content

I'm looking at a way to access a csv file's cells in a random fashion. If I use Python's csv module, I can only iterate through all lines which is rather slow. I should also add that the file is pretty large (>100MB) and that I'm looking at short response time.
I could preprocess the file into a different data format for faster row/column access. Perhaps someone has done this before and can share some experiences.
Background:
I'd like to show an extract of the csv on screen provided by a web server (depending on scroll position). Keeping the file in memory is not an option.

I have found SQLite good for this sort of thing. It is easy to set up and you can store the data locally, but you also get easier control over what you select than csv files and you get the facility to add indexes etc.
There is also a built in facility for loading csv files into a table: http://www.sqlite.org/cvstrac/wiki?p=ImportingFiles.
Let me know if you want any further details on the SQLite route i.e. how to create the table, load the data in or query it from Python.
SQLite Instructions to load .csv file to table
To create a database file you can just add the filename required as an argument when opening SQLite. Navigate to the directory containing the csv file from the command line (I am assuming here that you want the SQLite .db file to be contained in the same dir). If using Windows add SQLite to your PATH environment variable if not already done, (instructions here if you need them) and open SQLite as follows with an argument for the name that you want to give your database file e.g.:
sqlite3 example.db
Check the database file has been created by entering:
.databases
Create a table to hold the data. I am using an example for a simple customer table here. If data types are inconsistent for any columns use text:
create table customers (ID integer, Title text, Forename text, Surname text, Postcode text, Addr_Line1 text, Addr_Line2 text, Town text, County text, Home_Phone text, Mobile text, Comments text);
Specify the separator to be used:
.separator ","
Issue the command to import the data, the sytnax takes the form .import filename.ext table_name e.g.:
.import cust.csv customers
Check that the data has loaded in:
select count(*) from customers;
Add an index for columns that you are likely to filter on (full syntax described here) e.g.:
create index cust_surname on customers(surname);
You should now have fast access to the data when filtering on any of the indexed columns. To leave SQLite use .exit, to get a list of other helpful non-SQL commands use .help.
Python Alternative
Alternatively if you want to stick with pure Python and pre-process the file then you could load the data into a dictionary which would allow much faster access to the data as the dictionary keys behave like an index meaning that you can get to values associated with a key quickly without going through the records one by one. I would need further details of your input data and what fields the lookups would be based on to provide further details on how to implement this.
However, unless you will know in advance when the data will be required (to be able to pre-process the file before the request for data) then you would still have the overhead of loading the file from disk into memory every time you run this. Depending on your exact usage this may make the database solution more appropriate.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Best way to store a csv database in memory? - python

If you use a Pandas DataFrame you can query it like it was an SQL table, and read it directly from CSV and write it back out as well. I think that this is the best option for you. It's very fast and performant, and builds on solid, proven technologies.

Related

Python: Local Database for storing and retrieving historical data

CSV vs JSON vs DB - which is fastest and scalable to load in the memory and retrieve data

Insert CSV file into SQL Server (not import!)

How to import data from LibreOffice Calc to a SQL database?

Random access csv file content

Categories

Resources