complete beginner trying to create a flat file database in python - python

trying to keep it stupidly simple. Is it a bad idea to move a txt file into and out of a python list? the txt files will probably get to about 2-5k entries. what is the preferred method to create a simple flat file databse?

It might or might not be a bad idea. It depends on what you are trying to achieve, how much memory you have and how big those lines are on average. It also depends on what you are doing with that data. Maybe it is worth it to read and process file line by line? In any case, database assumes indexes, what are you going to do with a list of strings without an index? You cannot search it efficiently, for example.
In any case, if you feel like you need a database, take a look at SQLite. It is a small embedded SQL server written in C with Python interface. It is stable and proven to work. For example, it is being used on iPhone in tons of applications.

If you're looking for a very simple file database, maybe you should look at the shelve module. Example usage:
import shelve
with shelve.open("myfile") as mydb:
mydb["0"] = "first value"
mydb["1"] = "second value"
# ...

Related

discord.py: too big variable?

I'm very new to python and programming in general, and I'm looking to make a discord bot that has a lot of hand-written chat lines to randomly pick from and send back to the user. Making a really huge variable full of a list of sentences seems like a bad idea. Is there a way that I can store the chatlines on a different file and have the bot pick from the lines in that file? Or is there anything else that would be better, and how would I do it?
I'll interpret this question as "how large a variable is too large", to which the answer is pretty simple. A variable is too large when it becomes a problem. So, how can a variable become a problem? The big one is that the machien could possibly run out of memory, and an OOM killer (out-of-memory killer) or similiar will stop your program. How would you know if your variable is causing these issues? Pretty simple, your program crashes.
If the variable is static (with a size fully known at compile-time or prior to interpretation), you can calculate how much RAM it will take. (This is a bit finnicky with Python, so it might be easier to load it up at runtime and figure it out with a profiler.) If it's more than ~500 megabytes, you should be concerned. Over a gigabyte, and you'll probably want to reconsider your approach[^0]. So, what do you do then?
As suggested by #FishballNooodles, you can store your data line-by-line in a file and read the lines to an array. Unfortunately, the code they've provided still reads the entire thing into memory. If you use the code they're providing, you've got a few options, non-exhaustively listed below.
Consume a random number of newlines from the file when you need a line of text. You would look at one character at a time, compare it to \n, and read the line if you've encountered the requested number of newlines. This is O(n) worst case with respect to the number of lines in the file.
Rather than storing the text you need at a given index, store its location in a file. Then, you can seek to the location (which is probably O(1)), and read the text. This requires an O(n) construction cost at the start of the program, but would work much better at runtime.
Use an actual database. It's usually better not to reinvent the wheel. If you're just storing plain text, this is probably overkill, but don't discount it.
[^0]: These numbers are actually just random. If you control the server environment on which you run the code, then you can probably come up with some more precise signposts.
You can store your data in a file, supposedly named response.txt
and retrieve it in the discord bot file as open("response.txt").readlines()

Gremlin: how works IO import with python

I am trying to know what are the functions called from the command g.io('file.json').read().iterate()
I see that a 'read' step is put in the step_instructions but I can't found the original function to import the file in to the graph.
This because I want to import a lot of data but without a file, using a python object.
I see that io().read() import a big file in a minute and I want to ricreate it but without using a file.
Thanks a lot.
First of all to be clear on the nomenclature, io() is a step, while read() and write() are step modulators and those modulators can only apply to io() step to tell it to read or write respectively. Therefore, as io() currently only works with a string file name consequently you can only read/write from/to files.
If you want to send "a lot of data" with Python, I'd first consider what you mean by that in size. If you're talking millions of vertices and edges, you should first check if the graph database you are using has its own bulk loading tool. If it does, you should use that. You may also consider methods using Gremlin/Spark as described here in the case of JanusGraph. Finally, if you must use pure Gremlin to do your bulk loading, then parameterized traversal with your Python object (I assume a list/dict of some sort) is probably the approach to take. This blog post might offer some inspiration in that line of thinking.

Removing JSON objects that aren't correctly formatted Python

I'm building a chatbot database atm. I uses data from pushshift.io. In order to deal with big datafile, (I understand that json loads everything into RAM, so if you only have 16GB RAM and working with 30GB of data, that is a nono), I wrote a bash script that split the big file into smaller chunk of 3GB of file so that I can run it through json.loads (or pd.read_json). The problem whenever I run my code it returns
JSONDecodeError: Expecting value: line 1 column 1 (char 0)
Thus I take a look into the temp json file that I just created and I see this happens in my JSON file:
ink_id":"t3_2qyr1a","body":"Most of us have some family members like this. *Most* of my family is like this. ","downs":0,"created_utc":"1420070400","score":14,"author":"YoungModern","distinguished":null,"id":"cnas8zv","archived":false,"parent_id":"t3_2qyr1a","subreddit":"exmormon","author_flair_css_class":null,"author_flair_text":null,"gilded":0,"retrieved_on":1425124282,"ups":14,"controversiality":0,"subreddit_id":"t5_2r0gj","edited":false}
The sample correction of the data looks like this
{"score_hidden":false,"name":"t1_cnas8zv","link_id":"t3_2qyr1a","body":"Most of us have some family members like this. *Most* of my family is like this. ","downs":0,"created_utc":"1420070400","score":14,"author":"YoungModern","distinguished":null,"id":"cnas8zv","archived":false,"parent_id":"t3_2qyr1a","subreddit":"exmormon","author_flair_css_class":null,"author_flair_text":null,"gilded":0,"retrieved_on":1425124282,"ups":14,"controversiality":0,"subreddit_id":"t5_2r0gj","edited":false}
I notice that my bash script split the file without paying attention to the JSON objects. So my question is are there ways to write a function in python that can detect JSON objects that are not correctly formatted and deleted it?
There isn't a lot of information to go on, but I would challenge the frame a little.
There are several incremental json parsers available in Python. A quick search shows ijson should allow you to traverse your very large data structure without exploding.
You also should consider another data format (or a real database), or you will easily find yourself spending time reimplementing much much slower versions of features that already exist with the right tools.
If you are using the json standard library, then calling json.loads on badly formatted data will return JSONDecodeError. You can put your code in a try-catch statement and check if this exception occurs to make sure you only process correctly formatted data.

Interfacing with Python code via file read/write?

Working with a Windows program that has it's own language with minimal interfacing options with external code, but it can read & write to files. I am looking for a method to send a set of configuration values to Python 3 code like "12,43,47,62" to query data in Pandas and return the associated results.
Someone mentioned this could possibly be done through a file interface where inputs were written to a file from the originating program and values were read back from an alternate file. I have a couple of questions regarding this concept hopefully someone could clarify for me.
How well does this method handle simultaneous access where multiple calls are being made for different queries?
What is the correct terminology for this type of task?
Is there a way to do it so the Python code senses the change as opposed to repeatedly checking for changes?
1) Poorly. You should put each query in its own file, responses in their own files, and encode request ID's or other information in the file names.
2) I'm not sure there is one. "File Based Communication" maybe.
3) Yes, Python watchdog.

Storing user data in a Python script

What is the preferred/ usual way of storing data that is entered by the user when running a Python script, if I need the data again the next time the script runs?
For example, my script performs calculations based on what the user enters and then when the user runs the script again, it fetches the result from the last run.
For now, I write the data to a text file and read it from there. I don't think that I would need to store very large records ( less than 100, I'd say).
I am targeting Windows and Linux users both with this script, so a cross platform solution would be good. My only apprehension with using a text file is that I feel it might not be the best and the usual way of doing it.
So my question is, if you ever need to store some data for your script, how do you do it?
you could use a slite database or a CSV file. They are both very easy to work with but lend themselves to rows with the same type of information.
The best option might be shelve module
import shelve
shelf = shelve.open(filename)
shelf['key1'] = value1
shelf['key2'] = value2
shelf.close()
# next run
shelf.open(filename)
value1 = shelf['key1']
#etc
For small amounts of data, Python's pickle module is great for stashing away data you want easy access to later--just pickle the data objects from memory and write to a (hidden) file in the user's home folder (good for Linux etc.) or Application Data (on Windows).
Of, as #aaronnasterling mentioned, a sqlite3 file-based database is small, fast and easy that it's no wonder that so many popular programs like Firefox and Pidgin use it.
For 100 lines, plain text is fine with either the standard ConfigParser or csv modules.
Assuming your data structure is simple, text affords opportunities (e.g. grep, vi, notepad) that more complex formats preclude.
Since you only need the last result, just store the result in a file.
Example
write('something', wb)
It will only store the last result. Then when you re-run the script, do a open and read the previous result.

Categories

Resources