discord.py: too big variable? - python

I'm very new to python and programming in general, and I'm looking to make a discord bot that has a lot of hand-written chat lines to randomly pick from and send back to the user. Making a really huge variable full of a list of sentences seems like a bad idea. Is there a way that I can store the chatlines on a different file and have the bot pick from the lines in that file? Or is there anything else that would be better, and how would I do it?

I'll interpret this question as "how large a variable is too large", to which the answer is pretty simple. A variable is too large when it becomes a problem. So, how can a variable become a problem? The big one is that the machien could possibly run out of memory, and an OOM killer (out-of-memory killer) or similiar will stop your program. How would you know if your variable is causing these issues? Pretty simple, your program crashes.
If the variable is static (with a size fully known at compile-time or prior to interpretation), you can calculate how much RAM it will take. (This is a bit finnicky with Python, so it might be easier to load it up at runtime and figure it out with a profiler.) If it's more than ~500 megabytes, you should be concerned. Over a gigabyte, and you'll probably want to reconsider your approach[^0]. So, what do you do then?
As suggested by #FishballNooodles, you can store your data line-by-line in a file and read the lines to an array. Unfortunately, the code they've provided still reads the entire thing into memory. If you use the code they're providing, you've got a few options, non-exhaustively listed below.
Consume a random number of newlines from the file when you need a line of text. You would look at one character at a time, compare it to \n, and read the line if you've encountered the requested number of newlines. This is O(n) worst case with respect to the number of lines in the file.
Rather than storing the text you need at a given index, store its location in a file. Then, you can seek to the location (which is probably O(1)), and read the text. This requires an O(n) construction cost at the start of the program, but would work much better at runtime.
Use an actual database. It's usually better not to reinvent the wheel. If you're just storing plain text, this is probably overkill, but don't discount it.
[^0]: These numbers are actually just random. If you control the server environment on which you run the code, then you can probably come up with some more precise signposts.

You can store your data in a file, supposedly named response.txt
and retrieve it in the discord bot file as open("response.txt").readlines()

Related

Append unique fingerprint to file

I have a set of files (compiled software) that I want to give an unique fingerprint before distribution. The idea is to write a script that:
Randomly generates a character sequence
Appends the character sequence to a file in the project
Stores the fingerprint in a database with the addressee
Distributes the software to the addressee
The requirements for the fingerprint process is that:
The fingerprint is difficult to detect (i.e. not stored in the file metadata or easily accessible areas)
The fingerprint does not corrupt the data of the file the sequence is added to
The fingerprint can be added to an executable or dll file
It's easy to read the fingerprint if you know where to look
Are there any open source solutions that is built for the purpose of fingerprinting files?
Storing information in the file without corrupting it and in a way that is not easily detectable is an exercise in steganography, and quite a hard one. This theoretical tool needs to be able to parse executable structure, and properly modify it, edit offsets if needed, or detect padding arias, or basically do some of the work that the compiler is doing. I doubt that it exists or is reliable.
However, there are quite a few steganography tools that can store information in pictures by subtly changing the colors of the pixels, perhaps you can store your information in the icon of the exe file or any included asset.
Another way is to hide the data at compilation time, in optimization level of the performance-uncritical parts of the executable, so that compiler generates slightly different code, but the behavior is guaranteed to stay consistent. You can now use file hashes as your fingerprint.
Yet another way is to just create unused string inside some random function, mark it as volatile or analog in your language of choice to prevent the compiler from optimizing it out of your program and put something noticeable in it, like REPLACE_ME. Now you can open this file, search for this string and replace it with the identifier that you have generated. If identifier and the string were the same length - you can’t damage your software.
Another, more subtle way is to create multiple different rephrasings of the same messages in your app and swap them in and out as a way to differentiate versions. If your programming language stores null-terminated strings then this is very easy, just make your strings in the code as long as the longest rephrasing. If your language stores length of the string then you have to dynamically recalculate it too.
Alternatively, if you are working with the Unicode strings in your code, then you can use similar-looking glyphs in some strings as a less effort version of previous idea. Basically you are performing a homograph attack on your strings. Alternatively you can use unicode control chars (ZWJ, ZWNJ, etc.) that do not affect most languages and are invisible.
All schemes is easily discovered by diffing two different distributions of the software, the one with the different optimization levels could be plausibly written off as just different builds of the software, but the persistent attacker still could figure it out.
Since you are talking about compiled software, maybe an alternative solution could be to use an execbinary encrypting tool. When you execute the file it will ask for a password, if it's correct then it will use the password to generate a key. Then it uses that key to decrypt the program directly in memory. That way they won't be able to analyze the binary and even with the key it would be a lot more difficult to do so, much less modify it. You can put as many fingerprints as you like, regular text strings, into the code and they will most likely stay there.

Ignoring a function's return to save memory in Python

This might not even be an issue but I've got a couple of related Python questions that will hopefully help clear up a bit of debugging I've been stuck on for the past week or two.
If you call a function that returns a large object, is there a way to ignore the return value in order to save memory?
My best example is, let's say you are piping a large text file, line-by-line, to another server and when you complete the pipe, the function returns a confirmation for every successful line. If you pipe too many lines, the returned list of confirmations could potentially overrun your available memory.
for line in lines:
connection.put(line)
response = connection.execute()
If you remove the response variable, I believe the return value is still loaded into memory so is there a way to ignore/block the return value when you don't really care about the response?
More background: I'm using the redis-python package to pipeline a large number of set-additions. My processes occasionally die with out-of-memory issues even though the file itself is not THAT big and I'm not entirely sure why. This is just my latest hypothesis.
I think the confirmation response is not big enough to overrun your memory. In python, when you read lines from file , the lines is always in memory which cost large memory resource.

Data persistence for python when a lot of lookups but few writes?

I am working on a project that basically monitors a set remote directories (FTP, networked paths, and another), if the file is considered new and meets criteria we download it and process it. However i am stuck on what the best way is to keep track of the files we already downloaded. I don't want to download any duplicate files, so i need to keep track of what is already downloaded.
Orignally i was storing it as a tree:
server->directory->file_name
When the service shuts down it writes it to a file, and rereads it back when it starts up. However given that when there is around 20,000 or so files in the tree stuff starts to slow down alot.
Is there a better way to do this?
EDIT
The lookup times start to slowdown alot, my basic implementation is a dict of a dict. The storing stuff on the disk is fine, its more or less just the lookup time. I know i can optimize the tree and partition it. However that seems excessive for such a small project i was hoping python would have something like that.
I would create a set of tuples, then pickle it to a file. The tuples would be (server, directory, file_name), or even just (server, full_file_name_including_directory). There's no need for a multiple-level data structure. The tuples will hash into the set, and give you a O(1) lookup.
You mention "stuff starts to slow down alot," but you don't say if it's reading and writing time, or lookup times that are slowing down. If your lookup times are slowing down, you may be paging. Is your data structure approaching a significant fraction of your physical memory?
One way to get back some memory is to intern() the server names. This way, each server name will be stored only once in memory.
An interesting alternative is to use a Bloom filter. This will let you use far less memory, but will occasionally download a file that you didn't have to. This might be a reasonable trade-off, depending on why you didn't want to download the file twice.

adding space in an output file with out having to read the entire thing first

Question: How do you write data to an already existing file at the beginning of the file with out writing over what's already there and with out reading the entire file into memory? (e.g. prepend)
Info:
I'm working on a project right now where the program frequently dumps data into a file. this file will very quickly balloon up to 3-4gb. I'm running this simulation on a computer with only 768mb of ram. pulling all that data to the ram over and over will be a great pain and a huge waste of time. The simulation already takes long enough to run as it is.
The file is structured such that the number of dumps it makes is listed at the beginning with just a simple value, like 6. each time the program makes a new dump I want that to be incremented, so now it's 7. the problem lies with the 10th, 100th, 1000th, and so dump. the program will enter the 10 just fine, but remove the first letter of the next line:
"9\n580,2995,2083,028\n..."
"10\n80,2995,2083,028\n..."
obviously, the difference between 580 and 80 in this case is significant. I can't lose these values. so i need a way to add a little space in there so that I can add in this new data without losing my data or having to pull the entire file up and then rewrite it.
Basically what I'm looking for is a kind of prepend function. something to add data to the beginning of a file instead of the end.
Programmed in Python
~n
See the answers to this question:
How do I modify a text file in Python?
Summary: you can't do it without reading the file in (this is due to how the operating system works, rather than a Python limitation)
It's not addressing your original question, but here are some possible workarounds:
Use SQLite (it's bundled with your Python)
Use a fancier database, either RDBMS or NoSQL
Just track the number of dumps in a different text file
The first couple of options are a little more work up front, but provide more flexibility. The last option is the easiest solution to your current problem.
You could quite easily create an new file, output the data you wish to prepend to that file and then copy the content of the existing file and append it to the new one, then rename.
This would prevent having to read the whole file if that is the primary issue.

Data Carving loop improvements

I currently have a Python app I am developing which will data carve a block device for jpeg files. Let's just say that it sometimes works and sometimes doesn't. I have created it so that I read the block device till I find a ffd8, then I keep the stream open and search via looping for the ffd9 closure. Though I always need to take into account all ffd9 closures even after the first. So it tends to be a really intensive operation. Given a device with let's say 25 jpegs as well as lots of other data, the looping is pretty dramatic and it runs though a lot.
The program is not the slowest thing in the world, but I think it could be much faster and much more efficient. I am looking for a better way to search the block device and extract the data in a more efficient manner. I also don't want to kill the HDD or the drive holding the image of the block device.
So does anybody knew of a better way to systematically handle the searching and extraction of the data?
The trouble with reading the block device directly is that there is no guarantee that the blocks of any given file are contiguous. That means that even if you find your magic marker bytes 0xFFD8 in block 13, say, there is no guarantee that block 14 belongs to the same file, whether or not it contains the 0xFFD9 end marker or not. (Most files will start on a block boundary; the end of the file may be anywhere, possibly even across block boundaries.)
What's the better way to deal with it? Well, it depends what you're after - but if you're looking only at currently allocated blocks, then scan the file system using the Python analog of the POSIX C function ftw (nftw), and read each file in turn. This won't find evidence of deleted JPEG files in the free list - if that's what you are after, then you'll need to do as you are doing, more or less, but correlate that information with what you find in the file system proper. Mapping those blocks will (at best) be hard.

Categories

Resources