saving large streaming data in python

saving large streaming data in python - python

I have a large amount of data coming in every second in the form of python dictionaries, right now I am saving it to mySQL server as they come in but that creates a backlog thats more than a few hours. What is the best way to save the data locally and move it to a mySQL server every hour or so as a chunk to save time.I have tried redis but it cant save a list of these dictionaries which I can later move to mySQL.

A little-known fact about the Python native pickle format is that you can happily concatenate them into a file.
That is, simply open a file in append mode and pickle.dump() your dictionary into that file. If you want to be extra fancy, you could do something like timestamped files:
def ingest_data(data_dict):
filename = '%s.pickles' % date.strftime('%Y-%m-%d_%H')
with open(filename, 'ab') as outf:
pickle.dump(data_dict, outf, pickle.HIGHEST_PROTOCOL)
def read_data(filename):
with open(filename, 'rb') as inf:
while True:
yield pickle.load(inf) # TODO: handle EOF error

Related

How to correctly append binary files in python

I am trying to create a binary file (called textsnew) and then append two (previously created) binary files to it. When I print the resulting (textsnew), it only shows the first file appended to it, not the second one. I do however see that the size of the new file (textsnew) is the sum of the two appended files. Maybe Im opening it incorrectly? This is my code
with open("/path/textsnew", "ab") as myfile, open("/path/names", "rb") as file2:
myfile.write(file2.read())
with open("/path/textsnew", "ab") as myfile, open("/path/namesthree", "rb") as file2:
myfile.write(file2.read())
this code is for reading the file:
import pickle
infile1 = open('/path/textsnew','rb')
names1 = pickle.load(infile1)
print (names1)

Open the new file, write its data.
Then, while the new file is still open (in append mode), open the second file, read its data and immediately write that data to the first file.
Then repeat the procedure for the third file.
Everything in binary, of course, although it will work just as well with text files. Linux/Macos/*nix don't even really care.
This also assume that the built-in I/O buffer size will read the full file contents in one go, as in your question. Otherwise, you would need to create a loop around the read/write parts.
with open('/path/textsnew', 'ab') as fpout:
fpout.write(data)
with open('/path/names', 'rb') as fpin:
fpout.write(fpin.read())
with open('/path/namesthree', 'rb') as fpin:
fpout.write(fpin.read())

python read bigger csv line by line

Hello i have huge csv file (1GB) that can be updated (server often add new value)
I want in python read this file line by line (not load all file in memory) and i want to read this in "real time"
this is example of my csv file :
id,name,lastname
1,toto,bob
2,tutu,jordan
3,titi,henri
in first time i want to get the header of file (columns name) in my example i want get this : id,name,lastname
and in second time, i want read this file line by line not load all file in memory
and in third time i want to try to read new value between 10 seconds (with sleep(10) for example)
i search actualy solution with use pandas
i read this topic :
Reading a huge .csv file
import pandas as pd
chunksize = 10 ** 8
for chunk in pd.read_csv(filename, chunksize=chunksize):
process(chunk)
but i don't unterstand,
1) i don't know size of my csv file, how define chunksize ?
2) when i finish read, how says to pandas to try to read new value between 10 seconds (for example) ?
thanks for advance for your help

First of all, 1GB is not huge - pretty much any modern device can keep that in its working memory. Second, pandas doesn't let you poke around the CSV file, you can only tell it how much data to 'load' - I'd suggest using the built-in csv module if you want to do more advanced CSV processing.
Unfortunately, the csv module's reader() will produce an exhaustible iterator for your file so you cannot just build it as a simple loop and wait for the next lines to become available - you'll have to collect the new lines manually and then feed them to it to achieve the effect you want, something like:
import csv
import time
filename = "path/to/your/file.csv"
with open(filename, "rb") as f: # on Python 3.x use: open(filename, "r", newline="")
reader = csv.reader(f) # create a CSV reader
header = next(reader) # grab the first line and keep it as a header reference
print("CSV header: {}".format(header))
for row in reader: # iterate over the available rows
print("Processing row: {}".format(row)) # process each row however you want
# file exhausted, entering a 'waiting for new data' state where we manually read new lines
while True: # process ad infinitum...
reader = csv.reader(f.readlines()) # create a CSV reader for the new lines
for row in reader: # iterate over the new rows, if any
print("Processing new row: {}".format(row)) # process each row however you want
time.sleep(10) # wait 10 seconds before attempting again
Beware of the edge cases that may break this process - for example, if you attempt to read new lines as they are being added some data might get lost/split (in dependence of the flushing mechanism used for addition), if you delete previous lines the reader might get corrupted etc. If possible at all, I'd suggest controlling the CSV writing process in such a way that it informs explicitly your processing routines.
UPDATE: The above is processing the CSV file line by line, it never gets loaded whole into the working memory. The only part that actually loads more than one line in memory is when an update to the file occurs where it picks up all the new lines because it's faster to process them that way and, unless you're expecting millions of rows of updates between two checks, the memory impact would be negligible. However, if you want to have that part processed line-by-line as well, here's how to do it:
import csv
import time
filename = "path/to/your/file.csv"
with open(filename, "rb") as f: # on Python 3.x use: open(filename, "r", newline="")
reader = csv.reader(f) # create a CSV reader
header = next(reader) # grab the first line and keep it as a header reference
print("CSV header: {}".format(header))
for row in reader: # iterate over the available rows
print("Processing row: {}".format(row)) # process each row however you want
# file exhausted, entering a 'waiting for new data' state where we manually read new lines
while True: # process ad infinitum...
line = f.readline() # collect the next line, if any available
if line.strip(): # new line found, we'll ignore empty lines too
row = next(csv.reader([line])) # load a line into a reader, parse it immediately
print("Processing new row: {}".format(row)) # process the row however you want
continue # avoid waiting before grabbing the next line
time.sleep(10) # wait 10 seconds before attempting again

Chunk size is the number of lines it would read at once, so it doesn't depend on the file size. At the end of the file the for loop will end.
The chunk size depends on the optimal size of data for process. In some cases 1GB is not a problem, as it can fit in memory, and you don't need chuncks. If you aren't OK with 1GB loaded at once, you can select for example 1M lines chunksize = 1e6, so with the line length about 20 letters that would be something less than 100M, which seems reasonably low, but you may vary the parameter depending on your conditions.
When you need to read updated file you just start you for loop once again.
If you don't want to read the whole file just to understand that it hasn't changed, you can look at it's modification time (details here). And skip reading if it hasn't changed.
If the question is about reading after 10 seconds it can be done in infinite loop with sleep like:
import time
while True:
do_what_you_need()
time.sleep(10)
In fact the period will be more that 10 seconds as do_what_you_need() also takes time.

If the question is about reading the tail of a file, I don't know a good way to do that in pandas, but you can do some workarounds.
First idea is just to read file without pandas and remember the last position. Next time you need to read, you can use seek. Or you can try to implement the seek and read from pandas using StringIO as a source for pandas.read_csv
The other workaround is to use Unix command tail to cut last n lines, if you are sure there where added not too much at once. It will read the whole file, but it is much faster than reading and parsing all lines with pandas. Still seek is theretically faster on very long files. Here you need to check if there are too many lines added (you don't see the last processed id), in this case you'll need to get longer tail or read the whole file.
All that involves additional code, logic, mistakes. One of the them is that the last line could be broken (if you read at the moment it is being written). So the way I love most is just to switch from txt file to sqlite, which is an SQL compatable database which stores data in file and doesn't need a special process to access it. It has python library which make it easy to use. It will handle all the staff with long file, simultanious writing and reading, reading only the data you need. Just save the last processed id and make request like this SELECT * FROM table_name WHERE id > last_proceesed_id;. Well this is possible only if you also control the server code and can save in this format.

Append text file in python with json data

I'm trying to create a simple function which I can use to store json data to a file. I currently have this code
def data_store(key_id, key_info):
try:
with open('data.txt', 'a') as f:
data = json.load(f)
data[key_id] = key_info
json.dump(data, f)
pass
except Exception:
print("Error in data store")
The idea is the load what data is currently within the text file, then create or edit the json data. So running the code...
data_store("foo","bar")
The function will then read what's within the text file, then allow me to append the json data with either replacing what's there (if "foo" exists) or create it if it doesn't exist
This has been throwing errors at me however, Any ideas?

a mode would not work for both reading and writing at the same time. Instead, use r+:
with open('data.txt', 'r+') as f:
data = json.load(f)
data[key_id] = key_info
f.seek(0)
json.dump(data, f)
f.truncate()
seek(0) call here moves the cursor back to the beginning of the file. truncate() helps in situations where the new file contents is less than the old one.
And, as a side note, try to avoid having a bare except clause, or/and log the error and the traceback properly.

Can you write to a text file and then read from that same text file in the same program?

Basically I want to be able to calculate a parameter store it was a text file then read it back in later in the program.

myFile = 'example.txt'
Using with will automatically close the file when you leave that structure
# perform your writing
with open(myFile, 'w') as f:
f.write('some stuff')
# doing other work
# more code
# perform your reading
with open(myFile, 'r') as f:
data = f.read()
# do stuff with data

You need to use close() before changing mode (read / write):
def MyWrite(myfile):
file = open(myfile, "w")
file.write("hello world in the new file\n")
file.close()
def MyRead(myfile):
file = open(myfile, "r")
file.read()
file.close()

Also, you could open a file for reading AND writing, using:
fd = open(myfile, "r+")
However, you must be very careful, since every operation, either read or write, changes the pointer position, so you may need to use fd.seek to make sure you're placed in the right position where you want to read or write.
Also, keep in mind that your file becomes a sort of memory mapped string(*) that sometimes syncs with the disk. If you want to save changes at a specific point, you must use fd.flush and os.fsync(fd) to efectively commit the changes to disk without closing the file.
All in all, I'd say its better to stick to one mode of operation and then closing the file and opening again, unless there's a very good reason to have read/write available without switching modes.
* There's also a module for memory mapped files, but I think thats way beyond what you were asking.

Update json file

I have json file with some data, and would like to occasionally update this file.
I read the file:
with open('index.json', 'rb') as f:
idx = json.load(f)
then check for presence of a key from potentially new data, and if key is not present update the file:
with open('index.json', mode='a+') as f:
json.dump(new_data, f, indent=4)
However this procedure just creates new json object (python dict) and appends it as new object in output json file, making the file not valid json file.
Is there any simple way to append new data to json file without overwriting whole file, by updating the initial dict?

One way to do what you're after is to write one JSON object per line in the file. I'm using that approach and it works quite well.
A nice benefit is that you can read the file more efficiently (memory-wise) because you can read one line at a time. If you need all of them, there's no problem with assembling a list in Python, but if you don't you're operating much faster and you can also append.
So to initially write all your objects, you'd do something like this:
with open(json_file_path, "w") as json_file:
for data in data_iterable:
json_file.write("{}\n".format(json.dumps(data)))
Then to read efficiently (will consume little memory, no matter the file size):
with open(json_file_path, "r") as json_file:
for line in json_file:
data = json.loads(line)
process_data(data)
To update/append:
with open(json_file_path, "a") as json_file:
json_file.write("{}\n".format(json.dumps(new_data)))
Hope this helps :)

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.