I am trying to make my program read from a file every 5 seconds until it reaches its end.
I have this code:
data = pd.read_csv(path + file_name)
n = len(data.index)
for i in range(0, n):
element = data['first_column_name'][0]
I tried writing time.sleep(5) after reading the element, but it has to be from the beginning, to look like streaming data... if possible
How can I make it read the element from the file every 5 seconds?
Python is by default single threaded, meaning that it only executes one thing at a time. When you use time.sleep, python can do nothing else but watch this timer count down. You seem to want your program to do stuff and periodically check on a file. What I think you are looking for is async/multithreading. This is a big topic with lots of different options for different circumstances; this Real Python article gives a gentle introduction.
Related
I'm new to the platform, this is my first message and I need your help.
I'm working on a school project where I have to analyze data. I chose to analyze the Binance stream, especially the trades. I had no problem using their web socket, I get lines.
The problem is that I get a lot of lines. For 1 second for example I can recover 10, 15 lines or much more.
I would like to recover 1 line per second for example.
I tried to put a time.sleep(1) but it doesn't work. It just "pauses" the stream but resumes at the line where it stopped. I want to avoid processing some lines, that's why I would like to get 1 line per second.
I use this library
https://python-binance.readthedocs.io/en/latest/websockets.html
def handle_message(msg):
if msg['e'] == 'error':
print(msg['m'])
else:
bitcoins_exchanged = float(msg['p']) * float(msg['q'])
timestamp = msg['T'] / 1000
timestamp = datetime.fromtimestamp(timestamp).strftime('%Y-%m-%d %H:%M:%S')
print("{} - {} - Price: {}".format(timestamp,msg['s'],msg['p']))
conn_key = bm.start_trade_socket(BTCUSDT, handle_message)
bm.start()
Thanks for your help. For information, I use Python 2.7.
To briefly explain context, I am downloading SEC prospectus data for example. After downloading I want to parse the file to extract certain data, then output the parsed dictionary to a JSON file which consists of a list of dictionaries. I would use a SQL database for output, but the research cluster admins at my university are being slow getting me access. If anyone has any suggestions for how to store the data for easy reading/writing later I would appreciate it, I was thinking about HDF5 as a possible alternative.
A minimal example of what I am doing with the spots that I think I need to improved labeled.
def classify_file(doc):
try:
data = {
'link': doc.url
}
except AttributeError:
return {'flag': 'ATTRIBUTE ERROR'}
# Do a bunch of parsing using regular expressions
if __name__=="__main__":
items = list()
for d in tqdm([y + ' ' + q for y in ['2019'] for q in ['1']]):
stream = os.popen('bash ./getformurls.sh ' + d)
stacked = stream.read().strip().split('\n')
# split each line into the fixed-width fields
widths=(12,62,12,12,44)
items += [[item[sum(widths[:j]):sum(widths[:j+1])].strip() for j in range(len(widths))] for item in stacked]
urls = [BASE_URL + item[4] for item in items]
resp = list()
# PROBLEM 1
filelimit = 100
for i in range(ceil(len(urls)/filelimit)):
print(f'Downloading: {i*filelimit/len(urls)*100:2.0f}%... ',end='\r',flush=True)
resp += [r for r in grequests.map((grequests.get(u) for u in urls[i*filelimit:(i+1)*filelimit]))]
# PROBLEM 2
with Pool() as p:
rs = p.map_async(classify_file,resp,chunksize=20)
rs.wait()
prospectus = rs.get()
with open('prospectus_data.json') as f:
json.dump(prospectus,f)
The getfileurls.sh referenced is a bash script I wrote that was faster than doing it in python since I could use grep, the code for that is
#!/bin/bash
BASE_URL="https://www.sec.gov/Archives/"
INDEX="edgar/full-index/"
url="${BASE_URL}${INDEX}$1/QTR$2/form.idx"
out=$(curl -s ${url} | grep "^485[A|B]POS")
echo "$out"
PROBLEM 1: So I am currently pulling about 18k files in the grequests map call. I was running into an error about too many files being open so I decided to split up the urls list into manageable chunks. I don't like this solution, but it works.
PROBLEM 2: This is where my actual error is. This code runs fine on a smaller set of urls (~2k) on my laptop (uses 100% of my cpu and ~20GB of RAM ~10GB for the file downloads and another ~10GB when the parsing starts), but when I take it to the larger 18k dataset using 40 cores on a research cluster it spins up to ~100GB RAM and ~3TB swap usage then crashes after parsing about 2k documents in 20 minutes via a KeyboardInterrupt from the server.
I don't really understand why the swap usage is getting so crazy, but I think I really just need help with memory management here. Is there a way to create an generator of unsent requests that will be sent when I call classify_file() on them later? Any help would be appreciated.
Generally when you have runaway memory usage with a Pool it's because the workers are being re-used and accumulating memory with each iteration. You can occasionally close and re-open the pool to prevent this but it's so common of an issue that Python now has a built-in parameter to do it for you...
Pool(...maxtasksperchild) is the number of tasks a worker process can complete before it will exit and be replaced with a fresh worker process, to enable unused resources to be freed. The default maxtasksperchild is None, which means worker processes will live as long as the pool.
There's no way for me to tell you what the right value is but you generally want to set it low enough that resources can be freed fairly often but not so low that it slows things down. (Maybe a minutes worth of processing... just as a guess)
with Pool(maxtasksperchild=5) as p:
rs = p.map_async(classify_file,resp,chunksize=20)
rs.wait()
prospectus = rs.get()
For your first problem, you might consider just using requests and moving the call inside of the worker process you already have. Pulling 18K worth of URLs and caching all that data initially is going to take time and memory. If it's all encapsulated in the worker, you'll minimize data usage and you wont need to spin up so many open file handles.
This tutorial https://www.dataquest.io/blog/python-json-tutorial/ has a 600MB file that they work with, however when I run their code
import ijson
filename = "md_traffic.json"
with open(filename, 'r') as f:
objects = ijson.items(f, 'meta.view.columns.item')
columns = list(objects)
I'm running into 10+ minutes of waiting for the file to be read into ijson and I'm really confused how this is supposed to be reasonable. Shouldn't there be parsing? Am I missing something?
The main problem is not that you are creating a list after parsing (that only collects the individual results into a single structure), but that you are using the default pure-python backend provided by ijson.
There are other backends that can be used which are way faster. In ijson's homepage it is explained how you can import those. The yajl2_cffi backend is the fastest currently available at the moment, but I've created a new yajl2_c backend (there's a pull request pending acceptance) that performs even better.
In my laptop (Intel(R) Core(TM) i7-5600U) using the yajl2_cffi backend your code runs in ~1.5 minutes. Using the yajl2_c backend it runs in ~10.5 seconds (python 3) and ~15 seconds (python 2.7.12).
Edit: #lex-scarisbrick is of course also right in that you can quickly break out of the loop if you are only interested in the column names.
This looks like a direct copy/paste of the tutorial found here:
https://www.dataquest.io/blog/python-json-tutorial/
The reason it's taking so long is the list() around the output of the ijson.items function. This effectively forces parsing of the entire file before returning any results. Taking advantage of the ijson.items being a generator, the first result can be returned almost immediately:
import ijson
filename = "md_traffic.json"
with open(filename, 'r') as f:
for item in ijson.items(f, 'meta.view.columns.item'):
print(item)
break
EDIT: The very next step in the tutorial is print(columns[0]), which is why I included printing the first item in the answer. Also, it's not clear whether the question was for Python 2 or 3, so the answer uses syntax that works in both, albeit inelegantly.
I tried running your code and I killed the program after 25 minutes. So yes 10 minutes it's reasonable fast.
I have a list of about 200,000 entities, and I need to query a specific RESTful API for each of those entities, and end up with all the 200,000 entities saved in JSON format in txt files.
The naive way of doing it is going through the list of the 200,000 entities and query one by one, add the returned JSON to a list, and when it's done, right all to a text file. Something like:
from apiWrapper import api
from entities import listEntities #list of the 200,000 entities
a=api()
fullEntityList=[]
for entity in listEntities:
fullEntityList.append(a.getFullEntity(entity))
with open("fullEntities.txt","w") as f:
simplejson.dump(fullEntityList,f)
Obviously this is not reliable, as 200,000 queries to the API will take about 10 hours or so, so I guess something will cause an error before it gets to write it to the file.
I guess the right way is to write it in chunks, but not sure how to implement it. Any ideas?
Also, I cannot do this with a database.
I would recommend writing them to a SQLite database. This is they way I do it for my own tiny web spider applications. Because you can query the keys quite easily, and check which ones you already retrieved. This way, your application can easily continue where it left off. In particular if you get some 1000 new entries added next week.
Do design "recovery" into your application from the beginning. If there is some unexpected exception (Say, a timeout due to network congestion), you don't want to have to restart from the beginning, but only those queries you have not yet successfully retrieved. At 200.000 queries, an uptime of 99.9% means you have to expect 200 failures!
For space efficiency and performance it will likely pay off to use a compressed format, such as compressing the json with zlib before dumping it into the database blob.
SQLite is a good choice, unless your spider runs on multiple hosts at the same time. For a single application, sqlite is perfect.
The easy way is to open the file in 'a' (append) mode and write them one by one as they come in.
The better way is to use a job queue. This will allow you to spawn off a.getFullEntity calls into worker thread(s) and handle the results however you want when/if they come back, or schedule retries for failures, etc.
See Queue.
I'd also use a separate Thread that does file-writing, and use Queue to keep record of all entities. When I started off, I thought this would be done in 5 minutes, but then it turned out to be a little harder. simplejson and all other such libraries I'm aware off do not support partial writing, so you cannot first write one element of a list, later add another etc. So, I tried to solve this manually, by writing [, , and ] separately to the file and then dumping each entity separately.
Without being able to check it (as I don't have your api), you could try:
import threading
import Queue
import simplejson
from apiWrapper import api
from entities import listEntities #list of the 200,000 entities
CHUNK_SIZE = 1000
class EntityWriter(threading.Thread):
lines_written = False
_filename = "fullEntities.txt"
def __init__(self, queue):
super(EntityWriter, self).__init()
self._q = queue
self.running = False
def run(self):
self.running = True
with open(self._filename,"a") as f:
while True:
try:
entity = self._q.get(block=False)
if not EntityWriter.lines_written:
EntityWriter.lines_written = True
f.write("[")
simplejson.dump(entity,f)
else:
f.write(",\n")
simplejson.dump(entity,f)
except Queue.Empty:
break
self.running = False
def finish_file(self):
with open(self._filename,"a") as f:
f.write("]")
a=api()
fullEntityQueue=Queue.Queue(2*CHUNK_SIZE)
n_entities = len(listEntities)
writer = None
for i, entity in listEntities:
fullEntityQueue.append(a.getFullEntity(entity))
if (i+1) % CHUNK_SIZE == 0 or i == n_entities-1:
if writer is None or not writer.running:
writer = EntityWriter(fullEntityQueue)
writer.start()
writer.join()
writer.finish_file()
What this script does
The main loop still iterates over your list of entities, getting the full information for each. Afterwards each entity is now put into a Queue. Every 1000 entities (and at the end of the list) an EntityWriter-Thread is being launched that runs in parallel to the main Thread. This EntityWriter gets from the Queue and dumps it to the desired output file.
Some additional logic is required to make the JSON a list, as mentioned above I write [, , and ] manually. The resulting file should, in principle, be understood by simplejson when you reload it.
I have some code that will need to write about 20 bytes of data every 10 seconds.
I'm on Windows 7 using python 2.7
You guys recommend any 'least strain to the os/hard drive' way to do this?
I was thinking about opening and closing the same file very 10 seconds:
f = open('log_file.txt', 'w')
f.write(information)
f.close()
Or should I keep it open and just flush() the data and not close it as often?
What about sqllite? Will it improve performance and be less intensive than the open and close file operations?
(Isn't it just a flat file database so == to text file anyways...?)
What about mysql (this uses a local server/process.. not sure the specifics on when/how it saves data to hdd) ?
I'm just worried about not frying my hard drive and improving the performance on this logging procedure. I will be receiving new log information about every 10 seconds, and this will be going on 24/7 24 hours a day.
Your advice?
ie: Think about programs like utorrent that require saving large amounts of data on a constant basis for long periods of time, (my log file is significantly less data that those being written in such "downloader type programs" like utorrent)
import random
import time
def get_data():
letters = 'isn\'t the code obvious'
data = ''
for i in xrange(20):
data += random.choice(letters)
return data
while True:
f = open('log_file.txt', 'w')
f.write(get_data())
f.close()
time.sleep(10)
My CPU starts whining after about 15 seconds... (or is that my hdd? )
As expected, python comes included with a great tool for this, have a look at the logging module
Use the logging framework. This is exactly what it is designed to do.
Edit: Balls, beaten to it :).
Don't worry about "frying" your hard drive - 20 bytes every 10 seconds is a small fraction of the data written to the disk in the normal operation of the OS.