Python websocket - How to let some data flow? - python

I'm new to the platform, this is my first message and I need your help.
I'm working on a school project where I have to analyze data. I chose to analyze the Binance stream, especially the trades. I had no problem using their web socket, I get lines.
The problem is that I get a lot of lines. For 1 second for example I can recover 10, 15 lines or much more.
I would like to recover 1 line per second for example.
I tried to put a time.sleep(1) but it doesn't work. It just "pauses" the stream but resumes at the line where it stopped. I want to avoid processing some lines, that's why I would like to get 1 line per second.
I use this library
https://python-binance.readthedocs.io/en/latest/websockets.html
def handle_message(msg):
if msg['e'] == 'error':
print(msg['m'])
else:
bitcoins_exchanged = float(msg['p']) * float(msg['q'])
timestamp = msg['T'] / 1000
timestamp = datetime.fromtimestamp(timestamp).strftime('%Y-%m-%d %H:%M:%S')
print("{} - {} - Price: {}".format(timestamp,msg['s'],msg['p']))
conn_key = bm.start_trade_socket(BTCUSDT, handle_message)
bm.start()
Thanks for your help. For information, I use Python 2.7.

Related

How to read from a file every 5 seconds in Python?

I am trying to make my program read from a file every 5 seconds until it reaches its end.
I have this code:
data = pd.read_csv(path + file_name)
n = len(data.index)
for i in range(0, n):
element = data['first_column_name'][0]
I tried writing time.sleep(5) after reading the element, but it has to be from the beginning, to look like streaming data... if possible
How can I make it read the element from the file every 5 seconds?
Python is by default single threaded, meaning that it only executes one thing at a time. When you use time.sleep, python can do nothing else but watch this timer count down. You seem to want your program to do stuff and periodically check on a file. What I think you are looking for is async/multithreading. This is a big topic with lots of different options for different circumstances; this Real Python article gives a gentle introduction.

python - How to use file as a Queue

I need to use a file as a Queue but I don't know how to start (also any other aproach is welcome), because I have a non-secure transmission between my device and a computer, and I need all the data to be saved until it is sent and successfully recieved. The DATA is a list which always holds the same type and amount of elements. I imagine something like this to be the file structure:
FILE
DATA 0 <- send_pointer
DATA 1
DATA 2
DATA 3
<- new_item
So the code will look like:
while True:
DATA = data_gather()
FILE.write(DATA, new_item)
new_item += 1
x = FILE.read(send_pointer)
if send_function(x):
FILE.delete(send_pointer)
send_pointer += 1
else:
print('error sending x')
I hope you understand my issue, my english is not the best.
EDIT
I installed this module: https://pypi.python.org/pypi/pqueue/0.1.1
But I don't know how to use it well. I can't find a way to delete the data I have already read from the file.
Thanks!
EDIT 2
Solved with pqueue.
#!/usr/bin/python
import time
offset = 0
while True:
infile=open("./log.txt")
infile.seek(offset)
for line in infile:
print line // do something
offset=infile.tell()
infile.close()
time.sleep(10)
Only updates to log.txt are printed using this method

Optimizing web-scraper python loop

I'm scraping articles from a news site on behalf of the owner. I have to keep it to <= 5 requests per second, or ~100k articles in 6 hrs (overnight), but I'm getting ~30k at best.
Using Jupyter notebook, it runs fine # first, but becomes less and less responsive. After 6 hrs, the kernel is normally un-interruptable, and I have to restart it. Since I'm storing each article in-memory, this is a problem.
So my question is: is there a more efficient way to do this to reach ~100k articles in 6 hours?
The code is below. For each valid URL in a Pandas dataframe column, the loop:
downloads the webpage
extracts the relevant text
cleans out some encoding garbage from the text
writes that text to another dataframe column
every 2000 articles, it saves the dataframe to a CSV (overwriting the last backup), to handle the eventual crash of the script.
Some ideas I've considered:
Write each article to a local SQL server instead of in-mem (speed concerns?)
save each article text in a csv with its url, then build a dataframe later
delete all "print()" functions and rely solely on logging (my logger config doesn't seem to perform awesome, though--i'm not sure it's logging everything I tell it to)
i=0
#lots of NaNs in the column, hence the subsetting
for u in unique_urls[unique_urls['unique_suffixes'].isnull() == False]\
.unique_suffixes.values[:]:
i = i+1
if pd.isnull(u):
continue
#save our progress every 2k articles just in case
if i%2000 == 0:
unique_urls.to_csv('/backup-article-txt.csv', encoding='utf-8')
try:
#pull the data
html_r = requests.get(u).text
#the phrase "TX:" indicates start of article
#text, so if it's not present, URL must have been bad
if html_r.find("TX:") == -1:
continue
#capture just the text of the article
txt = html_r[html_r.find("TX:")+5:]
#fix encoding/formatting quirks
txt = txt.replace('\n',' ')
txt = txt.replace('[^\x00-\x7F]','')
#wait 200 ms to spare site's servers
time.sleep(.2)
#write our article to our dataframe
unique_urls.loc[unique_urls.unique_suffixes == u, 'article_text'] = txt
logging.info("done with url # %s -- %s remaining", i, (total_links-i))
print "done with url # " + str(i)
print total_links-i
except:
logging.exception("Exception on article # %s, URL: %s", i, u)
print "ERROR with url # " + str(i)
continue
This is the logging config I'm using. I found it on SO, but w/ this particular script it doesn't seem to capture everything.
logTime = "{:%d %b-%X}".format(datetime.datetime.now())
logger = logging.getLogger()
fhandler = logging.FileHandler(filename='logTime+'.log', mode='a')
formatter = logging.Formatter('%(asctime)s - %(name)s - %(levelname)s - %(message)s')
fhandler.setFormatter(formatter)
logger.addHandler(fhandler)
logger.setLevel(logging.INFO)
eta: some details in response to answers/comments:
script is only thing running on a 16 GB/ram EC2 instance
articles are ~100-800 words apiece
I'm going to take an educated guess and say that your script turns your machine into a swap storm as you get around 30k articles, according to your description. I don't see anything in your code where you could easily free up memory using:
some_large_container = None
Setting something that you know has a large allocation to None tells Python's memory manager that it's available for garbage collection. You also might want to explicitly call gc.collect(), but I'm not sure that would do you much good.
Alternatives you could consider:
sqlite3: Instead of a remote SQL database, use sqlite3 as intermediate storage. Exists there does a Python module.
Keep appending to the CSV checkpoint file.
Compress your strings with zlib.compress().
Any way that you decide to go, you're probably best off doing the collection as phase 1, constructing the Pandas dataframe as phase 2. Never pays off to be clever by a half. The other half tends to hang you.

How to read .cap files other than Pyshark that is faster than Scapy's rdpcap ()?

I have been looking for a way to get 802.11 Packets from a .cap file into an Array. So far I have found:
Scapy:
which is kind of nice, documentation available, but too slow, when I try to open a file with size > 40 Mb, I just keeps hanging on until it consumes all my Ram (all 16 gigs of it) at which point my pc just blocks and I have to reboot it
Pyshark:
doesn't have any of Scapy's problems, but documentation is too scarce, I can't find a way to handle and get attributes for 802.11 Packets
So I was thinking maybe there are better solutions out there, or maybe someone does have some experience with pyshark?
from scapy.all import *
import pyshark
from collections import defaultdict
import sys
import math
import numpy as np
counter=0
Stats = np.zeros((14))
filename='cap.cap'
a = rdpcap(filename)
print len(a)
for p in a:
pkt = p.payload
#Management packets
if p.haslayer(Dot11) and p.type == 0:
ipcounter = ipcounter +1
Stats[p.subtype] = Stats[p.subtype] + 1
print Stats
Note: when I launch the program with a 10 Mega bytes input (for instance) it takes about 20 seconds or so, but it does work, I wonder why is that, why is it so different from pyshark and what kind of computations is it doing?
You can patch scapy file named utils.py so that it won't load everything into memory
change :
def read_all(self,count=-1):
"""return a list of all packets in the pcap file
"""
res=[]
while count != 0:
count -= 1
p = self.read_packet()
if p is None:
break
res.append(p)
return res
to
def read_all(self,count=-1):
"""return an iterable of all packets in the pcap file
"""
while count != 0:
count -= 1
p = self.read_packet()
if p is None:
break
yield p
return
credit goes to :
http://comments.gmane.org/gmane.comp.security.scapy.general/4462
But link is now dead
Scapy will load all the packets to your memory and create a packetList instance.
I think there are two solutions to your problem.
Capture packets with a filter. In my work, I have never captured more than 2MB packets since I only capture on one wireless channel once.
Divide the huge packet file into several smaller parts. And then deal with them.
Hope it helps.
If pyshark suits your needs, you can use it like so:
cap = pyshark.FileCapture('/tmp/mycap.cap')
for packet in cap:
my_layer = packet.layer_name # or packet['layer name'] or packet[layer_index]
To see what available layers you have and what attributes they have, just print them (or use layer/packet.pretty_print()) or use autocomplete or look at packet.layer._all_fields.
For instance packet.udp.srcport.
What is missing in the documentation?
Note that you can also apply a filter as an argument to the FileCapture instance (either a display filter or a BPF filter, see docs)
with PcapReader('filename.pcapng') as pcap_reader:
for pkt in pcap_reader:
#do something with the packet
...
this works GOOD!
PcapReader just like xrange() to range()
Have you tried dpkt? It has a nice Reader interface which seems to lazy-load packets (I have loaded 100MB+ pcap files with it, no problem).
Sample:
from dpkt.pcap import Reader
with open(...) as f:
for pkt in Reader(f):
...
Thanks to #KimiNewt and
After spending some time with the pyshark Source code, I got some understanding of the nuts and bolts of it
PS : opening a 450 MB file using pyShark doesn't take any time at all, and the data access is fairly easy. I don't see any downsides of using it at the moment, but I will try to keep this post up to date as I advance in my project.
This is a sample code of 802.11 packet parsing using pyShark, I hope it will help those working on similar projects.
import pyshark
#Opening the cap file
filename='data-cap-01.cap'
cap = pyshark.FileCapture(filename)
#Getting a list of all fields of this packet on the level of this specific layer
#looking somthing like this :['fc_frag', 'fc_type_subtype',..., 'fc_type']
print cap[0]['WLAN']._field_names
#Getting the value of a specific field, the packet type in
#this case (Control, Management or Data ) which will be represented by an Integer (0,1,2)
print cap[0]['WLAN'].get_field_value('fc_type')
I will be later on working on packet decryption for WEP and WPA and getting 3rd layer headers, so I might add that too.

Advice on writing to a log file with python

I have some code that will need to write about 20 bytes of data every 10 seconds.
I'm on Windows 7 using python 2.7
You guys recommend any 'least strain to the os/hard drive' way to do this?
I was thinking about opening and closing the same file very 10 seconds:
f = open('log_file.txt', 'w')
f.write(information)
f.close()
Or should I keep it open and just flush() the data and not close it as often?
What about sqllite? Will it improve performance and be less intensive than the open and close file operations?
(Isn't it just a flat file database so == to text file anyways...?)
What about mysql (this uses a local server/process.. not sure the specifics on when/how it saves data to hdd) ?
I'm just worried about not frying my hard drive and improving the performance on this logging procedure. I will be receiving new log information about every 10 seconds, and this will be going on 24/7 24 hours a day.
Your advice?
ie: Think about programs like utorrent that require saving large amounts of data on a constant basis for long periods of time, (my log file is significantly less data that those being written in such "downloader type programs" like utorrent)
import random
import time
def get_data():
letters = 'isn\'t the code obvious'
data = ''
for i in xrange(20):
data += random.choice(letters)
return data
while True:
f = open('log_file.txt', 'w')
f.write(get_data())
f.close()
time.sleep(10)
My CPU starts whining after about 15 seconds... (or is that my hdd? )
As expected, python comes included with a great tool for this, have a look at the logging module
Use the logging framework. This is exactly what it is designed to do.
Edit: Balls, beaten to it :).
Don't worry about "frying" your hard drive - 20 bytes every 10 seconds is a small fraction of the data written to the disk in the normal operation of the OS.

Categories

Resources