Collect data in chunks from stdin: Python - python

I have the following Python code where I collect data from standard input into a list and run syntaxnet on it. The data is in the form of json objects from which I will extract the text field and feed it to syntaxnet.
data = []
for line in sys.stdin:
data.append(line)
run_syntaxnet(data) ##This is a function##
I am doing this because I do not want Syntaxnet to run for every single tweet since it will take a very long time and hence decrease performance.
Also, when I run this code on very large data, I do not want to keep collecting it forever and run out of memory. So I want to collect data in chunks- may be like 10000 tweets at a time and run Syntaxnet on them. Can someone help me how to do this?
Also, I want to understand what can be the maximum length of the list data so that I do not run out of memory.
EDIT:
I used the code:
data = []
for line in sys.stdin:
data.append(line)
if len(data) == 10000:
run_syntaxnet(data) ##This is a function##
data = []
which runs perfectly fine if the number of rows in the input data is a multiple of 10000. I am not sure what to do with the remainder of the rows.
For example, if the total number of rows is 12000, the first 10000 rows get processed as I want, but the next 2000 are left off since the condition len(data) > 10000 is not met.
I want to do something like:
if len(data) > 10000 or 'EOF of input file is reached':
run_syntaxnet(data)
Can someone tell me how to check for the EOF of input file? Thanks in advance!
PS: All the data to the python file is from Pig Streaming. Also, I can not afford to actually count the number of row sin the input data and send as a parameter since I have millions of rows and counting itself will take forever.

I think this is all you need:
data = []
for line in sys.stdin:
data.append(line)
if len(data) == 10000:
run_syntaxnet(data) ##This is a function##
data = []
once the list get to 10000, then run the function and reset your data list. Also the maximum size of the list will vary from machine to machine, depending on how much memory you have, so it will probably be best to try it out with different lengths and find out what is optimum.

I would gather the data into chunks and process those chunks when they get "large":
LARGE_DATA = 10
data = []
for line in sys.stdin:
data.append(line)
if len(data) > LARGE_DATA:
run_syntaxnet(data)
data = []
run_syntaxnet(data)

Related

Any way to make this more efficient? Tenable API report call

I have a script that does a bunch of data manipulation, but it is getting bottlenecked by this function.
The length of the Tenable generator array ips is always around 1000, give or take. The length of ips[row] is 5.
Are there any improvements that I can make here to make things more efficient? I feel like this takes far longer than it should.
def get_ten(sc):
now = time.time()
ips = [sc.analysis.vulns(('ip', '=', ip), tool='sumseverity', sortDirection='desc') for ip in [x[15] for x in csv.reader(open('full.csv', 'r'))
if x[15] != 'PrivateIpAddress']]
row = 0
while row < len(ips):
scan_data = []
scan_count = 0
for scan in ips[row]:
count = scan['count']
scan_data.append(count)
scan_count += int(count)
row += 1
print(time.time() - now)
Output: 2702.747463464737
Thanks!
I would suggest you try inversing your logic.
From looking at your code, it looks like you currently:
Read a CSV to get IPs.
Call the sc.analysis API for each IP
Process the results
My best guess is that the majority of the time is taken sending out the API calls and then waiting for the results. Instead I suggest you try:
Call sc.analysis API (without filtering ip) and read the results to get all IPs (as a set?)
Read CSV and get IPs as another set.
Do set union operation to find vulnerable IP addresses.

Which is faster, saving to a 2D list or CSV for millions of data?

I have over 20 million lines of data, each line with 60-200 int elements. My present method of using:
with open("file.txt") as f:
for block in reading_file(f):
for line in block:
a = line.split(" ")
op_on_data(a)
where reading_file() is a function that takes around 1000 lines at a time. And op_on_data() is a function where I do some basic operations:
def op_on_data(a):
if a[0] == "keyw":
print 'keyw: ', a[0], a[1]
else:
# some_operations on arr[]
for v in arr[splicing_here]:
if v > 100:
# more_operations here
two_d_list(particular_list_location).append(v)
for e in arr[splicing_here]:
if e < -100:
two_d_list_2(particular_list_location).append(e)
sys.stdout.flush()
And in the end I save the two_d_list to a Pandas Dataframe in ONE move. I do not save in chunks. For around 40,000 lines of a test dataset I got an initial time of ~10.5 s. But when I do the whole dataset, my system crashes after a few million lines. Probably because the list gets too large.
I need to know what is the best way to save the data after doing the operations. Do I keep using lists or save directly to a CSV file inside the function itself like line by line? How do I improve the speed and prevent system from crashing?
Edit: I am open to other options apart from lists and CSV.
I would try to make the code more efficient and generator based.
I see too many for loops for no good reason.
If anyhow you are iterating through lines so begin with this
for line in open("file.txt"): # open here is a generator (avoid using your own read functions)
a = line.split(" ")
op_on_data(a)
And for the second code snippet, here are more code review comments for the below code:
def op_on_data(a):
if a[0] == "keyw":
print 'keyw: ', a[0], a[1] # Avoid printing when iterating million of lines !!!
else:
# some_operations on arr[]
for v in arr[splicing_here]:
if v > 100:
# more_operations here
two_d_list(particular_list_location).append(v)
for e in arr[splicing_here]:
if e < -100:
two_d_list_2(particular_list_location).append(e)
sys.stdout.flush()
CODE REVIEW COMMENTS:
Do not iterate big arrays simply with for loop, always use generators/iterators, e.g.:
from itertools import cycle
my_iter_arr = cycle(arr)
for v in my_iter_arr:
# do something
Try to combine the 2 for loops into 1. I cannot see the reason why you are using 2 for loops, try:
for v in my_iter_arr:
if v > 100:
two_d_list(particular_list_location).append(v)
elif v < -100:
two_d_list_2(particular_list_location).append(v)
And the worst is appending millions of elements into an array stored in RAM, avoid the two_d_list(particular_list_location).append(v), I am not sure how two_d_list is efficiant! instead try to see when list reaches a sum of X elements dump the elements to a file! and continue appending to a clean list!
Try reading about Python Generators / Lazy iteration

Python Pickle acting differently when iterating through large number of files vs small number

I am trying to iterate through dictionaries using pickle, and it seems work fine for individual files, and when I iterate through small batches of files, but when I do more, something wonky happens. Here is the code I have:
def search_dictionary():
with open('full_set_name.txt', 'r') as file:
set_name = []
for f in file:
set_name.append(f.strip('\n'))
count = 0
while count < 24:
pickle_in = open("{0}.pickle".format(set_name[count]), 'rb')
card_d = pickle.load(pickle_in)
print (set_name[count], card_d)
count +=1
When I run this code with while count being less than 24 , Each dictionary prints out as expected wit each dictionary being printed in order of the list. when I do 24 or more, contents from set_name are at the top of the console, not in a dictionary. Each name from the list appearing hundreds of times.
I ran each of these files independently and each dictionary printed out just fine. If run while count < len(set_name) which will be all dictionaries, I get the error:
card_d = pickle.load(pickle_in)
EOFError: Ran out of input
From what I've read, it happens when you try to read an empty pickle file. The contents of each pickle definitely has a dictionary. Is there something wrong with my syntax? Or maybe a file got corrupted? How would I tell? The contents look fine if I print them individually. I did use dictionary comprehension to change the keys of each dictionary to lowercase. Is it possible it was corrupted that way? Here is the code I ran:
count = 0
with open('full_set_name.txt') as sets:
set_list = []
for s in sets:
set_list.append(s.rstrip())
while count < len(set_list):
pickle_in = open('{0}.pickle'.format(set_list[198]), 'rb')
file = pickle.load(pickle_in)
file = {k.lower(): v for k, v in file.items()}
pickle.dump(file, open('{0}.pickle'.format(set_list[198]), 'wb'))
print (file)
count += 1
What do you guys think? I'm sure I probably did somethin dumb, but I cannot figure out the actual problem. I have a picture to show what the first line looks like. It should be a dictionary, and it's actually the contents of the set_name list multiplied hundreds of times. Couldn't paste it. Thanks for taking a look

Python - process a chunk of lines in a file

I have a file containing x number of values each on their own line.
I need to be able to take n number of value from this file, put them into an array, pass that array into a new process, clear the array and then take another n number of values from the file to give to the next process.
The problem I'm having is when x is a value like 12 and I'm trying to give, let's say, 10 chunks of values of each process.
The first process will get it's first 10 values no problem, but I'm having trouble giving the remaining 2 to the last process.
The problem would also arise if, let's say, you tell the program to give each process 10 values from the file, but the file only has 1, or even 9 values.
I need know when I'm at the last set of values that is less than n
I want to avoid taking every value in the file and storing it in an array all at once since I could run into memory problems if there was millions of values in that file.
Here's an example of what I've tried to do:
chunk = 10
value_list = []
with open ('file.txt', 'r') as f:
for value in f:
value_list.append(value)
if (len(value_list) >= chunk):
print 'Got %d' % len(value_list)
value_list = [] # Clear the list
# Put array into new process
This will catch every 10 in this example, but it wont work if there even happend to be less than 10 in the file to begin with.
What I typically do in this situation is just handle the last (short) array after the for loop. For example,
chunk = 10
value_list = []
with open ('file.txt', 'r') as f:
for value in f:
if (len(value_list) >= chunk):
print 'Got %d' % len(value_list)
value_list = [] # Clear the list
# Put array into new process
value_list.append(value)
# send left overs to new process
if value_list:
print 'Got %d' % len(value_list)
# Put final array into new process

Cassandra buffered read of millions of columns

I've got a cassandra cluster with a small number of rows (< 100). Each row has about 2 million columns. I need to get a full row (all 2 million columns), but things start failing all over the place before I can finish my read. I'd like to do some kind of buffered read.
Ideally I'd like to do something like this using Pycassa (no this isn't the proper way to call get, it's just so you can get the idea):
results = {}
start = 0
while True:
# Fetch blocks of size 500
buffer = column_family.get(key, column_offset=start, column_count=500)
if len(buffer) == 0:
break
# Merge these results into the main one
results.update(buffer)
# Update the offset
start += len(buffer)
Pycassa (and by extension Cassandra) don't let you do this. Instead you need to specify a column name for column_start and column_finish. This is a problem since I don't actually know what the start or end column names will be. The special value "" can indicate the start or end of the row, but that doesn't work for any of the values in the middle.
So how can I accomplish a buffered read of all the columns in a single row? Thanks.
From the pycassa 1.0.8 documentation
it would appear that you could use something like the following [pseudocode]:
results = {}
start = 0
startColumn = ""
while True:
# Fetch blocks of size 500
buffer = get(key, column_start=startColumn, column_finish="", column_count=100)
# iterate returned values.
# set startColumn == previous column_finish.
Remember that on each subsequent call you're only get 99 results returned, because it's also returning startColumn, which you've already seen. I'm not skilled enough in Python yet to iterate on buffer to extract the column names.
In v1.7.1+ of pycassa you can use xget and get a row as wide as 2**63-1 columns.
for col in cf.xget(key, column_count=2**63-1):
# do something with the column.

Categories

Resources