Kernel dies with strange behavior - python

I have a simple function exhibiting strange behavior. I already searched for explanations, but couldn't find any.
def myfunc(frame):
lol = []
for i in range(frame.shape[0]):
if frame.iloc[i,3] == 3:
lol.append(frame.iloc[i,7])
return np.asarray(lol,dtype=np.int32)
print('before')
x = myfunc(x)
print('after')
The result of the above code is
before
Kernel died
def myfunc(frame):
lol = []
for i in range(frame.shape[0]):
if frame.iloc[i,3] == 3:
lol.append(frame.iloc[i,7])
print('myfunc')
return np.asarray(lol,dtype=np.int32)
print('before')
x = myfunc(x)
print('after')
how ever, simply adding a single print statement gives
before
myfunc
after
Kernel died
The print statement is the only difference and I've tested this, maybe, 50 times. Disregarding my other problems (Kernel died), I have no idea why this is happening. I would appreciate any insight.

I tried to reproduce the error with the sample data, and it worked fine for me.
You can try the same with your data in colab environment and check, if the error doesn't appear then the issue might be one of the following.
Your data is large and it does not fit into memory, you can try specifying engine='python' while loading csv file df = pd.read_csv('input.csv', engine='python').
You can alternatively split the dataframe into multiple dataframes and merge it later.
You can use Tensorflow's tf.data to load CSV files and to build input pipelines which is more efficient for large files.
Below is the working code with sample data.
import tensorflow as tf
import pandas as pd
import numpy as np
df = pd.read_csv("/content/sample_data/mnist_train_small.csv")
def myfunc(frame):
lol = []
for i in range(frame.shape[0]):
if frame.iloc[i,3] == 3:
lol.append(frame.iloc[i,7])
return np.asarray(lol,dtype=np.int32)
print('before')
x = myfunc(df)
print('after')
Result:
before
after

Related

attempt to get argmax of an empty sequence

when tried to execute this code, it is showing 'attempt to get argmax of an empty sequence' error.
code:
import re
import numpy as np
code:
import re
import numpy as np
output_directory = './fine_tuned_model'
lst = os.listdir(model_dir)
lst = [l for l in lst if 'model.ckpt-' in l and '.meta' in l]
steps=np.array([int(re.findall('\d+', l)[0]) for l in lst])
last_model = lst[steps.argmax()].replace('.meta', '')
last_model_path = os.path.join(model_dir, last_model)
print(last_model_path)
!python /content/models/research/object_detection/export_inference_graph.py \
--input_type=image_tensor \
--pipeline_config_path={pipeline_fname} \
--output_directory={output_directory} \
--trained_checkpoint_prefix={last_model_path}
Think the error is saying exactly what is happening. You are creating steps with no data so argmax() won't run. Perhaps you need to adjust how it's loaded so data is in steps. Hard to say based on info provided
Simplified Example to Demonstrate Issue:
steps = np.array([1,2,3])
#Works fine with data
print(steps.argmax())
#argmax() throws an error since the array is empty
emptySteps = np.delete(steps,[0,1,2])
print(emptySteps.argmax())
Possible Workaround
Appears you are searching a directory. If there are no files available, you may not want an error. Could achieve this with a simple check before running to see if there are files to process
if steps.size > 0:
print("Do something with argmax()")
else:
print ("No data in steps array")

Dataframe not updating through multiprocess Python keeps running even if finished

I am a newbie of multiprocessing and i am using the said library in python to parallelize the computation of a parameter for the rows of a dataframe.
The idea is the following:
I have two functions, g for the actual computation and f for filling the dataframe with the computed values. I call the function f with pool.apply_async. The problem is that at the end of the poo.async the dataframe has not been updated even though a print inside f easily shows that it is saving correctly the values. So I thought to save the results in a file excel inside the f function as showed in my pseudo code below. However, what I obtain is that the file excel where i save the results stops to be updated after 2 values and the kernel keeps running even though the terminal shows that the script has computed all the values.
This is my pseudo code:
def g(path to image1, path to image 2):
#vectorize images
#does computation
return value #value is a float
def f(row, index):
value= g(row.image1, row.image2)
df.at[index, 'value'] = value
df.to_csv('dftest.csv')
return df
def callbackf(result):
global results
results.append(result)
inside the main:
results=[]
pool = mp.Pool(N_CORES)
for index, row in df.iterrows():
pool.apply_async(f,
args=(row, index),
callback=callbackf)
I tried to use with get_context("spawn").Pool() as pool inside the main as suggested by https://pythonspeed.com/articles/python-multiprocessing/ but it didn't solve my problem. What am I doing wrong? Is it possible that the vectorizing the images at each row causes problem to the multiprocessing?
At the end I saved the results in a txt instead of a csv and it worked. I don't know why it didn't work with csv though.
Here's the code I put instead of the csv and pickle lines:
with open('results.txt', 'a') as f:
f.write(image1 +
'\t' + image2 +
'\t' + str(value) +
'\n')

Sharing large objects in multiprocessing pools

I'm trying to revisit this slightly older question and see if there's a better answer these days.
I'm using python3 and I'm trying to share a large dataframe with the workers in a pool. My function reads the dataframe, generates a new array using data from the dataframe, and returns that array. Example code below (note: in the example below I do not actually use the dataframe, but in my code I do).
def func(i):
return i*2
def par_func_dict(mydict):
values = mydict['values']
df = mydict['df']
return pd.Series([func(i) for i in values])
N = 10000
arr = list(range(N))
data_split = np.array_split(arr, 3)
df = pd.DataFrame(np.random.randn(10,10))
pool = Pool(cores)
gen = ({'values' : i, 'df' : df}
for i in data_split)
data = pd.concat(pool.map(par_func_dict,gen), axis=0)
pool.close()
pool.join()
I'm wondering if there's a way I can prevent feeding the generator with copies of the dataframe to prevent taking up so much memory.
The answer to the link above suggests using multiprocessing.Process(), but from what I can tell, it's difficult to use that on top of functions that return things (need to incorporate signals / events), and the comments indicate that each process still ends up using a large amount of memory.

Python - Reducing Import and Parse Time for Large CSV Files

My first post:
Before beginning, I should note I am relatively new to OOP, though I have done DB/stat work in SAS, R, etc., so my question may not be well posed: please let me know if I need to clarify anything.
My question:
I am attempting to import and parse large CSV files (~6MM rows and larger likely to come). The two limitations that I've run into repeatedly have been runtime and memory (32-bit implementation of Python). Below is a simplified version of my neophyte (nth) attempt at importing and parsing in reasonable time. How can I speed up this process? I am splitting the file as I import and performing interim summaries due to memory limitations and using pandas for the summarization:
Parsing and Summarization:
def ParseInts(inString):
try:
return int(inString)
except:
return None
def TextToYearMo(inString):
try:
return 100*inString[0:4]+int(inString[5:7])
except:
return 100*inString[0:4]+int(inString[5:6])
def ParseAllElements(elmValue,elmPos):
if elmPos in [0,2,5]:
return elmValue
elif elmPos == 3:
return TextToYearMo(elmValue)
else:
if elmPos == 18:
return ParseInts(elmValue.strip('\n'))
else:
return ParseInts(elmValue)
def MakeAndSumList(inList):
df = pd.DataFrame(inList, columns = ['x1','x2','x3','x4','x5',
'x6','x7','x8','x9','x10',
'x11','x12','x13','x14'])
return df[['x1','x2','x3','x4','x5',
'x6','x7','x8','x9','x10',
'x11','x12','x13','x14']].groupby(
['x1','x2','x3','x4','x5']).sum().reset_index()
Function Calls:
def ParsedSummary(longString,delimtr,rowNum):
keepColumns = [0,3,2,5,10,9,11,12,13,14,15,16,17,18]
#Do some other stuff that takes very little time
return [pse.ParseAllElements(longString.split(delimtr)[i],i) for i in keepColumns]
def CSVToList(fileName, delimtr=','):
with open(fileName) as f:
enumFile = enumerate(f)
listEnumFile = set(enumFile)
for lineCount, l in enumFile:
pass
maxSplit = math.floor(lineCount / 10) + 1
counter = 0
Summary = pd.DataFrame({}, columns = ['x1','x2','x3','x4','x5',
'x6','x7','x8','x9','x10',
'x11','x12','x13','x14'])
for counter in range(0,10):
startRow = int(counter * maxSplit)
endRow = int((counter + 1) * maxSplit)
includedRows = set(range(startRow,endRow))
listOfRows = [ParsedSummary(row,delimtr,rownum)
for rownum, row in listEnumFile if rownum in includedRows]
Summary = pd.concat([Summary,pse.MakeAndSumList(listOfRows)])
listOfRows = []
counter += 1
return Summary
(Again, this is my first question - so I apologize if I simplified too much or, more likely, too little, but I am at a loss as to how to expedite this.)
For runtime comparison:
Using Access I can import, parse, summarize, and merge several files in this size-range in <5 mins (though I am right at its 2GB lim). I'd hope I can get comparable results in Python - presently I'm estimating ~30 min run time for one file. Note: I threw something together in Access' miserable environment only because I didn't have admin rights readily available to install anything else.
Edit: Updated parsing code. Was able to shave off five minutes (est. runtime at 25m) by changing some conditional logic to try/except. Also - runtime estimate doesn't include pandas portion - I'd forgotten I'd commented that out while testing, but its impact seems negligible.
If you want to optimize performance, don't roll your own CSV reader in Python. There is already a standard csv module. Perhaps pandas or numpy have faster csv readers; I'm not sure.
From https://softwarerecs.stackexchange.com/questions/7463/fastest-python-library-to-read-a-csv-file:
In short, pandas.io.parsers.read_csv beats everybody else, NumPy's loadtxt is impressively slow and NumPy's from_file and load impressively fast.

Python Pandas Multiprocessing Apply

I am wondering if there is a way to do a pandas dataframe apply function in parallel. I have looked around and haven't found anything. At least in theory I think it should be fairly simple to implement but haven't seen anything. This is practically the textbook definition of parallel after all.. Has anyone else tried this or know of a way? If no one has any ideas I think I might just try writing it myself.
The code I am working with is below. Sorry for the lack of import statements. They are mixed in with a lot of other things.
def apply_extract_entities(row):
names=[]
counter=0
print row
for sent in nltk.sent_tokenize(open(row['file_name'], "r+b").read()):
for chunk in nltk.ne_chunk(nltk.pos_tag(nltk.word_tokenize(sent))):
if hasattr(chunk, 'node'):
names+= [chunk.node, ' '.join(c[0] for c in chunk.leaves())]
counter+=1
print counter
return names
data9_2['proper_nouns']=data9_2.apply(apply_extract_entities, axis=1)
EDIT:
So here is what I tried. I tried running it with just the first five element of my iterable and it is taking longer than it would if I ran it serially so I assume it is not working.
os.chdir(str(home))
data9_2=pd.read_csv('edgarsdc3.csv')
os.chdir(str(home)+str('//defmtest'))
#import stuff
from nltk import pos_tag, ne_chunk
from nltk.tokenize import SpaceTokenizer
#define apply function and apply it
os.chdir(str(home)+str('//defmtest'))
####
#this is our apply function
def apply_extract_entities(row):
names=[]
counter=0
print row
for sent in nltk.sent_tokenize(open(row['file_name'], "r+b").read()):
for chunk in nltk.ne_chunk(nltk.pos_tag(nltk.word_tokenize(sent))):
if hasattr(chunk, 'node'):
names+= [chunk.node, ' '.join(c[0] for c in chunk.leaves())]
counter+=1
print counter
return names
#need something that populates a list of sections of a dataframe
def dataframe_splitter(df):
df_list=range(len(df))
for i in xrange(len(df)):
sliced=df.ix[i]
df_list[i]=sliced
return df_list
df_list=dataframe_splitter(data9_2)
#df_list=range(len(data9_2))
print df_list
#the multiprocessing section
import multiprocessing
def worker(arg):
print arg
(arg)['proper_nouns']=arg.apply(apply_extract_entities, axis=1)
return arg
pool = multiprocessing.Pool(processes=10)
# get list of pieces
res = pool.imap_unordered(worker, df_list[:5])
res2= list(itertools.chain(*res))
pool.close()
pool.join()
# re-assemble pieces into the final output
output = data9_2.head(1).concatenate(res)
print output.head()
With multiprocessing, it's best to generate several large blocks of data, then re-assemble them to produce the final output.
source
import multiprocessing
def worker(arg):
return arg*2
pool = multiprocessing.Pool()
# get list of pieces
res = pool.map(worker, [1,2,3])
pool.close()
pool.join()
# re-assemble pieces into the final output
output = sum(res)
print 'got:',output
output
got: 12

Categories

Resources