question on h2o hanging when attempting to slice rows

question on h2o hanging when attempting to slice rows - python

I have I guess a moderately sized dataframe of ~500k rows and 200 columns with 8GB of memory.
My problem is that when I got to slice my data, even very small sized datasets when this gets trimmed down to 6k rows and 200 columns, that it just hangs and hangs for 10/15 min+. Then if I hit the STOP button for python interactive and re-try the process happens in 2-3 seconds.
I don't know why I can do my row-slicing in this 2-3 seconds normally. It is making it impossible to run programs as things just hang and hang and have to be manually stopped before it works.
I am following the approach laid out on the h2o webpage:
import h2o
h2o.init()
# Import the iris with headers dataset
path = "http://h2o-public-test-data.s3.amazonaws.com/smalldata/iris/iris_wheader.csv"
df = h2o.import_file(path=path)
# Slice 1 row by index
c1 = df[15,:]
c1.describe
# Slice a range of rows
c1_1 = df[range(25,50,1),:]
c1_1.describe
# Slice using a boolean mask. The output dataset will include rows with a sepal length
# less than 4.6.
mask = df["sepal_len"] < 4.6
cols = df[mask,:]
cols.describe
# Filter out rows that contain missing values in a column. Note the use of '~' to
# perform a logical not.
mask = df["sepal_len"].isna()
cols = df[~mask,:]
cols.describe
The error message from the console is as follows. I have this same error message repeated several times.:
/opt/anaconda3/lib/python3.7/site-packages/h2o/expr.py in (.0)
149 return self._cache._id # Data already computed under ID, but not cached
150 assert isinstance(self._children,tuple)
--> 151 exec_str = "({} {})".format(self._op, " ".join([ExprNode._arg_to_expr(ast) for ast in self._children]))
152 gc_ref_cnt = len(gc.get_referrers(self))
153 if top or gc_ref_cnt >= ExprNode.MAGIC_REF_COUNT:
~/opt/anaconda3/lib/python3.7/site-packages/h2o/expr.py in _arg_to_expr(arg)
161 return "[]" # empty list
162 if isinstance(arg, ExprNode):
--> 163 return arg._get_ast_str(False)
164 if isinstance(arg, ASTId):

Related

How to make a function dynamic and accept new parameters in python?

Following up on my previous question
I have a list of records as shown below
taken from this table
itemImage
name
nameFontSize
nameW
nameH
conutry
countryFont
countryW
countryH
code
codeFontSize
codeW
codeH
sample.jpg
Apple
142
1200
200
US
132
1200
400
1564
82
1300
600
sample2.jpg
Orange
142
1200
200
UK
132
1200
400
1562
82
1300
600
sample3.jpg
Lemon
142
1200
200
FR
132
1200
400
1563
82
1300
600
Right now, I have one function setText which takes all the elements of a row from this table.
I only have name, country and code for now but will be adding other stuff in the future.
I want to make this code more future proof and dynamic. For example, If I added four new columns in my data following the same pattern. How do I make python automatically adjust to that? instead of me going and declaring variables in my code every time.
Basically, I want to send each 4 columns starting from name to a function then continue till no column is left. Once that's done go to the next row and continue the loop.
Thanks to #Samwise who helped me clean up the code a bit.
import os
from PIL import Image,ImageFont,ImageDraw, features
import pandas as pd
path='./'
files = []
for (dirpath, dirnames, filenames) in os.walk(path):
files.extend(filenames)
df = pd.read_excel (r'./data.xlsx')
records = list(df.to_records(index=False))
def setText(itemImage, name, nameFontSize, nameW, nameH,
conutry, countryFontSize,countryW, countryH,
code, codeFontSize, codeW, codeH):
font1 = ImageFont.truetype(r'./font.ttf', nameFontSize)
font2 = ImageFont.truetype(r'./font.ttf', countryFontSize)
font3 = ImageFont.truetype(r'./font.ttf', codeFontSize)
file = Image.open(f"./{itemImage}")
draw = ImageDraw.Draw(file)
draw.text((nameW, nameH), name, font=font1, fill='#ff0000',
align="right",anchor="rm")
draw.text((countryW, countryH), conutry, font=font2, fill='#ff0000',
align="right",anchor="rm")
draw.text((codeW, codeH), str(code), font=font3, fill='#ff0000',
align="right",anchor="rm")
file.save(f'done {itemImage}')
for i in records:
setText(*i)

Sounds like df.columns might help. It returns a list, then you can iterate through whatever cols are present.
for col in df.columns():
The answers in this thread should help dial you in:
How to iterate over columns of pandas dataframe to run regression
It sounds like you also want row-wise results, so you could nest within df.iterrows or vice versa...though going cell by cell is generally not desirable and could end up being quite slow as your df grows.
So perhaps be thinking about how you could use your function with df.apply()

Repartition Dask DataFrame to get even partitions

I have a Dask DataFrames that contains index which is not unique (client_id). Repartitioning and resetting index ends up with very uneven partitions - some contains only a few rows, some thousands. For instance the following code:
for p in range(ddd.npartitions):
print(len(ddd.get_partition(p)))
prints out something like that:
55
17
5
41
51
1144
4391
75153
138970
197105
409466
415925
486076
306377
543998
395974
530056
374293
237
12
104
52
28
My DataFrame is one-hot encoded and has over 500 columns. Larger partitions don't fit in memory. I wanted to repartition the DataFrame to have partitions even in size. Do you know an efficient way to do this?
EDIT 1
Simple reproduce:
df = pd.DataFrame({'x':np.arange(0,10000),'y':np.arange(0,10000)})
df2 = pd.DataFrame({'x':np.append(np.arange(0,4995),np.arange(5000,10000,1000)),'y2':np.arange(0,10000,2)})
dd_df = dd.from_pandas(df, npartitions=10).set_index('x')
dd_df2= dd.from_pandas(df2, npartitions=5).set_index('x')
new_ddf=dd_df.merge(dd_df2, how='right')
#new_ddf = new_ddf.reset_index().set_index('x')
#new_ddf = new_ddf.repartition(npartitions=2)
new_ddf.divisions
for p in range(new_ddf.npartitions):
print(len(new_ddf.get_partition(p)))
Note the last partitions (one single element):
1000
1000
1000
1000
995
1
1
1
1
1
Even when we uncomment the commented lines, partitions remain uneven in the size.
Edit II: Walkoround
Simple wlakoround can be achieved by the following code.
Is there a more elgant way to do this (more in a Dask way)?
def repartition(ddf, npartitions=None):
MAX_PART_SIZE = 100*1024
if npartitions is None:
npartitions = ddf.npartitions
one_row_size = sum([dt.itemsize for dt in ddf.dtypes])
length = len(ddf)
requested_part_size = length/npartitions*one_row_size
if requested_part_size <= MAX_PART_SIZE:
np = npartitions
else:
np = length*one_row_size/MAX_PART_SIZE
chunksize = int(length/np)
vc = ddf.index.value_counts().to_frame(name='count').compute().sort_index()
vsum = 0
divisions = [ddf.divisions[0]]
for i,v in vc.iterrows():
vsum+=v['count']
if vsum > chunksize:
divisions.append(i)
vsum = 0
divisions.append(ddf.divisions[-1])
return ddf.repartition(divisions=divisions, force=True)

You're correct that .repartition won't do the trick since it doesn't handle any of the logic for computing divisions and just tries to combine the existing partitions wherever possible. Here's a solution I came up with for the same problem:
def _rebalance_ddf(ddf):
"""Repartition dask dataframe to ensure that partitions are roughly equal size.
Assumes `ddf.index` is already sorted.
"""
if not ddf.known_divisions: # e.g. for read_parquet(..., infer_divisions=False)
ddf = ddf.reset_index().set_index(ddf.index.name, sorted=True)
index_counts = ddf.map_partitions(lambda _df: _df.index.value_counts().sort_index()).compute()
index = np.repeat(index_counts.index, index_counts.values)
divisions, _ = dd.io.io.sorted_division_locations(index, npartitions=ddf.npartitions)
return ddf.repartition(divisions=divisions)
The internal function sorted_division_locations does what you want already, but it only works on an actual list-like, not a lazy dask.dataframe.Index. This avoids pulling the full index in case there are many duplicates and instead just gets the counts and reconstructs locally from that.
If your dataframe is so large that even the index won't fit in memory then you'd need to do something even more clever.

Performance issues with pandas iterrows

I am having performance issues with iterrows in on my dataframe as I start to scale up my data analysis.
Here is the current loop that I am using.
for ii, i in a.iterrows():
for ij, j in a.iterrows():
if ii != ij:
if i['DOCNO'][-5:] == j['DOCNO'][4:9]:
if i['RSLTN1'] > j['RSLTN1']:
dl.append(ij)
else:
dl.append(ii)
elif i['DOCNO'][-5:] == j['DOCNO'][-5:]:
if i['RSLTN1'] > j['RSLTN1']:
dl.append(ij)
else:
dl.append(ii)
c = a.drop(a.index[dl])
The point of the loop is to find 'DOCNO' values that are different in the dataframe but are known to be equivalent denoted by the 5 characters that are equivalent but spaced differently in the string. When found I want to drop the smaller number from the associated 'RSLTN1' column. Additionally, my data set may have multiple entries for a unique 'DOCNO' that I want to drop the lower number 'RSLTN1' result.
I was successful running this will small quantities of data (~1000 rows) but as I scale up 10x I am running into performance issues. Any suggestions?
Sample from dataset
In [107]:a[['DOCNO','RSLTN1']].sample(n=5)
Out[107]:
DOCNO RSLTN1
6815 MP00064958 72386.0
218 MP0059189A 65492.0
8262 MP00066187 96497.0
2999 MP00061663 43677.0
4913 MP00063387 42465.0

How does this fit you needs?
import pandas as pd
s = '''\
DOCNO RSLTN1
MP00059189 72386.0
MP0059189A 65492.0
MP00066187 96497.0
MP00061663 43677.0
MP00063387 42465.0'''
# Recreate dataframe
df = pd.read_csv(pd.compat.StringIO(s), sep='\s+')
# Create mask
# We sort to make sure we keep only highest value
# Remove all non-digit according to: https://stackoverflow.com/questions/44117326/
m = (df.sort_values(by='RSLTN1',ascending=False)['DOCNO']
.str.extract('(\d+)', expand=False)
.astype(int).duplicated())
# Apply inverted `~` mask
df = df.loc[~m]
Resulting df:
DOCNO RSLTN1
0 MP00059189 72386.0
2 MP00066187 96497.0
3 MP00061663 43677.0
4 MP00063387 42465.0
In this example the following row was removed:
MP0059189A 65492.0

Pandas: Query tool doesn't work if column headers are tuples: TypeError: argument of type 'int' is not iterable

TLDR: The df.query() tool doesn't seem to work if the df's columns are tuples or even tuples converted into strings. How can I work around this to get the slice I'm aiming for?
Long Version: I have a pandas dataframe that looks like this (although there are a lot more columns and rows...):
> dosage_df
Score ("A_dose","Super") ("A_dose","Light") ("B_dose","Regular")
28 1 40 130
11 2 40 130
72 3 40 130
67 1 90 130
74 2 90 130
89 3 90 130
43 1 40 700
61 2 40 700
5 3 40 700
Along with my data frame, I also have a python dictionary with the relevant ranges for each feature. The keys are the feature names, and the different values which it can take are the keys:
# Original Version
dosage_df.columns = ['First Score', 'Last Score', ("A_dose","Super"), ("A_dose","Light"), ("B_dose","Regular")]
dict_of_dose_ranges = {("A_dose","Super"):[1,2,3],
("A_dose","Light"):[40,70,90],
("B_dose","Regular"):[130,200,500,700]}
For my purposes, I need to generate a particular combination (say A_dose = 1, B_dose = 90, and C_dose = 700), and based on those settings take the relevant slice out of my dataframe, and do relevant calculations from that smaller subset, and save the results somewhere.
I'm doing this by implementing the following:
from itertools import product
for dosage_comb in product(*dict_of_dose_ranges.values()):
dosage_items = zip(dict_of_dose_ranges.keys(), dosage_comb)
query_str = ' & '.join('{} == {}'.format(*x) for x in dosage_items)
**sub_df = dosage_df.query(query_str)**
...
The problem is that is gets hung up on the query step, as it returns the following error message:
TypeError: argument of type 'int' is not iterable
In this case, the query generated looks like this:
query_str = "("A_dose","Light") == 40 & ("A_dose","Super") == 1 & ("B_dose","Regular") == 130"
Troubleshooting Attempts:
I've confirmed that indeed that solution should work for a dataframe with just string columns as found here. In addition, I've also tried "tricking" the tool by converting the columns and the dictionary keys into strings by the following code... but that returned the same error.
# String Version
dosage_df.columns = ['First Score', 'Last Score', '("A_dose","Super")', '("A_dose","Light")', '("B_dose","Regular")']
dict_of_dose_ranges = {
'("A_dose","Super")':[1,2,3],
'("A_dose","Light")':[40,70,90],
'("B_dose","Regular")':[130,200,500,700]}
Is there an alternate tool in python that can take tuples as inputs or a different way for me to trick it into working?

You can build a list of conditions and logically condense them with np.all instead of using query:
for dosage_comb in product(*dict_of_dose_ranges.values()):
dosage_items = zip(dict_of_dose_ranges.keys(), dosage_comb)
condition = np.all([dosage_df[col] == dose for col, dose in dosage_items], axis=0)
sub_df = dosage_df[condition]
This method seems to be a bit more flexible than query, but when filtering across many columns I've found that query often performs better. I don't know if this is true in general though.

Create a pandas DataFrame from generator?

I've create a tuple generator that extract information from a file filtering only the records of interest and converting it to a tuple that generator returns.
I've try to create a DataFrame from:
import pandas as pd
df = pd.DataFrame.from_records(tuple_generator, columns = tuple_fields_name_list)
but throws an error:
...
C:\Anaconda\envs\py33\lib\site-packages\pandas\core\frame.py in from_records(cls, data, index, exclude, columns, coerce_float, nrows)
1046 values.append(row)
1047 i += 1
-> 1048 if i >= nrows:
1049 break
1050
TypeError: unorderable types: int() >= NoneType()
I managed it to work consuming the generator in a list, but uses twice memory:
df = pd.DataFrame.from_records(list(tuple_generator), columns = tuple_fields_name_list)
The files I want to load are big, and memory consumption matters. The last try my computer spends two hours trying to increment virtual memory :(
The question: Anyone knows a method to create a DataFrame from a record generator directly, without previously convert it to a list?
Note: I'm using python 3.3 and pandas 0.12 with Anaconda on Windows.
Update:
It's not problem of reading the file, my tuple generator do it well, it scan a text compressed file of intermixed records line by line and convert only the wanted data to the correct types, then it yields fields in a generator of tuples form.
Some numbers, it scans 2111412 records on a 130MB gzip file, about 6.5GB uncompressed, in about a minute and with little memory used.
Pandas 0.12 does not allow generators, dev version allows it but put all the generator in a list and then convert to a frame. It's not efficient but it's something that have to deal internally pandas. Meanwhile I've must think about buy some more memory.

You certainly can construct a pandas.DataFrame() from a generator of tuples, as of version 0.19 (and probably earlier). Don't use .from_records(); just use the constructor, for example:
import pandas as pd
someGenerator = ( (x, chr(x)) for x in range(48,127) )
someDf = pd.DataFrame(someGenerator)
Produces:
type(someDf) #pandas.core.frame.DataFrame
someDf.dtypes
#0 int64
#1 object
#dtype: object
someDf.tail(10)
# 0 1
#69 117 u
#70 118 v
#71 119 w
#72 120 x
#73 121 y
#74 122 z
#75 123 {
#76 124 |
#77 125 }
#78 126 ~

You cannot create a DataFrame from a generator with the 0.12 version of pandas. You can either update yourself to the development version (get it from the github and compile it - which is a little bit painful on windows but I would prefer this option).
Or you can, since you said you are filtering the lines, first filter them, write them to a file and then load them using read_csv or something else...
If you want to get super complicated you can create a file like object that will return the lines:
def gen():
lines = [
'col1,col2\n',
'foo,bar\n',
'foo,baz\n',
'bar,baz\n'
]
for line in lines:
yield line
class Reader(object):
def __init__(self, g):
self.g = g
def read(self, n=0):
try:
return next(self.g)
except StopIteration:
return ''
And then use the read_csv:
>>> pd.read_csv(Reader(gen()))
col1 col2
0 foo bar
1 foo baz
2 bar baz

To get it to be memory efficient, read in chunks. Something like this, using Viktor's Reader class from above.
df = pd.concat(list(pd.read_csv(Reader(gen()),chunksize=10000)),axis=1)

You can also use something like (Python tested in 2.7.5)
from itertools import izip
def dataframe_from_row_iterator(row_iterator, colnames):
col_iterator = izip(*row_iterator)
return pd.DataFrame({cn: cv for (cn, cv) in izip(colnames, col_iterator)})
You can also adapt this to append rows to a DataFrame.
--
Edit, Dec 4th: s/row/rows in last line

If generator is just like a list of DataFrames, you need just to create a new DataFrame concatenating elements of the list:
result = pd.concat(list)
Recently I've faced the same problem.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.