Memory error during dataframe creation in Python - python

Ehy guys
i had problem during the creation of my dataset with Python.
I m doing this:
userTab = pd.read_csv('C:\\Users\\anto-\\Desktop\\Ex.
Resource\\mapping_user_id.tsv',delimiter="\t",names =
["User","Sequence"])
wordTab = pd.read_csv('C:\\Users\\anto-\\Desktop\\Ex.
Resource\\mapping_word_id.tsv',delimiter="\t",names =
["Word","Sequence"])
df = pd.DataFrame(data=data, index= userTab.User, columns=wordTab.Word)
I m trying to create a dataset from 2 elements, userTab.User are the row and wordTab.Word are the columns elements.
Maybe the shape is too big for compute in this way.
I print the shape of my element, because first all i think that i wrong the dimensions.
((603668,), (37419,), (603668, 37419))
after that i try to print the type, and my user and word are Seris element, and data is scipy.sparse.csc.csc_matrix
Maybe i need use chunk for this shape, but I seen the pandas.DataFrame reference and don't have an attribute.
I have a 8GB Ram on 64bit Python. The sparse matrix is in npz file (300mb about)
the error is general error:
MemoryError Traceback (most recent call
last)
<ipython-input-26-ad363966ef6a> in <module>()
10 type(sparse_matrix)
11
---> 12 df = pd.DataFrame(data=sparse_matrix, index=
np.array(userTab.User), columns= np.array(wordTab.Word))
~\Anaconda3\lib\site-packages\pandas\core\frame.py in __init__(self,
data, index, columns, dtype, copy)
416 if arr.ndim == 0 and index is not None and columns is not
None:
417 values = cast_scalar_to_array((len(index),
len(columns)),
--> 418 data, dtype=dtype)
419 mgr = self._init_ndarray(values, index, columns,
420 dtype=values.dtype,
copy=False)
~\Anaconda3\lib\site-packages\pandas\core\dtypes\cast.py in
cast_scalar_to_array(shape, value, dtype)
1164 fill_value = value
1165
-> 1166 values = np.empty(shape, dtype=dtype)
1167 values.fill(fill_value)
1168
MemoryError:
maybe the problem could be this, because i have a sort of ID that when i try to access at User column, id will be remain in userTab.User

Related

ValueError: Incompatible indexer with Series while adding date to Date to Data Frame

I am new to python and I can't figure out why I get this error: ValueError: Incompatible indexer with Series.
I am trying to add a date to my data frame.
The date I am trying to add:
date = (chec[(chec['Día_Sem']=='Thursday') & (chec['ID']==2011957)]['Entrada'])
date
Date output:
56 1900-01-01 07:34:00
Name: Entrada, dtype: datetime64[ns]
Then I try to add 'date' to my data frame using loc:
rep.loc[2039838,'Thursday'] = date
rep
And I get this error:
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-347-3e0678b0fdbf> in <module>
----> 1 rep.loc[2039838,'Thursday'] = date
2 rep
~/anaconda3/lib/python3.7/site-packages/pandas/core/indexing.py in __setitem__(self, key, value)
188 key = com.apply_if_callable(key, self.obj)
189 indexer = self._get_setitem_indexer(key)
--> 190 self._setitem_with_indexer(indexer, value)
191
192 def _validate_key(self, key, axis):
~/anaconda3/lib/python3.7/site-packages/pandas/core/indexing.py in _setitem_with_indexer(self, indexer, value)
640 # setting for extensionarrays that store dicts. Need to decide
641 # if it's worth supporting that.
--> 642 value = self._align_series(indexer, Series(value))
643
644 elif isinstance(value, ABCDataFrame):
~/anaconda3/lib/python3.7/site-packages/pandas/core/indexing.py in _align_series(self, indexer, ser, multiindex_indexer)
781 return ser.reindex(ax)._values
782
--> 783 raise ValueError('Incompatible indexer with Series')
784
785 def _align_frame(self, indexer, df):
ValueError: Incompatible indexer with Series
I was also facing similar issue but in a different scenario. I came across threads of duplicate indices, but of-course that was not the case with me. What worked for me was to use .at in place of .loc. So you can try and see if it works
rep['Thursday'].at[2039838] = date.values[0]
Try date.iloc[0] instead of date:
rep.loc[2039838,'Thursday'] = date.iloc[0]
Because date is actually a Series (so basically like a list/array) of the values, and .iloc[0] actually selects the value.
You use loc to get a specific value, and your date type is a series or dataframe, the type between the two can not match, you can change the code to give the value of date to rep.loc[2039838,'Thursday'], for example, if your date type is a series and is not null, you can do this:
rep.loc[2039838,'Thursday'] = date.values[0]

Column selection in Python

I am trying to find solution to below given problem but seems like I am going wrong with the approach
I have a set of Excel with some columns like ISBN, Title etc. The columns names in Excel are not properly formatted. ISBN is named as ISBN in some of the Excel files while it is named as ISBN-13, Alias, ISBN13 etc. in others. Similarly for Title and other columns.
I have read all these Excels as data frame in python using read Excel and used str.contains to find the columns based on substring. Please find code below:
searchfor = ['ISBN13','BAR CODE','ISBN NO#','ISBN','ISBN1','ISBN
13','ISBN_13','ITEM','ISBN NUMBER','ISBN No','ISBN-13','ISBN (13
DIGITS)','EAN','ALIAS','ITEMCODE']
searchfor1 = ['TITLE','BOOK NAME','NAME','TITLE
NAME','TITLES','BOOKNAME','BKDESC','PRODUCT NAME','ITEM DESCRIPTION','TITLE
18','COMPLETETITLE']
for f, i in zip(files_txt1, num1):
df = pd.read_excel(f,encoding='sys.getfilesystemencoding()')
df.columns = df.columns.str.upper()
df1['Isbn'] = df[df.columns[df.columns.str.contains('|'.join(searchfor))]]
df1['Title']=
df[df.columns[df.columns.to_series().str.contains('|'.join(searchfor1))]]
The code works fine if I have excel with text present in list. However throws error in case excel does not have any columns with name similar to list. Also code does not work for ISBN.
Please see detailed error below:
--------------------------------------------------------------------------- ValueError Traceback (most recent call last) C:\Users\Ruchir_Kumar_Jha\AppData\Local\Enthought\Canopy\edm\envs\User\lib\site-packages\pandas\core\common.py in _asarray_tuplesafe(values, dtype)
376 result = np.empty(len(values), dtype=object)
--> 377 result[:] = values
378 except ValueError:
ValueError: could not broadcast input array from shape (31807,0) into shape (31807)
During handling of the above exception, another exception occurred:
ValueError Traceback (most recent call last) C:\Users\Ruchir_Kumar_Jha\AppData\Local\Enthought\Canopy\edm\envs\User\lib\site-packages\pandas\core\frame.py in _ensure_valid_index(self, value) 2375 try:
-> 2376 value = Series(value) 2377 except:
C:\Users\Ruchir_Kumar_Jha\AppData\Local\Enthought\Canopy\edm\envs\User\lib\site-packages\pandas\core\series.py in __init__(self, data, index, dtype, name, copy, fastpath)
247 data = _sanitize_array(data, index, dtype, copy,
--> 248 raise_cast_failure=True)
249
C:\Users\Ruchir_Kumar_Jha\AppData\Local\Enthought\Canopy\edm\envs\User\lib\site-packages\pandas\core\series.py in _sanitize_array(data, index, dtype, copy, raise_cast_failure) 3028 else:
-> 3029 subarr = _asarray_tuplesafe(data, dtype=dtype) 3030
C:\Users\Ruchir_Kumar_Jha\AppData\Local\Enthought\Canopy\edm\envs\User\lib\site-packages\pandas\core\common.py in _asarray_tuplesafe(values, dtype)
379 # we have a list-of-list
--> 380 result[:] = [tuple(x) for x in values]
381
ValueError: cannot copy sequence with size 0 to array axis with dimension 31807
During handling of the above exception, another exception occurred:
ValueError Traceback (most recent call last) <ipython-input-23-9e043c13fef2> in <module>()
11 df.columns = df.columns.str.upper()
12 #print(list(df.columns))
---> 13 df1['Isbn'] = df[df.columns[df.columns.str.contains('|'.join(searchfor))]]
14 df1['Title'] = df[df.columns[df.columns.to_series().str.contains('|'.join(searchfor1))]]
15 df1['Curr'] = df[df.columns[df.columns.to_series().str.contains('|'.join(searchfor2))]]
C:\Users\Ruchir_Kumar_Jha\AppData\Local\Enthought\Canopy\edm\envs\User\lib\site-packages\pandas\core\frame.py in __setitem__(self, key, value) 2329 else: 2330
# set column
-> 2331 self._set_item(key, value) 2332 2333 def _setitem_slice(self, key, value):
C:\Users\Ruchir_Kumar_Jha\AppData\Local\Enthought\Canopy\edm\envs\User\lib\site-packages\pandas\core\frame.py in _set_item(self, key, value) 2394 """ 2395
-> 2396 self._ensure_valid_index(value) 2397 value = self._sanitize_column(key, value) 2398 NDFrame._set_item(self, key, value)
C:\Users\Ruchir_Kumar_Jha\AppData\Local\Enthought\Canopy\edm\envs\User\lib\site-packages\pandas\core\frame.py in _ensure_valid_index(self, value) 2376 value = Series(value) 2377 except:
-> 2378 raise ValueError('Cannot set a frame with no defined index ' 2379 'and a value that cannot be converted to a ' 2380 'Series')
ValueError: Cannot set a frame with no defined index and a value that cannot be converted to a Series
You don't need this all, if you know your columns beforehand, just try at teh time of creating dataFrame and exporting the File into Pandas itself that way you will reduce the memory usage significantly as well.
df = pd.read_csv(file_name, usecols=['ISBN13','BAR CODE','ISBN NO#','ISBN','ISBN1','ISBN 13','ISBN_13','ITEM','ISBN NUMBER','ISBN No','ISBN-13','ISBN (13 DIGITS)','EAN','ALIAS','ITEMCODE']).fillna('')
This would work as long as you have no match or exactly 1 match
searchfor = ['ISBN13','BAR CODE','ISBN NO#','ISBN','ISBN1','ISBN 13','ISBN_13','ITEM','ISBN NUMBER','ISBN No','ISBN-13','ISBN (13 DIGITS)','EAN','ALIAS','ITEMCODE']
searchfor1 = ['TITLE','BOOK NAME','NAME','TITLE NAME','TITLES','BOOKNAME','BKDESC','PRODUCT NAME','ITEM DESCRIPTION','TITLE 18','COMPLETETITLE']
for f, i in zip(files_txt1, num1):
df = pd.read_excel(f,encoding='sys.getfilesystemencoding()')
df.columns = df.columns.str.upper()
cols = df.columns
is_isbn = cols.isin(searchfor)
df1['Isbn'] = df[cols[is_isbn]] if is_isbn.any() else None
is_title = cols.isin(searchfor1)
df1['Title'] = df[cols[is_title]] if is_title.any() else None

How to fix Numpy 'otypes' within Pandas dataframe?

Objective: to run association rules on a binary values dataset
d = {'col1': [0, 0,1], 'col2': [1, 0,0], 'col3': [0,1,1]}
df = pd.DataFrame(data=d)
This produces a data frame with 0's and 1's for corresponding column values.
The problem is when I make use of code like the following:
from mlxtend.frequent_patterns import apriori
from mlxtend.frequent_patterns import association_rules
frequent_itemsets = apriori(pattern_dataset, min_support=0.50,use_colnames=True)
rules = association_rules(frequent_itemsets, metric="lift", min_threshold=1)
rules
Typically this runs just fine, but in running it this time I have encountered an error.
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-61-46ec6f572255> in <module>()
4 frequent_itemsets = apriori(pattern_dataset, min_support=0.50,use_colnames=True)
5 frequent_itemsets
----> 6 rules = association_rules(frequent_itemsets, metric="lift", min_threshold=1)
7 rules
D:\AnaConda\lib\site-packages\mlxtend\frequent_patterns\association_rules.py in association_rules(df, metric, min_threshold, support_only)
127 values = df['support'].values
128 frozenset_vect = np.vectorize(lambda x: frozenset(x))
--> 129 frequent_items_dict = dict(zip(frozenset_vect(keys), values))
130
131 # prepare buckets to collect frequent rules
D:\AnaConda\lib\site-packages\numpy\lib\function_base.py in __call__(self, *args, **kwargs)
1970 vargs.extend([kwargs[_n] for _n in names])
1971
-> 1972 return self._vectorize_call(func=func, args=vargs)
1973
1974 def _get_ufunc_and_otypes(self, func, args):
D:\AnaConda\lib\site-packages\numpy\lib\function_base.py in _vectorize_call(self, func, args)
2040 res = func()
2041 else:
-> 2042 ufunc, otypes = self._get_ufunc_and_otypes(func=func, args=args)
2043
2044 # Convert args to object arrays first
D:\AnaConda\lib\site-packages\numpy\lib\function_base.py in _get_ufunc_and_otypes(self, func, args)
1996 args = [asarray(arg) for arg in args]
1997 if builtins.any(arg.size == 0 for arg in args):
-> 1998 raise ValueError('cannot call `vectorize` on size 0 inputs '
1999 'unless `otypes` is set')
2000
ValueError: cannot call `vectorize` on size 0 inputs unless `otypes` is set
This is what I have for dtypes in Pandas, any help would be appreciated.
col1 int64
col2 int64
col3 int64
dtype: object
128 frozenset_vect = np.vectorize(lambda x: frozenset(x))
--> 129 frequent_items_dict = dict(zip(frozenset_vect(keys), values))
Here np.vectorize wraps the frozenset(x) function in code that can take an array or list (keys), and pass each element for evaluation. It a kind of numpy iteration (convenient, but not fast). But to determine what kind (dtype) of array it returns it performs a test run with the first element of keys. An alternative to doing this test run is to use the otypes parameter.
Anyways, in this particular run, keys is evidently empty, a 0 size array or list. It could return an equivalent shape result array, but it still has to set a dtype. Hence the error.
Evidently the code writer never anticipated the case where keys was empty. So you need to tackle the question of why is it empty?
We need to look at the association_rules code see how keys is set. Its use in line 129 suggests that it has the same number of elements as values, which is derived from the df with:
values = df['support'].values
If keys has 0 elements, then values does as well, and df has 0 'rows'.
What the size of frequent_itemsets?
I add a mlxtend tag because the error arises during the use of its code. You/we need to examine that code or its documentation to determine why this dataframe is empty.
Workaround:
def encode_units(x):
if x <= 0:
return 0
if x >= 1:
return 1
yourdataset_sets = yourdataset.applymap(encode_units)
frequent_itemsets = apriori(yourdataset_sets, min_support=0.001, use_colnames=True)
rules = association_rules(frequent_itemsets, metric="lift", min_threshold=1)
Credit: saeedesmaili

Error: all arrays must be same length. But they ARE the same length

I am doing some work about sentiment analysis, here I have three arrays:the content of the sentences, the sentiment score and the key words.
I want to display them as a dataframe by pandas, but I got :
"ValueError: arrays must all be same length"
Here are some of my codes:
print(len(text_sentences),len(score_list),len(keyinfo_list))
df = pd.DataFrame(text_sentences,score_list,keyinfo_list)
print(df)
Here are the results:
182 182 182
ValueError Traceback (most recent call last)
<ipython-input-15-cfb70aca07d1> in <module>()
21 print(len(text_sentences),len(score_list),len(keyinfo_list))
22
---> 23 df = pd.DataFrame(text_sentences,score_list,keyinfo_list)
24
25 print(df)
E:\learningsoft\anadonda\lib\site-packages\pandas\core\frame.py in __init__(self, data, index, columns, dtype, copy)
328 else:
329 mgr = self._init_ndarray(data, index, columns, dtype=dtype,
--> 330 copy=copy)
331 else:
332 mgr = self._init_dict({}, index, columns, dtype=dtype)
E:\learningsoft\anadonda\lib\site-packages\pandas\core\frame.py in _init_ndarray(self, values, index, columns, dtype, copy)
472 raise_with_traceback(e)
473
--> 474 index, columns = _get_axes(*values.shape)
475 values = values.T
476
E:\learningsoft\anadonda\lib\site-packages\pandas\core\frame.py in _get_axes(N, K, index, columns)
439 columns = _default_index(K)
440 else:
--> 441 columns = _ensure_index(columns)
442 return index, columns
443
E:\learningsoft\anadonda\lib\site-packages\pandas\core\indexes\base.py in _ensure_index(index_like, copy)
4015 if len(converted) > 0 and all_arrays:
4016 from .multi import MultiIndex
-> 4017 return MultiIndex.from_arrays(converted)
4018 else:
4019 index_like = converted
E:\learningsoft\anadonda\lib\site-packages\pandas\core\indexes\multi.py in from_arrays(cls, arrays, sortorder, names)
1094 for i in range(1, len(arrays)):
1095 if len(arrays[i]) != len(arrays[i - 1]):
-> 1096 raise ValueError('all arrays must be same length')
1097
1098 from pandas.core.categorical import _factorize_from_iterables
ValueError: all arrays must be same length
You can see all my three arrays contain 182 elements, so I don't understand why it said "all arrays must be same length".
You're passing the wrong data into pandas.DataFrame's initializer.
The way you're using it, you're essentially running:
pandas.DataFrame(data=text_sentences, index=score_list, columns=keyinfo_list)
This isn't what you want. You probably want to do something like this instead:
pd.DataFrame(data={
'sentences': text_sentences,
'scores': score_list,
'keyinfo': keyinfo_list
})

Type error on first steps with Apache Parquet

Rather confused by running into this type error while trying out the Apache Parquet file format for the first time. Shouldn't Parquet support all the data types that Pandas does? What am I missing?
import pandas
import pyarrow
import numpy
data = pandas.read_csv("data/BigData.csv", sep="|", encoding="iso-8859-1")
data_parquet = pyarrow.Table.from_pandas(data)
raises:
---------------------------------------------------------------------------
ArrowInvalid Traceback (most recent call last)
<ipython-input-9-90533507bcf2> in <module>()
----> 1 data_parquet = pyarrow.Table.from_pandas(data)
table.pxi in pyarrow.lib.Table.from_pandas()
~\AppData\Local\Continuum\anaconda3\lib\site-packages\pyarrow\pandas_compat.py in dataframe_to_arrays(df, schema, preserve_index, nthreads)
354 arrays = list(executor.map(convert_column,
355 columns_to_convert,
--> 356 convert_types))
357
358 types = [x.type for x in arrays]
~\AppData\Local\Continuum\anaconda3\lib\concurrent\futures\_base.py in result_iterator()
584 # Careful not to keep a reference to the popped future
585 if timeout is None:
--> 586 yield fs.pop().result()
587 else:
588 yield fs.pop().result(end_time - time.time())
~\AppData\Local\Continuum\anaconda3\lib\concurrent\futures\_base.py in result(self, timeout)
423 raise CancelledError()
424 elif self._state == FINISHED:
--> 425 return self.__get_result()
426
427 self._condition.wait(timeout)
~\AppData\Local\Continuum\anaconda3\lib\concurrent\futures\_base.py in __get_result(self)
382 def __get_result(self):
383 if self._exception:
--> 384 raise self._exception
385 else:
386 return self._result
~\AppData\Local\Continuum\anaconda3\lib\concurrent\futures\thread.py in run(self)
54
55 try:
---> 56 result = self.fn(*self.args, **self.kwargs)
57 except BaseException as exc:
58 self.future.set_exception(exc)
~\AppData\Local\Continuum\anaconda3\lib\site-packages\pyarrow\pandas_compat.py in convert_column(col, ty)
343
344 def convert_column(col, ty):
--> 345 return pa.array(col, from_pandas=True, type=ty)
346
347 if nthreads == 1:
array.pxi in pyarrow.lib.array()
array.pxi in pyarrow.lib._ndarray_to_array()
error.pxi in pyarrow.lib.check_status()
ArrowInvalid: Error converting from Python objects to Int64: Got Python object of type str but can only handle these types: integer
data.dtypes is:
0 object
1 object
2 object
3 object
4 object
5 float64
6 float64
7 object
8 object
9 object
10 object
11 object
12 object
13 float64
14 object
15 float64
16 object
17 float64
...
In Apache Arrow, table columns must be homogeneous in their data types. pandas supports Python object columns where values can be different types. So you will need to do some data scrubbing before writing to Parquet format.
We've handled some rudimentary cases (like both bytes and unicode in a column) in the Arrow-Python bindings but we don't hazard any guesses about how to handle bad data. I opened the JIRA https://issues.apache.org/jira/browse/ARROW-2098 about adding an option to coerce unexpected values to null in situations like this, which might help in the future.
Had this same issue and took me a while to figure out a way to find the offending column. Here is what I came up with to find the mixed type column - although I know there must be a more efficient way.
The last column printed before the exception happens is the mixed type column.
# method1: try saving the parquet file by removing 1 column at a time to
# isolate the mixed type column.
cat_cols = df.select_dtypes('object').columns
for col in cat_cols:
drop = set(cat_cols) - set([col])
print(col)
df.drop(drop, axis=1).reset_index(drop=True).to_parquet("c:/temp/df.pq")
Another attempt - list the columns and each type based on the unique values.
# method2: list all columns and the types within
def col_types(col):
types = set([type(x) for x in col.unique()])
return types
df.select_dtypes("object").apply(col_types, axis=0)
I faced similar situation, if possible, you can first convert all columns to the required field datatype and then try to convert to parquet. Example :-
import pandas as pd
column_list = df.columns
for col in column_list:
df[col] = df[col].astype(str)
df.to_parquet('df.parquet.gzip', compression='gzip')

Categories

Resources