I want to write a simple dataframe as an ORC file. The only sequence is of an integer type. If I set all values to None, an exception is raised on to_orc.
I understand that pyarrow cannot infer datatype from None values but what can I do to fix the datatype for output? Attempts to use .astype() only brought TypeError: int() argument must be a string, a bytes-like object or a number, not 'NoneType'
Bonus points if the solution also works for
empty dataframes
nested types
Script:
data = {'a': [1, 2]}
df = pd.DataFrame(data=data)
print(df)
df.to_orc('a.orc') # OK
df['a'] = None
print(df)
df.to_orc('a.orc') # fails
Output:
a
0 1
1 2
a
0 None
1 None
Traceback (most recent call last):
File ... line 9, in <module>
...
File "pyarrow/_orc.pyx", line 443, in pyarrow._orc.ORCWriter.write
File "pyarrow/error.pxi", line 121, in pyarrow.lib.check_status
pyarrow.lib.ArrowNotImplementedError: Unknown or unsupported Arrow type: null
This is a known issue, see https://github.com/apache/arrow/issues/30317. The problem is that the ORC writer does not yet support writing a column of all-nulls without specific dtype (not object dtype). If you cast the column to, for example, float first, then the writing works.
Using the df from your example:
>>> df.dtypes
a object
dtype: object
# the column has generic object dtype, cast to float
>>> df['a'] = df['a'].astype("float64")
>>> df.dtypes
a float64
dtype: object
# now writing to ORC and reading back works
>>> df.to_orc('a.orc')
>>> pd.read_orc('a.orc')
a
0 NaN
1 NaN
Related
Seen a couple of answers to the general question, and I've used some of the solutions suggested, but still getting stuck.
I have the following code:
name = ['Sepal-length', 'Sepal-width', 'Petal-length', 'Petal-width', 'Class']
iris_ds = pd.read_csv(url, names=name)
cols=iris_ds.columns.drop('Class')
iris_ds[cols]=iris_ds[cols].apply(pd.to_numeric, errors='coerce')
.......
iris_ds['Sepal-area'] = iris_ds.eval('Sepal-width' * 'Sepal-length')
print(iris_ds.head(20))
However, when I run the script for the second section, I get the following:
Traceback (most recent call last): File "Iris_Data_set1.py", line
67, in
iris_ds['Petal-area'] = iris_ds.eval('Petal-width' * 'Petal-length') TypeError: can't multiply sequence by non-int of type
'str'
The data types are as follows:
Sepal-length float64
Sepal-width float64
Petal-length float64
Petal-width float64
Class object
dtype: object
Any suggestions on how to resolve this issue, so that I can do the multiplication?
Is there any reason why you can't just do:
iris_ds['Sepal-area'] = iris_ds.Sepal-width * iris_ds.Sepal-length
I think there might be 2 problems though. You probably shouldn't be using Sepal-length as a column name and instead should use Sepal_length (and apply this to your other variables), making the answer:
iris_ds['Sepal_area'] = iris_ds.Sepal_width * iris_ds.Sepal_length
Im quite new to programming (in python) and I would like to create a new variable that is the logarithm of a column (from an imported excel file). I have tried different solutions from this site, but I keep getting an error. My latest error is AttributeError: 'str' object has no attribute 'log'.
I have already dropped all the values that are not "numbers', but I still don't know how to convert the values from strings to integers (if this is the case, because 'int(neighborhood)' doesn't work).
This is the code I have now:
import pandas as pd
import numpy as np
df=pd.read_excel("kwb-2016_del_col_del_row.xls")
df = df[df.m_woz != "."] # drop rows with values "."
neighborhood=df[df.recs=="Neighborhood"]
neighborhood=neighborhood["m_woz"]
print(neighborhood)
np.log(neighborhood)
and this is the error I'm getting:
AttributeError Traceback (most recent call last)
<ipython-input-66-46698de51811> in <module>()
12 print(neighborhood)
13
---> 14 np.log(neighborhood)
AttributeError: 'str' object has no attribute 'log'
Could someone help me please?
Perhaps you are not removing the data you think you are?
Try printing the data types to see what they are.
In a DataFrame, your column might be filled with objects instead of numbers.
print(df.dtypes)
Also, you might want to look at these two pages
Select row from a DataFrame based on the type of the object(i.e. str)
Pandas: convert dtype 'object' to int
Here's an example I constructed and ran interactively that correctly gets the logarithms (don't type >>>):
>>> raw_data = {'m_woz': ['abc', 'def', 1.23, 45.6, '.xyz'],
'recs': ['Neighborhood', 'Neighborhood',
'unknown', 'Neighborhood', 'whatever']}
>>> df = pd.DataFrame(raw_data, columns = ['m_woz', 'recs'])
>>> print(df.dtypes)
m_woz object
recs object
dtype: object
Note that the type is object, not float or int or str
Continuing on, here is what df and neighborhood look like:
>>> df
m_woz recs
0 42 Neighborhood
1 def Neighborhood
2 1.23 unknown
3 45.6 Neighborhood
4 .xyz whatever
>>> neighborhood=df[df.recs=="Neighborhood"]
>>> neighborhood
m_woz recs
0 42 Neighborhood
1 def Neighborhood
3 45.6 Neighborhood
And here are the tricks...
This line selects all rows in neighborhood that are int or float (be careful to fix indents if you copy/paste this
>>> df_num_strings = neighborhood[neighborhood['m_woz'].
apply(lambda x: type(x) in (int, float))]
>>> df_num_strings
m_woz recs
0 42 Neighborhood
3 45.6 Neighborhood
Almost there... convert the numbers to floating point from string
>>> df_float = df_num_strings['m_woz'].astype(str).astype(float)
>>> df_float
0 42.0
3 45.6
Finally, compute logarithms:
>>> np.log(df_float)
0 3.737670
3 3.819908
Name: m_woz, dtype: float64
not sure what the problem is here... all i want is the first and only element in this series
>>> a
1 0-5fffd6b57084003b1b582ff1e56855a6!1-AB8769635...
Name: id, dtype: object
>>> len (a)
1
>>> type(a)
<class 'pandas.core.series.Series'>
>>> a[0]
Traceback (most recent call last):
File "<pyshell#1>", line 1, in <module>
a[0]
File "C:\Python27\lib\site-packages\pandas\core\series.py", line 601, in __getitem__
result = self.index.get_value(self, key)
File "C:\Python27\lib\site-packages\pandas\core\indexes\base.py", line 2477, in get_value
tz=getattr(series.dtype, 'tz', None))
File "pandas\_libs\index.pyx", line 98, in pandas._libs.index.IndexEngine.get_value (pandas\_libs\index.c:4404)
File "pandas\_libs\index.pyx", line 106, in pandas._libs.index.IndexEngine.get_value (pandas\_libs\index.c:4087)
File "pandas\_libs\index.pyx", line 154, in pandas._libs.index.IndexEngine.get_loc (pandas\_libs\index.c:5126)
File "pandas\_libs\hashtable_class_helper.pxi", line 759, in pandas._libs.hashtable.Int64HashTable.get_item (pandas\_libs\hashtable.c:14031)
File "pandas\_libs\hashtable_class_helper.pxi", line 765, in pandas._libs.hashtable.Int64HashTable.get_item (pandas\_libs\hashtable.c:13975)
KeyError: 0L
why isn't that working? and how do get the first element?
When the index is integer, you cannot use positional indexers because the selection would be ambiguous (should it return based on label or position?). You need to either explicitly use a.iloc[0] or pass the label a[1].
The following works because the index type is object:
a = pd.Series([1, 2, 3], index=['a', 'b', 'c'])
a
Out:
a 1
b 2
c 3
dtype: int64
a[0]
Out: 1
But for integer index, things are different:
a = pd.Series([1, 2, 3], index=[2, 3, 4])
a[2] # returns the first entry - label based
Out: 1
a[1] # raises a KeyError
KeyError: 1
Look at the following Code:
import pandas as pd
import numpy as np
data1 = pd.Series(['a','b','c'],index=['1','3','5'])
data2 = pd.Series(['a','b','c'],index=[1,3,5])
print('keys data1: '+str(data1.keys()))
print('keys data2: '+str(data2.keys()))
print('base data1: '+str(data1.index.base))
print('base data2: '+str(data2.index.base))
print(data1['1':'3']) # Here we use the dictionary like slicing
print(data1[1:3]) # Here we use the integer like slicing
print(data2[1:3]) # Here we use the integer like slicing
keys data1: Index(['1', '3', '5'], dtype='object')
keys data2: Int64Index([1, 3, 5], dtype='int64')
base data1: ['1' '3' '5']
base data2: [1 3 5]
1 a
3 b
dtype: object
3 b
5 c
dtype: object
3 b
5 c
dtype: object
For data1, the dtype of the index is object, for data2 it is int64. Taking a look into Jake VanderPlas's Data Science Handbook he writes: "a Series object acts in many ways like a one-dimensional NumPy array, and in many ways like a standard Python dictionary". Hence if the index is of type "object" as in the case of data1, we have two different ways to acces the values:
1. By dictionary like slicing/indexing:
data1['1','3'] --> a,b
By integer like slicing/indexing:
data1[1:3] --> b,c
If the index dtype is of type int64 as in the case of data2, pandas has no opportunity to decide if we want to have index or dictionry like slicing/indexing and hence it defaults to index like slicing/indexing and consequently for data2[1:3] we get b,c just as for data1 when we choose integer like slicing/indexing.
Nevertheless VanderPlas mentions to keep in mind one critical thing in that case:
"Notice that when you are slicing with an explicit index (i.e., data['a':'c']), the final index is included in
the slice, while when you’re slicing with an implicit index (i.e., data[0:2]), the final index is excluded from the slice.[...] These slicing and indexing conventions can be a source of confusion."
To overcome this confuction you can use the loc for label based slicing/indexing and iloc for index based slicing/indexing
like:
import pandas as pd
import numpy as np
data1 = pd.Series(['a','b','c'],index=['1','3','5'])
data2 = pd.Series(['a','b','c'],index=[1,3,5])
print('data1.iloc[0:2]: ',str(data1.iloc[0:2]),sep='\n',end='\n\n')
# print(data1.loc[1:3]) --> Throws an error bacause there is no integer index of 1 or 3 (these are strings)
print('data1.loc["1":"3"]: ',str(data1.loc['1':'3']),sep='\n',end='\n\n')
print('data2.iloc[0:2]: ',str(data2.iloc[0:2]),sep='\n',end='\n\n')
print('data2.loc[1:3]: ',str(data2.loc[1:3]),sep='\n',end='\n\n') #Note that contrary to usual python slices, both the start and the stop are included
data1.iloc[0:2]:
1 a
3 b
dtype: object
data1.loc["1":"3"]:
1 a
3 b
dtype: object
data2.iloc[0:2]:
1 a
3 b
dtype: object
data2.loc[1:3]:
1 a
3 b
dtype: object
So data2.loc[1:3] searches explicitly for the values of 1 and 3 in the index and returns the values which lay between them while data2.iloc[0:2] returns the values between the zerost element in the index and the second element in the index excluding the second element.
I have a dataset imported via Pandas that has a column full of arrays with strings in them, i.e.:
'Entry'
0 ['test', 'test1', test2']
.
.
.
[n] ['test', 'test1n', 'test2n']
What I would like to do is apply a function to ensure that there are no similar elements in the array. My method is as follows:
def remove_duplicates ( test_id_list ):
new_test_ids = []
for tags in test_id_list:
if tags not in new_test_ids:
new_test_ids.append(tags)
return new_test_ids
I want to apply this to the 'Entry' column in my DataFrame via either apply() or maps() to clean up each column entry. I am doing this via
training_data['Entry'].apply(remove_duplicates(training_data['Entry']))
But I am getting the error:
Traceback (most recent call last):
File "/home/main.py", line 32, in <module>
training_data['Entry'].apply(remove_duplicates(training_data['Entry']))
File "/home/~~~~/.local/lib/python2.7/site-packages/pandas/core/series.py", line 2294, in apply
mapped = lib.map_infer(values, f, convert=convert_dtype)
File "pandas/src/inference.pyx", line 1207, in pandas.lib.map_infer (pandas/lib.c:66124)
TypeError: 'list' object is not callable
If anybody can help point me in the right direction, that would be wonderful! I am a bit lost at this point/new to using Pandas for data manipulation.
If you decompose your expression a bit you can see what's wrong.
training_data['Entry'].apply(remove_duplicates(training_data['Entry']))
is functionally equivalent to
x = remove_duplicates(training_data['Entry'])
training_data['Entry'].apply(x)
x is a list because that's what your remove_duplicates function returns. The apply method wants a function as Rauch points out, so you'd want x to simply be remove_duplicates
Setup
df
Out[1190]:
Entry
0 [test, test, test2]
1 [test, test1n, test2n]
To make your code work, you can just do:
df.Entry.apply(func=remove_duplicates)
Out[1189]:
0 [test, test2]
1 [test, test1n, test2n]
Name: Entry, dtype: object
You can actually do this without a custom function in a one liner:
df.Entry.apply(lambda x: list(set(x)))
Out[1193]:
0 [test, test2]
1 [test, test2n, test1n]
Name: Entry, dtype: object
I was scratching my head over past few days why my method call in multiprocessing was failing, and after drilling down further I realized that it was due to some DataFrames being passed as arguments to method which wasn't getting pickled. So as alternative to that, I parsed the DataFrame to csv in my parent method, passed the name of csv file to child method where I read csv to get the data. But this is pretty inefficient method, so let me know where I did wrong. The following are my codes.
DataFrames: 2 No.
DataFrame: a
FIELD_NAME object
DEST_LOC_NBR float64
APPT_NBR float64
SEQ_NBR float64
FIELD_NBR float64
CREATE_TS datetime64[ns]
CREATE_USERID_V object
BEFORE_FIELD_VALUE object
AFTER_FIELD_VALUE object
CHNG_ORIGIN_I float64
APPT_ STAT_ NBR float64
DESCRPTN object
LOC_NBR_x float64
LOCN_ABBR_x object
LOC_NBR_y object
LOCN_ABBR_y object
FRGT_TYP_NBR_x float64
FRGT_TYP_DESC_x object
FRGT_TYP_NBR_y object
FRGT_TYP_DESC_y object
DataFrame: d1
APPT_NBR float64
APPT_MERGE_F object
APPT_STAT_NBR object
DEST_LOC_NBR float64
APPT_TYP_NBR float64
LOC_NBR float64
FRGT_TYP_NBR float64
APPT_FRT_TYP_AFTER_ARVL_TS object
APPT_EMPTY_F object
APPT_ACTL_ARVL_TS datetime64[ns]
APPT_SCHD_ARVL_TS datetime64[ns]
APPT_NBR_OF_CNTRS float64
APPT_NBR_OF_GOHS float64
CREATE_TS datetime64[ns]
LAST_UPD_TS object
VERSION_NBR float64
APPT_CLOSE_TS object
APPT_OPEN_TS object
APPT_UNCNFRMD_TS object
APPT_ORG_SCH_AR_TS object
APPT_CNCL_TS object
APPT_SUSP_TS object
APPT_ARCHV_TS object
CARRIER object
VND_NBR float64
YARD_AREA_NBR float64
CREATE_USERID_V object
APPT_ACTL_ARVL_D object
APPT_SCHD_ARVL_D object
dtype: object
I tried even manually pickling on parent method and unpickle in child like I do for csv files, but the same error raises.
I first create random file names as this is Multi-threading process, hence file name can't be same:
fname = ''.join(random.SystemRandom().choice(string.ascii_uppercase + string.digits) for _ in range(5))
fnames = dict()
fnames['a'] = '{}_a.pkl'.format(fname)
fnames['d1'] = '{}_d1.pkl'.format(fname)
fnames['file'] = fname
And, I pickle dataframes using the following:
d1.to_pickle(fnames['d1'])
a.to_pickle(fnames['a'])
And, the following is my method call:
p = multiprocessing.Process(target=ParallelLoopTest, args=(dd, fnames, return_list, final_col_dates, d, final_col_dates_mod, iter, self.DC, self.start_hour_of_day))
While my method ParallelLoopTest definition is as follows:
def ParallelLoopTest(dd, fnames, days_out_vars, final_col_dates, d, final_col_dates_mod, iter, DC, start_hour_of_day, store):
a = pd.read_pickle(fnames['a'])
d1 = pd.read_pickle(fnames['d1'])
df = pd.read_pickle('{}_df.pkl'.format(fnames['file']))
And, I face the following error:
Traceback (most recent call last):
File "C:\Users\dkanhar\Anaconda3\lib\multiprocessing\process.py", line 249, in _bootstrap
self.run()
File "C:\Users\dkanhar\Anaconda3\lib\multiprocessing\process.py", line 93, in run
self._target(*self._args, **self._kwargs)
File "E:\Projects\Predictive Inbound Cartoon Estimation-MLO\Python\dataprep\DataPrep.py", line 497, in ParallelLoopTest
a = pd.read_pickle(fnames['a'])
File "C:\Users\dkanhar\Anaconda3\lib\site-packages\pandas\io\pickle.py", line 63, in read_pickle
return try_read(path, encoding='latin1')
File "C:\Users\dkanhar\Anaconda3\lib\site-packages\pandas\io\pickle.py", line 57, in try_read
return pc.load(fh, encoding=encoding, compat=True)
File "C:\Users\dkanhar\Anaconda3\lib\site-packages\pandas\compat\pickle_compat.py", line 118, in load
return up.load()
File "C:\Users\dkanhar\Anaconda3\lib\pickle.py", line 1039, in load
dispatch[key[0]](self)
File "C:\Users\dkanhar\Anaconda3\lib\site-packages\pandas\compat\pickle_compat.py", line 73, in load_newobj
obj = cls.__new__(cls, *args)
TypeError: function takes at most 0 arguments (1 given)
So, as you can see, Pandas fail to unpickle the DatFrame, no matter what I try. I tried picking manually using pickly.dump() and pickle.load() but even that failed with similar error. (TypeError, Function takes 0 arguments (1 given))
I also feel this is due to DataFrame problem, as I created a random DataFrame using:
df = pd.DataFrame(np.random.randint(0, 100, size=(100000, 4)), columns=list('ABCD'))
Pickling it, and unpickling on other end, and that worked.
So, what must be issue with DataFrame? Possible scenarios where DataFrame can't be pickled?
Note that its around 200mb pickle file for Dataframe a and 25mb for pickled DataFrame d1.
Also, is there way to post head of whole DataFrame? It has around 50 columns which wont be printed, and hence cant be added here, but in meanwhile I'm adding dtypes of both DataFrames.
Any help would be really helpful.
PS: The following link is for my previous post which described the error while Passed DataFRame directly to Method call instead of picking and unpickling:
Similar errors in MultiProcessing. Mismatch number of arguments to function
Update:
The following is link to Google spreedsheet with head n = 50 for both the DataFrames in concern.
https://docs.google.com/spreadsheets/d/1bGkpmV0__aPVUtc0HSeuRQpufu1T4pmKB4YxsQsPO50/edit?usp=sharing