pandas series can't get index

pandas series can't get index - python

not sure what the problem is here... all i want is the first and only element in this series
>>> a
1 0-5fffd6b57084003b1b582ff1e56855a6!1-AB8769635...
Name: id, dtype: object
>>> len (a)
1
>>> type(a)
<class 'pandas.core.series.Series'>
>>> a[0]
Traceback (most recent call last):
File "<pyshell#1>", line 1, in <module>
a[0]
File "C:\Python27\lib\site-packages\pandas\core\series.py", line 601, in __getitem__
result = self.index.get_value(self, key)
File "C:\Python27\lib\site-packages\pandas\core\indexes\base.py", line 2477, in get_value
tz=getattr(series.dtype, 'tz', None))
File "pandas\_libs\index.pyx", line 98, in pandas._libs.index.IndexEngine.get_value (pandas\_libs\index.c:4404)
File "pandas\_libs\index.pyx", line 106, in pandas._libs.index.IndexEngine.get_value (pandas\_libs\index.c:4087)
File "pandas\_libs\index.pyx", line 154, in pandas._libs.index.IndexEngine.get_loc (pandas\_libs\index.c:5126)
File "pandas\_libs\hashtable_class_helper.pxi", line 759, in pandas._libs.hashtable.Int64HashTable.get_item (pandas\_libs\hashtable.c:14031)
File "pandas\_libs\hashtable_class_helper.pxi", line 765, in pandas._libs.hashtable.Int64HashTable.get_item (pandas\_libs\hashtable.c:13975)
KeyError: 0L
why isn't that working? and how do get the first element?

When the index is integer, you cannot use positional indexers because the selection would be ambiguous (should it return based on label or position?). You need to either explicitly use a.iloc[0] or pass the label a[1].
The following works because the index type is object:
a = pd.Series([1, 2, 3], index=['a', 'b', 'c'])
a
Out:
a 1
b 2
c 3
dtype: int64
a[0]
Out: 1
But for integer index, things are different:
a = pd.Series([1, 2, 3], index=[2, 3, 4])
a[2] # returns the first entry - label based
Out: 1
a[1] # raises a KeyError
KeyError: 1

Look at the following Code:
import pandas as pd
import numpy as np
data1 = pd.Series(['a','b','c'],index=['1','3','5'])
data2 = pd.Series(['a','b','c'],index=[1,3,5])
print('keys data1: '+str(data1.keys()))
print('keys data2: '+str(data2.keys()))
print('base data1: '+str(data1.index.base))
print('base data2: '+str(data2.index.base))
print(data1['1':'3']) # Here we use the dictionary like slicing
print(data1[1:3]) # Here we use the integer like slicing
print(data2[1:3]) # Here we use the integer like slicing
keys data1: Index(['1', '3', '5'], dtype='object')
keys data2: Int64Index([1, 3, 5], dtype='int64')
base data1: ['1' '3' '5']
base data2: [1 3 5]
1 a
3 b
dtype: object
3 b
5 c
dtype: object
3 b
5 c
dtype: object
For data1, the dtype of the index is object, for data2 it is int64. Taking a look into Jake VanderPlas's Data Science Handbook he writes: "a Series object acts in many ways like a one-dimensional NumPy array, and in many ways like a standard Python dictionary". Hence if the index is of type "object" as in the case of data1, we have two different ways to acces the values:
1. By dictionary like slicing/indexing:
data1['1','3'] --> a,b
By integer like slicing/indexing:
data1[1:3] --> b,c
If the index dtype is of type int64 as in the case of data2, pandas has no opportunity to decide if we want to have index or dictionry like slicing/indexing and hence it defaults to index like slicing/indexing and consequently for data2[1:3] we get b,c just as for data1 when we choose integer like slicing/indexing.
Nevertheless VanderPlas mentions to keep in mind one critical thing in that case:
"Notice that when you are slicing with an explicit index (i.e., data['a':'c']), the final index is included in
the slice, while when you’re slicing with an implicit index (i.e., data[0:2]), the final index is excluded from the slice.[...] These slicing and indexing conventions can be a source of confusion."
To overcome this confuction you can use the loc for label based slicing/indexing and iloc for index based slicing/indexing
like:
import pandas as pd
import numpy as np
data1 = pd.Series(['a','b','c'],index=['1','3','5'])
data2 = pd.Series(['a','b','c'],index=[1,3,5])
print('data1.iloc[0:2]: ',str(data1.iloc[0:2]),sep='\n',end='\n\n')
# print(data1.loc[1:3]) --> Throws an error bacause there is no integer index of 1 or 3 (these are strings)
print('data1.loc["1":"3"]: ',str(data1.loc['1':'3']),sep='\n',end='\n\n')
print('data2.iloc[0:2]: ',str(data2.iloc[0:2]),sep='\n',end='\n\n')
print('data2.loc[1:3]: ',str(data2.loc[1:3]),sep='\n',end='\n\n') #Note that contrary to usual python slices, both the start and the stop are included
data1.iloc[0:2]:
1 a
3 b
dtype: object
data1.loc["1":"3"]:
1 a
3 b
dtype: object
data2.iloc[0:2]:
1 a
3 b
dtype: object
data2.loc[1:3]:
1 a
3 b
dtype: object
So data2.loc[1:3] searches explicitly for the values of 1 and 3 in the index and returns the values which lay between them while data2.iloc[0:2] returns the values between the zerost element in the index and the second element in the index excluding the second element.

Related

Write ORC using Pandas with all values of sequence None

I want to write a simple dataframe as an ORC file. The only sequence is of an integer type. If I set all values to None, an exception is raised on to_orc.
I understand that pyarrow cannot infer datatype from None values but what can I do to fix the datatype for output? Attempts to use .astype() only brought TypeError: int() argument must be a string, a bytes-like object or a number, not 'NoneType'
Bonus points if the solution also works for
empty dataframes
nested types
Script:
data = {'a': [1, 2]}
df = pd.DataFrame(data=data)
print(df)
df.to_orc('a.orc') # OK
df['a'] = None
print(df)
df.to_orc('a.orc') # fails
Output:
a
0 1
1 2
a
0 None
1 None
Traceback (most recent call last):
File ... line 9, in <module>
...
File "pyarrow/_orc.pyx", line 443, in pyarrow._orc.ORCWriter.write
File "pyarrow/error.pxi", line 121, in pyarrow.lib.check_status
pyarrow.lib.ArrowNotImplementedError: Unknown or unsupported Arrow type: null

This is a known issue, see https://github.com/apache/arrow/issues/30317. The problem is that the ORC writer does not yet support writing a column of all-nulls without specific dtype (not object dtype). If you cast the column to, for example, float first, then the writing works.
Using the df from your example:
>>> df.dtypes
a object
dtype: object
# the column has generic object dtype, cast to float
>>> df['a'] = df['a'].astype("float64")
>>> df.dtypes
a float64
dtype: object
# now writing to ORC and reading back works
>>> df.to_orc('a.orc')
>>> pd.read_orc('a.orc')
a
0 NaN
1 NaN

Why pandas DataFrame allows to set column using too large Series?

Is there a reason why pandas raises ValueError exception when setting DataFrame column using a list and doesn't do the same when using Series? Resulting in superfluous Series values being ignored (e.g. 7 in example below).
>>> import pandas as pd
>>> df = pd.DataFrame([[1],[2]])
>>> df
0
0 1
1 2
>>> df[0] = [5,6,7]
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "D:\Python310\lib\site-packages\pandas\core\frame.py", line 3655, in __setitem__
self._set_item(key, value)
File "D:\Python310\lib\site-packages\pandas\core\frame.py", line 3832, in _set_item
value = self._sanitize_column(value)
File "D:\Python310\lib\site-packages\pandas\core\frame.py", line 4529, in _sanitize_column
com.require_length_match(value, self.index)
File "D:\Python310\lib\site-packages\pandas\core\common.py", line 557, in require_length_match
raise ValueError(
ValueError: Length of values (3) does not match length of index (2)
>>>
>>> df[0] = pd.Series([5,6,7])
>>> df
0
0 5
1 6
Tested using python 3.10.6 and pandas 1.5.3 on Windows 10.

You have right the behaviour is different between list and np.array but it's expected.
If you take a look in the source code in the frame.py module you will see that if the value is a list then it checks the length, in np.array doesn't check the length and as you observed is the np.array is larger, its truncated.
NOTE: The details of the np.array truncation is here

How to handle empty string type data in pandas python

This is a very naive question but after referring to multiple articles, I am raising this concern. I have a column in the dataset where the column has numeric/blank/null values.
I have data like below:
fund_value
Null
123
-10
I wrote a method to handle it but it doesn't work and keeps on giving me the error:
def values(x):
if x:
if int(x) > 0:
return 'Positive'
elif int(x) < 0:
return 'Negative'
else:
return 'Zero'
df2 = pd.read_csv('/home/siddhesh/Downloads/s2s_results.csv') # Assuming it as query results
df2 = df2.astype(str)
df2['fund_value'] = df2.fund_value.apply(values)
Error:
Traceback (most recent call last):
File "/home/../Downloads/pyspark/src/sample/actual_dataset_testin.py", line 31, in <module>
df2['fund_value'] = df2.fund_value.apply(values)
File "/home/../.local/lib/python3.8/site-packages/pandas/core/series.py", line 4357, in apply
return SeriesApply(self, func, convert_dtype, args, kwargs).apply()
File "/home/../.local/lib/python3.8/site-packages/pandas/core/apply.py", line 1043, in apply
return self.apply_standard()
File "/home/../.local/lib/python3.8/site-packages/pandas/core/apply.py", line 1099, in apply_standard
mapped = lib.map_infer(
File "pandas/_libs/lib.pyx", line 2859, in pandas._libs.lib.map_infer
File "/home/../Downloads/pyspark/src/sample/actual_dataset_testin.py", line 16, in values
if int(x) > 0:
ValueError: invalid literal for int() with base 10: 'nan'
I even tried if x=="" or if not x: but nothing worked.
Expected Output:
fund_value
Zero
Positive
Negative

Considering df to be:
In [1278]: df = pd.DataFrame({'fund_value': [np.nan, 123, '', 10]})
In [1279]: df
Out[1279]:
fund_value
0 NaN
1 123
2
3 10
Use numpy.select with pd.to_numeric:
In [1246]: import numpy as np
In [1283]: df['fund_value'] = pd.to_numeric(df.fund_value, errors='coerce')
In [1247]: conds = [df.fund_value.gt(0), df.fund_value.lt(0)]
In [1250]: choices = ['Positive', 'Negative']
In [1261]: df['fund_value'] = np.select(conds, choices, default='Zero')
In [1288]: df
Out[1288]:
fund_value
0 Zero
1 Positive
2 Zero
3 Positive

You are facing a problem of NaN support with int. That is something that does not work...
Your solution: Fill your "missing" values using pd.fillna(). Fill those values with something (e.g., with 0), or remove them. Just read the values as float, which has native NaN support, then fill or remove those NaN
Background: The fact that you first cast the column to str, but then in your check function convert it back to int, which gives you a NaN error, looks like a workaround... Here is what causes the problem: Reading directly as int won't work, as int does not understand NaN --> see Int with capital I
Exampe: Assume you have a 'dirty int' input, that includes NaN, like this
df = pd.DataFrame({'fund_value': [None, 123, 10]})
fund_value
0 NaN
1 123.0
2 10.0
Pandas will do you the courtesy of converting this to float, given all values are numeric, and fills the "gaps" (None or np.nan) with NaN. You get something to screen, but in fact it is a column of float, not int.
Option 1: How to "convert" NaN values to '0' integer (for your case distinction between 'positive' or 'negative')
df.fillna(0).astype('int')
Option 2: Directly cast a column with NaN values to Int:
df.astype('Int32')
You can then work with either one of the datasets, which truly contain integers (option 1 with assuming all NaN==0, second one with true <NA>, not floats)

Your df2.astype(str) turns everything into string, and when you apply values(...) to the contents of the column which are all string, the first if-check will only return you a False if it's an empty string which is not the case for str(np.nan). Converting np.nan into string gives you a non-empty string 'nan'.
'nan' will pass your first if-check, and then in the second if-check, it finds itself not convertible into an int and python returns you an error.
To take care of that,
x = df['fund_value'].replace('', np.nan).astype(float)
(x > 0).map({True: 'Positive'}).fillna( (x < 0).map({True: 'Negative'}) ).fillna('Zero')

Role of name of pandas.Series while doing difference

I have two pandas.Series objects, say a and b, having the same index, and when performing the difference a - b I get the error
ValueError: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all()
which I don't understand where is coming from.
The Series a is obtained as a slice of a DataFrame whose index is a MultiIndex, and when I do a renaming
a.name = 0
the operation works fine (but if I rename to a tuple I get the same error).
Unfortunately, I am not able to reproduce a minimal example of the phenomenon (the difference of ad-hoc Series with name a tuple seems to work fine).
Any ideas on why this is happening?
If relevant, pandas version is 0.22.0
EDIT
The full traceback of the error:
----------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-15-e4efbf202d3c> in <module>()
----> 1 one - two
~/venv/lib/python3.4/site-packages/pandas/core/ops.py in wrapper(left, right, name, na_op)
727
728 if isinstance(rvalues, ABCSeries):
--> 729 name = _maybe_match_name(left, rvalues)
730 lvalues = getattr(lvalues, 'values', lvalues)
731 rvalues = getattr(rvalues, 'values', rvalues)
~/venv/lib/python3.4/site-packages/pandas/core/common.py in _maybe_match_name(a, b)
137 b_has = hasattr(b, 'name')
138 if a_has and b_has:
--> 139 if a.name == b.name:
140 return a.name
141 else:
ValueError: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all()
EDIT 2
Some more details on how a and b are obtained:
I have a DataFrame df whose index is a multyindex (year, id_)
I have a Series factors whose index are the columns of df (something like the standard deviation of the columns)
Then:
tmp = df.loc[(year, id_)]
a = tmp[factors != 0]
b = factors[factors != 0]
diff = a - b
and executing the last line the error happens.
EDIT 3
And it keeps happening also if I reduce the columns: the original df has around 1000 rows and columns, but reducing to the last 5 lines and columns, the problem persists!
For example, by doing
df = df.iloc[-10:][df.columns[-5:]]
line = df.iloc[-3]
factors = factors[df.columns]
a = line[factors != 0]
b = factors[factors != 0]
diff = a - b
I keep getting the same error, while printing a and b I obtain
a:
end_bin_68.750_100.000 0.002413
end_bin_75.000_100.000 0.002614
end_bin_81.250_100.000 0.001810
end_bin_87.500_100.000 0.002313
end_bin_93.750_100.000 0.001609
Name: (2015, 10000030), dtype: float64
b:
end_bin_68.750_100.000 0.001244
end_bin_75.000_100.000 0.001242
end_bin_81.250_100.000 0.000918
end_bin_87.500_100.000 0.000659
end_bin_93.750_100.000 0.000563
Name: 1, dtype: float64
While if I manually create df and factors with these same values (also in the indices) the error does not happen.
EDIT 4
While debugging, when one gets to the function _maybe_match_name one obtains the following:
ipdb> type(a.name)
<class 'tuple'>
ipdb> type(b.name)
<class 'numpy.int64'>
ipdb> a.name == b.name
a = end_bin_68.750_100.000 0.002413
end_bin_75.000_100.000 0.002614
end_bin_81.250_100.000 0.001810
end_bin_87.500_100.000 0.002313
end_bin_93.750_100.000 0.001609
Name: (2015, 10000030), dtype: float64
b = end_bin_68.750_100.000 0.001244
end_bin_75.000_100.000 0.001242
end_bin_81.250_100.000 0.000918
end_bin_87.500_100.000 0.000659
end_bin_93.750_100.000 0.000563
Name: 1, dtype: float64
ipdb> (a.name == b.name)
array([False, False])
EDIT 5
Finally I got to a minimal example:
a = pd.Series([1, 2, 3])
a.name = np.int64(13)
b = pd.Series([4, 5, 6])
b.name = (123, 789)
a - b
this raises the error to me, np.__version__ == 1.14.0 and pd.__version__ == 0.22.0

When an operation is made between two pandas Series it tries to give a name to the resulting Series.
s1 = pd.Series(np.random.randn(5))
s2 = pd.Series(np.random.randn(5))
s1.name = "hello"
s2.name = "hello"
s3 = s1-s2
s3.name
>>> "hello"
If the name is not the same, then the resulting Series has no name.
s1 = pd.Series(np.random.randn(5))
s2 = pd.Series(np.random.randn(5))
s1.name = "hello"
s2.name = "goodbye"
s3 = s1-s2
s3.name
>>>
This is done by comparing Series names with the function _maybe_match_name(), than is here on GitHub.
The comparison operator compares apparently in your case an array with a tuple, which is not possible (I haven't been able to reproduce the error), and raise the ValueError exception.
I guess it is a bug, what is weird is that np.int64(42) == ("A", "B")doesn't raise an exception for me.
But I have a FutureWarning from numpy:
FutureWarning: elementwise comparison failed; returning scalar instead, but in the future will perform elementwise comparison.
Which makes me think that you are using a extremely recent numpy version (you compiled it from the master branch on GitHub ?).
The bug will likely be corrected in next pandas release as it is a result of a future change in the behavior of numpy.
My guess is that the best thing to do is just to rename your Series before making operation as you already did b.name = None, or to change your numpy version (1.15.0works well).

index out of bound when iterrow() how is this possible?

I got error message:
5205
(5219, 25)
5221
(5219, 25)
Traceback (most recent call last):
File "/Users/Chu/Documents/dssg2018/sa4.py", line 44, in <module>
df.loc[idx,word]=len(df.iloc[indices[idx]][df[word]==1])/\
IndexError: index 5221 is out of bounds for axis 0 with size 5219
when I'm traversing the data frame, the index comes from the iterators. I don't know how is this even possible? idx directly comes from the dataframe
bt = BallTree(df[['lat','lng']], metric="haversine")
indices = bt.query_radius(df[['lat','lng']],r=(float(10)/40000)*360)
for idx,row in df.iterrows():
for word in bag_of_words:
if word in row['caption']:
print(idx)
print(df.shape)
df.loc[idx,word]=len(df.iloc[indices[idx]][df[word]==1])/\
np.max([1,len(df.iloc[indices[idx]][df[word]!=1])])
changing iloc to loc gives
/Library/Frameworks/Python.framework/Versions/3.6/bin/python3.6 /Users/Chu/Documents/dssg2018/sa4.py
(-124.60334244261675, 49.36453144316216, -121.67106179949566, 50.863501888419826)
27
(5219, 25)
/Users/Chu/Documents/dssg2018/sa4.py:42: FutureWarning:
Passing list-likes to .loc or [] with any missing label will raise
KeyError in the future, you can use .reindex() as an alternative.
See the documentation here:
http://pandas.pydata.org/pandas-docs/stable/indexing.html#deprecate-loc-reindex-listlike
df.loc[idx,word]=len(df.loc[indices[idx]][df[word]==1])/\
/Users/Chu/Documents/dssg2018/sa4.py:42: UserWarning: Boolean Series key will be reindexed to match DataFrame index.
df.loc[idx,word]=len(df.loc[indices[idx]][df[word]==1])/\
Traceback (most recent call last):
File "/Users/Chu/Documents/dssg2018/sa4.py", line 42, in <module>
df.loc[idx,word]=len(df.loc[indices[idx]][df[word]==1])/\
File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/pandas/core/frame.py", line 2133, in __getitem__
return self._getitem_array(key)
File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/pandas/core/frame.py", line 2173, in _getitem_array
key = check_bool_indexer(self.index, key)
File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/pandas/core/indexing.py", line 2023, in check_bool_indexer
raise IndexingError('Unalignable boolean Series provided as '
pandas.core.indexing.IndexingError: Unalignable boolean Series provided as indexer (index of the boolean Series and of the indexed object do not match

Your index is not from 0 to len(df)-1, this will making df.iloc[idx]out of boundary
For example
df = pd.DataFrame({'a': [0, 1]},index=[1,100])
for idx,row in df.iterrows():
print(idx)
print(row)
1
a 0
Name: 1, dtype: int64
100
a 1
Name: 100, dtype: int64
Then when you do
df.iloc[100]
IndexError: single positional indexer is out-of-bounds
But when you do .loc you get the expected output.
df.loc[100]
Out[23]:
a 1
Name: 100, dtype: int64
From the file :
.iloc :iloc[] is primarily integer position based
.loc:.loc[] is primarily label based
Solution:
Using .loc or df=df.reset_index(drop=True)

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

pandas series can't get index - python

Related

Write ORC using Pandas with all values of sequence None

Why pandas DataFrame allows to set column using too large Series?

How to handle empty string type data in pandas python

Role of name of pandas.Series while doing difference

index out of bound when iterrow() how is this possible?

Categories

Resources