I am new to python. This seems like a basic question to ask. But I really want to understand what is happening here
import numpy as np
import pandas as pd
tempdata = np.random.random(5)
myseries_one = pd.Series(tempdata)
myseries_two = pd.Series(data = tempdata, index = ['a','b','c','d','e'])
myseries_three = pd.Series(data = tempdata, index = [10,11,12,13,14])
myseries_one
Out[1]:
0 0.291293
1 0.381014
2 0.923360
3 0.271671
4 0.605989
dtype: float64
myseries_two
Out[2]:
a 0.291293
b 0.381014
c 0.923360
d 0.271671
e 0.605989
dtype: float64
myseries_three
Out[3]:
10 0.291293
11 0.381014
12 0.923360
13 0.271671
14 0.605989
dtype: float64
Indexing first element from each dataframe
myseries_one[0] #As expected
Out[74]: 0.29129291112626043
myseries_two[0] #As expected
Out[75]: 0.29129291112626043
myseries_three[0]
KeyError:0
Doubt1 :-Why this is happenening ? Why myseries_three[0] gives me a keyError ?
what we meant by calling myseries_one[0] , myseries_one[0] or myseries_three[0] ? Does calling this way mean we are calling by rownames ?
Doubt2 :-Is rownames and rownumber in Python works as different as rownames and rownumber in R ?
myseries_one[0:2]
Out[78]:
0 0.291293
1 0.381014
dtype: float64
myseries_two[0:2]
Out[79]:
a 0.291293
b 0.381014
dtype: float64
myseries_three[0:2]
Out[80]:
10 0.291293
11 0.381014
dtype: float64
Doubt3:- If calling myseries_three[0] meant calling by rownames then how myseries_three[0:3] producing the output ? does myseries_three[0:4] mean we are calling by rownumber ? Please explain and guide. I am migrating from R to python. so its a bit confusing for me.
When you are attempting to slice with myseries[something], the something is often ambiguous. You are highlighting a case of that ambiguity. In your case, pandas is trying to help you out by guessing what you mean.
myseries_one[0] #As expected
Out[74]: 0.29129291112626043
myseries_one has integer labels. It would make sense that when you attempt to slice with an integer that you intend to get the element that is labeled with that integer. It turns out, that you have an element labeled with 0 an so that is returned to you.
myseries_two[0] #As expected
Out[75]: 0.29129291112626043
myseries_two has string labels. It's highly unlikely that you meant to slice this series with a label of 0 when labels are all strings. So, pandas assumes that you meant a position of 0 and returns the first element (thanks pandas, that was helpful).
myseries_three[0]
KeyError:0
myseries_three has integer labels and you are attempting to slice with an integer... perfect. Let's just get that value for you... KeyError. Whoops, that index label does not exist. In this case, it is safer for pandas to fail than to guess that maybe you meant to slice by position. The documentation even suggests that if you want to remove the ambiguity, use loc for label based slicing and iloc for position based slicing.
Let's try loc
myseries_one.loc[0]
0.29129291112626043
myseries_two.loc[0]
KeyError:0
myseries_three.loc[0]
KeyError:0
Only myseries_one has a label 0. The other two return KeyErrors
Let's try iloc
myseries_one.iloc[0]
0.29129291112626043
myseries_two.iloc[0]
0.29129291112626043
myseries_three.iloc[0]
0.29129291112626043
They all have a position of 0 and return the first element accordingly.
For the range slicing, pandas decides to be less interpretive and sticks to positional slicing for the integer slice 0:2. Keep in mind. Actual real people (the programmers writing pandas code) are the ones making these decisions. When you are attempting to do something that is ambiguous, you may get varying results. To remove ambiguity, use loc and iloc.
iloc
myseries_one.iloc[0:2]
0 0.291293
1 0.381014
dtype: float64
myseries_two.iloc[0:2]
a 0.291293
b 0.381014
dtype: float64
myseries_three.iloc[0:2]
10 0.291293
11 0.381014
dtype: float64
loc
myseries_one.loc[0:2]
0 0.291293
1 0.381014
2 0.923360
dtype: float64
myseries_two.loc[0:2]
TypeError: cannot do slice indexing on <class 'pandas.indexes.base.Index'> with these indexers [0] of <type 'int'>
myseries_three.loc[0:2]
Series([], dtype: float64)
Related
I try to replace in all the empty cell of a dataset the mean of that column.
I use modifiedData=data.fillna(data.mean())
but it works only on integer column type.
I have also a column with float values and in it fillna does not work.
Why?
.fillna() works on columns that are nan. The concept of nan can't exist in an int column. Pandas dtype int does not support nan.
If you have a column with what seems to be integers, it is more likely an object column. Perhaps even filled with strings. Strings that are empty in some cases.
Empty strings are not filled by .fillna()
In [8]: pd.Series(["2", "1", ""]).fillna(0)
Out[8]:
0 2
1 1
2
dtype: object
An easy way to figure out what's going on is to use the df.Column.isna() method.
If that method gives you all False. you know there are no nan to fill.
To turn empty strings into nan values
In [11]: s = pd.Series(["2", "1", ""])
In [12]: empty_string_mask = s.str.len() == 0
In [21]: s.loc[empty_string_mask] = float('nan')
In [22]: s
Out[22]:
0 2
1 1
2 NaN
dtype: object
After that you can fillna
In [23]: s.fillna(0)
Out[23]:
0 2
1 1
2 0
dtype: object
Another way of going about this problem is to check the dtype
df.column.dtype
If it says 'object' It confirms your issue
You can cast the column to a float column
df.column = df.column.dtype(float)
Though manipulating dtypes in pandas usually leads to pains, this may be an easier route to take for this particular problem.
Consider this simple example
import pandas as pd
df = pd.DataFrame({'one' : [1,2,3],
'two' : [1,0,0]})
df
Out[9]:
one two
0 1 1
1 2 0
2 3 0
I want to write a function that takes as inputs a dataframe df and a column mycol.
Now this works:
df.groupby('one').two.sum()
Out[10]:
one
1 1
2 0
3 0
Name: two, dtype: int64
this works too:
def okidoki(df,mycol):
return df.groupby('one')[mycol].sum()
okidoki(df, 'two')
Out[11]:
one
1 1
2 0
3 0
Name: two, dtype: int64
but this FAILS
def megabug(df,mycol):
return df.groupby('one').mycol.sum()
megabug(df, 'two')
AttributeError: 'DataFrameGroupBy' object has no attribute 'mycol'
What is wrong here?
I am worried that okidoki uses some chaining that might create some subtle bugs (https://pandas.pydata.org/pandas-docs/stable/indexing.html#why-does-assignment-fail-when-using-chained-indexing).
How can I still keep the syntax groupby('one').mycol? Can the mycol string be converted to something that might work that way?
Thanks!
You pass a string as the second argument. In effect, you're trying to do something like:
df.'two'
Which is invalid syntax. If you're trying to dynamically access a column, you'll need to use the index notation, [...] because the dot/attribute accessor notation doesn't work for dynamic access.
Dynamic access on its own is possible. For example, you can use getattr (but I don't recommend this, it's an antipattern):
In [674]: df
Out[674]:
one two
0 1 1
1 2 0
2 3 0
In [675]: getattr(df, 'one')
Out[675]:
0 1
1 2
2 3
Name: one, dtype: int64
Dynamically selecting by attribute from a groupby call can be done, something like:
In [677]: getattr(df.groupby('one'), mycol).sum()
Out[677]:
one
1 1
2 0
3 0
Name: two, dtype: int64
But don't do it. It is a horrid anti pattern, and much more unreadable than df.groupby('one')[mycol].sum().
I think you need [] for select column by column name what is general solution for selecting columns, because select by attributes have many exceptions:
You can use this access only if the index element is a valid python identifier, e.g. s.1 is not allowed. See here for an explanation of valid identifiers.
The attribute will not be available if it conflicts with an existing method name, e.g. s.min is not allowed.
Similarly, the attribute will not be available if it conflicts with any of the following list: index, major_axis, minor_axis, items, labels.
In any of these cases, standard indexing will still work, e.g. s['1'], s['min'], and s['index'] will access the corresponding element or column.
def megabug(df,mycol):
return df.groupby('one')[mycol].sum()
print (megabug(df, 'two'))
one
1 1
2 0
3 0
Name: two, dtype: int64
This question already has answers here:
Python pandas insert list into a cell
(9 answers)
Closed 6 years ago.
I have a pandas dataframe. I have a column that could potentially have null values or an array of string values in it. But I'm having trouble working out how to store values in this column.
This is my code now:
df_completed = df[df.completed]
df['links'] = None
for i, row in df_completed.iterrows():
results = get_links(row['nct_id'])
if results:
df[df.nct_id == row['nct_id']].links = results
print df[df.nct_id == row['nct_id']].links
But this has two problems:
When results is an array of length 1, the printed output is None, rather than the array, so I think I must be saving the value wrong
When results is a longer array, the line where I save the value produces an error: ValueError: Length of values does not match length of index
What am I doing wrong?
I am not sure it's advisable to try to store arrays in pandas like this, have you considered trying to serialise the array contents and then store?
If storing an array is what you're after anyways, then you can try with the set_value() method, like this (make sure you take care of the dtype of column nct_id):
In [35]: df = pd.DataFrame(data=np.random.rand(5,5), columns=list('ABCDE'))
In [36]: df
Out[36]:
A B C D E
0 0.741268 0.482689 0.742200 0.210650 0.351758
1 0.798070 0.929576 0.522227 0.280713 0.168999
2 0.413417 0.481230 0.304180 0.894934 0.327243
3 0.797061 0.561387 0.247033 0.330608 0.294618
4 0.494038 0.065731 0.538588 0.095435 0.397751
In [38]: df.dtypes
Out[38]:
A float64
B float64
C float64
D float64
E float64
dtype: object
In [39]: df.A = df.A.astype(object)
In [40]: df.dtypes
Out[40]:
A object
B float64
C float64
D float64
E float64
dtype: object
In [41]: df.set_value(0, 'A', ['some','values','here'])
Out[41]:
A B C D E
0 [some, values, here] 0.482689 0.742200 0.210650 0.351758
1 0.79807 0.929576 0.522227 0.280713 0.168999
2 0.413417 0.481230 0.304180 0.894934 0.327243
3 0.797061 0.561387 0.247033 0.330608 0.294618
4 0.494038 0.065731 0.538588 0.095435 0.397751
I hope this helps!
Let df be a DataFrame with an index of dtype x. Due to implicit type conversions, I end up having to call df.loc(n), where n was implicitly converted from x to a trivially different type y. Most commonly this happens with int converted to float. For example:
In [35]: df = DataFrame([[1, 2.0]]).set_index(0, drop=False); df
Out[35]:
0 1
0
1 1 2
In [36]: df[0]
Out[36]:
0
1 1
Name: 0, dtype: int64
In [37]: df.loc[1]
Out[37]:
0 1
1 2
Name: 1, dtype: float64
In [38]: df.loc[df.loc[1][0]]
[...]
KeyError: 1.0
As you can see, the value in the first column - which also serves as the index - was implicitly converted from int64 to float64 when it was returned from the df.loc[1] operation. Unfortunately once that happens, I can no longer use this value for further df.loc operations, since the index is still int64, and looking up a float64 value in an int64 index will always fail.
What's the best way to handle this?
You could explicitly recast when doing your .loc:
>>> df.loc[df.loc[1].astype('int64')[0]]
0 1
1 2
Name: 1, dtype: float64
Or, more generally:
df.loc[df.loc[2].astype(df[0].dtype)[0]]
I am extracting a subset of my dataframe by index using either .xs or .loc (they seem to behave the same). When my condition retrieves multiple rows, the result stays a dataframe. When only a single row is retrieved, it is automatically converted to a series. I don't want that behavior, since that means I need to handle multiple cases downstream (different method sets available for series vs dataframe).
In [1]: df = pd.DataFrame({'a':range(7), 'b':['one']*4 + ['two'] + ['three']*2,
'c':range(10,17)})
In [2]: df.set_index('b', inplace=True)
In [3]: df.xs('one')
Out[3]:
a c
b
one 0 10
one 1 11
one 2 12
one 3 13
In [4]: df.xs('two')
Out[4]:
a 4
c 14
Name: two, dtype: int64
In [5]: type(df.xs('two'))
Out [5]: pandas.core.series.Series
I can manually convert that series back to a dataframe, but it seems cumbersome and will also require case testing to see if I should do that. Is there a cleaner way to just get a dataframe back to begin with?
IIUC, you can simply add braces, [], and use .loc:
>>> df.loc["two"]
a 4
c 14
Name: two, dtype: int64
>>> type(_)
<class 'pandas.core.series.Series'>
>>> df.loc[["two"]]
a c
b
two 4 14
[1 rows x 2 columns]
>>> type(_)
<class 'pandas.core.frame.DataFrame'>
This may remind you of how numpy advanced indexing works:
>>> a = np.arange(9).reshape(3,3)
>>> a[1]
array([3, 4, 5])
>>> a[[1]]
array([[3, 4, 5]])
Now this will probably require some refactoring of code so that you're always accessing with a list, even if the list only has one element, but it works well for me in practice.