Populate Pandas Series with list - python

I would like to populate a pd.Series() with a list.
I tried doing the following:
series = pd.Series(index=['a','b','c','d'])
series['a'] = 2
series['b'] = [2,3]
This is the error that I get. How can I populate the list in the pd.Series?
File "C:\Users\Sergej Shteriev\Anaconda3\lib\site-packages\pandas\core\internals.py", line 940, in setitem
values[indexer] = value
ValueError: setting an array element with a sequence.

This is because the initial dtype is assumed to be float (as the series is filled with NaNs).
series.dtype
# dtype('float64')
Since lists are only supported by object type columns, you'd need to cast before assigning.
series = series.astype(object)
series['b'] = [2, 3]
series
a 2 # this is still a float
b [2, 3]
c NaN
d NaN
dtype: object
series.tolist()
# [2.0, [[2, 3]], nan, nan]
A better suggestion is to declare series as an object at the start if that's what you intend stuffing into it.
series = pd.Series(index=['a','b','c','d'], dtype=object)
series['a'] = 2
series['b'] = [2, 3]
series
a 2
b [2, 3]
c NaN
d NaN
dtype: object
series.tolist()
# [2, [2, 3], nan, nan]
Of course, for performance reasons, I don't condone this. You're better off using python lists -- they're usually faster than object Series.

Related

pandas.Series method that returns updated series

Is there a Series method that acts like update but returns the updated series instead of updating in place?
Put another way, is there a better way to do this:
# Original series
ser1 = Series([10, 9, 8, 7], index=[1,2,3,4])
# I want to change ser1 to be [10, 1, 2, 7]
adj_ser = Series([1, 2], index=[2,3])
adjusted = my_method(ser1, adj_ser)
# Is there a builtin that does this already?
def my_method(current_series, adjustments):
x = current_series.copy()
x.update(adjustments)
return(x)
One possible solution should be combine_first, but it update adj_ser by ser1, also it cast integers to floats:
adjusted = adj_ser.combine_first(ser1)
print (adjusted)
1 10.0
2 1.0
3 2.0
4 7.0
dtype: float64
#nixon is right that iloc and loc are good for this kind of thing
import pandas as pd
# Original series
ser1 = pd.Series([10, 9, 8, 7], index=[1,2,3,4])
ser2 = ser1.copy()
ser3 = ser1.copy()
# I want to change ser1 to be [10, 1, 2, 7]
# One way
ser2.iloc[1:3] = [1,2]
ser2 # [10, 1, 2, 7]
# Another way
ser3.loc[2, 3] = [1,2]
ser3 # [10, 1, 2, 7]
Why two different methods?
As this post explains quite well, the major difference between loc and iloc is labels vs position. My personal shorthand is if you're trying to make adjustments based on the zero-index position of a value use iloc otherwise loc. YMMV
No built-in function other than update, but you can use mask with a Boolean series:
def my_method(current_series, adjustments):
bools = current_series.index.isin(adjustments.index)
return current_series.mask(bools, adjustments)
However, as the masking process introduces intermediary NaN values, your series will be upcasted to float. So your update solution is best.
Here is another way:
adjusted = ser1.mask(ser1.index.isin(adj_ser.index), adj_ser)
adjusted
Output:
1 10
2 1
3 2
4 7
dtype: int64

How to get the position of certain columns in dataframe - Python [duplicate]

In R when you need to retrieve a column index based on the name of the column you could do
idx <- which(names(my_data)==my_colum_name)
Is there a way to do the same with pandas dataframes?
Sure, you can use .get_loc():
In [45]: df = DataFrame({"pear": [1,2,3], "apple": [2,3,4], "orange": [3,4,5]})
In [46]: df.columns
Out[46]: Index([apple, orange, pear], dtype=object)
In [47]: df.columns.get_loc("pear")
Out[47]: 2
although to be honest I don't often need this myself. Usually access by name does what I want it to (df["pear"], df[["apple", "orange"]], or maybe df.columns.isin(["orange", "pear"])), although I can definitely see cases where you'd want the index number.
Here is a solution through list comprehension. cols is the list of columns to get index for:
[df.columns.get_loc(c) for c in cols if c in df]
DSM's solution works, but if you wanted a direct equivalent to which you could do (df.columns == name).nonzero()
For returning multiple column indices, I recommend using the pandas.Index method get_indexer, if you have unique labels:
df = pd.DataFrame({"pear": [1, 2, 3], "apple": [2, 3, 4], "orange": [3, 4, 5]})
df.columns.get_indexer(['pear', 'apple'])
# Out: array([0, 1], dtype=int64)
If you have non-unique labels in the index (columns only support unique labels) get_indexer_for. It takes the same args as get_indexer:
df = pd.DataFrame(
{"pear": [1, 2, 3], "apple": [2, 3, 4], "orange": [3, 4, 5]},
index=[0, 1, 1])
df.index.get_indexer_for([0, 1])
# Out: array([0, 1, 2], dtype=int64)
Both methods also support non-exact indexing with, f.i. for float values taking the nearest value with a tolerance. If two indices have the same distance to the specified label or are duplicates, the index with the larger index value is selected:
df = pd.DataFrame(
{"pear": [1, 2, 3], "apple": [2, 3, 4], "orange": [3, 4, 5]},
index=[0, .9, 1.1])
df.index.get_indexer([0, 1])
# array([ 0, -1], dtype=int64)
When you might be looking to find multiple column matches, a vectorized solution using searchsorted method could be used. Thus, with df as the dataframe and query_cols as the column names to be searched for, an implementation would be -
def column_index(df, query_cols):
cols = df.columns.values
sidx = np.argsort(cols)
return sidx[np.searchsorted(cols,query_cols,sorter=sidx)]
Sample run -
In [162]: df
Out[162]:
apple banana pear orange peach
0 8 3 4 4 2
1 4 4 3 0 1
2 1 2 6 8 1
In [163]: column_index(df, ['peach', 'banana', 'apple'])
Out[163]: array([4, 1, 0])
Update: "Deprecated since version 0.25.0: Use np.asarray(..) or DataFrame.values() instead." pandas docs
In case you want the column name from the column location (the other way around to the OP question), you can use:
>>> df.columns.values()[location]
Using #DSM Example:
>>> df = DataFrame({"pear": [1,2,3], "apple": [2,3,4], "orange": [3,4,5]})
>>> df.columns
Index(['apple', 'orange', 'pear'], dtype='object')
>>> df.columns.values()[1]
'orange'
Other ways:
df.iloc[:,1].name
df.columns[location] #(thanks to #roobie-nuby for pointing that out in comments.)
To modify DSM's answer a bit, get_loc has some weird properties depending on the type of index in the current version of Pandas (1.1.5) so depending on your Index type you might get back an index, a mask, or a slice. This is somewhat frustrating for me because I don't want to modify the entire columns just to extract one variable's index. Much simpler is to avoid the function altogether:
list(df.columns).index('pear')
Very straightforward and probably fairly quick.
how about this:
df = DataFrame({"pear": [1,2,3], "apple": [2,3,4], "orange": [3,4,5]})
out = np.argwhere(df.columns.isin(['apple', 'orange'])).ravel()
print(out)
[1 2]
When the column might or might not exist, then the following (variant from above works.
ix = 'none'
try:
ix = list(df.columns).index('Col_X')
except ValueError as e:
ix = None
pass
if ix is None:
# do something
import random
def char_range(c1, c2): # question 7001144
for c in range(ord(c1), ord(c2)+1):
yield chr(c)
df = pd.DataFrame()
for c in char_range('a', 'z'):
df[f'{c}'] = random.sample(range(10), 3) # Random Data
rearranged = random.sample(range(26), 26) # Random Order
df = df.iloc[:, rearranged]
print(df.iloc[:,:15]) # 15 Col View
for col in df.columns: # List of indices and columns
print(str(df.columns.get_loc(col)) + '\t' + col)
![Results](Results

np.maximum for scalar and pandas Series without np.nan

I have a list of pd.Series and scalar values (float and int) which I'd like to find the element-wise maximum for (Series are all same length). If there is a np.nan value, another value should be used (np.nan if only nans are available). This works fine as long as the series or values in the list don't contain nan values, but if they do the nans dominate the resulting series.
rv = input_list[0]
for s in input_list[1:]:
rv = np.maximum(s, rv)
As an example
input_list = [pd.Series([1, 2, 3, 1]), 2, pd.Series([3, 1, np.nan, 4])]
should return:
pd.Series([3, 2, 3, 4])
How can I modify this code to take care of nan values and ignore them if there are alternative values?
Solution using numpy.nanmax
You are looking for numpy.nanmax. From its documentation:
Return the maximum of an array or maximum along an axis, ignoring any
NaNs. When all-NaN slices are encountered a RuntimeWarning is raised
and NaN is returned for that slice.
So if you know that the maximum size of the series is n:
n= 4
result = pd.Series(np.nanmax(
[np.full(n, i) if np.isscalar(i) else i for i in input_list], axis=0))
Running it on the example:
input_list = [pd.Series([1, 2, 3, 1]), 2, pd.Series([3, 1, np.nan, 4])]
result = pd.Series(np.nanmax(
[np.full(n, i) if np.isscalar(i) else i for i in input_list], axis=0))
Output:
0 3.0
1 2.0
2 3.0
3 4.0
dtype: float64

pandas Serie idxmax() fails when dtype is dtype('O')

I believe a change with a recent version causes the call to idxmax() to fail in this case where it used to work before. I am not saying it is a regression, I'm trying to understand the reason and the correct call to issue.
type(sss)
<class 'pandas.core.series.Series'>
sss.dtype
dtype('O')
type(sss.index)
<class 'pandas.core.indexes.base.Index'>
sss.index=Index([...strings.., dtype'object', length=112)
The single column in the serie has number type with many NaN, and some valid numbers.
All indices are strings.
I am searching for the index of the maximum of the column.
How can I obtain that?
I can't replicate on pandas 0.19.2. You can convert to float and then use pd.Series.idxmax:
df = pd.DataFrame({'A': [0, 1.5, 1.0, np.nan, np.nan, 54, 19, np.nan]}, dtype=object,
index=list('abcdefgh'))
res = df['A'].astype(float).idxmax() # 'f'
Option One
s.index[np.argmax(s.tolist())]
Option Two
max(s.index, key=s.get)
Numeric Demo
s = pd.Series([0, 8, 4, 3], list('WXYZ'), object)
s
W 0
X 8
Y 4
Z 3
dtype: object
s.index[np.argmax(s.tolist())]
'X'
max(s.index, key=s.get)
'X'
String Demo
s = pd.Series(list('5Z4A'), list('ABCD'), object)
s
A 5
B Z
C 4
D A
dtype: object
s.index[np.argmax(s.tolist())]
'B'
max(s.index, key=s.get)
'B'

Get column index from column name in python pandas

In R when you need to retrieve a column index based on the name of the column you could do
idx <- which(names(my_data)==my_colum_name)
Is there a way to do the same with pandas dataframes?
Sure, you can use .get_loc():
In [45]: df = DataFrame({"pear": [1,2,3], "apple": [2,3,4], "orange": [3,4,5]})
In [46]: df.columns
Out[46]: Index([apple, orange, pear], dtype=object)
In [47]: df.columns.get_loc("pear")
Out[47]: 2
although to be honest I don't often need this myself. Usually access by name does what I want it to (df["pear"], df[["apple", "orange"]], or maybe df.columns.isin(["orange", "pear"])), although I can definitely see cases where you'd want the index number.
Here is a solution through list comprehension. cols is the list of columns to get index for:
[df.columns.get_loc(c) for c in cols if c in df]
DSM's solution works, but if you wanted a direct equivalent to which you could do (df.columns == name).nonzero()
For returning multiple column indices, I recommend using the pandas.Index method get_indexer, if you have unique labels:
df = pd.DataFrame({"pear": [1, 2, 3], "apple": [2, 3, 4], "orange": [3, 4, 5]})
df.columns.get_indexer(['pear', 'apple'])
# Out: array([0, 1], dtype=int64)
If you have non-unique labels in the index (columns only support unique labels) get_indexer_for. It takes the same args as get_indexer:
df = pd.DataFrame(
{"pear": [1, 2, 3], "apple": [2, 3, 4], "orange": [3, 4, 5]},
index=[0, 1, 1])
df.index.get_indexer_for([0, 1])
# Out: array([0, 1, 2], dtype=int64)
Both methods also support non-exact indexing with, f.i. for float values taking the nearest value with a tolerance. If two indices have the same distance to the specified label or are duplicates, the index with the larger index value is selected:
df = pd.DataFrame(
{"pear": [1, 2, 3], "apple": [2, 3, 4], "orange": [3, 4, 5]},
index=[0, .9, 1.1])
df.index.get_indexer([0, 1])
# array([ 0, -1], dtype=int64)
When you might be looking to find multiple column matches, a vectorized solution using searchsorted method could be used. Thus, with df as the dataframe and query_cols as the column names to be searched for, an implementation would be -
def column_index(df, query_cols):
cols = df.columns.values
sidx = np.argsort(cols)
return sidx[np.searchsorted(cols,query_cols,sorter=sidx)]
Sample run -
In [162]: df
Out[162]:
apple banana pear orange peach
0 8 3 4 4 2
1 4 4 3 0 1
2 1 2 6 8 1
In [163]: column_index(df, ['peach', 'banana', 'apple'])
Out[163]: array([4, 1, 0])
Update: "Deprecated since version 0.25.0: Use np.asarray(..) or DataFrame.values() instead." pandas docs
In case you want the column name from the column location (the other way around to the OP question), you can use:
>>> df.columns.values()[location]
Using #DSM Example:
>>> df = DataFrame({"pear": [1,2,3], "apple": [2,3,4], "orange": [3,4,5]})
>>> df.columns
Index(['apple', 'orange', 'pear'], dtype='object')
>>> df.columns.values()[1]
'orange'
Other ways:
df.iloc[:,1].name
df.columns[location] #(thanks to #roobie-nuby for pointing that out in comments.)
To modify DSM's answer a bit, get_loc has some weird properties depending on the type of index in the current version of Pandas (1.1.5) so depending on your Index type you might get back an index, a mask, or a slice. This is somewhat frustrating for me because I don't want to modify the entire columns just to extract one variable's index. Much simpler is to avoid the function altogether:
list(df.columns).index('pear')
Very straightforward and probably fairly quick.
how about this:
df = DataFrame({"pear": [1,2,3], "apple": [2,3,4], "orange": [3,4,5]})
out = np.argwhere(df.columns.isin(['apple', 'orange'])).ravel()
print(out)
[1 2]
When the column might or might not exist, then the following (variant from above works.
ix = 'none'
try:
ix = list(df.columns).index('Col_X')
except ValueError as e:
ix = None
pass
if ix is None:
# do something
import random
def char_range(c1, c2): # question 7001144
for c in range(ord(c1), ord(c2)+1):
yield chr(c)
df = pd.DataFrame()
for c in char_range('a', 'z'):
df[f'{c}'] = random.sample(range(10), 3) # Random Data
rearranged = random.sample(range(26), 26) # Random Order
df = df.iloc[:, rearranged]
print(df.iloc[:,:15]) # 15 Col View
for col in df.columns: # List of indices and columns
print(str(df.columns.get_loc(col)) + '\t' + col)
![Results](Results

Categories

Resources