Is there a Series method that acts like update but returns the updated series instead of updating in place?
Put another way, is there a better way to do this:
# Original series
ser1 = Series([10, 9, 8, 7], index=[1,2,3,4])
# I want to change ser1 to be [10, 1, 2, 7]
adj_ser = Series([1, 2], index=[2,3])
adjusted = my_method(ser1, adj_ser)
# Is there a builtin that does this already?
def my_method(current_series, adjustments):
x = current_series.copy()
x.update(adjustments)
return(x)
One possible solution should be combine_first, but it update adj_ser by ser1, also it cast integers to floats:
adjusted = adj_ser.combine_first(ser1)
print (adjusted)
1 10.0
2 1.0
3 2.0
4 7.0
dtype: float64
#nixon is right that iloc and loc are good for this kind of thing
import pandas as pd
# Original series
ser1 = pd.Series([10, 9, 8, 7], index=[1,2,3,4])
ser2 = ser1.copy()
ser3 = ser1.copy()
# I want to change ser1 to be [10, 1, 2, 7]
# One way
ser2.iloc[1:3] = [1,2]
ser2 # [10, 1, 2, 7]
# Another way
ser3.loc[2, 3] = [1,2]
ser3 # [10, 1, 2, 7]
Why two different methods?
As this post explains quite well, the major difference between loc and iloc is labels vs position. My personal shorthand is if you're trying to make adjustments based on the zero-index position of a value use iloc otherwise loc. YMMV
No built-in function other than update, but you can use mask with a Boolean series:
def my_method(current_series, adjustments):
bools = current_series.index.isin(adjustments.index)
return current_series.mask(bools, adjustments)
However, as the masking process introduces intermediary NaN values, your series will be upcasted to float. So your update solution is best.
Here is another way:
adjusted = ser1.mask(ser1.index.isin(adj_ser.index), adj_ser)
adjusted
Output:
1 10
2 1
3 2
4 7
dtype: int64
Related
Sorry if the title is confusing I am new to pandas and tried to be as concise as possible. Basically I have a dataframe I'm reading in and for each of the attributes in the data frame I need to quantize them to nearest 2 value by rounding. My approach is to turn them into bins with the values (-1.01, 1.00], (1.00, 3.00] ... and then from the bins I can just find out how many at each to know what the quantized data is. I can see the values using value_counts() but I want to be able to do something with the bins similar to df['Some_Attribute'].loc[df['Some_Attribute'] < 20] but if I replace 'Some_Attribute' with the bin name it will error
I've tried using value_counts() and then turning it into a list and just do it manually but while I can get a list of the values it's not sorted and I'm not sure how I'd know which value in the array corresponds with which range. I've also tried messing around and googling with .loc[] to see if maybe I got the syntax wrong but I haven't been able to figure that out
Edit: To provide better context
Sample_Input:
Age
1.9
2.0
2.4
5.9
6.0
6.4
df = pd.read_csv("Sample_Input.csv",names=attributes, header=0)
df['Age_Bins'] = pd.cut(df['Age'], two_bins)
df['Age_Bins'].loc[df['Age_Bins'] < 8.0]
df['Age_Bins'].loc[df['Age_Bins'] < 6.0]
If I run this I will get the error
TypeError: Invalid comparison between dtype=category and int
The output I would like to happen is for the last two lines in order to output 6 and then 3. If I tried this with a dataframe that wasn't cut it would work so I'm assuming it's trying to compare with the actual ranges instead of the amount of values at each range. Ideally I would like to find a way to get it to work with .loc[] but if that's not possible how do I get it into an array sorted by their ranges?
I might be wrong but I think you're looking for cumulative sum at each bin:
import pandas as pd
# sample you provided
df = pd.DataFrame({'age': [1.9, 2.0, 2.4, 5.9, 6.0, 6.4]})
# some bins to show how it works
pd.cut(df['age'], bins=[0, 2, 4, 6, 8], right=False).value_counts(sort=False).cumsum()
Output:
[0, 2) 1
[2, 4) 3
[4, 6) 4
[6, 8) 6
Name: age, dtype: int64
To cut into bins and then see how many each has:
df['age_bins'] = pd.cut(df['age'], bins=[0, 2, 4, 6, 8], right=False)
df.groupby('age_bins').agg('count')
Output:
age
age_bins
[0, 2) 1
[2, 4) 2
[4, 6) 1
[6, 8) 2
Again, .cumsum() is applicable here:
df.groupby('age_bins').agg('count').cumsum()
Output:
age
age_bins
[0, 2) 1
[2, 4) 3
[4, 6) 4
[6, 8) 6
In R when you need to retrieve a column index based on the name of the column you could do
idx <- which(names(my_data)==my_colum_name)
Is there a way to do the same with pandas dataframes?
Sure, you can use .get_loc():
In [45]: df = DataFrame({"pear": [1,2,3], "apple": [2,3,4], "orange": [3,4,5]})
In [46]: df.columns
Out[46]: Index([apple, orange, pear], dtype=object)
In [47]: df.columns.get_loc("pear")
Out[47]: 2
although to be honest I don't often need this myself. Usually access by name does what I want it to (df["pear"], df[["apple", "orange"]], or maybe df.columns.isin(["orange", "pear"])), although I can definitely see cases where you'd want the index number.
Here is a solution through list comprehension. cols is the list of columns to get index for:
[df.columns.get_loc(c) for c in cols if c in df]
DSM's solution works, but if you wanted a direct equivalent to which you could do (df.columns == name).nonzero()
For returning multiple column indices, I recommend using the pandas.Index method get_indexer, if you have unique labels:
df = pd.DataFrame({"pear": [1, 2, 3], "apple": [2, 3, 4], "orange": [3, 4, 5]})
df.columns.get_indexer(['pear', 'apple'])
# Out: array([0, 1], dtype=int64)
If you have non-unique labels in the index (columns only support unique labels) get_indexer_for. It takes the same args as get_indexer:
df = pd.DataFrame(
{"pear": [1, 2, 3], "apple": [2, 3, 4], "orange": [3, 4, 5]},
index=[0, 1, 1])
df.index.get_indexer_for([0, 1])
# Out: array([0, 1, 2], dtype=int64)
Both methods also support non-exact indexing with, f.i. for float values taking the nearest value with a tolerance. If two indices have the same distance to the specified label or are duplicates, the index with the larger index value is selected:
df = pd.DataFrame(
{"pear": [1, 2, 3], "apple": [2, 3, 4], "orange": [3, 4, 5]},
index=[0, .9, 1.1])
df.index.get_indexer([0, 1])
# array([ 0, -1], dtype=int64)
When you might be looking to find multiple column matches, a vectorized solution using searchsorted method could be used. Thus, with df as the dataframe and query_cols as the column names to be searched for, an implementation would be -
def column_index(df, query_cols):
cols = df.columns.values
sidx = np.argsort(cols)
return sidx[np.searchsorted(cols,query_cols,sorter=sidx)]
Sample run -
In [162]: df
Out[162]:
apple banana pear orange peach
0 8 3 4 4 2
1 4 4 3 0 1
2 1 2 6 8 1
In [163]: column_index(df, ['peach', 'banana', 'apple'])
Out[163]: array([4, 1, 0])
Update: "Deprecated since version 0.25.0: Use np.asarray(..) or DataFrame.values() instead." pandas docs
In case you want the column name from the column location (the other way around to the OP question), you can use:
>>> df.columns.values()[location]
Using #DSM Example:
>>> df = DataFrame({"pear": [1,2,3], "apple": [2,3,4], "orange": [3,4,5]})
>>> df.columns
Index(['apple', 'orange', 'pear'], dtype='object')
>>> df.columns.values()[1]
'orange'
Other ways:
df.iloc[:,1].name
df.columns[location] #(thanks to #roobie-nuby for pointing that out in comments.)
To modify DSM's answer a bit, get_loc has some weird properties depending on the type of index in the current version of Pandas (1.1.5) so depending on your Index type you might get back an index, a mask, or a slice. This is somewhat frustrating for me because I don't want to modify the entire columns just to extract one variable's index. Much simpler is to avoid the function altogether:
list(df.columns).index('pear')
Very straightforward and probably fairly quick.
how about this:
df = DataFrame({"pear": [1,2,3], "apple": [2,3,4], "orange": [3,4,5]})
out = np.argwhere(df.columns.isin(['apple', 'orange'])).ravel()
print(out)
[1 2]
When the column might or might not exist, then the following (variant from above works.
ix = 'none'
try:
ix = list(df.columns).index('Col_X')
except ValueError as e:
ix = None
pass
if ix is None:
# do something
import random
def char_range(c1, c2): # question 7001144
for c in range(ord(c1), ord(c2)+1):
yield chr(c)
df = pd.DataFrame()
for c in char_range('a', 'z'):
df[f'{c}'] = random.sample(range(10), 3) # Random Data
rearranged = random.sample(range(26), 26) # Random Order
df = df.iloc[:, rearranged]
print(df.iloc[:,:15]) # 15 Col View
for col in df.columns: # List of indices and columns
print(str(df.columns.get_loc(col)) + '\t' + col)
![Results](Results
Simple Matlab code: e.g A(5+(1:3)) -> gives [A(6), A(7), A(8)]
In the above, A is a vector or a matrix. For instance:
A = [1 2 3 4 5 6 7 8 9 10];
A(5+(1:3))
ans =
6 7 8
Note that MATLAB indexing starts at 1, not 0.
How can i do the same in Python?
You are looking for slicing behavior
A = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
>>> A[5:8]
[6, 7, 8]
If A is some function that you want to call with parameters 6, 7, and 8, you could use a list comprehension.
answers = [A(6+i) for i in range(3)]
You want to do two things.
First, create a range (5 + (1:3)) which could be done in Python like range(number).
Second, apply a function to each range index. This could be done with map or a for loop.
The for loop solutions have been addressed, so here's a map based one:
result = map(A,your_range)
Use a list comprehension:
x = 5
f = 1 # from
t = 3 # till
print [x+i for i in range(f,t+1)]
If you are trying to use subscripts to create an array which is a subset of the whole array:
subset_list = A[6:8]
in python u can do it easily by A[5:5+3] . u can reference the values 5 and 3 by variables also like
b=5
c=3
a[b:b+c]
So I have two pandas timeseries, and the indexes on both are timestamps. The thing is - not all of the timestamps exist on both timeseries. I want to perform a linear regression on the points that are matched up, ignoring those which have no 'pair'
This is my current solution, but it seems somewhat verbose and ugly:
indexes_used = sorted(list(set(series1).intersection(series2)))
perform_regression(series1.loc[indexes_used], series2.loc[indexes_used])
Alternatively, I was thinking of doing (but creating a temporary dataframe seems redundant):
temp_frame = pd.concat([series1, series2]).T.dropna() #need the transpose to keep timestamps on vertical axis
perform_regression(blabla)
Is there a good way to do this?
How about Series.align:
import pandas as pd
a = pd.Series([4, 5, 6, 7], index=[1, 2, 3, 4])
b = pd.Series([49, 54, 62, 74], index=[2, 6, 4, 0])
a2, b2 = a.align(b, join="inner")
the output:
2 5
4 7
dtype: int64
2 49
4 62
dtype: int64
In R when you need to retrieve a column index based on the name of the column you could do
idx <- which(names(my_data)==my_colum_name)
Is there a way to do the same with pandas dataframes?
Sure, you can use .get_loc():
In [45]: df = DataFrame({"pear": [1,2,3], "apple": [2,3,4], "orange": [3,4,5]})
In [46]: df.columns
Out[46]: Index([apple, orange, pear], dtype=object)
In [47]: df.columns.get_loc("pear")
Out[47]: 2
although to be honest I don't often need this myself. Usually access by name does what I want it to (df["pear"], df[["apple", "orange"]], or maybe df.columns.isin(["orange", "pear"])), although I can definitely see cases where you'd want the index number.
Here is a solution through list comprehension. cols is the list of columns to get index for:
[df.columns.get_loc(c) for c in cols if c in df]
DSM's solution works, but if you wanted a direct equivalent to which you could do (df.columns == name).nonzero()
For returning multiple column indices, I recommend using the pandas.Index method get_indexer, if you have unique labels:
df = pd.DataFrame({"pear": [1, 2, 3], "apple": [2, 3, 4], "orange": [3, 4, 5]})
df.columns.get_indexer(['pear', 'apple'])
# Out: array([0, 1], dtype=int64)
If you have non-unique labels in the index (columns only support unique labels) get_indexer_for. It takes the same args as get_indexer:
df = pd.DataFrame(
{"pear": [1, 2, 3], "apple": [2, 3, 4], "orange": [3, 4, 5]},
index=[0, 1, 1])
df.index.get_indexer_for([0, 1])
# Out: array([0, 1, 2], dtype=int64)
Both methods also support non-exact indexing with, f.i. for float values taking the nearest value with a tolerance. If two indices have the same distance to the specified label or are duplicates, the index with the larger index value is selected:
df = pd.DataFrame(
{"pear": [1, 2, 3], "apple": [2, 3, 4], "orange": [3, 4, 5]},
index=[0, .9, 1.1])
df.index.get_indexer([0, 1])
# array([ 0, -1], dtype=int64)
When you might be looking to find multiple column matches, a vectorized solution using searchsorted method could be used. Thus, with df as the dataframe and query_cols as the column names to be searched for, an implementation would be -
def column_index(df, query_cols):
cols = df.columns.values
sidx = np.argsort(cols)
return sidx[np.searchsorted(cols,query_cols,sorter=sidx)]
Sample run -
In [162]: df
Out[162]:
apple banana pear orange peach
0 8 3 4 4 2
1 4 4 3 0 1
2 1 2 6 8 1
In [163]: column_index(df, ['peach', 'banana', 'apple'])
Out[163]: array([4, 1, 0])
Update: "Deprecated since version 0.25.0: Use np.asarray(..) or DataFrame.values() instead." pandas docs
In case you want the column name from the column location (the other way around to the OP question), you can use:
>>> df.columns.values()[location]
Using #DSM Example:
>>> df = DataFrame({"pear": [1,2,3], "apple": [2,3,4], "orange": [3,4,5]})
>>> df.columns
Index(['apple', 'orange', 'pear'], dtype='object')
>>> df.columns.values()[1]
'orange'
Other ways:
df.iloc[:,1].name
df.columns[location] #(thanks to #roobie-nuby for pointing that out in comments.)
To modify DSM's answer a bit, get_loc has some weird properties depending on the type of index in the current version of Pandas (1.1.5) so depending on your Index type you might get back an index, a mask, or a slice. This is somewhat frustrating for me because I don't want to modify the entire columns just to extract one variable's index. Much simpler is to avoid the function altogether:
list(df.columns).index('pear')
Very straightforward and probably fairly quick.
how about this:
df = DataFrame({"pear": [1,2,3], "apple": [2,3,4], "orange": [3,4,5]})
out = np.argwhere(df.columns.isin(['apple', 'orange'])).ravel()
print(out)
[1 2]
When the column might or might not exist, then the following (variant from above works.
ix = 'none'
try:
ix = list(df.columns).index('Col_X')
except ValueError as e:
ix = None
pass
if ix is None:
# do something
import random
def char_range(c1, c2): # question 7001144
for c in range(ord(c1), ord(c2)+1):
yield chr(c)
df = pd.DataFrame()
for c in char_range('a', 'z'):
df[f'{c}'] = random.sample(range(10), 3) # Random Data
rearranged = random.sample(range(26), 26) # Random Order
df = df.iloc[:, rearranged]
print(df.iloc[:,:15]) # 15 Col View
for col in df.columns: # List of indices and columns
print(str(df.columns.get_loc(col)) + '\t' + col)
![Results](Results