Is it possible to grab a given value in a Pandas column and change it to a previous row value?
For instance, I have this Dataframe:
Date Price Signal
2018-01-01 13380.00 1
2018-01-02 14675.11 0
2018-01-03 14919.51 0
2018-01-04 15059.54 0
2018-01-05 16960.39 0
2018-01-06 17069.79 -1
2018-01-07 16150.03 0
2018-01-08 14902.54 0
2018-01-09 14400.00 1
2018-01-10 14907.09 0
2018-01-11 13238.78 0
2018-01-12 13740.01 -1
2018-01-13 14210.00 0
I would like to replace the zeros in the Signal column for either 1 or -1. The final DF should be this:
Date Price Signal
2018-01-01 13380.00 1
2018-01-02 14675.11 1
2018-01-03 14919.51 1
2018-01-04 15059.54 1
2018-01-05 16960.39 1
2018-01-06 17069.79 -1
2018-01-07 16150.03 -1
2018-01-08 14902.54 -1
2018-01-09 14400.00 1
2018-01-10 14907.09 1
2018-01-11 13238.78 1
2018-01-12 13740.01 -1
2018-01-13 14210.00 -1
Try this:
df['Signal'].replace(to_replace=0, method='ffill')
(assuming your DataFrame is callled df)
If what you want is propagate previous values to the next rows use the following:
df["Signal"] = df["Signal"].ffill()
import pandas as pd
Prepare a dataframe
If you have a dataframe:
df = pd.DataFrame([1,0,1,0,1], columns=['col1'])
Use pure apply()
You can do:
def replace(num):
if num==1: return 1
if num==0: return -1
Then apply it to the column holding the values you want to replace:
df['new']=df['col1'].apply(replace)
apply() lambda functions
You can achieve the same with a lambda function:
df['col1'].apply(lambda row: 1 if row == 1 else -1)
Use Built-in methods
Using the dataframe we prepared, can do:
df['new'] = df['col1'].replace(to_replace=0, value=-1)
If you don't want to create a new column, just straight replace the the values in the existing one, can do it inplace:
df['col1'].replace(to_replace=0, value=-1,inplace=True)
Clean up
If created a new column & don't want to keep the old column, can drop it:
df.drop('col1',axis=1)
Related
I have a master data frame and auxiliary data frame. Both have the same timestamp index and columns with master having few more columns. I want to copy a certain column's data from aux to master.
My code:
maindf = pd.DataFrame({'A':[0.0,NaN],'B':[10,20],'C':[100,200],},index=pd.date_range(start='2020-05-04 08:00:00', freq='1h', periods=2))
auxdf= pd.DataFrame({'A':[1,2],'B':[30,40],},index=pd.date_range(start='2020-05-04 08:00:00', freq='1h', periods=2))
maindf =
A B C
2020-05-04 08:00:00 0.0 10 100
2020-05-04 09:00:00 NaN 20 200
auxdf =
A B
2020-05-04 08:00:00 1 30
2020-05-04 09:00:00 2 40
Expected answer: I want o take column A data in auxdf and copy to maindf by matching the index.
maindf =
A B C
2020-05-04 08:00:00 1 10 100
2020-05-04 09:00:00 2 20 200
My solution:
maindf['A'] = auxdf['A']
My solution is not correct because I am copying values directly without checking for matching index. how do I achieve the solution?
You can use .update(), as follows:
maindf['A'].update(auxdf['A'])
.update() uses non-NA values from passed Series to make updates. Aligns on index.
Note also that the original dtype of maindf['A'] is retained: remains as float type even when auxdf['A'] is of int type.
Result:
print(maindf)
A B C
2020-05-04 08:00:00 1.0 10 100
2020-05-04 09:00:00 2.0 20 200
I have a column that calculates the duration of seconds it takes from A to B in format hh:mm:ss. However, A and B may show null values in the data.
Let's say A=05:15:00 and B=naT, then the subtraction will return 5:15 seconds which is misleading and wrong in context due to B being infinity! How can I specify to only subtract columns B from A if both columns are NOT NULL?
This is the code I have:
df['A_to_B']=(df.B-df.A).dt.total_seconds()
Python does not use null, but it does use a type called None to represent the absence of a value / type. So you would check if df.B and df.A are both not None, perhaps like this:
if (df.A is not None) and (df.B is not None):
df['A_to_B'] = (df.B-df.A).dt.total_seconds()
You can do:
df['A_to_B'] = np.where(df['A'].notna() & df['B'].notna(),
(df['A'] - df['B']).dt.total_seconds(),
np.nan)
Sample data
A B
0 05:15:00 NaT
1 NaT 00:00:15
2 05:15:00 00:15:00
Output:
A B A_to_B
0 05:15:00 NaT NaN
1 NaT 00:00:15 NaN
2 05:15:00 00:15:00 18000.0
Hello I'm trying to merge/roll two dataframes.
I would like to merge the 'dfDates' and 'dfProducts' then roll the products on 'dfProducts' group/members until the date that a new group/members is available.
I tried to use a outer join between both dataframes but I dont know how to roll the groups...
Follow below how the dataframes looks like and how I would like the 'dfFinal'
dfProducts
Date Product
2018-01-01 A
2018-01-01 B
2018-01-01 C
2018-01-03 D
2018-01-03 E
2018-01-03 F
dfDates
Date
2018-01-01
2018-01-02
2018-01-03
2018-01-04
dfFinal
Date Product
2018-01-01 A
2018-01-01 B
2018-01-01 C
2018-01-02 A
2018-01-02 B
2018-01-02 C
2018-01-03 D
2018-01-03 E
2018-01-03 F
2018-01-04 D
2018-01-04 E
2018-01-04 F
The easiest option I can see is to group everything by date first, then reindex to your desired range to drop nans into the empty spots, then ffill those:
(
df
.groupby("Date")
['Product']
.apply(list)
.reindex(pd.date_range(start=dfDates['Date'].min(), end=dfDates['Date'].max(), freq='D'))
.fillna(method='ffill')
.explode()
)
2018-01-01 A
2018-01-01 B
2018-01-01 C
2018-01-02 A
2018-01-02 B
2018-01-02 C
2018-01-03 D
2018-01-03 E
2018-01-03 F
2018-01-04 D
2018-01-04 E
2018-01-04 F
Name: Product, dtype: object
Define the following function:
def getLastDateRows(dat, df):
rows = df.query('Date == #dat')
n = rows.index.size
if n == 0:
lastDat = df.Date[df.Date < dat].iloc[-1]
rows = df.query('Date == #lastDat')
return pd.DataFrame({ 'Date': dat, 'Product': rows.Product })
Then apply it to each dfDates.Date and concat the results:
pd.concat(dfDates.Date.apply(getLastDateRows, df=dfProducts)\
.tolist(), ignore_index=True)
The result is just as expected.
Appendix
The solution proposed by Randy can be a bit improved:
dfProducts.groupby('Date').Product.apply(list)\
.reindex(dfDates.Date).ffill().explode().reset_index()
Differences:
Reindex is on dfDates.Date (not the whole range), so the result will
contain only dates present in dfDates, which can contain
intentional "gaps", e.g. for weekends.
The final call to reset_index causes that the result is a DataFrame
(not a Series).
I have a df like,
stamp value
0 00:00:00 2
1 00:00:00 3
2 01:00:00 5
converting to time delta
df['stamp']=pd.to_timedelta(df['stamp'])
slicing only odd index and adding 30 mins,
odd_df=pd.to_timedelta(df[1::2]['stamp'])+pd.to_timedelta('30 min')
#print(odd_df)
1 00:30:00
Name: stamp, dtype: timedelta64[ns]
now, updating df with odd_df,
as per the documentation it should give my expected output.
expected output:
df.update(odd_df)
#print(df)
stamp value
0 00:00:00 2
1 00:30:00 3
2 01:00:00 5
What I am getting,
df.update(odd_df)
#print(df)
stamp value
0 00:30:00 00:30:00
1 00:30:00 00:30:00
2 00:30:00 00:30:00
please help, what is wrong in this.
Try this instead:
df.loc[1::2, 'stamp'] += pd.to_timedelta('30 min')
This ensures you update just the values in DataFrame specified by the .loc() function while keeping the rest of your original DataFrame. To test, run df.shape. You will get (3,2) with the method above.
In your code here:
odd_df=pd.to_timedelta(df[1::2]['stamp'])+pd.to_timedelta('30 min')
The odd_df DataFrame only has parts of your original DataFrame. The parts you sliced. The shape of odd_df is (1,).
The data is given as following:
return
2010-01-04 0.016676
2010-01-05 0.003839
...
2010-01-05 0.003839
2010-01-29 0.001248
2010-02-01 0.000134
...
What I want get is to extract all value that is the last day of month appeared in the data .
2010-01-29 0.00134
2010-02-28 ......
If I directly use pandas.resample, i.e., df.resample('M).last(). I would select the correct rows with the wrong index. (it automatically use the last day of the month as the index)
2010-01-31 0.00134
2010-02-28 ......
How can I get the correct answer in a Pythonic way?
An assumption made here is that your date data is part of the index. If not, I recommend setting it first.
Single Year
I don't think the resampling or grouper functions would do. Let's group on the month number instead and call DataFrameGroupBy.tail.
df.groupby(df.index.month).tail(1)
Multiple Years
If your data spans multiple years, you'll need to group on the year and month. Using a single grouper created from dt.strftime—
df.groupby(df.index.strftime('%Y-%m')).tail(1)
Or, using multiple groupers—
df.groupby([df.index.year, df.index.month]).tail(1)
Note—if your index is not a DatetimeIndex as assumed here, you'll need to replace df.index with pd.to_datetime(df.index, errors='coerce') above.
Although this doesn't answer the question properly I'll leave it if someone is interested.
An approach which would only work if you are certain you have all days (!IMPORTANT) is to add 1 day too with pd.Timedelta and check if day == 1. I did a small running time test and it is 6x faster than the groupby solution.
df[(df['dates'] + pd.Timedelta(days=1)).dt.day == 1]
Or if index:
df[(df.index + pd.Timedelta(days=1)).day == 1]
Full example:
import pandas as pd
df = pd.DataFrame({
'dates': pd.date_range(start='2016-01-01', end='2017-12-31'),
'i': 1
}).set_index('dates')
dfout = df[(df.index + pd.Timedelta(days=1)).day == 1]
print(dfout)
Returns:
i
dates
2016-01-31 1
2016-02-29 1
2016-03-31 1
2016-04-30 1
2016-05-31 1
2016-06-30 1
2016-07-31 1
2016-08-31 1
2016-09-30 1
2016-10-31 1
2016-11-30 1
2016-12-31 1
2017-01-31 1
2017-02-28 1
2017-03-31 1
2017-04-30 1
2017-05-31 1
2017-06-30 1
2017-07-31 1
2017-08-31 1
2017-09-30 1
2017-10-31 1
2017-11-30 1
2017-12-31 1