I have a dataframe like this,
ID 00:00 01:00 02:00 ... 23:00 avg_value
22 4.7 5.3 6 ... 8 5.5
37 0 9.2 4.5 ... 11.2 9.2
4469 2 9.8 11 ... 2 6.4
Can I use np.where to apply conditions on multiple columns at once?
I want to update the values from 00:00 to 23:00 to 0 and 1. If the value at the time of day is greater than avg_value then I change it to 1, else to 0.
I know how to apply this method to one single column.
np.where(df['00:00']>df['avg_value'],1,0)
Can I change it to multiple columns?
Output will be like,
ID 00:00 01:00 02:00 ... 23:00 avg_value
22 0 1 1 ... 1 5.5
37 0 0 0 ... 1 9.2
4469 0 1 1 ... 0 6.4
Select all columns without last by DataFrame.iloc, compare by DataFrame.gt and casting to integers and last add avg_value column by DataFrame.join:
df = df.iloc[:, :-1].gt(df['avg_value'], axis=0).astype(int).join(df['avg_value'])
print (df)
00:00 01:00 02:00 23:00 avg_value
ID
22 0 0 1 1 5.5
37 0 0 0 1 9.2
4469 0 1 1 0 6.4
Or use DataFrame.pop for extract column:
s = df.pop('avg_value')
df = df.gt(s, axis=0).astype(int).join(s)
print (df)
00:00 01:00 02:00 23:00 avg_value
ID
22 0 0 1 1 5.5
37 0 0 0 1 9.2
4469 0 1 1 0 6.4
Because if assign to same columns integers are converted to floats (it is bug):
df.iloc[:, :-1] = df.iloc[:, :-1].gt(df['avg_value'], axis=0).astype(int)
print (df)
00:00 01:00 02:00 23:00 avg_value
ID
22 0.0 0.0 1.0 1.0 5.5
37 0.0 0.0 0.0 1.0 9.2
4469 0.0 1.0 1.0 0.0 6.4
Related
I have data with date, time, and values and want to calculate a forward looking rolling maximum for each date:
Date Time Value Output
01/01/2022 01:00 1.3 1.4
01/01/2022 02:00 1.4 1.2
01/01/2022 03:00 0.9 1.2
01/01/2022 04:00 1.2 NaN
01/02/2022 01:00 5 4
01/02/2022 02:00 4 3
01/02/2022 03:00 2 3
01/02/2022 04:00 3 NaN
I have tried this:
df = df.sort_values(by=['Date','Time'], ascending=True)
df['rollingmax'] = df.groupby(['Date'])['Value'].rolling(window=4,min_periods=0).max()
df = df.sort_values(by=['Date','Time'], ascending=False)
but that doesn't seem to work...
It looks like you want a shifted reverse rolling max:
n = 4
df['Output'] = (df[::-1]
.groupby('Date')['Value']
.apply(lambda g: g.rolling(n-1, min_periods=1).max().shift())
)
Output:
Date Time Value Output
0 01/01/2022 01:00 1.3 1.4
1 01/01/2022 02:00 1.4 1.2
2 01/01/2022 03:00 0.9 1.2
3 01/01/2022 04:00 1.2 NaN
4 01/02/2022 01:00 5.0 4.0
5 01/02/2022 02:00 4.0 3.0
6 01/02/2022 03:00 2.0 3.0
7 01/02/2022 04:00 3.0 NaN
I have a dataframe, x_train, with three variables an index datetime which takes a reading every 5 minutes and an ID column:
x_train
Time ID var_1 var_2 var_3
2020-01-01 00:00:00 1 9.3 4.2 2.4
2020-01-02 00:00:05 1 3.5 4.5 7.6
2020-01-01 00:00:00 2 2.1 7.6 4.5
2020-01-02 00:00:05 2 3.9 7.5 7.0
and a second dataframe, y_train, with labels for each mode the IDs are in:
y_train
Time ID mode label
2020-01-01 00:00:00 1 1 B
2020-01-02 00:00:05 1 1 B
2020-01-01 00:00:00 2 0 A
2020-01-02 00:00:05 2 0 A
I want to slice the data by ID and time with a step size of 1 day or 288 rows as this data is time-series dependent. So far I've managed to split the data by id using groupby, however I'm not sure how to apply the time slicing.
Heres what I've tried:
FEATURE_COLUMNS = X_train.columns.to_list()
sequences = []
for Id, group in X_train.groupby("ID"):
sequence_features = group[FEATURE_COLUMNS]
label = y_train[y_train.ID == ID].iloc[0].label
sequences.append((sequence_features, label))
Which gives me a slice of all the different IDs but not the time sliced:
( ID var_1 var_2 var_3
Time
2016-01-09 01:55:00 2 0.402679 0.588398 0.560771
2016-03-22 11:40:00 2 0.382457 0.507188 0.450901
2016-02-29 09:40:00 2 0.344540 0.652963 0.607460
2016-01-06 01:00:00 2 0.384479 0.825977 0.499619
2016-01-19 18:10:00 2 0.437563 0.631526 0.479827
... ... ... ... ...
2016-01-10 23:30:00 2 0.366026 0.829760 0.636387
2016-01-22 18:25:00 2 0.976997 0.350567 0.674448
2016-01-28 06:30:00 2 0.975986 0.719546 0.727988
2016-02-27 04:15:00 2 0.451972 0.674149 0.470185
2016-03-10 19:15:00 2 0.354146 0.423203 0.487947
[17673 rows x 4 columns],
'b')
I feel I need to add a line that tells the loop to only look at 288 rows per ID at a time but I'm not sure how to execute it.
Edit: also my sliced output data reorganises the index datetime in a weird order is there a way to fix this?
I have two dataframes of the same size, the same columns and same index.
df1:
symbol fund1 fund2 fund3 ... ... ...
id datetime
10 2012-10-19 09:05:00 -100 0 0 50 0 0
20 2012-10-19 09:10:00 0 300 0 0 0 0
df2:
symbol fund1 fund2 fund3 ... ... ...
id datetime
10 2012-10-19 09:05:00 -0.5 0 0 0.005 0 0
20 2012-10-19 09:10:00 0 -10 0 0 0 0
I would like to receive a new dataframe that takes the values from df1 only if the sign of each element in df1 is NOT the same (the opposite) than in df2.
So, the result for the example would be:
df_outcome:
symbol fund1 fund2 fund3 ... ... ...
id datetime
10 2012-10-19 09:05:00 0 0 0 0 0 0
20 2012-10-19 09:10:00 0 300 0 0 0 0
I've found that there is a function: np.sign(df), I think I should first apply this function to both tables, but what should I do then to compare the values to these "sign" tables and, if they are opposite, element by element, take the values from df1?
You can use where with np.sign and inequality test:
df1.where(np.sign(df1) != np.sign(df2)).fillna(0)
Output:
fund1 fund2 fund3 fund4 fund5 fund6
id datetime
10 2012-10-19 09:05:00 0.0 0.0 0.0 0.0 0.0 0.0
20 2012-10-19 09:10:00 0.0 300.0 0.0 0.0 0.0 0.0
This question already has answers here:
Combine two pandas Data Frames (join on a common column)
(4 answers)
Closed 4 years ago.
I have two dfs, one is longer than the other but they both have one column that contain the same values.
Here is my first df called weather:
DATE AWND PRCP SNOW WT01 WT02 TAVG
0 2017-01-01 5.59 0.00 0.0 NaN NaN 46
1 2017-01-02 9.17 0.21 0.0 1.0 NaN 40
2 2017-01-03 10.74 0.58 0.0 1.0 NaN 42
3 2017-01-04 8.05 0.00 0.0 1.0 NaN 47
4 2017-01-05 7.83 0.00 0.0 NaN NaN 34
Here is my 2nd df called bike:
DATE LENGTH ID AMOUNT
0 2017-01-01 3 1 5
1 2017-01-01 6 2 10
2 2017-01-02 9 3 100
3 2017-01-02 12 4 250
4 2017-01-03 15 5 45
So I want my df to copy over all rows from the weather df based upon the shared DATE column and copy it over.
DATE LENGTH ID AMOUNT AWND SNOW TAVG
0 2017-01-01 3 1 5 5.59 0 46
1 2017-01-01 6 2 10 5.59 0 46
2 2017-01-02 9 3 100 9.17 0 40
3 2017-01-02 12 4 250 9.17 0 40
4 2017-01-03 15 5 45 10.74 0 42
Please help! Maybe some type of join can be used.
Use merge
In [93]: bike.merge(weather[['DATE', 'AWND', 'SNOW', 'TAVG']], on='DATE')
Out[93]:
DATE LENGTH ID AMOUNT AWND SNOW TAVG
0 2017-01-01 3 1 5 5.59 0.0 46
1 2017-01-01 6 2 10 5.59 0.0 46
2 2017-01-02 9 3 100 9.17 0.0 40
3 2017-01-02 12 4 250 9.17 0.0 40
4 2017-01-03 15 5 45 10.74 0.0 42
Just use the same indexes and simple slicing
df2 = df2.set_index('DATE')
df2[['SNOW', 'TAVG']] = df.set_index('DATE')[['SNOW', 'TAVG']]
If you check the pandas docs, they explain all the different types of "merges" (joins) that you can do between two dataframes.
The common syntax for a merge looks like: pd.merge(weather, bike, on= 'DATE')
You can also make the merge more fancy by adding any of the arguments to your merge function that I listed below: (e.g specifying whether your want an inner vs right join)
Here are the arguments the function takes based on the current pandas docs:
pandas.merge(left, right, how='inner', on=None, left_on=None, right_on=None, left_index=False, right_index=False, sort=False, suffixes=('_x', '_y'), copy=True, indicator=False, validate=None)
Source
Hope it helps!
I am querying my database to show records from the past week. I am then aggregating the data and transposing it in python and pandas into a DataFrame.
In this table I am attempting to show what occurred on each day in the past 7 week, however, on some days no events occur. In these cases, the date is missing altogether. I am looking for an approach to append the dates that are not present (but are part of the date range specified in the query) so that I can then fillna with any value I wish for the other missing columns.
In some trials I have the data set into a pandas Dataframe where the dates are the index and in others the dates are a column. I am preferably looking to have the dates as the top index - so group by name, stack purchase and send_back and dates are the 'columns'.
Here is an example of how the dataframe looks now and what I am looking for:
Dates set in query for - 01.08.2016 - 08.08.2016. The dataframe looks liks so:
| dates | name | purchase | send_back
0 01.08.2016 Michael 120 0
1 02.08.2016 Sarah 100 40
2 04.08.2016 Sarah 55 0
3 05.08.2016 Michael 80 20
4 07.08.2016 Sarah 130 0
After:
| dates | name | purchase | send_back
0 01.08.2016 Michael 120 0
1 02.08.2016 Sarah 100 40
2 03.08.2016 - 0 0
3 04.08.2016 Sarah 55 0
4 05.08.2016 Michael 80 20
5 06.08.2016 - 0 0
6 07.08.2016 Sarah 130 0
7 08.08.2016 Sarah 0 35
8 08.08.2016 Michael 20 0
Printing the following:
df.index
gives:
'Index([ u'dates',u'name',u'purchase',u'send_back'],
dtype='object')
RangeIndex(start=0, stop=1, step=1)'
I appreciate any guidance.
assuming you have the following DF:
In [93]: df
Out[93]:
name purchase send_back
dates
2016-08-01 Michael 120 0
2016-08-02 Sarah 100 40
2016-08-04 Sarah 55 0
2016-08-05 Michael 80 20
2016-08-07 Sarah 130 0
you can resample and replace:
In [94]: df.resample('D').first().replace({'name':{np.nan:'-'}}).fillna(0)
Out[94]:
name purchase send_back
dates
2016-08-01 Michael 120.0 0.0
2016-08-02 Sarah 100.0 40.0
2016-08-03 - 0.0 0.0
2016-08-04 Sarah 55.0 0.0
2016-08-05 Michael 80.0 20.0
2016-08-06 - 0.0 0.0
2016-08-07 Sarah 130.0 0.0
Your index is of object type and you must convert it to datetime format.
# Converting the object date to datetime.date
df['dates'] = df['dates'].apply(lambda x: datetime.strptime(x, "%d.%m.%Y"))
# Setting the index column
df.set_index(['dates'], inplace=True)
# Choosing a date range extending from first date to the last date with a set frequency
new_index = pd.date_range(start=df.index[0], end=df.index[-1], freq='D')
new_index.name = df.index.name
# Setting the new index
df = df.reindex(new_index)
# Making the required modifications
df.ix[:,0], df.ix[:,1:] = df.ix[:,0].fillna('-'), df.ix[:,1:].fillna(0)
print (df)
name purchase send_back
dates
2016-08-01 Michael 120.0 0.0
2016-08-02 Sarah 100.0 40.0
2016-08-03 - 0.0 0.0
2016-08-04 Sarah 55.0 0.0
2016-08-05 Michael 80.0 20.0
2016-08-06 - 0.0 0.0
2016-08-07 Sarah 130.0 0.0
Let's suppose you have data for a single day (as mentioned in the comments section) and you would like to fill the other days of the week with null values:
Data Setup:
df = pd.DataFrame({'dates':['01.08.2016'], 'name':['Michael'],
'purchase':[120], 'send_back':[0]})
print (df)
dates name purchase send_back
0 01.08.2016 Michael 120 0
Operations:
df['dates'] = df['dates'].apply(lambda x: datetime.strptime(x, "%d.%m.%Y"))
df.set_index(['dates'], inplace=True)
# Setting periods as 7 to account for the end of the week
new_index = pd.date_range(start=df.index[0], periods=7, freq='D')
new_index.name = df.index.name
# Setting the new index
df = df.reindex(new_index)
print (df)
name purchase send_back
dates
2016-08-01 Michael 120.0 0.0
2016-08-02 NaN NaN NaN
2016-08-03 NaN NaN NaN
2016-08-04 NaN NaN NaN
2016-08-05 NaN NaN NaN
2016-08-06 NaN NaN NaN
2016-08-07 NaN NaN NaN
Incase you want to fill the null values with 0's, you could do:
df.fillna(0, inplace=True)
print (df)
name purchase send_back
dates
2016-08-01 Michael 120.0 0.0
2016-08-02 0 0.0 0.0
2016-08-03 0 0.0 0.0
2016-08-04 0 0.0 0.0
2016-08-05 0 0.0 0.0
2016-08-06 0 0.0 0.0
2016-08-07 0 0.0 0.0