Select values based on another dataframe - python

I have two dataframes of the same size, the same columns and same index.
df1:
symbol fund1 fund2 fund3 ... ... ...
id datetime
10 2012-10-19 09:05:00 -100 0 0 50 0 0
20 2012-10-19 09:10:00 0 300 0 0 0 0
df2:
symbol fund1 fund2 fund3 ... ... ...
id datetime
10 2012-10-19 09:05:00 -0.5 0 0 0.005 0 0
20 2012-10-19 09:10:00 0 -10 0 0 0 0
I would like to receive a new dataframe that takes the values from df1 only if the sign of each element in df1 is NOT the same (the opposite) than in df2.
So, the result for the example would be:
df_outcome:
symbol fund1 fund2 fund3 ... ... ...
id datetime
10 2012-10-19 09:05:00 0 0 0 0 0 0
20 2012-10-19 09:10:00 0 300 0 0 0 0
I've found that there is a function: np.sign(df), I think I should first apply this function to both tables, but what should I do then to compare the values to these "sign" tables and, if they are opposite, element by element, take the values from df1?

You can use where with np.sign and inequality test:
df1.where(np.sign(df1) != np.sign(df2)).fillna(0)
Output:
fund1 fund2 fund3 fund4 fund5 fund6
id datetime
10 2012-10-19 09:05:00 0.0 0.0 0.0 0.0 0.0 0.0
20 2012-10-19 09:10:00 0.0 300.0 0.0 0.0 0.0 0.0

Related

Calculate average temperature/humidity between 2 dates pandas data frames

I have the following data frames:
df3
Harvest_date
Starting_date
2022-10-06
2022-08-06
2022-02-22
2021-12-22
df (I have all temp and humid starting from 2021-01-01 till the present)
date
temp
humid
2022-10-06 00:30:00
2
30
2022-10-06 00:01:00
1
30
2022-10-06 00:01:30
0
30
2022-10-06 00:02:00
0
30
2022-10-06 00:02:30
-2
30
I would like to calculate the avg temperature and humidity between the starting_date and harvest_date. I tried this:
import pandas as pd
df = pd.read_csv (r'C:\climate.csv')
df3 = pd.read_csv (r'C:\Flower_weight_Seson.csv')
df['date'] = pd.to_datetime(df.date)
df3['Harvest_date'] = pd.to_datetime(df3.Harvest_date)
df3['Starting_date'] = pd.to_datetime(df3.Starting_date)
df.style.format({"date": lambda t: t.strftime("%Y-%m-%d")})
df3.style.format({"Harvest_date": lambda t: t.strftime("%Y-%m-%d")})
df3.style.format({"Starting_date": lambda t: t.strftime("%Y-%m-%d")})
for harvest_date,starting_date in zip(df3['Harvest_date'],df3['Starting_date']):
df3["Season avg temp"]= df[df["date"].between(starting_date,harvest_date)]["temp"].mean()
df3["Season avg humid"]= df[df["date"].between(starting_date,harvest_date)]["humid"].mean()
I get the same value for all dates. Can someone point out what I did wrong, please?
Use DataFrame.loc with match indices by means of another DataFrame:
#changed data for match with df3
print (df)
date temp humid
0 2022-10-06 00:30:00 2 30
1 2022-09-06 00:01:00 1 33
2 2022-09-06 00:01:30 0 23
3 2022-10-06 00:02:00 0 30
4 2022-01-06 00:02:30 -2 25
for i,harvest_date,starting_date in zip(df3.index,df3['Harvest_date'],df3['Starting_date']):
mask = df["date"].between(starting_date,harvest_date)
avg = df.loc[mask, ["temp",'humid']].mean()
df3.loc[i, ["Season avg temp",'Season avg humid']] = avg.to_numpy()
print (df3)
Harvest_date Starting_date Season avg temp Season avg humid
0 2022-10-06 2022-08-06 0.5 28.0
1 2022-02-22 2021-12-220 -2.0 25.0
EDIT: For add new condition for match by room columns use:
for i,harvest_date,starting_date, room in zip(df3.index,
df3['Harvest_date'],
df3['Starting_date'], df3['Room']):
mask = df["date"].between(starting_date,harvest_date) & df['Room'].eq(room)
avg = df.loc[mask, ["temp",'humid']].mean()
df3.loc[i, ["Season avg temp",'Season avg humid']] = avg.to_numpy()
print (df3)

Update muliple column values based on condition in python

I have a dataframe like this,
ID 00:00 01:00 02:00 ... 23:00 avg_value
22 4.7 5.3 6 ... 8 5.5
37 0 9.2 4.5 ... 11.2 9.2
4469 2 9.8 11 ... 2 6.4
Can I use np.where to apply conditions on multiple columns at once?
I want to update the values from 00:00 to 23:00 to 0 and 1. If the value at the time of day is greater than avg_value then I change it to 1, else to 0.
I know how to apply this method to one single column.
np.where(df['00:00']>df['avg_value'],1,0)
Can I change it to multiple columns?
Output will be like,
ID 00:00 01:00 02:00 ... 23:00 avg_value
22 0 1 1 ... 1 5.5
37 0 0 0 ... 1 9.2
4469 0 1 1 ... 0 6.4
Select all columns without last by DataFrame.iloc, compare by DataFrame.gt and casting to integers and last add avg_value column by DataFrame.join:
df = df.iloc[:, :-1].gt(df['avg_value'], axis=0).astype(int).join(df['avg_value'])
print (df)
00:00 01:00 02:00 23:00 avg_value
ID
22 0 0 1 1 5.5
37 0 0 0 1 9.2
4469 0 1 1 0 6.4
Or use DataFrame.pop for extract column:
s = df.pop('avg_value')
df = df.gt(s, axis=0).astype(int).join(s)
print (df)
00:00 01:00 02:00 23:00 avg_value
ID
22 0 0 1 1 5.5
37 0 0 0 1 9.2
4469 0 1 1 0 6.4
Because if assign to same columns integers are converted to floats (it is bug):
df.iloc[:, :-1] = df.iloc[:, :-1].gt(df['avg_value'], axis=0).astype(int)
print (df)
00:00 01:00 02:00 23:00 avg_value
ID
22 0.0 0.0 1.0 1.0 5.5
37 0.0 0.0 0.0 1.0 9.2
4469 0.0 1.0 1.0 0.0 6.4

NaN values when adding two columns

I have two dataframes with different indexing that I want to sum the same column from the two dataframes.
I tried the following but gives NaN values
result['Anomaly'] = df['Anomaly'] + tmp['Anomaly']
df
date Anomaly
0 2018-12-06 0
1 2019-01-07 0
2 2019-02-06 1
3 2019-03-06 0
4 2019-04-06 0
tmp
date Anomaly
0 2018-12-06 0
1 2019-01-07 1
4 2019-04-06 0
result
date Anomaly
0 2018-12-06 0.0
1 2019-01-07 NaN
2 2019-02-06 1.0
3 2019-03-06 0.0
4 2019-04-06 0.0
What I want is actually:
result
date Anomaly
0 2018-12-06 0
1 2019-01-07 1
2 2019-02-06 1
3 2019-03-06 0
4 2019-04-06 0
Here is necessary align by datetimes, so first use DataFrame.set_index for DatetimeIndex and then use Series.add:
df = df.set_index('date')
tmp = tmp.set_index('date')
result = df['Anomaly'].add(tmp['Anomaly'], fill_value=0).reset_index()
You can try this
pd.concat([df, tmp]).groupby('date', as_index=False)["Anomaly"].sum()
date Anomaly
0 2018-12-06 0
1 2019-01-07 1
2 2019-02-06 1
3 2019-03-06 0
4 2019-04-06 0
combine_first():
res = pd.DataFrame({'date':df.date,'Anomaly':tmp.Anomaly.combine_first(df.Anomaly)})
print(res)
date Anomaly
0 2018-12-06 0.0
1 2019-01-07 1.0
2 2019-02-06 1.0
3 2019-03-06 0.0
4 2019-04-06 0.0
You must first set correct indices on your dataframes, and then add using the date indices:
tmp1 = tmp.set_index('date')
result = df.set_index('date')
result.loc[tmp1.index] += tmp1
result.reset_index(inplace=True)

Fill in missing dates pandas based off max and min

How can I create a python statement for a conditional
I have a dataframe like the one below. I was wondering how can i fill in missing dates based of the max min dates in a dataframe.
Day Movie Rating
2017-01-01 GreatGatsby 5
2017-01-02 TopGun 5
2017-01-03 Deadpool 1
2017-01-10 PlanetOfApes 2
How can I make something that filles in the missing dates to something like
Day Movie Rating
2017-01-01 GreatGatsby 5
2017-01-02 TopGun 5
2017-01-03 Deadpool 1
2017-01-04 0 0
2017-01-05 0 0
2017-01-06 0 0
2017-01-07 0 0
2017-01-08 0 0
2017-01-09 0 0
2017-01-10 PlanetOfApes 2
Use resample + first/last/min/max:
df.set_index('Day').resample('1D').first().fillna(0).reset_index()
Day Movie Rating
0 2017-01-01 GreatGatsby 5.0
1 2017-01-02 TopGun 5.0
2 2017-01-03 Deadpool 1.0
3 2017-01-04 0 0.0
4 2017-01-05 0 0.0
5 2017-01-06 0 0.0
6 2017-01-07 0 0.0
7 2017-01-08 0 0.0
8 2017-01-09 0 0.0
9 2017-01-10 PlanetOfApes 2.0
If Day isn't a datetime column, use pd.to_datetime to convert it first:
df['Day'] = pd.to_datetime(df['Day'])
Alternative by Wen asfreq:
df.set_index('Day').asfreq('D').fillna(0).reset_index()
Day Movie Rating
0 2017-01-01 GreatGatsby 5.0
1 2017-01-02 TopGun 5.0
2 2017-01-03 Deadpool 1.0
3 2017-01-04 0 0.0
4 2017-01-05 0 0.0
5 2017-01-06 0 0.0
6 2017-01-07 0 0.0
7 2017-01-08 0 0.0
8 2017-01-09 0 0.0
9 2017-01-10 PlanetOfApes 2.0
I believe you need reindex:
df = (df.set_index('Day')
.reindex(pd.date_range(df['Day'].min(), df['Day'].max()), fill_value=0)
.reset_index())
print (df)
index Movie Rating
0 2017-01-01 GreatGatsby 5
1 2017-01-02 TopGun 5
2 2017-01-03 Deadpool 1
3 2017-01-04 0 0
4 2017-01-05 0 0
5 2017-01-06 0 0
6 2017-01-07 0 0
7 2017-01-08 0 0
8 2017-01-09 0 0
9 2017-01-10 PlanetOfApes 2

Panda group dataframe based on datetime type into different period ignoring date part

I want to group the rows into groups, based on variable time interval.
However, when doing grouping, I want to ignore the date part, only group based on the time date.
Say I want to group every 5 minutes.
timestampe val
0 2016-08-11 11:03:00 0.1
1 2016-08-13 11:06:00 0.3
2 2016-08-09 11:04:00 0.5
3 2016-08-05 11:35:00 0.7
4 2016-08-19 11:09:00 0.8
5 2016-08-21 12:37:00 0.9
into
timestampe val
0 2016-08-11 11:03:00 0.1
2 2016-08-09 11:04:00 0.5
timestampe val
1 2016-08-13 11:06:00 0.3
4 2016-08-19 11:09:00 0.8
timestampe val
3 2016-08-05 11:35:00 0.7
timestampe val
5 2016-08-21 12:37:00 0.9
Notice as long as the time is within the same 5 minutes interval, the rows are grouped, regardless of the date.
This is assuming you split the day up into 5 minute windows
df.groupby(df.timestampe.dt.hour.mul(60) \
.add(df.timestampe.dt.minute) // 5) \
.apply(pd.DataFrame.reset_index)
for name, group in df.groupby(df.timestampe.dt.hour.mul(60).add(df.timestampe.dt.minute) // 5):
print name
print group
print
132
timestampe val
0 2016-08-11 11:03:00 0.1
2 2016-08-09 11:04:00 0.5
133
timestampe val
1 2016-08-13 11:06:00 0.3
4 2016-08-19 11:09:00 0.8
139
timestampe val
3 2016-08-05 11:35:00 0.7
151
timestampe val
5 2016-08-21 12:37:00 0.9
Since you do not care about the date part of your datetime object, I think that make all date equal is a good trick.
df['time'] = df['timestamp'].apply(lambda x: x.replace(year=2000, month=1, day=1))
You get:
timestamp val time
0 2016-08-11 11:03:00 0.1 2000-01-01 11:03:00
1 2016-08-13 11:06:00 0.3 2000-01-01 11:06:00
2 2016-08-09 11:04:00 0.5 2000-01-01 11:04:00
3 2016-08-05 11:35:00 0.7 2000-01-01 11:35:00
4 2016-08-19 11:09:00 0.8 2000-01-01 11:09:00
5 2016-08-21 11:37:00 0.9 2000-01-01 11:37:00
Now you can do what you what on time column. For example, groups on every 5 mins:
grouped = df.groupby(Grouper(key='time', freq='5min'))
grouped.count()
timestamp val
time
2000-01-01 11:00:00 2 2
2000-01-01 11:05:00 2 2
2000-01-01 11:10:00 0 0
2000-01-01 11:15:00 0 0
2000-01-01 11:20:00 0 0
2000-01-01 11:25:00 0 0
2000-01-01 11:30:00 0 0
2000-01-01 11:35:00 2 2
Hope this trick may be suitable for your need. Thanks!

Categories

Resources