I have three dataframes:
df_1 =
Name Description Date Quant Value
0 B100 text123 2021-01-02 3 89.1
1 B101 text567 2021-01-03 2 90.1
2 A200 text820 2021-03-02 1 90.2
3 B101 text567 2021-03-02 6 90.2
4 A500 text758 2021-03-06 1 94.0
5 A500 text758 2021-03-06 2 94.0
6 A500 text758 2021-03-07 2 94.0
7 A200 text820 2021-04-02 1 90.2
8 A999 text583 2021-05-05 2 90.6
9 A998 text834 2021-05-09 1 99.9
df_2 = # the index is funny because I did some manipulations and dropped some NaNs before
Code Name Person
0 900 B100 600
1 901 B100 610
2 959 B101 670
3 979 A999 670
6 944 A200 388
7 921 A500 663
8 988 B300 794
df_3 =
Code StartDate EndDate RealValue
0 900 2000-01-01 2007-12-31 80.9
1 901 2008-01-01 2099-12-31 98.8
2 902 2000-01-01 2020-02-02 98.3
3 903 2000-01-01 2007-01-10 90.6
4 903 2007-01-11 2099-12-31 90.7
5 959 2020-04-09 2099-12-31 98.9
6 979 2000-01-01 2009-02-12 87.6
7 979 2009-02-13 2021-06-13 78.0
8 979 2021-06-15 2099-12-31 89.5
9 944 2020-04-09 2099-12-31 98.9
10 921 2020-04-09 2099-12-31 98.9
I want to do the following:
Start from df_1, find the corresponding Code(s) in df_2 for each Name in df_1. Now I look into df_1 to see what the Value and Quant were at every Date, and I compare the Value with every RealValue from the date range where my Date is. The difficult part is selecting the right Code and then the right data range. So:
Name Date Code Value RealValue Quant
B100 2021-01-02 901 89.1 98.8 3
B101 2021-01-03 959 90.1 98.9 2
A200 2021-03-02 944 90.2 98.9 1
B101 2021-03-02 959 90.1 98.9 6
A500 2021-03-06 921 94.0 98.9 1
A500 2021-03-06 921 94.0 98.9 2
A500 2021-03-07 921 94.0 98.9 2
A200 2021-05-05 944 90.2 98.9 2
A999 2021-05-05 979 90.6 78.0 2
What I did was merging everything in one table, but since my real dataset is huge and there are many records that do not appear everywhere, I might have lost some data or ended up with NaNs. So I would leave the dataframes as they are here and navigate through them for every record in df_1. Is that possible?
First, map Name column from df2 to df3 then merge df1 and df3 on Name column. Finally, filter out rows where Date is between StartDate and EndDate:
COLS = df1.columns.tolist() + ['RealValue']
df3['Name'] = df3['Code'].map(df2.set_index('Code')['Name'])
out = df1.merge(df3, on='Name', how='left') \
.query('Date.between(StartDate, EndDate)')[COLS]
Output:
>>> out
Name Description Date Quant Value RealValue
1 B100 text123 2021-01-02 3 89.1 98.8
2 B101 text567 2021-01-03 2 90.1 98.9
3 A200 text820 2021-03-02 1 90.2 98.9
4 B101 text567 2021-03-02 6 90.2 98.9
5 A500 text758 2021-03-06 1 94.0 98.9
6 A500 text758 2021-03-06 2 94.0 98.9
7 A500 text758 2021-03-07 2 94.0 98.9
8 A200 text820 2021-04-02 1 90.2 98.9
10 A999 text583 2021-05-05 2 90.6 78.0
Example:
start_date = "2019-1-1"
end_date="2019-1-31"
after_start_dates=df['date']>= start_date
before_star_dates=df['date']<= end_date
between_date=after_start_dates & before_end_dates
filter_dates=df.loc[between_dates]
print(filtered_dates)
try this:
try1 = pd.merge(df_1, df_2, on = 'Name', how = 'outer')
try2 = pd.merge(try1, df_3, on = 'Code', how = 'outer')
try2
and then, you try to navigate in try2.
try2[['Name','Date','Code','Value','RealValue','Quant']]
Related
I'm trying to solve this issue. I have two dataframe. The first one looks like:
ID
start.date
end.date
272
2007-03-27 10:37:00
2007-03-27 15:09:00
290
2007-04-10 14:12:00
2007-04-10 15:51:00
268
2007-03-23 18:18:00
2007-03-23 18:24:00
264
2007-04-05 06:54:00
2007-04-09 06:45:00
105
2007-04-18 10:51:00
2007-04-18 13:37:00
280
2007-03-30 11:09:00
2007-04-02 06:27:00
99
2007-03-28 12:12:00
2007-03-28 15:22:00
268
2007-03-27 10:41:00
2007-03-27 10:54:00
263
2007-03-28 11:08:00
2007-03-28 12:45:00
264
2007-03-28 07:12:00
2007-03-28 11:08:00
While the second one looks like:
ID
date
266
2007-03-30 17:17:10
272
2007-03-30 14:23:39
268
2007-03-30 09:12:48
264
2007-03-30 18:57:57
276
2007-04-02 14:30:02
106
2007-03-28 11:35:49
276
2007-03-30 13:40:24
82
2007-03-27 17:29:28
104
2007-03-28 17:50:12
264
2007-03-29 14:41:16
I would like to add a column to the first dataframe with the count of the rows in the second dataframe with that ID and with a date value between the start.date and end.date of the first dataframe. How can I do it?
You can try apply on rows:
df1['start.date'] = pd.to_datetime(df1['start.date'])
df1['end.date'] = pd.to_datetime(df1['end.date'])
df2['date'] = pd.to_datetime(df2['date'])
df1['count'] = df1.apply(lambda row: (df2['date'].eq(row['ID']) & (row['start.date'] < df2['date']) & (df2['date'] < row['end.date'])).sum(), axis=1)
# or
df1['count2'] = df1.apply(lambda row: (df2['date'].eq(row['ID']) & df2['date'].between(row['start.date'], row['end.date'], inclusive='neither')).sum(), axis=1)
print(df1)
ID start.date end.date count count2
0 272 2007-03-27 10:37:00 2007-03-27 15:09:00 0 0
1 290 2007-04-10 14:12:00 2007-04-10 15:51:00 0 0
2 268 2007-03-23 18:18:00 2007-03-23 18:24:00 0 0
3 264 2007-04-05 06:54:00 2007-04-09 06:45:00 0 0
4 105 2007-04-18 10:51:00 2007-04-18 13:37:00 0 0
5 280 2007-03-30 11:09:00 2007-04-02 06:27:00 0 0
6 99 2007-03-28 12:12:00 2007-03-28 15:22:00 0 0
7 268 2007-03-27 10:41:00 2007-03-27 10:54:00 0 0
8 263 2007-03-28 11:08:00 2007-03-28 12:45:00 0 0
9 264 2007-03-28 07:12:00 2007-03-28 11:08:00 0 0
Perfect job for numpy boardcasting:
id1, start_date, end_date = [df1[[col]].to_numpy() for col in ["ID", "start.date", "end.date"]]
id2, date = [df2[col].to_numpy() for col in ["ID", "date"]]
# Check every row in df1 against every row in df2 for our criteria:
# matching id, and date between start.date and end.date
match = (id1 == id2) & (start_date < date) & (date < end_date)
df1["count"] = match.sum(axis=1)
I've created a class which takes in minute data and returns the daily ohlc for that day. A simple version looks like so:
import pandas as pd
from datetime import time
from IPython.display import display
from pandas.tseries.holiday import USFederalHolidayCalendar
from pandas.tseries.offsets import CustomBusinessDay
US_BUSINESS_DAY = CustomBusinessDay(calendar=USFederalHolidayCalendar())
#
class SessionData:
def __init__(self, data, date):
self.data = pd.read_csv(data)
self.date = pd.to_datetime(date)
df = self.data
# get the minute data and return only the specified date (2022-04-18)
df_current_day = df[(df['date'] >= date) & (df['date'] <= date)]
display(df_current_day)
self.previous_day = pd.to_datetime(date) - 2 * US_BUSINESS_DAY
print(type(self.date), ' self.date type')
print(type(self.previous_day), ' self.previous_day type')
print(type(df['date']))
df_previous_day = df[(df['date'] >= self.previous_day) & (df['date'] <= self.previous_day)]
Here's what my data originally looks like:
v vw o c h l t n date time
0 605.0 4.2036 4.2000 4.20 4.2000 4.20 2022-04-07 13:30:00 3 2022-04-07 13:30:00
1 809.0 4.2013 4.2026 4.20 4.2026 4.20 2022-04-07 13:41:00 12 2022-04-07 13:41:00
2 115.0 4.1739 4.1700 4.17 4.1700 4.17 2022-04-07 13:43:00 3 2022-04-07 13:43:00
3 170.0 4.1495 4.1500 4.15 4.1500 4.15 2022-04-07 13:53:00 6 2022-04-07 13:53:00
4 100.0 4.1600 4.1600 4.16 4.1600 4.16 2022-04-07 13:57:00 1 2022-04-07 13:57:00
... ... ... ... ... ... ... ... ... ... ...
1397 6260.0 6.5252 6.5300 6.53 6.5600 6.51 2022-04-18 23:55:00 32 2022-04-18 23:55:00
1398 8610.0 6.5399 6.5300 6.55 6.5500 6.52 2022-04-18 23:56:00 28 2022-04-18 23:56:00
1399 9035.0 6.5493 6.5500 6.55 6.5600 6.54 2022-04-18 23:57:00 24 2022-04-18 23:57:00
1400 30328.0 6.5188 6.5600 6.50 6.5600 6.50 2022-04-18 23:58:00 66 2022-04-18 23:58:00
1401 25403.0 6.5152 6.5000 6.52 6.5500 6.49 2022-04-18 23:59:00 62 2022-04-18 23:59:00
1402 rows × 10 columns
Here's when I display df_current_day:
v vw o c h l t n date time
687 852.0 4.1498 3.98 4.41 4.4100 3.98 2022-04-18 12:00:00 13 2022-04-18 12:00:00
688 2901.0 4.4839 4.13 4.75 4.7500 4.13 2022-04-18 12:01:00 24 2022-04-18 12:01:00
689 44063.0 4.9450 4.88 4.66 5.2599 4.60 2022-04-18 12:02:00 236 2022-04-18 12:02:00
690 46314.0 4.6890 4.70 4.62 4.8000 4.55 2022-04-18 12:03:00 225 2022-04-18 12:03:00
691 142991.0 4.8611 4.66 5.03 5.0900 4.61 2022-04-18 12:04:00 581 2022-04-18 12:04:00
... ... ... ... ... ... ... ... ... ... ...
1397 6260.0 6.5252 6.53 6.53 6.5600 6.51 2022-04-18 23:55:00 32 2022-04-18 23:55:00
1398 8610.0 6.5399 6.53 6.55 6.5500 6.52 2022-04-18 23:56:00 28 2022-04-18 23:56:00
1399 9035.0 6.5493 6.55 6.55 6.5600 6.54 2022-04-18 23:57:00 24 2022-04-18 23:57:00
1400 30328.0 6.5188 6.56 6.50 6.5600 6.50 2022-04-18 23:58:00 66 2022-04-18 23:58:00
1401 25403.0 6.5152 6.50 6.52 6.5500 6.49 2022-04-18 23:59:00 62 2022-04-18 23:59:00
715 rows × 10 columns
But when I go to create df_previous_day I get the following error:
TypeError: '>=' not supported between instances of 'str' and 'Timestamp'
If I print the types of my variables like so:
print(type(self.date), ' self.date type')
print(type(self.previous_day), ' self.previous_day type')
print(type(df['date']))
I get back:
<class 'pandas._libs.tslibs.timestamps.Timestamp'> self.date type
<class 'pandas._libs.tslibs.timestamps.Timestamp'> self.previous_day type
<class 'pandas.core.series.Series'>
So my question is why doesn't df_previous_day work if it shares the same types as the variables used in the df_current_day?
I have the below data frame (date time index, with all working days in us calender)
import pandas as pd
from pandas.tseries.holiday import USFederalHolidayCalendar
from pandas.tseries.offsets import CustomBusinessDay
import random
us_bd = CustomBusinessDay(calendar=USFederalHolidayCalendar())
dt_rng = pd.date_range(start='1/1/2018', end='12/31/2018', freq=us_bd)
n1 = [round(random.uniform(20, 35),2) for _ in range(len(dt_rng))]
n2 = [random.randint(100, 200) for _ in range(len(dt_rng))]
df = pd.DataFrame(list(zip(n1,n2)), index=dt_rng, columns=['n1','n2'])
print(df)
n1 n2
2018-01-02 24.78 197
2018-01-03 23.33 176
2018-01-04 33.19 128
2018-01-05 32.49 110
... ... ...
2018-12-26 31.34 173
2018-12-27 29.72 166
2018-12-28 31.07 104
2018-12-31 33.52 184
[251 rows x 2 columns]
For each row in column n1 , how to get values from the same column for the same day of next month? (if value for that exact day is not available (due to weekends or holidays), then should get the value at the next available date. ). I tried using df.n1.shift(21), but its not working as the exact working days at each month differ.
Expected output as below
n1 n2 next_mnth_val
2018-01-02 25.97 184 28.14
2018-01-03 24.94 133 27.65 # three values below are same, because on Feb 2018, the next working day after 2nd is 5th
2018-01-04 23.99 143 27.65
2018-01-05 24.69 182 27.65
2018-01-08 28.43 186 28.45
2018-01-09 31.47 104 23.14
... ... ... ...
2018-12-26 29.06 194 20.45
2018-12-27 29.63 158 20.45
2018-12-28 30.60 148 20.45
2018-12-31 20.45 121 20.45
for December , the next month value should be last value of the data frame ie, value at index 2018-12-31 (20.45).
please help.
This is an interesting problem. I would shift the date by 1 month, then shift it again to the next business day:
df1 = df.copy().reset_index()
df1['new_date'] = df1['index'] + pd.DateOffset(months=1) + pd.offsets.BDay()
df.merge(df1, left_index=True, right_on='new_date')
Output (first 31st days):
n1_x n2_x index n1_y n2_y new_date
0 34.82 180 2018-01-02 29.83 129 2018-02-05
1 34.82 180 2018-01-03 24.28 166 2018-02-05
2 34.82 180 2018-01-04 27.88 110 2018-02-05
3 24.89 186 2018-01-05 25.34 111 2018-02-06
4 31.66 137 2018-01-08 26.28 138 2018-02-09
5 25.30 162 2018-01-09 32.71 139 2018-02-12
6 25.30 162 2018-01-10 34.39 159 2018-02-12
7 25.30 162 2018-01-11 20.89 132 2018-02-12
8 23.44 196 2018-01-12 29.27 167 2018-02-13
12 25.40 153 2018-01-19 28.52 185 2018-02-20
13 31.38 126 2018-01-22 23.49 141 2018-02-23
14 30.90 133 2018-01-23 25.56 145 2018-02-26
15 30.90 133 2018-01-24 23.06 155 2018-02-26
16 30.90 133 2018-01-25 24.95 174 2018-02-26
17 29.39 138 2018-01-26 21.28 157 2018-02-27
18 32.94 173 2018-01-29 20.26 189 2018-03-01
19 32.94 173 2018-01-30 22.41 196 2018-03-01
20 32.94 173 2018-01-31 27.32 149 2018-03-01
21 28.09 119 2018-02-01 31.39 192 2018-03-02
22 32.21 199 2018-02-02 28.22 151 2018-03-05
23 21.78 120 2018-02-05 34.82 180 2018-03-06
24 28.25 127 2018-02-06 24.89 186 2018-03-07
25 22.06 189 2018-02-07 32.85 125 2018-03-08
26 33.78 121 2018-02-08 30.12 102 2018-03-09
27 30.79 137 2018-02-09 31.66 137 2018-03-12
28 29.88 131 2018-02-12 25.30 162 2018-03-13
29 20.02 143 2018-02-13 23.44 196 2018-03-14
30 20.28 188 2018-02-14 20.04 102 2018-03-15
I have the following dataframe:
date forecast_price pool_price forecast_ail ail
1 2019-09-03 11:00:00 34.90 35.5 9964 9970
2 2019-09-03 12:00:00 34.95 36.6 10074 10078
3 2019-09-03 13:00:00 34.94 37.7 10130 10135
4 2019-09-03 14:00:00 50.90 NaN 9000 NaN
5 2019-09-03 15:00:00 60.95 NaN 10000 NaN
6 2019-09-03 16:00:00 70.94 NaN 12000 NaN
I would like to copy the contents of rows 1 to 3 onto rows 3 to 6, but I'd like to leave the forecast_price and forecast_ail column values the same for rows 4 to 6. How do I go about doing so?
Expected output:
date forecast_price pool_price forecast_ail ail
1 2019-09-03 11:00:00 34.90 35.5 9964 9970
2 2019-09-03 12:00:00 34.95 36.6 10074 10078
3 2019-09-03 13:00:00 34.94 37.7 10130 10135
4 2019-09-03 14:00:00 50.90 35.5 9000 9970
5 2019-09-03 15:00:00 60.95 36.6 10000 10078
6 2019-09-03 16:00:00 70.94 37.7 12000 10135
I guess you meant to copy in rows 4 to 6
You can use:
df.loc[[4,5,6],['pool_price','ail']]=df.loc[[1,2,3],['pool_price','ail']]
I am analyzing data from excel file.
I want to create data frame by parsing data from excel using python.
Data in my excel file looks like as follow:
The first row highlighted in yellow contains match, which will be one of the columns in data frame that I wanted to create.
In fact, second row and 4th row are the name of the columns that I wanted to created in a new data frame.
3rd row and fifth row are the value of each column.
The sample here is only for one match.
I have multiple matches in the excel file.
I want to create a data frame that contain the column Match and all name in blue colors in the file.
I have attached the sample file that contains multiple matches.
Download the file here.
My expected data frame is
Match 1-0 2-0 2-1 3-0 3-1 3-2 4-0 4-1 4-2 4-3.......
MOL Vivi -vs- Chelsea 14 42 20 170 85 85 225 225 225 .....
Can anyone advise me how to parse the excel data and convert to data frame?
Thanks,
Zep
Use:
import pandas as pd
from datetime import datetime
df = pd.read_excel('test_match.xlsx')
#mask for check a-z in column HOME -vs- AWAY
m1 = df['HOME -vs- AWAY'].str.contains('[a-z]', na=False)
#create index by matches
df.index = df['HOME -vs- AWAY'].where(m1).ffill()
df.index.name = 'Match'
#remove same index and HOME -vs- AWAY column rows
df = df[df.index != df['HOME -vs- AWAY']].copy()
#test if datetime or string
m2 = df['HOME -vs- AWAY'].apply(lambda x: isinstance(x, datetime))
m3 = df['HOME -vs- AWAY'].apply(lambda x: isinstance(x, str))
#seelct next rows and set new columns names
df1 = df[m2.shift().fillna(False)]
df1.columns = df[m2].iloc[0]
#also remove only NaNs columns
df2 = df[m3.shift().fillna(False)].dropna(axis=1, how='all')
df2.columns = df[m3].iloc[0].dropna()
#join together
df = pd.concat([df1, df2], axis=1).astype(float).reset_index().rename_axis(None, axis=1)
print (df.head())
Match 2000-01-01 00:00:00 2000-02-01 00:00:00 \
0 MOL Vidi -vs- Chelsea 14.00 42.00
1 Lazio -vs- Eintracht Frankfurt 8.57 11.55
2 Sevilla -vs- FC Krasnodar 7.87 6.63
3 Villarreal -vs- Spartak Moscow 7.43 7.03
4 Rennes -vs- FC Astana 4.95 6.38
2018-02-01 00:00:00 2000-03-01 00:00:00 2018-03-01 00:00:00 \
0 20.00 170.00 85.00
1 7.87 23.80 15.55
2 7.87 8.72 8.65
3 7.07 10.00 9.43
4 7.33 12.00 13.20
2018-03-02 00:00:00 2000-04-01 00:00:00 2018-04-01 00:00:00 \
0 85.0 225.00 225.00
1 21.3 64.30 42.00
2 25.9 14.80 14.65
3 23.9 19.35 17.65
4 38.1 31.50 34.10
2018-04-02 00:00:00 ... 0-1 0-2 2018-01-02 00:00:00 \
0 225.0 ... 5.6 6.80 7.00
1 55.7 ... 11.0 19.05 10.45
2 38.1 ... 28.0 79.60 29.20
3 38.4 ... 20.9 58.50 22.70
4 81.4 ... 12.9 42.80 22.70
0-3 2018-01-03 00:00:00 2018-02-03 00:00:00 0-4 \
0 12.5 12.0 32.0 30.0
1 48.4 27.4 29.8 167.3
2 223.0 110.0 85.4 227.5
3 203.5 87.6 73.4 225.5
4 201.7 97.6 103.6 225.5
2018-01-04 00:00:00 2018-02-04 00:00:00 2018-03-04 00:00:00
0 29.0 60.0 220.0
1 91.8 102.5 168.3
2 227.5 227.5 227.5
3 225.5 225.5 225.5
4 225.5 225.5 225.5
[5 rows x 27 columns]