Here is the problem:
I want to select the dataframe (say, df3) with each index1 in df1 to be in the range between d_reach and d_start in df2,
Below is the code to generate samples:
import numpy as np
import pandas as pd
import datetime
from datetime import timedelta
index1 = pd.date_range(datetime.datetime(2021, 1, 1, 1, 1), periods = 1000, freq = "3min")
df1 = pd.DataFrame(np.random.random(1000), index = index1, columns = ['r'])
d_start = pd.date_range(datetime.datetime(2021, 1, 1, 1, 1), periods = 500, freq = "5min")
d_reach = d_start + timedelta(seconds = np.random.randint(low = 4, high = 6))
value = {'id3': np.tile([0,1], 250)}
df2 = pd.DataFrame(value, index = [d_start,d_reach])
df2.index.names = ['d_start','d_reach']
df2 is MultiIndexed.
The expected ouput of df3 should be:
2021-01-01 01:07:00 0.011026
2021-01-01 01:10:00 0.423813
...
here index1 in df1 2021-01-01 01:07:00 >= 2021-01-01 01:06:05 which is one of the d_reach in df2
and the next index1 in df1 2021-01-01 01:10:00 < 2021-01-01 01:11:00 which is the next d_start in df2
Below is the code I tried but failed:
df = pd.DataFrame()
for i in df1.index:
df = df.append(df1.loc[i])
for idx1, idx2 in zip(df2.index.get_level_values(0).tolist(),
df2.index.get_level_values(1).tolist())
if i >= idx1 and i <= idx2
Really appreciate any advice as to find df3 in Python. Thanks!
I want to select the dataframe (say, df3) with each index1 in df1 to be in the range between d_reach and d_start in df2,
here is one way to cross join then find the matches and filter them out :
mdf = pd.merge(df1.reset_index(), df2.reset_index() , how='cross', on=None)
result = mdf.loc[mdf['index'].between(mdf['d_start'], mdf['d_reach']),['index','r']].set_index('index')
print(result.head())
output:
>>>
r
index
2021-01-01 01:01:00 0.415163
2021-01-01 01:16:00 0.729592
2021-01-01 01:31:00 0.411244
2021-01-01 01:46:00 0.524753
2021-01-01 02:01:00 0.105035
That's going to be memory intensive though, another way is to load your dataframes into an in-memory database and join them based on the condition and load the result back to your result dataframe, you will find a lot of samples on that method online.
Related
I am a beginner to python. This seems like something that would have been asked but I have been trying to search for the answer for 3 days at this point and can't find it.
I created a dataframe using pd after running pytesseract on an image. Everything is fine except one 'minor' thing. When I want it to show the dataframe, if the first series is 'Date', it shows only the first row:
df['Date'] = pd.Series(date_date)
df['In'] = pd.Series(float_in)
df['Out'] = pd.Series(float_out)
df['Date'] = df['Date'].fillna(date_date)
df['Out'] = df['Out'].fillna(0)
df['In'] = df['In'].fillna(0)
print(df)
Date In Out
0 2022-05-31 0.0 7700.0
If I change the column sequence and keep column 'Date' on any other position, it comes out fine:
df['In'] = pd.Series(float_in)
df['Out'] = pd.Series(float_out)
df['Date'] = pd.Series(date_date)
df['Date'] = df['Date'].fillna(date_date)
df['Out'] = df['Out'].fillna(0)
df['In'] = df['In'].fillna(0)
print(df)
In Out Date
0 0.0 7700.0 2022-05-31
1 0.0 4232.0 2022-05-31
2 0.0 16056.0 2022-05-31
3 0.0 80000.0 2022-05-31
4 0.0 40000.0 2022-05-31
5 0.0 105805.0 2022-05-31
6 0.0 185500.0 2022-05-31
7 0.0 52188.0 2022-05-31
Can anyone guide as to why this is happening and how to fix it? I would like the Date to remain the first column but of course I want all rows!
Thank you in advance.
Here is the complete code if that helps:
import cv2
import pytesseract
import pandas as pd
from datetime import datetime
pytesseract.pytesseract.tesseract_cmd=r'C:\Program Files\Tesseract-OCR\tesseract.exe'
img = cv2.imread("C:\\Users\\Fast Computer\\Documents\\Python test\\Images\\page-0.png")
thresh = 255
#Coordinates and ROI for Amount Out
x3,y3,w3,h3 = 577, 495, 172, 815
ROI_3 = img[y3:y3+h3,x3:x3+w3]
#Coordinates and ROI for Amount In
x4,y4,w4,h4 = 754, 495, 175, 815
ROI_4 = img[y4:y4+h4,x4:x4+w4]
#Coordinates and ROI for Date
x5,y5,w5,h5 = 833, 174, 80, 22
ROI_5 = img[y5:y5+h5,x5:x5+w5]
#OCR and convert to strings
text_amount_out = pytesseract.image_to_string(ROI_3)
text_amount_in = pytesseract.image_to_string(ROI_4)
text_date = pytesseract.image_to_string(ROI_5)
text_amount_out = text_amount_out.replace(',', '')
text_amount_in = text_amount_in.replace(',', '')
cv2.waitKey(0)
cv2.destroyAllWindows()
#Convert Strings to Lists
list_amount_out = text_amount_out.split()
list_amount_in = text_amount_in.split()
list_date = text_date.split()
float_out = []
for item in list_amount_out:
float_out.append(float(item))
float_in = []
for item in list_amount_in:
float_in.append(float(item))
date_date = datetime.strptime(text_date, '%d/%m/%Y ')
#Creating columns
df = pd.DataFrame()
df['In'] = pd.Series(float_in)
df['Out'] = pd.Series(float_out)
df['Date'] = pd.Series(date_date)
df['Date'] = df['Date'].fillna(date_date)
df['Out'] = df['Out'].fillna(0)
df['In'] = df['In'].fillna(0)
print(df)
Your problem lies with how you initialize and then update the pd.DataFrame().
import pandas as pd
from datetime import datetime
float_in = [0.0,0.5,1.0]
float_out = [0.0,0.5,1.0,1.5]
# this line just gives you 1 value:
date_date = datetime.strptime('01/01/2022 ', '%d/%m/%Y ')
# date_date = datetime.strptime(text_date, '%d/%m/%Y ')
# creates an empty df
df = pd.DataFrame()
print(df.shape)
# (0, 0)
Now, when you first fill the df only with a series that contains date_date, we get:
df['Date'] = pd.Series(date_date) # 1 row
print(df.shape)
# (1, 1)
print(df)
# Date
# 0 2022-01-01
Adding any other (longer) pd.Series() to this, will not add rows to the df. Rather, it will only add the first value of that series:
df['In'] = pd.Series(float_in)
print(df)
# Date In
# 0 2022-01-01 0.0
One way to avoid this, is by initializing your df with an index that stretches the length of your longest list:
max_length = max(map(len, [float_in, float_out])) # 4
df = pd.DataFrame(index=range(max_length))
print(df.shape)
# (4, 0), so now we start with 4 rows
df['Date'] = pd.Series(date_date)
print(df)
# Date
# 0 2022-01-01
# 1 NaT
# 2 NaT
# 3 NaT
df['In'] = pd.Series(float_in)
df['Out'] = pd.Series(float_out)
df['Date'] = df['Date'].fillna(date_date)
df['Out'] = df['Out'].fillna(0)
df['In'] = df['In'].fillna(0)
print(df)
Date In Out
0 2022-01-01 0.0 0.0
1 2022-01-01 0.5 0.5
2 2022-01-01 1.0 1.0
3 2022-01-01 0.0 1.5
You need to use iterable with repeated date rather than single date, consider following simple example
import datetime
import pandas as pd
df = pd.DataFrame()
df['Date'] = pd.Series(datetime.date(1900,1,1))
df['Values'] = pd.Series([1.5,2.5,3.5])
print(df)
gives output
Date Values
0 1900-01-01 1.5
whilst
import datetime
import pandas as pd
df = pd.DataFrame()
df['Date'] = pd.Series([datetime.date(1900,1,1)]*3) # repeat 3 times
df['Values'] = pd.Series([1.5,2.5,3.5])
print(df)
gives output
Date Values
0 1900-01-01 1.5
1 1900-01-01 2.5
2 1900-01-01 3.5
import yfinance as yf
import numpy as np
import pandas as pd
ETF_DB = ['QQQ', 'EGFIX']
fundsret = yf.download(ETF_DB, start=datetime.date(2020,12,31), end=datetime.date(2022,4,30), interval='1mo')['Adj Close'].pct_change()
df = pd.DataFrame(fundsret)
df
Gives me:
I'm trying to remove the rows in the dataframe that aren't month end such as the row 2021-03-22. How do I have the dataframe go through and remove the rows where the date doesn't end in '01'?
df.reset_index(inplace=True)
# Convert the date to datetime64
df['Date'] = pd.to_datetime(df['Date'], format='%Y-%m-%d')
#select only day = 1
filtered = df.loc[df['Date'].dt.day == 1]
Did you mean month start?
You can use:
df = df[df.index.day==1]
reproducible example:
df = pd.DataFrame(columns=['A', 'B'],
index=['2021-01-01', '2021-02-01', '2021-03-01',
'2021-03-22', '2021-03-31'])
df.index = pd.to_datetime(df.index, dayfirst=False)
output:
A B
2021-01-01 NaN NaN
2021-02-01 NaN NaN
2021-03-01 NaN NaN
end of month
for the end of month, you can add 1 day and check if this jumps to the next month:
end = (df.index+pd.Timedelta('1d')).month != df.index.month
df = df[end]
or add an offset and check if the value is unchanged:
end = df.index == (df.index + pd.offsets.MonthEnd(0))
df = df[end]
output:
A B
2021-03-31 NaN NaN
import pandas as pd
import re
# Dummy Dictionary
dict={
'Date': ['2021-01-01','2022-03-01','2023-04-22','2023-04-01'],
'Name' : ['A','B','C','D']
}
# Making a DataFrame
df=pd.DataFrame(dict)
# Date Pattern Required
pattern= '(\d{4})-(\d{2})-01'
new_df=df[df['Date'].str.match(r'((\d{4})-(\d{2})-01)')]
print(new_df)
I have a dataframe with huge amount of rows, and I want to conditional groupby sum to this dataframe.
This is an example of my dataframe and code:
import pandas as pd
data = {'Case': [1, 1, 1, 1, 1, 1],
'Id': [1, 1, 1, 1, 2, 2],
'Date1': ['2020-01-01', '2020-01-01', '2020-02-01', '2020-02-01', '2020-01-01', '2020-01-01'],
'Date2': ['2020-01-01', '2020-02-01', '2020-01-01', '2020-02-01', '2020-01-01', '2020-02-01'],
'Quantity': [50,100,150,20,30,35]
}
df = pd.DataFrame(data)
df['Date1'] = pd.to_datetime(df['Date1'])
df['Date2'] = pd.to_datetime(df['Date2'])
sum_list = []
for d in df['Date1'].unique():
temp = df.groupby(['Case','Id']).apply(lambda x: x[(x['Date2'] == d) & (x['Date1']<d)]['Quantity'].sum()).rename('sum').to_frame()
temp['Date'] = d
sum_list.append(temp)
output = pd.concat(sum_list, axis=0).reset_index()
When I apply this for loop to the real dataframe, it's extremely slow. I want to find a better way to do this conditional groupby sum operation. Here are my questions:
is for loop a good method to do what I need here?
are there any better ways to replace line 1 inside for loop;
I feel line 2 inside for loop is also time-consuming, how should I improve it.
Thanks for your help.
One option is a double merge and a groupby:
date = pd.Series(df.Date1.unique(), name='Date')
step1 = df.merge(date, left_on = 'Date2', right_on = 'Date', how = 'outer')
step2 = step1.loc[step1.Date1 < step1.Date]
step2 = step2.groupby(['Case', 'Id', 'Date']).agg(sum=('Quantity','sum'))
(df
.loc[:, ['Case', 'Id', 'Date2']]
.drop_duplicates()
.rename(columns={'Date2':'Date'})
.merge(step2, how = 'left', on = ['Case', 'Id', 'Date'])
.fillna({'sum': 0}, downcast='infer')
)
Case Id Date sum
0 1 1 2020-01-01 0
1 1 1 2020-02-01 100
2 1 2 2020-01-01 0
3 1 2 2020-02-01 35
apply is the slow one. Avoid it as much as you can.
I tested this with your small snippet and it gives the correct answer. You need to test more thoroughly with your real data:
case = df["Case"].unique()
id_= df["Id"].unique()
d = df["Date1"].unique()
index = pd.MultiIndex.from_product([case, id_, d], names=["Case", "Id", "Date"])
# Sum only rows whose Date2 belong to a specific list of dates
# This is equivalent to `x['Date2'] == d` in your original code
cond = df["Date2"].isin(d)
tmp = df[cond].groupby(["Case", "Id", "Date1", "Date2"], as_index=False).sum()
# Select only those sums where Date1 < Date2 and sum again
# This takes care of the `x['Date1'] < d` condition
cond = tmp["Date1"] < tmp["Date2"]
output = tmp[cond].groupby(["Case", "Id", "Date2"]).sum().reindex(index, fill_value=0).reset_index()
Another solution:
x = df.groupby(["Case", "Id", "Date1"], as_index=False).apply(
lambda x: x.loc[x["Date1"] < x["Date2"], "Quantity"].sum()
)
print(
x.pivot(index=["Case", "Id"], columns="Date1", values=None)
.fillna(0)
.melt(ignore_index=False)
.drop(columns=[None])
.reset_index()
.rename(columns={"Date1": "Date", "value":"sum"})
)
Prints:
Case Id Date sum
0 1 1 2020-01-01 100.0
1 1 2 2020-01-01 35.0
2 1 1 2020-02-01 0.0
3 1 2 2020-02-01 0.0
I have a dataframe with three columns, viz., date,commodity and values. I want to add another column, median_20, the rolling median of last 20 days for each commodity in the df. Also, I want to add other columns which show value n days before, for example, lag_1 column shows value 1 day before for a given commodity, lag_2 shows value 2 days before, and so on. My df is quite big (>2 million rows) in size.
dates = pd.date_range('2017-01-01', '2017-07-02')
df1 = pd.DataFrame({'date':dates, 'commodity':np.random.normal(size = len(dates)), 'market':'GOLD'})
df2 = pd.DataFrame({'date':dates, 'commodity':np.random.normal(size = len(dates)), 'market':'SILVER'})
df = pd.concat([df1, df2])
df = df.sort('date')
date commodity value
0 2017-01-01 GOLD -1.239422
0 2017-01-01 SILVER -0.209840
1 2017-01-02 SILVER 0.146293
1 2017-01-02 GOLD 1.422454
2 2017-01-03 GOLD 0.453222
...
Try:
import pandas as pd
import numpy as np
# create dataframe
dates = pd.date_range('2017-01-01', '2017-07-02')
df1 = pd.DataFrame({'date':dates, 'commodity':np.random.normal(size = len(dates)), 'market':'GOLD'})
df2 = pd.DataFrame({'date':dates, 'commodity':np.random.normal(size = len(dates)), 'market':'SILVER'})
df = pd.concat([df1, df2])
df = df.sort_values(by='date').reset_index(drop=True)
# create columns
df['median_20_temp'] = df.groupby('market')['commodity'].rolling(20).median()
df['median_20'] = df.groupby('market')['median_20_temp'].shift(1)
df['lag_1'] = df.groupby('market')['commodity'].shift(1)
df['lag_2'] = df.groupby('market')['commodity'].shift(2)
df.drop(['median_20_temp'], axis=1, inplace=True)
Edit:
The following should work with version 0.16.2:
import numpy as np
import pandas as pd
np.random.seed(123)
dates = pd.date_range('2017-01-01', '2017-07-02')
df1 = pd.DataFrame({'date':dates, 'commodity':np.random.normal(size = len(dates)), 'market':'GOLD'})
df2 = pd.DataFrame({'date':dates, 'commodity':np.random.normal(size = len(dates)), 'market':'SILVER'})
df = pd.concat([df1, df2])
df = df.sort('date').reset_index(drop=True)
# create columns
df['median_20_temp'] = df.groupby('market')['commodity'].apply(lambda s: pd.rolling_median(s, 20))
df['median_20'] = df.groupby('market')['median_20_temp'].shift(1)
df['lag_1'] = df.groupby('market')['commodity'].shift(1)
df['lag_2'] = df.groupby('market')['commodity'].shift(2)
df.drop(['median_20_temp'], axis=1, inplace=True)
I hope this helps.
I am sure there is a more efficient way, meanwhile try this solution:
for commo in df.market.unique():
df.loc[df.market==commo,'lag_1'] = df.loc[df.market==commo,'commodity'].shift(1)
df.loc[df.market==commo,'median_20'] = pd.rolling_median(df.loc[df.market==commo,'commodity'],20)
I have two dataframes (df1, df2) and I would like
to create a new column in df1 that indicates if
there is a match in the code columns between each
dataframe. The code column in df2 is made up of
strings separated by a comma.
df1
Date Code
2016-01-01 LANH08
2016-01-01 LAOH07
2016-01-01 LAPH09
2016-01-01 LAQH06
2016-01-01 LARH03
df2
Date Code
2016-01-01 LANH08, LAOH07, LXA0EW, LAGRL1
2016-01-01 LAUH02, LAVH00, LAVH01, LAYH00
2016-01-01 LANH08
2016-01-01 AAH00, ABH00, XAH03
2016-01-01 ARH04, BA0BW, BMH01, DPH00
My Goal
df1
Date Code Match
2016-01-01 LANH08 Y
2016-01-01 LAOH07 Y
2016-01-01 LAPH09 N
2016-01-01 LAQH06 N
2016-01-01 LARH03 N
#Split df2['Code'] into an array
df2.Code = df2.Code.str.split(', ')
#Recreate df2 reshaped
df2 = pd.concat([pd.DataFrame(dict(list(zip(df2.columns,df2.ix[i]))),\
index=range(len(list(zip(df2.columns,df2.ix[i]))[1]))) for i in range(len(df2.index))])
#default df2['match'] to 'Y'
df2['Match'] = 'Y'
#Create new dataframe by left merging df1 with df2
df3 = df1.merge(df2, left_on = ['Date','Code'], right_on = ['Date','Code'], how = 'left')
#Fill NaN values in Match column with 'N' (because they weren't in df2)
df3['Match'] = df3['Match'].fillna('N')
Split pandas dataframe string entry to separate rows
Final Solution:
data1 = {'Date':['2016-01-01',
'2016-01-01',
'2016-01-01',
'2016-01-01',
'2016-01-01'],
'Code':['LANH08',
'LAOH07',
'LAPH09',
'LAQH06',
'LARH03']}
df1 = DataFrame(data1)
data2 = {'Date':['2016-01-01',
'2016-01-01',
'2016-01-01',
'2016-01-01',
'2016-01-01'],
'Code':['LANH08, LAOH07, LXA0EW, LAGRL1',
'LAUH02, LAVH00, LAVH01, LAYH00',
'LANH08',
'AAH00, ABH00, XAH03',
'LAUH02, LAVH00']}
df2 = DataFrame(data2)
df2 = DataFrame(df2.Code.str.split(', ').tolist(), index=df2.Date).stack().drop_duplicates()
df2 = df2.reset_index()[[0, 'Date']] # Code variable is currently labeled 0
df2.columns = ['Code', 'Date'] # Renaming Code
# default df2['match'] to 'Y'
df2['Match'] = 'Y'
# Create new dataframe by left merging df1 with df2
df3 = df1.merge(df2, left_on = ['Code', 'Date'], right_on = ['Code', 'Date'], how = 'left')
# Fill NaN values in Match column with 'N' (because they weren't in df2)
df3['Match'] = df3['Match'].fillna('N')
df3
Code Date Match
0 LANH08 2016-01-01 Y
1 LAOH07 2016-01-01 Y
2 LAPH09 2016-01-01 N
3 LAQH06 2016-01-01 N
4 LARH03 2016-01-01 N