I am a beginner to python. This seems like something that would have been asked but I have been trying to search for the answer for 3 days at this point and can't find it.
I created a dataframe using pd after running pytesseract on an image. Everything is fine except one 'minor' thing. When I want it to show the dataframe, if the first series is 'Date', it shows only the first row:
df['Date'] = pd.Series(date_date)
df['In'] = pd.Series(float_in)
df['Out'] = pd.Series(float_out)
df['Date'] = df['Date'].fillna(date_date)
df['Out'] = df['Out'].fillna(0)
df['In'] = df['In'].fillna(0)
print(df)
Date In Out
0 2022-05-31 0.0 7700.0
If I change the column sequence and keep column 'Date' on any other position, it comes out fine:
df['In'] = pd.Series(float_in)
df['Out'] = pd.Series(float_out)
df['Date'] = pd.Series(date_date)
df['Date'] = df['Date'].fillna(date_date)
df['Out'] = df['Out'].fillna(0)
df['In'] = df['In'].fillna(0)
print(df)
In Out Date
0 0.0 7700.0 2022-05-31
1 0.0 4232.0 2022-05-31
2 0.0 16056.0 2022-05-31
3 0.0 80000.0 2022-05-31
4 0.0 40000.0 2022-05-31
5 0.0 105805.0 2022-05-31
6 0.0 185500.0 2022-05-31
7 0.0 52188.0 2022-05-31
Can anyone guide as to why this is happening and how to fix it? I would like the Date to remain the first column but of course I want all rows!
Thank you in advance.
Here is the complete code if that helps:
import cv2
import pytesseract
import pandas as pd
from datetime import datetime
pytesseract.pytesseract.tesseract_cmd=r'C:\Program Files\Tesseract-OCR\tesseract.exe'
img = cv2.imread("C:\\Users\\Fast Computer\\Documents\\Python test\\Images\\page-0.png")
thresh = 255
#Coordinates and ROI for Amount Out
x3,y3,w3,h3 = 577, 495, 172, 815
ROI_3 = img[y3:y3+h3,x3:x3+w3]
#Coordinates and ROI for Amount In
x4,y4,w4,h4 = 754, 495, 175, 815
ROI_4 = img[y4:y4+h4,x4:x4+w4]
#Coordinates and ROI for Date
x5,y5,w5,h5 = 833, 174, 80, 22
ROI_5 = img[y5:y5+h5,x5:x5+w5]
#OCR and convert to strings
text_amount_out = pytesseract.image_to_string(ROI_3)
text_amount_in = pytesseract.image_to_string(ROI_4)
text_date = pytesseract.image_to_string(ROI_5)
text_amount_out = text_amount_out.replace(',', '')
text_amount_in = text_amount_in.replace(',', '')
cv2.waitKey(0)
cv2.destroyAllWindows()
#Convert Strings to Lists
list_amount_out = text_amount_out.split()
list_amount_in = text_amount_in.split()
list_date = text_date.split()
float_out = []
for item in list_amount_out:
float_out.append(float(item))
float_in = []
for item in list_amount_in:
float_in.append(float(item))
date_date = datetime.strptime(text_date, '%d/%m/%Y ')
#Creating columns
df = pd.DataFrame()
df['In'] = pd.Series(float_in)
df['Out'] = pd.Series(float_out)
df['Date'] = pd.Series(date_date)
df['Date'] = df['Date'].fillna(date_date)
df['Out'] = df['Out'].fillna(0)
df['In'] = df['In'].fillna(0)
print(df)
Your problem lies with how you initialize and then update the pd.DataFrame().
import pandas as pd
from datetime import datetime
float_in = [0.0,0.5,1.0]
float_out = [0.0,0.5,1.0,1.5]
# this line just gives you 1 value:
date_date = datetime.strptime('01/01/2022 ', '%d/%m/%Y ')
# date_date = datetime.strptime(text_date, '%d/%m/%Y ')
# creates an empty df
df = pd.DataFrame()
print(df.shape)
# (0, 0)
Now, when you first fill the df only with a series that contains date_date, we get:
df['Date'] = pd.Series(date_date) # 1 row
print(df.shape)
# (1, 1)
print(df)
# Date
# 0 2022-01-01
Adding any other (longer) pd.Series() to this, will not add rows to the df. Rather, it will only add the first value of that series:
df['In'] = pd.Series(float_in)
print(df)
# Date In
# 0 2022-01-01 0.0
One way to avoid this, is by initializing your df with an index that stretches the length of your longest list:
max_length = max(map(len, [float_in, float_out])) # 4
df = pd.DataFrame(index=range(max_length))
print(df.shape)
# (4, 0), so now we start with 4 rows
df['Date'] = pd.Series(date_date)
print(df)
# Date
# 0 2022-01-01
# 1 NaT
# 2 NaT
# 3 NaT
df['In'] = pd.Series(float_in)
df['Out'] = pd.Series(float_out)
df['Date'] = df['Date'].fillna(date_date)
df['Out'] = df['Out'].fillna(0)
df['In'] = df['In'].fillna(0)
print(df)
Date In Out
0 2022-01-01 0.0 0.0
1 2022-01-01 0.5 0.5
2 2022-01-01 1.0 1.0
3 2022-01-01 0.0 1.5
You need to use iterable with repeated date rather than single date, consider following simple example
import datetime
import pandas as pd
df = pd.DataFrame()
df['Date'] = pd.Series(datetime.date(1900,1,1))
df['Values'] = pd.Series([1.5,2.5,3.5])
print(df)
gives output
Date Values
0 1900-01-01 1.5
whilst
import datetime
import pandas as pd
df = pd.DataFrame()
df['Date'] = pd.Series([datetime.date(1900,1,1)]*3) # repeat 3 times
df['Values'] = pd.Series([1.5,2.5,3.5])
print(df)
gives output
Date Values
0 1900-01-01 1.5
1 1900-01-01 2.5
2 1900-01-01 3.5
Related
import yfinance as yf
import numpy as np
import pandas as pd
ETF_DB = ['QQQ', 'EGFIX']
fundsret = yf.download(ETF_DB, start=datetime.date(2020,12,31), end=datetime.date(2022,4,30), interval='1mo')['Adj Close'].pct_change()
df = pd.DataFrame(fundsret)
df
Gives me:
I'm trying to remove the rows in the dataframe that aren't month end such as the row 2021-03-22. How do I have the dataframe go through and remove the rows where the date doesn't end in '01'?
df.reset_index(inplace=True)
# Convert the date to datetime64
df['Date'] = pd.to_datetime(df['Date'], format='%Y-%m-%d')
#select only day = 1
filtered = df.loc[df['Date'].dt.day == 1]
Did you mean month start?
You can use:
df = df[df.index.day==1]
reproducible example:
df = pd.DataFrame(columns=['A', 'B'],
index=['2021-01-01', '2021-02-01', '2021-03-01',
'2021-03-22', '2021-03-31'])
df.index = pd.to_datetime(df.index, dayfirst=False)
output:
A B
2021-01-01 NaN NaN
2021-02-01 NaN NaN
2021-03-01 NaN NaN
end of month
for the end of month, you can add 1 day and check if this jumps to the next month:
end = (df.index+pd.Timedelta('1d')).month != df.index.month
df = df[end]
or add an offset and check if the value is unchanged:
end = df.index == (df.index + pd.offsets.MonthEnd(0))
df = df[end]
output:
A B
2021-03-31 NaN NaN
import pandas as pd
import re
# Dummy Dictionary
dict={
'Date': ['2021-01-01','2022-03-01','2023-04-22','2023-04-01'],
'Name' : ['A','B','C','D']
}
# Making a DataFrame
df=pd.DataFrame(dict)
# Date Pattern Required
pattern= '(\d{4})-(\d{2})-01'
new_df=df[df['Date'].str.match(r'((\d{4})-(\d{2})-01)')]
print(new_df)
Here is the problem:
I want to select the dataframe (say, df3) with each index1 in df1 to be in the range between d_reach and d_start in df2,
Below is the code to generate samples:
import numpy as np
import pandas as pd
import datetime
from datetime import timedelta
index1 = pd.date_range(datetime.datetime(2021, 1, 1, 1, 1), periods = 1000, freq = "3min")
df1 = pd.DataFrame(np.random.random(1000), index = index1, columns = ['r'])
d_start = pd.date_range(datetime.datetime(2021, 1, 1, 1, 1), periods = 500, freq = "5min")
d_reach = d_start + timedelta(seconds = np.random.randint(low = 4, high = 6))
value = {'id3': np.tile([0,1], 250)}
df2 = pd.DataFrame(value, index = [d_start,d_reach])
df2.index.names = ['d_start','d_reach']
df2 is MultiIndexed.
The expected ouput of df3 should be:
2021-01-01 01:07:00 0.011026
2021-01-01 01:10:00 0.423813
...
here index1 in df1 2021-01-01 01:07:00 >= 2021-01-01 01:06:05 which is one of the d_reach in df2
and the next index1 in df1 2021-01-01 01:10:00 < 2021-01-01 01:11:00 which is the next d_start in df2
Below is the code I tried but failed:
df = pd.DataFrame()
for i in df1.index:
df = df.append(df1.loc[i])
for idx1, idx2 in zip(df2.index.get_level_values(0).tolist(),
df2.index.get_level_values(1).tolist())
if i >= idx1 and i <= idx2
Really appreciate any advice as to find df3 in Python. Thanks!
I want to select the dataframe (say, df3) with each index1 in df1 to be in the range between d_reach and d_start in df2,
here is one way to cross join then find the matches and filter them out :
mdf = pd.merge(df1.reset_index(), df2.reset_index() , how='cross', on=None)
result = mdf.loc[mdf['index'].between(mdf['d_start'], mdf['d_reach']),['index','r']].set_index('index')
print(result.head())
output:
>>>
r
index
2021-01-01 01:01:00 0.415163
2021-01-01 01:16:00 0.729592
2021-01-01 01:31:00 0.411244
2021-01-01 01:46:00 0.524753
2021-01-01 02:01:00 0.105035
That's going to be memory intensive though, another way is to load your dataframes into an in-memory database and join them based on the condition and load the result back to your result dataframe, you will find a lot of samples on that method online.
My data frame contains a IGN_DATE column in which the values are of the form 20080727142700, format is YYYYMMDDHHMMSS.
The column type is float64.
How can I get the a separate column for time, date (without 00:00:00), day, month.
What I tried:
Column name IGN_DATE
dataframe - df
df['IGN_DATE'] = df['IGN_DATE'].apply(str)
df['DATE'] = pd.to_datetime(df['IGN_DATE'].str.slice(start = 0, stop = 8))
df['MONTH'] = df['IGN_DATE'].str.slice(start = 4, stop = 6).astype(int)
df['DAY'] = df['IGN_DATE'].str.slice(start = 6, stop = 8).astype(int)
df['TIME'] = df['IGN_DATE'].str.slice(start = 8, stop = 13)
DATE is in the format YYYY-MM-DD 00:00:00. I don't want 00:00:00 in DATE.
How to get the time--which has type string--to HH:MM:SS ?
Is there any simpler way to do this?
If nan values are not important can dropna then convert to_datetime with a specified format then use the dt accessor to access desired values:
# Drop Rows with nan in IGN_DATE column
df = df.dropna(subset=['IGN_DATE'])
# Convert dtype to whole number then to `str`
df['IGN_DATE'] = df['IGN_DATE'].astype('int64').astype(str)
# Series of datetime values from Column
s = pd.to_datetime(df['IGN_DATE'], format='%Y%m%d%H%M%S')
# Extract out and add to DataFrame from `s`
df['DATE'] = s.dt.date
df['MONTH'] = s.dt.month
df['DAY'] = s.dt.day
df['TIME'] = s.dt.time
Otherwise can mask notna values from IGN_DATE and assign only those rows:
# Mask not null values
m = df['IGN_DATE'].notna()
# Convert to String
df.loc[m, 'IGN_DATE'] = df.loc[m, 'IGN_DATE'].astype('int64').astype(str)
# Series of datetime values from Column
s = pd.to_datetime(df['IGN_DATE'], format='%Y%m%d%H%M%S')
# Extract out and add to DataFrame from `s`
df.loc[m, 'DATE'] = s.dt.date
df.loc[m, 'MONTH'] = s.dt.month
df.loc[m, 'DAY'] = s.dt.day
df.loc[m, 'TIME'] = s.dt.time
Sample DF:
import numpy as np
import pandas as pd
df = pd.DataFrame({'IGN_DATE': [20080727142700, np.nan, 20151015171807]})
Sample Output with dropna:
IGN_DATE DATE MONTH DAY TIME
0 20080727142700 2008-07-27 7 27 14:27:00
2 20151015171807 2015-10-15 10 15 17:18:07
Sample Output with mask:
IGN_DATE DATE MONTH DAY TIME
0 20080727142700 2008-07-27 7.0 27.0 14:27:00
1 NaN NaN NaN NaN NaN
2 20151015171807 2015-10-15 10.0 15.0 17:18:07
I have trouble setting the correct datime format with Pandas, I do not understand why my command does not work. Any solution?
date = ['01/10/2014 00:03:20']
value = [33.24]
df = pd.DataFrame({'value':value,'index':date})
df.index = pd.to_datetime(df.index,format='%d/%m/%y %H:%M:%S')
Solution for DatetimeIndex:
date = ['01/10/2014 00:03:20']
value = [33.24]
#create index by date list
df = pd.DataFrame({'value':value},index=date)
#use Y for match YYYY, y is for match YY years format
df.index = pd.to_datetime(df.index,format='%d/%m/%Y %H:%M:%S')
print (df)
value
2014-10-01 00:03:20 33.24
If want index column name is necessary use [] for avoid selecting RangeIndex:
df = pd.DataFrame({'value':value,'index':date})
df['index'] = pd.to_datetime(df['index'],format='%d/%m/%Y %H:%M:%S')
print (df)
value index
0 33.24 2014-10-01 00:03:20
Calling a column 'index' is a bit confusing, changed it to 'index_date'.
import pandas as pd
date = ['01/10/2014 00:03:20']
value = [33.24]
df = pd.DataFrame({'value':value,'index_date':date})
df['index_date'] = pd.to_datetime(df["index_date"], errors="coerce")
Output of df:
value index_date
0 33.24 2014-01-10 00:03:20
And if you run df.dtypes
value float64
index_date datetime64[ns]
I have a data frame that has DatetimeIndex. I would like to create an input, the user will write the date, then python will look up the first passed month.
Here's an example: df is the name of the dataframe
date = input('Enter a date in YYYY-MM-DD format: ')
Enter a date in YYYY-MM-DD format: 2017-01-31
I would like that python will do df[date-1] and then print the result so that I get:
2016-12-31 8.257478e+04
It's possible if the input date is in the index already, but I'm looking find a way when the input is not.
Any ideas ? Thanks in advance
It seems you need get_loc for position of value in index and then iloc for selecting:
pos = df.index.get_loc(d)
print (df.iloc[[pos - 1]])
Sample:
start = pd.to_datetime('2016-11-30')
rng = pd.date_range(start, periods=10, freq='M')
df = pd.DataFrame({'a': range(10)}, index=rng)
print (df)
a
2016-11-30 0
2016-12-31 1
2017-01-31 2
2017-02-28 3
2017-03-31 4
2017-04-30 5
2017-05-31 6
2017-06-30 7
2017-07-31 8
2017-08-31 9
d = '2017-01-31'
pos = df.index.get_loc(d)
print (df.iloc[[pos - 1]])
a
2016-12-31 1
If date is not in index add method='nearest':
d = '2017-01-20'
pos = df.index.get_loc(d, method='nearest')
print (df.iloc[[pos - 1]])
a
2016-12-31 1
But if need more general solution you have to use some conditions like:
d = '2017-11-30'
pos = df.index.get_loc(d, method='nearest')
if pos == 0:
print ('Value less or same as minimal date in DataTimeIndex')
else:
print ('Value nearest less or same as date', df.index[pos])
print ('Previous value', df.iloc[[pos - 1]])