Date of YYYYMMDDHHMMSS in pandas data frame - python

My data frame contains a IGN_DATE column in which the values are of the form 20080727142700, format is YYYYMMDDHHMMSS.
The column type is float64.
How can I get the a separate column for time, date (without 00:00:00), day, month.
What I tried:
Column name IGN_DATE
dataframe - df
df['IGN_DATE'] = df['IGN_DATE'].apply(str)
df['DATE'] = pd.to_datetime(df['IGN_DATE'].str.slice(start = 0, stop = 8))
df['MONTH'] = df['IGN_DATE'].str.slice(start = 4, stop = 6).astype(int)
df['DAY'] = df['IGN_DATE'].str.slice(start = 6, stop = 8).astype(int)
df['TIME'] = df['IGN_DATE'].str.slice(start = 8, stop = 13)
DATE is in the format YYYY-MM-DD 00:00:00. I don't want 00:00:00 in DATE.
How to get the time--which has type string--to HH:MM:SS ?
Is there any simpler way to do this?

If nan values are not important can dropna then convert to_datetime with a specified format then use the dt accessor to access desired values:
# Drop Rows with nan in IGN_DATE column
df = df.dropna(subset=['IGN_DATE'])
# Convert dtype to whole number then to `str`
df['IGN_DATE'] = df['IGN_DATE'].astype('int64').astype(str)
# Series of datetime values from Column
s = pd.to_datetime(df['IGN_DATE'], format='%Y%m%d%H%M%S')
# Extract out and add to DataFrame from `s`
df['DATE'] = s.dt.date
df['MONTH'] = s.dt.month
df['DAY'] = s.dt.day
df['TIME'] = s.dt.time
Otherwise can mask notna values from IGN_DATE and assign only those rows:
# Mask not null values
m = df['IGN_DATE'].notna()
# Convert to String
df.loc[m, 'IGN_DATE'] = df.loc[m, 'IGN_DATE'].astype('int64').astype(str)
# Series of datetime values from Column
s = pd.to_datetime(df['IGN_DATE'], format='%Y%m%d%H%M%S')
# Extract out and add to DataFrame from `s`
df.loc[m, 'DATE'] = s.dt.date
df.loc[m, 'MONTH'] = s.dt.month
df.loc[m, 'DAY'] = s.dt.day
df.loc[m, 'TIME'] = s.dt.time
Sample DF:
import numpy as np
import pandas as pd
df = pd.DataFrame({'IGN_DATE': [20080727142700, np.nan, 20151015171807]})
Sample Output with dropna:
IGN_DATE DATE MONTH DAY TIME
0 20080727142700 2008-07-27 7 27 14:27:00
2 20151015171807 2015-10-15 10 15 17:18:07
Sample Output with mask:
IGN_DATE DATE MONTH DAY TIME
0 20080727142700 2008-07-27 7.0 27.0 14:27:00
1 NaN NaN NaN NaN NaN
2 20151015171807 2015-10-15 10.0 15.0 17:18:07

Related

Python - Print(df) Only Showing First Row

I am a beginner to python. This seems like something that would have been asked but I have been trying to search for the answer for 3 days at this point and can't find it.
I created a dataframe using pd after running pytesseract on an image. Everything is fine except one 'minor' thing. When I want it to show the dataframe, if the first series is 'Date', it shows only the first row:
df['Date'] = pd.Series(date_date)
df['In'] = pd.Series(float_in)
df['Out'] = pd.Series(float_out)
df['Date'] = df['Date'].fillna(date_date)
df['Out'] = df['Out'].fillna(0)
df['In'] = df['In'].fillna(0)
print(df)
Date In Out
0 2022-05-31 0.0 7700.0
If I change the column sequence and keep column 'Date' on any other position, it comes out fine:
df['In'] = pd.Series(float_in)
df['Out'] = pd.Series(float_out)
df['Date'] = pd.Series(date_date)
df['Date'] = df['Date'].fillna(date_date)
df['Out'] = df['Out'].fillna(0)
df['In'] = df['In'].fillna(0)
print(df)
In Out Date
0 0.0 7700.0 2022-05-31
1 0.0 4232.0 2022-05-31
2 0.0 16056.0 2022-05-31
3 0.0 80000.0 2022-05-31
4 0.0 40000.0 2022-05-31
5 0.0 105805.0 2022-05-31
6 0.0 185500.0 2022-05-31
7 0.0 52188.0 2022-05-31
Can anyone guide as to why this is happening and how to fix it? I would like the Date to remain the first column but of course I want all rows!
Thank you in advance.
Here is the complete code if that helps:
import cv2
import pytesseract
import pandas as pd
from datetime import datetime
pytesseract.pytesseract.tesseract_cmd=r'C:\Program Files\Tesseract-OCR\tesseract.exe'
img = cv2.imread("C:\\Users\\Fast Computer\\Documents\\Python test\\Images\\page-0.png")
thresh = 255
#Coordinates and ROI for Amount Out
x3,y3,w3,h3 = 577, 495, 172, 815
ROI_3 = img[y3:y3+h3,x3:x3+w3]
#Coordinates and ROI for Amount In
x4,y4,w4,h4 = 754, 495, 175, 815
ROI_4 = img[y4:y4+h4,x4:x4+w4]
#Coordinates and ROI for Date
x5,y5,w5,h5 = 833, 174, 80, 22
ROI_5 = img[y5:y5+h5,x5:x5+w5]
#OCR and convert to strings
text_amount_out = pytesseract.image_to_string(ROI_3)
text_amount_in = pytesseract.image_to_string(ROI_4)
text_date = pytesseract.image_to_string(ROI_5)
text_amount_out = text_amount_out.replace(',', '')
text_amount_in = text_amount_in.replace(',', '')
cv2.waitKey(0)
cv2.destroyAllWindows()
#Convert Strings to Lists
list_amount_out = text_amount_out.split()
list_amount_in = text_amount_in.split()
list_date = text_date.split()
float_out = []
for item in list_amount_out:
float_out.append(float(item))
float_in = []
for item in list_amount_in:
float_in.append(float(item))
date_date = datetime.strptime(text_date, '%d/%m/%Y ')
#Creating columns
df = pd.DataFrame()
df['In'] = pd.Series(float_in)
df['Out'] = pd.Series(float_out)
df['Date'] = pd.Series(date_date)
df['Date'] = df['Date'].fillna(date_date)
df['Out'] = df['Out'].fillna(0)
df['In'] = df['In'].fillna(0)
print(df)
Your problem lies with how you initialize and then update the pd.DataFrame().
import pandas as pd
from datetime import datetime
float_in = [0.0,0.5,1.0]
float_out = [0.0,0.5,1.0,1.5]
# this line just gives you 1 value:
date_date = datetime.strptime('01/01/2022 ', '%d/%m/%Y ')
# date_date = datetime.strptime(text_date, '%d/%m/%Y ')
# creates an empty df
df = pd.DataFrame()
print(df.shape)
# (0, 0)
Now, when you first fill the df only with a series that contains date_date, we get:
df['Date'] = pd.Series(date_date) # 1 row
print(df.shape)
# (1, 1)
print(df)
# Date
# 0 2022-01-01
Adding any other (longer) pd.Series() to this, will not add rows to the df. Rather, it will only add the first value of that series:
df['In'] = pd.Series(float_in)
print(df)
# Date In
# 0 2022-01-01 0.0
One way to avoid this, is by initializing your df with an index that stretches the length of your longest list:
max_length = max(map(len, [float_in, float_out])) # 4
df = pd.DataFrame(index=range(max_length))
print(df.shape)
# (4, 0), so now we start with 4 rows
df['Date'] = pd.Series(date_date)
print(df)
# Date
# 0 2022-01-01
# 1 NaT
# 2 NaT
# 3 NaT
df['In'] = pd.Series(float_in)
df['Out'] = pd.Series(float_out)
df['Date'] = df['Date'].fillna(date_date)
df['Out'] = df['Out'].fillna(0)
df['In'] = df['In'].fillna(0)
print(df)
Date In Out
0 2022-01-01 0.0 0.0
1 2022-01-01 0.5 0.5
2 2022-01-01 1.0 1.0
3 2022-01-01 0.0 1.5
You need to use iterable with repeated date rather than single date, consider following simple example
import datetime
import pandas as pd
df = pd.DataFrame()
df['Date'] = pd.Series(datetime.date(1900,1,1))
df['Values'] = pd.Series([1.5,2.5,3.5])
print(df)
gives output
Date Values
0 1900-01-01 1.5
whilst
import datetime
import pandas as pd
df = pd.DataFrame()
df['Date'] = pd.Series([datetime.date(1900,1,1)]*3) # repeat 3 times
df['Values'] = pd.Series([1.5,2.5,3.5])
print(df)
gives output
Date Values
0 1900-01-01 1.5
1 1900-01-01 2.5
2 1900-01-01 3.5

How do I remove rows of a Pandas DataFrame based on a certain condition?

import yfinance as yf
import numpy as np
import pandas as pd
ETF_DB = ['QQQ', 'EGFIX']
fundsret = yf.download(ETF_DB, start=datetime.date(2020,12,31), end=datetime.date(2022,4,30), interval='1mo')['Adj Close'].pct_change()
df = pd.DataFrame(fundsret)
df
Gives me:
I'm trying to remove the rows in the dataframe that aren't month end such as the row 2021-03-22. How do I have the dataframe go through and remove the rows where the date doesn't end in '01'?
df.reset_index(inplace=True)
# Convert the date to datetime64
df['Date'] = pd.to_datetime(df['Date'], format='%Y-%m-%d')
#select only day = 1
filtered = df.loc[df['Date'].dt.day == 1]
Did you mean month start?
You can use:
df = df[df.index.day==1]
reproducible example:
df = pd.DataFrame(columns=['A', 'B'],
index=['2021-01-01', '2021-02-01', '2021-03-01',
'2021-03-22', '2021-03-31'])
df.index = pd.to_datetime(df.index, dayfirst=False)
output:
A B
2021-01-01 NaN NaN
2021-02-01 NaN NaN
2021-03-01 NaN NaN
end of month
for the end of month, you can add 1 day and check if this jumps to the next month:
end = (df.index+pd.Timedelta('1d')).month != df.index.month
df = df[end]
or add an offset and check if the value is unchanged:
end = df.index == (df.index + pd.offsets.MonthEnd(0))
df = df[end]
output:
A B
2021-03-31 NaN NaN
import pandas as pd
import re
# Dummy Dictionary
dict={
'Date': ['2021-01-01','2022-03-01','2023-04-22','2023-04-01'],
'Name' : ['A','B','C','D']
}
# Making a DataFrame
df=pd.DataFrame(dict)
# Date Pattern Required
pattern= '(\d{4})-(\d{2})-01'
new_df=df[df['Date'].str.match(r'((\d{4})-(\d{2})-01)')]
print(new_df)

Find Pandas column largest/smallest values where dates don't overlap

I have a DataFrame like:
df = pd.DataFrame(index = [0,1,2,3,4,5])
df['XYZ'] = [2, 8, 6, 5, 9, 10]
df['Date2'] = ["2005-01-06", "2005-01-07", "2005-01-08", "1994-06-08", "1999-06-15", "2005-01-09"]
df['Date1'] = ["2005-01-02", "2005-01-03", "2005-01-04", "1994-06-04", "1999-06-12", "2005-01-05"]
df['Date1'] = pd.to_datetime(df['Date1'])
df['Date2'] = pd.to_datetime(df['Date2'])
I need to follow the 2 largest values of XYZ with dates that do not overlap. The expected output would be:
XYZ Date1 Date2
10 2005-01-05 2005-01-09
9 1999-06-12 1999-06-15
5 1994-06-04 1994-06-08
I tried to sort by "XYZ":
df.sort_values(by="XYZ", ascending=False, inplace=True)
And then compare dates:
df['overlap'] = (df['Date1] <= df['Date2'].shift()) & (df['Date2'] >= df['Date1'].shift())
And then drop any True values in df['overlap'] and take the nlargest() values, however that results in cases that do overlap.
Any help would be much appreciated.
This is somewhat involved but hopefully will work for you. We introduce a mask indexed by every date between the min and the max date in your df, where we mark each date as 'used' if it appears in the range, and then use that to reject overlapping rows
First we get the min and the max date (while also sorting the original df by 'XYZ')
df1 = df.sort_values('XYZ', ascending = False)
dmin, dmax = df1[['Date1', 'Date2']].unstack().agg([min,max])
then we create a mask populated with 0s initially
mask = pd.Series(index = pd.date_range(dmin,dmax), data = 0)
Then we iterate over rows marking those we want in the 'include' column
for idx,row in df1.iterrows():
if sum(mask[row['Date1']:row['Date2']]) > 0:
df1.loc[idx, 'include'] = False
continue
mask[row['Date1']:row['Date2']] = 1
df1.loc[idx, 'include'] = True
finally filter on 'include'
df1[df1['include']].drop(columns = 'include')
output
XYZ Date1 Date2
5 10 2005-01-05 2005-01-09
4 9 1999-06-12 1999-06-15
3 5 1994-06-04 1994-06-08

Keep pandas DataFrame rows in df2 for each row in df1 with timedelta

I have two pandas dataframes. I would like to keep all rows in df2 where Type is equal to Type in df1 AND Date is between Date in df1 (- 1 day or + 1 day). How can I do this?
df1
IBSN Type Date
0 1 X 2014-08-17
1 1 Y 2019-09-22
df2
IBSN Type Date
0 2 X 2014-08-16
1 2 D 2019-09-22
2 9 X 2014-08-18
3 3 H 2019-09-22
4 3 Y 2019-09-23
5 5 G 2019-09-22
res
IBSN Type Date
0 2 X 2014-08-16 <-- keep because Type = df1[0]['Type'] AND Date = df1[0]['Date'] - 1
1 9 X 2014-08-18 <-- keep because Type = df1[0]['Type'] AND Date = df1[0]['Date'] + 1
2 3 Y 2019-09-23 <-- keep because Type = df1[1]['Type'] AND Date = df1[1]['Date'] + 1
This should do it:
import pandas as pd
from datetime import timedelta
# create dummy data
df1 = pd.DataFrame([[1, 'X', '2014-08-17'], [1, 'Y', '2019-09-22']], columns=['IBSN', 'Type', 'Date'])
df1['Date'] = pd.to_datetime(df1['Date']) # might not be necessary if your Date column already contain datetime objects
df2 = pd.DataFrame([[2, 'X', '2014-08-16'], [2, 'D', '2019-09-22'], [9, 'X', '2014-08-18'], [3, 'H', '2019-09-22'], [3, 'Y', '2014-09-23'], [5, 'G', '2019-09-22']], columns=['IBSN', 'Type', 'Date'])
df2['Date'] = pd.to_datetime(df2['Date']) # might not be necessary if your Date column already contain datetime objects
# add date boundaries to the first dataframe
df1['Date_from'] = df1['Date'].apply(lambda x: x - timedelta(days=1))
df1['Date_to'] = df1['Date'].apply(lambda x: x + timedelta(days=1))
# merge the date boundaries to df2 on 'Type'. Filter rows where date is between
# data_from and date_to (inclusive). Drop 'date_from' and 'date_to' columns
df2 = df2.merge(df1.loc[:, ['Type', 'Date_from', 'Date_to']], on='Type', how='left')
df2[(df2['Date'] >= df2['Date_from']) & (df2['Date'] <= df2['Date_to'])].\
drop(['Date_from', 'Date_to'], axis=1)
Note that according to your logic, row 4 in df2 (3 Y 2014-09-23) should not remain as its date (2014) is not in between the given dates in df1 (year 2019).
Assume Date columns in both dataframes are already in dtype datetime. I would construct IntervalIndex to assign to index of df1. Map columns Type of df1 to df2. Finally check equality to create mask to slice
iix = pd.IntervalIndex.from_arrays(df1.Date + pd.Timedelta(days=-1),
df1.Date + pd.Timedelta(days=1), closed='both')
df1 = df1.set_index(iix)
s = df2['Date'].map(df1.Type)
df_final = df2[df2.Type == s]
Out[1131]:
IBSN Type Date
0 2 X 2014-08-16
2 9 X 2014-08-18
4 3 Y 2019-09-23

Set the correct datetime format with pandas

I have trouble setting the correct datime format with Pandas, I do not understand why my command does not work. Any solution?
date = ['01/10/2014 00:03:20']
value = [33.24]
df = pd.DataFrame({'value':value,'index':date})
df.index = pd.to_datetime(df.index,format='%d/%m/%y %H:%M:%S')
Solution for DatetimeIndex:
date = ['01/10/2014 00:03:20']
value = [33.24]
#create index by date list
df = pd.DataFrame({'value':value},index=date)
#use Y for match YYYY, y is for match YY years format
df.index = pd.to_datetime(df.index,format='%d/%m/%Y %H:%M:%S')
print (df)
value
2014-10-01 00:03:20 33.24
If want index column name is necessary use [] for avoid selecting RangeIndex:
df = pd.DataFrame({'value':value,'index':date})
df['index'] = pd.to_datetime(df['index'],format='%d/%m/%Y %H:%M:%S')
print (df)
value index
0 33.24 2014-10-01 00:03:20
Calling a column 'index' is a bit confusing, changed it to 'index_date'.
import pandas as pd
date = ['01/10/2014 00:03:20']
value = [33.24]
df = pd.DataFrame({'value':value,'index_date':date})
df['index_date'] = pd.to_datetime(df["index_date"], errors="coerce")
Output of df:
value index_date
0 33.24 2014-01-10 00:03:20
And if you run df.dtypes
value float64
index_date datetime64[ns]

Categories

Resources