I Am trying to extract all the previous days data from google spreadsheet, When i hardcode the dates the data comes in perfectly but when i try to make it more dynamic so that i can automate the process it does not work.
This is what i tried, if someone can help
import pandas as pd
import re
import datetime
from dateutil import parser
sheet_id = "19SzfcL3muVeISycG5eFYUqwrwwReGETZsNtl-euGU"
sheet_name = "October-2022"
url=f"https://docs.google.com/spreadsheets/d/{sheet_id}/gviz/tq?tqx=out:csv&sheet={sheet_name}"
ct = datetime.datetime.today()
pt= datetime.datetime.today() - datetime.timedelta(1)
#print(ct)
#print(pt)
df = pd.read_csv(url)
df['Timestamp1'] = pd.to_datetime(df['Timestamp'], format='%Y-%m-%d')
#filtered_df = df.loc[(df['Timestamp1'] > '2022-10-13') & (df['Timestamp1'] < '2022-10-14')]
filtered_df = df.loc[(df['Timestamp1'] > 'pt') & (df['Timestamp1'] < 'ct')]
filtered_df
When you filter the dataframe you are comparing the Timestamp with a string variable. The solution you provided should be correct if you remove the quotation marks:
filtered_df = df.loc[(df['Timestamp1'] > pt) & (df['Timestamp1'] < ct)]
Related
I need to find the difference between 2 dates where certain end dates are blank. I am need to exclude the weekends, as well as the holidays when calculating the dates. I also need to put into account the blank end_dates.
I have a data frame which looks like:
start_date
end_date
01-01-2020
05-01-2020
30-10-2021
NaT
15-08-2019
NaT
29-06-2020
15-07-2020
The code for retrieving the holidays I wrote as the following:
df = read_excel(r'dates.xlsx')
df.head()
us_holidays = holidays.UnitesStates()
The following code works around the null values and it excludes the weekends
def business_days(start, end):
mask = pd.notnull(start) & pd.notnull(end)
start = start.values.astype('datetime64[D]')[mask]
end = end.values.astype('datetime64[D]')[mask]
holi = us_holidays.values.astype('datetime64[D]')[mask]
result = np.empty(len(mask), dtype=float)
result[mask] = np.busday_count(start, end, holidays= holi)
result[~mask] = np.nan
return result
df['count'] = business_days(df['start_date'], df['end_date'])
The error I get is:
AttributeError: 'builtin_function_or_method' object has no attribute 'astype'
How can I fix the following error?
Any help will be greatly appreciated, thanks.
I'm not familiar with the holiday package. But holidays.UnitesStates() seems to return an object and not the needed
list of dates. However you can create a list of holiday dates for a certain range of years.
I'm not sure why you get "NaT", ususally you get NaNs. But you can handle both.
One way to do it:
import holidays
import pandas as pd
import numpy as np
import datetime
#Create Dummy DataFrame:
df = pd.DataFrame(columns=['start_date','end_date'])
df['start_date'] = np.array(["2020-01-01","2021-10-30","2019-08-15","2020-06-29"])
df['end_date'] = np.array(["2020-01-05","NaT","NaT", "2020-07-15"])
#Convert Columns to datetime
df['start_date'] = pd.to_datetime(df['start_date'])
df['end_date'] = pd.to_datetime(df['end_date'])
#Convert DateTime to Date
df['start_date'] = df['start_date'].dt.date
df['end_date'] = df['end_date'].dt.date
#Not sure why you get NaT when you read the file with pandas. So replace it with today:
df = df.replace({'NaT': datetime.date.today()})
#In case you get a NaN:
df = df.fillna(datetime.date.today())
#Get First and Last Year
max_year = df.max().max().year
min_year = df.min().min().year
#holidays.US returns an object, so you have to create a list
us_holidays = list()
for date,name in sorted(holidays.US(years=list(range(min_year,max_year+1))).items()):
us_holidays.append(date)
start_dates = list(df['start_date'].values)
end_dates = list(df['end_date'].values)
df['count'] = np.busday_count(start_dates, end_dates, holidays = us_holidays)
How to change format date from 12-Mar-2022 to , format='%d/%m/%Y' in python
so the problem is I read data from the google sheet where in the data contain multiple format, some of them is 12/03/2022 and some of them 12-Mar-2022.
I tried using this got error of couse because doesn't match for 12-Mar-2022
defectData_x['date'] = pd.to_datetime(defectData_x['date'], format='%d/%m/%Y')
Appreciate your help
defectData_x['date1'] = defectData_x['date'].dt.strftime('%d/%m/%Y')
don forget date1's dtype is not datetime but object
so it is better using date column and date1 column both before make final result
after final result, you can drop date column
add my example:
import pandas as pd
df = pd.DataFrame(["12/03/2022", "12-Mar-2022"], columns=["date"])
df["date1"] = pd.to_datetime(df["date"])
df['date2'] = df['date1'].dt.strftime('%d/%m/%Y')
I have a csv file with a long timestamp column (years):
1990-05-12 14:01
.
.
1999-01-10 10:00
where the time is in hh:mm format. I'm trying to extract each day worth of data into a new csv file. Here's my code:
import datetime
import pandas as pd
df = pd.read_csv("/home/parallels/Desktop/ewh_log/hpwh_log.csv",parse_dates=True)
#change timestmap column format
def extract_months_data(df):
df = pd.to_datetime(df['timestamp'])
print(df)
def write_o_csv(df):
print('writing ..')
#todo
x1 = pd.to_datetime(df['timestamp'],format='%m-%d %H:%M').notnull().all()
if (x1)==True:
extract_months_data(df)
else:
x2 = pd.to_datetime(df['timestamp'])
x2 = x1.dt.strftime('%m-%d %H:%M')
write_to_csv(df)
The issue is that when I get to the following line
def extract_months_data(df):
df = pd.to_datetime(df['timestamp'])
I get the following error:
pandas._libs.tslibs.np_datetime.OutOfBoundsDatetime
Is there alternative solution to do it with pandas without ignoring the rest of the data. I saw posts that suggested using coerce but that replaces the rest of the data with NaT.
Thanks
UPDATE:
This post here here answers half of the question which is how to filter hours (or minutes) out of timestamp column. The second part would be how to extract a full day to another csv file. I'll post updates here once I get to a solution.
You are converting to datetime two times which is not needed
Something like that should work
import pandas as pd
df = pd.read_csv('data.csv')
df['month_data'] = pd.to_datetime(df['timestamp'], format='%Y-%m-%d %H:%M')
df['month_data'] = df['month_data'].dt.strftime('%m-%d %H:%M')
# If you dont want columns with month_data NaN
df = df[df['month_data'].notna()]
print(df)
For a current project, I am planning to filter a JSON file by timeranges by running several loops, each time with a slightly shifted range. The code below however yields the error TypeError: Invalid comparison between dtype=datetime64[ns] and date for line after_start_date = df["Date"] >= start_date.
I have already tried to modify the formatting of the dates both within the Python code as well as the corresponding JSON file. Is there any smart tweak to align the date types/formats?
The JSON file has the following format:
[
{"No":"121","Stock Symbol":"A","Date":"05/11/2017","Text Main":"Sample text"}
]
And the corresponding code looks like this:
import string
import json
import pandas as pd
import datetime
from dateutil.relativedelta import *
# Loading and reading dataset
file = open("Glassdoor_A.json", "r")
data = json.load(file)
df = pd.json_normalize(data)
df['Date'] = pd.to_datetime(df['Date'])
# Create an empty dictionary
d = dict()
# Filtering by date
start_date = datetime.date.fromisoformat('2017-01-01')
end_date = datetime.date.fromisoformat('2017-01-31')
for i in df.iterrows():
start_date += relativedelta(months=+3)
end_date += relativedelta(months=+3)
print(start_date)
print(end_date)
after_start_date = df["Date"] >= start_date
before_end_date = df["Date"] <= end_date
between_two_dates = after_start_date & before_end_date
filtered_dates = df.loc[between_two_dates]
print(filtered_dates)
You can use pd.to_datetime('2017-01-31') instead of datetime.date.fromisoformat('2017-01-31').
I hope this helps!
My general advice is not to use datetime module.
Use rather built-in pandasonic methods / classes like pd.to_datetime
and pd.DateOffset.
You should also close the input file as early as it is not needed, e.g.:
with open('Glassdoor_A.json', 'r') as file:
data = json.load(file)
Other weird points in your code are that:
you wrote a loop iterating over rows for i in df.iterrows():,
but never use i (control variable of this loop).
your loop works rather in a time step (not "row by row") mode,
so your loop should be rather something like "while end_date <=
last_end_date:",
the difference between start_date and end_date is just
1 month (actually they are dates of start and end of some month),
but in the loop you increase both dates by 3 months.
Below you have an example of code to look for rows in consecutive months,
up to some final date and print rows from the current month if any:
start_date = pd.to_datetime('2017-01-01')
end_date = pd.to_datetime('2017-03-31')
last_end_date = pd.to_datetime('2017-12-31')
mnthBeg = pd.offsets.MonthBegin(3)
mnthEnd = pd.offsets.MonthEnd(3)
while end_date <= last_end_date:
filtered_rows = df[df.Date.between(start_date, end_date)]
n = len(filtered_rows.index)
print(f'Date range: {start_date.strftime("%Y-%m-%d")} - {end_date.strftime("%Y-%m-%d")}, {n} rows.')
if n > 0:
print(filtered_rows)
start_date += mnthBeg
end_date += mnthEnd
You can compare your dates using the following method
from datetime import datetime
df_subset = df.loc[(df['Start_Date'] > datetime.strptime('2018-12-31', '%Y-%m-%d'))]
I have the following date: 2019-11-20 which corresponds to week 47 of the calendar year. This is also what my excel document says. However, when I do it in Python I get week 46 instead. I will upload my code but I do not get what's wrong with it. I tried to split up the column I had to date and time separately but still, I get the same problem. Very odd I do not know what's wrong and my local time at my laptop is fine. Thanks for your help in advance!
Here is my code:
import pandas as pd
from datetime import datetime
import numpy as np
import re
df = pd.read_csv (r'C:\Users\user\document.csv')
df['startedAt'].replace(regex=True,inplace=True,to_replace=r'\+01:00',value=r'')
df['startedAt'].replace(regex=True,inplace=True,to_replace=r'\+02:00',value=r'')
df['startedAt'] = df['startedAt'].apply(lambda x: datetime.strptime(x, '%Y-%m-%dT%H:%M:%S').strftime('%d-%m-%y %H:%M:%S'))
df['endedAt'].replace(regex=True,inplace=True,to_replace=r'\+01:00',value=r'')
df['endedAt'].replace(regex=True,inplace=True,to_replace=r'\+02:00',value=r'')
df['endedAt'] = pd.to_datetime(df['endedAt'], format='%Y-%m-%d')
df['startedAt'] = pd.to_datetime(df['startedAt'])
df['Date_started'] = df['startedAt'].dt.strftime('%d/%m/%Y')
df['Time_started'] = df['startedAt'].dt.strftime('%H:%M:%S')
df['Date_started'] = pd.to_datetime(df['Date_started'], errors='coerce')
df['week'] = df['Date_started'].dt.strftime('%U')
print(df)