I have a result file generated from software and it looks like this:
,0,1,2,3,4,5,6,7,8,9
0,Month,Decade,Stage,Kc,ETc,ETc,Eff,rain,Irr.,Req.
1,coeff,mm/day,mm/dec,mm/dec,mm/dec,,,,,
2,Sep,1,Init,0.50,1.85,18.5,21.8,0.0,,
3,Sep,2,Init,0.50,1.77,17.7,30.3,0.0,,
4,Sep,3,Init,0.50,1.72,17.2,37.1,0.0,,
5,Oct,1,Deve,0.61,2.05,20.5,49.5,0.0,,
6,Oct,2,Deve,0.82,2.66,26.6,59.3,0.0,,
7,Oct,3,Deve,1.03,3.24,35.6,43.0,0.0,,
8,Nov,1,Mid,1.20,3.63,36.3,20.9,15.4,,
9,Nov,2,Mid,1.21,3.53,35.3,6.0,29.2,,
10,Nov,3,Mid,1.21,3.70,37.0,4.0,33.0,,
11,Dec,1,Mid,1.21,3.87,38.7,0.1,38.6,,
12,Dec,2,Late,1.18,3.92,39.2,0.0,39.2,,
13,Dec,3,Late,1.00,3.58,39.4,0.0,39.4,,
14,Jan,1,Late,0.88,3.36,10.1,0.0,10.1,,
15,,,,,,,,,,
16,372.1,272.2,204.9,,,,,,,
As one can observe, the months vary from September to January. Each month is divided into three divisions or decades. To be exact, the months vary from September 2017 to 1st decade of January 2018. Now, I have to generate dates with the starting date of each decade in a month in this format: 01-Sep-2017. So I will have 01-Sep-2017, 11-Sep-2017, 21-Sep-2017, ..., 01-Jan-2018. How to generate these dates? I will share the code that I have written until now.
years = [2017, 2018, 2019]
temp = pd.read_csv(folder_link) # Reading the particular result file
Month = temp['0'][2:] # First column = Month (Jul, Aug, ..)
Decade = temp['1'][2:]
for year in years:
for j in range(2,len(Decade)): # First two lines are headers, so removed them
if(int(Decade[j]) == 1): # First decade = 1-10 days of month
Date = "1" + "-" + Month[j] + "-" + str(year) # Writing the date as 1-Jan-2017
Dates.append(Date)
if(int(Decade[j]) == 2): # Second decade = 11-20 days of month
Date = "11" + "-" + Month[j] + "-" + str(year)
Dates.append(Date)
if(int(Decade[j]) == 3): # Third decade = 21-28 or 21-30 or 21-31 days of month
Date = "21" + "-" + Month[j] + "-" + str(year)
Dates.append(Date)
The problem with this code is I will get 01-Sep-2017, 11-Sep-2017, 21-Sep-2017, ..., 01-Jan-2017 (instead of 2018). I need a generalized solution that could work for all months, not just for January. I have some results ranging from Sep 2017 - Aug 2018. Any help?
First you could start by setting your columns and index right while reading the csv file. Then you can use a formula to deduce the day from decade.
Increment year when switching from december to january only (you can extend your condition here if there are cases where january and/or december are missing).
The code becomes much easier to read and understand once you apply these:
temp = pd.read_csv(folder_link, header=1, index_col=0)
Dates = []
year = 2017
for index, row in temp.iloc[1:].iterrows():
month = row["Month"]
if month == "Jan" and temp.at[index-1, "Month"] == "Dec":
year += 1 # incrementing year if row is january while preceding row is december
day = (int(row["Decade"]) - 1) * 10 + 1
Dates.append(f"{day}-{month}-{year}")
print(Dates)
Output:
['1-Sep-2017', '11-Sep-2017', '21-Sep-2017', '1-Oct-2017', '11-Oct-2017', '21-Oct-2017', '1-Nov-2017', '11-Nov-2017', '21-Nov-2017', '1-Dec-2017', '11-Dec-2017', '21-Dec-2017', '1-Jan-2018']
If you want to stay with the iteration approach (there may be more efficient one using pandas functions), here is a simple way to do :
dates = []
year = 2017
month_list = ['Jan', 'Sep', 'Oct', 'Nov', 'Dec']
temp = pd.read_csv("data.csv") # Reading the particular result file
for index, row in temp.iterrows():
# First two lines are headers, so skip them. Same for last two lines.
if index > 1 and row[1] in month_list:
if row[1] == 'Jan':
year += 1
if(int(row[2]) == 1): # First decade = 1-10 days of month
date = "1" + "-" + row[1] + "-" + str(year) # Writing the date as 1-Jan-2017
dates.append(date)
elif(int(row[2]) == 2): # Second decade = 11-20 days of month
date = "11" + "-" + row[1] + "-" + str(year)
dates.append(date)
elif(int(row[2]) == 3): # Third decade = 21-28 or 21-30 or 21-31 days of month
date = "21" + "-" + row[1] + "-" + str(year)
dates.append(date)
else:
print("Unrecognized value for month {}".format(row[2]))
pass
print(dates)
Explanation :
use iterrows to iterate over your dataframe rows
then, skip headers and check you are parsing actual data by looking at month value (using a predefined list)
finally, just increment year when your month value is Jan
*Note : this solution assumes that your data is a time series with rows ordered in time.
P.S: only use capital letters for classes in Python, not variables.
I have made every attempt that I know of to make this work, but at this point I think I am just running in circles.
I am taking user input and using that to query a database. The caveat is that there are dates within the database that need to have days added to them, and to make sure that the user is seeing all the UPDATED information between the dates they chose, I changed the user's start date so that it includes two months beforehand.
At this point, the information is passed into a dataframe where it is then filtered to contain only relevant information as well as adjusting the dates that need to be adjusted. After that, it's passed through a mask on the dataframe to make sure that the user is seeing the updated information only, and not dates that are outside of their chosen range that originally weren't.
There were a few points throughout this process that my code was running properly, but I kept realizing there were changes that needed to be made. As to be expected, those changes caused my code to break and I've not been able to figure out how to fix it.
One issue is that the SQL queries are not returning the correct information. It seems that the chosen start date will allow any entries past that date, but the chosen end date will only include database entries if the end date is very near to the highest date in the database. The problem with that is that the user may not always know what the highest value in the database is, therefore they need to be able to choose an arbitrary value to query by.
There is an also an issue where it seems the query only wants to work some of the time. On two separate instances I ran the same exact queries and it only worked one time and not the other.
Here is my code:
self.StartDate = (self.StartMonth.get() + " " + self.StartDay.get() + "," + " " + self.StartYear.get())
self.StartDate = datetime.strptime(self.StartDate, '%b %d, %Y').date()
self.StartDate = self.StartDate - timedelta(days = 60)
self.StartDate = self.StartDate.strftime('%b %d, %Y')
self.EndDate = (self.EndMonth.get() + " " + self.EndDay.get() + "," + " " + self.EndYear.get())
self.EndDate = datetime.strptime(self.EndDate, '%b %d, %Y').date()
self.EndDate = self.EndDate.strftime('%b %d, %Y')
JobType = self.JobType.get()
if JobType == 'All':
self.cursor.execute('''
SELECT
*
FROM
MainTable
WHERE
ETADate >= ? and
ETADate <= ?
''',
(self.StartDate, self.EndDate,)
)
self.data = self.cursor.fetchall()
else:
self.cursor.execute('''
SELECT
*
FROM
MainTable
WHERE
ETADate BETWEEN
? AND ?
AND EndUse = ?
''',
(self.StartDate, self.EndDate, JobType,)
)
self.data = self.cursor.fetchall()
self.Data_Cleanup()
def Data_Cleanup(self):
self.df = pd.DataFrame (
self.data,
columns = [
'id',
'JobNumber',
'ETADate',
'Balance',
'EndUse',
'PayType'
]
)
remove = ['id', 'JobNumber']
self.df = self.df.drop(columns = remove)
self.df['ETADate'] = pd.to_datetime(self.df['ETADate'])
self.df.loc[self.df['PayType'] == '14 Days', 'ETADate'] = self.df['ETADate'] + timedelta(days = 14)
self.df.loc[self.df['PayType'] == '30 Days', 'ETADate'] = self.df['ETADate'] + timedelta(days = 30)
self.df['ETADate'] = self.df['ETADate'].astype('category')
self.df['EndUse'] = self.df['EndUse'].astype('category')
self.df['PayType'] = self.df['PayType'].astype('category')
mask = (self.df['ETADate'] >= self.StartDate) & (self.df['ETADate'] <= self.EndDate)
print(self.df.loc[mask])
Ideally, the data would be updated before it is added to the database, but unfortunately the source of this data isn't capable of updating it correctly.
I appreciate any help.
You are storing dates as a string, formatted like Jan 02, 2021. That means you'll compare the month first, alphabetically, then the day numerically, then the year. Or, to take a few random dates, the sort order looks like this:
Dec 23, 2021
Jan 01, 2021
Nov 07, 2026
Nov 16, 2025
If you want a query that makes sense, you'll either need quite a bit of SQL logic to parse these dates on the SQLite side, or preferably, just store the dates using a format that sorts correctly as a string. If you use .strftime("%Y-%m-%d") those same dates will sort in order:
2021-01-01
2021-12-23
2025-11-16
2026-11-07
This will require changing the format of the columns in your database, of course.
I want to extract dates in the format Month Date Year.
For example: 14 January, 2005 or Feb 29 1982
the code im using:
date = re.findall(r'\d{1,3} Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec|January|February|March|April|May|June|July|August|September|October|November|December \d{1,3}[, ]\d{4}',line)
python inteprets this as 1-2 digits and Jan or each of the months. So it would match with only "Feb" or "12 Jan", but not the rest of it
So how do I group ONLY the Months in a way where i can use the | only for the months but not the rest of the expression
Answering your question directly, you can make two regexps for your "Day Month Year" and "Month Day Year" formats, then check them separately.
import datetime
# Make months using list comp
months_shrt = [datetime.date(1,m,1).strftime('%b') for m in range(1,13)]
months_long = [datetime.date(1,m,1).strftime('%B') for m in range(1,13)]
# Join together
months = months_shrt + months_long
months_or = f'({"|".join(months)})'
expr_dmy = '\d{1,3},? ' + months_or + ',? \d{4}'
expr_mdy = months_or + ',? \d{1,3},? \d{4}'
You can try both out and see which one matches. However, you'll still need to inspect it and convert it to your favourite flavour of date format.
Instead, I would advise not using regexp at all, and simply try different date formats.
str_a = ' ,'
str_b = ' ,'
base_fmts = [('%d', '%b', '%Y'),
('%d', '%B', '%Y'),
('%b', '%d', '%Y'),
('%B', '%d', '%Y')]
def my_formatter(s):
for o in base_fmts:
for i in range(2):
for j in range(2):
# Concatenate
fmt = f'{o[0]}{str_a[i]} '
fmt += f'{o[1]}{str_b[j]} '
fmt += f'{o[2]}'
try:
d = datetime.datetime.strptime(s, fmt)
except ValueError:
continue
else:
return d
The function above will take a string and return a datetime.datetime object. You can use standard datetime.datetime methods to get your day, month and year back.
>>> d = my_formatter('Jan 15, 2009')
>>> (d.month, d.day, d.year)
(1, 15, 2009)
Problem:
I have a bunch of files that were downloaded from an org. Halfway through their data directory the org changed the naming convention (reasons unknown). I am looking to create a script that will take the files in a directory and rename the file the same way, but simply "go back one day".
Here is a sample of how one file is named: org2015365_res_version.asc
What I need is logic to only change the year day (2015365) in this case to 2015364. This logic needs to span a few years so 2015001 would be 2014365.
I guess I'm not sure this is possible since its not working with the current date so using a module like datetime does not seem applicable.
Partial logic I came up with. I know it is rudimentary at best, but wanted to take a stab at it.
# open all files
all_data = glob.glob('/somedir/org*.asc')
# empty array to be appended to
day = []
year = []
# loop through all files
for f in all_data:
# get first part of string, renders org2015365
f_split = f.split('_')[0]
# get only year day - renders 2015365
year_day = f_split.replace(f_split[:3], '')
# get only day - renders 365
days = year_day.replace(year_day[0:4], '')
# get only year - renders 2015
day.append(days)
years = year_day.replace(year_day[4:], '')
year.append(years)
# convert to int for easier processing
day = [int(i) for i in day]
year = [int(i) for i in year]
if day == 001 & year == 2016:
day = 365
year = 2015
elif day == 001 & year == 2015:
day = 365
year = 2014
else:
day = day - 1
Apart from the logic above I also came across the function below from this post, I am not sure what would be the best way to combine that with the partial logic above. Thoughts?
import glob
import os
def rename(dir, pattern, titlePattern):
for pathAndFilename in glob.iglob(os.path.join(dir, pattern)):
title, ext = os.path.splitext(os.path.basename(pathAndFilename))
os.rename(pathAndFilename,
os.path.join(dir, titlePattern % title + ext))
rename(r'c:\temp\xx', r'*.doc', r'new(%s)')
Help me, stackoverflow. You're my only hope.
You can use datetime module:
#First argument - string like 2015365, second argument - format
dt = datetime.datetime.strptime(year_day,'%Y%j')
#Time shift
dt = dt + datetime.timedelta(days=-1)
#Year with shift
nyear = dt.year
#Day in year with shift
nday = dt.timetuple().tm_yday
Based on feedback from the community I was able to get the logic needed to fix the files downloaded from the org! The logic was the biggest hurdle. It turns out that the datetime module can be used, I need to read up more on that.
I combined the logic with the batch renaming using the os module, I put the code below to help future users who may have a similar question!
# open all files
all_data = glob.glob('/some_dir/org*.asc')
# loop through
for f in all_data:
# get first part of string, renders org2015365
f_split = f.split('_')[1]
# get only year day - renders 2015365
year_day = f_split.replace(f_split[:10], '')
# first argument - string 2015365, second argument - format the string to datetime
dt = datetime.datetime.strptime(year_day, '%Y%j')
# create a threshold where version changes its naming convention
# only rename files greater than threshold
threshold = '2014336'
th = datetime.datetime.strptime(threshold, '%Y%j')
if dt > th:
# Time shift - go back one day
dt = dt + datetime.timedelta(days=-1)
# Year with shift
nyear = dt.year
# Day in year with shift
nday = dt.timetuple().tm_yday
# rename files correctly
f_output = 'org' + str(nyear) + str(nday).zfill(3) + '_res_version.asc'
os.rename(f, '/some_dir/' + f_output)
else:
pass