Selecting multiple files according to timestamp with glob

Selecting multiple files according to timestamp with glob - python

I have a directory with 7000 daily .tif files, spanning from 2002 to 2022. Each file has a respective indication of its timestamp, in the yyyy/mm/dd format, as: My_file_20020513.tif
I'm trying to select the file paths according to their respective month, i.e. all files from January, all files from February, and so on.
I thought using a simple wildcard showing my month of interest would solve the problem, such as:
january_files = glob.glob('My_folder/My_file_*01*.tif')
But this ended up selecting all files that have a number 01 on its string, such as all the first days of a month, and even October's days that starts with a 1.
Is there a way I can use glob for selecting the file paths for a specific month only?

Use ? to match a single character. Then you can use four ? to match any year.
january_files = glob.glob('My_folder/My_file_????01*.tif')

Related

Pandas df.loc with regex

I'm working with a data set consisting of several csv files of nearly the same form. Each csv describes a particular date, and labels the data by state/province. However, the format of one of the column headers in the data set was altered from Province/State to Province_State, so that all csv's created before a certain date use the first format and all csv's created after that date use the second format.
I'm trying to sum up all the entries corresponding to a particular state. At present, the code I'm working with is as follows:
daily_data.loc[daily_data[areaLabel] == location].sum()
where daily_data is the dataframe containing the csv data, location is the name of the state I'm looking for, and arealabel is a variable storing either 'Province/State' or 'Province_State' depending on the result of a date check. I would like to eliminate the date check by e.g. conditioning on a regular expression like Province(/|_)State, but I'm having a lot of trouble finding a way to index into a pandas dataframe by regular expression. Is this doable (and in a way that would make the code more elegant rather than less elegant)? If so, I'd appreciate it if someone could point me in the right direction.

Use filter to get the columns that match your regex
>>> df.filter(regex="Province(/|_)State").columns[0]
'Province/State'
Then use this to select only rows that match your location:
df[df[df.filter(regex="Province(/|_)State").columns[0]]==location].sum()
This however assumes that there are no other columns that would match the regex.

python remove rows with the same keys and keep the row with the most recent date stamp

I have a SharePoint excel sheet with the file name and format that updates every day with the most recent information. The rows are order numbers (as key for other dataframe), ordered qty and received qty for the current day.
Rows will be added if there are more orders placed today, and old orders will be deleted after they are fulfilled for several days to keep the size of this report relatively small. It looks like this
What I want to do is to have a Python program or Power BI one to generate another excel file and refresh it automatically. This generated file will keep all the distinct PO numbers (like a groupby in SQL or pivot in excel) but only keep the record of the most recent days.
For example, if files on 1/2/2021 and 1/3/2021 looks like this:
For example,
Then the generated file on 1/3 will look like:
Just keep only one row for all the distinct POs and this row will be the ones on the most recent days in the report.

In python you can compare strings based of their lexicographic order so
if we look at the logical expression
'A' < 'B'
this comparison will result in True.
So then you could write a function which will sort out the biggest one with the same dates using this functionallity.
Also if you formulate your dates as "2020-02-14" / YYYY-MM-DD then you can also use string comparison to find out which date is older or newer or in other words bigger or smaller.
For writing and reading you could use python CSV librarys as I understood you are working with .csv files but in my opinion these librarys aren't actually that helpful because you can also implement the same functionallity in python quite easily, but it comes down to what you prefer.

Need to somehow change the formatting of a column of data from text to date

I've been handed a few sheets of data that I need to compare and contrast for a Uni project.
I have 2 sheets where the date/time format is 01/08/2020 00:01. Which is perfect for what I need to do. The third sheet has the date formatted (for some inexplicable reason) as 2020-08-01T00:01:00.0000000.
It's in normal text instead of date format, and there's 1000s of rows, and they don't follow the same increments, so I can't just start at the top and drag down. Obviously, I can't do it manually. I'm wondering if there's some Python code that I could use that recognises the year, data and day, reformat them according, replace the T with a space, and get rid of the 7 zeros at the end.

You can use a formula to extract the dates you need. Assuming your strings start in cell A1, you can use this formula:
=DATE(LEFT(A1,4),MID(A1,6,2),MID(A1,9,2))+TIME(MID(A1,12,2),MID(A1,15,2),MID(A1,18,2))

How do I slice a range that occur every nth time?

I have a nc file in which the time variable is a bit weird. It is not a gregorian calendar but a simple 365-day-a-year calendar (i.e. leap years are not included). This means the unit is also a bit off, but nothing too worrisome.
xarray.DataArray 'days' (time: 6570)>
array([730817., 730818., 730819., ..., 737384., 737385., 737386.])
Dimensions without coordinates: time
Attributes:
units: days_since_Jan11900
long_name: calendar_days
730817 represents 01-01-2001 and 737386 represents 31-12-2018
I want to obtain a certain time period of the data set for multiple years, just as you can do with cdo -seldmonth, -selday etc. But of course, with no date, I cannot use the otherwise brilliant option. My idea was to slice the time range I need with np.slice, but I do not know how and cannot seem to find adequate answers on SO.
In my specific case, I need to slice a range going from May 30th (150th day of the year) to Aug 18th (229th day of the year) every year. I know the first slice should be something like:
ds = ds.loc[dict(time = slice(149,229))]
But, that will only give me the range for 2001, and not the following years.
I cannot do it with cdo -ntime as it does not recognize the time unit.
How do I make sure that I get the range for the following 17 years too? And by that skipping 285 days in between the ranges I need?

I fixed it through Python. It can probably be done in a smarter way, but I manually picked the ranges I needed with help from #dl.meteo and using np.r_.
ds = ds.sel(time=np.r_[149:229,514:594,879:959,1244:1324,1609:1689,1974:2054,2339:2419,2704:2784,3069:3149,3434:3514,3799:3879,4164:4244,4529:4609,4894:4974,5259:5339,5624:5704,5989:6069,6354:6434])

From your answer it seems you know the timeslices, so you could also extract them with cdo using
cdo seltimestep,149/229 in.nc out.nc
etc
but if you want to do it (semi)automatically with cdo, it should also be possible as cdo supports a 365 day calendar. I think you need to set the calendar to this type and then probably reset the time units and the reftime. without an example file I can't test this, but I think something like this could work:
step 1: set the calendar type to 365 and then set the reference data to your first date:
cdo setcalendar,365_day infile.nc out1.nc
cdo setreftime,2000-01-01,00:00:00 out1.nc out2.nc
you then need to see what the first date is in the file, you can pipe it to less:
cdo showdate out2.nc | less
step 2: you then can shift the timeaxis to the correct date using cdo,shifttime
e.g. if the showdate gives the first day as 2302-04-03, then you can simply do
cdo shiftime,-302years -shifttime,-3months -shifttime,-2days out2.nc out3.nc
to correct the dates...
Then you should be able to use all the cdo functionality on the file to do the manipulation as you wish

Calculate Number of Last Week in Isocalendar Year

I am writing a script to grab files from a directory based on the week number in the filenames. I need to grab files with week N and week N+1 in the filenames.
I have the basics down, but am now struggling to figure out the rollover for new years (the file format uses the isocalendar standard). Isocalendars can have either 52 or 53 weeks in them, so I need a way to figure out if the year I'm in is a 52 or 53 week year so that I can then compare to the results of datetime.now().isocalendar()[1] and see if I need to set N+1 to 1.
Is there a built in python function for this?

Why not just use (datetime.now() + timedelta(days=7)).isocalendar()[1] rather than calculating it from the current week number at all?

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Selecting multiple files according to timestamp with glob - python

Use ? to match a single character. Then you can use four ? to match any year. january_files = glob.glob('My_folder/My_file_????01*.tif')

Related

Pandas df.loc with regex

python remove rows with the same keys and keep the row with the most recent date stamp

Need to somehow change the formatting of a column of data from text to date

How do I slice a range that occur every nth time?

Calculate Number of Last Week in Isocalendar Year

Categories

Resources