Slice does not work after using .split() function - python

I'm trying to create a function that takes the "day" part of a Y-M-D date string.
For example:
Input: ["2022 November 23,2023 April 9"]
Output: 23
I have tried to do this by using the .split() function to split the string up at the comma, then slicing the last 2 indexes out to get the day. However, while I can get the last term of the new split string easily, I cannot get the 2nd-to-last term.
Ex:
y_m_d="2022 November 20,2023 April 9"
split_ymd=y_m_d.split(",")
first_value=split_ymd[0]
print(split_ymd[-1]) #This prints "0"
However, adding the 2nd argument to the slice command breaks it
y_m_d="2022 November 20,2023 April 9"
split_ymd=y_m_d.split(",")
first_value=split_ymd[0]
print(split_ymd[-1:-2]) #This prints "[]"
I understand that some of the terminologies above might not be correct as I a new to learn python, and programming in general, and that the code above is very messy but I just need help knowing why the slice command above does not work. I am open to suggestions on improving the code itself, but I really just want to know why the slice does not work in his situation.
Thanks!

The slice doesn't work in this situation because the slice doesn't work in any situation. It's unrelated to .split() or anything else you're doing.
Consider this simpler test case:
>>> [1,2,3,4][-1]
4
>>> [1,2,3,4][-1:-2]
[]
This happens because -1 refers to index 3 and -2 refers to index 2, and the span [3,2) is backwards so it's treated as empty.
You can swap them if you actually want a range:
>>> [1,2,3,4][-2:-1]
[3]
Or you can just use -2 if you want the second-from-last element:
>>> [1,2,3,4][-2]
3

You can split your initial list on the comma to create a list containing
multiple strings, each representing a date.
Then iterate over those dates, splitting them by spaces. The last value in the subsequent list is the day value you are looking for.
Let me know if this isn't clear.
It looks something like this:
list_of_dates = ["2022 November 23, 2023 April 9"]
# This separates all dates by splitting on the comma
dates = "".join(list_of_dates).split(",")
days = []
for d in dates:
# This splits each date on the space
temp = d.split(" ")
days.append(temp[-1])
print(days)
# Output: ["23", "9"]

So, I have two recommendations here:
Try learning how python slicing works (both negative and positive numbers)
For your actual solution, I see a list of dates. Comma separating it and then parsing the date into a datetime object might make things easier here.
# For example,
date_str_list = "2022 November 20,2023 April 9"
for date_str in date_str_list.split(","):
date = datetime.datetime.strptime(date_str, "%Y %B %d")
day = date.day
See https://docs.python.org/3/library/datetime.html for more details on how you can control the string format of a datetime object.

The Amazing datetime Module!
First lesson: Let the built-ins do the hard work for you!
Here is an example of your function which parses a date string and returns the day. Additionally, here is an example implementation of how you can use the function.
Hope this helps!
The Function:
from datetime import datetime as dt
def extract_day(date_string, mask='%Y %B %d'):
"""Extract and return the day.
Args:
date_string (str): Date text as a string.
mask (str): Format of the provided date string,
used for parsing.
"""
day = dt.strptime(date_string, mask).day
return day
Implementation:
dates1 = "2022 November 20,2023 April 9"
dates2 = "20 Nov 2022,9 Apr 2023"
date_strings1 = dates1.split(',')
date_strings2 = dates2.split(',')
for date in date_strings1:
day = extract_day(date)
print('\nOriginal string: {}'.format(date))
print('Extracted day: {}'.format(day))
for date in date_strings2:
day = extract_day(date, mask='%d %b %Y')
print('\nOriginal string: {}'.format(date))
print('Extracted day: {}'.format(day))
Output:
Original string: 2022 November 20
Extracted day: 20
Original string: 2023 April 9
Extracted day: 9
Original string: 20 Nov 2022
Extracted day: 20
Original string: 9 Apr 2023
Extracted day: 9

How about:
ymd="2022 November 20,2023 April 9"
lchar = ymd.find(',')
fchar = lchar-2
d_int = int(y_m_d[fchar:lchar])
print(d_int)
If there is only one comma after the date this should give you what you want.

You can tackle this in a few different ways:
Using the len function:
date = "2013 November 20,2023 April 10"
splitted = date.split(',')
splitted[0][len(splitted[0])-2:]
Use negative indexing:
date = "2013 November 20,2023 April 10"
splitted = date.split(',')
splitted[0][-2:]

Related

How to extract non-standard dates from text in Python?

I have a dataframe similar to the following one:
df = pd.DataFrame({'Text': ['Hello I would like to get only the date which is 12-13 December 2018 amid this text.', 'Ciao, what I would like to do is to keep dates, e.g. 11-14 October 2019, and remove all the rest.','Hi, SO can you help me delete everything but 10 January 2011. I found it hard doing it myself.']})
I would like to extract only dates from the text. The problem is that it is hard to find patterns. The only rule I can find there is: keep 2/3 objects before a four-digit number (i.e. the year).
I tried many convoluted solutions but I am not able to get what I need.
The result should look like this:
["12-13 December 2018"
"11-14 October 2019"
"10 January 2011"]
Can anyone help me?
Thanks!
If "keep 2/3 object before a four-digit number (i.e. the year)" is a reliable rule then you could use the following:
import re
data = {'Text': ['Hello I would like to get only the date which is 12-13 December 2018 amid this text.', 'Ciao, what I would like to do is to keep dates, e.g. 11-14 October 2019, and remove all the rest.','Hi, SO can you help me delete everything but 10 January 2011. I found it hard doing it myself.']}
date_strings = []
for string in data['Text']: # loop through each string
words = string.split() # split string by ' ' characters
for index in range(len(words)):
if re.search(r'(\d){4}', words[index]): # if the 'word' is 4 digits
date_strings.append( ' '.join(words[index-2:index+1]) ) # extract that word & the preceeding 2
break
print(date_strings)
To get:
['12-13 December 2018', '11-14 October 2019,', '10 January 2011.']
Some assumptions:
the dates are always 3 'words' long
the years are always at the end of the dates
as pointed out in the comments, the only 4-digit number in the text is the year
Here is a potential solution using a regex:
from calendar import month_name
months = '|'.join(list(month_name)[1:])
df['Text'].str.extract(r'([0-9-]+ (?:%s) \d{4})' % months)[0]
alternative regex: r'((?:\d+-)?\d+ (?:%s) \d{4})' % months
output:
0 12-13 December 2018
1 11-14 October 2019
2 10 January 2011

Processing data with incorrect dates like 30th of February

In trying to process a large number of bank account statements given in CSV format I realized that some of the dates are incorrect (30th of February, which is not possible).
So this snippet fails [1] telling me that some dates are incorrect:
df_from_csv = pd.read_csv( csv_filename
, encoding='cp1252'
, sep=";"
, thousands='.', decimal=","
, dayfirst=True
, parse_dates=['Buchungstag', 'Wertstellung']
)
I could of course pre-process those CSV files and replace the 30th of Feb with 28th of Feb (or whatever the Feb ended in that year).
But is there a way to do this in Pandas, while importing? Like "If this column fails, set it to X"?
Sample row
775945;28.02.2018;30.02.2018;;901;"Zinsen"
As you can see, the date 30.02.2018 is not correct, because there ain't no 30th of Feb. But this seems to be a known problem in Germany. See [2].
[1] Here's the error message:
ValueError: day is out of range for month
[2] https://de.wikipedia.org/wiki/30._Februar
Here is how I solved it:
I added a custom date-parser:
import calendar
def mydateparser(dat_str):
"""Given a date like `30.02.2020` create a correct date `28.02.2020`"""
if dat_str.startswith("30.02"):
(d, m, y) = [int(el) for el in dat_str.split(".")]
# This here will get the first and last days in a given year/month:
(first, last) = calendar.monthrange(y, m)
# Use the correct last day (`last`) in creating a new datestring:
dat_str = f"{last:02d}.{m:02d}.{y}"
return pd.datetime.strptime(dat_str, "%d.%m.%Y")
# and used it in `read_csv`
for csv_filename in glob.glob(f"{path}/*.csv"):
# read csv into DataFrame
df_from_csv = pd.read_csv(csv_filename,
encoding='cp1252',
sep=";",
thousands='.', decimal=",",
dayfirst=True,
parse_dates=['Buchungstag', 'Wertstellung'],
date_parser=mydateparser
)
This allows me to fix those incorrect "30.02.XX" dates and allow pandas to convert those two date columns (['Buchungstag', 'Wertstellung']) into dates, instead of objects.
You could load it all up as text, then run it through a regex to identify non legal dates - which you could apply some adjustment function.
A sample regex you might apply could be:
ok_date_pattern = re.compile(r"^(0[1-9]|[12][0-9]|3[01])[-](0[1-9]|1[012])[-](19|20|99)[0-9]{2}\b")
This finds dates in DD-MM-YYYY format where the DD is constrained to being from 01 to 31 (i.e. a day of 42 would be considered illegal) and MM is constrained to 01 to 12, and YYYY is constrained to being within the range 1900 to 2099.
There are other regexes that go into more depth - such as some of the inventive answers found here
What you then need is a working adjustment function - perhaps that parses the date as best it can and returns a nearest legal date. I'm not aware of anything that does that out of the box, but a function could be written to deal with the most common edge cases I guess.
Then it'd be a case of tagging legal and illegal dates using an appropriate regex, and assigning some date-conversion function to deal with these two classes of dates appropriately.

How to extract minimum and maximum dates from string with regex in Python?

I am trying to extract minimum and maximum dates from a string column in pandas. I have two string formats to extract dates.
First one is:
date_from_string = 'My date format is 7-20 November 2019'
And the second one is:
date_from_string_v2 = 'My date format is 7 October and 7 November 2019'
I want to extract minimum and maximum dates seperately. For example, for the first case:
minimum_date = 20191107
maximum_date = 20191120
or for the second type:
minimum_date = 20191007
maximum_date = 20191107
I have tried a date_converter function code here. I also tried dateutils and datefinder modules. But I could not solve this yet. I need some help for this issue.
Thanks.
Based on your comments, if a string includes just one case and just a single range of dates, a regex could possibly be better than date parser. Date parsers are usually geared at producing a single date, not a range (maybe one of the modules Arkistarvh mentioned does ranges, but I doubt it).
A regex targeted at the strings you supplied would be something like this:
re_month=r'(?:January|February|March|April|May|June|July|August|September|October|November|December)'
re_ranges=r'(?P<range1s>\d{1,2})-(?P<range1e>\d{1,2} +'+re_month+' +\d{4})|(?P<range2s>\d{1,2} +'+re_month+') +and +(?P<range2e>\d{1,2} +'+re_month+' +\d{4})'
#which gives:
>re.search(re_ranges,date_from_string).groups()
('7', '20 November 2019', None, None)
>re.search(re_ranges,date_from_string_v2).groups()
(None, None, '7 October', '7 November 2019')
which can then be parsed by normal date parsers.

How do I parse a date without zero padding, in the format (1 or 2-digit year)-(Month abbreviation)?

I need to parse a few dates that are roughly in the format (1 or 2-digit year)-(Month abbreviation), for example:
5-Jun (June 2005)
13-Jan (January 2013)
I tried using strptime with the format %b-%y but it did not consistently produce the desired date. Per the documentation, this is because some years in my dataset are not zero-padded.
Further, when I tested the datetime module (please see below for my code) on the string "5-Jun", I got "2019-06-05", instead of the desired result (June 2005), even if I set yearfirst=True when calling parse.
from dateutil.parser import parse
parsed = parse("5-Jun",yearfirst=True)
print(parsed)
It will be easier if 0 is padded to single digit years, as it can be directly converted to time using format. Regular expression is used here to replace any instance of single digit number with it's '0 padded in front' value. I've used regex from here.
Sample code:
import re
match_condn = r'\b([0-9])\b'
replace_str = r'0\1'
datetime.strptime(re.sub(match_condn, replace_str, '15-Jun'), '%y-%b').strftime("%B %Y")
Output:
June 2015
One approach is to use str.zfill
Ex:
import datetime
d = ["5-Jun", "13-Jan"]
for date in d:
date, month = date.split("-")
date = date.zfill(2)
print(datetime.datetime.strptime(date+"-"+month, "%y-%b").strftime("%B %Y"))
Output:
June 2005
January 2013
Ah. I see from #Rakesh's answer what your data is about. I thought you needed to parse the full name of the month. So you had your two terms %b and %y backwards, but then you had the problem with the single-digit years. I get it now. Here's a much simpler way to get what you want if you can assume your dates are always in one of the two formats you indicate:
inp = "5-Jun"
t = time.strptime(("0" + inp)[-6:], "%y-%b")

How to add variables together into a new variable where you control the separation

Let's say i've declared three variables which are a date, how can I combine them into a new variable where i can print them in the correct 1/2/03 format by simply printing the new variable name.
month = 1
day = 2
year = 03
date = month, day, year <<<<< What would this code have to be?
print(date)
I know i could set the sep='/' argument in the print statement if i call all three variables individually, but this means i can't add addition text into the print statement without it also being separated by a /. therefore i need a single variable i can call.
The .join() method does what you want (assuming the input is strings):
>>> '/'.join((month, day, year))
1/2/03
As does all of Python's formatting options, e.g.:
>>> '%s/%s/%s' % (month, day, year)
1/2/03
But date formatting (and working with dates in general) is tricky, and there are existing tools to do it "right", namely the datetime module, see date.strftime().
>>> date = datetime.date(2003, 1, 2)
>>> date.strftime('%m/%d/%y')
'01/02/03'
>>> date.strftime('%-m/%-d/%y')
'1/2/03'
Note the - before the m and the d to suppress leading zeros on the month and date.
You can use the join method. You can also use a list comprehension to format the strings so they are each 2 digits wide.
>>> '/'.join('%02d' % i for i in [month, day, year])
'01/02/03'
You want to read about the str.format() method:
https://docs.python.org/3/library/stdtypes.html#str.format
Or if you're using Python 2:
https://docs.python.org/2/library/stdtypes.html#str.format
The join() function will also work in this case, but learning about str.format() will be more useful to you in the long run.
The correct answer is: use the datetime module:
import datetime
month = 1
day = 2
year = 2003
date = datetime(year, month, day)
print(date)
print(date.strftime("%m/%d/%Y"))
# etc
Trying to handle dates as tuples is just a PITA, so don't waste your time.

Categories

Resources