Pyspark split date string

Pyspark split date string - python

I have a dataframe and want to split the start_date column (string and year) and keep just the year in a new column (column 4):
ID start_date End_date start_year
|01874938| August 2013| December 2014| 2013|
|00798252| March 2009| May 2015| 2009|
|02202785| July 2, 2014|January 15, 2016| 2, |
|01646125| November 2012| November 2015| 2012|
As you can see I can split the date and keep the years. However for dates like in row 3: "July 2, 2014" the result is "2, " instead of 2014.
This is my code :
split_col = fn.split(df7_ct_map['start_date'] , ' ')
df = df7_ct_map.withColumn('NAME1', split_col.getItem(0))
df = dff.withColumn('start_year', split_col.getItem(1))

You could use a regular expression instead of splitting on ,.
df.withColumn('start_year', regexp_extract(df['start_date'], '\\d{4}', 0))
This will match 4 consecutive numbers, i.e. a year.

You could also extract the last 4 characters of your column start_date.
from pyspark.sql import functions as F
df.withColumn('start_year' ,
F.expr('substring(rtrim(start_date), length(start_date) - 4,length(start_date) )' ) )
.show()
+-------------+----------+
| start_date|start_year|
+-------------+----------+
| August 2013| 2013|
| March 2009| 2009|
| July 2, 2014| 2014|
|November 2014| 2014|
+-------------+----------+

Related

Dropping unnecessary text from data using regex and applying it entire dataframe

i have table where it has dates in multiple formats. with that it also has some unwanted text which i want to drop so that i could process this date strings
Data :
sr.no. col_1 col_2
1 'xper may 2022 - nov 2022' 'derh 06/2022 - 07/2022 ubj'
2 'sp# 2021 - 2022' 'zpt May 2022 - December 2022'
Expected Output :
sr.no. col_1 col_2
1 'may 2022 - nov 2022' '06/2022 - 07/2022'
2 '2021 - 2022' 'May 2022 - December 2022'
def keep_valid_characters(string):
return re.sub(r'(?i)\b(jan(uary)?|feb(ruary)?|mar(ch)?|apr(il)?|may|jun(e)?|jul(y)?|aug(ust)?|sep(tember)?|oct(ober)?|nov(ember)?|dec(ember)?)\b|[^a-z0-9/-]', '', string)
i am using the above pattern to drop but stuck. any other approach.?

You can try to split the pattern construction to multiple strings in complicated case like this:
months = r"jan(?:uary)?|feb(?:ruary)?|mar(?:ch)?|apr(?:il)?|may|june?|july?|aug(?:ust)?|sep(?:tember)?|oct(?:ober)?|nov(?:ember)?|dec(?:ember)?"
pat = rf"(?i)((?:{months})?\s*[\d/]+\s*-\s*(?:{months})?\s*[\d/]+)"
df[["col_1", "col_2"]] = df[["col_1", "col_2"]].transform(lambda x: x.str.extract(pat)[0])
print(df)
Prints:
sr.no. col_1 col_2
0 1 may 2022 - nov 2022 06/2022 - 07/2022
1 2 2021 - 2022 May 2022 - December 2022

How to add a new column based on different conditions on other columns pandas

This is my dataframe:
Date Month
04/21/2019 April
07/03/2019 July
01/05/2018 January
09/23/2019 September
I want to add a column called fiscal year. A new fiscal year starts on 1st of July every year and ends on the last day of June. So for example if the year is 2019 and month is April, it is still fiscal year 2019. However, if the year is 2019 but month is anything after June, it will be fiscal year 2020. The resulting data frame should look like this:
Date Month FY
04/21/2019 April FY19
07/03/2019 July FY20
01/05/2019 January FY19
09/23/2019 September FY20
How do I achieve this?

One way using pandas.Dateoffset:
df["FY"] = (pd.to_datetime(df["Date"])
+ pd.DateOffset(months=6)).dt.strftime("FY%Y")
print(df)
Output:
Date Month FY
0 04/21/2019 April FY2019
1 07/03/2019 July FY2020
2 01/05/2019 January FY2019
3 09/23/2019 September FY2020

try via pd.PeriodIndex()+pd.to_datetime():
df['Date']=pd.to_datetime(df['Date'])
df['FY']=pd.PeriodIndex(df['Date'],freq='A-JUN').strftime("FY%y")
output:
Date Month FY
0 2019-04-21 April FY19
1 2019-07-03 July FY20
2 2019-01-05 January FY19
3 2019-09-23 September FY20
Note: I suggest you you convert your 'Date' to datetime first then do any operation on it or If you don't want to convert 'Date' column then use the above code in a single step:
df['FY']=pd.PeriodIndex(pd.to_datetime(df['Date']),freq='A-JUN').strftime("FY%y")

Create a date from Year and Month on a Select query

I'm working on Vertica
I have a problem, looking really easy, but I can't find a way to figure it out.
From a query, I can get 2 fields "Month" and "Year". What I want is to automatically select another field, Date, that I'd build being '01/Month/Year' (as the sql Date format). The goal is :
What I have
SELECT MONTH, YEAR FROM MyTable
Output :
01 2020
11 2019
09 2021
What I want
SELECT MONTH, YEAR, *answer* FROM MyTable
Output :
01 2020 01-01-2020
11 2019 01-11-2019
09 2021 01-09-2021
Sorry, it looks like really dumb and easy, but I didn't find any good way to do it. Thanks in advance.

Don't use string operations to build dates, you can mess up things considerably:
Today could be: 16.07.2021 or 07/16/2021, or also 2021-07-16, and, in France, for example: 16/07/2021 . Then, you could also left-trim the zeroes - or have 'July' instead of 07 ....
Try:
WITH
my_table (mth,yr) AS (
SELECT 1, 2020
UNION ALL SELECT 11, 2019
UNION ALL SELECT 9, 2021
)
SELECT
yr
, mth
, ADD_MONTHS(DATE '0001-01-01',(yr-1)*12+(mth-1)) AS firstofmonth
FROM my_table
ORDER BY 1,2;
yr | mth | firstofmonth
------+-----+--------------
2019 | 11 | 2019-11-01
2020 | 1 | 2020-01-01
2021 | 9 | 2021-09-01

I finally found a way doing that :
SELECT MONTH, YEAR, CONCAT(CONCAT(YEAR, '-'), CONCAT(MONTH, '-01')) FROM MyTable

Try this:
SELECT [MONTH], [YEAR], CONCAT(CONCAT(CONCAT('01-',[MONTH]),'-'),[YEAR]) AS [DATE]
FROM MyTable
Output will be:
| MONTH | YEAR | DATE |
|-------|------|------------|
| 01 | 2020 | 01-01-2020 |
| 11 | 2019 | 01-11-2019 |
| 09 | 2021 | 01-09-2021 |

How do I dynamically capture in regex two dates from one line of text?

I have a text that will change weekly:
text = "Weekly Comparison, Week 50 October 28 - November 3, 2016 October 30 - November 5, 2015"
I'm looking for regex patterns for year 1, and year 2.
(Both will change weekly so I need the formula to capture all months, days, years)
My output should be the following:
2015 = November 5, 2015
2016 = November 3, 2016
The framework I'm using does not allow for regex capture groups or splits, so I need the formula to be specialized for this type of string.
Thanks!

Code
As per my original comments
See regex in use here
(\w+\s+\d+,\s*(\d+))
Note: The above regex and the regex on regex101 do not match. This is done purposely. Regex101 can only demonstrate the output of substitutions, thus I've prepended .*? to the regex in order to properly display the expected output.
Results
Input
Weekly Comparison, Week 50 October 28 - November 3, 2016 October 30 - November 5, 2015
Output
2016 = November 3, 2016
2015 = November 5, 2015
Usage
import re
regex = r"(\w+\s+\d+,\s*(\d+))"
str = "Weekly Comparison, Week 50 October 28 - November 3, 2016 October 30 - November 5, 2015"
for (date, year) in re.findall(regex, str):
print year + ' = ' + date

You can try this:
text = "Weekly Comparison, Week 50 October 28 - November 3, 2016 October 30 - November 5, 2015"
import re
final_data = sorted(["{} = {}".format(re.findall("\d+$", i)[0], i) for i in re.findall("[a-zA-Z]+\s\d+,\s\d+", text)], key=lambda x:int(re.findall("^\d+", x)[0]))
Output:
['2015 = November 5, 2015', '2016 = November 3, 2016']

Using #ctwheels regex:
text = "Weekly Comparison, Week 50 October 28 - November 3, 2016 October 30 - November 5, 2015"
import re
result = [(date.split(",")[1].strip(), date) for date in re.findall(r'\w+\s+\d+,\s*\d+', text)]
print(result)
# [('2016', 'November 3, 2016'), ('2015', 'November 5, 2015')]

How to save split data in panda in reverse order?

You can use this to create the dataframe:
xyz = pd.DataFrame({'release' : ['7 June 2013', '2012', '31 January 2013',
'February 2008', '17 June 2014', '2013']})
I am trying to split the data and save, them into 3 columns named "day, month and year", using this command:
dataframe[['day','month','year']] = dataframe['release'].str.rsplit(expand=True)
The resulting dataframe is :
dataframe
As you can see, that it works perfectly when it gets 3 strings, but whenever it is getting less then 3 strings, it saves the data at the wrong place.
I have tried split and rsplit, both are giving the same result.
Any solution to get the data at the right place?
The last one is year and it is present in every condition , it should be the first one to be saved and then month if it is present otherwise nothing and same way the day should be stored.

You could
In [17]: dataframe[['year', 'month', 'day']] = dataframe['release'].apply(
lambda x: pd.Series(x.split()[::-1]))
In [18]: dataframe
Out[18]:
release year month day
0 7 June 2013 2013 June 7
1 2012 2012 NaN NaN
2 31 January 2013 2013 January 31
3 February 2008 2008 February NaN
4 17 June 2014 2014 June 17
5 2013 2013 NaN NaN

Try reversing the result.
dataframe[['year','month','day']] = dataframe['release'].str.rsplit(expand=True).reverse()

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Pyspark split date string - python

You could use a regular expression instead of splitting on ,. df.withColumn('start_year', regexp_extract(df['start_date'], '\\d{4}', 0)) This will match 4 consecutive numbers, i.e. a year.

Related

Dropping unnecessary text from data using regex and applying it entire dataframe

How to add a new column based on different conditions on other columns pandas

Create a date from Year and Month on a Select query

How do I dynamically capture in regex two dates from one line of text?

How to save split data in panda in reverse order?

Categories

Resources