Dropping unnecessary text from data using regex and applying it entire dataframe - python

i have table where it has dates in multiple formats. with that it also has some unwanted text which i want to drop so that i could process this date strings
Data :
sr.no. col_1 col_2
1 'xper may 2022 - nov 2022' 'derh 06/2022 - 07/2022 ubj'
2 'sp# 2021 - 2022' 'zpt May 2022 - December 2022'
Expected Output :
sr.no. col_1 col_2
1 'may 2022 - nov 2022' '06/2022 - 07/2022'
2 '2021 - 2022' 'May 2022 - December 2022'
def keep_valid_characters(string):
return re.sub(r'(?i)\b(jan(uary)?|feb(ruary)?|mar(ch)?|apr(il)?|may|jun(e)?|jul(y)?|aug(ust)?|sep(tember)?|oct(ober)?|nov(ember)?|dec(ember)?)\b|[^a-z0-9/-]', '', string)
i am using the above pattern to drop but stuck. any other approach.?

You can try to split the pattern construction to multiple strings in complicated case like this:
months = r"jan(?:uary)?|feb(?:ruary)?|mar(?:ch)?|apr(?:il)?|may|june?|july?|aug(?:ust)?|sep(?:tember)?|oct(?:ober)?|nov(?:ember)?|dec(?:ember)?"
pat = rf"(?i)((?:{months})?\s*[\d/]+\s*-\s*(?:{months})?\s*[\d/]+)"
df[["col_1", "col_2"]] = df[["col_1", "col_2"]].transform(lambda x: x.str.extract(pat)[0])
print(df)
Prints:
sr.no. col_1 col_2
0 1 may 2022 - nov 2022 06/2022 - 07/2022
1 2 2021 - 2022 May 2022 - December 2022

Related

dropping unnecessary text from list of string

Existing Df :
Id dates
01 ['ATIVE 04/2018 to 03/2020',' XYZ mar 2020 – Jul 2022','June 2021 - 2023 XYZ']
Expected Df :
Id dates
01 ['04/2018 to 03/2020','mar 2020 – Jul 2022','June 2021 - 2023']
I am looking to clean the List under the dates column. i tried it with below function but doesn't serve the purpose. Any leads on the same..?
def clean_dates_list(dates_list):
cleaned_dates_list = []
for date_str in dates_list:
cleaned_date_str = re.sub(r'[^A-Za-z\s\d]+', '', date_str)
cleaned_dates_list.append(cleaned_date_str)
return cleaned_dates_list
ls = ['ATIVE 04/2018 to 03/2020', ' XYZ mar 2020 – Jul 2022', 'June 2021 - 2023 XYZ']
ls_to_remove = ['ATIVE', 'XYZ']
for item in ls:
ls_str = item.split()
new_item = str()
for item in ls_str:
if item in ls_to_remove:
continue
new_item += " " + item
print(new_item)
I don't know your list of words to remove and it's not a good practice. But in your case it works.

Split columns by space or dash - python

I have a pandas df with mixed formatting for a specific column. It contains the qtr and year. I'm hoping to split this column into separate columns. But the formatting contains a space or a second dash between qtr and year.
I'm hoping to include a function that splits the column by a blank space or a second dash.
df = pd.DataFrame({
'Qtr' : ['APR-JUN 2019','JAN-MAR 2019','JAN-MAR 2015','JUL-SEP-2020','OCT-DEC 2014','JUL-SEP-2015'],
})
out:
Qtr
0 APR-JUN 2019 # blank
1 JAN-MAR 2019 # blank
2 JAN-MAR 2015 # blank
3 JUL-SEP-2020 # second dash
4 OCT-DEC 2014 # blank
5 JUL-SEP-2015 # second dash
split by blank
df[['Qtr', 'Year']] = df['Qtr'].str.split(' ', 1, expand=True)
split by second dash
df[['Qtr', 'Year']] = df['Qtr'].str.split('-', 1, expand=True)
intended output:
Qtr Year
0 APR-JUN 2019
1 JAN-MAR 2019
2 JAN-MAR 2015
3 JUL-SEP 2020
4 OCT-DEC 2014
5 JUL-SEP 2015
You can use a regular expression with the extract function of the string accessor.
df[['Qtr', 'Year']] = df['Qtr'].str.extract(r'(\w{3}-\w{3}).(\d{4})')
print(df)
Result
Qtr Year
0 APR-JUN 2019
1 JAN-MAR 2019
2 JAN-MAR 2015
3 JUL-SEP 2020
4 OCT-DEC 2014
5 JUL-SEP 2015
You can split using regex using positive lookahead and non capturing group (?:..), then filter out the empty values, and apply a pandas Series on the values:
>>> (df.Qtr.str.split('\s|(.+(?<=-).+)(?:-)')
.apply(lambda x: [i for i in x if i])
.apply(lambda x: pd.Series(x, index=['Qtr', 'Year']))
)
Qtr Year
0 APR-JUN 2019
1 JAN-MAR 2019
2 JAN-MAR 2015
3 JUL-SEP 2020
4 OCT-DEC 2014
5 JUL-SEP 2015
If, and only if, the data is in the posted format you could use list slicing.
import pandas as pd
df = pd.DataFrame(
{
"Qtr": [
"APR-JUN 2019",
"JAN-MAR 2019",
"JAN-MAR 2015",
"JUL-SEP-2020",
"OCT-DEC 2014",
"JUL-SEP-2015",
],
}
)
df[['Qtr', 'Year']] = [(x[:7], x[8:12]) for x in df['Qtr']]
print(df)
Qtr Year
0 APR-JUN 2019
1 JAN-MAR 2019
2 JAN-MAR 2015
3 JUL-SEP 2020
4 OCT-DEC 2014
5 JUL-SEP 2015

Regex for extract months and year combination in a date

I am using regex to extract the month and year of pairs of dates in text:
regex = (
r"((Jan(uary)?|Feb(ruary)?|Mar(ch)?|Apr(il)?|May|Jun(e)?|Jul(y)?|Aug(ust)?|Sep(tember)?(t)?|Oct(ober)?|Nov(ember)?|Dec(ember)?)"
r"\s?[\.\s\’\’\,\/\'\,\‘\-\–\—]?\s?(\d{4}|\d{2})?\s?\s?((to)|[\|\-\–\—])\s?\s?"
r"((Jan(uary)?|Feb(ruary)?|Mar(ch)?|Apr(il)?|May|Jun(e)?|Jul(y)?|Aug(ust)?|Sep(tember)?(t)?|Oct(ober)?|Nov(ember)?|Dec(ember)?)"
r"\s?[\.\s\’\’\,\/\'\,\‘\-\–\—]?\s?(\d{4}|\d{2})|(Present|Now|till\s?(now|date|today)?|current)))"
)
When I test the regex with some inputs that contain the day of the month in some and not in others:
lst = [
'July 2014 - 28th August 2014',
'Jan 2012 - 3rd sep 2014',
'Jan 2008 - May 2012',
'Jan 2008 and May 2012'
]
for i in lst:
word = re.finditer(regex,i,re.IGNORECASE)
for match in word:
print(match.group())
I get the following output:
Jan 2008 - May 2012
but my expected output is:
July 2014 - August 2014
Jan 2012 - sep 2014
Jan 2008 - May 2012
What do I need to change to make the regex match text with an optional day in the date? When a date string includes the day, it is always an ordinal number with a st, nd, rd or th suffix.
You cannot "skip" part of a string during a single match operation, so if you have 26th August, you can't match or capture just 26 August. In these cases, you either need to capture parts of the match and then concatenate them, or replace the parts you do not need as a post-processing step.
So, here, I'd use the post-process replace approach with
import re
day = r'(?:((?:0?[1-9]|[12]\d|3[01])(?:\s*(?:st|[rn]d|th))?)\s*)?'
month = r'(Jan(?:uary)?|Feb(?:ruary)?|Mar(?:ch)?|Apr(?:il)?|May|June?|July?|Aug(?:ust)?|Sep(?:t(?:ember)?)?|Oct(?:ober)?|Nov(?:ember)?|Dec(?:ember)?)'
year = r'(\d{2}(?:\d{2})?)'
rx_valid = re.compile( fr'\b{day}{month}\s*{year}\s*[-—–]\s*{day}{month}\s*{year}(?!\d)', re.IGNORECASE )
rx_ordinal = re.compile( r'\s*\d+\s*(?:st|[rn]d|th)', re.IGNORECASE )
lst = [
'July 2014 - 28th August 2014',
'Jan 2012 - 3rd sep 2014',
'Jan 2008 - May 2012',
'Jan 2008 and May 2012'
]
for i in lst:
word = rx_valid.finditer(i)
for match in word:
print(rx_ordinal.sub("", match.group()))
Output:
July 2014 - August 2014
Jan 2012 - sep 2014
Jan 2008 - May 2012
See the Python demo and the regex demo.

how to add 2 Array list into a single pandas dataframe with two seperate column name

Hello guys I have a program that takes two array "Year_Array" and "Month_Array" and generates the output according to conditions.
I want to add that both array in a single dataframe with column name year and name so in future I can add that dataframe with other dataframe.
Below is the sample code:
Year_Array=[2010,2011,2012,2013,2014]
Month_Array=['Jan','Feb','Mar','April','May','June','July','Aug','Sep','Oct','Nov','Dec']
segment=[1, 1, 3, 5, 2, 1, 1, 1, 2, 1, 6, 1]
p=0
for p in range(0, len(Year_Array), 1):
c=0
for i in range(0, len(segment),1):
h = segment[i]
for j in range(0,int(h) , 1):
print((Year_Array[p]) ,(Month_Array[c]))
c += 1
On the basis of segment the code is generated like this:
output
2010 Jan
2010 Feb
2010 Mar
2010 Mar
2010 Mar
2010 April
2010 April
2010 April
2010 April
2010 April
2010 May
2010 May
2010 June
2010 July
2010 Aug
2010 Sep
2010 Sep
2010 Oct
2010 Nov
2010 Nov
2010 Nov
2010 Nov
2010 Nov
2010 Nov
2010 Dec
2011 Jan
2011 Feb
2011 Mar
2011 Mar
2011 Mar
2011 April
......
......
2012 Jan
2012 Feb
2012 Mar
2012 Mar
2012 Mar
2012 April
......
......so on till 2014
all this output i want to store in a single dataframe for this i tried with this way:
df=pd.DataFrame(Year_Array[p])
print(df)
df.columns = ['year']
print("df-",df)
df1=pd.DataFrame(Month_Array[c])
df1.columns = ['month']
print(df1)
if i write :then this also print only the array values not the output
df=pd.DataFrame(Year_Array)
print(df)
**but this is not working i want the same ouput while printing the array in dataframe with column name "year" and "month"**please tell me how to do it..thanks
You can create a Array with the expected output and create a dataframe from it.
Edit : added column name to dataframe.
Year_Array=[2010,2011,2012,2013,2014]
Month_Array=['Jan','Feb','Mar','April','May','June','July','Aug','Sep','Oct','Nov','Dec']
final_Array=[]
segment=[1, 1, 3, 5, 2, 1, 1, 1, 2, 1, 6, 1]
p=0
for p in range(0, len(Year_Array), 1):
c=0
for i in range(0, len(segment),1):
h = segment[i]
# print(h)
for j in range(0,int(h) , 1):
final_Array.append(((Year_Array[p], Month_Array[c])))
c += 1
data = pd.DataFrame(final_Array,columns=['year','month'])
data.head()
Output :
year month
0 2010 Jan
1 2010 Feb
2 2010 Mar
3 2010 Mar
4 2010 Mar

How to save split data in panda in reverse order?

You can use this to create the dataframe:
xyz = pd.DataFrame({'release' : ['7 June 2013', '2012', '31 January 2013',
'February 2008', '17 June 2014', '2013']})
I am trying to split the data and save, them into 3 columns named "day, month and year", using this command:
dataframe[['day','month','year']] = dataframe['release'].str.rsplit(expand=True)
The resulting dataframe is :
dataframe
As you can see, that it works perfectly when it gets 3 strings, but whenever it is getting less then 3 strings, it saves the data at the wrong place.
I have tried split and rsplit, both are giving the same result.
Any solution to get the data at the right place?
The last one is year and it is present in every condition , it should be the first one to be saved and then month if it is present otherwise nothing and same way the day should be stored.
You could
In [17]: dataframe[['year', 'month', 'day']] = dataframe['release'].apply(
lambda x: pd.Series(x.split()[::-1]))
In [18]: dataframe
Out[18]:
release year month day
0 7 June 2013 2013 June 7
1 2012 2012 NaN NaN
2 31 January 2013 2013 January 31
3 February 2008 2008 February NaN
4 17 June 2014 2014 June 17
5 2013 2013 NaN NaN
Try reversing the result.
dataframe[['year','month','day']] = dataframe['release'].str.rsplit(expand=True).reverse()

Categories

Resources