dropping unnecessary text from list of string - python

Existing Df :
Id dates
01 ['ATIVE 04/2018 to 03/2020',' XYZ mar 2020 – Jul 2022','June 2021 - 2023 XYZ']
Expected Df :
Id dates
01 ['04/2018 to 03/2020','mar 2020 – Jul 2022','June 2021 - 2023']
I am looking to clean the List under the dates column. i tried it with below function but doesn't serve the purpose. Any leads on the same..?
def clean_dates_list(dates_list):
cleaned_dates_list = []
for date_str in dates_list:
cleaned_date_str = re.sub(r'[^A-Za-z\s\d]+', '', date_str)
cleaned_dates_list.append(cleaned_date_str)
return cleaned_dates_list

ls = ['ATIVE 04/2018 to 03/2020', ' XYZ mar 2020 – Jul 2022', 'June 2021 - 2023 XYZ']
ls_to_remove = ['ATIVE', 'XYZ']
for item in ls:
ls_str = item.split()
new_item = str()
for item in ls_str:
if item in ls_to_remove:
continue
new_item += " " + item
print(new_item)
I don't know your list of words to remove and it's not a good practice. But in your case it works.

Related

Dropping unnecessary text from data using regex and applying it entire dataframe

i have table where it has dates in multiple formats. with that it also has some unwanted text which i want to drop so that i could process this date strings
Data :
sr.no. col_1 col_2
1 'xper may 2022 - nov 2022' 'derh 06/2022 - 07/2022 ubj'
2 'sp# 2021 - 2022' 'zpt May 2022 - December 2022'
Expected Output :
sr.no. col_1 col_2
1 'may 2022 - nov 2022' '06/2022 - 07/2022'
2 '2021 - 2022' 'May 2022 - December 2022'
def keep_valid_characters(string):
return re.sub(r'(?i)\b(jan(uary)?|feb(ruary)?|mar(ch)?|apr(il)?|may|jun(e)?|jul(y)?|aug(ust)?|sep(tember)?|oct(ober)?|nov(ember)?|dec(ember)?)\b|[^a-z0-9/-]', '', string)
i am using the above pattern to drop but stuck. any other approach.?
You can try to split the pattern construction to multiple strings in complicated case like this:
months = r"jan(?:uary)?|feb(?:ruary)?|mar(?:ch)?|apr(?:il)?|may|june?|july?|aug(?:ust)?|sep(?:tember)?|oct(?:ober)?|nov(?:ember)?|dec(?:ember)?"
pat = rf"(?i)((?:{months})?\s*[\d/]+\s*-\s*(?:{months})?\s*[\d/]+)"
df[["col_1", "col_2"]] = df[["col_1", "col_2"]].transform(lambda x: x.str.extract(pat)[0])
print(df)
Prints:
sr.no. col_1 col_2
0 1 may 2022 - nov 2022 06/2022 - 07/2022
1 2 2021 - 2022 May 2022 - December 2022

Regex for extract months and year combination in a date

I am using regex to extract the month and year of pairs of dates in text:
regex = (
r"((Jan(uary)?|Feb(ruary)?|Mar(ch)?|Apr(il)?|May|Jun(e)?|Jul(y)?|Aug(ust)?|Sep(tember)?(t)?|Oct(ober)?|Nov(ember)?|Dec(ember)?)"
r"\s?[\.\s\’\’\,\/\'\,\‘\-\–\—]?\s?(\d{4}|\d{2})?\s?\s?((to)|[\|\-\–\—])\s?\s?"
r"((Jan(uary)?|Feb(ruary)?|Mar(ch)?|Apr(il)?|May|Jun(e)?|Jul(y)?|Aug(ust)?|Sep(tember)?(t)?|Oct(ober)?|Nov(ember)?|Dec(ember)?)"
r"\s?[\.\s\’\’\,\/\'\,\‘\-\–\—]?\s?(\d{4}|\d{2})|(Present|Now|till\s?(now|date|today)?|current)))"
)
When I test the regex with some inputs that contain the day of the month in some and not in others:
lst = [
'July 2014 - 28th August 2014',
'Jan 2012 - 3rd sep 2014',
'Jan 2008 - May 2012',
'Jan 2008 and May 2012'
]
for i in lst:
word = re.finditer(regex,i,re.IGNORECASE)
for match in word:
print(match.group())
I get the following output:
Jan 2008 - May 2012
but my expected output is:
July 2014 - August 2014
Jan 2012 - sep 2014
Jan 2008 - May 2012
What do I need to change to make the regex match text with an optional day in the date? When a date string includes the day, it is always an ordinal number with a st, nd, rd or th suffix.
You cannot "skip" part of a string during a single match operation, so if you have 26th August, you can't match or capture just 26 August. In these cases, you either need to capture parts of the match and then concatenate them, or replace the parts you do not need as a post-processing step.
So, here, I'd use the post-process replace approach with
import re
day = r'(?:((?:0?[1-9]|[12]\d|3[01])(?:\s*(?:st|[rn]d|th))?)\s*)?'
month = r'(Jan(?:uary)?|Feb(?:ruary)?|Mar(?:ch)?|Apr(?:il)?|May|June?|July?|Aug(?:ust)?|Sep(?:t(?:ember)?)?|Oct(?:ober)?|Nov(?:ember)?|Dec(?:ember)?)'
year = r'(\d{2}(?:\d{2})?)'
rx_valid = re.compile( fr'\b{day}{month}\s*{year}\s*[-—–]\s*{day}{month}\s*{year}(?!\d)', re.IGNORECASE )
rx_ordinal = re.compile( r'\s*\d+\s*(?:st|[rn]d|th)', re.IGNORECASE )
lst = [
'July 2014 - 28th August 2014',
'Jan 2012 - 3rd sep 2014',
'Jan 2008 - May 2012',
'Jan 2008 and May 2012'
]
for i in lst:
word = rx_valid.finditer(i)
for match in word:
print(rx_ordinal.sub("", match.group()))
Output:
July 2014 - August 2014
Jan 2012 - sep 2014
Jan 2008 - May 2012
See the Python demo and the regex demo.

How to get the values individually in a dictionary

I have this in my dictionary i need to get my values sepeartely like the date is Mar 5 06:49:10 2021 GMT.Like how to print the values indvidually.please help me with this?
dict = {'Ernst & Young Nederland LLP': ['Mar 5 06:49:10 2021 GMT', '2048']}
Iterate over the dict values and pick the first entry in the value
d = {'Ernst & Young Nederland LLP': ['Mar 5 06:49:10 2021 GMT', '2048']}
for v in d.values(): print(v[0])
You can iterate your dictionary and pick only the first value:
d = {'Ernst & Young Nederland LLP': ['Mar 5 06:49:10 2021 GMT', '2048']}
for v in d.values(): print(v[0])
If you need also they keys replace .values() with .items() like so:
for k,v in d.items():...
This will give the following output:
Mar 5 06:49:10 2021 GMT

How to separate date values from a text column with special characters in a pandas dataframe? [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 3 years ago.
Improve this question
I have a column with 4 values like below in a dataframe :
Have attached the image below for better understanding
Input
India,Chennai - 24 Oct 1992
India,-Chennai, Oct 1992
(Asia) India,Chennai-22 Oct 1992
India,-Chennai, 1992
Output
Place
India Chennai
India Chennai
(Asia) India Chennai
India Chennai
Date
24 Oct 1992
Oct 1992
22 Oct 1992
1992
I need to split the Date and Year(23 Oct 1992, 1992) separately as a column and the text (India,Chennai) as separate column.
I'm bit confused to extract the values, I tried the replace and split options but couldn't achieve the result.
Would appreciate if somebody could help !!
Apologies for the format of Input and Output data !!
Use:
import re
df['Date'] = df['col'].str.split("(-|,)").str[-1]
df['Place'] = df.apply(lambda x: x['col'].split(x['Date']), axis=1).str[0].str.replace(',', ' ').str.replace('-', '')
Input
col
0 India,Chennai - 24 Oct 1992
1 India,-Chennai,Oct 1992
2 India,-Chennai, 1992
3 (Asia) India,Chennai-22 Oct 1992
Output
col Place Date
0 India,Chennai - 24 Oct 1992 India Chennai 24 Oct 1992
1 India,-Chennai,Oct 1992 India Chennai Oct 1992
2 India,-Chennai, 1992 India Chennai 1992
3 (Asia) India,Chennai-22 Oct 1992 (Asia) India Chennai 22 Oct 1992
There are lot of ways to create columns by using Pandas library in python,
you can create by creating list or by list of dictionaries or by dictionaries of list.
for simple understanding here i am going to use lists
first import pandas as pd
import pandas as pd
creating a list from given data
data = [['India','chennai', '24 Oct', 1992], ['India','chennai', '23 Oct', 1992],\
['India','chennai', '23 Oct', 1992],['India','chennai', '21 Oct', 1992]]
creating dataframe from list
df = pd.DataFrame(data, columns = ['Country', 'City', 'Date','Year'], index=(0,1,2,3))
print
print(df)
output will be as
Country City Date Year
0 India chennai 24 Oct 1992
1 India chennai 23 Oct 1992
2 India chennai 23 Oct 1992
3 India chennai 21 Oct 1992
hope this will help you
The following assumes that the first digit is where we always want to split the text. If the assumption fails then the code also fails!
>>> import re
>>> text_array
['India,Chennai - 24 Oct 1992', 'India,-Chennai,23 Oct 1992', '(Asia) India,Chennai-22 Oct 1992', 'India,-Chennai, 1992']
# split at the first digit, keep the digit, split at only the first digit
>>> tmp = [re.split("([0-9]){1}", t, maxsplit=1) for t in text_array]
>>> tmp
[['India,Chennai - ', '2', '4 Oct 1992'], ['India,-Chennai,', '2', '3 Oct 1992'], ['(Asia) India,Chennai-', '2', '2 Oct 1992'], ['India,-Chennai, ', '1', '992']]
# join the last two fields together to get the digit back.
>>> r = [(i[0], "".join(i[1:])) for i in tmp]
>>> r
[('India,Chennai - ', '24 Oct 1992'), ('India,-Chennai,', '23 Oct 1992'), ('(Asia) India,Chennai-', '22 Oct 1992'), ('India,-Chennai, ', '1992')]
If you have control over the how input is generated then I would suggest that
the input is made more consistent and then we can parse using a tool like
pandas or directly with csv.
Hope this helps.
Regards,
Prasanth
Python code:
import re
import pandas as pd
input_dir = '/content/drive/My Drive/TestData'
csv_file = '{}/test001.csv'.format(input_dir)
p = re.compile(r'(?:[0-9]|[0-2][0-9]|[3][0-1])\s(?:Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec)\s(?:\d{4})|(?:Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec)\s(?:\d{4})|(?:\d{4})', re.IGNORECASE)
places = []
dates = []
with open(csv_file, encoding='utf-8', errors='ignore') as f:
for line in f:
s = re.sub("[,-]", " ", line.strip())
s = re.sub("\s+", " ", s)
r = p.search(s)
str_date = r.group()
dates.append(str_date)
place = s[0:s.find(str_date)]
places.append(place)
dict = {'Place': places,
'Date': dates
}
df = pd.DataFrame(dict)
print(df)
Output:
Place Date
0 India Chennai 24 Oct 1992
1 India Chennai Oct 1992
2 (Asia) India Chennai 22 Oct 1992
3 India Chennai 1992

Extract date from email using python 2.7 regex

I have tried many regex code to extract the date from the emails that has this format but I couldn't:
Date: Tue, 13 Nov 2001 08:41:49 -0800 (PST)
Sent: Thursday, November 08, 2001 10:25 AM
This how it looks like in all emails and I want to extract them both.
Thank you in advance
You can do something like this using this kind of pattern:
Using Python3:
import re
data = "Date: Tue, 13 Nov 2001 08:41:49 -0800 (PST)"
final = re.findall(r"Date: (\w+), ([0-9]+) (\w+) ([0-9]+)", data)
print("{0}, {1}".format(final[0][0], " ".join(final[0][1:])))
print(" ".join(final[0][1:]))
Using Python2:
import re
data = "Date: Tue, 13 Nov 2001 08:41:49 -0800 (PST)"
final = re.findall(r"Date: (\w+), ([0-9]+) (\w+) ([0-9]+)", data)
print "%s, %s" % (final[0][0], " ".join(final[0][1:]))
print " ".join(final[0][1:])
Output:
Tue, 13 Nov 2001
13 Nov 2001
Edit:
A quick answer to the new update of your question, you can do something like this:
import re
email = '''Date: Tue, 13 Nov 2001 08:41:49 -0800 (PST)
Sent: Thursday, November 08, 2001 10:25 AM'''
data = email.split("\n")
pattern = r"(\w+: \w+, [0-9]+ \w+ [0-9]+)|(\w+: \w+, \w+ [0-9]+, [0-9]+)"
final = []
for k in data:
final += re.findall(pattern, k)
final = [j.split(":") for k in final for j in k if j != '']
# Python3
print(final)
# Python2
# print final
Output:
[['Date', ' Tue, 13 Nov 2001'], ['Sent', ' Thursday, November 08, 2001']]
import re
my_email = 'Date: Tue, 13 Nov 2001 08:41:49 -0800 (PST)'
match = re.search(r': (\w{3,3}, \d{2,2} \w{3,3} \d{4,4})', my_email)
print(match.group(1))
I am not regex expert but here is a solution, you can write some tests for that
d = "Date: Tue, 13 Nov 2001 08:41:49 -0800 (PST)"
dates = [re.search('(\d+ \w+ \d+)',date).groups()[0] for date in re.search('(Date: \w+, \d+ \w+ \d+)', d).groups()]
['13 Nov 2001']
Instead of using regex, you can use split() if only extract the same string model:
email_date = "Date: Tue, 13 Nov 2001 08:41:49 -0800 (PST)"
email_date = '%s %s %s %s' % (tuple(email_date.split(' ')[1:5]))
Output:
Tue, 13 Nov 2001

Categories

Resources