Remove words until a specific character is reached - python

I'm new to python and am having difficulties to remove words in a string
9 - Saturday, 19 May 2012
above is my string I would like to remove all string to
19 May 2012
so I could easily convert it to sql date
here is the could that I tried
new_s = re.sub(',', '', '9 - Saturday, 19 May 2012')
But it only remove the "," in the String. Any help?

You can use string.split(',')
and you will get
['9 - Saturday', '19 May 2012']

You are missing the .* (matching any number of chars) before the , (and a space after it which you probably also want to remove:
>>> new_s = re.sub('.*, ', '', '9 - Saturday, 19 May 2012')
>>> new_s
'19 May 2012'

Your regex is matching a single comma only hence that is the only thing it removes.
You may use a negated character class i.e. [^,]* to match everything until you match a comma and then match comma and trailing whitespace to remove it like this:
>>> print re.sub('[^,]*, *', '', '9 - Saturday, 19 May 2012')
19 May 2012

Regex is great, but for this you could also use .split()
test_string = "9 - Saturday, 19 May 2012"
splt_string = test_string.split(",")
out_String = splt_string[1]
print(out_String)
Outputs:
19 May 2012
If the leading ' ' is a propblem, you can remedy this with out_String.lstrip()

try this
a = "9 - Saturday, 19 May 2012"
f = a.find("19 May 2012")
b = a[f:]
print(b)

Related

Splitting an item in a list based on words' order

so i have list, this is one of the item:
'Sent: Saturday, January 13, 2022 8:55 AM'
i only need the part "January 13, 2022 8:55" but split into two items "January 13, 2022" and "8:55". i know i can use split, but the problem is this item is subject to change, so i need a code that can work regardless of the change in the item (example of change: 'Sent: Tuesday, May 17, 2021 9:55 AM'). most of the answers i found split the data based on the number of characters or based on specific words, which won't work on my case because the data is subject to change. i'm looking for answers that split the item based on the number of words but got no clue.
any idea?
text = 'Sent: Saturday, January 13, 2022 8:55 AM'
textAsList = text.split()
newText = textAsList[2]+ ' ' + textAsList[3] + ' ' + textAsList[4] + ' ' + textAsList[5] + ' ' + textAsList[6]
print(newText)

Change whitespace to underscore at specific positions

I have string like this:
strings = ['pic1.jpg siberian cat 24 25', 'pic2.jpg siemese cat 14 32', 'pic3.jpg american bobtail cat 8 13', 'pic4.jpg cat 9 1']
What I want is to replace whitespace between cat breeds to hyphen eliminating whitespace between .jpg and first word in breed, and numbers.
Expected output:
['pic1.jpg siberian_cat 24 25', 'pic2.jpg siemese_cat 14 32', 'pic3.jpg american_bobtail cat 8 13', 'pic4.jpg cat 9 1']
I tried to construct patterns as follows:
[re.sub(r'(?<!jpg\s)([a-z])\s([a-z])\s([a-z])', r'\1_\2_\3', x) for x in strings ]
However, I adds hyphen between .jpg and next word.
The problem is that "cat" is not always put at the end of the word combination.
Here is one approach using re.sub with a callback function:
strings = ['pic1.jpg siberian cat 24 25', 'pic2.jpg siemese cat 14 32', 'pic3.jpg american bobtail cat 8 13', 'pic4.jpg cat 9 1']
output = [re.sub(r'(?<!\S)\w+(?: \w+)* cat\b', lambda x: x.group().replace(' ', '_'), x) for x in strings]
print(output)
This prints:
['pic1.jpg siberian_cat 24 25',
'pic2.jpg siemese_cat 14 32',
'pic3.jpg american_bobtail_cat 8 13',
'pic4.jpg cat 9 1']
Here is an explanation of the regex pattern used:
(?<!\S) assert what precedes first word is either whitespace or start of string
\w+ match a word, which is then followed by
(?: \w+)* a space another word, zero or more times
[ ] match a single space
cat\b followed by 'cat'
In other words, taking the third list element as an example, the regex pattern matches american bobtail cat, then replaces all spaces by underscore in the lambda callback function.
Try this [re.sub(r'jpg\s((\S+\s)+)cat', "jpg " + "_".join(x.split('jpg')[1].split('cat')[0].strip().split()) + "_cat", x) for x in strings ]

Split string using regular expression in python

I have such string
Sale: \t\t\t5 Jan \u2013 10 Jan
I want to extract the start and the end of the sale. Very straightforward approach would be to make several spilts, but I want to that using regular expressions.
As the result I want to get
start = "5 Jan"
end = "10 Jan"
Is it possible to do that using regex?
This should help.
import re
s = "Sale: \t\t\t5 Jan \u2013 10 Jan"
f = re.findall(r"\d+ \w{3}", s)
print f
Output:
['5 Jan', '10 Jan']
This may not be an optimised one but works assuming the string pattern remains the same.
import re
s = 'Sale: \t\t\t5 Jan \u2013 10 Jan'
start, end = re.search(r'Sale:(.*)', s).group(1).strip().replace('\u2013', ',').split(', ')
# start <- 5 Jan
# end <- 10 Jan

How to Extract Date From String Python

I have a String Like
"Originally Posted on 09 May, 2016. By query 3 j...."
how can i extract date using python??
I tried this code:
dStr = "Originally Posted on 09 May, 2016. By query 3 j...."
date_st = re.findall("(\d+\ \w+,)", dStr)
printing date_st, i've got:
['09 May,']
what should i do for year??
You forget to add the year after ',', Only need to add '\d+.' is well.
re.findall("(\d+\ \w+, \d+\.)", dStr)
You will got this:
'09 May, 2016.'
You were almost there. Just add 4 digits for the year after the ,. It is better to use [a-z]+ instead of \w+ to match the month names as \w matches _ and 0-9 (along with alphabets) as well.
re.findall(r'\d+\s[a-z]+,\s\d{4}',s,re.I)

How to write the grammar for this in pyparsing: match a set of words but not containing a given pattern

I am new to Python and pyparsing. I need to accomplish the following.
My sample line of text is like this:
12 items - Ironing Service 11 Mar 2009 to 10 Apr 2009
Washing service (3 Shirt) 23 Mar 2009
I need to extract the item description, period
tok_date_in_ddmmmyyyy = Combine(Word(nums,min=1,max=2)+ " " + Word(alphas, exact=3) + " " + Word(nums,exact=4))
tok_period = Combine((tok_date_in_ddmmmyyyy + " to " + tok_date_in_ddmmmyyyy)|tok_date_in_ddmmmyyyy)
tok_desc = Word(alphanums+"-()") but stop before tok_period
How to do this?
I would suggest looking at SkipTo as the pyparsing class that is most appropriate, since you have a good definition of the unwanted text, but will accept pretty much anything before that. Here are a couple of ways to use SkipTo:
text = """\
12 items - Ironing Service 11 Mar 2009 to 10 Apr 2009
Washing service (3 Shirt) 23 Mar 2009"""
# using tok_period as defined in the OP
# parse each line separately
for tx in text.splitlines():
print SkipTo(tok_period).parseString(tx)[0]
# or have pyparsing search through the whole input string using searchString
for [[td,_]] in SkipTo(tok_period,include=True).searchString(text):
print td
Both for loops print the following:
12 items - Ironing Service
Washing service (3 Shirt)
M K Saravanan, this particular parsing problem is not so hard to do with good 'ole re:
import re
import string
text='''
12 items - Ironing Service 11 Mar 2009 to 10 Apr 2009
Washing service (3 Shirt) 23 Mar 2009
This line does not match
'''
date_pat=re.compile(
r'(\d{1,2}\s+[a-zA-Z]{3}\s+\d{4}(?:\s+to\s+\d{1,2}\s+[a-zA-Z]{3}\s+\d{4})?)')
for line in text.splitlines():
if line:
try:
description,period=map(string.strip,date_pat.split(line)[:2])
print((description,period))
except ValueError:
# The line does not match
pass
yields
# ('12 items - Ironing Service', '11 Mar 2009 to 10 Apr 2009')
# ('Washing service (3 Shirt)', '23 Mar 2009')
The main workhorse here is of course the re pattern. Let's break it apart:
\d{1,2}\s+[a-zA-Z]{3}\s+\d{4} is the regexp for a date, the equivalent of tok_date_in_ddmmmyyyy. \d{1,2} matches one or two digits, \s+ matches one or more whitespaces, [a-zA-Z]{3} matches 3 letters, etc.
(?:\s+to\s+\d{1,2}\s+[a-zA-Z]{3}\s+\d{4})? is a regexp surrounded by (?:...).
This indicates a non-grouping regexp. Using this, no group (e.g. match.group(2)) is assigned to this regexp. This matters because date_pat.split() returns a list with each group being a member of the list. By suppressing the grouping, we keep the entire period 11 Mar 2009 to 10 Apr 2009 together. The question mark at the end indicates that this pattern may occur zero or once. This allows the regexp to match both
23 Mar 2009 and 11 Mar 2009 to 10 Apr 2009.
text.splitlines() splits text on \n.
date_pat.split('12 items - Ironing Service 11 Mar 2009 to 10 Apr 2009')
splits the string on the date_pat regexp. The match is included in the returned list.
Thus we get:
['12 items - Ironing Service ', '11 Mar 2009 to 10 Apr 2009', '']
map(string.strip,date_pat.split(line)[:2]) prettifies the result.
If line does not match date_pat, then date_pat.split(line) returns [line,],
so
description,period=map(string.strip,date_pat.split(line)[:2])
raises a ValueError because we can't unpack a list with only one element into a 2-tuple. We catch this exception but simply pass on to the next line.

Categories

Resources