Extracting sub-string after the first space in Python - python

I need help in regex or Python to extract a substring from a set of string. The string consists of alphanumeric. I just want the substring that starts after the first space and ends before the last space like the example given below.
Example 1:
A:01 What is the date of the election ?
BK:02 How long is the river Nile ?
Results:
What is the date of the election
How long is the river Nile
While I am at it, is there an easy way to extract strings before or after a certain character? For example, I want to extract the date or day like from a string like the ones given in Example 2.
Example 2:
Date:30/4/2013
Day:Tuesday
Results:
30/4/2013
Tuesday
I have actually read about regex but it's very alien to me. Thanks.

I recommend using split
>>> s="A:01 What is the date of the election ?"
>>> " ".join(s.split()[1:-1])
'What is the date of the election'
>>> s="BK:02 How long is the river Nile ?"
>>> " ".join(s.split()[1:-1])
'How long is the river Nile'
>>> s="Date:30/4/2013"
>>> s.split(":")[1:][0]
'30/4/2013'
>>> s="Day:Tuesday"
>>> s.split(":")[1:][0]
'Tuesday'

>>> s="A:01 What is the date of the election ?"
>>> s.split(" ", 1)[1].rsplit(" ", 1)[0]
'What is the date of the election'
>>>

There's no need to dig into regex if this is all you need; you can use str.partition
s = "A:01 What is the date of the election ?"
before,sep,after = s.partition(' ') # could be, eg, a ':' instead
If all you want is the last part, you can use _ as a placeholder for 'don't care':
_,_,theReallyAwesomeDay = s.partition(':')

Related

How to split an element in a list into two elements?

I want to split elements of list, each element is currently made up of a movie and a date, however I now need to separate them so I can add them to a database
This is what I've tried
movies=["The Big Bad Fox and Other Tales (English subtitles)('23rd', 'May')"]
splitter=re.compile('(/(.+)').split
[part for img in movies for part in splitter(img) if part]
How do I solve this problem?
You were almost there ;D
import re
movies=["The Big Bad Fox and Other Tales (English subtitles)('23rd', 'May')"]
matcher = re.compile(r"^(.*)\((.*?)\)$").match
print([matcher(movie).groups() for movie in movies])
I suggest using RegExr to learn and test regular expressions.
I am not sure what format you were hoping to get the elements into, but you could take hone in on similarities, like if each date starts with "('".
movies = ["The Big Bad Fox and Other Tales (English subtitles) ('23rd','May')"]
titles,dates = [],[]
for i in range(len(movies)):
newTitle,newDate,sign,count = "","",False,0
for char in movies[i]:
if char == "(":
sign = True
elif sign == True:
if char == "'":
newDate += "(" + movies[i][count:]
break
else:
newTitle += char
count += 1
titles.append(newTitle)
dates.append(newDate)
print(titles)
print(dates)
Output:
['The Big Bad Fox and Other Tales ']
["('23rd','May')"]
Hope this helped!
We can use three important python functions for this problem:
replace(pattern, replacement)
string[start_position:end_position] and string.index(pattern)
movies=["The Big Bad Fox and Other Tales (English subtitles)('23rd', 'May')"]
First, make 2 patterns which denote the beginning and end of the date area:
date_start = "('"
date_end = "')"
Then, remove that part of the string for further analysis:
date_information = movies[0][movies[0].index(date_start):movies[0].index(date_end)]
At this point, "date information" should be ('23rd', 'May
Then, just trim the first 2 characters and replace the single quotations:
date_information = date_information[2:].replace("'", "")
This will give you a final string, "date_information" which should be the date and the month, separated by a comma:
23rd, May
Finally, you can split this string by comma (date_information.split(",")) to get it into a database.
Rather than using regex, you can use split
movies=["The Big Bad Fox and Other Tales (English subtitles)('23rd', 'May')"]
splitter= movies[0].split(')(')
movie_name = f"{splitter[0]})"
date = f"({splitter[1]}"
this is parsing so, keep in mind this will only work in this standard format.

Using regular expressions in python to extract location mentions in a sentence

I am writing a code using python to extract the name of a road,street, highway, for example a sentence like "There is an accident along Uhuru Highway", I want my code to be able to extract the name of the highway mentioned, I have written the code below.
sentence="there is an accident along uhuru highway"
listw=[word for word in sentence.lower().split()]
for i in range(len(listw)):
if listw[i] == "highway":
print listw[i-1] + " "+ listw[i]
I can achieve this but my code is not optimized, i am thinking of using regular expressions, any help please
'uhuru highway' can be found as follows
import re
m = re.search(r'\S+ highway', sentence) # non-white-space followed by ' highway'
print(m.group())
# 'uhuru highway'
If the location you want to extract will always have highway after it, you can use:
>>> sentence = "there is an accident along uhuru highway"
>>> a = re.search(r'.* ([\w\s\d\-\_]+) highway', sentence)
>>> print(a.group(1))
>>> uhuru
You can do the following without using regexes:
sentence.split("highway")[0].strip().split(' ')[-1]
First split according to "highway". You'll get:
['there is an accident along uhuru', '']
And now you can easily extract the last word from the first part.

Convert certain numbers in a sentence such as date, time, phone number from numbers to words in Python

I am kind of new to Python so I apologize for my lacks. I have a code in python perfected with other users' help (thank you) that converts a date from numbers into words using dictionaries for days,months,years, like 3.6.2015 => march.third.two thousand fifteen using:
date = raw_input("Give date: ")
I want to input a sentence such as: "today is 3.6.2015, it is 10:00 o'clock and it's rainy" and from it I do not know how to search through the sentence for the date, or time, or phone number and to that date and time to apply the conversion.
If someone can please help, thank you.
You could use regular expressions:
import re
s = "today is 3.6.2015, it is 10:00 o'clock and it's rainy"
mat = re.search(r'(\d{1,2}\.\d{1,2}\.\d{4})', s)
date = mat.group(1)
print date # 3.6.2015
Note, if there's nothing matching this regular expression in the input text, an AttributeError will be raised, that you'll either have to prevent (e.g. if mat:) or handle.
EDIT
Assuming you can turn your conversion code into a function, you could use re.sub:
import re
def your_function(num_string):
# Whatever your function does
words_string = "march.third.two thousand fifteen"
return words_string
s = "today is 3.6.2015, it is 10:00 o'clock and it's rainy"
date = re.sub(r'(\d{1,2}\.\d{1,2}\.\d{4})', your_function, s)
print date
# today is march.third.two thousand fifteen, it is 10:00 o'clock and it's rainy
Just modify your_function to change the 3.6.2015 into march.third.two thousand fifteen.

Python regular expressions for simple questions

I wish to let the user ask a simple question, so I can extract a few standard elements from the string entered.
Examples of strings to be entered:
Who is the director of The Dark Knight?
What is the capital of China?
Who is the president of USA?
As you can see sometimes it is "Who", sometimes it is "What". I'm most likely looking for the "|" operator. I'll need to extract two things from these strings. The word after "the" and before "of", as well as the word after "of".
For example:
1st sentence: I wish to extract "director" and place it in a variable called Relation, and extract "The Dark Knight" and place it in a variable called Concept.
Desired output:
RelationVar = "director"
ConceptVar = "The Dark Knight"
2nd sentence: I wish to extract "capital", assign it to variable "Relation".....and extract "China" and place it in variable "Concept".
RelationVar = "capital"
ConceptVar = "China"
Any ideas on how to use the re.match function? or any other method?
You're correct that you want to use | for who/what. The rest of the regex is very simple, the group names are there for clarity but you could use r"(?:Who|What) is the (.+) of (.+)[?]" instead.
>>> r = r"(?:Who|What) is the (?P<RelationVar>.+) of (?P<ConceptVar>.+)[?]"
>>> l = ['Who is the director of The Dark Knight?', 'What is the capital of China?', 'Who is the president of USA?']
>>> [re.match(r, i).groupdict() for i in l]
[{'RelationVar': 'director', 'ConceptVar': 'The Dark Knight'}, {'RelationVar': 'capital', 'ConceptVar': 'China'}, {'RelationVar': 'president', 'ConceptVar': 'USA'}]
Change (?:Who|What) to (Who|What) if you also want to capture whether the question uses who or what.
Actually extracting the data and assigning it to variables is very simple:
>>> m = re.match(r, "What is the capital of China?")
>>> d = m.groupdict()
>>> relation_var = d["RelationVar"]
>>> concept_var = d["ConceptVar"]
>>> relation_var
'capital'
>>> concept_var
'China'
Here is the script, you can simply use | to optional match one inside the brackets.
This worked fine for me
import re
list = ['Who is the director of The Dark Knight?','What is the capital of China?','Who is the president of USA?']
for string in list:
a = re.compile(r'(What|Who) is the (.+) of (.+)')
nodes = a.findall(string);
Relation = nodes[0][0]
Concept = nodes[0][1]
print Relation
print Concept
print '----'
Best Regards:)

Python parsing

I'm trying to parse the title tag in an RSS 2.0 feed into three different variables for each entry in that feed. Using ElementTree I've already parsed the RSS so that I can print each title [minus the trailing )] with the code below:
feed = getfeed("http://www.tourfilter.com/dallas/rss/by_concert_date")
for item in feed:
print repr(item.title[0:-1])
I include that because, as you can see, the item.title is a repr() data type, which I don't know much about.
A particular repr(item.title[0:-1]) printed in the interactive window looks like this:
'randy travis (Billy Bobs 3/21'
'Michael Schenker Group (House of Blues Dallas 3/26'
The user selects a band and I hope to, after parsing each item.title into 3 variables (one each for band, venue, and date... or possibly an array or I don't know...) select only those related to the band selected. Then they are sent to Google for geocoding, but that's another story.
I've seen some examples of regex and I'm reading about them, but it seems very complicated. Is it? I thought maybe someone here would have some insight as to exactly how to do this in an intelligent way. Should I use the re module? Does it matter that the output is currently is repr()s? Is there a better way? I was thinking I'd use a loop like (and this is my pseudoPython, just kind of notes I'm writing):
list = bandRaw,venue,date,latLong
for item in feed:
parse item.title for bandRaw, venue, date
if bandRaw == str(band)
send venue name + ", Dallas, TX" to google for geocoding
return lat,long
list = list + return character + bandRaw + "," + venue + "," + date + "," + lat + "," + long
else
In the end, I need to have the chosen entries in a .csv (comma-delimited) file looking like this:
band,venue,date,lat,long
randy travis,Billy Bobs,3/21,1234.5678,1234.5678
Michael Schenker Group,House of Blues Dallas,3/26,4321.8765,4321.8765
I hope this isn't too much to ask. I'll be looking into it on my own, just thought I should post here to make sure it got answered.
So, the question is, how do I best parse each repr(item.title[0:-1]) in the feed into the 3 separate values that I can then concatenate into a .csv file?
Don't let regex scare you off... it's well worth learning.
Given the examples above, you might try putting the trailing parenthesis back in, and then using this pattern:
import re
pat = re.compile('([\w\s]+)\(([\w\s]+)(\d+/\d+)\)')
info = pat.match(s)
print info.groups()
('Michael Schenker Group ', 'House of Blues Dallas ', '3/26')
To get at each group individual, just call them on the info object:
print info.group(1) # or info.groups()[0]
print '"%s","%s","%s"' % (info.group(1), info.group(2), info.group(3))
"Michael Schenker Group","House of Blues Dallas","3/26"
The hard thing about regex in this case is making sure you know all the known possible characters in the title. If there are non-alpha chars in the 'Michael Schenker Group' part, you'll have to adjust the regex for that part to allow them.
The pattern above breaks down as follows, which is parsed left to right:
([\w\s]+) : Match any word or space characters (the plus symbol indicates that there should be one or more such characters). The parentheses mean that the match will be captured as a group. This is the "Michael Schenker Group " part. If there can be numbers and dashes here, you'll want to modify the pieces between the square brackets, which are the possible characters for the set.
\( : A literal parenthesis. The backslash escapes the parenthesis, since otherwise it counts as a regex command. This is the "(" part of the string.
([\w\s]+) : Same as the one above, but this time matches the "House of Blues Dallas " part. In parentheses so they will be captured as the second group.
(\d+/\d+) : Matches the digits 3 and 26 with a slash in the middle. In parentheses so they will be captured as the third group.
\) : Closing parenthesis for the above.
The python intro to regex is quite good, and you might want to spend an evening going over it http://docs.python.org/library/re.html#module-re. Also, check Dive Into Python, which has a friendly introduction: http://diveintopython3.ep.io/regular-expressions.html.
EDIT: See zacherates below, who has some nice edits. Two heads are better than one!
Regular expressions are a great solution to this problem:
>>> import re
>>> s = 'Michael Schenker Group (House of Blues Dallas 3/26'
>>> re.match(r'(.*) \((.*) (\d+/\d+)', s).groups()
('Michael Schenker Group', 'House of Blues Dallas', '3/26')
As a side note, you might want to look at the Universal Feed Parser for handling the RSS parsing as feeds have a bad habit of being malformed.
Edit
In regards to your comment... The strings occasionally being wrapped in "s rather than 's has to do with the fact that you're using repr. The repr of a string is usually delimited with 's, unless that string contains one or more 's, where instead it uses "s so that the 's don't have to be escaped:
>>> "Hello there"
'Hello there'
>>> "it's not its"
"it's not its"
Notice the different quote styles.
Regarding the repr(item.title[0:-1]) part, not sure where you got that from but I'm pretty sure you can simply use item.title. All you're doing is removing the last char from the string and then calling repr() on it, which does nothing.
Your code should look something like this:
import geocoders # from GeoPy
us = geocoders.GeocoderDotUS()
import feedparser # from www.feedparser.org
feedurl = "http://www.tourfilter.com/dallas/rss/by_concert_date"
feed = feedparser.parse(feedurl)
lines = []
for entry in feed.entries:
m = re.search(r'(.*) \((.*) (\d+/\d+)\)', entry.title)
if m:
bandRaw, venue, date = m.groups()
if band == bandRaw:
place, (lat, lng) = us.geocode(venue + ", Dallas, TX")
lines.append(",".join([band, venue, date, lat, lng]))
result = "\n".join(lines)
EDIT: replaced list with lines as the var name. list is a builtin and should not be used as a variable name. Sorry.

Categories

Resources