How do I print everything between two sentences? - python

Im very new to coding and only know the very basics. I am using python and trying to print everything between two sentences in a text. I only want the content between, not before or after. It`s probably very easy, but i couldnt figure it out.
Ev 39 Fursetfjellet (Oppdøl - Batnfjordsøra) No reports. Ev 134 Haukelifjell (Liamyrane bom - Fjellstad bom) Ev 134 Haukelifjell Hordaland / Telemark — Icy. 10 o'clock 1 degree. Valid from: 05.01.2020 13:53 Rv 3 Kvikne (Tynset (Motrøa) - Ulsberg)
I want to collect the bold text to use in website later. Everything except the italic text(the sentence before and after) is dynamic if that has anything to say.

You can use split to cut the string and access the parts that you are interested in.
If you know how to get the full text already, it's easy to get the bold sentence by removing the two constant sentences before and after.
full_text = "Ev 39 Fursetfjellet (Oppdøl - Batnfjordsøra) No reports. Ev 134 Haukelifjell (Liamyrane bom - Fjellstad bom) Ev 134 Haukelifjell Hordaland / Telemark — Icy. 10 o'clock 1 degree. Valid from: 05.01.2020 13:53 Rv 3 Kvikne (Tynset (Motrøa) - Ulsberg)"
s1 = "Ev 39 Fursetfjellet (Oppdøl - Batnfjordsøra) No reports. Ev 134 Haukelifjell (Liamyrane bom - Fjellstad bom)"
s2 = "Rv 3 Kvikne (Tynset (Motrøa) - Ulsberg)"
bold_text = full_text.split(s1)[1] # Remove the left part.
bold_text = bold_text.split(s2)[0] # Remove the right part.
bold_text = bold_text.strip() # Clean up spaces on each side if needed.
print(bold_text)

It looks like a job for regular expressions, there is the re module in Python.
You should:
Open the file
Read its content in a variable
Use search or match function in the re module
In particular, in the last step you should use your "surrounding" strings as "delimiters" and capture everything between them. You can achieve this using a regex pattern like str1 + "(.*)" + str2.
You can give a look at regex documentation, but just to give you an idea:
".*" captures everything
"()" allows you actually capture the content inside them and access it later with an index (e.g. re.search(pattern, original_string).group(1))

Related

Converting character-level spans to token-level spans

I'm currently trying to convert character-level spans to token-level spans and am wondering if there's a functionality in the library that I may not be taking advantage of.
The data that I'm currently using consists of "proper" text (I say "proper" as in it's written as if it's a normal document, not with things like extra whitespaces for easier split operations) and annotated entities. The entities are annotated at the character level but I would like to obtain the tokenized subword-level span.
My plan was to first convert character-level spans to word-level spans, then convert that to subword-level spans. A piece of code that I wrote looks like this:
new_text = []
for word in original_text.split():
if (len(word) > 1) and (word[-1] in ['.', ',', ';', ':']):
new_text.append(word[:-1] + ' ' + word[-1])
else:
new_text.append(word)
new_text = ' '.join(new_text).split()
word2char_span = {}
start_idx = 0
for idx, word in enumerate(new_text):
char_start = start_idx
char_end = char_start + len(word)
word2char_span[idx] = (char_start, char_end)
start_idx += len(word) + 1
This seems to work well but one edge case I didn't think of is parentheses. To give a more concrete example, one paragraph-entity pair looks like this:
>>> original_text = "RDH12, a retinol dehydrogenase causing Leber's congenital amaurosis, is also involved in \
steroid metabolism. Three retinol dehydrogenases (RDHs) were tested for steroid converting abilities: human and \
murine RDH 12 and human RDH13. RDH12 is involved in retinal degeneration in Leber's congenital amaurosis (LCA). \
We show that murine Rdh12 and human RDH13 do not reveal activity towards the checked steroids, but that human type \
12 RDH reduces dihydrotestosterone to androstanediol, and is thus also involved in steroid metabolism. Furthermore, \
we analyzed both expression and subcellular localization of these enzymes."
>>> entity_span = [139, 143]
>>> print(original_text[139:143])
'RDHs'
This example actually returns a KeyError when I try to refer to (139, 143) because the adjustment code I wrote takes (RDHs) as the entity rather than RDHs. I don't want to hardcode parentheses handling either because there are some entities where the parentheses are included.
I feel like there should be a simpler approach to this issue and I'm overthinking things a bit. Any feedback on how I could achieve what I want is appreciated.
Tokenization is tricky. I'd suggest using SpaCy to process your data as you can access the offset of each token in the source text at character level, which should make mapping of your character spans to tokens straightforward:
import spacy
nlp = spacy.load("en_core_web_sm")
original_text = "Three retinol dehydrogenases (RDHs) were tested for steroid converting abilities:"
# Process data
doc = nlp(original_text)
for token in doc:
print(token.idx, token, sep="\t")
Output:
0 Three
6 retinol
14 dehydrogenases
29 (
30 RDHs
34 )
36 were
41 tested
48 for
52 steroid
60 converting
71 abilities
80 :

How to use regex in string partition using python?

I have a string like as shown below from a pandas data frame column
string = "insulin MixTARD 30/70 - inJECTable 20 unit(s) SC (SubCutaneous) - Hypoglycaemia Protocol if Blood Glucose Level (mmol) < 4 - Call Doctor if Blood Glucose Level (mmol) > 22"
I am trying to get an output like as shown below (you can see everything before 2nd hyphen is returned)
insulin MixTARD 30/70 - inJECTable 20 unit(s) SC (SubCutaneous)
So, I tried the below code
string.partition(' -')[0] # though this produces the output, not reliable
Meaning, I always want everything before the 2nd Hyphen (-).
Instead of me manually assigning the spaces, I would like to write something like below. Not sure whether the below is right as well. can you help me get everything before the 2nd hyphen?
string.partition(r'\s{2,6}-')[0]
Can help me get the expected output using partition method and regex?
You could use re.sub here for a one-liner solution:
string = "insulin MixTARD 30/70 - inJECTable 20 unit(s) SC (SubCutaneous) - Hypoglycaemia Protocol if Blood Glucose Level (mmol) < 4 - Call Doctor if Blood Glucose Level (mmol) > 22"
output = re.sub(r'^([^-]+?-[^-]+?)(?=\s*-).*$', '\\1', string)
print(output)
This prints:
insulin MixTARD 30/70 - inJECTable 20 unit(s) SC (SubCutaneous)
Explanation of regex:
^ from the start of the input
( capture
[^-]+? all content up to
- the first hyphen
[^-]+? all content up, but not including
) end capture
(?=\s*-) zero or more whitespace characters followed by the second hyphen
.* then match the remainder of the input
$ end of the input
Try using re.split instead of string.partition:
re.split(r'\s{2,6}-', string)[0]
Simple solution with split and join:
"-".join(string.split("-")[0:2])

How to split a multi-line string using regular expressions?

I have been banging my beginner head for most of the day trying various things.
Here is the string
1 default active Eth2/45, Eth2/46, Eth2/47
Eth3/41, Eth3/42, Eth3/43
Eth4/41, Eth4/42, Eth4/43
47 Production active Po1, Po21, Po23, Po25, Po101
Po102, Eth2/1, Eth2/2, Eth2/3
Eth2/4, Eth3/29, Eth3/30
Eth3/31, Eth3/32, Eth3/33
Eth3/34, Eth3/35, Eth3/36
Eth3/37, Eth3/38, Eth3/39
Eth3/40, Eth3/44, Eth4/29
Eth4/30, Eth4/31, Eth4/32
Eth4/33, Eth4/34, Eth4/35
Eth4/36, Eth4/37, Eth4/38
Eth4/39, Eth4/40, Eth4/44
128 Test active Po1, Eth1/13, Eth2/1, Eth2/2
Eth2/3, Eth2/4
129 Backup active Po1, Eth1/14, Eth2/1, Eth2/2
Eth2/3, Eth2/4
What I need is to split like below. I have tried to use regex101.com to simulate various regex but I did not have much luck. I managed to isolate the delimiters with (\n\d+) and then I wanted to use lookbehind but I got an error saying that I need fixed string length.
Here is a link to the regex101 section:
1 default active Eth2/45, Eth2/46, Eth2/47
Eth3/41, Eth3/42, Eth3/43
Eth4/41, Eth4/42, Eth4/43
47 VLAN047 active Po1, Po21, Po23, Po25, Po101
Po102, Eth2/1, Eth2/2, Eth2/3
Eth2/4, Eth3/29, Eth3/30
Eth3/31, Eth3/32, Eth3/33
Eth3/34, Eth3/35, Eth3/36
Eth3/37, Eth3/38, Eth3/39
Eth3/40, Eth3/44, Eth4/29
Eth4/30, Eth4/31, Eth4/32
Eth4/33, Eth4/34, Eth4/35
Eth4/36, Eth4/37, Eth4/38
Eth4/39, Eth4/40, Eth4/44
128 Rogers-Refresh-MGT active Po1, Eth1/13, Eth2/1, Eth2/2
Eth2/3, Eth2/4
129 ManagementSegtNorthW active Po1, Eth1/14, Eth2/1, Eth2/2
Eth2/3, Eth2/4
Update: I update the regex101 example but it is not selecting what I want. The python code works. I wonder what is the problem with regex101
That's pretty simple - use lookahead instead of lookbehind:
parsed = re.split(r'\n(?=\d)', data)
In python there is always more than one way to skin a cat. Multiline regexes are usually very hard. The following is a lot simpler, and more importantly readable
for line in data.split("\n"):
if line[0].isdigit():
if section:
sections.append("\n".join(section))
section=[]
section.append(line)
sections.append("\n".join(section)) # grab the last one
print(sections)
Performance wise, I think this would probably be better, because we are not looking for a pattern in the entire string. we are only looking at the first character in a line.

Python Regex matching on unexpected parts of search string

I'm trying to parse a page using regex (Python 2.7; IPython QTConsole). The page is a .txt pulled from a web directory that I grabbed using urllib2
>>> import re
>>> Z = '[A-Z]{2}Z[0-9]{3}.*?\\$\\$'
>>> snippet = re.search(Z, page, re.DOTALL)
>>> snippet = snippet.group() # Only including the first part for brevity.
'PZZ570-122200-\nPOINT ARENA TO POINT REYES 10 TO 60 NM OFFSHORE-\n249 AM PDT FRI SEP 12 2014\n.TODAY...SW WINDS 5 KT. WIND WAVES 2 FT OR LESS.\nNW SWELL 3 TO 5 FT AT 12 SECONDS. PATCHY FOG IN THE MORNING.\n.TONIGHT...W WINDS 10 KT. WIND WAVES 2 FT OR LESS.'
I want to search for the newline followed by a period. I'd like to get the first and second occurrences as below. The objective is to parse the information between the first and second (and subsequent) \n\. delimiters. I know I could do look-around, but I'm having trouble making the lookahead greedy. Further, I can't figure out why the following doesn't work.
>>> pat = r"\n\."
>>> s = re.search(pat, snippet.group(), re.DOTALL)
>>> e = re.search(pat, snippet.group()[s.end():], re.DOTALL)
The s above works, but I get a strange result for e.
>>> [s.group(), s.start(), e.group(), e.end()]
['\n.', 90, '\n.', 110]
>>> snippet.group()[s.start():e.end()]
'\n.TODAY...SW WINDS 5'
>>> snippet.group()[e.start():e.end()]
' 5'
I guess there's some formatting in snippet.group() that's hidden? If that's the case, then it's strange that some newlines are explicit as if snippet.group() is raw, and others are hidden. Why are e.group(), and snippet.group()[e.start():e.end()] different?
I apologize if this question has already been addressed. I couldn't find anything related.
Thanks very much in advance.
To split a string in python, it might be easier to use str.split() or re.split().
e.g.:
"1\n.2\n.3".split("\n.")

Python parsing

I'm trying to parse the title tag in an RSS 2.0 feed into three different variables for each entry in that feed. Using ElementTree I've already parsed the RSS so that I can print each title [minus the trailing )] with the code below:
feed = getfeed("http://www.tourfilter.com/dallas/rss/by_concert_date")
for item in feed:
print repr(item.title[0:-1])
I include that because, as you can see, the item.title is a repr() data type, which I don't know much about.
A particular repr(item.title[0:-1]) printed in the interactive window looks like this:
'randy travis (Billy Bobs 3/21'
'Michael Schenker Group (House of Blues Dallas 3/26'
The user selects a band and I hope to, after parsing each item.title into 3 variables (one each for band, venue, and date... or possibly an array or I don't know...) select only those related to the band selected. Then they are sent to Google for geocoding, but that's another story.
I've seen some examples of regex and I'm reading about them, but it seems very complicated. Is it? I thought maybe someone here would have some insight as to exactly how to do this in an intelligent way. Should I use the re module? Does it matter that the output is currently is repr()s? Is there a better way? I was thinking I'd use a loop like (and this is my pseudoPython, just kind of notes I'm writing):
list = bandRaw,venue,date,latLong
for item in feed:
parse item.title for bandRaw, venue, date
if bandRaw == str(band)
send venue name + ", Dallas, TX" to google for geocoding
return lat,long
list = list + return character + bandRaw + "," + venue + "," + date + "," + lat + "," + long
else
In the end, I need to have the chosen entries in a .csv (comma-delimited) file looking like this:
band,venue,date,lat,long
randy travis,Billy Bobs,3/21,1234.5678,1234.5678
Michael Schenker Group,House of Blues Dallas,3/26,4321.8765,4321.8765
I hope this isn't too much to ask. I'll be looking into it on my own, just thought I should post here to make sure it got answered.
So, the question is, how do I best parse each repr(item.title[0:-1]) in the feed into the 3 separate values that I can then concatenate into a .csv file?
Don't let regex scare you off... it's well worth learning.
Given the examples above, you might try putting the trailing parenthesis back in, and then using this pattern:
import re
pat = re.compile('([\w\s]+)\(([\w\s]+)(\d+/\d+)\)')
info = pat.match(s)
print info.groups()
('Michael Schenker Group ', 'House of Blues Dallas ', '3/26')
To get at each group individual, just call them on the info object:
print info.group(1) # or info.groups()[0]
print '"%s","%s","%s"' % (info.group(1), info.group(2), info.group(3))
"Michael Schenker Group","House of Blues Dallas","3/26"
The hard thing about regex in this case is making sure you know all the known possible characters in the title. If there are non-alpha chars in the 'Michael Schenker Group' part, you'll have to adjust the regex for that part to allow them.
The pattern above breaks down as follows, which is parsed left to right:
([\w\s]+) : Match any word or space characters (the plus symbol indicates that there should be one or more such characters). The parentheses mean that the match will be captured as a group. This is the "Michael Schenker Group " part. If there can be numbers and dashes here, you'll want to modify the pieces between the square brackets, which are the possible characters for the set.
\( : A literal parenthesis. The backslash escapes the parenthesis, since otherwise it counts as a regex command. This is the "(" part of the string.
([\w\s]+) : Same as the one above, but this time matches the "House of Blues Dallas " part. In parentheses so they will be captured as the second group.
(\d+/\d+) : Matches the digits 3 and 26 with a slash in the middle. In parentheses so they will be captured as the third group.
\) : Closing parenthesis for the above.
The python intro to regex is quite good, and you might want to spend an evening going over it http://docs.python.org/library/re.html#module-re. Also, check Dive Into Python, which has a friendly introduction: http://diveintopython3.ep.io/regular-expressions.html.
EDIT: See zacherates below, who has some nice edits. Two heads are better than one!
Regular expressions are a great solution to this problem:
>>> import re
>>> s = 'Michael Schenker Group (House of Blues Dallas 3/26'
>>> re.match(r'(.*) \((.*) (\d+/\d+)', s).groups()
('Michael Schenker Group', 'House of Blues Dallas', '3/26')
As a side note, you might want to look at the Universal Feed Parser for handling the RSS parsing as feeds have a bad habit of being malformed.
Edit
In regards to your comment... The strings occasionally being wrapped in "s rather than 's has to do with the fact that you're using repr. The repr of a string is usually delimited with 's, unless that string contains one or more 's, where instead it uses "s so that the 's don't have to be escaped:
>>> "Hello there"
'Hello there'
>>> "it's not its"
"it's not its"
Notice the different quote styles.
Regarding the repr(item.title[0:-1]) part, not sure where you got that from but I'm pretty sure you can simply use item.title. All you're doing is removing the last char from the string and then calling repr() on it, which does nothing.
Your code should look something like this:
import geocoders # from GeoPy
us = geocoders.GeocoderDotUS()
import feedparser # from www.feedparser.org
feedurl = "http://www.tourfilter.com/dallas/rss/by_concert_date"
feed = feedparser.parse(feedurl)
lines = []
for entry in feed.entries:
m = re.search(r'(.*) \((.*) (\d+/\d+)\)', entry.title)
if m:
bandRaw, venue, date = m.groups()
if band == bandRaw:
place, (lat, lng) = us.geocode(venue + ", Dallas, TX")
lines.append(",".join([band, venue, date, lat, lng]))
result = "\n".join(lines)
EDIT: replaced list with lines as the var name. list is a builtin and should not be used as a variable name. Sorry.

Categories

Resources