Extracting multiple text with regex and matching the existence as a condition

Extracting multiple text with regex and matching the existence as a condition - python

I am trying to extract elements using a regex, while needing to also distinguish which lines have "-External" at the end. The naming structure I am working with is:
<ServerName>: <Country>-<CountryCode>
or
<ServerName>: <Country>-<CountryCode>-External
For example:
test1 = 'Neo1: Brussels-BRU-External'
test2 = 'Neo1: Brussels-BRU'
match = re.search(r'(?<=: ).+', test1)
print match.group(0)
This gives me "Brussels-BRU". I am trying to extract "Brussels" and "BRU" separately, while not caring about anything to the left of the :.
After, I need to know when a line has "-External". Is there a way I can treat the existence of "-External" as True and without as None?

I suggest that regexs are not needed here, and that a simple split or 2 can get you what are after. Here is a way to split() the line into pieces from which you can then select what you are interested in:
Code:
def split_it(a_string):
on_colon = a_string.split(':')
return on_colon[0], on_colon[1].strip().split('-')
Test Code:
tests = (
'Neo1: Brussels-BRU-External',
'Neo1: Brussels-BRU',
)
for test in tests:
print(split_it(test))
Results:
('Neo1', ['Brussels', 'BRU', 'External'])
('Neo1', ['Brussels', 'BRU'])
Analysis:
The length of the list can be used to determine if the additional field 'External' is present.

Related

Reformat a string with special tokens to a list/dictionary containing its tokens as elements?

I have a string (as an output of a model that generates sequences) in the format --
<bos> <new_gen> ent1 <gen> rel1_ent1 <gen> rel2_ent1 <new_gen> ent2 <gen> rel1_ent2 <eos>
Because this is a collection of elements generated as a sentence/sequence, I would like to reformat it to a list/dictionary (to evaluate the quality of responses) --
[ [ent1, rel1_ent1, rel2_ent1], [ent2, rel1_ent2] ] or
{ "ent1" : ["rel1_ent1", "rel2_ent1"], "ent2" : ["rel1_ent2"] }
So far, the way I have been looking at this is via splitting the string by <bos> and/or <eos> special tokens -- test_string.split("<bos>")[1].split("<eos>")[0].split("<rel>")[1:]. But I am not sure how to handle generality if I do this across a large set of sequences with varying length (i.e. # of rel_ents associated with a given ent).
Also, I feel there might be a more optimal way to do this (without ugly splitting and looping) -- maybe regex?. Either way, I am entirely unsure and looking for a more optimal solution.
Added note: the special tokens <bos>, <new_gen>, <gen>, <eos> can be entirely removed from the generated output if that helps.

Well, there could be a smoother way without, as you mentioned it "ugly splitting and looping", but maybe re.finditer could be a good option here. Find each substring of interest with pattern:
<new_gen>\s(\w+)\s<gen>\s(\w+(?:\s<gen>\s\w+)*)
See an online demo. We then can use capture group 1 as our key values and capture group 2 as a substring we need to split again into lists:
import regex as re
s = '<bos> <new_gen> ent1 <gen> rel1_ent1 <gen> rel2_ent1 <new_gen> ent2 <gen> rel1_ent2 <eos>'
result = re.finditer(r'<new_gen>\s(\w+)\s<gen>\s(\w+(?:\s<gen>\s\w+)*)', s)
d = {}
for match_obj in result:
d[match_obj.group(1)] = match_obj.group(2).split(' <gen> ')
print(d)
Prints:
{'ent1': ['rel1_ent1', 'rel2_ent1'], 'ent2': ['rel1_ent2']}

Find date and time in a text column in a dataframe Python

I'm trying to find and extract the date and time in a column that contain text sentences. The example data is as below.
df = {'Id': ['001', '002',...],
'Description': ['
THERE IS AN INTERUPTION/FAILURE # 9.6AM ON 27.1.2020 FOR JB BRANCH. THE INTERUPTION ALSO INVOLVED A, B, C AND SOME OTHER TOWN AREAS. OTC AND SST SERVICES INTERRUPTED AS GENSET ALSO WORKING AT THAT TIME. WE CALL FOR SERVICE. THE TECHNICHIAN COME AT 10.30AM. THEN IT BECOME OK AROUND 10.45AM', 'today is 23/3/2013 #10:AM we have',...],
....
}
df = pd.DataFrame (df, columns = ['Id','Description'])
I have tried the datefinder library below but it gives todays date which is wrong.
findDate = dtf.find_dates(le['Description'][0])
for dates in findDate:
print(dates)
Does anyone know what is the best way to extract it and automatically put it into a new column? Or does anyone know any library that can calculate duration between time and date in a string text. Thank you.

So you have two issues here.
you want to know how to apply a function on a DataFrame.
you want a function to extract a pattern from a bunch of text
Here is how to apply a function on a Serie (if selecting only one column as I did, you get a Serie). Bonus points: Read the DataFrame.apply() and Series.apply() documentation (30s) to become a Pandas-chad!
def do_something(x):
some-code()
df['new_text_column'] = df['original_text_column'].apply(do_something)
And here is one way to extract patterns from a string using regexes. Read the regex doc (or follow a course)and play around with RegExr to become an omniscient god (that is, if you use a command-line on Linux, along with your regex knowledge).
Modified from: How to extract the substring between two markers?
import re
text = 'gfgfdAAA1234ZZZuijjk'
# Searching numbers.
m = re.search('\d+', text)
if m:
found = m.group(0)
# found: 1234

python regex match content crossing multiple lines

I have a txt file, include multiple line.My result crossing multiple lines.
for example, my data can be simplified as the following:
target_str =
x:-2.12343234
aaa:-3.05594480202
aaa:-3.01292995004
aaa:-2.383299
456:-2.232342
x:-2.53739230
aaa:-2.96875038099
aaa:-2.92326261448
aaa:-2.87628054847
bbb:-2.82755928961
456:-2.77678240323
x:-2.3433210
aaa:-2.72356707049
aaa:-2.6675072938
aaa:-2.60827106148
456:-2.3323232
x:-2.8743920
aaa:-2.433233
aaa:-2.9747893
aaa:-2.9747893
bbb:-2.43873
456:-2.43434
I want to match
x:.....
aaa:.....
aaa:.....
aaa:.....
bbb:.....
456:.....
means if there exist bbb, then I pick up the lines from x:... to 456:....
The expected results for the example data is:
x:-2.53739230
aaa:-2.96875038099
aaa:-2.92326261448
aaa:-2.87628054847
bbb:-2.82755928961
456:-2.77678240323
x:-2.8743920
aaa:-2.433233
aaa:-2.9747893
aaa:-2.9747893
bbb:-2.43873
456:-2.43434
I write:
a=re.findall(r"x:(.*\n){4}bbb:.*\n456.*",target_str)
print(a)
But the results is:
['aaa:-2.87628054847\n', 'aaa:-2.9747893\n']
This is not correct, can anyone help me? thanks a lot.

Try with following regex:
(x:(?:.*\n){4}bbb:.*\n456.*)
(?:.*\n) - ?: Makes group non capturing, so it won't be set to output.
Adding parenthesses on whole regex makes it an group which you would like to see as output

How to find set of most frequently occurring word-pairs in a file using python?

I have a data set as follows:
"485","AlterNet","Statistics","Estimation","Narnia","Two and half men"
"717","I like Sheen", "Narnia", "Statistics", "Estimation"
"633","MachineLearning","AI","I like Cars, but I also like bikes"
"717","I like Sheen","MachineLearning", "regression", "AI"
"136","MachineLearning","AI","TopGear"
and so on
I want to find out the most frequently occurring word-pairs e.g.
(Statistics,Estimation:2)
(Statistics,Narnia:2)
(Narnia,Statistics)
(MachineLearning,AI:3)
The two words could be in any order and at any distance from each other
Can someone suggest a possible solution in python? This is a very large data set.
Any suggestion is highly appreciated
So this is what I tried after suggestions from #275365
#275365 I tried the following with input read from a file
def collect_pairs(file):
pair_counter = Counter()
for line in open(file):
unique_tokens = sorted(set(line))
combos = combinations(unique_tokens, 2)
pair_counter += Counter(combos)
print pair_counter
file = ('myfileComb.txt')
p=collect_pairs(file)
text file has same number of lines as the original one but has only unique tokens in a particular line. I don't know what am I doing wrong since when I run this it splits the words in letters rather than giving output as combinations of words. When I run this file it outputs split letters rather than combinations of words as expected. I dont know where I am making a mistake.

You might start with something like this, depending on how large your corpus is:
>>> from itertools import combinations
>>> from collections import Counter
>>> def collect_pairs(lines):
pair_counter = Counter()
for line in lines:
unique_tokens = sorted(set(line)) # exclude duplicates in same line and sort to ensure one word is always before other
combos = combinations(unique_tokens, 2)
pair_counter += Counter(combos)
return pair_counter
The result:
>>> t2 = [['485', 'AlterNet', 'Statistics', 'Estimation', 'Narnia', 'Two and half men'], ['717', 'I like Sheen', 'Narnia', 'Statistics', 'Estimation'], ['633', 'MachineLearning', 'AI', 'I like Cars, but I also like bikes'], ['717', 'I like Sheen', 'MachineLearning', 'regression', 'AI'], ['136', 'MachineLearning', 'AI', 'TopGear']]
>>> pairs = collect_pairs(t2)
>>> pairs.most_common(3)
[(('MachineLearning', 'AI'), 3), (('717', 'I like Sheen'), 2), (('Statistics', 'Estimation'), 2)]
Do you want numbers included in these combinations or not? Since you didn't specifically mention excluding them, I have included them here.
EDIT: Working with a file object
The function that you posted as your first attempt above is very close to working. The only thing you need to do is change each line (which is a string) into a tuple or list. Assuming your data looks exactly like the data you posted above (with quotation marks around each term and commas separating the terms), I would suggest a simple fix: you can use ast.literal_eval. (Otherwise, you might need to use a regular expression of some kind.) See below for a modified version with ast.literal_eval:
from itertools import combinations
from collections import Counter
import ast
def collect_pairs(file_name):
pair_counter = Counter()
for line in open(file_name): # these lines are each simply one long string; you need a list or tuple
unique_tokens = sorted(set(ast.literal_eval(line))) # eval will convert each line into a tuple before converting the tuple to a set
combos = combinations(unique_tokens, 2)
pair_counter += Counter(combos)
return pair_counter # return the actual Counter object
Now you can test it like this:
file_name = 'myfileComb.txt'
p = collect_pairs(file_name)
print p.most_common(10) # for example

There is not that much you can do, except counting all pairs.
Obvious optimizations are to early remove duplicate words and synonyms, perform stemming (anything that reduces the number of distinct tokens is good!), and to only count pairs (a,b) where a<b (in your example, only either count statistics,narnia, or narnia,statistics, but not both!).
If you run out of memory, perform two passes. In the first pass, use one or multiple hash functions to obtain a candidate filter. In the second pass, only count words that pass this filter (MinHash / LSH style filtering).
It's a naive parallel problem, therefore this is also easy to distribute to multiple threads or computers.

strip sides of a string in python

I have a list like this:
Tomato4439, >gi|224089052|ref|XP_002308615.1| predicted protein [Populus trichocarpa]
I want to strip the unwanted characters using python so the list would look like:
Tomato
Populus trichocarpa
I can do the following for the first one:
name = ">Tomato4439"
name = name.strip(">1234567890")
print name
Tomato
However, I am not sure what to do with the second one. Any suggestion would be appreciated.

given:
s='Tomato4439, >gi|224089052|ref|XP_002308615.1| predicted protein [Populus trichocarpa]'
this:
s = s.split()
[s[0].strip('0123456789,'), s[-2].replace('[',''), s[-1].replace(']','')]
will give you
['Tomato', 'Populus', 'trichocarpa']
It might be worth investigating regular expressions if you are going to do this frequently and the "rules" might not be that static as regular expressions are much more flexible dealing with the data in that case. For the sample problem you present though, this will work.

import re
a = "Tomato4439, >gi|224089052|ref|XP_002308615.1| predicted protein [Populus trichocarpa]"
re.sub(r"^([A-Za-z]+).+\[([^]]+)\]$", r"\1 \2", a)
This gives
'Tomato Populus trichocarpa'

If the strings you're trying to parse are consistent semantically, then your best option might be classifying the different "types" of strings you have, and then creating regular expressions to parse them using python's re module.

>>> import re
>>> line = "Tomato4439, >gi|224089052|ref|XP_002308615.1| predicted protein [Populus trichocarpa]"
>>> match = re.match("^([a-zA-Z]+).*\[([a-zA-Z ]+)\].*",line)
>>> match.groups()
('Tomato', 'Populus trichocarpa')
edited to not include the [] on the 2nd part... this should work for any thing that matches the pattern of your query (eg starts with name, ends with something in []) it would also match
"Tomato4439, >gi|224089052|ref|XP_002308615.1| predicted protein [Populus trichocarpa apples]" for example

Previous answers were simpler than mine, but:
Here is one way to print the stuff that you don't want.
tag = "Tomato4439, >gi|224089052|ref|XP_002308615.1| predicted protein [Populus trichocarpa]"
import re, os
find = re.search('>(.+?) \[', tag).group(1)
print find
Gives you
gi|224089052|ref|XP_002308615.1| predicted protein
Then you can use the replace function to remove that from the original string. And the translate function to remove the extra unwanted characters.
tag2 = tag.replace(find, "")
tag3 = str.translate(tag2, None, ">[],")
print tag3
Gives you
Tomato4439 Populus trichocarpa

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Extracting multiple text with regex and matching the existence as a condition - python

Related

Reformat a string with special tokens to a list/dictionary containing its tokens as elements?

Find date and time in a text column in a dataframe Python

python regex match content crossing multiple lines

How to find set of most frequently occurring word-pairs in a file using python?

strip sides of a string in python

Categories

Resources