Extract text between two span() iterators in python - python

I am trying to extract text between two iterators.
I have tried using span() function on it to find the start and the end span
How do I proceed further, to extract text between these spans
start_matches = start_pattern.finditer(filter_lines)
end_matches = end_pattern.finditer(filter_lines)
for s_match in start_matches :
s_cargo=s_match.span()
for e_match in end_matches :
e_cargo=e_match.span()
Using the span: 1) s_cargo and 2) e_cargo, I would want to find the text within the string filter_lines
I am relatively new to python, any kind of help is much appreciated.

you can try:
my_data = []
for s, e in zip(s_cargo, e_cargo):
start, _ = s
_, end = e
my_data.append(your_text[start: end])
variable your_text should be the text over whom you are filtering using regex

Related

Generating multiple strings by replacing wildcards

So i have the following strings:
"xxxxxxx#FUS#xxxxxxxx#ACS#xxxxx"
"xxxxx#3#xxxxxx#FUS#xxxxx"
And i want to generate the following strings from this pattern (i'll use the second example):
Considering #FUS# will represent 2.
"xxxxx0xxxxxx0xxxxx"
"xxxxx0xxxxxx1xxxxx"
"xxxxx0xxxxxx2xxxxx"
"xxxxx1xxxxxx0xxxxx"
"xxxxx1xxxxxx1xxxxx"
"xxxxx1xxxxxx2xxxxx"
"xxxxx2xxxxxx0xxxxx"
"xxxxx2xxxxxx1xxxxx"
"xxxxx2xxxxxx2xxxxx"
"xxxxx3xxxxxx0xxxxx"
"xxxxx3xxxxxx1xxxxx"
"xxxxx3xxxxxx2xxxxx"
Basically if i'm given a string as above, i want to generate multiple strings by replacing the wildcards that can be #FUS#, #WHATEVER# or with a number #20# and generating multiple strings with the ranges that those wildcards represent.
I've managed to get a regex to find the wildcards.
wildcardRegex = f"(#FUS#|#WHATEVER#|#([0-9]|[1-9][0-9]|[1-9][0-9][0-9])#)"
Which finds correctly the target wildcards.
For 1 wildcard present, it's easy.
re.sub()
For more it gets complicated. Or maybe it was a long day...
But i think my algorithm logic is failing hard because i'm failing to write some code that will basically generate the signals. I think i need some kind of recursive function that will be called for each number of wildcards present (up to maybe 4 can be present (xxxxx#2#xxx#2#xx#FUS#xx#2#x)).
I need a list of resulting signals.
Is there any easy way to do this that I'm completely missing?
Thanks.
import re
stringV1 = "xxx#FUS#xxxxi#3#xxx#5#xx"
stringV2 = "XXXXXXXXXX#FUS#XXXXXXXXXX#3#xxxxxx#5#xxxx"
regex = "(#FUS#|#DSP#|#([0-9]|[1-9][0-9]|[1-9][0-9][0-9])#)"
WILDCARD_FUS = "#FUS#"
RANGE_FUS = 3
def getSignalsFromWildcards(app, can):
sigList = list()
if WILDCARD_FUS in app:
for i in range(RANGE_FUS):
outAppSig = app.replace(WILDCARD_FUS, str(i), 1)
outCanSig = can.replace(WILDCARD_FUS, str(i), 1)
if "#" in outAppSig:
newSigList = getSignalsFromWildcards(outAppSig, outCanSig)
sigList += newSigList
else:
sigList.append((outAppSig, outCanSig))
elif len(re.findall("(#([0-9]|[1-9][0-9]|[1-9][0-9][0-9])#)", stringV1)) > 0:
wildcard = re.search("(#([0-9]|[1-9][0-9]|[1-9][0-9][0-9])#)", app).group()
tarRange = int(wildcard.strip("#"))
for i in range(tarRange):
outAppSig = app.replace(wildcard, str(i), 1)
outCanSig = can.replace(wildcard, str(i), 1)
if "#" in outAppSig:
newSigList = getSignalsFromWildcards(outAppSig, outCanSig)
sigList += newSigList
else:
sigList.append((outAppSig, outCanSig))
return sigList
if "#" in stringV1:
resultList = getSignalsFromWildcards(stringV1, stringV2)
for item in resultList:
print(item)
results in
('xxx0xxxxi0xxxxx', 'XXXXXXXXXX0XXXXXXXXXX0xxxxxxxxxx')
('xxx0xxxxi1xxxxx', 'XXXXXXXXXX0XXXXXXXXXX1xxxxxxxxxx')
('xxx0xxxxi2xxxxx', 'XXXXXXXXXX0XXXXXXXXXX2xxxxxxxxxx')
('xxx1xxxxi0xxxxx', 'XXXXXXXXXX1XXXXXXXXXX0xxxxxxxxxx')
('xxx1xxxxi1xxxxx', 'XXXXXXXXXX1XXXXXXXXXX1xxxxxxxxxx')
('xxx1xxxxi2xxxxx', 'XXXXXXXXXX1XXXXXXXXXX2xxxxxxxxxx')
('xxx2xxxxi0xxxxx', 'XXXXXXXXXX2XXXXXXXXXX0xxxxxxxxxx')
('xxx2xxxxi1xxxxx', 'XXXXXXXXXX2XXXXXXXXXX1xxxxxxxxxx')
('xxx2xxxxi2xxxxx', 'XXXXXXXXXX2XXXXXXXXXX2xxxxxxxxxx')
long day after-all...

extracting some strings between two strings in a variable of a data frame in Python

I am new in Python, dont have much knowledge, need a help on a problem i am having now
i have dataframe which has a variable say 'item' which is in text format, i need to pull the text between two strings say 'notify' and 'accordingly', i tried the below method but getting a blank output
start = 'to notify'
end = 'accordingly'
data_1['match'] = data_1['Issue'].apply(lambda x: "".join(x for x in x.split() if re.search(('%s(.*)%s' % (start, end)),x)))
I also tried re.findall but it is asking string or byte like objects, i tried to covert the variable from object to string, but it is not happening even. It will be really helpful if someone can help me on these problems...
I'm having a bit of a problem reading your code, but this snippet should do what I understand (get the text between a start and an end strings)
import pandas as pd
import re
start = 'to notify'
end = 'accordingly'
# I created an auxiliary function to better handle the errors
# when the patern start - text - end is not found
def extract_between(x, start, end):
try:
return re.match(pattern=r'.*{}(.*){}.*'.format(start, end), string=x).group(1)
except AttributeError:
return None
# This is just an example, if it does not work for your porpoise please share some data
df = pd.DataFrame([('to notify TEXT accordingly'), ('this should not match')], columns=['issue'])
df['issue'] = df['issue'].apply(extract_between, **{'start': start, 'end': end})
print(df['issue'])

Getting regular text from wikipedia page

I am trying to get the text or the summary text from a random wikipedia page, i need it, to be a list of lists of words (list of sentences) in the end.
I am using the following code
def get_random_pages_summary(pages = 0):
import wikipedia
page_names = [wikipedia.random(1) for i in range(pages)]
return [[p,wikipedia.page(p).summary] for p in page_names]
def text_to_list_of_words_without_new_line(text):
t = text.replace("\n", " ").strip()
t1 = t.split()
t2 = ["".join(w) for w in t1]
return t2
text = get_random_pages_summary(1)
for i,row in enumerate(text):
text[i][1] = text_to_list_of_words_without_new_line(row[1])
print text[0][1]
I am getting weird tokens, i assume they are a relic of the markdown code for the wikipedia page e.g
Russian:', u'\u0418\u0432\u0430\u043d
I found that it is probably happening when there is a quote from another language inside the English page, it also happens when having a range of years in the page e.g 2015-2016
I would like to convert all of these to regular words, and remove those that i can not convert to regular words.
Thanks.

Slicing a string into a list based on reoccuring patterns

I have a long string variable full of hex values:
hexValues = 'AA08E3020202AA08E302AA1AA08E3020101' etc..
The first 2 bytes (AA08) are a signature for the start of a frame and the rest of the data up to the next AA08 are the contents of the signature.
I want to slice the string into a list based on the reoccurring start of frame sign, e.g:
list = [AA08, E3020202, AA08, F25S1212, AA08, 42ABC82] etc...
I'm not sure how I can split the string up like this. Some of the frames are also corrupted, where the start of the frame won'y have AA08, but maybe AA01.. so I'd need some kind of regex to spot these.
if I do list = hexValues.split('AA08)', the list just removes all the starts of the frame...
So I'm a bit stuck.
Newbie to python.
Thanks
For the case when you don't have "corrupted" data the following should do:
hex_values = 'AA08E3020202AA08E302AA1AA08E3020101'
delimiter = hex_values[:4]
hex_values = hex_values.replace(delimiter, ',' + delimiter + ',')
hex_list = hex_values.split(',')[1:]
print(hex_list)
['AA08', 'E3020202', 'AA08', 'E302AA1', 'AA08', 'E3020101']
Without considering corruptions, you may try this.
l = []
for s in hexValues.split('AA08'):
if s:
l += ['AA08', s]

Extract emoticons from a text

I need to extract text emoticons from a text using Python and I've been looking for some solutions to do this but most of them like this or this only cover simple emoticons. I need to parse all of them.
Currently I'm using a list of emoticons that I iterate for every text that I have process but this is so inefficient. Do you know a better solution? Maybe a Python library that can handle this problem?
One of most efficient solution is to use Aho–Corasick string matching algorithm and is nontrivial algorithm designed for this kind of problem. (search of multiple predefined strings in unknown text)
There is package available for this.
https://pypi.python.org/pypi/ahocorasick/0.9
https://hkn.eecs.berkeley.edu/~dyoo/python/ahocorasick/
Edit:
There are also more recent packages available (haven tried any of them)
https://pypi.python.org/pypi/pyahocorasick/1.0.0
Extra:
I have made some performance test with pyahocorasick and it is faster than python re when searching for more than 1 word in dict (2 or more).
Here it is code:
import re, ahocorasick,random,time
# search N words from dict
N=3
#file from http://norvig.com/big.txt
with open("big.txt","r") as f:
text = f.read()
words = set(re.findall('[a-z]+', text.lower()))
search_words = random.sample([w for w in words],N)
A = ahocorasick.Automaton()
for i,w in enumerate(search_words):
A.add_word(w, (i, w))
A.make_automaton()
#test time for ahocorasic
start = time.time()
print("ah matches",sum(1 for i in A.iter(text)))
print("aho done in ", time.time() - start)
exp = re.compile('|'.join(search_words))
#test time for re
start = time.time()
m = exp.findall(text)
print("re matches",sum(1 for _ in m))
print("re done in ",time.time()-start)

Categories

Resources