fuzzy string split in Python 2.x

fuzzy string split in Python 2.x - python

Input file:
rep_origin 607..1720
/label=Rep
Region 2643..5020
/label="region"
extra_info and stuff
I'm trying to split by the first column-esque entry. For example, I want to get a list that looks like this...
Desired Output:
['rep_origin 607..1720 /label=Rep', 'Region 2643..5020 /label="region" extra_info and stuff']
I tried splitting by ' ' but that gave me some crazy stuff. If I could add a "fuzzy" search term at the end that includes all alphabet characters but NOT a whitespace. That would solve the problem. I suppose you can do it with regex with something like ' [A-Z]' findall but I wasn't sure if there was a less complicated way.
Is there a way to add a "fuzzy" search term at the very end of string.split identifier? (i.e. original_string.' [alphabet_character]'

I'm not sure exactly what you're looking for but the parse function below takes the text from your question and returns a list of sections and a section is a list of the lines from each section (with leading and trailing whitespace removed).
#!/usr/bin/env python
import re
# This is the input from your question
INPUT_TEXT = '''\
rep_origin 607..1720
/label=Rep
Region 2643..5020
/label="region"
extra_info and stuff'''
# A regular expression that matches the start of a section. A section
# start is a line that has 4 spaces before the first non-space
# character.
match_section_start = re.compile(r'^ [^ ]').match
def parse(text):
sections = []
section_lines = None
def append_section_if_lines():
if section_lines:
sections.append(section_lines)
for line in text.split('\n'):
if match_section_start(line):
# We've found the start of a new section. Unless this is
# the first section, save the previous section.
append_section_if_lines()
section_lines = []
section_lines.append(line.strip())
# Save the last section.
append_section_if_lines()
return sections
sections = parse(INPUT_TEXT)
print(sections)

Related

Splitting based on particular pattern and editing string

I am trying to split a string based on a particular pattern in an effort to rejoin it later after adding a few characters.
Here's a sample of my string: "123\babc\b:123" which I need to convert to "123\babc\\"b\":123". I need to do it several times in a long string. I have tried variations of the following:
regex = r"(\\b[a-zA-Z]+)\\b:"
test_str = "123\\babc\\b:123"
x = re.split(regex, test_str)
but it doesn't split at the right positions for me to join. Is there another way of doing this/another way of splitting and joining?

You're right, you can do it with re.split as suggested. You can split by \b and then rebuild your output with a specific separator (and keep the \b when you want too).
Here an example:
# Import module
import re
string = "123\\babc\\b:123"
# Split by "\n"
list_sliced = re.split(r'\\b', "123\\babc\\b:123")
print(list_sliced)
# ['123', 'abc', ':123']
# Define your custom separator
custom_sep = '\\\\"b\\"'
# Build your new output
output = list_sliced[0]
# Iterate over each word
for i, word in enumerate(list_sliced[1:]):
# Chose the separator according the parity (since we don't want to change the first "\b")
sep = "\\\\b"
if i % 2 == 1:
sep = custom_sep
# Update output
output += sep + word
print(output)
# 123\\babc\\"b\":123

Maybe, the following expression,
^([\\]*)([^\\]+)([\\]*)([^\\]+)([\\]*)([^:]+):(.*)$
and a replacement of,
\1\2\3\4\5\\"\6\\":\7
with a re.sub might return our desired output.
The expression is explained on the top right panel of this demo if you wish to explore/simplify/modify it.

pandas read_table with regex header definition

For the data file formated like this:
("Time Step" "courantnumber_max" "courantnumber_avg" "flow-time")
0 0.55432343242 0.34323443432242 0.00001
I can use pd.read_table(filename, sep=' ', header=0) and it will get everything correct except for the very first header, "Time Step".
Is there a way to specify a regex string for read_table() to use to parse out the header names?
I know a way to solve the issue is to just use regex to create a list of names for the read_table() function to use, but I figured there might/should be a way to directly express that in the import itself.
Edit: Here's what it returns as headers:
['("Time', 'Step"', 'courantnumber_max', 'courantnumber_avg', 'flow-time']

So it doesn't appear to be actually possible to do this inside the pandas.read_table() function. Below is posted the actual solution I ended up using to fix the problem:
import re
def get_headers(file, headerline, regexstring, exclude):
# Get string of selected headerline
with file.open() as f:
for i, line in enumerate(f):
if i == headerline-1:
headerstring = line
elif i > headerline-1:
break
# Parse headerstring
reglist = re.split(regexstring, headerstring)
# Filter entries in reglist
#filter out blank strs
filteredlist = list(filter(None, reglist))
#filter out items in exclude list
headerslist = []
if exclude:
for entry in filteredlist:
if not entry in exclude:
headerslist.append(entry)
return headerslist
get_headers(filename, 3, r'(?:" ")|["\)\(]', ['\n'])
Code explanation:
get_headers():
Arguments, file is a file object that contains the header. headerline is the line number (starting at 1) that the header names exist. regexstring is the pattern that will be fed into re.split(). Highly recommended that you prepend a r to the regex pattern. exclude is a list of miscellaneous strings that you want to be removed from the headerlist.
The regex pattern I used:
First up we have the pipe (|) symbol. This was done to separate both the "normal" split method (which is the " ") and the other stuff that needs to be rid of (namely the parenthesis).
Starting with the first group: (?:" "). We have the (...) since we want to match those characters in order. The " " is what we want to match as the stuff to split around. The ?: basically says to not capture the contents of the group. This is important/useful as otherwise re.split() will keep any groups as a separate item. See re.split() in documentation.
The second group is simply the other characters. Without them, the first and last items would be '("Time Step' and 'flow-time)\n'. Note that this causes \n to be treated as a separate entry to the list. This is why we use the exclude argument to fix that up after the fact.

Python: Extract hashtags out of a text file

So, I've written the code below to extract hashtags and also tags with '#', and then append them to a list and sort them in descending order. The thing is that the text might not be perfectly formatted and not have spaces between each individual hashtag and the following problem may occur - as it may be checked with the #print statement inside the for loop :
#socality#thisismycommunity#themoderndayexplorer#modernoutdoors#mountaincultureelevated
So, the .split() method doesn't deal with those. What would be the best practice to this issue?
Here is the .txt file
Grateful for your time.
name = input("Enter file:")
if len(name) < 1 : name = "tags.txt"
handle = open(name)
tags = dict()
lst = list()
for line in handle :
hline = line.split()
for word in hline:
if word.startswith('#') : tags[word] = tags.get(word,0) + 1
else :
tags[word] = tags.get(word,0) + 1
#print(word)
for k,v in tags.items() :
tags_order = (v,k)
lst.append(tags_order)
lst = sorted(lst, reverse=True)[:34]
print('Final Dictionary: ' , '\n')
for v,k in lst :
print(k , v, '')

Use a regular expression. There are only a few limits; a tag must start with either # or #, and it may not contain any spaces or other whitespace characters.
This code
import re
tags = []
with open('../Downloads/tags.txt','Ur') as file:
for line in f.readline():
tags += re.findall(r'[##][^\s##]+', line)
creates a list of all tags in the file. You can easily adjust it to store the found tags in your dictionary; instead of storing the result straight away in tags, loop over it and do with each item as you please.
The regex is built up from these two custom character classes:
[##] - either the single character # or # at the start
[^\s##]+ - a sequence of not any single whitespace character (\s matches all whitespace such as space, tab, and returns), #, or #; at least one, and as many as possible.
So findall starts matching at the start of any tag and then grabs as much as it can, stopping only when encountering any of the "not" characters.
findall returns a list of matching items, which you can immediately add to an existing list, or loop over the found items in turn:
for tag in re.findall(r'[##][^\s##]+', line):
# process "tag" any way you want here
The source text file contains Windows-style \r\n line endings, and so I initially got a lot of empty "lines" on my Mac. Opening the text file in Universal newline mode makes sure that is handled transparently by the line reading part of Python.

Using Python 2.4.3: Want to find the same regex multiple times in a text file

Super NOOB to Python (2.4.3): I am executing a function containing a regular expression which searches through a txt file that I'm importing. I am able to read and run re.search on the text file and the output is correct. I need to fun this for multiple occurrences. The regex occurs 48 times in the text). The code is as follows:
!/usr/bin/python
import re
dataRead = open('pd_usage_14-04-23.txt', 'r')
dataWrite = open('test_write.txt', 'w')
text = (dataRead.read()) #reads and initializes text for conversion to string
s = str(text) #converts text to string for reading
def user(str):
re1='((?:[a-z][a-z]+))' # Word 1
re2='(\\s+)' # White Space 1
re3='((?:[a-z][a-z]+))' # Word 2
re4='(\\s+)' # White Space 2
re5='((?:[a-z][a-z]*[0-9]+[a-z0-9]*))' # Alphanum 1
rg = re.compile(re1+re2+re3+re4+re5,re.IGNORECASE|re.DOTALL)
#alphanum1=rg.group(5)
re.findall(rg, s, flags=0)
#print "("+alphanum1+")"+"\n"
#if m:
#word1=m.group(1)
#ws1=m.group(2)
#word2=m.group(3)
#ws2=m.group(4)
#alphanum1=m.group(5)
#print "("+alphanum1+")"+"\n"
return
user(s)
dataRead.close()
dataWrite.close()
OUTPUT: g706454
THIS OUTPUT IS CORRECT! BUT...!
I need to run it multiple times reading text thats further down.
I have 2 other definitions that need to be ran multiple times also. I need all 3 to run consecutively, and then run again but starting with the next line or something to search and output newer data. All the logic I tried implement returns the same output.
So I have something like this:
for count in range (0,47):
if stop_read:
date(s)
usage(s)
user(s)
stop_read is a definition that finds the next line after the data that I'm looking for (date, usage, user). I figured I could call this to say If you hit stop_read, read the next line and run definitions all over again.
Any help is greatly appreciated!

Here is what I do for a regex in Python 3, should be similar to Python 2. This is for a multiline searc.
regex = re.compile("\\w+-\\d+\\b", re.MULTILINE)
Then later on in code I have something like:
myset.update([m.group(0) for m in regex.finditer(logmsg.text)])
Maybe you might want to update your Python if you can, 2.4 is old, old, and stale.

looks like re.findall would solve your problem:
re.findall(pattern, string, flags=0)
Return a list of all non-overlapping matches in the string.
If one or more groups are present in the pattern, return a
list of groups; this will be a list of tuples if the pattern
has more than one group.
Empty matches are included in the result.

Python: re.compile and re.sub

Question part 1
I got this file f1:
<something #37>
<name>George Washington</name>
<a23c>Joe Taylor</a23c>
</something #37>
and I want to re.compile it that it looks like this f1: (with spaces)
George Washington Joe Taylor
I tried this code but it kind of deletes everything:
import re
file = open('f1.txt')
fixed = open('fnew.txt', 'w')
text = file.read()
match = re.compile('<.*>')
for unwanted in text:
fixed_doc = match.sub(r' ', text)
fixed.write(fixed_doc)
My guess is the re.compile line but I'm not quite sure what to do with it. I'm not supposed to use 3rd party extensions. Any ideas?
Question part 2
I had a different question about comparing 2 files I got this code from Alfe:
from collections import Counter
def test():
with open('f1.txt') as f:
contentsI = f.read()
with open('f2.txt') as f:
contentsO = f.read()
tokensI = Counter(value for value in contentsI.split()
if value not in [])
tokensO = Counter(value for value in contentsO.split()
if value not in [])
return not (tokensI - tokensO) and not (set(tokensO) - set(tokensI))
Is it possible to implement the re.compile and re.sub in the 'if value not in []' section?

I will explain what happens with your code:
import re
file = open('f1.txt')
fixed = open('fnew.txt','w')
text = file.read()
match = re.compile('<.*>')
for unwanted in text:
fixed_doc = match.sub(r' ',text)
fixed.write(fixed_doc)
The instruction text = file.read() creates an object text of type string named text.
Note that I use bold characters text to express an OBJECT, and text to express the name == IDENTIFIER of this object.
As a consequence of the instruction for unwanted in text:, the identifier unwanted is successively assigned to each character referenced by the text object.
Besides, re.compile('<.*>') creates an object of type RegexObject (which I personnaly call compiled) regex or simply regex , <.*> being only the regex pattern).
You assign this compiled regex object to the identifier match: it's a very bad practice, because match is already the name of a method of regex objects in general, and of the one you created in particular, so then you could write match.match without error.
match is also the name of a function of the re module.
This use of this name for your particular need is very confusing. You must avoid that.
There's the same flaw with the use of file as a name for the file-handler of file f1. file is already an identifier used in the language, you must avoid it.
Well. Now this bad-named match object is defined, the instruction fixed_doc = match.sub(r' ',text) replaces all the occurences found by the regex match in text with the replacement r' '.
Note that it's completely superfluous to write r' ' instead of just ' ' because there's absolutely nothing in ' ' that needs to be escaped. It's a fad of some anxious people to write raw strings every time they have to write a string in a regex problem.
Because of its pattern <.+> in which the dot symbol means "greedily eat every character situated between a < and a > except if it is a newline character" , the occurences catched in the text by match are each line until the last > in it.
As the name unwanted doesn't appear in this instruction, it is the same operation that is done for each character of the text, one after the other. That is to say: nothing interesting.
To analyze the execution of a programm, you should put some printing instructions in your code, allowing to understand what happens. For example, if you do print repr(fixed_doc), you'll see the repeated printing of this: ' \n \n \n '. As I said: nothing interesting.
There's one more default in your code: you open files, but you don't shut them. It is mandatory to shut files, otherwise it could happen some weird phenomenons, that I personnally observed in some of my codes before I realized this need. Some people pretend it isn't mandatory, but it's false.
By the way, the better manner to open and shut files is to use the with statement. It does all the work without you have to worry about.
.
So , now I can propose you a code for your first problem:
import re
def ripl(mat=None,li = []):
if mat==None:
li[:] = []
return
if mat.group(1):
li.append(mat.span(2))
return ''
elif mat.span() in li:
return ''
else:
return mat.group()
r = re.compile('</[^>]+>'
'|'
'<([^>]+)>(?=.*?(</\\1>))',
re.DOTALL)
text = '''<something #37>
<name>George <wxc>Washington</name>
<a23c>Joe </zazaza>Taylor</a23c>
</something #37>'''
print '1------------------------------------1'
print text
print '2------------------------------------2'
ripl()
print r.sub(ripl,text)
print '3------------------------------------3'
result
1------------------------------------1
<something #37>
<name>George <wxc>Washington</name>
<a23c>Joe </zazaza>Taylor</a23c>
</something #37>
2------------------------------------2
George <wxc>Washington
Joe </zazaza>Taylor
3------------------------------------3
The principle is as follows:
When the regex detects a tag,
- if it's an end tag, it matches
- if it's a start tag, it matches only if there is a corresponding end tag somewhere further in the text
For each match, the method sub() of the regex r calls the function ripl() to perform the replacement.
If the match is with a start tag (which is necessary followed somewhere in the text by its corresponding end tag, by construction of the regex), then ripl() returns ''.
If the match is with an end tag, ripl() returns '' only if this end tag has previously in the text been detected has being the corresponding end tag of a previous start tag. This is done possible by recording in a list li the span of each corresponding end tag's span each time a start tag is detected and matching.
The recording list li is defined as a default argument in order that it's always the same list that is used at each call of the function ripl() (please, refer to the functionning of default argument to undertsand, because it's subtle).
As a consequence of the definition of li as a parameter receiving a default argument, the list object li would retain all the spans recorded when analyzing several text in case several texts would be analyzed successively. In order to avoid the list li to retain spans of past text matches, it is necessary to make the list empty. I wrote the function so that the first parameter is defined with a default argument None: that allows to call ripl() without argument before any use of it in a regex's sub() method.
Then, one must think to write ripl() before any use of it.
.
If you want to remove the newlines of the text in order to obtain the precise result you showed in your question, the code must be modified to:
import re
def ripl(mat=None,li = []):
if mat==None:
li[:] = []
return
if mat.group(1):
return ''
elif mat.group(2):
li.append(mat.span(3))
return ''
elif mat.span() in li:
return ''
else:
return mat.group()
r = re.compile('( *\n *)'
'|'
'</[^>]+>'
'|'
'<([^>]+)>(?=.*?(</\\2>)) *',
re.DOTALL)
text = '''<something #37>
<name>George <wxc>Washington</name>
<a23c>Joe </zazaza>Taylor</a23c>
</something #37>'''
print '1------------------------------------1'
print text
print '2------------------------------------2'
ripl()
print r.sub(ripl,text)
print '3------------------------------------3'
result
1------------------------------------1
<something #37>
<name>George <wxc>Washington</name>
<a23c>Joe </zazaza>Taylor</a23c>
</something #37>
2------------------------------------2
George <wxc>WashingtonJoe </zazaza>Taylor
3------------------------------------3

You can use Beautiful Soup to do this easily:
from bs4 import BeautifulSoup
file = open('f1.txt')
fixed = open('fnew.txt','w')
#now for some soup
soup = BeautifulSoup(file)
fixed.write(str(soup.get_text()).replace('\n',' '))
The output of the above line will be:
George Washington Joe Taylor
(Atleast this works with the sample you gave me)
Sorry I don't understand part 2, good luck!

Don't need re.compile
import re
clean_string = ''
with open('f1.txt') as f1:
for line in f1:
match = re.search('.+>(.+)<.+', line)
if match:
clean_string += (match.group(1))
clean_string += ' '
print(clean_string) # 'George Washington Joe Taylor'

Figured the first part out it was the missing '?'
match = re.compile('<.*?>')
does the trick.
Anyway still not sure about the second questions. :/

For part 1 try the below code snippet. However consider using a library like beautifulsoup as suggested by Moe Jan
import re
import os
def main():
f = open('sample_file.txt')
fixed = open('fnew.txt','w')
#pattern = re.compile(r'(?P<start_tag>\<.+?\>)(?P<content>.*?)(?P<end_tag>\</.+?\>)')
pattern = re.compile(r'(?P<start><.+?>)(?P<content>.*?)(</.+?>)')
output_text = []
for text in f:
match = pattern.match(text)
if match is not None:
output_text.append(match.group('content'))
fixed_content = ' '.join(output_text)
fixed.write(fixed_content)
f.close()
fixed.close()
if __name__ == '__main__':
main()
For part 2:
I am not completely clear with what you are asking - however my guess is that you want to do something like if re.sub(value) not in []. However, note that you need to call re.compile only once prior to initializing the Counter instance. It would be better if you clarify the second part of your question.
Actually, I would recommend you to use the built-in Python diff module to find difference between two files. Using this way better than using your own diff algorithm, since the diff logic is well tested and widely used and is not vulnerable to logical or programmatic errors resulting from presence of spurious newlines, tab and space characters.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

fuzzy string split in Python 2.x - python

Related

Splitting based on particular pattern and editing string

pandas read_table with regex header definition

Python: Extract hashtags out of a text file

Using Python 2.4.3: Want to find the same regex multiple times in a text file

Python: re.compile and re.sub

Categories

Resources