Match all names with exactly 5 digits at the end - python

I have a text file like this:
john123:
1
2
coconut_rum.zip
bob234513253:
0
jackdaniels.zip
nowater.zip
3
judy88009:
dontdrink.zip
9
tommi54321:
dontdrinkalso.zip
92
...
I have millions of entries like this.
I want to pick up the name and number which has a number 5 digits long. I tried this:
matches = re.findall(r'\w*\d{5}:',filetext2)
but it's giving me results which have at least 5 digits.
['bob234513253:', 'judy88009:', 'tommi54321:']
Q1: How to find the names with exactly 5 digits?
Q2: I want to append the zip files which is associated with these names with 5 digits. How do I do that using regular expressions?

That's because \w includes digit characters:
>>> import re
>>> re.match('\w*', '12345')
<_sre.SRE_Match object at 0x021241E0>
>>> re.match('\w*', '12345').group()
'12345'
>>>
You need to be more specific and tell Python that you only want letters:
matches = re.findall(r'[A-Za-z]*\d{5}:',filetext2)
Regarding your second question, you can use something like the following:
import re
# Dictionary to hold the results
results = {}
# Break-up the file text to get the names and their associated data.
# filetext2.split('\n\n') breaks it up into individual data blocks (one per person).
# Mapping to str.splitlines breaks each data block into single lines.
for name, *data in map(str.splitlines, filetext2.split('\n\n')):
# See if the name matches our pattern.
if re.match('[A-Za-z]*\d{5}:', name):
# Add the name and the relevant data to the file.
# [:-1] gets rid of the colon on the end of the name.
# The list comprehension gets only the file names from the data.
results[name[:-1]] = [x for x in data if x.endswith('.zip')]
Or, without all the comments:
import re
results = {}
for name, *data in map(str.splitlines, filetext2.split('\n\n')):
if re.match('[A-Za-z]*\d{5}:', name):
results[name[:-1]] = [x for x in data if x.endswith('.zip')]
Below is a demonstration:
>>> import re
>> filetext2 = '''\
... john123:
... 1
... 2
... coconut_rum.zip
...
... bob234513253:
... 0
... jackdaniels.zip
... nowater.zip
... 3
...
... judy88009:
... dontdrink.zip
... 9
...
... tommi54321:
... dontdrinkalso.zip
... 92
... '''
>>> results = {}
>>> for name, *data in map(str.splitlines, filetext2.split('\n\n')):
... if re.match('[A-Za-z]*\d{5}:', name):
... results[name[:-1]] = [x for x in data if x.endswith('.zip')]
...
>>> results
{'tommi54321': ['dontdrinkalso.zip'], 'judy88009': ['dontdrink.zip']}
>>>
Keep in mind though that it is not very efficient to read in all of the file's contents at once. Instead, you should consider making a generator function to yield the data blocks one at a time. Also, you can increase performance by pre-compiling your Regex patterns.

import re
results = {}
with open('datazip') as f:
records = f.read().split('\n\n')
for record in records:
lines = record.split()
header = lines[0]
# note that you need a raw string
if re.match(r"[^\d]\d{5}:", header[-7:]):
# in general multiple hits are possible, so put them into a list
results[header] = [l for l in lines[1:] if l[-3:]=="zip"]
print results
Output
{'tommi54321:': ['dontdrinkalso.zip'], 'judy88009:': ['dontdrink.zip']}
Comment
I tried to keep it very simple, if your input is very long you should, as suggested by iCodez, implement a generator that yields one record at a time, while for the regexp match I tried a little optimization searching only the last 7 characters of the header.
Addendum: a simplistic implementation of a record generator
import re
def records(f):
record = []
for l in f:
l = l.strip()
if l:
record.append(l)
else:
yield record
record = []
yield record
results = {}
for record in records(open('datazip')):
head = record[0]
if re.match(r"[^\d]\d{5}:", head[-7:]):
results[head] = [ r for r in record[1:] if r[-3:]=="zip"]
print results

You need to limit the regex to the end of the word so that it wont match any further using \b
[a-zA-Z]+\d{5}\b
see for example http://regex101.com/r/oC1yO6/1
The regex would match
judy88009:
tommi54321:
python code would be like
>>> re.findall(r'[a-zA-Z]+\d{5}\b', x)
['judy88009', 'tommi54321']

Related

Extract the lines between 2 specific tags

For a routine programming question, I need to extract some lines of text that are between 2 tags(delimiters, if I need to be more specific).
The file is something like this:
*some random text*
...
...
...
tag/delimiter 1
text 1 #extract
text 2 #extract
... #extract
... #extract
text n #extract
tag/ending_delimiter
*some random text*
...
...
...
tag/delimiter 2
text 1 #extract
text 2 #extract
... #extract
... #extract
text n #extract
tag/ending_delimiter
*some random text*
...
...
...
tag/delimiter n
text 1 #extract
text 2 #extract
... #extract
... #extract
text n #extract
tag/ending_delimiter
*some random text until the file ends*
The ending_delimiter is the same everywhere.
The starting delimiter, i.e delimiter 1, delimiter 2 upto n is taken from a list.
The catch is, in the file there are a few (less than 3) charecters after each starting delimiter, which, combined with the starting delimiter, work as an identifier for the lines of text until the ending_delimiter, a kind of "uid", technically.
So far, what I've tried is this:
data_file = open("file_name")
block = []
found = False
for elem in list_of_starting_delimiters:
for line in data_file:
if found:
block.append(line)
if re.match(attribute_end, line.strip()):
break
else:
if re.match(elem, line.strip()):
found = True
block = elem
data_file.close()
I have also tried to implement the answers suggested in:
python - Read file from and to specific lines of text
but with no success.
The implementation I'm currently trying is one of the answers of the link above.
Any help is appreciated.
P.S: Using Python 2.7, on PyCharm, on Windows 10.
I suggest fixing your code the following way:
block = []
found = False
list_of_starting_delimiters = ['tag/delimiter']
attribute_end = 'tag/ending_delimiter'
curr = []
for elem in list_of_starting_delimiters:
for line in data_file:
if found:
curr.append(line)
if line.strip().startswith(attribute_end):
found = False
block.append("\n".join(curr)) # Add merged list to final list
curr = [] # Zero out current list
else:
if line.strip().startswith(elem): # If line starts with start delimiter
found = True
curr.append(line.strip()) # Append line to current list
if len(curr) > 0: # If there are still lines in the current list
block.append(curr) # Add them to the final list
See the Python demo
There are quite a lot of issues with your current code:
block = elem made block a byte string and the further .append caused an exception
You only grabbed one occurrence of the block because upon fining one, you had a break statement
All the lines were added as separate items while you needed to collect them into a list and then join them with \n to get strings to paste into a resulting list
You need no regex to check if a string appears at the start of a string, use str.startswith method.
By the time I figured this out there are a fair amount of good responses already, but my approach would be that you could resolve this with:
import re
pattern = re.compile(r"(^tag\/delimiter) (.{0,3})\n\n((^[\w\d #\.]*$\n)+)^(tag\/ending_delimiter)", re.M)
You could then find all matches in your text by either doing:
for i in pattern.finditer(<target_text>):
#do something with each match
pattern.findAll(<target_text>) - returns a list of strings of all matches
This of course bears the stipulation that you need to specify different delimiters and compile a different regex pattern (re.compile) for each different delimiter, using variables and string concatenation as #SpghttCd shows in his answer
For more info see the python re module
What about
import re
with open(file, 'r') as f:
txt = f.read()
losd = '|'.join(list_of_starting_delimiters)
enddel = 'attribute_end'
block = re.findall('(?:' + losd + r')([\s\S]*?)' + enddel, txt)
I would make that in following way: For example purpose let <d1> and <d2> and <d3> be our starting delimiters and <d> ending delimeter and string being text you are processing. Then following line of code:
re.findall('(<d1>|<d2>|<d3>)(.+?)(<d>)',string,re.DOTALL)
will give list of tuples, with each tuple containing starting delimiter, body and ending delimiter. This code use grouping inside regular expression (brackets), pipe (|) in regular expressions acts similar to or, dot (.) combined with DOTALL flag match any character, plus (+) means 1 or more, question (?) non-greedy manner (this is important in this case, as otherwise you would get single match begining at first starting delimiter and ending at last ending delimiter)
My re-less solution would be the following:
list_of_starting_delimiters = ['tag/delimiter 1', 'tag/delimiter 2', 'tag/delimiter n']
enddel = 'tag/ending_delimiter'
block ={}
section = ''
with open(file, 'r') as f:
for line in f:
if line.strip() == enddel:
section = ''
if section:
block[section] = block.get(section, '') + line
if line.strip() in list_of_starting_delimiters:
section = line.strip()
print(block)
It extracts the blocks into a dictionary with start delimiter tags as keys and according sections as values.
It requires the start and end tags to be the only content of their respective lines.
Output:
{'tag/delimiter 1':
'\ntext 1 #extract\n\ntext 2 #extract\n\n... #extract\n\n... #extract\n\ntext n #extract\n\n',
'tag/delimiter 2':
'\ntext 1 #extract\n\ntext 2 #extract\n\n... #extract\n\n... #extract\n\ntext n #extract\n\n',
'tag/delimiter n':
'\ntext 1 #extract\n\ntext 2 #extract\n\n... #extract\n\n... #extract\n\ntext n #extract\n\n'}

Parsing block in file to Python list without newlines

I have a particular block of stuff in a general file of many contents which is arbitrarily long, can contain any character, begins each line with a blank space and has the form in some text file:
1\1\GINC-NODE9999\Scan\...
... ... ... ... ... ... ...
... ... ... ... ...\HF=-568
.8880019,-568.2343213, -568
.2343432, ... , -586.328492
1\RMSD=...
I'm interested in the particular sequence which lies between \HF= and \RMSD=. I want to put these numbers into a Python list. This sequence is simply a series of numbers that are comma separated, however, these numbers can roll over onto a second line. ALSO, \HF= and \RMSD may be broken by rolling over onto a newline.
Current Efforts
I currently have the following:
with open(infile) as data:
d1 = []
start = '\\HF'
end = 'RMSD'
should_append = False
for line in data:
if start in line:
data = line[len(start):]
d1.append(data)
should_append=True
elif end in line:
should_append = False
break
elif should_append:
d1.append(line)
which spits out the following list
['.6184082129,7.5129238742\\\\Version=EM64L-G09RevC.01\\
State=1-A\\HF=-568\n', ' .8880019,-568.8879907,-568.8879686,
-568.887937,-\n']
The problem is not only do I have newlines throughout, I'm also keeping more data than I should. Furthermore, numbers that roll over onto other lines are given their own placement in the list. I need it to look like
['-568.8880019', '-568.8879907', ... ]
A multline non-greedy regular expression can be used to extract text that lies between \HF= and \RMSD=. Once the text is extracted it should be trivially easy to tokenize into constituent numbers
import re
import os
pattern = r'''\HF=(.*?)\RMSD='''
pat = re.compile(pattern, re.DOTALL)
for number in pat.finditer(open('file.txt').read()):
print number.group(1).replace(os.linesep, '').replace(' ', '').strip(r'''\\''')
...
-568 .8880019,-568.2343213, -568 .2343432, ... , -586.328492 1\
for a fast solution, you can implement a naive string concatenation based on regular expressions.
I implemented a short solution for your data format.
import re
def naiveDecimalExtractor(data):
p = re.compile("(-?\d+)[\n\s]*(\d+\.\d+)[\n\s]*(\d+)")
brokenNumbers = p.findall(data)
return ["".join(n) for n in brokenNumbers]
data = """
1\1\GINC-NODE9999\Scan\...
... ... ... ... ... ... ...
... ... ... ... ...\HF=-568
.8880019,-568.2343213, -568
.2343432, ... , -586.328492
1\RMSD=...
"""
print naiveDecimalExtractor(data)
Regards,
And Past
Use something like this to join everything in one line:
with open(infile) as data:
joined = ''.join(data.read().splitlines())
And then parse that without worrying about newlines.
If your file is really large you may want to consider another approach to avoid having it all in memory.
How about something like this:
# open the file to read
f = open("test.txt")
# read the whole file, then concatenate the list as one big string (str)
str = " ".join(f.readlines())
# get the substring between \HF= and \RMDS, then remove any '\', 'n', or ' '
values = str[str.find("\HF=")+5:str.find("\RMSD")].translate(None, "\n ")
# the string is now just numbers separated by commas, so split it to a list
# using the ',' deliminator
list = values.split(',')
Now list has:
['568.8880019', '-568.2343213', '-568.2343432', '...', '-586.3284921']
I had something like this open and forgot to post - a "slightly" different answer that uses mmap'd files and re.finditer:
This has the advantage of dealing with larger files relatively efficiently as it allows the regex engine to see the file as one long string without it being in memory at once.
import mmap
import re
with open('/home/jon/blah.txt') as fin:
mfin = mmap.mmap(fin.fileno(), 0, access=mmap.ACCESS_READ)
for match in re.finditer(r'\\HF=(.*?)\\RMSD=', mfin, re.DOTALL):
print match.group(1).translate(None, '\n ').split(',')
# ['-568.8880019', '-568.2343213', '-568.2343432', '...', '-586.3284921']

Matching in Python lists when there are extra characters

I am trying to write a python code to match things from two lists in python.
One tab-delimited file looks like this:
COPB2
KLMND7
BLCA8
while the other file2 has a long list of similar looking "names", if you will. There should be some identical matches in the file, which I have succeeded in identifying and writing out to a new file. The problem is when there are additional characters at the end of one of the "names". For example, COPB2 from above should match COPB2A in file2, but it does not. Similarly KLMND7 should match KLMND79. Should I use regular expressions? Make them into strings? Any ideas are helpful, thank you!
What I have worked on so far, after the first response seen below:
with open(in_file1, "r") as names:
for line in names:
file1_list = [i.strip() for i in line.split()]
file1_str = str(file1_list)
with open(in_file2, "r") as symbols:
for line in symbols:
items = line.split("\t")
items = str(items)
matches = items.startswith(file1_str)
print matches
This code returns False when I know there should be some matches.
string.startswith() No need for regex, if it's only trailing characters
>>> g = "COPB2A"
>>> f = "COPB2"
>>> g.startswith(f)
True
Here is a working piece of code:
file1_list = []
with open(in_file1, "r") as names:
for line in names:
line_items = line.split()
for item in line_items:
file1_list.append(item)
matches = []
with open(in_file2, "r") as symbols:
for line in symbols:
file2_items = line.split()
for file2_item in file2_items:
for file1_item in file1_list:
if file2_item.startswith(file1_item):
matches.append(file2_item)
print file2_item
print matches
It may be quite slow for large files. If it's unacceptable, I could try to think about how to optimize it.
You might take a look at difflib if you need a more generic solution. Keep in mind it is a big import with lots of overhead so only use it if you really need to. Here is another question that is somewhat similar.
https://stackoverflow.com/questions/1209800/difference-between-two-strings-in-python-php
Assuming you loaded the files into lists X, Y.
## match if a or b is equal to or substring of one another in a case-sensitive way
def Match( a, b):
return a.find(b[0:min(len(a),len(b))-1])
common_words = {};
for a in X:
common_words[a]=[];
for b in Y:
if ( Match( a, b ) ):
common_words[a].append(b);
If you want to use regular expressions to do the matching, you want to use "beginning of word match" operator "^".
import re
def MatchRe( a, b ):
# make sure longer string is in 'a'.
if ( len(a) < len(b) ):
a, b = b, a;
exp = "^"+b;
q = re.match(exp,a);
if ( not q ):
return False; #no match
return True; #access q.group(0) for matches

appending regex matches to a dictionary

I have a file in which there is the following info:
dogs_3351.txt:34.13559322033898
cats_1875.txt:23.25581395348837
cats_2231.txt:22.087912087912088
elephants_3535.txt:37.092592592592595
fish_1407.txt:24.132530120481928
fish_2078.txt:23.470588235294116
fish_2041.txt:23.564705882352943
fish_666.txt:23.17241379310345
fish_840.txt:21.77173913043478
I'm looking for a way to match the colon and append whatever appears afterwards (the numbers) to a dictionary the keys of which are the name of the animals in the beginning of each line.
Actually, regular expressions are unnecessary, provided that your data is well formatted and contains no surprises.
Assuming that data is a variable containing the string that you listed above:
dict(item.split(":") for item in data.split())
t = """
dogs_3351.txt:34.13559322033898
cats_1875.txt:23.25581395348837
cats_2231.txt:22.087912087912088
elephants_3535.txt:37.092592592592595
fish_1407.txt:24.132530120481928
fish_2078.txt:23.470588235294116
fish_2041.txt:23.564705882352943
fish_666.txt:23.17241379310345
fish_840.txt:21.77173913043478
"""
import re
d = {}
for p, q in re.findall(r'^(.+?)_.+?:(.+)', t, re.M):
d.setdefault(p, []).append(q)
print d
why dont you use the python find method to locate the index of the colons which you can use to slice the string.
>>> x='dogs_3351.txt:34.13559322033898'
>>> key_index = x.find(':')
>>> key = x[:key_index]
>>> key
'dogs_3351.txt'
>>> value = x[key_index+1:]
>>> value
'34.13559322033898'
>>>
Read in each line of the file as a text and process the lines individually as above.
Without regex and using defaultdict:
from collections import defaultdict
data = """dogs_3351.txt:34.13559322033898
cats_1875.txt:23.25581395348837
cats_2231.txt:22.087912087912088
elephants_3535.txt:37.092592592592595
fish_1407.txt:24.132530120481928
fish_2078.txt:23.470588235294116
fish_2041.txt:23.564705882352943
fish_666.txt:23.17241379310345
fish_840.txt:21.77173913043478"""
dictionary = defaultdict(list)
for l in data.splitlines():
animal = l.split('_')[0]
number = l.split(':')[-1]
dictionary[animal] = dictionary[animal] + [number]
Just make sure your data is well formatted

Importing data from a text file using python

I have a text file containing data in rows and columns (~17000 rows in total). Each column is a uniform number of characters long, with the 'unused' characters filled in by spaces. For example, the first column is 11 characters long, but the last four characters in that column are always spaces (so that it appears to be a nice column when viewed with a text editor). Sometimes it's more than four if the entry is less than 7 characters.
The columns are not otherwise separated by commas, tabs, or spaces. They are also not all the same number of characters (the first two are 11, the next two are 8 and the last one is 5 - but again, some are spaces).
What I want to do is import the entires (which are numbers) in the last two columns if the second column contains the string 'OW' somewhere in it. Any help would be greatly appreciated.
Python's struct.unpack is probably the quickest way to split fixed-length fields. Here's a function that will lazily read your file and return tuples of numbers that match your criteria:
import struct
def parsefile(filename):
with open(filename) as myfile:
for line in myfile:
line = line.rstrip('\n')
fields = struct.unpack('11s11s8s8s5s', line)
if 'OW' in fields[1]:
yield (int(fields[3]), int(fields[4]))
Usage:
if __name__ == '__main__':
for field in parsefile('file.txt'):
print field
Test data:
1234567890a1234567890a123456781234567812345
something maybe OW d 111111118888888855555
aaaaa bbbbb 1234 1212121233333
other thinganother OW 121212 6666666644444
Output:
(88888888, 55555)
(66666666, 44444)
In Python you can extract a substring at known positions using a slice - this is normally done with the list[start:end] syntax. However you can also create slice objects that you can use later to do the indexing.
So you can do something like this:
columns = [slice(11,22), slice(30,38), slice(38,44)]
myfile = open('some/file/path')
for line in myfile:
fields = [line[column].strip() for column in columns]
if "OW" in fields[0]:
value1 = int(fields[1])
value12 = int(fields[2])
....
Separating out the slices into a list makes it easy to change the code if the data format changes, or you need to do stuff with the other fields.
Here's a function which might help you:
def rows(f, columnSizes):
while True:
row = {}
for (key, size) in columnSizes:
value = f.read(size)
if len(value) < size: # EOF
return
row[key] = value
yield row
for an example of how it's used:
from StringIO import StringIO
sample = StringIO("""aaabbbccc
d e f
g h i
""")
for row in rows(sample, [('first', 3),
('second', 3),
('third', 4)]):
print repr(row)
Note that unlike the other answers, this example is not line-delimited (it uses the file purely as a provider of bytes, not an iterator of lines), since you specifically mentioned that the fields were not separated, I assumed that the rows might not be either; the newline is taken into account specifically.
You can test if one string is a substring of another with the 'in' operator. For example,
>>> 'OW' in 'hello'
False
>>> 'OW' in 'helOWlo'
True
So in this case, you might do
if 'OW' in row['third']:
stuff()
but you can obviously test any field for any value as you see fit.
entries = ((float(line[30:38]), float(line[38:43])) for line in myfile if "OW" in line[11:22])
for num1, num2 in entries:
# whatever
entries = []
with open('my_file.txt', 'r') as f:
for line in f.read().splitlines()
line = line.split()
if line[1].find('OW') >= 0
entries.append( ( int(line[-2]) , int(line[-1]) ) )
entries is an array containing tuples of the last two entries
edit: oops

Categories

Resources