Python regular expression for r.findall - python

I am using findall to separate text.
I started with this expression re.findall(r'(.?)(\$.?\$)' but it doesn't give me the data after the last piece of text found. I missed the '6\n\n'
How do I get the last piece of text?
Here is my python code:
#!/usr/bin/env python
import re
allData = '''
1
2
3 here Some text in here
$file1.txt$
4 Some text in here and more $file2.txt$
5 Some text $file3.txt$ here
$file3.txt$
6
'''
for record in re.findall(r'(.*?)(\$.*?\$)|(.*?$)',allData,flags=re.DOTALL) :
print repr(record)
The output I get for this is:
('\n1\n2\n3 here Some text in here \n', '$file1.txt$', '')
('\n4 Some text in here and more ', '$file2.txt$', '')
('\n5 Some text ', '$file3.txt$', '')
(' here \n', '$file3.txt$', '')
('', '', '\n6\n')
('', '', '')
('', '', '')
I really would like this output:
('\n1\n2\n3 here Some text in here \n', '$file1.txt$')
('\n4 Some text in here and more ', '$file2.txt$')
('\n5 Some text ', '$file3.txt$')
(' here \n', '$file3.txt$')
('\n6\n', '', )
Background info in case you need to see the larger picture.
I case your are interested, I'm re-writing this in python. I have the rest of the code under control. I am just getting too much stuff out of findall.
https://discussions.apple.com/message/21202021#21202021

If I understand correctly from that Apple link you want to do something like:
import re
allData = '''
1
2
3 here Some text in here
$file1.txt$
4 Some text in here and more $file2.txt$
5 Some text $file3.txt$ here
$file3.txt$
6
'''
def read_file(m):
return open(m.group(1)).read()
# Sloppy matching :D
# print re.sub("\$(.*?)\$", read_file, allData)
# More precise.
print re.sub("\$(file\d+?\.txt)\$", read_file, allData)
EDIT As Oscar suggests make match more precise.
ie. take the filename between $s and read the file for the data and that's what the above would do.
Example output:
1
2
3 here Some text in here
I'am file1.txt
4 Some text in here and more
I'am file2.txt
5 Some text
I'am file3.txt
here
I'am file3.txt
6
Files:
==> file1.txt <==
I'am file1.txt
==> file2.txt <==
I'am file2.txt
==> file3.txt <==
I'am file3.txt

To achieve the output you want you need to restrict your pattern to 2 capture groups. (If you use 3 capture groups, you will have 3 elements in every "record").
You could make the second group optional, that should do the job:
r'([^$]*)(\$.*?\$)?'

Here's one way to solve your substitution problem with findall.
def readfile(name):
with open(name) as f:
return f.read()
r = re.compile(r"\$(.+?)\$|(\$|[^$]+)")
print "".join(readfile(filename) if filename else text
for filename, text in r.findall(allData))

This one is partly solving your problem
import re
allData = '''
1
2
3 here Some text in here
$file1.txt$
4 Some text in here and more $file2.txt$
5 Some text $file3.txt$ here
$file3.txt$
6
'''
for record in re.findall(r'(.*?)(\$.*?\$)|(.*?$)',allData.strip(),flags=re.DOTALL) :
print [ x for x in record if x]
producing output
['1\n2\n3 here Some text in here \n', '$file1.txt$']
['\n4 Some text in here and more ', '$file2.txt$']
['\n5 Some text ', '$file3.txt$']
[' here \n', '$file3.txt$']
['\n6']
[]
Avoid last empty list with
for record in re.findall(r'(.*?)(\$.*?\$)|(.*?$)',allData.strip(),flags=re.DOTALL) :
if ([ x for x in record if x] != []):
print [ x for x in record if x]

Related

How to add line numbers to multiline string

I have a multiline string like the following:
txt = """
some text
on several
lines
"""
How can I print this text such that each line starts with a line number?
I usually use a regex substitution with a function attribute:
def repl(m):
repl.cnt+=1
return f'{repl.cnt:03d}: '
repl.cnt=0
print(re.sub(r'(?m)^', repl, txt))
Prints:
001:
002: some text
003:
004: on several
005:
006: lines
007:
Which allows you to easily number only lines that have text:
def repl(m):
if m.group(0).strip():
repl.cnt+=1
return f'{repl.cnt:03d}: {m.group(0)}'
else:
return '(blank)'
repl.cnt=0
print(re.sub(r'(?m)^.*$', repl, txt))
Prints:
(blank)
001: some text
(blank)
002: on several
(blank)
003: lines
(blank)
This can be done with a combination of split("\n"), join(\n), enumerate and a list comprehension:
def insert_line_numbers(txt):
return "\n".join([f"{n+1:03d} {line}" for n, line in enumerate(txt.split("\n"))])
print(insert_line_numbers(txt))
It produces the output:
001
002 some text
003
004 on several
005
006 lines
007
I did it like this. Simply break the text into lines. Add a line number. Use format to print int line number and the string. 2 place holders for . and a space after the .
count = 1
txt = '''Text
on
several
lines'''
txt = txt.splitlines()
for t in txt:
print("{}{}{}{}".format(count,"."," ",t))
count += 1
Output
1. Text
2. on
3. several
4. lines
for n, i in enumerate(txt.rstrip().split('\n')):
print(n, i)
0
1 some text
2
3 on several
4
5 lines

Confused on how to extract multiple data values with a single regular expression

I have a text file with text that looks like this:
(49) Sat Jun/30 21:00 Uruguay 2-1 (1-0) Portugal # Fisht Stadium, Sochi (UTC+3)
[Edinson Cavani 7', 62'; Pepe 55']
I have to implement a generator function that extracts data from each line of text according to these rules
-read from the input file the 2 lines of text for one game
-use 2 regular expressions (1 for each line) to extract the data (color coded here) for one Game object
-pass back the data (Note that the extracted data are passed back, don't pass back a Game object)
I can extract 1 data value, but I'm having trouble with extracting 5 data values(Game Number, Country1/2, Score1/2) from a line of text with just one regular expression.
Try this:
import re
with open('FileName.txt','r') as f:
print([{'Game Number':line.split()[0],'Country 1':line.split()[4],'Country 2':line.split()[7],'Score 1':line.split()[5], 'Score 2':line.split()[6]} if not line[0]=='[' else {'Player Name':[re.sub('[^a-zA-Z\s+]','',i).strip() for i in line.split(';')], 'Minute':[[int(re.sub('\D','',x)) for x in i.split() if any(a for a in x if a.isdigit())] for i in line.split(';')]} for line in f])

Python print .psl format without quotes and commas

I am working on a linux system using python3 with a file in .psl format common to genetics. This is a tab separated file that contains some cells with comma separated values. An small example file with some of the features of a .psl is below.
input.psl
1 2 3 x read1 8,9, 2001,2002,
1 2 3 mt read2 8,9,10 3001,3002,3003
1 2 3 9 read3 8,9,10,11 4001,4002,4003,4004
1 2 3 9 read4 8,9,10,11 4001,4002,4003,4004
I need to filter this file to extract only regions of interest. Here, I extract only rows with a value of 9 in the fourth column.
import csv
def read_psl_transcripts():
psl_transcripts = []
with open("input.psl") as input_psl:
csv_reader = csv.reader(input_psl, delimiter='\t')
for line in input_psl:
#Extract only rows matching chromosome of interest
if '9' == line[3]:
psl_transcripts.append(line)
return psl_transcripts
I then need to be able to print or write these selected lines in a tab delimited format matching the format of the input file with no additional quotes or commas added. I cant seem to get this part right and additional brackets, quotes and commas are always added. Below is an attempt using print().
outF = open("output.psl", "w")
for line in read_psl_transcripts():
print(str(line).strip('"\''), sep='\t')
Any help is much appreciated. Below is the desired output.
1 2 3 9 read3 8,9,10,11 4001,4002,4003,4004
1 2 3 9 read4 8,9,10,11 4001,4002,4003,4004
You might be able to solve you problem with a simple awk statement.
awk '$4 == 9' input.pls > output.pls
But with python you could solve it like this:
write_pls = open("output.pls", "w")
with open("input.pls") as file:
for line in file:
splitted_line = line.split()
if splitted_line[3] == '9':
out_line = '\t'.join(splitted_line)
write_pls.write(out_line + "\n")
write_pls.close()

How can I output the file name with its word content in such format in python?

Say I have a file test.txt containing :
1:text1.txt
2:text2.txt
text1.txt contains:
I am a good person
text2.txt contains:
Bla bla
I would like to output :
I 1
Bla 2
am 1
bla 2
good 1
a 1
person 1
As in I want to output the file index with each word in the file. I would post my code but it is so ugly and far from the solution. I'm new to python so please be nice. There is no specified order of the output, the sample output I mentioned is utterly random just to get you to have an idea of what I'm looking for.
This is my code
`with open("text.txt", "r") as f:
text=f.readlines()
for line in text:
splitted=line.split(":")
splitsplit=splitted[1].split("\n")
files=splitsplit[0]
splittedindicies=splitted[0].split("\n")
indicies=splittedindicies[0]
print indicies[0]
files_list=list(files)
files_l=files.split(" ")
for x in files_l:
fileshandle=open(x,"r")
read=fileshandle.readlines()
for y in read:
words=y.split(" ")
words.sort()
for j in words:
print j `
My output is:
1
I
am
a
good
person
2
Bla
bla
Again, please be nice, I'm an R programmer first time dealing with python.
You should try some regex recipe here :
As you comment out :
how can I store the output
Your output is in values of dict , you can do operation with them.
import re
track={}
pattern=r'(\d):?(\w+\.txt)'
with open('test.txt','r') as file_name:
for line in file_name:
match=re.finditer(pattern,line)
for finding in match:
with open(finding.group(2)) as file_name_2:
for item in file_name_2:
track[int(finding.group(1))]=item.split()
for key,value in track.items():
for item in value:
print(key,item)
output:
1 I
1 am
1 a
1 good
1 person
2 Bla
2 bla
Since the order of the words does not matter, why don't you just process the files in the order they appear in test.txt? There are a couple errors in your code, the first one on line 3 where you overwrite the content of splitted. I'm also particularly confused by your call to sort.
Anyway, here's one way to do it.
>>> with open('test.txt') as filenames:
... for line in filenames:
... file_no, filename = line.strip().split(':')
... with open(filename) as f:
... for line in f:
... for word in line.split():
... print '{} {}'.format(word, file_no)
...
I 1
am 1
a 1
good 1
person 1
Bla 2
bla 2

Separate lines in Python

I have a .txt file. It has 3 different columns. The first one is just numbers. The second one is numbers which starts with 0 and it goes until 7. The final one is a sentence like. And I want to keep them in different lists because of matching them for their numbers. I want to write a function. How can I separate them in different lists without disrupting them?
The example of .txt:
1234 0 my name is
6789 2 I am coming
2346 1 are you new?
1234 2 Who are you?
1234 1 how's going on?
And I have keep them like this:
----1----
1234 0 my name is
1234 1 how's going on?
1234 2 Who are you?
----2----
2346 1 are you new?
----3-----
6789 2 I am coming
What I've tried so far:
inputfile=open('input.txt','r').read()
m_id=[]
p_id=[]
packet_mes=[]
input_file=inputfile.split(" ")
print(input_file)
input_file=line.split()
m_id=[int(x) for x in input_file if x.isdigit()]
p_id=[x for x in input_file if not x.isdigit()]
With your current approach, you are reading the entire file as a string, and performing a split on a whitespace (you'd much rather split on newlines instead, because each line is separated by a newline). Furthermore, you're not segregating your data into disparate columns properly.
You have 3 columns. You can split each line into 3 parts using str.split(None, 2). The None implies splitting on space. Each group will be stored as key-list pairs inside a dictionary. Here I use an OrderedDict in case you need to maintain order, but you can just as easily declare o = {} as a normal dictionary with the same grouping (but no order!).
from collections import OrderedDict
o = OrderedDict()
with open('input.txt', 'r') as f:
for line in f:
i, j, k = line.strip().split(None, 2)
o.setdefault(i, []).append([int(i), int(j), k])
print(dict(o))
{'1234': [[1234, 0, 'my name is'],
[1234, 2, 'Who are you?'],
[1234, 1, "how's going on?"]],
'6789': [[6789, 2, 'I am coming']],
'2346': [[2346, 1, 'are you new?']]}
Always use the with...as context manager when working with file I/O - it makes for clean code. Also, note that for larger files, iterating over each line is more memory efficient.
Maybe you want something like that:
import re
# Collect data from inpu file
h = {}
with open('input.txt', 'r') as f:
for line in f:
res = re.match("^(\d+)\s+(\d+)\s+(.*)$", line)
if res:
if not res.group(1) in h:
h[res.group(1)] = []
h[res.group(1)].append((res.group(2), res.group(3)))
# Output result
for i, x in enumerate(sorted(h.keys())):
print("-------- %s -----------" % (i+1))
for y in sorted(h[x]):
print("%s %s %s" % (x, y[0], y[1]))
The result is as follow (add more newlines if you like):
-------- 1 -----------
1234 0 my name is
1234 1 how's going on?
1234 2 Who are you?
-------- 2 -----------
2346 1 are you new?
-------- 3 -----------
6789 2 I am coming
It's based on regexes (module re in python). This is a good tool when you want to match simple line based patterns.
Here it relies on spaces as columns separators but it can as easily be adapted for fixed width columns.
The results is collected in a dictionary of lists. each list containing tuples (pairs) of position and text.
The program waits output for sorting items.
It's a quite ugly code but it's quite easy to understand.
raw = []
with open("input.txt", "r") as file:
for x in file:
raw.append(x.strip().split(None, 2))
raw = sorted(raw)
title = raw[0][0]
refined = []
cluster = []
for x in raw:
if x[0] == title:
cluster.append(x)
else:
refined.append(cluster)
cluster = []
title = x[0]
cluster.append(x)
refined.append(cluster)
for number, group in enumerate(refined):
print("-"*10+str(number)+"-"*10)
for line in group:
print(*line)

Categories

Resources