Python regex question

Python regex question - python

I'm having some problems figuring out a solution to this problem.
I want to read from a file on a per line basis and analyze whether that line has one of two characters (1 or 0). I then need to sum up the value of the line and also find the index value (location) of each of the "1" character instances.
so for example:
1001
would result in:
line 1=(count:2, pos:[0,3])
I tried a lot of variations of something like this:
r=urllib.urlopen(remote-resouce)
list=[]
for line in lines:
for m in re.finditer(r'1',line):
list.append((m.start()))
I'm having two issues:
1) I thought that the best solution would be to iterate through each line and then use a regex finditer function. My issue here is that I keep failing to write a for loop that works. Despite my best efforts, I keep returning the results as one long list, rather than a multidimensional array of dictionaries.
Is this approach the right one? If so, how do I write the correct for loop?
If not, what else should I try?

Perhaps do it without regex:
import urllib
url='http://stackoverflow.com/questions/5158168/python-regex-question/5158341'
f=urllib.urlopen(url)
for linenum,line in enumerate(f):
print(line)
locations=[pos for pos,char in enumerate(line) if char=='1']
print('line {n}=(count:{c}, pos:{l})'.format(
n=linenum,
c=len(locations),
l=locations
))

Using regexes here is probably a bad idea. You can see if a 1 or 0 is in a line of text with '0' in line or '1' in line, and you can get the count with line.count('1').
Finding all of the locations of 1s does require iterating through the string, I believe.

Unubtu's code works fine. I tested it on a sample file which also has all 0's for a particular line. Here is the complete code -
#! /usr/bin/python
2
3 # Write a program to read a text file which has 1's and 0's on each line
4 # For each line count the number of 1's and their position and print it
5
6 import sys
7
8 def countones(infile):
9 f = open(infile,'r')
10 for linenum, line in enumerate(f):
11 locations = [pos for pos,char in enumerate(line) if char == '1']
12 print('line {n}=(count:{c}, pos:{l})'.format(n=linenum,c=len(locations),l= locations))
13
14
15 def main():
16 infile = './countones.txt'
17 countones(infile)
18
19 # Standard boilerplate to call the main() function to begin the program
20 if __name__ == '__main__':
21 main()
Input file -
1001
110001
111111
00001
010101
00000
Result -
line 0=(count:2, pos:[0, 3])
line 1=(count:3, pos:[0, 1, 5])
line 2=(count:6, pos:[0, 1, 2, 3, 4, 5])
line 3=(count:1, pos:[4])
line 4=(count:3, pos:[1, 3, 5])
line 5=(count:0, pos:[])

Related

Python Regex - Reference first line on every match, until the start of a new group

Sample text:
This is HeaderA
Line 1
Line 2
Line 3
Line 4
Line 5
This is HeaderB
Line 1
Line 2
Intended result:
HeaderA1 HeaderA2 HeaderA3 HeaderA4 HeaderA5
HeaderB1, HeaderB2
Regex Attempts:
(?:^This is (?P<H>HeaderB)\s) (Line (?P<L>\d)\s)*?
Matches only the Header 'H' and 1st 'L' Line
(?:^This is (?P<H>HeaderB)\s)? (Line (?P<L>\d)\s)*?
manage to match multiple 'L' Lines however, only first 2 line are of the same match, not the subsequent L lines does not reference the Header capture group.
I tried other attempts to adjust the regex but ended up screwing up the expression. I have limited experience with regex, so I am not entirely sure if it is possible to get the desired output.

Mix of regex and substitutions with format.
It is assumed that below a Header you always have a Line i
import re
text = """This is HeaderA
Line 1
Line 2
Line 3
Line 4
Line 5
This is HeaderB
Line 1
Line 2"""
ordered_matches = [] # global
def custom_match(m, all_matches=ordered_matches):
p = m.group(0)
if p.isdigit():
all_matches[-1] += [p]
else:
all_matches += [[p]]
return '' # doesn't matter
r = re.sub(r'([A-Z0-9]+)$', custom_match, text, flags=re.M)
for m in ordered_matches:
print(('Header{}{{}} '.format(m[0]) * (len(m)-1)).format(*m[1:]))
Output
HeaderA1 HeaderA2 HeaderA3 HeaderA4 HeaderA5
HeaderB1 HeaderB2

IIUC you're trying to combine the Header(A|B) with the integers in the following lines. With the given output, it's probably easier to work with simple split() operations instead of re.
for group in text.split('This is ')[1:]:
header, *lines = group.splitlines()
print(*[header+line.split()[-1] for line in lines])
Output:
HeaderA1 HeaderA2 HeaderA3 HeaderA4 HeaderA5
HeaderB1 HeaderB2

Formatting output of CSV data?

I'm fairly new to python and made something that had this output:
(The text is in a csv file so so:
1,A
2,B
3,C etc)
Number Letter
1 A
2 B
3 C
26 Z
Unfortunately, I spent a good amount of time making it using a complicated method in which I manually made spaces like this:
Updated Code rn
fx = int(input('Number?\n'))
f=open('nums.txt','r')
lines=f.readlines()
line = lines[fx - 1]
with open('nums.txt','r') as f:
for i, line in enumerate(f):
if i >= 5:
break
NUM, LTR, SMB = line.rsplit(',', 1)
print(NUM.ljust(13) + LTR.ljust(13) + SMB)
How do I get it to make 3 columns? Right now it comes up with a
ValueError: not enough values to unpack (expected 3, got 2)
So is there a simpler method of achieving this that doesn't move the strings around like this:
Number Letter
1 A
2 B
3 C
26 Z #< string moves with spaces.

For simple alignment, you can use ljust or rjust. There is also no need to read the entire file for each line you want to process:
with open('numberletter','r') as f:
for i, line in enumerate(f):
if i >= 5:
break
number, letter = line.rsplit(',', 1)
print(number.ljust(13) + letter)
For more complex output formatting, look at str.format() and the formatting syntax

You can use sys module for that.
import sys
a=[1,"A"]
sys.stdout.write("%-6s %-50s " % (a[0],a[1]))

Regex remove certain characters from a file

I'd like to write a python script that reads a text file containing this:
FRAME
1 J=1,8 SEC=CL1 NSEG=2 ANG=0
2 J=8,15 SEC=CL2 NSEG=2 ANG=0
3 J=15,22 SEC=CL3 NSEG=2 ANG=0
And output a text file that looks like this:
1 1 8
2 8 15
3 15 22
I essentially don't need the commas or the SEC, NSEG and ANG data. Could someone help me use regex to do this?
So far I have this:
import re
r = re.compile(r"\s*(\d+)\s+J=(\S+)\s+SEC=(\S+)\s+NSEG=(\S+)+ANG=(\S+)\s")
with open('RawDataFile_445.txt') as a:
# open all 4 files with a meaningful name
file=[open(outputfile.txt","w")
for line in a:

Without regex:
for line in file:
keep = []
line = line.strip()
if line.startswith('FRAME'):
continue
first, second, *_ = line.split()
keep.append(first)
first, second = second.split('=')
keep.extend(second.split(','))
print(' '.join(keep))

My advice? Since I don't write many regex's I avoid writing big ones all at once. Since you've already done that I would try to verify it a small chunk at a time, as illustrated in this code.
import re
r = re.compile(r"\s*(\d+)\s+J=(\S+)\s+SEC=(\S+)\s+NSEG=(\S+)+ANG=(\S+)\s")
r = re.compile(r"\s*(\d+)")
r = re.compile(r"\s*(\d+)\s+J=(\d+)")
with open('RawDataFile_445.txt') as a:
a.readline()
for line in a.readlines():
result = r.match(line)
if result:
print (result.groups())
The first regex is your entire brute of an expression. The next line is the first chunk I verified. The next line is the second, bigger chunk that worked. Notice the slight change.
At this point I would go back, make the correction to the original, whole regex and then copy a bigger chunk to try. And re-run.

Let's focus on an example string we want to parse:
1 J=1,8
We have space(s), digit(s), more space(s), some characters, then digit(s), a comma, and more digit(s). If we replace them with regex characters, we get (\d+)\s+J=(\d+),(\d+), where + means we want 1 or more of that type. Note that we surround the digits with parentheses so we can capture them later with .groups() or .group(#), where # is the nth group.

Find and copy a line using regex in Python

I am new to this forum and to programming and apologize in advance if I violate any of the forum rules. I have researched this extensively, but I couldn't find a solution for my problem.
So I have a very long file that has this general structure:
data="""
20.020001 563410 9
20.520001 577410 20
21.022001 591466 9
21.522001 605466 120
23.196001 652338 2
25.278001 710634 7
25.780001 724690 144
26.280001 738690 9
26.782001 752746 40
27.282001 766746 9
27.784001 780802 140
29.372001 825266 2
31.458001 883674 7
31.958002 897674 8
32.458002 911674 9
32.958002 925674 10
"""
I imported the file using
with open("C:\blablabla\text.txt", 'r+') as infile:
data = infile.read()
Now I am trying to use a regular expression to find all lines that end with 140 through 146, so I did this:
items=re.findall('.......................14[0-6]\n',data,re.MULTILINE)
for x in items:
print x
This works, but when I now try to copy those lines that contain the regular expression,
for x in items:
if items in data:
data.write(items)
I get the following error:
if items in data:
TypeError: 'in <string>' requires string as left operand, not list
I understand what the problem is, but I don't know how to solve it. How can I feed the left operand a string when the outcome of my regex is a list?
Any help is much appreciated!

You should simply handle each line separately:
data = infile.readlines()
for line in data:
if re.match('.......................14[0-6]\n', line):
print line[:-1]
The last character of the line is a trailing newline, which would be duplicated by the one the print statement includes.

You can read the file line by line:
data=""
with open("file.txt", 'r+') as infile:
for line in infile:
if (146 >= int(line.split()[-1]) >= 140) :
data = data + line
print data

Your Regex can be simplified further
re.findall('.*?14[0-6]\n')
To overcome your further problems
items = re.findall('.*?14[0-6]\n',data)
result=""""""
for x in items:
result+=str(x)
print result

matching and dispalying specific lines through python

I have 15 lines in a log file and i want to read the 4th and 10 th line for example through python and display them on output saying this string is found :
abc
def
aaa
aaa
aasd
dsfsfs
dssfsd
sdfsds
sfdsf
ssddfs
sdsf
f
dsf
s
d
please suggest through code how to achieve this in python .
just to elaborate more on this example the first (string or line is unique) and can be found easily in logfile the next String B comes within 40 lines of the first one but this one occurs at lots of places in the log file so i need to read this string withing the first 40 lines after reading string A and print the same that these strings were found.
Also I cant use with command of python as this gives me errors like 'with' will become a reserved keyword in Python 2.6. I am using Python 2.5

You can use this:
fp = open("file")
for i, line in enumerate(fp):
if i == 3:
print line
elif i == 9:
print line
break
fp.close()

def bar(start,end,search_term):
with open("foo.txt") as fil:
if search_term in fil.readlines()[start,end]:
print search_term + " has found"
>>>bar(4, 10, "dsfsfs")
"dsfsfs has found"

#list of random characters
from random import randint
a = list(chr(randint(0,100)) for x in xrange(100))
#look for this
lookfor = 'b'
for element in xrange(100):
if lookfor==a[element]:
print a[element],'on',element
#b on 33
#b on 34
is one easy to read and simple way to do it. Can you give part of your log file as an example? There are other ways that may work better :).
after edits by author:
The easiest thing you can do then is:
looking_for = 'findthis' i = 1 for line in open('filename.txt','r'):
if looking_for == line:
print i, line
i+=1
it's efficient and easy :)

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Python regex question - python

Using regexes here is probably a bad idea. You can see if a 1 or 0 is in a line of text with '0' in line or '1' in line, and you can get the count with line.count('1'). Finding all of the locations of 1s does require iterating through the string, I believe.

Related

Python Regex - Reference first line on every match, until the start of a new group

Formatting output of CSV data?

Regex remove certain characters from a file

Find and copy a line using regex in Python

matching and dispalying specific lines through python

Categories

Resources