How to use regex to escape some info in python? - python

I have a text file and I read that using Python. It starts with a web address and provides other info starts with (y) or (n). Between the lines, there might be few blank lines. For example the text file can be like this,
http://usatoday30.usatoday.com/money/industries/energy/2005-12-28-enron-participants_x.htm
(y) Lay, Kenneth
(y) Skilling, Jeffrey
(n) Howard, Kevin
(n) Krautz, Michael
I would like have the names starts with (y) and returns as list. Say, for this case the return list would be like this,
result = ["Lay, Kenneth", "Skilling, Jeffrey"]
I read the data as following,
poi_names_data = open("../final_project/poi_names.txt", "r")
for row in poi_names_data:
print row, "\n"
How to extract the right info from the row?

As suggested in the comments, you can use startswith to decide if you are going to process the row and use re.sub to remove (y), leading spaces and line breaks \n, after that it should give you the expected output:
import re
result = []
with open("test.txt") as text:
for row in text:
if row.startswith("(y)"):
result.append(re.sub(r"\(y\)\s+|\n", "", row))
result
# ['Lay, Kenneth', 'Skilling, Jeffrey']

I'd recommend read file line by line and process accordingly. The reason is that if your file is big, really big then it will be much better performance and less memory footprint.
import io
import re
result = []
rx = re.compile(r'(?<=\(y\)).*', re.MULTILINE)
with open('data.txt','r+') as f:
for line in f:
match = rx.search(line)
if match:
result.append(match.group(0).strip())
print(result)
I'll get following output from your sample data. (assuming data is stored in file test.txt)
['Lay, Kenneth', 'Skilling, Jeffrey']

Related

'regular expression in <string>' requires string as left operand, not list

I am new to python and I don't seem to find why the second script
does not work when using regular expressions.
Use case:
I want to extract entries starting with "crypto map IPSEC xx ipsec-isakmp" from a
Cisco running configuration file and print this line and the next 4.
I have managed to print the lines after the match but not the matched line itself.
My workaround for this is to print the text "crypto map IPSEC" statically first.
The script will then print me the next 4 lines using "islice".
As this is not perfect I wanted to use regular expression. This does not work at all.
>>>>>>
from itertools import islice
import re
#This works
print('Crypto map configurations: \n')
with open('show_run.txt', 'r') as f:
for line in f:
if 'crypto map IPSEC' and 'ipsec-isakmp' in line:
print('crypto map IPSEC')
print(''.join(islice(f, 4)))
f.close()
# The following does not work.
# Here I would like to use regular expressions to fetch the lines
# with "crypto map IPSEC xx ipsec-isakmp"
#
'''
print('Crypto map configurations: \n')
with open('show_run.txt', 'r') as f:
for line in f:
pattern = r"crypto\smap\sIPSEC\s\d+\s.+"
matched = re.findall(pattern, line)
if str(matched) in line:
print(str(matched))
print(''.join(islice(f, 4)))
f.close()
'''
if 'crypto map IPSEC' and 'ipsec-isakmp' in line:
should be:
if 'crypto map IPSEC' in line and 'ipsec-isakmp' in line:
Another alternative (if the line looks like what you described in the question):
if line.startswith('crypto map IPSEC') and line.endswith('ipsec-isakmp'): ...
And in:
print(''.join(islice(f, 4)))
You probably want to parse the line not f.
As for your question about regex: no need to parse it using a regex (consider previous parts of this answer) as it's running much slower and usually harder to maintain. That said, if this question is for learning, you can do:
import re
line = 'crypto map IPSEC 12345 ipsec-isakmp'
pattern = r'crypto map IPSEC (\d+) ipsec-isakmp'
matched = re.findall(pattern, line)
if matched:
print(matched[0])
See repl
I want to extract entries starting with "crypto map IPSEC xx ipsec-isakmp" from a Cisco running configuration file and print this line and the next 4.
Then you're making it much more complicated than it has to be:
for line in f:
if line.startswith("crypto map IPSEC") and "ipsec-isakmp" in line:
print(line.strip())
for i in range(4):
try:
print next(f).strip()
except StopIteration:
# we're reached the end of file and there weren't 4 lines left
# after the last "crypto map IPSEC" line. Sh!t happens...
break
nb: if you really insist on use regexps, replace the second line with
if re.match(r"^crypto map IPSEC \d+ ipsec-isakmp", line):
(assuming this is the correct pattern of course - hard to tell for sure without seeing your real data)

Can't replace string with variable

I came up with the below which finds a string in a row and copies that row to a new file. I want to replace Foo23 with something more dynamic (i.e. [0-9], etc.), but I cannot get this, or variables or regex, to work. It doesn't fail, but I also get no results. Help? Thanks.
with open('C:/path/to/file/input.csv') as f:
with open('C:/path/to/file/output.csv', "w") as f1:
for line in f:
if "Foo23" in line:
f1.write(line)
Based on your comment, you want to match lines whenever any three letters followed by two numbers are present, e.g. foo12 and bar54. Use regex!
import re
pattern = r'([a-zA-Z]{3}\d{2})\b'
for line in f:
if re.findall(pattern, line):
f1.write(line)
This will match lines like 'some line foo12' and 'another foo54 line', but not 'a third line foo' or 'something bar123'.
Breaking it down:
pattern = r'( # start capture group, not needed here, but nice if you want the actual match back
[a-zA-Z]{3} # any three letters in a row, any case
\d{2} # any two digits
) # end capture group
\b # any word break (white space or end of line)
'
If all you really need is to write all of the matches in the file to f1, you can use:
matches = re.findall(pattern, f.read()) # finds all matches in f
f1.write('\n'.join(matches)) # writes each match to a new line in f1
In essence, your question boils down to: "I want to determine whether the string matches pattern X, and if so, output it to the file". The best way to accomplish this is to use a reg-ex. In Python, the standard reg-ex library is re. So,
import re
matches = re.findall(r'([a-zA-Z]{3}\d{2})', line)
Combining this with file IO operations, we have:
data = []
with open('C:/path/to/file/input.csv', 'r') as f:
data = list(f)
data = [ x for x in data if re.findall(r'([a-zA-Z]{3}\d{2})\b', line) ]
with open('C:/path/to/file/output.csv', 'w') as f1:
for line in data:
f1.write(line)
Notice that I split up your file IO operations to reduce nesting. I also removed the filtering outside of your IO. In general, each portion of your code should do "one thing" for ease of testing and maintenance.

Concatenate lines with previous line based on number of letters in first column

New to coding and trying to figure out how to fix a broken csv file to make be able to work with it properly.
So the file has been exported from a case management system and contains fields for username, casenr, time spent, notes and date.
The problem is that occasional notes have newlines in them and when exporting the csv the tooling does not contain quotation marks to define it as a string within the field.
see below example:
user;case;hours;note;date;
tnn;123;4;solved problem;2017-11-27;
tnn;124;2;random comment;2017-11-27;
tnn;125;3;I am writing a comment
that contains new lines
without quotation marks;2017-11-28;
HJL;129;8;trying to concatenate lines to re form the broken csv;2017-11-29;
I would like to concatenate lines 3,4 and 5 to show the following:
tnn;125;3;I am writing a comment that contains new lines without quotation marks;2017-11-28;
Since every line starts with a username (always 3 letters) I thought I would be able to iterate the lines to find which lines do not start with a username and concatenate that with the previous line.
It is not really working as expected though.
This is what I have got so far:
import re
with open('Rapp.txt', 'r') as f:
for line in f:
previous = line #keep current line in variable to join next line
if not re.match(r'^[A-Za-z]{3}', line): #regex to match 3 letters
print(previous.join(line))
Script shows no output just finishes silently, any thoughts?
I think I would go a slightly different way:
import re
all_the_data = ""
with open('Rapp.txt', 'r') as f:
for line in f:
if not re.search("\d{4}-\d{1,2}-\d{1,2};\n", line):
line = re.sub("\n", "", line)
all_the_data = "".join([all_the_data, line])
print (all_the_data)
There a several ways to do this each with pros and cons, but I think this keeps it simple.
Loop the file as you have done and if the line doesn't end in a date and ; take off the carriage return and stuff it into all_the_data. That way you don't have to play with looking back 'up' the file. Again, lots of way to do this. If you would rather use the logic of starts with 3 letters and a ; and looking back, this works:
import re
all_the_data = ""
with open('Rapp.txt', 'r') as f:
all_the_data = ""
for line in f:
if not re.search("^[A-Za-z]{3};", line):
all_the_data = re.sub("\n$", "", all_the_data)
all_the_data = "".join([all_the_data, line])
print ("results:")
print (all_the_data)
Pretty much what was asked for. The logic being if the current line doesn't start right, take out the previous line's carriage return from all_the_data.
If you need help playing with the regex itself, this site is great: http://regex101.com
The regex in your code matches to all the lines (string) in the txt (finds a valid match to the pattern). The if condition is never true and hence nothing prints.
with open('./Rapp.txt', 'r') as f:
join_words = []
for line in f:
line = line.strip()
if len(line) > 3 and ";" in line[0:4] and len(join_words) > 0:
print(';'.join(join_words))
join_words = []
join_words.append(line)
else:
join_words.append(line)
print(";".join(join_words))
I've tried to not use regex here to keep it a little clear if possible. But, regex is a better option.
A simple way would be to use a generator that acts as a filter on the original file. That filter would concatenate a line to the previous one if it has not a semicolon (;) in its 4th column. Code could be:
def preprocess(fd):
previous = next(fd)
for line in fd:
if line[3] == ';':
yield previous
previous = line
else:
previous = previous.strip() + " " + line
yield previous # don't forget last line!
You could then use:
with open(test.txt) as fd:
rd = csv.DictReader(preprocess(fd))
for row in rd:
...
The trick here is that the csv module only requires on object that returns a line each time next function is applied to it, so a generator is appropriate.
But this is only a workaround and the correct way would be that the previous step directly produces a correct CSV file.

Print out lines that begin with two different string outputs?

I am trying to scan an input file and print out parts of lines that begin with a certain string. The text file is 10000+ lines, but I am only concerned with the beginning line, and more specifically the data within that line. For clarification, here are two lines of code which explain what I am trying to say.
inst "N69" "IOB",placed BIOB_X11Y0 R8 ,
inst "n0975" "SLICEX",placed CLEXL_X20Y5 SLICE_X32Y5 ,
Here is the code that I have gotten to so far:
searchfile = open("C:\PATH\TO\FILE.txt","r")
for line in searchfile:
if "inst " in line:
print line
searchfile.close()
Now this is great if I am looking for all lines that start with "inst", but I am specifically looking for lines that start with "inst "N"" or "inst "n"". From there, I wanted to extract just the string starting with N or n.
My idea was to first extract those lines (as shown above) to a new .txt file, then run another script to get only the portions of the lines that have N or n. In the example above, I am only concerned with N69 and n0975. Is there an easier method of doing this?
Yes with the re module.
re.finditer(r'^inst\s+\"n(\d+)\"', the_whole_file, re.I)
Will return you an iterator of all the matches.
For each match you will need to do .group(1) to get those numbers you wanted.
Notice that you don't need to filter the file first using this method. You can do this for the whole file.
The output in your case will be:
69
0975
With re.search() function:
Sample file.txt content:
inst "N69" "IOB",placed BIOB_X11Y0 R8 ,
some text
inst "n0975" "SLICEX",placed CLEXL_X20Y5 SLICE_X32Y5 ,
text
another text
import re
with open('file.txt', 'r') as f:
for l in f.read().splitlines():
m = re.search(r'^inst "([Nn][^"]+)"', l)
if m:
print(m.group(1))
The output:
N69
n0975
Here is one solution:
with open('nfile.txt','r') as f:
for line in f:
if line.startswith('inst "n') or line.startswith('inst "N'):
print line.split()[1]
For each line in the file startswith part checks if the line starts with one of your target patters. If yes, it splits the line using split and prints the second component which is the part with n or N.

reading data from multiple lines as a single item

I have a set of data from a file as such
"johnnyboy"=splice(23):15,00,30,00,31,00,32,02,39,00,62,00,a3,00,33,00,2d,0f,39,00,\
00,5c,00,6d,00,65,00,64,00,69,00,61,00,5c,00,57,00,69,00,6e,00,64,00,6f,00,\
77,00,73,00,20,00,41,00,61,00,63,00,6b,00,65,aa,72,00,6f,00,75,00,6e,dd,64,\
00,2e,00,77,00,61,00,76,00,ff,00
"johnnyboy"="gotwastedatthehouse"
"johnnyboy"=splice(23):15,00,30,00,31,00,32,02,39,00,62,00,a3,00,33,00,2d,0f,39,00,\
00,5c,00,6d,00,65,00,64,00,69,00,61,00,5c,00,57,00,69,00,6e,00,64,00,6f,00,\
77,00,73,00,20,00,41,00,61,00,63,00,6b,00,65,aa,72,00,6f,00,75,00,6e,dd,64,\
00,2e,00,77,00,61,00,76,00,ff,00
[mattplayhouse\wherecanwego\tothepoolhall]
How can I read/reference the text per "johnnyboy"=splice(23) as as single line as such:
"johnnyboy"=splice(23):15,00,30,00,31,00,32,02,39,00,62,00,a3,00,33,00,2d,0f,39,00,00,5c,00,6d,00,65,00,64,00,69,00,61,00,5c,00,57,00,69,00,6e,00,64,00,6f,00,77,00,73,00,20,00,41,00,61,00,63,00,6b,00,65,aa,72,00,6f,00,75,00,6e,dd,64,00,2e,00,77,00,61,00,76,00,ff,00
I am currently matching he regex based on splice(23): with a search as follows:
re_johnny = re.compile('splice')
with open("file.txt", 'r') as file:
read = file.readlines()
for line in read:
if re_johnny.match(line):
print(line)
I think I need to take and remove the backslashes and the spaces to merge the lines but am unfamiliar with how to do that and not obtain the blank lines or the new line that is not like my regex. When trying the first solution attempt, my last row was pulled inappropriately. Any assistance would be great.
Input file: fin
"johnnyboy"=splice(23):15,00,30,00,31,00,32,02,39,00,62,00,a3,00,33,00,2d,0f,39,00,\
00,5c,00,6d,00,65,00,64,00,69,00,61,00,5c,00,57,00,69,00,6e,00,64,00,6f,00,\
77,00,73,00,20,00,41,00,61,00,63,00,6b,00,65,aa,72,00,6f,00,75,00,6e,dd,64,\
00,2e,00,77,00,61,00,76,00,ff,00
"johnnyboy"="gotwastedatthehouse"
"johnnyboy"=splice(23):15,00,30,00,31,00,32,02,39,00,62,00,a3,00,33,00,2d,0f,39,00,\
00,5c,00,6d,00,65,00,64,00,69,00,61,00,5c,00,57,00,69,00,6e,00,64,00,6f,00,\
77,00,73,00,20,00,41,00,61,00,63,00,6b,00,65,aa,72,00,6f,00,75,00,6e,dd,64,\
00,2e,00,77,00,61,00,76,00,ff,00
[mattplayhouse\wherecanwego\tothepoolhall]
Adding to tigerhawk's suggestion you can try something like this:
Code:
import re
with open('fin', 'r') as f:
for l in [''.join([b.strip('\\') for b in a.split()]) for a in f.read().split('\n\n')]:
if 'splice' in l:
print(l)
Output:
"johnnyboy"=splice(23):15,00,30,00,31,00,32,02,39,00,62,00,a3,00,33,00,2d,0f,39,00,00,5c,00,6d,00,65,00,64,00,69,00,61,00,5c,00,57,00,69,00,6e,00,64,00,6f,00,77,00,73,00,20,00,41,00,61,00,63,00,6b,00,65,aa,72,00,6f,00,75,00,6e,dd,64,00,2e,00,77,00,61,00,76,00,ff,00
"johnnyboy"=splice(23):15,00,30,00,31,00,32,02,39,00,62,00,a3,00,33,00,2d,0f,39,00,00,5c,00,6d,00,65,00,64,00,69,00,61,00,5c,00,57,00,69,00,6e,00,64,00,6f,00,77,00,73,00,20,00,41,00,61,00,63,00,6b,00,65,aa,72,00,6f,00,75,00,6e,dd,64,00,2e,00,77,00,61,00,76,00,ff,00
With regex you have multiplied your problems. Instead, keep it simple:
If a line starts with ", it begins a record.
Else, append it to the previous record.
You can implement parsing for such a scheme in just a few lines in Python. And you don't need regex.

Categories

Resources