Using regex pattern stocked in file - python

I'm working on a simple Python script which looks if any lines of an input file match any of the patterns from a CSV file.
The following code doesn't show anything:
# -*- coding: utf8 -*-
import re
import csv
csvfile = open('errors.csv', 'r')
errorsreader = csv.reader(csvfile, delimiter="\t")
log = open('gcc.log', 'r')
for line in log:
for row in errorsreader:
matchObj = re.match(row[0],line)
if matchObj:
print (line)
While the same code, with the following pattern instead of row[0] works:
.* error: expected ‘;’ before ‘}’ token .*
I have been looking for workarounds but none of them seem to work. Any guesses?

The problem is that at first line from log you will read all lines from errorsreader and then you will read nothing. You can change
errorsreader = csv.reader(csvfile, delimiter="\t")
to
errorsreader = list(csv.reader(csvfile, delimiter="\t"))

Related

Python Regex to find CRLF

I'm trying to write a regex that will find any CRLF in python.
I am able to successfully open the file and use newlines to determine what newlines its using CRLF or LF. My numerous regex attempts have failed
with open('test.txt', 'rU') as f:
text = f.read()
print repr(f.newlines)
regex = re.compile(r"[^\r\n]+", re.MULTILINE)
print(regex.match(text))
I've done numerous iterations on the regex and in every case it till either detect \n as \r\n or not work at all.
You could try using the re library to search for the \r & \n patterns.
import re
with open("test.txt", "rU") as f:
for line in f:
if re.search(r"\r\n", line):
print("Found CRLF")
regex = re.compile(r"\r\n")
line = regex.sub("\n", line)
if re.search(r"\r", line):
print("Found CR")
regex = re.compile(r"\r")
line = regex.sub("\n", line)
if re.search(r"\n", line):
print("Found LF")
regex = re.compile(r"\n")
line = regex.sub("\n", line)
print(line)
Assuming your test.txt file looks something like this:
This is a test file
with a line break
at the end of the file.
As I mentioned in a comment, you're opening the file with universal newlines, which means that Python will automatically perform newline conversion when reading from or writing to the file. Your program therefore will not see CR-LF sequences; they will be converted to just LF.
Generally, if you want to portably observe all bytes from a file unchanged, then you must open the file in binary mode:
In Python 2:
from __future__ import print_function
import re
with open('test.txt', 'rb') as f:
text = f.read()
regex = re.compile(r"[^\r\n]+", re.MULTILINE)
print(regex.match(text))
In Python 3:
import re
with open('test.txt', 'rb') as f:
text = f.read()
regex = re.compile(rb"[^\r\n]+", re.MULTILINE)
print(regex.match(text))

\ufeff is appearing while reading csv using unicodecsv module

I have following code
import unicodecsv
CSV_PARAMS = dict(delimiter=",", quotechar='"', lineterminator='\n')
unireader = unicodecsv.reader(open('sample.csv', 'rb'), **CSV_PARAMS)
for line in unireader:
print(line)
and it prints
['\ufeff"003', 'word one"']
['003,word two']
['003,word three']
The CSV looks like this
"003,word one"
"003,word two"
"003,word three"
I am unable to figure out why the first row has \ufeff (which is i believe a file marker). Moreover, there is " at the beginning of first row.
The CSV file is comign from client so i can't dictate them how to save a file etc. Looking to fix my code so that it can handle encoding.
Note: I have already tried passing encoding='utf8' to CSV_PARAMS and it didn't solve the problem
encoding='utf-8-sig' will remove the UTF-8-encoded BOM (byte order mark) used as a UTF-8 signature in some files:
import unicodecsv
with open('sample.csv','rb') as f:
r = unicodecsv.reader(f, encoding='utf-8-sig')
for line in r:
print(line)
Output:
['003,word one']
['003,word two']
['003,word three']
But why are you using the third-party unicodecsv with Python 3? The built-in csv module handles Unicode correctly:
import csv
# Note, newline='' is a documented requirement for the csv module
# for reading and writing CSV files.
with open('sample.csv', encoding='utf-8-sig', newline='') as f:
r = csv.reader(f)
for line in r:
print(line)

Extra blank line is getting printed at the end of the output in Python

I am trying to read a file from command line and trying to replace all the commas in that file with blank. Below is my code:
import sys
datafile = sys.argv[1];
with open(datafile, 'r') as data:
plaintext = data.read()
plaintext = plaintext.replace(',', '')
print(plaintext)
But while printing the plaintext I am getting one extra blank row at the end. Why is it happening and how can I get rid of that?
You might be able to use
plaintext.rstrip('\n')
This should remove the extra line

Use csv file with multiple newline characters in Python 3

I'm trying to import a csv file which has # as delimiter and \r\n as line break. Inside one column there is data which also has newline in it but \n.
I'm able to read one line after another without problems but using the csv lib (Python 3) I've got stuck.
The below example throws a
Error: new-line character seen in unquoted field - do you need to open the file in universal-newline mode?
Is it possible to use the csv lib with multiple newline characters?
Thanks!
import csv
with open('../database.csv', newline='\r\n') as csvfile:
file = csv.reader(csvfile, delimiter='#', quotechar='"')
for row in file:
print(row[3])
database.csv:
2202187#"645cc14115dbfcc4defb916280e8b3a1"#"cd2d3e434fb587db2e5c2134740b8192"#"{
Age = 22;
Salary = 242;
}
Please try this code. According to Python 3.5.4 documentation, with newline=None, common line endings like '\r\n' ar replaced by '\n'.
import csv
with open('../database.csv', newline=None) as csvfile:
file = csv.reader(csvfile, delimiter='#', quotechar='"')
for row in file:
print(row[3])
I've replaced newline='\r\n' by newline=None.
You also could use the 'rU' modifier but it is deprecated.
...
with open('../database.csv', 'rU') as csvfile:
...

Processing Russian text file fails

I have this code:
# -*- coding: utf-8 -*-
import codecs
prefix = u"а"
rus_file = "rus_names.txt"
output = "rus_surnames.txt"
with codecs.open(rus_file, 'r', 'utf-8') as infile:
with codecs.open(output, 'a', 'utf-8') as outfile:
for line in infile.readlines():
outfile.write(line+prefix)
And it gives me smth kinda chineese text in an output file. Even when I try to outfile.write(line) it gives me the same crap in an output. I just don't get it.
The purpose: I have a huge file with male surnames. I need to get the same file with female surnames. In russian it looks like this: Ivanov - Ivanova | Иванов - Иванова
Try
lastname = str(line+prefix, 'utf-8')
outfile.write(lastname)
So #AndreyAtapin was partially right. I've tried to add lines in a file which contains my previous mistakes with chineese characters. even flushing the file didn't help. But when I delete it and script creates it once again, it works! thanks.

Categories

Resources