I'm having issues with .replace(). My XML parser does not like '&', but will accept '&\amp;'. I'd like to use .replace('&','&') but this does not seem to be working. I keep getting the error:
lxml.etree.XMLSyntaxError: xmlParseEntityRef: no name, line 51, column 41
So far I have tried just a straight forward file=file.replace('&','&'), but this doesn't work. I've also tried:
xml_file = infile
file=xml_file.readlines()
for line in file:
for char in line:
char.replace('&','&')
infile=open('a','w')
file='\n'.join(file)
infile.write(file)
infile.close()
infile=open('a','r')
xml_file=infile
What would be the best way to fix my issue?
str.replace creates and returns a new string. It can't alter strings in-place - they're immutable. Try replacing:
file=xml_file.readlines()
with
file = [line.replace('&','&') for line in xml_file]
This uses a list comprehension to build a list equivalent to .readlines() but with the replacement already made.
As pointed out in the comments, if there were already &s in the string, they'd be turned into &, likely not what you want. To avoid that, you could use a negative lookahead in a regular expression to replace only the ampersands not already followed by amp;:
import re
file = [re.sub("&(?!amp;)", "&", line) ...]
str.replace() returns new string object with the change made. It does not change data in-place. You are ignoring the return value.
You want to apply it to each line instead:
file = [line.replace('&', '&') for line in file]
You could use the fileinput() module to do the transformation, and have it handle replacing the original file (a backup will be made):
import fileinput
import sys
for line in fileinput.input('filename', inplace=True):
sys.stdout.write(line.replace('&', '&'))
Oh...
You need to decode HTML notation for special symbols. Python has module to deal with it - HTMLParser, here some docs.
Here is example:
import HTMLParser
out_file = ....
file = xml_file.readlines()
parsed_lines = []
for line in file:
parsed_lines.append(htmlparser.unescape(line))
Slightly off topic, but it might be good to use some escaping?
I often use urllib's quote which will put the HTML escaping in and out:
result=urllib.quote("filename&fileextension")
'filename%26fileextension'
urllib.unquote(result)
filename&fileextension
Might help for consistency?
Related
Friends,
I have a situation where i need to grep a word from a string
[MBeanServerInvocationHandler]com.bea:Name=itms2md01,Location=hello,Type=ServerRuntime
What I want to grep is the word that assigned to the variable Name in the above string which is itms2md01.
In my case i have to grep which ever string assigned to Name= so there is no particular string i have to search
Tried:
import re
import sys
file = open(sys.argv[2], "r")
for line in file:
if re.search(sys.argv[1], line):
print line,
Deak is right. As I am not having enough reputation to comment, I am depicting it below. I am not going to the file level. Just see as an instance:-
import re
str1 = "[MBeanServerInvocationHandler]com.bea:Name=itms2md01,Location=hello,Type=ServerRuntime"
pat = '(?<=Name=)\w+(?=,)'
print re.search(pat, str1).group()
Accordingly you can apply your logic with the file content with this pattern
I like to use named groups, because I'm often searching for more than one thing. But even for one item in the search, it still works nicely, and I can remember very easily what I was searching for.
I'm not certain that I fully understand the question, but if you are saying that the user can pass a key to search the value for and also a file from which to search, you can do that like this:
So, for this case, I might do:
import re
import sys
file = open(sys.argv[2], "r")
for line in file:
match = re.search(r"%s=(?P<item>[^,]+)" % sys.argv[1], line)
if match is not None:
print match.group('item')
I am assuming that is the purpose, as you have included sys.argv[1] into the search, though you didn't mention why you did so in your question.
I'm currently trying to write a function that takes two inputs:
1 - The URL for a web page
2 - The name of a text file containing some regular expressions
My function should read the text file line by line (each line being a different regex) and then it should execute the given regex on the web page source code. However, I've ran in to trouble doing this:
example
Suppose I want the address contained on a Yelp with URL = http://www.yelp.com/biz/liberty-grill-cork
where the regex is \<address\>\s*([^<]*)\\b\s*<. In Python, I then run:
address = re.search('\<address\>\s*([^<]*)\\b\s*<', web_page_source_code)
The above will work, however, if I just write the regex in a text file as is, and then read the regex from the text file, then it won't work. So reading the regex from a text file is what is causing the problem, how can I rectify this?
EDIT: This is how I'm reading the regexes from the text file:
with open("test_file.txt","r") as file:
for regex in file:
address = re.search(regex, web_page_source_code)
Just to add, the reason I want to read regexes from a text file is so that my function code can stay the same and I can alter my list of regexes easily. If anyone can suggest any other alternatives that would be great.
Your string has some backlashes and other things escaped to avoid special meaning in Python string, not only the regex itself.
You can easily verify what happens when you print the string you load from the file. If your backslashes doubled, you did it wrong.
The text you want in the file is:
File
\<address\>\s*([^<]*)\b\s*<
Here's how you can check it
In [1]: a = open('testfile.txt')
In [2]: line = a.readline()
-- this is the line as you'd see it in python code when properly escaped
In [3]: line
Out[3]: '\\<address\\>\\s*([^<]*)\\b\\s*<\n'
-- this is what it actually means (what re will use)
In [4]: print(line)
\<address\>\s*([^<]*)\b\s*<
OK, I managed to get it working. For anyone who wants to read regular expressions from text files, you need to do the following:
Ensure that regex in the text file is entered in the right format (thanks to MightyPork for pointing that out)
You also need to remove the newline '\n' character at the end
So overall, your code should look something like:
a = open("test_file.txt","r")
line = a.readline()
line = line.strip('\n')
result = re.search(line,page_source_code)
This code, borrowed from another place on stackoverflow, removes all of the places that the csv has "None" written in it. However, it also adds an extra line to the csv. How can I change this code to remove that extra line? I think the problem is caused by inplace, but when I take inplace away the file is no longer altered by running the code.
def cleanOutputFile(filename):
for line in fileinput.FileInput(filename,inplace=1):
line = line.replace('None',"")
print line
Thanks!
If you want to replace all the None's:
with open(filename) as f:
lines = f.read().replace("None","")
with open(filename,"w") as f1:
f1.write(lines)
Using rstrip with fileinput should also work:
import fileinput
for line in fileinput.FileInput(fileinput,inplace=1):
print line.replace('None',"").rstrip() # remove newline character to avoid adding extra lines in the output
The problem here has nothing to do with fileinput, or with the replace.
Lines read from a file always end in a newline.
print adds a newline, even if the thing you're printing already ends with a newline.
You can see this without even a file involved:
>>> a = 'abc'
>>> print a
abc
>>> a = 'abc\n'
>>> print a
abc
>>>
The solution is any of the following:
rstrip the newlines off the input: print line.rstrip('\n') (or do the strip earlier in your processing)
Use the "magic comma" to prevent print from adding the newline: print line,
Use Python 3-style print with from __future__ import print_function, so you can use the more flexible keyword arguments: print(line, end='')
Use sys.stdout.write instead of print.
Completely reorganize your code so you're no longer writing to stdout at all, but instead writing directly to a temporary file, or reading the whole file into memory and then writing it back out, etc.
I have a CSV file that contains an information about people:
name,age,height
Maria,25,172
George,45,180,
Peter,23,179,
The problem is that some strings contain an extra commas in the end, and some don't (this appears because this information was got from the internet using urlopen in the other Python script which processes the raw data).
I tried to write some code to fix this, but I couldn`t get a result. What I've written:
import re
data = open('file.csv').read()
new_data = re.sub('\W$', '', data)
print(new_data)
But this code substitutes only the last comma in the whole document. I tried to write a cycle, which counts all lines and then analyses each line, but maybe my coding skills are not great and I didn't reach a success. Please, tell me, what I'm doing wrong.
The problem is the whole file is handled as a string, and $ matches only the end of the string.
You would better use re.sub('\W\n', '\n', data)
You can also do that without regexp: new_data = data.replace(',\n', '\n'), which is probably faster.
This is simple enough you don't really need regex (and its probably faster to not use it)
Here's what I would do:
with open("file.csv", 'r') as f:
newLines = [line[:-1] if line.endswith(",") else line for line in f.readlines()]
Then all you need to do is write it back to the file
I have file with contents in list form such as
[1,'ab','fgf','ssd']
[2,'eb','ghf','hhsd']
[3,'ag','rtf','ssfdd']
I want to read that file line by line using f.readline and assign each line to a list.
I tried doing this:
k=[ ]
k=f.readline()
print k[1]
I expected a result to show 2nd element in the list in first line, but it showed the first bit and gave o/p as '1'.
How to get the expected output?
If all you want is to take the input format shown and store it as a list attempting to execute the input file (with eval()) is not a good idea. This leaves your program open to all sorts of accidentally and intentionally harmful input. You are better advised to just parse the input file:
s=f.readline()[1:-1]
k = s.split(',')
print k[1]
readline just returns strings. You need to cast it to what you want. eval does the job, be warned that however it does execute everything inside the string, so this is only an option if you trust the input (i.e. you've saved it yourself).
If you need to save data from your program to a file, you might want to use pickle.
if the sample posted is actual content of your file (which I highly doubt), here is what you could do starting with Python 2.6, docs:
>>> for line in open(fname):
print(ast.literal_eval(line)[1])
ab
eb
ag
You could use eval on each line; this would evaluate the the expression in the line and should yield your expected list, if the formatting is correct.
A safer solution would be a simple CSV parser. For that your input could look something like this (comma-separated):
123,321,12,123,321,'asd',ewr,'afdg','et al',213
Maybe this is feasible.
Maybe You can use eval as suggested, but I'm just curious: Is there any reason not to use JSON as file format?
You can use the json module:
import json
with open('lists.txt', 'r') as f:
lines = f.readlines()
for line in lines:
line = line.replace("'", '"')
l = json.loads(line)
print l[1]
Outputs:
ab
eb
ag