I wrote a script to gather information out of an XML file. Inside, there are ENTITY's defined and I need a RegEx to get the value out of it.
<!ENTITY ABC "123">
<!ENTITY BCD "234">
<!ENTITY CDE "345">
First, i open up the xml file and save the contents inside of a variable.
xml = open("file.xml", "r")
lines = xml.readlines()
Then I got a for loop:
result = "ABC"
var_search_result_list = []
var_searcher = "ENTITY\s" + result + '.*"[^"]*"\>'
for line in lines:
var_search_result = re.match(var_searcher, line)
if var_search_result != None:
var_search_result_list += list(var_search_result.groups())
print(var_search_result_list)
I really want to have the value 123 inside of my var_search_result_list list. Instead, I get an empty list every time I use this. Has anybody got a solution?
Thanks in Advance - Toki
There are a few issues in the code.
You are using re.match which has to match from the start of the string.
Your pattern is ENTITY\sABC.*"([^"]*)"\> which does not match from
the start of the given example strings.
If you want to add 123 only, you have to use a capture group, and add it using var_search_result.group(1) to the result list using append
For example:
import re
xml = open("file.xml", "r")
lines = xml.readlines()
result = "ABC"
var_search_result_list = []
var_searcher = "ENTITY\s" + result + '.*"([^"]*)"\>'
print(var_searcher)
for line in lines:
var_search_result = re.search(var_searcher, line)
if var_search_result:
var_search_result_list.append(var_search_result.group(1))
print(var_search_result_list)
Output
['123']
A bit more precise pattern could be
<!ENTITY\sABC\s+"([^"]*)"\>
Regex demo
I have the following line of code reading in a specific part of a text file. The problem is these are numbers not strings so I want to convert them to ints and read them into a list of some sort.
A sample of the data from the text file is as follows:
However this is not wholly representative I have uploaded the full set of data here: http://s000.tinyupload.com/?file_id=08754130146692169643 as a text file.
*NSET, NSET=Nodes_Pushed_Back_IB
99915527, 99915529, 99915530, 99915532, 99915533, 99915548, 99915549, 99915550,
99915551, 99915552, 99915553, 99915554, 99915555, 99915556, 99915557, 99915558,
99915562, 99915563, 99915564, 99915656, 99915657, 99915658, 99915659, 99915660,
99915661, 99915662, 99915663, 99915664, 99915665, 99915666, 99915667, 99915668,
99915669, 99915670, 99915885, 99915886, 99915887, 99915888, 99915889, 99915890,
99915891, 99915892, 99915893, 99915894, 99915895, 99915896, 99915897, 99915898,
99915899, 99915900, 99916042, 99916043, 99916044, 99916045, 99916046, 99916047,
99916048, 99916049, 99916050
*NSET, NSET=Nodes_Pushed_Back_OB
Any help would be much appreciated.
Hi I am still stuck with this issue any more suggestions? Latest code and error message is as below Thanks!
import tkinter as tk
from tkinter import filedialog
file_path = filedialog.askopenfilename()
print(file_path)
data = []
data2 = []
data3 = []
flag= False
with open(file_path,'r') as f:
for line in f:
if line.strip().startswith('*NSET, NSET=Nodes_Pushed_Back_IB'):
flag= True
elif line.strip().endswith('*NSET, NSET=Nodes_Pushed_Back_OB'):
flag= False #loop stops when condition is false i.e if false do nothing
elif flag: # as long as flag is true append
data.append([int(x) for x in line.strip().split(',')])
result is the following error:
ValueError: invalid literal for int() with base 10: ''
Instead of reading these as strings I would like each to be a number in a list, i.e [98932850 98932852 98932853 98932855 98932856 98932871 98932872 98932873]
In such cases I use regular expressions together with string methods. I would solve this problem like so:
import re
with open(filepath) as f:
txt = f.read()
g = re.search(r'NSET=Nodes_Pushed_Back_IB(.*)', txt, re.S)
snums = g.group(1).replace(',', ' ').split()
numbers = [int(num) for num in snums]
I read the entire text into txt.
Next I use a regular expression and use the last portion of your header in the text as an anchor, and capture with capturing parenthesis all the rest (the re.S flag means that a dot should capture also newlines). I access all the nubers as one unit of text via g.group(1).
Next. I remove all the commas (actually replace them with spaces) because on the resulting text I use split() which is an excellent function to use on text items that are separated with spaces - it doesn't matter the amount of spaces, it just splits it as you would intent.
The rest is just converting the text to numbers using a list comprehension.
Your line contains more than one number, and some separating characters. You could parse that format by judicious application of split and perhaps strip, or you could minimize string handling by having re extract specifically the fields you care about:
ints = list(map(int, re.findall(r'-?\d+', line)))
This regular expression will find each group of digits, optionally prefixed by a minus sign, and then map will apply int to each such group found.
Using a sample of your string:
strings = ' 98932850, 98932852, 98932853, 98932855, 98932856, 98932871, 98932872, 98932873,\n'
I'd just split the string, strip the commas, and return a list of numbers:
numbers = [ int(s.strip(',')) for s in strings.split() ]
Based on your comment and regarding the larger context of your code. I'd suggest a few things:
from itertools import groupby
number_groups = []
with open('data.txt', 'r') as f:
for k, g in groupby(f, key=lambda x: x.startswith('*NSET')):
if k:
pass
else:
number_groups += list(filter('\n'.__ne__, list(g))) #remove newlines in list
data = []
for group in number_groups:
for str_num in group.strip('\n').split(','):
data.append(int(str_num))
I have a file with a bunch of numbers that have white spaces and colons and I am trying to remove them. As I have seen on this forum the function line.strip.split() works well to achieve this. Is there a way of removing the white space and colon all in one go? Using the method posted by Lorenzo I have this:
train = []
with open('C:/Users/Morgan Weiss/Desktop/STA5635/DataSets/dexter/dexter_train.data') as train_data:
train.append(train_data.read().replace(' ','').replace(':',''))
size_of_train = np.shape(train)
for i in range(size_of_train[0]):
for j in range(size_of_train[1]):
train[i][j] = int(train[i][j])
print(train)
Although I get this error:
File "C:/Users/Morgan Weiss/Desktop/STA5635/Homework/Homework_1/HW1_Dexter.py", line 11, in <module>
for j in range(size_of_train[1]):
IndexError: tuple index out of range
I think the above syntax is not correct, but anyways as per your question, you can use replace function present in python.
When reading each line as a string from that file you can do something like,
train = []
with open('/Users/sushant.moon/Downloads/dexter_train.data') as f:
list = f.read().split()
for x in list:
data = x.split(':')
train.append([int(data[0]),int(data[1])])
# this part becomes redundant as i have already converted str to int before i append data to train
size_of_train = np.shape(train)
for i in range(size_of_train[0]):
for j in range(size_of_train[1]):
train[i][j] = int(train[i][j])
Here I am using replace function to replace space with blank string, and similar with colon.
You did not provide an example of what your input file looks like so we can only speculate what solution you need. I'm going to suppose that you need to extract integers from your input text file and print their values.
Here's how I would do it:
Instead of trying to eliminate whitespace characters and colons, I will be searching for digits using a regular expression
Consecutive digits would constitute a number
I would convert this number to an integer form.
And here's how it would look like:
import re
input_filename = "/home/evens/Temporaire/Stack Exchange/StackOverflow/Input_file-39359816.txt"
matcher = re.compile(r"\d+")
with open(input_filename) as input_file:
for line in input_file:
for digits_found in matcher.finditer(line):
number_in_string_form = digits_found.group()
number = int(number_in_string_form)
print(number)
But before you run away with this code, you should continue to learn Python because you don't seem to grasp its basic elements yet.
This is the python script:
f = open('csvdata.csv','rb')
fo = open('out6.csv','wb')
for line in f:
bits = line.split(',')
bits[1] = '"input"'
fo.write( ','.join(bits) )
f.close()
fo.close()
I have a CSV file and I'm replacing the content of the 2nd column with the string "input". However, I need to grab some information from that column content first.
The content might look like this:
failurelog_wl","inputfile/source/XXXXXXXX"; "**X_CORD2**"; "Invoice_2M";
"**Y_CORD42**"; "SIZE_ID37""
It has weird type of data as you can see, especially that it has 2 double quotes at the end of the line instead of just one that you would expect.
I need to extract the XCORD and YCORD information, like XCORD = 2 and YCORD = 42, before replacing the column value. I then want to insert an extra column, named X_Y, which represents (2_42).
How can I modify my script to do that?
If I understand your question correctly, you can use a simple regular expression to pull out the numbers you want:
import re
f = open('csvdata.csv','rb')
fo = open('out6.csv','wb')
for line in f:
bits = line.split(',')
x_y_matches = re.match('.*X_CORD(\d+).*Y_CORD(\d+).*', bits[1])
assert x_y_matches is not None, 'Line had unexpected format: {0}'.format(bits[1])
x_y = '({0}_{1})'.format(x_y_matches.group(1), x_y_matches.group(2))
bits[1] = '"input"'
bits.append(x_y)
fo.write( ','.join(bits) )
f.close()
fo.close()
Note that this will only work if column 2 always says 'X_CORD' and 'Y_CORD' immediately before the numbers. If it is sometimes a slightly different format, you'll need to adjust the regular expression to allow for that. I added the assert to give a more useful error message if that happens.
You mentioned wanting the column to be named X_Y. Your script appears to assume that there is no header, and my modified version definitely makes this assumption. Again, you'd need to adjust for that if there is a header line.
And, yes, I agree with the other commenters that using the csv module would be cleaner, in general, for reading and writing csv files.
I am trying to increment a version number using regex but I can't seem to get the hang of regex at all. I'm having trouple with the symbols in the string I am trying to read and change. The code I have so far is:
version_file = "AssemblyInfo.cs"
read_file = open(version_file).readlines()
write_file = open(version_file, "w")
r = re.compile(r'(AssemblyFileVersion\s*(\s*"\s*)(\S+))\s*"\s*')
for l in read_file:
m1 = r.match(l)
if m1:
VERSION_ID=map(int,m1.group(2).split("."))
VERSION_ID[2]+=1 # increment version
l = r.sub(r'\g<1>' + '.'.join(['%s' % (v) for v in VERSION_ID]), l)
write_file.write(l)
write_file.close()
The string I am trying to read and change is:
[assembly: AssemblyFileVersion("1.0.0.0")]
What I would like written to the file is:
[assembly: AssemblyFileVersion("1.0.0.1")]
So basically I want to increment the build number by one.
Can anyone help me fix my regualr expression. I seem to have trouble getting to grips with regular expression that have to get around symbols.
Thanks for any help.
If you specify the version as "1.0.0.*" then AFAIK it gets updated on each build automagically, at least if you're using Visual Studio.NET.
I'm not sure regex is your best bet, but one way of doing it would be this:
import re
# Don't bother matching everything, just the bits that matter.
pat = re.compile(r'AssemblyFileVersion.*\.(\d+)"')
# ... lines omitted which set up read_file, write_file etc.
for line in read_file:
m = pat.search(line)
if m:
start, end = m.span(1)
line = line[:start] + str(int(line[start:end]) + 1) + line[end:]
write_file.write(line)
Good luck with regex.
If I had to do the same, I'd convert the string to int by removing the dots, add one and convert back to string.
Well, I'd have also used a integer version number in the first place.