Python: Modifying text file, deleting substring - python

I have a text file which has a line like this -
time time B2CAT_INLET_T\CAN-Monitoring:1 B1CAT_MIDBED_T\CAN-Monitoring:1 B1CAT_INLET_T\CAN-Monitoring:1 B1CAT_OUTLET_T\CAN-Monitoring:1 time APEPFRPP\CCP:1 KDFILRAW\CCP:1
When I read it using
lines = txtfile.readlines()
I get lines =
'time\ttime\tB2CAT_INLET_T\\CAN-Monitoring:1\tB1CAT_MIDBED_T\\CAN-Monitoring:1\tB1CAT_INLET_T\\CAN-Monitoring:1\tB1CAT_OUTLET_T\\CAN-Monitoring:1\ttime\tAPEPFRPP\\CCP:1\tKDFILRAW\\CCP:1\t\t'
So the '\' show as 'double \' and the tab shows as '\t'
From this I want to delete all instances of '\CAN-Monitoring:1' and '\CCP:1' and preserve the tabs as they are.
I have a code that walks through each element of 'lines' and gets index of each 'double \' and '\t'
Then I tried to use lines.replace(index of 'double \':index of '\t','')
But this does not seem to work as I want.
Following is my code so far:
# Reading from text file
txtfile = open('filename.txt', 'r')
lines = txtfile.readlines()
textToModify = lines
# This gives indices of all '\\' and '\t'
doubleslash = []
tab = []
for i, item in enumerate(textToModify):
if textToModify[i] == '\\':
doubleslash.append(i)
for i, item in enumerate(textToModify):
if textToModify[i] == '\t':
tab.append(i)
# Should find text beginning with '\\' until '\t' only
itemSlashBegin = []
itemTabBegin = []
for itemSlash in doubleslash:
for itemTab in tab:
if itemSlash < itemTab:
break
itemSlashBegin.append(itemSlash)
itemTabBegin.append(itemTab)
# Trying to replace '\\'text'\t' in the original text
for i,item in enumerate(itemSlashBegin):
ModifiedTxt = textToModify.replace([item:itemTabBegin[i]],"")
I am sure there is a more elegant way too; but I cannot find it.
Please give me some solution.
Thank you
R

If you don't want to import anything then use this
f = 'time\ttime\tB2CAT_INLET_T\\CAN-Monitoring:1\tB1CAT_MIDBED_T\\CAN-Monitoring:1\tB1CAT_INLET_T\\CAN-Monitoring:1\tB1CAT_OUTLET_T\\CAN-Monitoring:1\ttime\tAPEPFRPP\\CCP:1\tKDFILRAW\\CCP:1\t\t'
s= ('\CAN-Monitoring:1','\CCP:1')
for i in s:
f=f.replace(i, '')
print(f)
time time B2CAT_INLET_T B1CAT_MIDBED_T B1CAT_INLET_T B1CAT_OUTLET_T time APEPFRPP KDFILRAW

Just use re.sub here:
out = re.sub(r'\\CAN-Monitoring:1|\\CCP:1', '', inp)
print(out)
This prints:
time time B2CAT_INLET_T B1CAT_MIDBED_T B1CAT_INLET_T B1CAT_OUTLET_T time APEPFRPP KDFILRAW
Note that double backslash and \t are simply how a literal backslash and tab character are represented in a Python string.

Related

How to read strings as integers when reading from a file in python

I have the following line of code reading in a specific part of a text file. The problem is these are numbers not strings so I want to convert them to ints and read them into a list of some sort.
A sample of the data from the text file is as follows:
However this is not wholly representative I have uploaded the full set of data here: http://s000.tinyupload.com/?file_id=08754130146692169643 as a text file.
*NSET, NSET=Nodes_Pushed_Back_IB
99915527, 99915529, 99915530, 99915532, 99915533, 99915548, 99915549, 99915550,
99915551, 99915552, 99915553, 99915554, 99915555, 99915556, 99915557, 99915558,
99915562, 99915563, 99915564, 99915656, 99915657, 99915658, 99915659, 99915660,
99915661, 99915662, 99915663, 99915664, 99915665, 99915666, 99915667, 99915668,
99915669, 99915670, 99915885, 99915886, 99915887, 99915888, 99915889, 99915890,
99915891, 99915892, 99915893, 99915894, 99915895, 99915896, 99915897, 99915898,
99915899, 99915900, 99916042, 99916043, 99916044, 99916045, 99916046, 99916047,
99916048, 99916049, 99916050
*NSET, NSET=Nodes_Pushed_Back_OB
Any help would be much appreciated.
Hi I am still stuck with this issue any more suggestions? Latest code and error message is as below Thanks!
import tkinter as tk
from tkinter import filedialog
file_path = filedialog.askopenfilename()
print(file_path)
data = []
data2 = []
data3 = []
flag= False
with open(file_path,'r') as f:
for line in f:
if line.strip().startswith('*NSET, NSET=Nodes_Pushed_Back_IB'):
flag= True
elif line.strip().endswith('*NSET, NSET=Nodes_Pushed_Back_OB'):
flag= False #loop stops when condition is false i.e if false do nothing
elif flag: # as long as flag is true append
data.append([int(x) for x in line.strip().split(',')])
result is the following error:
ValueError: invalid literal for int() with base 10: ''
Instead of reading these as strings I would like each to be a number in a list, i.e [98932850 98932852 98932853 98932855 98932856 98932871 98932872 98932873]
In such cases I use regular expressions together with string methods. I would solve this problem like so:
import re
with open(filepath) as f:
txt = f.read()
g = re.search(r'NSET=Nodes_Pushed_Back_IB(.*)', txt, re.S)
snums = g.group(1).replace(',', ' ').split()
numbers = [int(num) for num in snums]
I read the entire text into txt.
Next I use a regular expression and use the last portion of your header in the text as an anchor, and capture with capturing parenthesis all the rest (the re.S flag means that a dot should capture also newlines). I access all the nubers as one unit of text via g.group(1).
Next. I remove all the commas (actually replace them with spaces) because on the resulting text I use split() which is an excellent function to use on text items that are separated with spaces - it doesn't matter the amount of spaces, it just splits it as you would intent.
The rest is just converting the text to numbers using a list comprehension.
Your line contains more than one number, and some separating characters. You could parse that format by judicious application of split and perhaps strip, or you could minimize string handling by having re extract specifically the fields you care about:
ints = list(map(int, re.findall(r'-?\d+', line)))
This regular expression will find each group of digits, optionally prefixed by a minus sign, and then map will apply int to each such group found.
Using a sample of your string:
strings = ' 98932850, 98932852, 98932853, 98932855, 98932856, 98932871, 98932872, 98932873,\n'
I'd just split the string, strip the commas, and return a list of numbers:
numbers = [ int(s.strip(',')) for s in strings.split() ]
Based on your comment and regarding the larger context of your code. I'd suggest a few things:
from itertools import groupby
number_groups = []
with open('data.txt', 'r') as f:
for k, g in groupby(f, key=lambda x: x.startswith('*NSET')):
if k:
pass
else:
number_groups += list(filter('\n'.__ne__, list(g))) #remove newlines in list
data = []
for group in number_groups:
for str_num in group.strip('\n').split(','):
data.append(int(str_num))

Extract the lines between 2 specific tags

For a routine programming question, I need to extract some lines of text that are between 2 tags(delimiters, if I need to be more specific).
The file is something like this:
*some random text*
...
...
...
tag/delimiter 1
text 1 #extract
text 2 #extract
... #extract
... #extract
text n #extract
tag/ending_delimiter
*some random text*
...
...
...
tag/delimiter 2
text 1 #extract
text 2 #extract
... #extract
... #extract
text n #extract
tag/ending_delimiter
*some random text*
...
...
...
tag/delimiter n
text 1 #extract
text 2 #extract
... #extract
... #extract
text n #extract
tag/ending_delimiter
*some random text until the file ends*
The ending_delimiter is the same everywhere.
The starting delimiter, i.e delimiter 1, delimiter 2 upto n is taken from a list.
The catch is, in the file there are a few (less than 3) charecters after each starting delimiter, which, combined with the starting delimiter, work as an identifier for the lines of text until the ending_delimiter, a kind of "uid", technically.
So far, what I've tried is this:
data_file = open("file_name")
block = []
found = False
for elem in list_of_starting_delimiters:
for line in data_file:
if found:
block.append(line)
if re.match(attribute_end, line.strip()):
break
else:
if re.match(elem, line.strip()):
found = True
block = elem
data_file.close()
I have also tried to implement the answers suggested in:
python - Read file from and to specific lines of text
but with no success.
The implementation I'm currently trying is one of the answers of the link above.
Any help is appreciated.
P.S: Using Python 2.7, on PyCharm, on Windows 10.
I suggest fixing your code the following way:
block = []
found = False
list_of_starting_delimiters = ['tag/delimiter']
attribute_end = 'tag/ending_delimiter'
curr = []
for elem in list_of_starting_delimiters:
for line in data_file:
if found:
curr.append(line)
if line.strip().startswith(attribute_end):
found = False
block.append("\n".join(curr)) # Add merged list to final list
curr = [] # Zero out current list
else:
if line.strip().startswith(elem): # If line starts with start delimiter
found = True
curr.append(line.strip()) # Append line to current list
if len(curr) > 0: # If there are still lines in the current list
block.append(curr) # Add them to the final list
See the Python demo
There are quite a lot of issues with your current code:
block = elem made block a byte string and the further .append caused an exception
You only grabbed one occurrence of the block because upon fining one, you had a break statement
All the lines were added as separate items while you needed to collect them into a list and then join them with \n to get strings to paste into a resulting list
You need no regex to check if a string appears at the start of a string, use str.startswith method.
By the time I figured this out there are a fair amount of good responses already, but my approach would be that you could resolve this with:
import re
pattern = re.compile(r"(^tag\/delimiter) (.{0,3})\n\n((^[\w\d #\.]*$\n)+)^(tag\/ending_delimiter)", re.M)
You could then find all matches in your text by either doing:
for i in pattern.finditer(<target_text>):
#do something with each match
pattern.findAll(<target_text>) - returns a list of strings of all matches
This of course bears the stipulation that you need to specify different delimiters and compile a different regex pattern (re.compile) for each different delimiter, using variables and string concatenation as #SpghttCd shows in his answer
For more info see the python re module
What about
import re
with open(file, 'r') as f:
txt = f.read()
losd = '|'.join(list_of_starting_delimiters)
enddel = 'attribute_end'
block = re.findall('(?:' + losd + r')([\s\S]*?)' + enddel, txt)
I would make that in following way: For example purpose let <d1> and <d2> and <d3> be our starting delimiters and <d> ending delimeter and string being text you are processing. Then following line of code:
re.findall('(<d1>|<d2>|<d3>)(.+?)(<d>)',string,re.DOTALL)
will give list of tuples, with each tuple containing starting delimiter, body and ending delimiter. This code use grouping inside regular expression (brackets), pipe (|) in regular expressions acts similar to or, dot (.) combined with DOTALL flag match any character, plus (+) means 1 or more, question (?) non-greedy manner (this is important in this case, as otherwise you would get single match begining at first starting delimiter and ending at last ending delimiter)
My re-less solution would be the following:
list_of_starting_delimiters = ['tag/delimiter 1', 'tag/delimiter 2', 'tag/delimiter n']
enddel = 'tag/ending_delimiter'
block ={}
section = ''
with open(file, 'r') as f:
for line in f:
if line.strip() == enddel:
section = ''
if section:
block[section] = block.get(section, '') + line
if line.strip() in list_of_starting_delimiters:
section = line.strip()
print(block)
It extracts the blocks into a dictionary with start delimiter tags as keys and according sections as values.
It requires the start and end tags to be the only content of their respective lines.
Output:
{'tag/delimiter 1':
'\ntext 1 #extract\n\ntext 2 #extract\n\n... #extract\n\n... #extract\n\ntext n #extract\n\n',
'tag/delimiter 2':
'\ntext 1 #extract\n\ntext 2 #extract\n\n... #extract\n\n... #extract\n\ntext n #extract\n\n',
'tag/delimiter n':
'\ntext 1 #extract\n\ntext 2 #extract\n\n... #extract\n\n... #extract\n\ntext n #extract\n\n'}

Delete tab character from config files

I am using this function to read a config file.
import numpy as np
stream = np.genfromtxt(filepath, delimiter = '\n', comments='#', dtype= 'str')
It works pretty well but I have a problem: the tab character.
I.e.
output
['\tvalue1 ', ' 1'] ['\t'] ['value2 ', ' 2']
Is there a way to ignore this special char?
My solution is something like that: (It works for my purposes but it's a bit "ugly")
result = {}
for el in stream:
row = el.split('=',1)
try:
if len(row) == 2:
row[0] = row[0].replace(' ','').replace('\t','') #clean the elements from not needed spaces
row[1] = row[1].replace(' ','').replace('\t','')
result[row[0]] = eval(row[1])
except:
print >> sys.stderr,"FATAL ERROR: '"+filepath+"' missetted"
logging.exception(sys.stderr)
sys.exit('')
To replace the tabs with nothing:
stream = [x.replace('\t','') for x in stream]
Or to replace tabs with a single space, and then remove duplicate spaces:
stream = [' '.join(x.replace('\t',' ').split()) for x in stream]
To remove empty strings (source):
stream = filter(None, stream)
There docent seem to be a way to assign multiple delimiters or comments using numpys genfromtext. I would recommend looking elsewhere. Try https://docs.python.org/2/library/configparser.html. Here's a link with a quick example so you can get a feel for how to work with the module https://wiki.python.org/moin/ConfigParserExamples

python - matching string and replacing

I have a file i am trying to replace parts of a line with another word.
it looks like bobkeiser:bob123#bobscarshop.com:0.0.0.0.0:23rh32o3hro2rh2:234212
i need to delete everything but bob123#bobscarshop.com, but i need to match 23rh32o3hro2rh2 with 23rh32o3hro2rh2:poniacvibe , from a different text file and place poniacvibe infront of bob123#bobscarshop.com
so it would look like this bob123#bobscarshop.com:poniacvibe
I've had a hard time trying to go about doing this, but i think i would have to split the bobkeiser:bob123#bobscarshop.com:0.0.0.0.0:23rh32o3hro2rh2:234212 with data.split(":") , but some of the lines have a (:) in a spot that i don't want the line to be split at, if that makes any sense...
if anyone could help i would really appreciate it.
ok, it looks to me like you are using a colon : to separate your strings.
in this case you can use .split(":") to break your strings into their component substrings
eg:
firststring = "bobkeiser:bob123#bobscarshop.com:0.0.0.0.0:23rh32o3hro2rh2:234212"
print(firststring.split(":"))
would give:
['bobkeiser', 'bob123#bobscarshop.com', '0.0.0.0.0', '23rh32o3hro2rh2', '234212']
and assuming your substrings will always be in the same order, and the same number of substrings in the main string you could then do:
firststring = "bobkeiser:bob123#bobscarshop.com:0.0.0.0.0:23rh32o3hro2rh2:234212"
firstdata = firststring.split(":")
secondstring = "23rh32o3hro2rh2:poniacvibe"
seconddata = secondstring.split(":")
if firstdata[3] == seconddata[0]:
outputdata = firstdata
outputdata.insert(1,seconddata[1])
outputstring = ""
for item in outputdata:
if outputstring == "":
outputstring = item
else
outputstring = outputstring + ":" + item
what this does is:
extract the bits of the strings into lists
see if the "23rh32o3hro2rh2" string can be found in the second list
find the corresponding part of the second list
create a list to contain the output data and put the first list into it
insert the "poniacvibe" string before "bob123#bobscarshop.com"
stitch the outputdata list back into a string using the colon as the separator
the reason your strings need to be the same length is because the index is being used to find the relevant strings rather than trying to use some form of string type matching (which gets much more complex)
if you can keep your data in this form it gets much simpler.
to protect against malformed data (lists too short) you can explicitly test for them before you start using len(list) to see how many elements are in it.
or you could let it run and catch the exception, however in this case you could end up with unintended results, as it may try to match the wrong elements from the list.
hope this helps
James
EDIT:
ok so if you are trying to match up a long list of strings from files you would probably want something along the lines of:
firstfile = open("firstfile.txt", mode = "r")
secondfile= open("secondfile.txt",mode = "r")
first_raw_data = firstfile.readlines()
firstfile.close()
second_raw_data = secondfile.readlines()
secondfile.close()
first_data = []
for item in first_raw_data:
first_data.append(item.replace("\n","").split(":"))
second_data = []
for item in second_raw_data:
second_data.append(item.replace("\n","").split(":"))
output_strings = []
for item in first_data:
searchstring = item[3]
for entry in second_data:
if searchstring == entry[0]:
output_data = item
output_string = ""
output_data.insert(1,entry[1])
for data in output_data:
if output_string == "":
output_string = data
else:
output_string = output_string + ":" + data
output_strings.append(output_string)
break
for entry in output_strings:
print(entry)
this should achieve what you're after and as prove of concept will print the resulting list of stings for you.
if you have any questions feel free to ask.
James
Second edit:
to make this output the results into a file change the last two lines to:
outputfile = open("outputfile.txt", mode = "w")
for entry in output_strings:
outputfile.write(entry+"\n")
outputfile.close()

Help parsing text file in python

Really been struggling with this one for some time now, i have many text files with a specific format from which i need to extract all the data and file into different fields of a database. The struggle is tweaking the parameters for parsing, ensuring i get all the info correctly.
the format is shown below:
WHITESPACE HERE of unknown length.
K PA DETAILS
2 4565434 i need this sentace as one DB record
2 4456788 and this one
5 4879870 as well as this one, content will vary!
X Max - there sometimes is a line beginning with 'Max' here which i don't need
There is a Line here that i do not need!
WHITESPACE HERE of unknown length.
The tough parts were 1) Getting rid of whitespace, and 2)defining the fields from each other, see my best attempt, below:
dict = {}
XX = (open("XX.txt", "r")).readlines()
for line in XX:
if line.isspace():
pass
elif line.startswith('There is'):
pass
elif line.startswith('Max', 2):
pass
elif line.startswith('K'):
pass
else:
for word in line.split():
if word.startswith('4'):
tmp_PA = word
elif word == "1" or word == "2" or word == "3" or word == "4" or word == "5":
tmp_K = word
else:
tmp_DETAILS = word
cu.execute('''INSERT INTO bugInfo2 (pa, k, details) VALUES(?,?,?)''',(tmp_PA,tmp_K,tmp_DETAILS))
At the minute, i can pull the K & PA fields no problem using this, however my DETAILS is only pulling one word, i need the entire sentance, or at least 25 chars of it.
Thanks very much for reading and I hope you can help! :)
K
You are splitting the whole line into words. You need to split into first word, second word and the rest. Like line.split(None, 2).
It would probably use regular expressions. And use the oposite logic, that is if it starts with number 1 through 5, use it, otherwise pass. Like:
pattern = re.compile(r'([12345])\s+\(d+)\s+\(.*\S)')
f = open('XX.txt', 'r') # No calling readlines; lazy iteration is better
for line in f:
m = pattern.match(line)
if m:
cu.execute('''INSERT INTO bugInfo2 (pa, k, details) VALUES(?,?,?)''',
(m.group(2), m.group(1), m.group(3)))
Oh, and of course, you should be using prepared statement. Parsing SQL is orders of magnitude slower than executing it.
If I understand correctly your file format, you can try this script
filename = 'bug.txt'
f = file(filename,'r')
foundHeaders = False
records = []
for rawline in f:
line = rawline.strip()
if not foundHeaders:
tokens = line.split()
if tokens == ['K','PA','DETAILS']:
foundHeaders = True
continue
else:
tokens = line.split(None,2)
if len(tokens) != 3:
break
try:
K = int(tokens[0])
PA = int(tokens[1])
except ValueError:
break
records.append((K,PA,tokens[2]))
f.close()
for r in records:
print r # replace this by your DB insertion code
This will start reading the records when it encounters the header line, and stop as soon as the format of the line is no longer (K,PA,description).
Hope this helps.
Here is my attempt using re
import re
stuff = open("source", "r").readlines()
whitey = re.compile(r"^[\s]+$")
header = re.compile(r"K PA DETAILS")
juicy_info = re.compile(r"^(?P<first>[\d])\s(?P<second>[\d]+)\s(?P<third>.+)$")
for line in stuff:
if whitey.match(line):
pass
elif header.match(line):
pass
elif juicy_info.match(line):
result = juicy_info.search(line)
print result.group('third')
print result.group('second')
print result.group('first')
Using re I can pull the data out and manipulate it on a whim. If you only need the juicy info lines, you can actually take out all the other checks, making this a REALLY concise script.
import re
stuff = open("source", "r").readlines()
#create a regular expression using subpatterns.
#'first, 'second' and 'third' are our own tags ,
# we could call them Adam, Betty, etc.
juicy_info = re.compile(r"^(?P<first>[\d])\s(?P<second>[\d]+)\s(?P<third>.+)$")
for line in stuff:
result = juicy_info.search(line)
if result:#do stuff with data here just use the tag we declared earlier.
print result.group('third')
print result.group('second')
print result.group('first')
import re
reg = re.compile('K[ \t]+PA[ \t]+DETAILS[ \t]*\r?\n'\
+ 3*'([1-5])[ \t]+(\d+)[ \t]*([^\r\n]+?)[ \t]*\r?\n')
with open('XX.txt') as f:
mat = reg.search(f.read())
for tripl in ((2,1,3),(5,4,6),(8,7,9)):
cu.execute('''INSERT INTO bugInfo2 (pa, k, details) VALUES(?,?,?)''',
mat.group(*tripl)
I prefer to use [ \t] instead of \s because \s matches the following characters:
blank , '\f', '\n', '\r', '\t', '\v'
and I don't see any reason to use a symbol representing more that what is to be matched, with risks to match erratic newlines at places where they shouldn't be
Edit
It may be sufficient to do:
import re
reg = re.compile(r'^([1-5])[ \t]+(\d+)[ \t]*([^\r\n]+?)[ \t]*$',re.MULTILINE)
with open('XX.txt') as f:
for mat in reg.finditer(f.read()):
cu.execute('''INSERT INTO bugInfo2 (pa, k, details) VALUES(?,?,?)''',
mat.group(2,1,3)

Categories

Resources