I am using this function to read a config file.
import numpy as np
stream = np.genfromtxt(filepath, delimiter = '\n', comments='#', dtype= 'str')
It works pretty well but I have a problem: the tab character.
I.e.
output
['\tvalue1 ', ' 1'] ['\t'] ['value2 ', ' 2']
Is there a way to ignore this special char?
My solution is something like that: (It works for my purposes but it's a bit "ugly")
result = {}
for el in stream:
row = el.split('=',1)
try:
if len(row) == 2:
row[0] = row[0].replace(' ','').replace('\t','') #clean the elements from not needed spaces
row[1] = row[1].replace(' ','').replace('\t','')
result[row[0]] = eval(row[1])
except:
print >> sys.stderr,"FATAL ERROR: '"+filepath+"' missetted"
logging.exception(sys.stderr)
sys.exit('')
To replace the tabs with nothing:
stream = [x.replace('\t','') for x in stream]
Or to replace tabs with a single space, and then remove duplicate spaces:
stream = [' '.join(x.replace('\t',' ').split()) for x in stream]
To remove empty strings (source):
stream = filter(None, stream)
There docent seem to be a way to assign multiple delimiters or comments using numpys genfromtext. I would recommend looking elsewhere. Try https://docs.python.org/2/library/configparser.html. Here's a link with a quick example so you can get a feel for how to work with the module https://wiki.python.org/moin/ConfigParserExamples
Related
I have a text file which has a line like this -
time time B2CAT_INLET_T\CAN-Monitoring:1 B1CAT_MIDBED_T\CAN-Monitoring:1 B1CAT_INLET_T\CAN-Monitoring:1 B1CAT_OUTLET_T\CAN-Monitoring:1 time APEPFRPP\CCP:1 KDFILRAW\CCP:1
When I read it using
lines = txtfile.readlines()
I get lines =
'time\ttime\tB2CAT_INLET_T\\CAN-Monitoring:1\tB1CAT_MIDBED_T\\CAN-Monitoring:1\tB1CAT_INLET_T\\CAN-Monitoring:1\tB1CAT_OUTLET_T\\CAN-Monitoring:1\ttime\tAPEPFRPP\\CCP:1\tKDFILRAW\\CCP:1\t\t'
So the '\' show as 'double \' and the tab shows as '\t'
From this I want to delete all instances of '\CAN-Monitoring:1' and '\CCP:1' and preserve the tabs as they are.
I have a code that walks through each element of 'lines' and gets index of each 'double \' and '\t'
Then I tried to use lines.replace(index of 'double \':index of '\t','')
But this does not seem to work as I want.
Following is my code so far:
# Reading from text file
txtfile = open('filename.txt', 'r')
lines = txtfile.readlines()
textToModify = lines
# This gives indices of all '\\' and '\t'
doubleslash = []
tab = []
for i, item in enumerate(textToModify):
if textToModify[i] == '\\':
doubleslash.append(i)
for i, item in enumerate(textToModify):
if textToModify[i] == '\t':
tab.append(i)
# Should find text beginning with '\\' until '\t' only
itemSlashBegin = []
itemTabBegin = []
for itemSlash in doubleslash:
for itemTab in tab:
if itemSlash < itemTab:
break
itemSlashBegin.append(itemSlash)
itemTabBegin.append(itemTab)
# Trying to replace '\\'text'\t' in the original text
for i,item in enumerate(itemSlashBegin):
ModifiedTxt = textToModify.replace([item:itemTabBegin[i]],"")
I am sure there is a more elegant way too; but I cannot find it.
Please give me some solution.
Thank you
R
If you don't want to import anything then use this
f = 'time\ttime\tB2CAT_INLET_T\\CAN-Monitoring:1\tB1CAT_MIDBED_T\\CAN-Monitoring:1\tB1CAT_INLET_T\\CAN-Monitoring:1\tB1CAT_OUTLET_T\\CAN-Monitoring:1\ttime\tAPEPFRPP\\CCP:1\tKDFILRAW\\CCP:1\t\t'
s= ('\CAN-Monitoring:1','\CCP:1')
for i in s:
f=f.replace(i, '')
print(f)
time time B2CAT_INLET_T B1CAT_MIDBED_T B1CAT_INLET_T B1CAT_OUTLET_T time APEPFRPP KDFILRAW
Just use re.sub here:
out = re.sub(r'\\CAN-Monitoring:1|\\CCP:1', '', inp)
print(out)
This prints:
time time B2CAT_INLET_T B1CAT_MIDBED_T B1CAT_INLET_T B1CAT_OUTLET_T time APEPFRPP KDFILRAW
Note that double backslash and \t are simply how a literal backslash and tab character are represented in a Python string.
I have a file where the lines have the form #nr = name(#nr, (#nr), different vars, and names).
I would like to only have the #nr in the brackets to get the form #nr = name(#nr, #nr)
I have tried to solve this in different ways like using regex, startswith() and lists but nothing has worked so far.
Any help is much appreciated.
Edit: Code
for line in f.split():
start = line.find( '(' )
end = line.find( ')' )
if start != -1 and end != -1:
line = ''.join(i for i in x if not i.startswith('#'))
print(line)
Edit 2:
As example I have:
#304= IFCRELDEFINESBYPROPERTIES('0FZ0hKNanFNAQpJ_Iqh4zM',#42,$,$,(#142),#301);
Afterwards I want to have:
#304= IFCRELDEFINESBYPROPERTIES(#42,#142,#301);
This can be solved using regex, though trying to do it with a single find/replace would be more complicated. Instead, you can do it in two steps:
import re
def sub_func(match):
nums = re.findall(r'#\d+', match.group(2))
return match.group(1) + '(' + ','.join(nums) + ');'
text = "#304= IFCRELDEFINESBYPROPERTIES('0FZ0hKNanFNAQpJ_Iqh4zM',#42,$,$,(#142),#301);"
result = re.sub(r'(^[^(]+)\((.*)\);', sub_func, text)
print(result)
# '#304= IFCRELDEFINESBYPROPERTIES(#42,#142,#301);'
So instead of passing a string as the second argument for re.sub, we pass a function instead, where we can process the results of the match with some more regex and reformatting the results before passing it back.
I have a delimited file that's causing me a bit of grief. It's pipe delimited, 6 fields. but field 4 can be split over several lines or contain nothing. I need a way to remove the newline fields from field 4.
Here's what I've got
import csv
#header is constant
#filedone|fieldtwo|three|four|five|six
content = """"asfdd|b|c|defg
ijklmnopque2
|record|sadfe
1324|b|c|defg
ijklmnopqu
dafdsasfde2asdf
dsfdsf
dsfadfadse2fdsase2
asdfasdfasfe2
|record|afasde
3243243|b|c|defg
ijklmnopque2
|record|adf
startrecord4|b|c||record|adf
"""
def extract():
x = []
y = []
x = content.split('|')
for item in x:
if (len(item) > 4):
y.append(item.replace('\n', '').replace('\r', ' '))
else:
y.append(item)
print(y)
if __name__ == '__main__':
extract()
This will run and the problem is just output it all in one row. I do still need it to output indivicual records (4 in this case) without the newlines, but I'm not sure how.
Can I read the whole file with pandas.read_csv? Is there a better solution?
The header is constant across all records.
Would it be a solution for you to simply replace all double newlines by a placeholder to then explicitely remove the single newlines after which you can restore single newlines at the placeholder positions again?
You can try
sth_unique = '#%##'
c = content.replace('\n\n', sth_unique).replace('\n', '').replace(sth_unique, '\n')
print(c)
#"asfdd|b|c|defgijklmnopque2|record|sadfe
#1324|b|c|defgijklmnopqudafdsasfde2asdfdsfdsfdsfadfadse2fdsase2asdfasdfasfe2|record|afasde
#3243243|b|c|defgijklmnopque2|record|adf
#startrecord4|b|c||record|adf
I need to parse a multi-line string into a data structure containing (1) the identifier and (2) the text after the identifier (but before the next > symbol). the identifier always comes on its own line, but the text can take up multiple lines.
>identifier1
lalalalalalalalalalalalalalalalala
>identifier2
bababababababababababababababababa
>identifier3
wawawawawawawawawawawawawawawawawa
after execution I might have the data structured something like this:
id = ['identifier1', 'identifier2', 'identifier3']
and
txt =
['lalalalalalalalalalalalalalalalala',
'bababababababababababababababababa',
'wawawawawawawawawawawawawawawawawa']
It seems I would want to use regex to find (1) things after > but before carriage return, and (2) things between >'s, having temporarily deleted the identifier string and EOL, replacing with "".
The thing is I will have hundreds of these identifiers so I need to run the regex sequentially. Any ideas on how to attack this problem? I am working in python but feel free to use whatever language you want in your response.
*Update 1: code from slater getting closer but things are still not partitioned sequentially into id, text, id, text, etc *
teststring = '''>identifier1
lalalalalalalalalalalalalalalalala
>identifier2
bababababababababababababababababa
>identifier3
wawawawawawawawawawawawawawawawawa'''
# First, split the text into relevant chunks
split_text = teststring.split('>')
#see where we are after split
print split_text
#remove spaces that will mess up the partitioning
while '' in split_text:
split_text.remove('')
#see where we are after removing '', before partitioning
print split_text
id = [text.partition(r'\n')[0] for text in split_text]
txt = [text.partition(r'\n')[0] for text in split_text]
#see where we are after partition
print id
print txt
print len(split_text)
print len(id)
but the output was:
['', 'identifier1\nlalalalalalalalalalalalalalalalala\n', 'identifier2\nbababababababababababababababababa\n', 'identifier3\nwawawawawawawawawawawawawawawawawa']
['identifier1\nlalalalalalalalalalalalalalalalala\n', 'identifier2\nbababababababababababababababababa\n', 'identifier3\nwawawawawawawawawawawawawawawawawa']
['identifier1\nlalalalalalalalalalalalalalalalala\n', 'identifier2\nbababababababababababababababababa\n', 'identifier3\nwawawawawawawawawawawawawawawawawa']
['identifier1\nlalalalalalalalalalalalalalalalala\n', 'identifier2\nbababababababababababababababababa\n', 'identifier3\nwawawawawawawawawawawawawawawawawa']
3
3
note: it needs to work for a multiline string, dealing with all the \n's. a better test case might be:
teststring = '''
>identifier1
lalalalalalalalalalalalalalalalala
lalalalalalalalalalalalalalalalala
>identifier2
bababababababababababababababababa
bababababababababababababababababa
>identifier3
wawawawawawawawawawawawawawawawawa
wawawawawawawawawawawawawawawawawa'''
# First, split the text into relevant chunks
split_text = teststring.split('>')
#see where we are after split
print split_text
#remove spaces that will mess up the partitioning
while '' in split_text:
split_text.remove('')
#see where we are after removing '', before partitioning
print split_text
id = [text.partition(r'\n')[0] for text in split_text]
txt = [text.partition(r'\n')[0] for text in split_text]
#see where we are after partition
print id
print txt
print len(split_text)
print len(id)
current output:
['\n', 'identifier1\nlalalalalalalalalalalalalalalalala\nlalalalalalalalalalalalalalalalala\n', 'identifier2\nbababababababababababababababababa\nbababababababababababababababababa\n', 'identifier3\nwawawawawawawawawawawawawawawawawa\nwawawawawawawawawawawawawawawawawa']
['\n', 'identifier1\nlalalalalalalalalalalalalalalalala\nlalalalalalalalalalalalalalalalala\n', 'identifier2\nbababababababababababababababababa\nbababababababababababababababababa\n', 'identifier3\nwawawawawawawawawawawawawawawawawa\nwawawawawawawawawawawawawawawawawa']
['\n', 'identifier1\nlalalalalalalalalalalalalalalalala\nlalalalalalalalalalalalalalalalala\n', 'identifier2\nbababababababababababababababababa\nbababababababababababababababababa\n', 'identifier3\nwawawawawawawawawawawawawawawawawa\nwawawawawawawawawawawawawawawawawa']
['\n', 'identifier1\nlalalalalalalalalalalalalalalalala\nlalalalalalalalalalalalalalalalala\n', 'identifier2\nbababababababababababababababababa\nbababababababababababababababababa\n', 'identifier3\nwawawawawawawawawawawawawawawawawa\nwawawawawawawawawawawawawawawawawa']
4
4
Personally, I feel that you should use regex as little as possible. It's slow, difficult to maintain, and generally unreadable.
That said, solving this in python is extremely straightforward. I'm a little unclear on what exactly you mean by running this "sequentially", but let me know if this solution doesn't fit your needs.
# First, split the text into relevant chunks
split_text = text.split('>')
id = [text.partition('\n')[0] for text in split_text]
txt = [text.partition('\n')[2] for text in split_text]
Obviously, you could make the code more efficient, but if you're only dealing with hundreds of identifiers it really shouldn't be needed.
If you want to remove any blank entries that might occur, you can do the following:
list_with_blanks = ['', 'hello', '', '', 'world']
filter(None, list_with_blanks)
>>> ['hello', 'world']
Let me know if you have any more questions.
Unless I misunderstood the question, it's as easy as
for line in your_file:
if line.startswith('>'):
id.append(line[1:].strip())
else:
text.append(line.strip())
Edit: to concatenate multiple lines:
ids, text = [], []
for line in teststring.splitlines():
if line.startswith('>'):
ids.append(line[1:])
text.append('')
elif text:
text[-1] += line
I found a solution. It's certainly not very pythonic but it works.
======================================================================
======================================================================
teststring = '''
>identifier1
lalalalalalalalalalalalalalalalala\n
lalalalalalalalalalalalalalalalala\n
>identifier2
bababababababababababababababababa\n
bababababababababababababababababa\n
>identifier3
wawawawawawawawawawawawawawawawawa\n
wawawawawawawawawawawawawawawawawa\n'''
i = 0
j = 0
#split the multiline string by line
dsplit = teststring.split('\n')
#the indicies of identifiers
index = list()
for line in dsplit:
if line.startswith('>'):
print line
index.append(i)
j = j + 1
i = i+1
index.append(i) #add this so you get the last block of text
#the text corresponding to each index
thetext = list()
#the names corresponding to each gene
thenames = list()
for n in range(0, len(index)-1):
thetext.append("")
for k in range(index[n]+1, index[n+1]):
thetext[n] = thetext[n] + dsplit[k]
thenames.append(dsplit[index[n]][1:]) # the [1:] removes the first character (>) from the line
print "the indicies", index
print "the text: ", thetext
print "the names", thenames
print "this many text entries: ", len(thetext)
print "this many index entries: ", j
this gives the following output:
>identifier1
>identifier2
>identifier3
the indicies [1, 6, 11, 16]
the text: ['lalalalalalalalalalalalalalalalalalalalalalalalalalalalalalalalalala', 'babababababababababababababababababababababababababababababababababa', 'wawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawa']
the names ['identifier1', 'identifier2', 'identifier3']
this many text entries: 3
this many index entries: 3
I have a large file with several lines as given below.I want to read in only those lines which have the _INIT pattern in them and then strip off the _INIT from the name and only save the OSD_MODE_15_H part in a variable. Then I need to read the corresponding hex value, 8'h00 in this case, ans strip off the 8'h from it and replace it with a 0x and save in a variable.
I have been trying strip the off the _INIT,the spaces and the = and the code is becoming really messy.
localparam OSD_MODE_15_H_ADDR = 16'h038d;
localparam OSD_MODE_15_H_INIT = 8'h00
Can you suggest a lean and clean method to do this?
Thanks!
The following solution uses a regular expression (compiled to speed searching up) to match the relevant lines and extract the needed information. The expression uses named groups "id" and "hexValue" to identify the data we want to extract from the matching line.
import re
expression = "(?P<id>\w+?)_INIT\s*?=.*?'h(?P<hexValue>[0-9a-fA-F]*)"
regex = re.compile(expression)
def getIdAndValueFromInitLine(line):
mm = regex.search(line)
if mm == None:
return None # Not the ..._INIT parameter or line was empty or other mismatch happened
else:
return (mm.groupdict()["id"], "0x" + mm.groupdict()["hexValue"])
EDIT: If I understood the next task correctly, you need to find the hexvalues of those INIT and ADDR lines whose IDs match and make a dictionary of the INIT hexvalue to the ADDR hexvalue.
regex = "(?P<init_id>\w+?)_INIT\s*?=.*?'h(?P<initValue>[0-9a-fA-F]*)"
init_dict = {}
for x in re.findall(regex, lines):
init_dict[x.groupdict()["init_id"]] = "0x" + x.groupdict()["initValue"]
regex = "(?P<addr_id>\w+?)_ADDR\s*?=.*?'h(?P<addrValue>[0-9a-fA-F]*)"
addr_dict = {}
for y in re.findall(regex, lines):
addr_dict[y.groupdict()["addr_id"]] = "0x" + y.groupdict()["addrValue"]
init_to_addr_hexvalue_dict = {init_dict[x] : addr_dict[x] for x in init_dict.keys() if x in addr_dict}
Even if this is not what you actually need, having init and addr dictionaries might help to achieve your goal easier. If there are several _INIT (or _ADDR) lines with the same ID and different hexvalues then the above dict approach will not work in a straight forward way.
try something like this- not sure what all your requirements are but this should get you close:
with open(someFile, 'r') as infile:
for line in infile:
if '_INIT' in line:
apostropheIndex = line.find("'h")
clean_hex = '0x' + line[apostropheIndex + 2:]
In the case of "16'h038d;", clean_hex would be "0x038d;" (need to remove the ";" somehow) and in the case of "8'h00", clean_hex would be "0x00"
Edit: if you want to guard against characters like ";" you could do this and test if a character is alphanumeric:
clean_hex = '0x' + ''.join([s for s in line[apostropheIndex + 2:] if s.isalnum()])
You can use a regular expression and the re.findall() function. For example, to generate a list of tuples with the data you want just try:
import re
lines = open("your_file").read()
regex = "([\w]+?)_INIT\s*=\s*\d+'h([\da-fA-F]*)"
res = [(x[0], "0x"+x[1]) for x in re.findall(regex, lines)]
print res
The regular expression is very specific for your input example. If the other lines in the file are slightly different you may need to change it a bit.