Replace column based strings with multiple with pre-defined values - Python

Replace column based strings with multiple with pre-defined values - Python - python

I am bit confused with approach to implement the below logic in python. I would need expert advice in choosing a method.
I have to replace strings with predefined values in certain columns.
For e.g.
| is delimiter
Input :
ABCD|NewYork|800|TU
XYA|England|589|IA
Output :
QWER|NewYork|800|PL
NHQ|England|589|DQ
Predefined dictionary :
Actual Value : ABCDEFGHIJKLMNOPQRSTUVWXYZ
Replace Value : QWERTYASDFGHNBVCXZOPLKMNHY
So, If value is ABCD, I should get QWER. If it is TU then it should replace it with PL. The values can be random.
My approach would be like below
Read a line and then go to column 1
read each character and replace one by one by using replace values
Go to column 4 and then read each character and replace one by one
go to next line and so on....
I feel this might be poor way of coding. Is there any different way than above approach? Please suggest a method.
Column's may be different for different files. It should be dynmaic

You can make use of str.translate and str.maketrans to make your life a lot easier here:
In [1]: fnd = 'ABCDEFGHIJKLMNOPQRSTUVWXYZ'
...: rpl = 'QWERTYASDFGHNBVCXZOPLKMNHY'
...: trns = str.maketrans(fnd, rpl)
In [2]: 'ABCD'.translate(trns)
Out[2]: 'QWER'
In [4]: 'UV'.translate(trns)
Out[4]: 'LK'

This is one way using a list comprehensions with str.join.
The trick is to convert your dictionary to a Python dict.
x = ['ABCD|NewYork|800|TU',
'XYA|England|589|IA']
d = dict(zip('ABCDEFGHIJKLMNOPQRSTUVWXYZ',
'QWERTYASDFGHNBVCXZOPLKMNHY'))
res = ['|'.join([''.join(list(map(d.get, i[0])))]+i[1:]) \
for i in map(lambda y: y.split('|'), x)]
Result:
['QWER|NewYork|800|TU',
'NHQ|England|589|IA']

This should do it:
from string import maketrans
actual = 'ABCDEFGHIJKLMNOPQRSTUVWXYZ'
replace = 'QWERTYASDFGHNBVCXZOPLKMNHY'
with open('infile.txt') as inf, open('outfile.txt', 'w') as outf:
toBeWritten = []
for line in inf:
items = line.strip().split('|')
items[0] = items[0].translate(maketrans( actual, replace))
items[3] = items[3].translate(maketrans( actual, replace))
print items
toBeWritten.append('|'.join(items))
outf.writelines(toBeWritten)

Related

How to split string with multiple delimiters in Python?

My First String
xxx.xxx.com-bonding_err_bond0-if_eth2-d.rrd.csv
But I want to result like this below
bonding_err_bond0-if_eth2
I try some code but seems not work correctly
csv = "xxx.xxx.com-bonding_err_bond0-if_eth2-d.rrd.csv"
x = csv.rsplit('.', 4)[2]
print(x)
But Result that I get is com-bonding_err_bond0-if_eth2-d But my purpose is bonding_err_bond0-if_eth2

If you are allowed to use the solution apart from regex,
You can break the solution into a smaller part to understand better and learn about join if you are not aware of it. It will come in handy.
solution= '-'.join(csv.split('.', 4)[2].split('-')[1:3])
Thanks,
Shashank

Probably you got the answer, but if you want a generic method for any string data you can do this:
In this way you wont be restricted to one string and you can loop the data as well.
csv = "xxx.xxx.com-bonding_err_bond0-if_eth2-d.rrd.csv"
first_index = csv.find("-")
second_index = csv.find("-d")
result = csv[first_index+1:second_index]
print(result)
# OUTPUT:
# bonding_err_bond0-if_eth2

You can just separate the string with -, remove the beginning and end, and then join them back into a string.
csv = "xxx.xxx.com-bonding_err_bond0-if_eth2-d.rrd.csv"
x = '-'.join(csv.split('-')[1:-1])
Output
>>> csv
>>> bonding_err_bond0-if_eth2

python script not joining strings as expected

I have a list of lists of sequences, and a corresponding list of lists of names.
testSequences = [
['aaaa', 'cccc'],
['tt', 'gg'],
['AAAAAAA', 'CCCCCC', 'TTTTTT', 'GGGGGG']]
testNames = [
['>xx_oneFish |xzx', '>xx_twoFish |zzx'],
['>xx_redFish |zxx', '>xx_blueFish |zxx'],
['>xx_oneFish |xzx', '>xx_twoFish |xzx', '>xz_redFish |xxx', '>zx_blueFish |xzz']]
I also have a list of all the identifying parts of the names:
taxonNames = ['oneFish', 'twoFish', 'redFish', 'blueFish']
I am trying to produce a new list, where each item in the list will correspond to one of the "identifying parts of the names", and the string will be made up of all the sequences for that name.
If a name and sequence does not appear in one of the lists in the lists (i.e. no redFish or blueFish in the first list of testNames) I want to add in a string of hyphens the same length as the sequences in that list. This would give me this output:
['aaaa--AAAAAA', 'cccc--CCCCCC', '----ttTTTTTT', '----ggGGGG']
I have this piece of code to do this.
complete = [''] * len(taxonNames)
for i in range(len(testSequences)):
for j in range(len(taxonNames)):
sequenceLength = len(testSequences[i][0])
for k in range(len(testSequences[i])):
if taxonNames[j] in testNames[i][k]:
complete[j].join(testSequences[i][k])
if taxonNames[j] not in testNames[i][k]:
hyphenString = "-" * sequenceLength
complete[j].join(hyphenString)
print complete
"complete" should give my final output as explained above, but it comes out looking like this:
['', '', '', '']
How can I fix my code to give me the correct answer?

The main issue with your code, which makes it very hard to understand, is you're not really leveraging the language elements that make Python so strong.
Here's a solution to your problem that works:
test_sequences = [
['aaaa', 'cccc'],
['tt', 'gg'],
['AAAAAAA', 'CCCCCC', 'TTTTTT', 'GGGGGG']]
test_names = [
['>xx_oneFish |xzx', '>xx_twoFish |zzx'],
['>xx_redFish |zxx', '>xx_blueFish |zxx'],
['>xx_oneFish |xzx', '>xx_twoFish |xzx', '>xz_redFish |xxx', '>zx_blueFish |xzz']]
taxon_names = ['oneFish', 'twoFish', 'redFish', 'blueFish']
def get_seqs(taxon_name, sequences_list, names_list):
for seqs, names in zip(sequences_list, names_list):
found_seq = None
for seq, name in zip(seqs, names):
if taxon_name in name:
found_seq = seq
break
yield found_seq if found_seq else '-' * len(seqs[0])
result = [''.join(get_seqs(taxon_name, test_sequences, test_names))
for taxon_name in taxon_names]
print(result)
The generator get_seqs pairs up lists from test_sequences and test_names and for each pair, tries to find the sequence (seq) for the name (name) that matches and yields it, or yields a string of the right number of hyphens for that list of sequences.
The generator (a function that yields multiple values) has code that quite literally follows the explanation above.
The result is then simply a matter of, for each taxon_name, getting all the resulting sequences that match in order and joining them together into a string, which is the result = ... line.
You could make it work with list indexing loops and string concatenation, but this is not a PHP question, now is it? :)
Note: for brevity, you could just access the global test_sequences and test_names instead of passing them in as parameters, but I think that would come back to haunt you if you were to actually use this code. Also, I think it makes semantic sense to change the order of names and sequences in the entire example, but I didn't to avoid further deviating from your example.

Here is a solution that may do what you want. It begins, not with your data structures from this post, but with the three example files from your previous post (which you used to build this post's data structures).
The only thing I couldn't figure out was how many hyphens to use for a missing sequence from a file.
differentNames = ['oneFish', 'twoFish', 'redFish', 'blueFish']
files = ['f1.txt', 'f2.txt', 'f3.txt']
data = [[] for _ in range(len(differentNames))]
final = []
for file in files:
d = dict()
with open(file, 'r') as fin:
for line in fin:
line = line.rstrip()
if line.startswith('>'): # for ex., >xx_oneFish |xxx
underscore = line.index('_')
space = line.index(' ')
key = line[underscore+1:space]
else:
d[key] = line
for i, key in enumerate(differentNames):
data[i].append(d.get(key, '-' * 4))
for array in data:
final.append(''.join(array))
print(final)
Prints:
['AAAAAAAaaaa----', 'CCCCCCcccc----', 'TTTTTT----tt', 'GGGGGG----gg']

How do i format the ouput of a list of list into a textfile properly?

I am really new to python and now I am struggeling with some problems while working on a student project. Basically I try to read data from a text file which is formatted in columns. I store the data in a list of list and sort and manipulate the data and write them into a file again. My problem is to align the written data in proper columns. I found some approaches like
"%i, %f, %e" % (1000, 1000, 1000)
but I don't know how many columns there will be. So I wonder if there is a way to set all columns to a fixed width.
This is how the input data looks like:
2 232.248E-09 74.6825 2.5 5.00008 499.482
5 10. 74.6825 2.5 -16.4304 -12.3
This is how I store the data in a list of list:
filename = getInput('MyPath', workdir)
lines = []
f = open(filename, 'r')
while 1:
line = f.readline()
if line == '':
break
splitted = line.split()
lines.append(splitted)
f.close()
To write the data I first put all the row elements of the list of list into one string with a free fixed space between the elements. But instead i need a fixed total space including the element. But also I don't know the number of columns in the file.
for k in xrange(len(lines)):
stringlist=""
for i in lines[k]:
stringlist = stringlist+str(i)+' '
lines[k] = stringlist+'\n'
f = open(workdir2, 'w')
for i in range(len(lines)):
f.write(lines[i])
f.close()
This code works basically, but sadly the output isn't formatted properly.
Thank you very much in advance for any help on this issue!

You are absolutely right about begin able to format widths as you have above using string formatting. But as you correctly point out, the tricky bit is doing this for a variable sized output list. Instead, you could use the join() function:
output = ['a', 'b', 'c', 'd', 'e',]
# format each column (len(a)) with a width of 10 spaces
width = [10]*len(a)
# write it out, using the join() function
with open('output_example', 'w') as f:
f.write(''.join('%*s' % i for i in zip(width, output)))
will write out:
' a b c d e'
As you can see, the length of the format array width is determined by the length of the output, len(a). This is flexible enough that you can generate it on the fly.
Hope this helps!

String formatting might be the way to go:
>>> print("%10s%9s" % ("test1", "test2"))
test1 test2
Though you might want to first create strings from those numbers and then format them as I showed above.
I cannot fully comprehend your writing code, but try working on it somehow like that:
from itertools import enumerate
with open(workdir2, 'w') as datei:
for key, item in enumerate(zeilen):
line = "%4i %6.6" % key, item
datei.write(item)

Matching in Python lists when there are extra characters

I am trying to write a python code to match things from two lists in python.
One tab-delimited file looks like this:
COPB2
KLMND7
BLCA8
while the other file2 has a long list of similar looking "names", if you will. There should be some identical matches in the file, which I have succeeded in identifying and writing out to a new file. The problem is when there are additional characters at the end of one of the "names". For example, COPB2 from above should match COPB2A in file2, but it does not. Similarly KLMND7 should match KLMND79. Should I use regular expressions? Make them into strings? Any ideas are helpful, thank you!
What I have worked on so far, after the first response seen below:
with open(in_file1, "r") as names:
for line in names:
file1_list = [i.strip() for i in line.split()]
file1_str = str(file1_list)
with open(in_file2, "r") as symbols:
for line in symbols:
items = line.split("\t")
items = str(items)
matches = items.startswith(file1_str)
print matches
This code returns False when I know there should be some matches.

string.startswith() No need for regex, if it's only trailing characters
>>> g = "COPB2A"
>>> f = "COPB2"
>>> g.startswith(f)
True
Here is a working piece of code:
file1_list = []
with open(in_file1, "r") as names:
for line in names:
line_items = line.split()
for item in line_items:
file1_list.append(item)
matches = []
with open(in_file2, "r") as symbols:
for line in symbols:
file2_items = line.split()
for file2_item in file2_items:
for file1_item in file1_list:
if file2_item.startswith(file1_item):
matches.append(file2_item)
print file2_item
print matches
It may be quite slow for large files. If it's unacceptable, I could try to think about how to optimize it.

You might take a look at difflib if you need a more generic solution. Keep in mind it is a big import with lots of overhead so only use it if you really need to. Here is another question that is somewhat similar.
https://stackoverflow.com/questions/1209800/difference-between-two-strings-in-python-php

Assuming you loaded the files into lists X, Y.
## match if a or b is equal to or substring of one another in a case-sensitive way
def Match( a, b):
return a.find(b[0:min(len(a),len(b))-1])
common_words = {};
for a in X:
common_words[a]=[];
for b in Y:
if ( Match( a, b ) ):
common_words[a].append(b);
If you want to use regular expressions to do the matching, you want to use "beginning of word match" operator "^".
import re
def MatchRe( a, b ):
# make sure longer string is in 'a'.
if ( len(a) < len(b) ):
a, b = b, a;
exp = "^"+b;
q = re.match(exp,a);
if ( not q ):
return False; #no match
return True; #access q.group(0) for matches

Importing data from a text file using python

I have a text file containing data in rows and columns (~17000 rows in total). Each column is a uniform number of characters long, with the 'unused' characters filled in by spaces. For example, the first column is 11 characters long, but the last four characters in that column are always spaces (so that it appears to be a nice column when viewed with a text editor). Sometimes it's more than four if the entry is less than 7 characters.
The columns are not otherwise separated by commas, tabs, or spaces. They are also not all the same number of characters (the first two are 11, the next two are 8 and the last one is 5 - but again, some are spaces).
What I want to do is import the entires (which are numbers) in the last two columns if the second column contains the string 'OW' somewhere in it. Any help would be greatly appreciated.

Python's struct.unpack is probably the quickest way to split fixed-length fields. Here's a function that will lazily read your file and return tuples of numbers that match your criteria:
import struct
def parsefile(filename):
with open(filename) as myfile:
for line in myfile:
line = line.rstrip('\n')
fields = struct.unpack('11s11s8s8s5s', line)
if 'OW' in fields[1]:
yield (int(fields[3]), int(fields[4]))
Usage:
if __name__ == '__main__':
for field in parsefile('file.txt'):
print field
Test data:
1234567890a1234567890a123456781234567812345
something maybe OW d 111111118888888855555
aaaaa bbbbb 1234 1212121233333
other thinganother OW 121212 6666666644444
Output:
(88888888, 55555)
(66666666, 44444)

In Python you can extract a substring at known positions using a slice - this is normally done with the list[start:end] syntax. However you can also create slice objects that you can use later to do the indexing.
So you can do something like this:
columns = [slice(11,22), slice(30,38), slice(38,44)]
myfile = open('some/file/path')
for line in myfile:
fields = [line[column].strip() for column in columns]
if "OW" in fields[0]:
value1 = int(fields[1])
value12 = int(fields[2])
....
Separating out the slices into a list makes it easy to change the code if the data format changes, or you need to do stuff with the other fields.

Here's a function which might help you:
def rows(f, columnSizes):
while True:
row = {}
for (key, size) in columnSizes:
value = f.read(size)
if len(value) < size: # EOF
return
row[key] = value
yield row
for an example of how it's used:
from StringIO import StringIO
sample = StringIO("""aaabbbccc
d e f
g h i
""")
for row in rows(sample, [('first', 3),
('second', 3),
('third', 4)]):
print repr(row)
Note that unlike the other answers, this example is not line-delimited (it uses the file purely as a provider of bytes, not an iterator of lines), since you specifically mentioned that the fields were not separated, I assumed that the rows might not be either; the newline is taken into account specifically.
You can test if one string is a substring of another with the 'in' operator. For example,
>>> 'OW' in 'hello'
False
>>> 'OW' in 'helOWlo'
True
So in this case, you might do
if 'OW' in row['third']:
stuff()
but you can obviously test any field for any value as you see fit.

entries = ((float(line[30:38]), float(line[38:43])) for line in myfile if "OW" in line[11:22])
for num1, num2 in entries:
# whatever

entries = []
with open('my_file.txt', 'r') as f:
for line in f.read().splitlines()
line = line.split()
if line[1].find('OW') >= 0
entries.append( ( int(line[-2]) , int(line[-1]) ) )
entries is an array containing tuples of the last two entries
edit: oops

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Replace column based strings with multiple with pre-defined values - Python - python

You can make use of str.translate and str.maketrans to make your life a lot easier here: In [1]: fnd = 'ABCDEFGHIJKLMNOPQRSTUVWXYZ' ...: rpl = 'QWERTYASDFGHNBVCXZOPLKMNHY' ...: trns = str.maketrans(fnd, rpl) In [2]: 'ABCD'.translate(trns) Out[2]: 'QWER' In [4]: 'UV'.translate(trns) Out[4]: 'LK'

Related

How to split string with multiple delimiters in Python?

python script not joining strings as expected

How do i format the ouput of a list of list into a textfile properly?

Matching in Python lists when there are extra characters

Importing data from a text file using python

Categories

Resources