I have a list of lists of sequences, and a corresponding list of lists of names.
testSequences = [
['aaaa', 'cccc'],
['tt', 'gg'],
['AAAAAAA', 'CCCCCC', 'TTTTTT', 'GGGGGG']]
testNames = [
['>xx_oneFish |xzx', '>xx_twoFish |zzx'],
['>xx_redFish |zxx', '>xx_blueFish |zxx'],
['>xx_oneFish |xzx', '>xx_twoFish |xzx', '>xz_redFish |xxx', '>zx_blueFish |xzz']]
I also have a list of all the identifying parts of the names:
taxonNames = ['oneFish', 'twoFish', 'redFish', 'blueFish']
I am trying to produce a new list, where each item in the list will correspond to one of the "identifying parts of the names", and the string will be made up of all the sequences for that name.
If a name and sequence does not appear in one of the lists in the lists (i.e. no redFish or blueFish in the first list of testNames) I want to add in a string of hyphens the same length as the sequences in that list. This would give me this output:
['aaaa--AAAAAA', 'cccc--CCCCCC', '----ttTTTTTT', '----ggGGGG']
I have this piece of code to do this.
complete = [''] * len(taxonNames)
for i in range(len(testSequences)):
for j in range(len(taxonNames)):
sequenceLength = len(testSequences[i][0])
for k in range(len(testSequences[i])):
if taxonNames[j] in testNames[i][k]:
complete[j].join(testSequences[i][k])
if taxonNames[j] not in testNames[i][k]:
hyphenString = "-" * sequenceLength
complete[j].join(hyphenString)
print complete
"complete" should give my final output as explained above, but it comes out looking like this:
['', '', '', '']
How can I fix my code to give me the correct answer?
The main issue with your code, which makes it very hard to understand, is you're not really leveraging the language elements that make Python so strong.
Here's a solution to your problem that works:
test_sequences = [
['aaaa', 'cccc'],
['tt', 'gg'],
['AAAAAAA', 'CCCCCC', 'TTTTTT', 'GGGGGG']]
test_names = [
['>xx_oneFish |xzx', '>xx_twoFish |zzx'],
['>xx_redFish |zxx', '>xx_blueFish |zxx'],
['>xx_oneFish |xzx', '>xx_twoFish |xzx', '>xz_redFish |xxx', '>zx_blueFish |xzz']]
taxon_names = ['oneFish', 'twoFish', 'redFish', 'blueFish']
def get_seqs(taxon_name, sequences_list, names_list):
for seqs, names in zip(sequences_list, names_list):
found_seq = None
for seq, name in zip(seqs, names):
if taxon_name in name:
found_seq = seq
break
yield found_seq if found_seq else '-' * len(seqs[0])
result = [''.join(get_seqs(taxon_name, test_sequences, test_names))
for taxon_name in taxon_names]
print(result)
The generator get_seqs pairs up lists from test_sequences and test_names and for each pair, tries to find the sequence (seq) for the name (name) that matches and yields it, or yields a string of the right number of hyphens for that list of sequences.
The generator (a function that yields multiple values) has code that quite literally follows the explanation above.
The result is then simply a matter of, for each taxon_name, getting all the resulting sequences that match in order and joining them together into a string, which is the result = ... line.
You could make it work with list indexing loops and string concatenation, but this is not a PHP question, now is it? :)
Note: for brevity, you could just access the global test_sequences and test_names instead of passing them in as parameters, but I think that would come back to haunt you if you were to actually use this code. Also, I think it makes semantic sense to change the order of names and sequences in the entire example, but I didn't to avoid further deviating from your example.
Here is a solution that may do what you want. It begins, not with your data structures from this post, but with the three example files from your previous post (which you used to build this post's data structures).
The only thing I couldn't figure out was how many hyphens to use for a missing sequence from a file.
differentNames = ['oneFish', 'twoFish', 'redFish', 'blueFish']
files = ['f1.txt', 'f2.txt', 'f3.txt']
data = [[] for _ in range(len(differentNames))]
final = []
for file in files:
d = dict()
with open(file, 'r') as fin:
for line in fin:
line = line.rstrip()
if line.startswith('>'): # for ex., >xx_oneFish |xxx
underscore = line.index('_')
space = line.index(' ')
key = line[underscore+1:space]
else:
d[key] = line
for i, key in enumerate(differentNames):
data[i].append(d.get(key, '-' * 4))
for array in data:
final.append(''.join(array))
print(final)
Prints:
['AAAAAAAaaaa----', 'CCCCCCcccc----', 'TTTTTT----tt', 'GGGGGG----gg']
Related
I currently have a list with one element inside each. Is there a way python can combine the first two lists into one? I tried my code down below. The last for loop Is my attempt. If you see the actual output, it only duplicates it but doesnt get the second element. I need the first and second element to be listed together. Please note that the post i made earlier is not a duplicate of the post mentioned there. The post the moderator had suggested is answwering a question on how to split a SINGLE list into even chunks. I am asking how to group together many lists in 2's. Basically what i am doing is opening a file, looking for value between the strings 'cdc or dcc\s(space)' and returning those values. I then want to compare it to the string that comes next.
text.txt
^random binary characters d1234 d0123456789d 1234c null null null d34 dc49416494949 c3456
output:
['d1234d0123456789d1234c']
['d34dc49416494949c3456']
expected output:
['d1234d0123456789d1234c','d34dc49416494949c3456']
code:
with open(text.txt, 'r', encoding="ISO-8859-1") as in_file:
data = in_file.readlines()
for row in data:
micr_ocr_line = re.findall(r'd[^d]*d[^d]*c[0-9]+|d[^d]*d[^d]*c\s+[0-9]+', row)
for r in micr_ocr_line:
rmve_spcl_char = re.sub (r'([^a-zA-Z-0-9]+?)', '', r)
rmve_spcl_char = re.sub(r'(c\d{4,}).*', r'\1', rmve_spcl_char)
a = [l for l in rmve_spcl_char.split('\n')]
for previous, current in zip_longest(a, a[::1]):
print(previous, current)
micr_ocr_dat = [previous, current]
print(micr_ocr_dat)
micr_ocr_dat_l.append(micr_ocr_dat)
You've almost got it I think - just slice slightly different?
a = [l for l in rmve_spcl_char.split('\n')]
for previous, current in zip(a[::2], a[1::2]):
print(previous, current)
micr_ocr_dat = [previous, current]
print(micr_ocr_dat)
micr_ocr_dat_l.append(micr_ocr_dat)
btw , you should use len to get the length of the list and use range() to the for loop
Here is the way to split it in two
for fp in dat_filepath:
with open(fp, 'r', encoding="ISO-8859-1") as in_file:
data = in_file.readlines()
listresult = []
for a in range(0, len(data), 2):
listresult.append([data[a], data[a + 1]])
print(listresult)
The list result shuld be the data you expected
Let's say I have a ton of HTML with no newlines. I want to get each element into a list.
input = "<head><title>Example Title</title></head>"
a_list = ["<head>", "<title>Example Title</title>", "</head>"]
Something like such. Splitting between each ><.
But in Python, I don't know of a way to do that. I can only split at that string, which removes it from the output. I want to keep it, and split between the two equality operators.
How can this be done?
Edit: Preferably, this would be done without adding the characters back in to the ends of each list item.
# initial input
a = "<head><title>Example Title</title></head>"
# split list
b = a.split('><')
# remove extra character from first and last elements
# because the split only removes >< pairs.
b[0] = b[0][1:]
b[-1] = b[-1][:-1]
# initialize new list
a_list = []
# fill new list with formatted elements
for i in range(len(b)):
a_list.append('<{}>'.format(b[i]))
This will output the given list in python 2.7.2, but it should work in python 3 as well.
You can try this:
import re
a = "<head><title>Example Title</title></head>"
data = re.split("><", a)
new_data = [data[0]+">"]+["<" + i+">" for i in data[1:-1]] + ["<"+data[-1]]
Output:
['<head>', '<title>Example Title</title>', '</head>']
The shortest approach using re.findall() function on extended example:
# extended html string
s = "<head><title>Example Title</title></head><body>hello, <b>Python</b></body>"
result = re.findall(r'(<[^>]+>[^<>]+</[^>]+>|<[^>]+>)', s)
print(result)
The output:
['<head>', '<title>Example Title</title>', '</head>', '<body>', '<b>Python</b>', '</body>']
Based on the answers by other people, I made this.
It isn't as clean as I had wanted, but it seems to work. I had originally wanted to not re-add the characters after split.
Here, I got rid of one extra argument by combining the two characters into a string. Anyways,
def split_between(string, chars):
if len(chars) is not 2: raise IndexError("Argument chars must contain two characters.")
result_list = [chars[1] + line + chars[0] for line in string.split(chars)]
result_list[0] = result_list[0][1:]
result_list[-1] = result_list[-1][:-1]
return result_list
Credit goes to #cforemanand #Ajax1234.
Or even simpler, this:
input = "<head><title>Example Title</title></head>"
print(['<'+elem if elem[0]!='<' else elem for elem in [elem+'>' if elem[-1]!='>' else elem for elem in input.split('><') ]])
I am trying to match elements from two lists and write it to a file, match columns from both the files col[0] and print certain columns in to a new file
with open('~/gf_out.txt', 'w') as w:
w.write('\t'.join(headers) + '\n')
for i in d1: #list1
for j in d2: # list2
if i[0] == j[0]:
out = ((j[0:10]),i[1],i[2],j[11],j[12])
# print out
w.write('\t'.join(out) + '\n')
TypeError: sequence item 0: expected string, list found
if out changed to
out = (str(j[0:10]),i[1],i[2],j[11],j[12])
the final output would have [ ] around the first 10 columns, how can this be fixed
ANALYSIS
Your problem is right where the error messge (certainly) told and, and just what it described ... once you're comfortable enough with Python to interpret the description.
out = ((j[0:10]),i[1],i[2],j[11],j[12])
w.write('\t'.join(out) + '\n')
join operates on a sequence of strings. You gave it a sequence, but the first element of that is the tuple (j[0:10]).
REMEDY
You have nested lists, so you need nested joins.
sep = '\t' # separator
out_0 = sep.join(j[0:10])
out_line = sep.join(out_0,,i[1],i[2],j[11],j[12])
w.write(out_line)
Yes, you can recombine this to a single-line write; I broke it down to make the logic clear.
If this doesn't match your needs, then please provide the required MCVE to clarify the problems.
What exactly are you wanting it to do? j[0:10] is a list, so if you want to convert it to a string, it will have square brackets. if you want those lements to be joind by tabs as well, you need to either do that explicitly or join it to the other list instead of embedding it.
out = ('\t'.join(j[0:10]),i[1],i[2],j[11],j[12])
or
out = j[0:10] + [i[1],i[2],j[11],j[12]]
I have a file i am trying to replace parts of a line with another word.
it looks like bobkeiser:bob123#bobscarshop.com:0.0.0.0.0:23rh32o3hro2rh2:234212
i need to delete everything but bob123#bobscarshop.com, but i need to match 23rh32o3hro2rh2 with 23rh32o3hro2rh2:poniacvibe , from a different text file and place poniacvibe infront of bob123#bobscarshop.com
so it would look like this bob123#bobscarshop.com:poniacvibe
I've had a hard time trying to go about doing this, but i think i would have to split the bobkeiser:bob123#bobscarshop.com:0.0.0.0.0:23rh32o3hro2rh2:234212 with data.split(":") , but some of the lines have a (:) in a spot that i don't want the line to be split at, if that makes any sense...
if anyone could help i would really appreciate it.
ok, it looks to me like you are using a colon : to separate your strings.
in this case you can use .split(":") to break your strings into their component substrings
eg:
firststring = "bobkeiser:bob123#bobscarshop.com:0.0.0.0.0:23rh32o3hro2rh2:234212"
print(firststring.split(":"))
would give:
['bobkeiser', 'bob123#bobscarshop.com', '0.0.0.0.0', '23rh32o3hro2rh2', '234212']
and assuming your substrings will always be in the same order, and the same number of substrings in the main string you could then do:
firststring = "bobkeiser:bob123#bobscarshop.com:0.0.0.0.0:23rh32o3hro2rh2:234212"
firstdata = firststring.split(":")
secondstring = "23rh32o3hro2rh2:poniacvibe"
seconddata = secondstring.split(":")
if firstdata[3] == seconddata[0]:
outputdata = firstdata
outputdata.insert(1,seconddata[1])
outputstring = ""
for item in outputdata:
if outputstring == "":
outputstring = item
else
outputstring = outputstring + ":" + item
what this does is:
extract the bits of the strings into lists
see if the "23rh32o3hro2rh2" string can be found in the second list
find the corresponding part of the second list
create a list to contain the output data and put the first list into it
insert the "poniacvibe" string before "bob123#bobscarshop.com"
stitch the outputdata list back into a string using the colon as the separator
the reason your strings need to be the same length is because the index is being used to find the relevant strings rather than trying to use some form of string type matching (which gets much more complex)
if you can keep your data in this form it gets much simpler.
to protect against malformed data (lists too short) you can explicitly test for them before you start using len(list) to see how many elements are in it.
or you could let it run and catch the exception, however in this case you could end up with unintended results, as it may try to match the wrong elements from the list.
hope this helps
James
EDIT:
ok so if you are trying to match up a long list of strings from files you would probably want something along the lines of:
firstfile = open("firstfile.txt", mode = "r")
secondfile= open("secondfile.txt",mode = "r")
first_raw_data = firstfile.readlines()
firstfile.close()
second_raw_data = secondfile.readlines()
secondfile.close()
first_data = []
for item in first_raw_data:
first_data.append(item.replace("\n","").split(":"))
second_data = []
for item in second_raw_data:
second_data.append(item.replace("\n","").split(":"))
output_strings = []
for item in first_data:
searchstring = item[3]
for entry in second_data:
if searchstring == entry[0]:
output_data = item
output_string = ""
output_data.insert(1,entry[1])
for data in output_data:
if output_string == "":
output_string = data
else:
output_string = output_string + ":" + data
output_strings.append(output_string)
break
for entry in output_strings:
print(entry)
this should achieve what you're after and as prove of concept will print the resulting list of stings for you.
if you have any questions feel free to ask.
James
Second edit:
to make this output the results into a file change the last two lines to:
outputfile = open("outputfile.txt", mode = "w")
for entry in output_strings:
outputfile.write(entry+"\n")
outputfile.close()
I'm building an XML parser in python for an SVG file. It will eventually become specific instructions for stepper motors.
SVG files contain commands such as 'M', 'C' and 'L.' The path data might look like this:
[M199.66, 0.50C199.6, 0.50...0.50Z]
When I extracted the path data, it's a list of one item (which is a string). I split the long string into multiple strings:
[u'M199.6', u'0.50C199.66', u'0.50']
The 'M, C and L' commands are important - I'm having difficulty splitting '0.5C199.6' into '0.5' and 'C199.6' because it only exists for certain items in the list, and I'd like to retain the C and not discard it. This is what I have so far:
for item in path_strings[0]:
s=string.split(path_strings[0], ',')
print s
break
for i in range(len(s)):
coordinates=string.split(s[i],'C')
print coordinates
break
You could try breaking it into substrings like this:
whole = "0.5C199.66"
start = whole[0:whole.find("C")]
end = whole[whole.find("C"):]
That should give you start == "0.5" and end == "C199.66"
Alternatively you could use the index function instead of find, which raises a ValueError when the substring can't be found. That would give you the benefit of easily determining that for the current string, no 'C' command is present.
http://docs.python.org/2/library/string.html#string-functions
Use a regex to search for the commands ([MCL]).
import re
lst = [u'M199.6', u'0.50C199.66', u'0.50']
for i, j in enumerate(lst):
m = re.search('(.+?)([MCL].+)', j)
if m:
print [m.group(1), m.group(2)] # = coordinates from your example
lst[i:i+1] = [m.group(1), m.group(2)] # replace the item in the lst with the splitted thing
# or do something else with the coordinates, whatever you want.
print lst
splits your list in:
[u'M199.6', u'0.50', u'C199.66', u'0.50']