I currently have a list with one element inside each. Is there a way python can combine the first two lists into one? I tried my code down below. The last for loop Is my attempt. If you see the actual output, it only duplicates it but doesnt get the second element. I need the first and second element to be listed together. Please note that the post i made earlier is not a duplicate of the post mentioned there. The post the moderator had suggested is answwering a question on how to split a SINGLE list into even chunks. I am asking how to group together many lists in 2's. Basically what i am doing is opening a file, looking for value between the strings 'cdc or dcc\s(space)' and returning those values. I then want to compare it to the string that comes next.
text.txt
^random binary characters d1234 d0123456789d 1234c null null null d34 dc49416494949 c3456
output:
['d1234d0123456789d1234c']
['d34dc49416494949c3456']
expected output:
['d1234d0123456789d1234c','d34dc49416494949c3456']
code:
with open(text.txt, 'r', encoding="ISO-8859-1") as in_file:
data = in_file.readlines()
for row in data:
micr_ocr_line = re.findall(r'd[^d]*d[^d]*c[0-9]+|d[^d]*d[^d]*c\s+[0-9]+', row)
for r in micr_ocr_line:
rmve_spcl_char = re.sub (r'([^a-zA-Z-0-9]+?)', '', r)
rmve_spcl_char = re.sub(r'(c\d{4,}).*', r'\1', rmve_spcl_char)
a = [l for l in rmve_spcl_char.split('\n')]
for previous, current in zip_longest(a, a[::1]):
print(previous, current)
micr_ocr_dat = [previous, current]
print(micr_ocr_dat)
micr_ocr_dat_l.append(micr_ocr_dat)
You've almost got it I think - just slice slightly different?
a = [l for l in rmve_spcl_char.split('\n')]
for previous, current in zip(a[::2], a[1::2]):
print(previous, current)
micr_ocr_dat = [previous, current]
print(micr_ocr_dat)
micr_ocr_dat_l.append(micr_ocr_dat)
btw , you should use len to get the length of the list and use range() to the for loop
Here is the way to split it in two
for fp in dat_filepath:
with open(fp, 'r', encoding="ISO-8859-1") as in_file:
data = in_file.readlines()
listresult = []
for a in range(0, len(data), 2):
listresult.append([data[a], data[a + 1]])
print(listresult)
The list result shuld be the data you expected
Related
I have a list of lists of sequences, and a corresponding list of lists of names.
testSequences = [
['aaaa', 'cccc'],
['tt', 'gg'],
['AAAAAAA', 'CCCCCC', 'TTTTTT', 'GGGGGG']]
testNames = [
['>xx_oneFish |xzx', '>xx_twoFish |zzx'],
['>xx_redFish |zxx', '>xx_blueFish |zxx'],
['>xx_oneFish |xzx', '>xx_twoFish |xzx', '>xz_redFish |xxx', '>zx_blueFish |xzz']]
I also have a list of all the identifying parts of the names:
taxonNames = ['oneFish', 'twoFish', 'redFish', 'blueFish']
I am trying to produce a new list, where each item in the list will correspond to one of the "identifying parts of the names", and the string will be made up of all the sequences for that name.
If a name and sequence does not appear in one of the lists in the lists (i.e. no redFish or blueFish in the first list of testNames) I want to add in a string of hyphens the same length as the sequences in that list. This would give me this output:
['aaaa--AAAAAA', 'cccc--CCCCCC', '----ttTTTTTT', '----ggGGGG']
I have this piece of code to do this.
complete = [''] * len(taxonNames)
for i in range(len(testSequences)):
for j in range(len(taxonNames)):
sequenceLength = len(testSequences[i][0])
for k in range(len(testSequences[i])):
if taxonNames[j] in testNames[i][k]:
complete[j].join(testSequences[i][k])
if taxonNames[j] not in testNames[i][k]:
hyphenString = "-" * sequenceLength
complete[j].join(hyphenString)
print complete
"complete" should give my final output as explained above, but it comes out looking like this:
['', '', '', '']
How can I fix my code to give me the correct answer?
The main issue with your code, which makes it very hard to understand, is you're not really leveraging the language elements that make Python so strong.
Here's a solution to your problem that works:
test_sequences = [
['aaaa', 'cccc'],
['tt', 'gg'],
['AAAAAAA', 'CCCCCC', 'TTTTTT', 'GGGGGG']]
test_names = [
['>xx_oneFish |xzx', '>xx_twoFish |zzx'],
['>xx_redFish |zxx', '>xx_blueFish |zxx'],
['>xx_oneFish |xzx', '>xx_twoFish |xzx', '>xz_redFish |xxx', '>zx_blueFish |xzz']]
taxon_names = ['oneFish', 'twoFish', 'redFish', 'blueFish']
def get_seqs(taxon_name, sequences_list, names_list):
for seqs, names in zip(sequences_list, names_list):
found_seq = None
for seq, name in zip(seqs, names):
if taxon_name in name:
found_seq = seq
break
yield found_seq if found_seq else '-' * len(seqs[0])
result = [''.join(get_seqs(taxon_name, test_sequences, test_names))
for taxon_name in taxon_names]
print(result)
The generator get_seqs pairs up lists from test_sequences and test_names and for each pair, tries to find the sequence (seq) for the name (name) that matches and yields it, or yields a string of the right number of hyphens for that list of sequences.
The generator (a function that yields multiple values) has code that quite literally follows the explanation above.
The result is then simply a matter of, for each taxon_name, getting all the resulting sequences that match in order and joining them together into a string, which is the result = ... line.
You could make it work with list indexing loops and string concatenation, but this is not a PHP question, now is it? :)
Note: for brevity, you could just access the global test_sequences and test_names instead of passing them in as parameters, but I think that would come back to haunt you if you were to actually use this code. Also, I think it makes semantic sense to change the order of names and sequences in the entire example, but I didn't to avoid further deviating from your example.
Here is a solution that may do what you want. It begins, not with your data structures from this post, but with the three example files from your previous post (which you used to build this post's data structures).
The only thing I couldn't figure out was how many hyphens to use for a missing sequence from a file.
differentNames = ['oneFish', 'twoFish', 'redFish', 'blueFish']
files = ['f1.txt', 'f2.txt', 'f3.txt']
data = [[] for _ in range(len(differentNames))]
final = []
for file in files:
d = dict()
with open(file, 'r') as fin:
for line in fin:
line = line.rstrip()
if line.startswith('>'): # for ex., >xx_oneFish |xxx
underscore = line.index('_')
space = line.index(' ')
key = line[underscore+1:space]
else:
d[key] = line
for i, key in enumerate(differentNames):
data[i].append(d.get(key, '-' * 4))
for array in data:
final.append(''.join(array))
print(final)
Prints:
['AAAAAAAaaaa----', 'CCCCCCcccc----', 'TTTTTT----tt', 'GGGGGG----gg']
Let's say I have a ton of HTML with no newlines. I want to get each element into a list.
input = "<head><title>Example Title</title></head>"
a_list = ["<head>", "<title>Example Title</title>", "</head>"]
Something like such. Splitting between each ><.
But in Python, I don't know of a way to do that. I can only split at that string, which removes it from the output. I want to keep it, and split between the two equality operators.
How can this be done?
Edit: Preferably, this would be done without adding the characters back in to the ends of each list item.
# initial input
a = "<head><title>Example Title</title></head>"
# split list
b = a.split('><')
# remove extra character from first and last elements
# because the split only removes >< pairs.
b[0] = b[0][1:]
b[-1] = b[-1][:-1]
# initialize new list
a_list = []
# fill new list with formatted elements
for i in range(len(b)):
a_list.append('<{}>'.format(b[i]))
This will output the given list in python 2.7.2, but it should work in python 3 as well.
You can try this:
import re
a = "<head><title>Example Title</title></head>"
data = re.split("><", a)
new_data = [data[0]+">"]+["<" + i+">" for i in data[1:-1]] + ["<"+data[-1]]
Output:
['<head>', '<title>Example Title</title>', '</head>']
The shortest approach using re.findall() function on extended example:
# extended html string
s = "<head><title>Example Title</title></head><body>hello, <b>Python</b></body>"
result = re.findall(r'(<[^>]+>[^<>]+</[^>]+>|<[^>]+>)', s)
print(result)
The output:
['<head>', '<title>Example Title</title>', '</head>', '<body>', '<b>Python</b>', '</body>']
Based on the answers by other people, I made this.
It isn't as clean as I had wanted, but it seems to work. I had originally wanted to not re-add the characters after split.
Here, I got rid of one extra argument by combining the two characters into a string. Anyways,
def split_between(string, chars):
if len(chars) is not 2: raise IndexError("Argument chars must contain two characters.")
result_list = [chars[1] + line + chars[0] for line in string.split(chars)]
result_list[0] = result_list[0][1:]
result_list[-1] = result_list[-1][:-1]
return result_list
Credit goes to #cforemanand #Ajax1234.
Or even simpler, this:
input = "<head><title>Example Title</title></head>"
print(['<'+elem if elem[0]!='<' else elem for elem in [elem+'>' if elem[-1]!='>' else elem for elem in input.split('><') ]])
I have a long string variable full of hex values:
hexValues = 'AA08E3020202AA08E302AA1AA08E3020101' etc..
The first 2 bytes (AA08) are a signature for the start of a frame and the rest of the data up to the next AA08 are the contents of the signature.
I want to slice the string into a list based on the reoccurring start of frame sign, e.g:
list = [AA08, E3020202, AA08, F25S1212, AA08, 42ABC82] etc...
I'm not sure how I can split the string up like this. Some of the frames are also corrupted, where the start of the frame won'y have AA08, but maybe AA01.. so I'd need some kind of regex to spot these.
if I do list = hexValues.split('AA08)', the list just removes all the starts of the frame...
So I'm a bit stuck.
Newbie to python.
Thanks
For the case when you don't have "corrupted" data the following should do:
hex_values = 'AA08E3020202AA08E302AA1AA08E3020101'
delimiter = hex_values[:4]
hex_values = hex_values.replace(delimiter, ',' + delimiter + ',')
hex_list = hex_values.split(',')[1:]
print(hex_list)
['AA08', 'E3020202', 'AA08', 'E302AA1', 'AA08', 'E3020101']
Without considering corruptions, you may try this.
l = []
for s in hexValues.split('AA08'):
if s:
l += ['AA08', s]
I'm trying to read 200 txt files and do some preprocessing.
1) how could i write simpler code instead of writing same code for each of txt files?
2) can i combine regular expression with rstrip?
-> mainly, i want to get rid of "\n" but sometimes they are sticked with other letters.so what i want is remove every \n as well as words that are combined with \n (i.e. "\n?", "!\n" .. and so on)
3) at the last line, is there a way to add all list in one list with simpler code?
data = open("job (0).txt", 'r').read()
rows0 = data.split(" ")
rows0 = [item.rstrip('\n?, \n') for item in rows0]
data = open("job (1).txt", 'r').read()
rows1 = data.split(" ")
rows1 = [item.rstrip('\n?, \n') for item in rows1]
.....(up to 200th file)
data = open("job (199).txt", 'r').read()
rows199 = data.split(" ")
rows199 = [item.rstrip('\n?, \n') for item in rows199]
ds_l = rows0 + rows1 + ... rows199
First of all, I'm not a python expert. But since the question has been around for a while already... (At least I'm save from downvotes if no one looks at this^^)
1) Use loops, and read a programming tutorial.
See for example this post How do I read a file line-by-line into a list? on how to get a list of all rows. Then you can loop over the list.
2) No idea whether it's possible to use regexes with strip, this brought me here, so tell me if you find out.
It's not clear what exactly you are asking for, do you want to get rid of all (space seperated) words that contain any "/n", or just cut out the "/n","/n?",... parts of the words?
In the first case, a simple, unelegant solution would be to just have two loops over rows and over all words in a row and do something like
# loop over rows with i as index
row = rows[i].split(" ")
for j in range len(row):
if("/n" in row[j])
del row[j]
rows[i] = " ".join(row)
In the latter case, if there's not so many expressions you want to remove, you can probably use re.sub() somehow. Google helps ;)
3) If you have the rows as a list "rows" of strings, you can use join:
ds_1 = "".join(rows)
(For join: Python join: why is it string.join(list) instead of list.join(string)?)
I am really new to python and now I am struggeling with some problems while working on a student project. Basically I try to read data from a text file which is formatted in columns. I store the data in a list of list and sort and manipulate the data and write them into a file again. My problem is to align the written data in proper columns. I found some approaches like
"%i, %f, %e" % (1000, 1000, 1000)
but I don't know how many columns there will be. So I wonder if there is a way to set all columns to a fixed width.
This is how the input data looks like:
2 232.248E-09 74.6825 2.5 5.00008 499.482
5 10. 74.6825 2.5 -16.4304 -12.3
This is how I store the data in a list of list:
filename = getInput('MyPath', workdir)
lines = []
f = open(filename, 'r')
while 1:
line = f.readline()
if line == '':
break
splitted = line.split()
lines.append(splitted)
f.close()
To write the data I first put all the row elements of the list of list into one string with a free fixed space between the elements. But instead i need a fixed total space including the element. But also I don't know the number of columns in the file.
for k in xrange(len(lines)):
stringlist=""
for i in lines[k]:
stringlist = stringlist+str(i)+' '
lines[k] = stringlist+'\n'
f = open(workdir2, 'w')
for i in range(len(lines)):
f.write(lines[i])
f.close()
This code works basically, but sadly the output isn't formatted properly.
Thank you very much in advance for any help on this issue!
You are absolutely right about begin able to format widths as you have above using string formatting. But as you correctly point out, the tricky bit is doing this for a variable sized output list. Instead, you could use the join() function:
output = ['a', 'b', 'c', 'd', 'e',]
# format each column (len(a)) with a width of 10 spaces
width = [10]*len(a)
# write it out, using the join() function
with open('output_example', 'w') as f:
f.write(''.join('%*s' % i for i in zip(width, output)))
will write out:
' a b c d e'
As you can see, the length of the format array width is determined by the length of the output, len(a). This is flexible enough that you can generate it on the fly.
Hope this helps!
String formatting might be the way to go:
>>> print("%10s%9s" % ("test1", "test2"))
test1 test2
Though you might want to first create strings from those numbers and then format them as I showed above.
I cannot fully comprehend your writing code, but try working on it somehow like that:
from itertools import enumerate
with open(workdir2, 'w') as datei:
for key, item in enumerate(zeilen):
line = "%4i %6.6" % key, item
datei.write(item)