I'm having some troubles with removing items from a list. I'm looking for a more elegant solution. Preferably a solution in one for-loop or filter.
The objective of the piece of code: remove all empty entries and all entries starting with a '#' from the config handle.
At the moment i'm using:
# Read the config file and put every line in a seperate entry in a list
configHandle = [item.rstrip('\n') for item in open('config.conf')]
# Strip comment items from the configHandle
for item in configHandle:
if item.startswith('#'):
configHandle.remove(item)
# remove all empty items in handle
configHandle = filter(lambda a: a != '', configHandle)
print configHandle
This works but I think it is a bit of a nasty solution.
When I try:
# Read the config file and put every line in a seperate entry in a list
configHandle = [item.rstrip('\n') for item in open('config.conf')]
# Strip comment items and empty items from the configHandle
for item in configHandle:
if item.startswith('#'):
configHandle.remove(item)
elif len(item) == 0:
configHandle.remove(item)
This, however, fails. I cannot figure out why.
Can someone push me in the right direction?
Because You're changing the list while iterating over it. You can use a list comprehension to get ride of this problem:
configHandle = [i for i in configHandle if i and not i.startswith('#')]
Also for opening a file you better to use a with statement that close the file at the end of the block automatically1:
with open('config.conf') as infile :
configHandle = infile.splitlines()
configHandle = [line for line in configHandle if line and not line.startswith('#')]
1. Because there is no guarantee for external links to be collected by garbage-collector. And you need to close them explicitly, which can be done by calling the close() method of a file object, or as mentioned as a more pythonic way use a with statement.
Don't remove items while you iterating, it's a common pitfall
You aren't allowed to modify an item that you're iterating over.
Instead you should use things like filter or list comprehensions.
configHandle = filter(lambda a: (a != '') and not a.startswith('#'), configHandle)
Your filter expression is good; just include the additional condition you're looking for:
configHandle = filter(lambda a: a != '' and not a.startswith('#'), configHandle)
There are other options if you don't want to use filter, but, as has been stated in other answers, it is a very bad idea to attempt to modify a list while you are iterating through it. The answers to this stackoverflow question provides alternatives to using filter to remove from a list based on a condition.
Related
I have a file of paths called test.txt
/groups/cgsd/javed/validation_set/LeungSY_Targeted_SS-190528-01a/Clean/pfg001G_1_Clean.fastq.gz
/groups/cgsd/javed/validation_set/LeungSY_Targeted_SS-190528-01a/Clean/pfg001G_2_Clean.fastq.gz
/groups/cgsd/javed/validation_set/LeungSY_Targeted_SS-190528-01a/Clean/pfg001T_1_Clean.fastq.gz
/groups/cgsd/javed/validation_set/LeungSY_Targeted_SS-190528-01a/Clean/pfg001T_2_Clean.fastq.gz
/groups/cgsd/javed/validation_set/LeungSY_Targeted_SS-190528-01a/Clean/pfg002G_1_Clean.fastq.gz
/groups/cgsd/javed/validation_set/LeungSY_Targeted_SS-190528-01a/Clean/pfg002G_2_Clean.fastq.gz
/groups/cgsd/javed/validation_set/LeungSY_Targeted_SS-190528-01a/Clean/pfg002T_1_Clean.fastq.gz
/groups/cgsd/javed/validation_set/LeungSY_Targeted_SS-190528-01a/Clean/pfg002T_2_Clean.fastq.gz
Notice that the number of lines is even and always even, my final goal is to parse this file and create a new one looping through these paths on a two by two basis. I am trying enumerate function but this will not parse two by two. Furthermore, I'm going out of range because indexing the way I'm doing is wrong. It would also be great if someone could tell me how to index properly with enumerate.
with open('./src/test.txt') as f:
for index,line in enumerate(f):
sample = re.search(r'pfg[\dGT]+',line)
sample_string = sample.group(0)
#print(sample_string)
print('{{"name":"{0}","readgroup":"{0}","platform_unit":"{0}","fastq_1":"{1}","fastq_2":"{2}","library":"{0}"}},'.format(sample_string,line,line[index+1]))
The result is something like this:
{"name":"pfg001G","readgroup":"pfg001G","platform_unit":"pfg001G","fastq_1":"/groups/cgsd/javed/validation_set/LeungSY_Targeted_SS-190528-01a/Clean/pfg001G_1_Clean.fastq.gz
","fastq_2":"g","library":"pfg001G"},
{"name":"pfg001G","readgroup":"pfg001G","platform_unit":"pfg001G","fastq_1":"/groups/cgsd/javed/validation_set/LeungSY_Targeted_SS-190528-01a/Clean/pfg001G_2_Clean.fastq.gz
","fastq_2":"r","library":"pfg001G"},
{"name":"pfg001T","readgroup":"pfg001T","platform_unit":"pfg001T","fastq_1":"/groups/cgsd/javed/validation_set/LeungSY_Targeted_SS-190528-01a/Clean/pfg001T_1_Clean.fastq.gz
","fastq_2":"o","library":"pfg001T"},
{"name":"pfg001T","readgroup":"pfg001T","platform_unit":"pfg001T","fastq_1":"/groups/cgsd/javed/validation_set/LeungSY_Targeted_SS-190528-01a/Clean/pfg001T_2_Clean.fastq.gz
","fastq_2":"u","library":"pfg001T"},
{"name":"pfg002G","readgroup":"pfg002G","platform_unit":"pfg002G","fastq_1":"/groups/cgsd/javed/validation_set/LeungSY_Targeted_SS-190528-01a/Clean/pfg002G_1_Clean.fastq.gz
","fastq_2":"p","library":"pfg002G"},
{"name":"pfg002G","readgroup":"pfg002G","platform_unit":"pfg002G","fastq_1":"/groups/cgsd/javed/validation_set/LeungSY_Targeted_SS-190528-01a/Clean/pfg002G_2_Clean.fastq.gz
","fastq_2":"s","library":"pfg002G"},
{"name":"pfg002T","readgroup":"pfg002T","platform_unit":"pfg002T","fastq_1":"/groups/cgsd/javed/validation_set/LeungSY_Targeted_SS-190528-01a/Clean/pfg002T_1_Clean.fastq.gz
","fastq_2":"/","library":"pfg002T"},
{"name":"pfg002T","readgroup":"pfg002T","platform_unit":"pfg002T","fastq_1":"/groups/cgsd/javed/validation_set/LeungSY_Targeted_SS-190528-01a/Clean/pfg002T_2_Clean.fastq.gz","fastq_2":"c","library":"pfg002T"},
Clearly the indexation is wrong since it's going through every element of my path that is g r etc instead of printing the next path. For the first iteration the next path printed should be: "fastq_2":"/groups/cgsd/javed/validation_set/LeungSY_Targeted_SS-190528-01a/Clean/pfg001G_2_Clean.fastq.gz".
I believe the problem itself can be tackled with itertools more elegantly I just don't know how to do it. Would also be great if someone could tell me if an indexation with enumerate could also work.
One problem is that you are trying to access the data from the second line of the pair before you have read it. Additionally you can not access the second line with line[index + 1] because that refers to a character in the current line, not the next line which hasn't yet been read.
So you need to keep track of pairs of lines. You can use the index provided by enumerate() to determine whether the current line is the first (because it is an even number) or the second (because it's odd). Store the name and path for fastq_1 when you read the first line. Only write the output on the second line. Like this:
import re
with open('test.txt') as f:
for index, line in enumerate(f):
if index % 2 == 0: # even, so this is the first line of a pair
name = re.search(r'pfg[\dGT]+',line).group(0)
fastq_1 = line.rstrip()
else: # odd, so second line. Emit result
fastq_2 = line.rstrip()
print('{{"name":"{0}","readgroup":"{0}","platform_unit":"{0}","fastq_1":"{1}","fastq_2":"{2}","library":"{0}"}},'.format(name, fastq_1, fastq_2))
line.rstrip() is required to remove the trailing new line character at the end of each line.
#mhawke already provided a good solution, but to give another approach, "looping through these ... on a two by two basis" can be done with the more_itertools.chunked function from the more_itertools library or with the grouper() recipe from the Python manual.
This also gives options for what should happen when the last line is an odd one; whether that should raise an error or pair it with a default value.
You may want to consider that when you're assigning index to variable, you're getting the index character of that string not the indexation of it.
What you can do is to assign th e file to a list then get the index location so, you can switch between line as you want.
Still don't understand point, do you want to switch between lines in both fastq_1 and fastq_2 or you each path be according to its key?
Code Syntax
with open(path) as f:
lis = list(f)
for index, line in enumerate(lis):
try:
sample = re.search(r'pfg[\dGT]+',line)
sample_string = sample.group(0)
print(f'{{"name":"{sample_string}","readgroup":"{sample_string}","platform_unit":"{sample_string}","fastq_1":"{line}","fastq_2":"{lis[index+1]}","library":"{sample_string}"}},')
except IndexError:
break
Output
{"name":"pfg001G","readgroup":"pfg001G","platform_unit":"pf
g001G","fastq_1":"/groups/cgsd/javed/validation_set/Leung
SY_Targeted_SS-190528-01a/Clean/pfg001G_1_Clean.fastq.gz
","fastq_2":"/groups/cgsd/javed/validation_set/LeungSY_Ta
rgeted_SS-190528-01a/Clean/pfg001G_2_Clean.fastq.gz
","library":"pfg001G"},
{"name":"pfg001G","readgroup":"pfg001G","platform_unit":"pf
g001G","fastq_1":"/groups/cgsd/javed/validation_set/Leung
SY_Targeted_SS-190528-01a/Clean/pfg001G_2_Clean.fastq.gz
","fastq_2":"/groups/cgsd/javed/validation_set/LeungSY_Targeted_SS-190528-01a/Clean/pfg001T_1_Clean.fastq.gz
","library":"pfg001G"},
{"name":"pfg001T","readgroup":"pfg001T","platform_unit":"pf
g001T","fastq_1":"/groups/cgsd/javed/validation_set/Leung
SY_Targeted_SS-190528-01a/Clean/pfg001T_1_Clean.fastq.gz
","fastq_2":"/groups/cgsd/javed/validation_set/LeungSY_Targeted_SS-190528-01a/Clean/pfg001T_2_Clean.fastq.gz
","library":"pfg001T"},
{"name":"pfg001T","readgroup":"pfg001T","platform_unit":"pf
g001T","fastq_1":"/groups/cgsd/javed/validation_set/Leung
SY_Targeted_SS-190528-01a/Clean/pfg001T_2_Clean.fastq.gz
","fastq_2":"/groups/cgsd/javed/validation_set/LeungSY_Targeted_SS-190528-01a/Clean/pfg002G_1_Clean.fastq.gz
","library":"pfg001T"},
{"name":"pfg002G","readgroup":"pfg002G","platform_unit":"pf
g002G","fastq_1":"/groups/cgsd/javed/validation_set/Leung
SY_Targeted_SS-190528-01a/Clean/pfg002G_1_Clean.fastq.gz
","fastq_2":"/groups/cgsd/javed/validation_set/LeungSY_Targeted_SS-190528-01a/Clean/pfg002G_2_Clean.fastq.gz
","library":"pfg002G"},
{"name":"pfg002G","readgroup":"pfg002G","platform_unit":"pf
g002G","fastq_1":"/groups/cgsd/javed/validation_set/LeungSY_Targeted_SS-190528-01a/Clean/pfg002G_2_Clean.fastq.gz
","fastq_2":"/groups/cgsd/javed/validation_set/LeungSY_Targeted_SS-190528-01a/Clean/pfg002T_1_Clean.fastq.gz
","library":"pfg002G"},
{"name":"pfg002T","readgroup":"pfg002T","platform_unit":"pfg002T","fastq_1":"/groups/cgsd/javed/validation_set/Leung
SY_Targeted_SS-190528-01a/Clean/pfg002T_1_Clean.fastq.gz
","fastq_2":"/groups/cgsd/javed/validation_set/LeungSY_Targeted_SS-190528-01a/Clean/pfg002T_2_Clean.fastq.gz
","library":"pfg002T"},
[Program finished]
I need to check whether the variable is in a list of lists, and if it is return the sublist. I've tried a number of different solutions but none of them work...
My code:
addressList = [['JohnSmith', 'NR87TYH', 'PE26RE', '1EnglandRoad', '67'],['JaneSmithe', 'UY34DSF', 'SW147EG', '23SouthDrive', '82'], ['JimmyJones', 'PL20DCH', 'NW33EX', '145EastRidings', '54']]
numPlate = "UY34DSF"
for sublist in addressList:
if numPlate in sublist:
print("Ding dong the witch is dead!")
I should clarify a few things here. The "addressList" variable is a CSV file stripped into a list that displays as it is written above.
This, or the other 15 or so different methods of doing this haven't worked but I feel this is the closest as the logic makes sense. Am I missing something obvious?
Thanks!
EDIT: Ok so thanks for the answers it made me double check everything in my code but it's still not working so I have copied it and pasted the entire code below.
'line1' returns the value I want from the file.
'reading2' returns the list of list as it should be.
From this, the iterations that I included with my original question should work fine, but aren't? Any ideas?
import csv
fopen = open("StandardUKReg.txt","r")
line1 = fopen.readline()
fopen.close()
with open("OwnerInfoCSV.csv", "r") as inf:
reading2 = list(csv.reader(inf, skipinitialspace=True))
for sublist in reading2:
if line1 in sublist:
print("yay")
print(line1) # This displays "UY34DSF"
print(reading2) # This displays the below list of lists:
[['Name', 'Reg', 'Postcode', 'Address', 'Speed'], ['JohnSmith', 'NR87TYH', 'PE26RE', '1EnglandRoad', '67'], ['JaneSmithe', 'UY34DSF', 'SW147EG', '23SouthDrive', '82'], ['JimmyJones', 'PL20DCH', 'NW33EX', '145EastRidings', '54'], ['VinnyJones', 'TD53BFC', 'NG167YT', '95BirdRoad', '79'], ['ClarkKent', 'FH45NFH', 'SE89YG', '8NorthAvenue', '56']]
This will create a list of all lists that contain the string.
matching_lists = [ls for ls in addressList if numPlate in ls]
When you read e.g. line1, it may have invisible carriage-return or possibly space attached. Check this by e.g. printing the length of line1, or printing something immediately before/after it. Of course these non-visible characters will prevent line1 matching anything in the list. Use strip() to clean it up, e.g.:
import string
line1 = string.strip(fopen.readline())
I'm trying to to sort through a file line by line, comparing the beginning with a string from a list, like so:
for line in lines:
skip_line = True
for tag in tags:
if line.startswith(tag) is False:
continue
else:
skip_line = False
break
if skip_line is False:
#do stuff
While the code works just fine, I'm wondering if there's a neater way to check for this condition. I have looked at any(), but it seems to just give me the possibility to check if any of my lines start with a fixed tag (not eliminating the for loop needed to loop through my list.
So, essentially I'm asking this:
Is there a better, sleeker option than using a for loop to iterate over my tags list to check if the current line starts with one of its elements?
As Paradox pointed out in his answer:
Using a dictionary to lookup if the string exists has O(1) complexity and actually makes the entire code look a lot cleaner, while being faster than looping through a list. Like so:
tags = {'ticker':0, 'orderBook':0, 'tradeHistory':0}
for line in lines:
if line.split('\t')[0] in tags:
#do stuff
If you're determined to pull this down into a one-liner, you can use a generator:
tagged_lines = (line for line in lines if any(line.startswith(tag) for tag in tags))
for line in tagged_lines:
# Do something with line here
Of course, how readable this is is a different question.
You've probably seen syntax like [x*x for x in range(10)] before, but by swapping the [] for (), we instead generate each item only when it's asked for.
Instead of iterating over your tags list, you can put all your tags inside a HashMap and do a simple lookup like myMap.exists("word"). This would be much faster that iterating through your tags list and works in O(1) complexity. In python its actually a dictionary data structure. http://progzoo.net/wiki/Python:Hash_Maps
This has been asked before. Take a look at this post for more solutions. I would flag this post as a duplicate but I still do not have the reputation.
https://stackoverflow.com/a/10477481/5016492
You'll need to modify the regular expression so that it looks at the start of the line. Something like this should work for you '^tag' .
How about a combination off any() and filter() like in this example:
# use your data here ...
mytags = ('hello', 'world')
mylines = ('hello friend', 'you are great', 'world is cruel')
result = filter(lambda line: any(map(lambda tag: line.startswith(tag), mytags)), mylines)
print result
In fact any() will do the job
Looping each line
for line in lines:
tagged = any(lambda: line.startswith(y), tags)
Any list start with any tag
any(lambda x: any(lambda y: x.startswith(y), tags), lines)
Filter tagged lines
filter(lambda x: any(lambda y: x.startswith(y), tags), lines)
My code below is extracting some portion from a file and displaying the result in separate lists.
I want to form a list of all these lists which were filtered out. I tried to form it in my code but when I am trying to print it out, I am getting an empty list.
import re
hand = open('mbox.txt')
for line in hand:
my_list = list()
line = line.rstrip()
#Extracting out the data from file
x=re.findall('^From .* ([0-9][0-9]:[0-9][0-9]:[0-9][0-9])', line)
#checking the length and checking if the data is not present to the list
if len(x) != 0 and x not in my_list:
my_list.append(x[0])
print my_list
Filtered list is:
['15:46:24']
['15:03:18']
['14:50:18']
['11:37:30']
['11:35:08']
['11:12:37']
and so on.
A couple of things to note. If you are repeatedly doing regex matching, I suggest you compile the pattern first and then do the matching. Also, you don't need to check length of a container manually to get its bool value - just do if container:. Use builtin filter to remove empty items. Or you can use a set that avoids duplicates automatically. I am also not sure why you are stripping the space characters before doing the regex match. Is that necessary?
import re
match = r"^From .* ([0-9][0-9]:[0-9][0-9]:[0-9][0-9])"
with open("mbox.txt") as f:
for line in f.readlines():
match = filter(None,re.findall(match, line))
data.append(list(match))
print(data)
This is all you need to get that list of lists. The use of list comprehension and filter made the code more compact.
just move my_list=list() to out of the for loop.
I am having trouble splitting an '&' in a list of URL's. I know it is because I cannot split a list directly but I cannot figure out how to get around this error. I am open for any suggestions.
def nestForLoop():
lines = open("URL_leftof_qm.txt", 'r').readlines()
for l in lines:
toke1 = l.split("?")
toke2 = toke1.split("&")
for t in toke2:
with open("ampersand_right_split.txt".format(), 'a') as f:
f.write
lines.close()
nestForLoop()
NO. STOP.
qs = urlparse.urlparse(url).query
qsl = urlparse.parse_qsl(qs)
As Ignacio points out, you should not be doing this in the first place. But I'll explain where you're going wrong, and how to fix it:
toke2 is a list of two strings: the main URL before the ?, and the query string after the &. You don't want to split that list, or everything in that list; you just want to split the query string. So:
mainurl, query = l.split("?")
queryvars = query.split("&")
What if you did want to split everything in the first list? There are two different things that could mean, which are of course done differently. But both require a loop (explicit, or inside a list comprehension) over the first list. Either this:
tokens = [toke2.split("&") for toke2 in l.split("?")]
or
tokens = [token for toke2 in l.split("?")
for token in toke2.split("&")]
Try them both out to see the different outputs, and hopefully you'll understand what they're doing.