I am having trouble splitting an '&' in a list of URL's. I know it is because I cannot split a list directly but I cannot figure out how to get around this error. I am open for any suggestions.
def nestForLoop():
lines = open("URL_leftof_qm.txt", 'r').readlines()
for l in lines:
toke1 = l.split("?")
toke2 = toke1.split("&")
for t in toke2:
with open("ampersand_right_split.txt".format(), 'a') as f:
f.write
lines.close()
nestForLoop()
NO. STOP.
qs = urlparse.urlparse(url).query
qsl = urlparse.parse_qsl(qs)
As Ignacio points out, you should not be doing this in the first place. But I'll explain where you're going wrong, and how to fix it:
toke2 is a list of two strings: the main URL before the ?, and the query string after the &. You don't want to split that list, or everything in that list; you just want to split the query string. So:
mainurl, query = l.split("?")
queryvars = query.split("&")
What if you did want to split everything in the first list? There are two different things that could mean, which are of course done differently. But both require a loop (explicit, or inside a list comprehension) over the first list. Either this:
tokens = [toke2.split("&") for toke2 in l.split("?")]
or
tokens = [token for toke2 in l.split("?")
for token in toke2.split("&")]
Try them both out to see the different outputs, and hopefully you'll understand what they're doing.
Related
My First String
xxx.xxx.com-bonding_err_bond0-if_eth2-d.rrd.csv
But I want to result like this below
bonding_err_bond0-if_eth2
I try some code but seems not work correctly
csv = "xxx.xxx.com-bonding_err_bond0-if_eth2-d.rrd.csv"
x = csv.rsplit('.', 4)[2]
print(x)
But Result that I get is com-bonding_err_bond0-if_eth2-d But my purpose is bonding_err_bond0-if_eth2
If you are allowed to use the solution apart from regex,
You can break the solution into a smaller part to understand better and learn about join if you are not aware of it. It will come in handy.
solution= '-'.join(csv.split('.', 4)[2].split('-')[1:3])
Thanks,
Shashank
Probably you got the answer, but if you want a generic method for any string data you can do this:
In this way you wont be restricted to one string and you can loop the data as well.
csv = "xxx.xxx.com-bonding_err_bond0-if_eth2-d.rrd.csv"
first_index = csv.find("-")
second_index = csv.find("-d")
result = csv[first_index+1:second_index]
print(result)
# OUTPUT:
# bonding_err_bond0-if_eth2
You can just separate the string with -, remove the beginning and end, and then join them back into a string.
csv = "xxx.xxx.com-bonding_err_bond0-if_eth2-d.rrd.csv"
x = '-'.join(csv.split('-')[1:-1])
Output
>>> csv
>>> bonding_err_bond0-if_eth2
Let's say I have a ton of HTML with no newlines. I want to get each element into a list.
input = "<head><title>Example Title</title></head>"
a_list = ["<head>", "<title>Example Title</title>", "</head>"]
Something like such. Splitting between each ><.
But in Python, I don't know of a way to do that. I can only split at that string, which removes it from the output. I want to keep it, and split between the two equality operators.
How can this be done?
Edit: Preferably, this would be done without adding the characters back in to the ends of each list item.
# initial input
a = "<head><title>Example Title</title></head>"
# split list
b = a.split('><')
# remove extra character from first and last elements
# because the split only removes >< pairs.
b[0] = b[0][1:]
b[-1] = b[-1][:-1]
# initialize new list
a_list = []
# fill new list with formatted elements
for i in range(len(b)):
a_list.append('<{}>'.format(b[i]))
This will output the given list in python 2.7.2, but it should work in python 3 as well.
You can try this:
import re
a = "<head><title>Example Title</title></head>"
data = re.split("><", a)
new_data = [data[0]+">"]+["<" + i+">" for i in data[1:-1]] + ["<"+data[-1]]
Output:
['<head>', '<title>Example Title</title>', '</head>']
The shortest approach using re.findall() function on extended example:
# extended html string
s = "<head><title>Example Title</title></head><body>hello, <b>Python</b></body>"
result = re.findall(r'(<[^>]+>[^<>]+</[^>]+>|<[^>]+>)', s)
print(result)
The output:
['<head>', '<title>Example Title</title>', '</head>', '<body>', '<b>Python</b>', '</body>']
Based on the answers by other people, I made this.
It isn't as clean as I had wanted, but it seems to work. I had originally wanted to not re-add the characters after split.
Here, I got rid of one extra argument by combining the two characters into a string. Anyways,
def split_between(string, chars):
if len(chars) is not 2: raise IndexError("Argument chars must contain two characters.")
result_list = [chars[1] + line + chars[0] for line in string.split(chars)]
result_list[0] = result_list[0][1:]
result_list[-1] = result_list[-1][:-1]
return result_list
Credit goes to #cforemanand #Ajax1234.
Or even simpler, this:
input = "<head><title>Example Title</title></head>"
print(['<'+elem if elem[0]!='<' else elem for elem in [elem+'>' if elem[-1]!='>' else elem for elem in input.split('><') ]])
I'm trying to read 200 txt files and do some preprocessing.
1) how could i write simpler code instead of writing same code for each of txt files?
2) can i combine regular expression with rstrip?
-> mainly, i want to get rid of "\n" but sometimes they are sticked with other letters.so what i want is remove every \n as well as words that are combined with \n (i.e. "\n?", "!\n" .. and so on)
3) at the last line, is there a way to add all list in one list with simpler code?
data = open("job (0).txt", 'r').read()
rows0 = data.split(" ")
rows0 = [item.rstrip('\n?, \n') for item in rows0]
data = open("job (1).txt", 'r').read()
rows1 = data.split(" ")
rows1 = [item.rstrip('\n?, \n') for item in rows1]
.....(up to 200th file)
data = open("job (199).txt", 'r').read()
rows199 = data.split(" ")
rows199 = [item.rstrip('\n?, \n') for item in rows199]
ds_l = rows0 + rows1 + ... rows199
First of all, I'm not a python expert. But since the question has been around for a while already... (At least I'm save from downvotes if no one looks at this^^)
1) Use loops, and read a programming tutorial.
See for example this post How do I read a file line-by-line into a list? on how to get a list of all rows. Then you can loop over the list.
2) No idea whether it's possible to use regexes with strip, this brought me here, so tell me if you find out.
It's not clear what exactly you are asking for, do you want to get rid of all (space seperated) words that contain any "/n", or just cut out the "/n","/n?",... parts of the words?
In the first case, a simple, unelegant solution would be to just have two loops over rows and over all words in a row and do something like
# loop over rows with i as index
row = rows[i].split(" ")
for j in range len(row):
if("/n" in row[j])
del row[j]
rows[i] = " ".join(row)
In the latter case, if there's not so many expressions you want to remove, you can probably use re.sub() somehow. Google helps ;)
3) If you have the rows as a list "rows" of strings, you can use join:
ds_1 = "".join(rows)
(For join: Python join: why is it string.join(list) instead of list.join(string)?)
My code below is extracting some portion from a file and displaying the result in separate lists.
I want to form a list of all these lists which were filtered out. I tried to form it in my code but when I am trying to print it out, I am getting an empty list.
import re
hand = open('mbox.txt')
for line in hand:
my_list = list()
line = line.rstrip()
#Extracting out the data from file
x=re.findall('^From .* ([0-9][0-9]:[0-9][0-9]:[0-9][0-9])', line)
#checking the length and checking if the data is not present to the list
if len(x) != 0 and x not in my_list:
my_list.append(x[0])
print my_list
Filtered list is:
['15:46:24']
['15:03:18']
['14:50:18']
['11:37:30']
['11:35:08']
['11:12:37']
and so on.
A couple of things to note. If you are repeatedly doing regex matching, I suggest you compile the pattern first and then do the matching. Also, you don't need to check length of a container manually to get its bool value - just do if container:. Use builtin filter to remove empty items. Or you can use a set that avoids duplicates automatically. I am also not sure why you are stripping the space characters before doing the regex match. Is that necessary?
import re
match = r"^From .* ([0-9][0-9]:[0-9][0-9]:[0-9][0-9])"
with open("mbox.txt") as f:
for line in f.readlines():
match = filter(None,re.findall(match, line))
data.append(list(match))
print(data)
This is all you need to get that list of lists. The use of list comprehension and filter made the code more compact.
just move my_list=list() to out of the for loop.
I am running a server with cherrypy and python script. Currently, there is a web page containing data of a list of people, which i need to get. The format of the web page is as follow:
www.url1.com, firstName_1, lastName_1
www.url2.com, firstName_2, lastName_2
www.url3.com, firstName_3, lastName_3
I wish to display the list of names on my own webpage, with each name hyperlinked to their corresponding website.
I have read the webpage into a list with the following method:
#cherrypy.expose
def receiveData(self):
""" Get a list, one per line, of currently known online addresses,
separated by commas.
"""
method = "whoonline"
fptr = urllib2.urlopen("%s/%s" % (masterServer, method))
data = fptr.readlines()
fptr.close()
return data
But I don't know how to break the list into a list of lists at where the comma are. The result should give each smaller list three elements; URL, First Name, and Last Name. So I was wondering if anyone could help.
Thank you in advance!
You can iterate over fptr, no need to call readlines()
data = [line.split(', ') for line in fptr]
You need the split(',') method on each string:
data = [ line.split(',') for line in fptr.readlines() ]
lists = []
for line in data:
lists.append([x.strip() for x in line.split(',')])
If you data is a big 'ole string (potentially with leading or trailing spaces), do it this way:
lines=""" www.url1.com, firstName_1, lastName_1
www.url2.com, firstName_2 , lastName_2
www.url3.com, firstName_3, lastName_3 """
data=[]
for line in lines.split('\n'):
t=[e.strip() for e in line.split(',')]
data.append(t)
print data
Out:
[['www.url1.com', 'firstName_1', 'lastName_1'], ['www.url2.com', 'firstName_2',
'lastName_2'], ['www.url3.com', 'firstName_3', 'lastName_3']]
Notice the leading and trailing spaces are removed.