How to remove characters in a string AFTER a given character? - python

I have a list of tuples that I want to remove the url extensions from. Here's what it looks like
['google.com', 'google.ru', 'google.ca']
Basically, I want to remove everything after the "." in each one so that I'm returned with something like this
['google', 'google', 'google']
My instructions specifically tell me to use the split() function, but I'm confused with that as well. If it's also possible, I need to remove duplicates, so my final result would be:
['google']
Thanks for the help, sorry if my specifications are odd.

This def removes url extensions:
def removeurlextensions(L):
L2 = []
for x in range(len(L)):
L2.append(L[x].split('.')[0])
return L2
To print your list:
L = ['google.com', 'google.ru', 'google.ca']
print(removeurlextensions(L))
#prints ['google', 'google', 'google']
To remove duplicates you can use list(set()):
L = ['google.com', 'google.ru', 'google.ca']
print(list(set(removeurlextensions(L))))
#prints ['google']

You can simply use split.
ls = ['google.com', 'google.ru', 'google.ca']
print([i.split('.', 1)[0] for i in ls])
# result = ['google', 'google', 'google']
And to remove the duplicate, you might want to use set.
mod = [i.split('.', 1)[0] for i in ls]
print(list(set(mod)))
# result = ['google']

This will only work if all items are strings:
for i in range(len(my_list)):
my_list[i] = my_list[I].split('.')[0]
already_in_list = []
for item in my_list:
if item in already_in_list:
my_list.pop(item)
else:
already_in_list.append(item)
print(my_list)
I did do this from memory so if there is a bug please let me know.

Related

Replace substring inside a list

I have a list of strings with a few unclean entries and I want to replace the unclean entries with clean entries
list = ['created_DATE', 'column1(case', 'timestamp', 'location(case']
I want to get a list that is like this
cleanList = ['created_DATE', 'column1', 'timestamp', 'location']
I tired the following:
str_match = [s for s in list if "(case" in s] *#find the intersecting elements*
print (str_match)
new=[]
for k in str_match:
a=k.replace("(case" , "")
new.append(a) *#make an list of the words without the substring*
print(new)
I am not sure how do I now replace the entries from the new list into the original list. Can someone please help.
Thank you
If you want to remove all occurrences of "case(" from your list's elements, then you could write it like this:
list = ['created_DATE', 'column1(case', 'timestamp', 'location(case']
clean = []
for n in list:
clean.append(n.replace("(case", ""))
print(clean)
You can either create a new list clean as told by #alani:
import re
myList = ['created_DATE', 'column1(case', 'timestamp', 'location(case']
clean = [re.sub("\(.*", "", s) for s in myList]
print(clean)
or iterate over elements of myList and update in place
for i in range(len(myList)):
if "(case" in myList[i]:
myList[i] = myList[i].replace("(case" , "")
print(myList)

How to extract strings between two markers for each object of a list in python

I got a list of strings. Those strings have all the two markers in. I would love to extract the string between those two markers for each string in that list.
example:
markers 'XXX' and 'YYY' --> therefore i want to extract 78665786 and 6866
['XXX78665786YYYjajk', 'XXX6866YYYz6767'....]
You can just loop over your list and grab the substring. You can do something like:
import re
my_list = ['XXX78665786YYYjajk', 'XXX6866YYYz6767']
output = []
for item in my_list:
output.append(re.search('XXX(.*)YYY', item).group(1))
print(output)
Output:
['78665786', '6866']
import re
l = ['XXX78665786YYYjajk', 'XXX6866YYYz6767'....]
l = [re.search(r'XXX(.*)YYY', i).group(1) for i in l]
This should work
Another solution would be:
import re
test_string=['XXX78665786YYYjajk','XXX78665783336YYYjajk']
int_val=[int(re.search(r'\d+', x).group()) for x in test_string]
the command split() splits a String into different parts.
list1 = ['XXX78665786YYYjajk', 'XXX6866YYYz6767']
list2 = []
for i in list1:
d = i.split("XXX")
for g in d:
d = g.split("YYY")
list2.append(d)
print(list2)
it's saved into a list

how to remove "\n" from a list of strings

I have a list that is read from a text file that outputs:
['/Users/myname/Documents/test1.txt\n', '/Users/myname/Documents/test2.txt\n', '/Users/myname/Documents/test3.txt\n']
I want to remove the \n from each element, but using .split() does not work on lists only strings (which is annoying as this is a list of strings).
How do I remove the \n from each element so I can get the following output:
['/Users/myname/Documents/test1.txt', '/Users/myname/Documents/test2.txt', '/Users/myname/Documents/test3.txt']
old_list = [x.strip() for x in old_list]
old_list refers to the list you want to remove the \n from.
Or if you want something more readable:
for x in range(len(old_list)):
old_list[x] = old_list[x].strip()
Does the same thing, without list comprehension.
strip() method takes out all the whitespaces, including \n.
But if you are not ok with the idea of removing whitespaces from start and end, you can do:
old_list = [x.replace("\n", "") for x in old_list]
or
for x in range(len(old_list)):
old_list[x] = old_list[x].replace("\n", "")
do a strip but keep in mind that the result is not modifying the original list, so you will need to reasign it if required:
a = ['/Users/myname/Documents/test1.txt\n', '/Users/myname/Documents/test2.txt\n', '/Users/myname/Documents/test3.txt\n']
a = [path.strip() for path in a]
print a
Give this code a try:
lst = ['/Users/myname/Documents/test1.txt\n', '/Users/myname/Documents/test2.txt\n', '/Users/myname/Documents/test3.txt\n']
for n, element in enumerate(lst):
element = element.replace('\n', '')
lst[n] = element
print(lst)
Use:
[i.strip() for i in lines]
in case you don't mind to lost the spaces and tabs at the beginning and at the end of the lines.
You can read the whole file and split lines using str.splitlines:
temp = file.read().splitlines()
if you still have problems go to this question where I got the answer from
How to read a file without newlines?
answered Sep 8 '12 at 11:57 Bakuriu
There are many ways to achieve your result.
Method 1: using split() method
l = ['/Users/myname/Documents/test1.txt\n', '/Users/myname/Documents/test2.txt\n', '/Users/myname/Documents/test3.txt\n']
result = [i.split('\n')[0] for i in l]
print(result) # ['/Users/myname/Documents/test1.txt', '/Users/myname/Documents/test2.txt', '/Users/myname/Documents/test3.txt']
Method 2: using strip() method that removes leading and trailing whitespace
l = ['/Users/myname/Documents/test1.txt\n', '/Users/myname/Documents/test2.txt\n', '/Users/myname/Documents/test3.txt\n']
result = [i.strip() for i in l]
print(result) # ['/Users/myname/Documents/test1.txt', '/Users/myname/Documents/test2.txt', '/Users/myname/Documents/test3.txt']
Method 3: using rstrip() method that removes trailing whitespace
l = ['/Users/myname/Documents/test1.txt\n', '/Users/myname/Documents/test2.txt\n', '/Users/myname/Documents/test3.txt\n']
result = [i.rstrip() for i in l]
print(result) # ['/Users/myname/Documents/test1.txt', '/Users/myname/Documents/test2.txt', '/Users/myname/Documents/test3.txt']
Method 4: using the method replace
l = ['/Users/myname/Documents/test1.txt\n', '/Users/myname/Documents/test2.txt\n', '/Users/myname/Documents/test3.txt\n']
result = [i.replace('\n', '') for i in l]
print(result) # ['/Users/myname/Documents/test1.txt', '/Users/myname/Documents/test2.txt', '/Users/myname/Documents/test3.txt']
Here is another way to do it with lambda:
cleannewline = lambda somelist : map(lambda element: element.strip(), somelist)
Then you can just call it as:
cleannewline(yourlist)

duplicate list items when using .replace() function

lst = ['123,456', '"hello"', '345,678', '"bye"']
def main():
new_lst = []
for item in lst:
#print item
new_lst.append(item.replace(',','***'))
new_lst.append(item.replace('\"', ''))
return new_lst
print main()
This is quite puzzling to me. I don't know what I'm doing wrong here. I know it's a very stupid mistake but it's not clicking for me. I don't know why I get an output of:
['123***456', '123,456', '"hello"', 'hello', '345***678', '345,678', '"bye"', 'bye']
What I was hoping was for:
['123***456', 'hello', '345***678', 'bye']
Any help is greatly appreciated!
You are appending the same string twice, with two different replacements. You should chain the replaces like this
new_lst.append(item.replace(',','***').replace('\"', ''))
Even better, you can use a list comprehension here, like this
return [item.replace(',','***').replace('\"', '') for item in lst]

Remove duplicates in a list while keeping its order (Python)

This is actually an extension of this question. The answers of that question did not keep the "order" of the list after removing duplicates. How to remove these duplicates in a list (python)
biglist =
[
{'title':'U2 Band','link':'u2.com'},
{'title':'Live Concert by U2','link':'u2.com'},
{'title':'ABC Station','link':'abc.com'}
]
In this case, the 2nd element should be removed because a previous "u2.com" element already exists. However, the order should be kept.
use set(), then re-sort using the index of the original list.
>>> mylist = ['c','a','a','b','a','b','c']
>>> sorted(set(mylist), key=lambda x: mylist.index(x))
['c', 'a', 'b']
My answer to your other question, which you completely ignored!, shows you're wrong in claiming that
The answers of that question did not
keep the "order"
my answer did keep order, and it clearly said it did. Here it is again, with added emphasis to see if you can just keep ignoring it...:
Probably the fastest approach, for a really big list, if you want to preserve the exact order of the items that remain, is the following...:
biglist = [
{'title':'U2 Band','link':'u2.com'},
{'title':'ABC Station','link':'abc.com'},
{'title':'Live Concert by U2','link':'u2.com'}
]
known_links = set()
newlist = []
for d in biglist:
link = d['link']
if link in known_links: continue
newlist.append(d)
known_links.add(link)
biglist[:] = newlist
Generators are great.
def unique( seq ):
seen = set()
for item in seq:
if item not in seen:
seen.add( item )
yield item
biglist[:] = unique( biglist )
This page discusses different methods and their speeds:
http://www.peterbe.com/plog/uniqifiers-benchmark
The recommended* method:
def f5(seq, idfun=None):
# order preserving
if idfun is None:
def idfun(x): return x
seen = {}
result = []
for item in seq:
marker = idfun(item)
# in old Python versions:
# if seen.has_key(marker)
# but in new ones:
if marker in seen: continue
seen[marker] = 1
result.append(item)
return result
f5(biglist,lambda x: x['link'])
*by that page
This is an elegant and compact way, with list comprehension (but not as efficient as with dictionary):
mylist = ['aaa','aba','aaa','aea','baa','aaa','aac','aaa',]
[ v for (i,v) in enumerate(mylist) if v not in mylist[0:i] ]
And in the context of the answer:
[ v for (i,v) in enumerate(biglist) if v['link'] not in map(lambda d: d['link'], biglist[0:i]) ]
dups = {}
newlist = []
for x in biglist:
if x['link'] not in dups:
newlist.append(x)
dups[x['link']] = None
print newlist
produces
[{'link': 'u2.com', 'title': 'U2 Band'}, {'link': 'abc.com', 'title': 'ABC Station'}]
Note that here I used a dictionary. This makes the test not in dups much more efficient than using a list.
Try this :
list = ['aaa','aba','aaa','aea','baa','aaa','aac','aaa',]
uniq = []
for i in list:
if i not in uniq:
uniq.append(i)
print list
print uniq
output will be :
['aaa', 'aba', 'aaa', 'aea', 'baa', 'aaa', 'aac', 'aaa']
['aaa', 'aba', 'aea', 'baa', 'aac']
A super easy way to do this is:
def uniq(a):
if len(a) == 0:
return []
else:
return [a[0]] + uniq([x for x in a if x != a[0]])
This is not the most efficient way, because:
it searches through the whole list for every element in the list, so it's O(n^2)
it's recursive so uses a stack depth equal to the length of the list
However, for simple uses (no more than a few hundred items, not performance critical) it is sufficient.
I think using a set should be pretty efficent.
seen_links = set()
for index in len(biglist):
link = biglist[index]['link']
if link in seen_links:
del(biglist[index])
seen_links.add(link)
I think this should come in at O(nlog(n))

Categories

Resources