How to find unique starts of strings?

How to find unique starts of strings? - python

If I have a list of strings (eg 'blah 1', 'blah 2' 'xyz fg','xyz penguin'), what would be the best way of finding the unique starts of strings ('xyz' and 'blah' in this case)? The starts of strings can be multiple words.

Your question is confusing, as it is not clear what you really want. So I'll give three answers and hope that one of them at least partially answers your question.
To get all unique prefixes of a given list of string, you can do:
>>> l = ['blah 1', 'blah 2', 'xyz fg', 'xyz penguin']
>>> set(s[:i] for s in l for i in range(len(s) + 1))
{'', 'xyz pe', 'xyz penguin', 'b', 'xyz fg', 'xyz peng', 'xyz pengui', 'bl', 'blah 2', 'blah 1', 'blah', 'xyz f', 'xy', 'xyz pengu', 'xyz p', 'x', 'blah ', 'xyz pen', 'bla', 'xyz', 'xyz '}
This code generates all initial slices of every string in the list and passes these to a set to remove duplicates.
To get all largest initial word sequences smaller than the full string, you could go with:
>>> l = ['a b', 'a c', 'a b c', 'b c']
>>> set(s.rsplit(' ', 1)[0] for s in l)
{'a', 'a b', 'b'}
This code creates a set by splitting all strings at their rightmost space, if available (otherwise the while string will be returned).
On the other hand, to get all unique initial word sequences without considering full strings, you could go for:
>>> l = ['a b', 'a c', 'a b c', 'b c']
>>> set(' '.join(w[:i]) for s in l for w in (s.split(),) for i in range(len(w)))
{'', 'a', 'b', 'a b'}
This code splits each word at any whitespace and concatenates all initial slices of the resulting list, except the largest one. This code has pitfall: it will e.g. convert tabs to spaces. This may or may not be an issue in your case.

If you mean unique first words of strings (words being separated by space), this would be:
arr=['blah 1', 'blah 2' 'xyz fg','xyz penguin']
unique=list(set([x.split(' ')[0] for x in arr]))

Related

Removing specific character from the end of strings in list in python

Let's say I have a list that is something like this:
lst = ['Joe C', 'Jill', 'Chad', 'Cassie C']
I want to remove the last character from each string if that character is a 'C'. At the moment I'm stuck at this impass:
no_c_list = [i[:-1] for i in lst if i[-1:] == 'C']
However, this would return a list of:
['Joe', 'Cassie']

Use rstrip:
lst = ['Joe C', 'Jill', 'Chad', 'Cassie C']
result = [e.rstrip('C') for e in lst]
print(result)
Output
['Joe ', 'Jill', 'Chad', 'Cassie ']
From the documentation:
Return a copy of the string with trailing characters removed. The
chars argument is a string specifying the set of characters to be
removed.
Also, as mentioned by #dawg:
result = [e.rstrip(' C') for e in lst]
If you want to remove the trailing whitespace also.

Try this:
lst = ['Joe C', 'Jill', 'Chad', 'Cassie C']
new_list = [ i[:-2] if i[-2:] == " C" else i for i in lst ]
print(new_list)

You could use a regex:
>>> lst = ['Joe C', 'Jill', 'Chad', 'Cassie C']
>>> import re
>>> [re.sub(r' C$', '', s) for s in lst]
['Joe', 'Jill', 'Chad', 'Cassie']

How to get correct output from regex.split()?

import re
number_with_both_parantheses = "(\(*([\d+\.]+)\))"
def process_numerals(text):
k = re.split(number_with_both_parantheses, text)
k = list(filter(None, k))
for elem in k:
print(elem)
INPUT = 'Statement 1 (1) Statement 2 (1.1) Statement 3'
expected_output = ['Statement 1', '(1)' , 'Statement 2', '(1.1)', 'Statement 3']
current_output = ['Statement 1', '(1)' , '1', 'Statement 2', '(1.1)', '1.1' , 'Statement 3']
My input is the INPUT. I am getting the current_output when call the method 'process_numerals' with input text. How do I shift to expected output ?

Your regex seems off. You realize that \(* checks for zero or more left parentheses?
>>> import re
>>> INPUT = 'Statement 1 (1) Statement 2 (1.1) Statement 3'
>>> re.split('\((\d+(?:\.\d+)?)\)', INPUT)
['Statement 1 ', '1', ' Statement 2 ', '1.1', ' Statement 3']
If you really want the literal parentheses to be included, put them inside the capturing parentheses.
The non-capturing parentheses (?:...) allow you to group without capturing. I guess that's what you are mainly looking for.

Moving elements to the end of a list

I want to solve two problems regarding sorting my list in python.
1) In my list, there is an element starts with "noname" and a number comes after it like this, "noname3" or "noname4" (each list contains only one noname+number)
This noname aggregates all the nonames and the number after it shows however many nonames are there.
My question is that how can I send this noname+integer element to the end?
2) As you can see below, sorted function will sort English first then Korean. Is there any way that I can sort Korean first then English? Of course 'noname' at the end.
names = ['Z', 'C', 'A B', 'noname3', 'ㄴ', 'ㄱ', 'D A', 'A A' , 'ㄷ']
sorted(names)
# Output
['A A', 'A B', 'C', 'D A','noname3', 'Z', 'ㄱ', 'ㄴ', 'ㄷ']
# Desired Output
[ 'ㄱ', 'ㄴ', 'ㄷ', 'A A', 'A B', 'C', 'D A', 'Z', 'noname3']

Use a key function that sorts the noname items higher than the non-noname items.
sorted(names, key=lambda x: (x.startswith("noname"), x))

Without knowing how exactly Korean characters are alphabetized, here's my attempt (based on #kindall's start). Note, you can pass a custom function into the key parameter of the sorter
def sorter(char):
#Place english characters after Korean
if ord(char[0])>122:
return ord(char[0])-12000
else:
return ord(char[0])+12000
lst=['Z', 'C', 'A B', 'noname3', 'ㄴ', 'ㄱ', 'D A', 'A A' , 'ㄷ']
sorted(lst, key=lambda x: (x.startswith('noname'),sorter(x)))
['ㄱ', 'ㄴ', 'ㄷ', 'A B', 'A A', 'C', 'D A', 'Z', 'noname3']

sorting using python for complex strings [closed]

Closed. This question needs debugging details. It is not currently accepting answers.
Edit the question to include desired behavior, a specific problem or error, and the shortest code necessary to reproduce the problem. This will help others answer the question.
Closed 7 years ago.
Improve this question
I have an array that contains numbers and characters, e.g. ['A 3', 'C 1', 'B 2'], and I want to sort it using the numbers in each element.
I tried the below code but it did not work
def getKey(item):
item.split(' ')
return item[1]
x = ['A 3', 'C 1', 'B 2']
print sorted(x, key=getKey(x))

To be safe, I'd recommend you to strip everything but the digits.
>>> import re
>>> x = ['A 3', 'C 1', 'B 2', 'E']
>>> print sorted(x, key=lambda n: int(re.sub(r'\D', '', n) or 0))
['E', 'C 1', 'B 2', 'A 3']
With your method;
def getKey(item):
return int(re.sub(r'\D', '', item) or 0)
>>> print sorted(x, key=getKey)
['E', 'C 1', 'B 2', 'A 3']

What you have, plus comments to what's not working :P
def getKey(item):
item.split(' ') #without assigning to anything? This doesn't change item.
#Also, split() splits by whitespace naturally.
return item[1] #returns a string, which will not sort correctly
x = ['A 3', 'C 1', 'B 2']
print sorted(x, key=getKey(x)) #you are assign key to the result of getKey(x), which is nonsensical.
What it should be
print sorted(x, key=lambda i: int(i.split()[1]))

This is one way to do it:
>>> x = ['A 3', 'C 1', 'B 2']
>>> y = [i[::-1] for i in sorted(x)]
>>> y.sort()
>>> y = [i[::-1] for i in y]
>>> y
['C 1', 'B 2', 'A 3']
>>>

Determining the position of sub-string in list of strings

I have a list of words (strings), say:
word_lst = ['This','is','a','great','programming','language']
And a second list with sub-strings, say:
subs_lst= ['This is', 'language', 'a great']
And let's suppose each sub-string in subs_lst appears only one time in word_lst. (sub-strings can be of any length)
I want an easy way to find the hierarchical position of the sub-strings in the word_lst.
So what I want is to order subs_lst according to they appearance in word_lst.
In the previous example, the output would be:
out = ['This is', 'a great', language]
Does anyone know an easy way to do this?

There's probably a faster way to do this, but this works, at least:
word_lst = ['This','is','a','great','programming','language']
subs_lst= ['This is', 'language', 'a great']
substr_lst = [' '.join(word_lst[i:j]) for i in range(len(word_lst)) for j in range(i+1, len(word_lst)+1)]
sorted_subs_list = sorted(subs_lst, key=lambda x:substr_lst.index(x))
print sorted_subs_list
Output:
['This is', 'a great', 'language']
The idea is to build a list of every substring in word_lst, ordered so that all the entries that start with "This" come first, followed by all the entries starting with "is", etc.. We store that in substr_lst.
>>> print substr_lst
['This', 'This is', 'This is a', 'This is a great', 'This is a great programming', 'This is a great programming language', 'is', 'is a', 'is a great', 'is a great programming', 'is a great programming language', 'a', 'a great', 'a great programming', 'a great programming language', 'great', 'great programming', 'great programming language', 'programming', 'programming language', 'language']
Once we have that list, we sort subs_list, using the index of each entry in substr_list as the key to sort by:
>>> substr_lst.index("This is")
1
>>> substr_lst.index("language")
20
>>> substr_lst.index("a great")
12

The intermediate step seems unneeded to me. Why not just make the word list a single string and find the substrings in that?
sorted(subs_lst, key = lambda x : ' '.join(word_lst).index(x))

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to find unique starts of strings? - python

If I have a list of strings (eg 'blah 1', 'blah 2' 'xyz fg','xyz penguin'), what would be the best way of finding the unique starts of strings ('xyz' and 'blah' in this case)? The starts of strings can be multiple words.

If you mean unique first words of strings (words being separated by space), this would be: arr=['blah 1', 'blah 2' 'xyz fg','xyz penguin'] unique=list(set([x.split(' ')[0] for x in arr]))

Related

Removing specific character from the end of strings in list in python

How to get correct output from regex.split()?

Moving elements to the end of a list

sorting using python for complex strings [closed]

Determining the position of sub-string in list of strings

Categories

Resources