How to split a string by spaces and remove non-ASCII characters?

How to split a string by spaces and remove non-ASCII characters? - python

When I am given a string like "Ready[[[, steady, go!", I want to turn it into a list like this: [Ready, steady, go!]. Currently, the best I could do are two list comprehensions but I couldn't figure out a way to combine them.
text_list = [i for i in text.split()]
output: ['Ready[[[,', 'steady,', 'go!']
clean_list = [x for x in list(text) if x in string.ascii_letters]
output: ['R', 'e', 'a', 'd', 'y', 's', 't', 'e', 'a', 'd', 'y', 'g', 'o']
clean_list does succeed in removing non-ASCII letters but literally turns every single character into a list element. text_list keeps the format intact but does not remove non-ASCII characters. How do I combine the two logics to give me the output that I want?

This should work:
import re, string
# filter out all unwanted characters using regex
pattern = re.compile(f"[^{string.ascii_letters} !]")
filtered = pattern.sub('', "Ready[[[, steady, go!")
# split
result = filtered.split()

Related

Getting strings from list using python

Hi I am new to python I am trying to delete some unwanted characters and bring format
My lists are
List=
['2', '4a.', 'D', '__|5.', 'E|6.', 'F', '|7.', 'G', '—|8.'']
['9', '10.', "QRS(q,r", 's)', '11.', 'TUV/', '12.', "XYZ:"]
I want to get the list as follows
['D', 'E', 'F', 'G']
["QRS(q,r,s)", 'TUV/', "XYZ:"]
Here I want to delete numbers and alphanumeric ones
There are two challenges here
in the first list I had 'E|6.' I want to get E only string
in the second list I had "QRS(q,r", 's)' I want it as "QRS(q,r,s)" as only one string
Can anyone plz help me out thanks in advance

First you will need to differentiate between an special character and a alphabet character. For this you must have a list of limited alphabet characters -
import string
string.ascii_lowercase # returns all alphabets in lowercase.
Then you will need to iterate through the string in the list as to find possible alphabets in uppercase.
import string
List = ['2', '4a.', 'D', '__|5.', 'E|6.', 'F', '|7.', 'G', '—|8.'],['9', '10.', "QRS(q,r", 's)', '11.', 'TUV/', '12.', "XYZ:"]
what_you_need = []
for h in List:
for i in h:
for j in i:
if j.upper() in string.ascii_lowercase.upper():
what_you_need.append(j)
print(what_you_need)

you can try regular expression.
check below example:
import re
l1 = []
a = ['2', '4a.', 'D', '__|5.', 'E|16.', 'F', '|7.', 'G', '—|8.']
for idx, ele in enumerate(a):
if '(' in ele:
l1.append(ele + ',' + b[idx+1])
continue
elif ')' in ele:
pass
elif any([i.isdigit() for i in ele]):
g = re.findall(r"([A-Z])",ele)
if g:
l1.append(g[0])
else:
l1.append(ele)
In same way you can prepare regular expression for another list

How do I create a new list with a nested list comprehension?

Say I have a list of words
word_list = ['cat','dog','rabbit']
and I want to end up with a list of letters (not including any repeated letters), like this:
['c', 'a', 't', 'd', 'o', 'g', 'r', 'b', 'i']
without a list comprehension the code would like this:
letter_list=[]
for a_word in word_list:
for a_letter in a_word:
if a_letter not in letter_list:
letter_list.append(a_letter)
print(letter_list)
is there a way to do this with a list comprehension?
I have tried
letter_list = [a_letter for a_letter in a_word for a_word in word_list]
but I get a
NameError: name 'a_word' is not defined
error. I have see answers for similar problems, but they usually iterate over a nested collection (list or tuple). Is there a way to do this from a non-nested list like a_word?
Trying
letter_list = [a_letter for a_letter in [a_word for a_word in word_list]]
Results in the initial list: ['cat','dog','rabbit']
And trying
letter_list = [[a_letter for a_letter in a_word] for a_word in word_list]
Results in:[['c', 'a', 't'], ['d', 'o', 'g'], ['r', 'a', 'b', 'b', 'i', 't']], which is closer to what I want except it's nested lists. Is there a way to do this and have just the letters be in letter_list?

Update. How about this:
word_list = ['cat','dog','rabbit']
new_list = [letter for letter in ''.join(word_list)]
new_list = sorted(set(new_list), key=new_list.index)
print(new_list)
Output:
['c', 'a', 't', 'd', 'o', 'g', 'r', 'b', 'i']

word_list = ['cat','dog','rabbit']
letter_list = list(set([letter for word in word_list for letter in word]))
This works and removes the duplicate letters, but the order is not preserved. If you want to keep the order you can do this.
from collections import OrderedDict
word_list = ['cat','dog','rabbit']
letter_list = list(OrderedDict.fromkeys("".join(word_list)))

you can do it by using list comprehension
l=[j for i in word_list for j in i ]
print(l)
output:
['c', 'a', 't', 'd', 'o', 'g', 'r', 'a', 'b', 'b', 'i', 't']

You can use a list comprehension. It is faster than looping in cases like yours when you call .append on each iteration, as explained by this answer.
But if you want to keep only unique letters (i.e. without repeating any letter), you can use a set comprehension by changing the braces [] to curly braces {} as in
letter_set = {letter for letter in word for word in word_list}
This way you avoid checking the partial list on every iteration to see if the letter is already part of the set. Instead you make use of pythons embedded hashing algorithms and make your code a lot faster.

Another solution:
>>> s = set()
>>> word_list = ['cat', 'dog', 'rabbit']
>>> [c for word in word_list for c in word if (c not in s, s.add(c))[0]]
['c', 'a', 't', 'd', 'o', 'g', 'r', 'b', 'i']
This will test whether the letter is already in the set or not, and it will unconditionally add it to the set (having no effect if it is already present). The None returned from s.add is stored in the temporary tuple but otherwise ignored. The first element of the temporary tuple (that is, the result of the c not in s) is used to filter the items.
This relies on the fact that the elements of the temporary tuple are evaluated from left to right.
Could be considered a bit hacky :-)

i want to remove empty string from a list

can someone please tell me why I cant remove the empty string by using following code?
numlist = list()
tim = "s x s f f f"
timo = tim.strip()
for line in timo:
numlist.append(line)
list(filter(None, numlist))
print(numlist)
output: ['s', ' ', 'x', ' ', 's', ' ', 'f', ' ', 'f', ' ', 'f']
desired output: ['s', 'x', 's', 'f', 'f', 'f']

Use split not strip. strip is for removing leading and trailing characters
In [35]: tim = "s x s f f f"
In [36]: tim.split()
Out[36]: ['s', 'x', 's', 'f', 'f', 'f']

You forgot to assign the result of the filtering back to numlist, so it made the new list and discarded it. Just make the line:
numlist = list(filter(None, numlist))
That said, it wouldn't have done what you wanted, because a string of a single space is still truthy. If you want to exclude spaces as truthy, a simple tweak would be:
numlist = list(filter(str.strip, numlist))
Or simplifying further (but with different behavior if the input isn't always single characters with space separation), replace the entirety of your code with just:
tim = "s x s f f f"
numlist = tim.split()
print(numlist)
as no-arg split will split on whitespace, remove leading and trailing whitespace, and return the list of non-whitespace components as a single efficient action.

Converting to lower-case: every letter gets tokenized

I have a text document that I want to convert to lower case, but when I do it in the following way every letter of my document gets tokenized. Why does it happen?
with open('assign_1.txt') as g:
assign_1 = g.read()
assign_new = [word.lower() for word in assign_1]
What I get:
assign_new
['b',
'a',
'n',
'g',
'l',
'a',
'd',
'e',
's',
'h',]

You iterated through the entire input, one character at a time, dropped each to lower-case, and specified the result as a list. It's simpler than that:
assign_lower = g.read().lower()
Using the variable "word" doesn't make you iterate over words -- assign_1 still a sequence of characters.
If you want to break this into words, use the split method ... which is independent of the lower-case operation.

find all characters NOT in regex pattern

Let's say I have a regex of legal characters
legals = re.compile("[abc]")
I can return a list of legal characters in a string like this:
finder = re.finditer(legals, "abcdefg")
[match.group() for match in finder]
>>>['a', 'b', 'c']
How can I use regex to find a list of the characters NOT in the regex? IE in my case it would return
['d','e','f','g']
Edit: To clarify, I'm hoping to find a way to do this without modifying the regex itself.

Negate the character class:
>>> illegals = re.compile("[^abc]")
>>> finder = re.finditer(illegals, "abcdefg")
>>> [match.group() for match in finder]
['d', 'e', 'f', 'g']
If you can't do that (and you're only dealing with one-character length matches), you could
>>> legals = re.compile("[abc]")
>>> remains = legals.sub("", "abcdefg")
>>> [char for char in remains]
['d', 'e', 'f', 'g']

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to split a string by spaces and remove non-ASCII characters? - python

This should work: import re, string # filter out all unwanted characters using regex pattern = re.compile(f"[^{string.ascii_letters} !]") filtered = pattern.sub('', "Ready[[[, steady, go!") # split result = filtered.split()

Related

Getting strings from list using python

How do I create a new list with a nested list comprehension?

i want to remove empty string from a list

Converting to lower-case: every letter gets tokenized

find all characters NOT in regex pattern

Categories

Resources