can someone please tell me why I cant remove the empty string by using following code?
numlist = list()
tim = "s x s f f f"
timo = tim.strip()
for line in timo:
numlist.append(line)
list(filter(None, numlist))
print(numlist)
output: ['s', ' ', 'x', ' ', 's', ' ', 'f', ' ', 'f', ' ', 'f']
desired output: ['s', 'x', 's', 'f', 'f', 'f']
Use split not strip. strip is for removing leading and trailing characters
In [35]: tim = "s x s f f f"
In [36]: tim.split()
Out[36]: ['s', 'x', 's', 'f', 'f', 'f']
You forgot to assign the result of the filtering back to numlist, so it made the new list and discarded it. Just make the line:
numlist = list(filter(None, numlist))
That said, it wouldn't have done what you wanted, because a string of a single space is still truthy. If you want to exclude spaces as truthy, a simple tweak would be:
numlist = list(filter(str.strip, numlist))
Or simplifying further (but with different behavior if the input isn't always single characters with space separation), replace the entirety of your code with just:
tim = "s x s f f f"
numlist = tim.split()
print(numlist)
as no-arg split will split on whitespace, remove leading and trailing whitespace, and return the list of non-whitespace components as a single efficient action.
Related
I'm trying to make a custom tokenizer in python that works with inline tags. The goal is to take a string input like this:
'This is *tag1* a test *tag2*.'
and have it output the a list separated by tag and character:
['T', 'h', 'i', 's', ' ', 'i', 's', ' ', '*tag1*', ' ', 'a', ' ', 't', 'e', 's', 't', ' ', '*tag2*', '.']
without the tags, I would just use list(), and I think I found a solution for dealing with as single tag type, but there are multiple. There are also other multi character segments, such as ellipses, that are supposed to be encoded as a single feature.
One thing I tried is replacing the tag with a single unused character with regex and then using list() on the string:
text = 'This is *tag1* a test *tag2*.'
tidx = re.match(r'\*.*?\*', text)
text = re.sub(r'\*.*?\*', r'#', text)
text = list(text)
then I would iterate over it and replace the '#' with the extracted tags, but I have multiple different features I am trying to extract, and reiterating the process multiple times with different placeholder characters before splitting the string seems like poor practice. Is there any easier way to do something like this? I'm still quite new to this so there are still a lot of common methods I am unaware of. I guess I can also use a larger regex expression that encompasses all of the features i'm trying to extract but it still feels hacky, and I would prefer to use something more modular that can be used to find other features without writing a new expression every time.
You can use the following regex with re.findall:
\*[^*]*\*|.
See the regex demo. The re.S or re.DOTALL flag can be used with this pattern so that . could also match line break chars that it does not match by default.
Details
\*[^*]*\* - a * char, followed with zero or more chars other than * and then a *
| - or
. - any one char (with re.S).
See the Python demo:
import re
s = 'This is *tag1* a test *tag2*.'
print( re.findall(r'\*[^*]*\*|.', s, re.S) )
# => ['T', 'h', 'i', 's', ' ', 'i', 's', ' ', '*tag1*', ' ', 'a', ' ', 't', 'e', 's', 't', ' ', '*tag2*', '.']
I'm not sure exactly what would be best for you, but you should be able to use the split() method or the .format() method showcased below to get what you want.
# you can use this to get what you need
txt = 'This is *tag1* a test *tag2*.'
x = txt.split("*") #Splits up at *
x = txt.split() #Splits all the words up at the spaces
print(x)
# also, you may be looking for something like this to format a string
mystring = 'This is {} a test {}.'.format('*tag1*', '*tag2*')
print(mystring)
# using split to get ['T', 'h', 'i', 's', ' ', 'i', 's', ' ', '*tag1*', ' ', 'a', ' ', 't', 'e', 's', 't', ' ', '*tag2*', '.']
txt = 'This is *tag1* a test *tag2*.'
split = txt.split("*") #Splits up at *
finallist = [] # initialize the list
for string in split:
# print(string)
if string == '*tag1*':
finallist.append(string)
# pass
elif string == '*tag2*.':
finallist.append(string)
else:
for x in range(len(string)):
letter = string[x]
finallist.append(letter)
print(finallist)
When I am given a string like "Ready[[[, steady, go!", I want to turn it into a list like this: [Ready, steady, go!]. Currently, the best I could do are two list comprehensions but I couldn't figure out a way to combine them.
text_list = [i for i in text.split()]
output: ['Ready[[[,', 'steady,', 'go!']
clean_list = [x for x in list(text) if x in string.ascii_letters]
output: ['R', 'e', 'a', 'd', 'y', 's', 't', 'e', 'a', 'd', 'y', 'g', 'o']
clean_list does succeed in removing non-ASCII letters but literally turns every single character into a list element. text_list keeps the format intact but does not remove non-ASCII characters. How do I combine the two logics to give me the output that I want?
This should work:
import re, string
# filter out all unwanted characters using regex
pattern = re.compile(f"[^{string.ascii_letters} !]")
filtered = pattern.sub('', "Ready[[[, steady, go!")
# split
result = filtered.split()
I created a code that requires tabs to put in it, but I cannot seem to figure out how to add the tabs appropriately. See below for my code and the doc string for what it should return, and what it returns instead. Maybe I should rethink my whole approach?
def display_game(guesses, clues):
'''(list, list) -> str
Given two lists of single character strings, return a string
that displays the current state of the game
>>>display_game([['Y', 'P', 'G', 'G'], ['O', 'O', 'G', 'G']], [['b', 'b'], ['b','b', 'b', 'b']])
'Guess\tClues\n****************\nY P G G\tb b\nO O G G\tb b b b\n'
'''
display = 'Guess\tClues\n****************\n'
for i in range(len(guesses)):
for letter in guesses[i]:
display += letter + ' '
for letter in clues[i]:
display += letter + ' '
display += '\n'
return display
When I use it (using the doc string example), I get:
display_game([['Y', 'P', 'G', 'G'], ['O', 'O', 'G', 'G']], [['b', 'b'], ['b','b', 'b', 'b']])
'Guess\tClues\n****************\nY P G G b b \nO O G G b b b b \n'
Any attempt to put \t in the code has it turning out wrong (ex: with \t between each string instead of where they should be as per the doc string). Is anyone able to suggest how I may change things around? Thanks!
Your code does not add a tab in between the guess and the clue. You could simply add
display += '\t'
in between the first and second nested for loops, however, you then need to ensure that a trailing space is not added at the end of the first loop.
str.join() is a better way to handle this as it only adds delimiter strings in between the items of a sequence:
>>> ' '.join(['a', 'b', 'c'])
'a b c'
Notice that there is no trailing space character in the above. Applying that to your function:
def display_game(guesses, clues):
display = 'Guess\tClues\n****************\n'
for guess, clue in zip(guesses, clues):
display += '{}\t{}\n'.format(' '.join(guess), ' '.join(clue))
return display
zip() is also used here to pair each guess and clue. Then it's simply a matter of using str.join() on the guess and clue, and building the string with the tab in the required place.
>>> assert(display_game([['Y', 'P', 'G', 'G'], ['O', 'O', 'G', 'G']], [['b', 'b'], ['b','b', 'b', 'b']]) == 'Guess\tClues\n****************\nY P G G\tb b\nO O G G\tb b b b\n')
You can just add it in between the for loops:
for i in range(len(guesses)):
for letter in guesses[i]:
display += letter + ' '
display += '\t' # right here
for letter in clues[i]:
display += letter + ' '
display += '\n'
return display
This worked for me. Just add the tab between those two for loops of guesses and clues.
def display_game(guesses, clues):
display = 'Guess \t Clues \n **************** \n'
for i in range(len(guesses)):
for letter in guesses[i]:
display += letter + ' '
display += '\t'
for letter in clues[i]:
display += letter + ' '
display += '\n'
return display
print(display_game('at', 'yk'))
This gave output:
Guess Clues
****************
a y
t k
If i have lists for example:
['6'] #Number
['!'] #Punctuation
['r'] #Alphabet
['8'] #Number
['/'] #Punctuation
['e'] #Alphabet
['5'] #Number
[':'] #Punctuation
['l'] #Alphabet
I use data = line.strip().split(' ') to convert it into this form from a csv file.
I am trying to assign the elements in the lists to their respective variable
For example number will contain the lists that have numbers in it, punctuation will contain the lists that have punctuation in it and alphabet will have lists with alphabets.
What I can't understand is if I do something like
number = data[0], punc = data[1], alpha = data[2]
I get an error:
List index out of range.
So how can i solve this problem?
My code,
for line in new_file:
text = [line.strip() for line in line.split(' ')]
This part of your code appears to be fine
for line in new_file:
text = [line.strip() for line in line.split(' ')]
however if you are doing the following
for line in new_file:
text = [line.strip() for line in line.split(' ')]
number = text[0], punc = text[1], alpha = text[2]
You may ran into problems..take for example a line in your file below
"hello world"
if you split this line you will have a list like ["hello", "world"].This list contains two elements.
Now if you assign this result like text=["hello", "world"]
and you place this result in a variable like
alpha = text[2]
You will certainly recieve List index out of range. ..Why?
Because text[2] does not exist!
Some lines may contain less then 3 words (like in this example)
Revise your approach
Try using a dictionary approach
alpha={"alphabet":[]}
numb={"alphabet":[]}
punc={"punctuation":[]}
..iterate through the file and use list comprehension to select all punctuation, letters, etc and add it your the respective dictionary elements... If you are having trouble post your revised codes
EDIT A WORKING EXAMPLE HOW I WOULD TACKLE THIS
Let say I have a file named new_file and has the content below
hello my name is repzERO
AND THIS IS my age: 100 years
A python script I tried
import re
new_file=open("new_file","r")
alpha={"alphabet":[]}
numb={"number":[]}
punc={"punctuation":[]}
all_punctuation=""
for line in new_file:
alpha["alphabet"]+=[c for c in line if re.search("[a-zA-Z ]",c)]
numb["number"]+=[c for c in line if re.search("[0-9]",c)]
punc["punctuation"]+=[c for c in line if re.search("[^\w\s]",c)]
print(alpha)
print(numb)
print(punc)
output
{'alphabet': ['h', 'e', 'l', 'l', 'o', ' ', 'm', 'y', ' ', 'n', 'a', 'm', 'e', ' ', 'i', 's', ' ', 'r', 'e', 'p', 'z', 'E', 'R', 'O', 'A', 'N', 'D', ' ', 'T', 'H', 'I', 'S', ' ', 'I', 'S', ' ', 'm', 'y', ' ', 'a', 'g', 'e', ' ', ' ', 'y', 'e', 'a', 'r', 's']}
{'number': ['1', '0', '0']}
{'punctuation': [':']}
Your lists seems to have less elements.
Something like this:
yourVariableName = ["what", "ever", "elements", "are", "here"]
is called a list. The list above has 5 elements. You can access the elements with a numeric index i:
yourVariableName[i]
where i is in this case either 0, 1, 2, 3 or 4 (or negative number when you want to count from the end). When you try
yourVariableName[5]
or even higher, you get an "index out of range" error.
This is my current code:
key = input("Enter the key: ")
sent = input("Enter a sentence: ")
print()# for turnin
print()
print("With a key of:",key)
print("Original sentence:",sent)
print()
#split = sent.split()
blank = [ ]
for word in sent:
for ch in word:
blank = blank + ch.split()
print(blank)
print()
What i have now gives me a list of all the letters in my sentence, but no spaces. If i use this...
for word in sent:
for ch in word:
print(ch.split())
It gives me a list of all characters including the spaces. Is there to get this result and have it equal a variable?
If you just want a list of all characters in the sentence, use
chars = list(sent)
What you're doing is definitely not what you think you're doing.
for word in sent:
This doesn't loop over the words. This loops over the characters. This:
for word in sent.split()
would loop over the words.
for ch in word:
Since word is a character already, this loops over a single character. If it weren't for the fact that characters are represented as length-1 strings, this would throw some kind of error.
sent is of type string. and when you iterate over a string this way:
for word in sent:
you get the individual characters, not the words.
Then you iterate over a single char:
for ch in word:
and get that very same char (!).
And then with that split() call you convert a non-blank character, say 'x' into a list with itself as element (['x']) and a blank characters into the empty list.
You probably want something along the lines of:
for word in sent.split():
....
But if what you want is to build a list of words, no need to iterate, that's exactly what sent.split() will get you!
And if what you want is a list of chars, do list(sent).
From help(str.split):
split(...)
S.split(sep=None, maxsplit=-1) -> list of strings
Return a list of the words in S, using sep as the
delimiter string. If maxsplit is given, at most maxsplit
splits are done. If sep is not specified or is None, any
whitespace string is a separator and empty strings are
removed from the result.
If you want individual characters of a string, pass it to list.
>>> list('This is a string.')
['T', 'h', 'i', 's', ' ', 'i', 's', ' ', 'a', ' ', 's', 't', 'r', 'i', 'n', 'g', '.']
I'm not 100% sure what you're asking, but it seems like....
blank = [ch for ch in sent]
...that's all you need....
Let me give you some sample Ins and Outs and see if that's what you want.
IN = "Hello world!"
OUT =>
['H', 'e', 'l', 'l', 'o', ' ', 'w', 'o', 'r', 'l', 'd', '!']
Is that right?
string = "This here is a string"
>>> x = list(string) # I think list() is what you are looking for... It's not clear
>>> print x
['T', 'h', 'i', 's', ' ', 'h', 'e', 'r', 'e', ' ', 'i', 's', ' ', 'a', ' ', 's', 't', 'r', 'i', 'n', 'g']
>>> print string.split() # default arg is a space
['This', 'here', 'is', 'a', 'string']