Removing specific characters from string - python

I am new to NLP and trying to do some pre-processing steps on my data for a classification task. I have already done most of the cleaning but there still are some special characters within the text that I am now trying to remove.
The text is in a Dataframe and is already tokenized and lemmatized, converted to lowercase, with no stopwords and no punctuation.
Each text record is represented by a list of words.
['​‘the', 'redwood', 'massacre’', 'five', 'adventurous', 'friend', 'visiting', 'legendary', 'murder', 'site', 'redwood', 'hallmark', 'exciting', 'thrilling', 'camping', 'weekend', 'away', 'soon', 'discover', 'they’re', 'people', 'mysterious', 'location', 'fun', 'camping', 'expedition', 'soon', 'turn', 'nightmare', 'sadistically', 'stalked', 'mysterious', 'unseen', 'killer']
I tried the following code and other solutions as well but I can't understand why the output splits the words into single letters instead of just removing the special character, leaving the words in a compact format.
def remove_character(text):
new_text=[word.replace('€','') for word in text]
return new_text
df["Column_name"]=df["Column_name"].apply(lambda x:remove_character(x))
After applying the function this is the output on the same text record:
"['[', ""'"", 'â', '', '‹', 'â', '', '˜', 't', 'h', 'e', ""'"", ',', ' ', ""'"", 'r', 'e', 'd', 'w', 'o', 'o', 'd', ""'"", ',', ' ', ""'"", 'm', 'a', 's', 's', 'a', 'c', 'r', 'e', 'â', '', '™', ""'"", ',', ' ', ""'"", 'f', 'i', 'v', 'e', ""'"", ',', ' ', ""'"", 'a', 'd', 'v', 'e', 'n', 't', 'u', 'r', 'o', 'u', 's', ""'"", ',', ' ', ""'"", 'f', 'r', 'i', 'e', 'n', 'd', ""'"", ',', ' ', ""'"", 'v', 'i', 's', 'i', 't', 'i', 'n', 'g', ""'"", ',', ' ', ""'"", 'l', 'e', 'g', 'e', 'n', 'd', 'a', 'r', 'y', ""'"", ',', ' ', ""'"", 'm', 'u', 'r', 'd', 'e', 'r', ""'"", ',', ' ', ""'"", 's', 'i', 't', 'e', ""'"", ',', ' ', ""'"", 'r', 'e', 'd', 'w', 'o', 'o', 'd', ""'"", ',', ' ', ""'"", 'h', 'a', 'l', 'l', 'm', 'a', 'r', 'k', ""'"", ',', ' ', ""'"", 'e', 'x', 'c', 'i', 't', 'i', 'n', 'g', ""'"", ',', ' ', ""'"", 't', 'h', 'r', 'i', 'l', 'l', 'i', 'n', 'g', ""'"", ',', ' ', ""'"", 'c', 'a', 'm', 'p', 'i', 'n', 'g', ""'"", ',', ' ', ""'"", 'w', 'e', 'e', 'k', 'e', 'n', 'd', ""'"", ',', ' ', ""'"", 'a', 'w', 'a', 'y', ""'"", ',', ' ', ""'"", 's', 'o', 'o', 'n', ""'"", ',', ' ', ""'"", 'd', 'i', 's', 'c', 'o', 'v', 'e', 'r', ""'"", ',', ' ', ""'"", 't', 'h', 'e', 'y', 'â', '', '™', 'r', 'e', ""'"", ',', ' ', ""'"", 'p', 'e', 'o', 'p', 'l', 'e', ""'"", ',', ' ', ""'"", 'm', 'y', 's', 't', 'e', 'r', 'i', 'o', 'u', 's', ""'"", ',', ' ', ""'"", 'l', 'o', 'c', 'a', 't', 'i', 'o', 'n', ""'"", ',', ' ', ""'"", 'f', 'u', 'n', ""'"", ',', ' ', ""'"", 'c', 'a', 'm', 'p', 'i', 'n', 'g', ""'"", ',', ' ', ""'"", 'e', 'x', 'p', 'e', 'd', 'i', 't', 'i', 'o', 'n', ""'"", ',', ' ', ""'"", 's', 'o', 'o', 'n', ""'"", ',', ' ', ""'"", 't', 'u', 'r', 'n', ""'"", ',', ' ', ""'"", 'n', 'i', 'g', 'h', 't', 'm', 'a', 'r', 'e', ""'"", ',', ' ', ""'"", 's', 'a', 'd', 'i', 's', 't', 'i', 'c', 'a', 'l', 'l', 'y', ""'"", ',', ' ', ""'"", 's', 't', 'a', 'l', 'k', 'e', 'd', ""'"", ',', ' ', ""'"", 'm', 'y', 's', 't', 'e', 'r', 'i', 'o', 'u', 's', ""'"", ',', ' ', ""'"", 'u', 'n', 's', 'e', 'e', 'n', ""'"", ',', ' ', ""'"", 'k', 'i', 'l', 'l', 'e', 'r', ""'"", ']']"

It seems you have single words in cells like this
$ df.head()
Column_name
0 ​‘the
1 redwood
2 massacre’
3 five
4 adventurous
so you shouldn't use for word in text which will split word into chars - it will work like for char in text.
You should use only replace() in apply() which will run it with every cell (similar to for-loop)
df["Column_name"] = df["Column_name"].apply(lambda word: word.replace('€',''))
Minimal working example (so everyone can copy and run it)
import pandas as pd
def remove_character(text):
return [word.replace('€', '') for word in text]
df = pd.DataFrame({'Column_name': ['​‘the', 'redwood', 'massacre’', 'five', 'adventurous', 'friend', 'visiting', 'legendary', 'murder', 'site', 'redwood', 'hallmark', 'exciting', 'thrilling', 'camping', 'weekend', 'away', 'soon', 'discover', 'they’re', 'people', 'mysterious', 'location', 'fun', 'camping', 'expedition', 'soon', 'turn', 'nightmare', 'sadistically', 'stalked', 'mysterious', 'unseen', 'killer']})
print(df.head())
df["Column_name"] = df["Column_name"].apply(lambda word: word.replace('€',''))
#df["Column_name"] = df["Column_name"].apply(lambda x:remove_character(x))
print(df.head())

Your remove_character function should return a string rather than a list. However, pandas includes the str accessor on a Series to perform operations on strings so another option you can use is
df["Column_name"] = df["Column_name"].str.replace('€','')
(no need to use apply)

Related

Joining individual elements of an array

I have an array consisting of labels but each label has been broken down by individual characters. For example, this is the first 2 elements of the array:
array([['1', '.', ' ', 'I', 'd', 'e', 'n', 't', 'i', 'f', 'y', 'i', 'n',
'g', ',', ' ', 'A', 's', 's', 'e', 's', 's', 'i', 'n', 'g', ' ',
'a', 'n', 'd', ' ', 'I', 'm', 'p', 'r', 'o', 'v', 'i', 'n', 'g',
' ', 'C', 'a', 'r', 'e', '', ''],
['9', '.', ' ', 'N', 'o', 'n', '-', 'P', 'h', 'a', 'r', 'm', 'a',
'c', 'o', 'l', 'o', 'g', 'i', 'c', 'a', 'l', ' ', 'I', 'n', 't',
'e', 'r', 'v', 'e', 'n', 't', 'i', 'o', 'n', 's', '', '', '',
'', ''], ...
I would like it to be formatted as such:
array(['1. Identifying, Assessing and Improving Care',
'9. Non-Pharmacological Interventions', ...
I want to be able to iterate through a concatenate the label output so it is as shown above.
Any help in achieving this would be much appreciated :) Many thanks!
import numpy as np
k=np.array([['1', '.', ' ', 'I', 'd', 'e', 'n', 't', 'i', 'f', 'y', 'i', 'n',
'g', ',', ' ', 'A', 's', 's', 'e', 's', 's', 'i', 'n', 'g', ' ',
'a', 'n', 'd', ' ', 'I', 'm', 'p', 'r', 'o', 'v', 'i', 'n', 'g',
' ', 'C', 'a', 'r', 'e', '', ''],
['9', '.', ' ', 'N', 'o', 'n', '-', 'P', 'h', 'a', 'r', 'm', 'a',
'c', 'o', 'l', 'o', 'g', 'i', 'c', 'a', 'l', ' ', 'I', 'n', 't',
'e', 'r', 'v', 'e', 'n', 't', 'i', 'o', 'n', 's', '', '', '',
'', '']])
for x in k:
print(''.join(x))
#output
1. Identifying, Assessing and Improving Care
9. Non-Pharmacological Interventions
Using List comprehension:
[''.join(x) for x in k]
#output
['1. Identifying, Assessing and Improving Care',
'9. Non-Pharmacological Interventions']
Considering the array as a list of lists, you could join all characters by looping through the list:
r = [['1', '.', ' ', 'I', 'd', 'e', 'n', 't', 'i', 'f', 'y', 'i', 'n',
'g', ',', ' ', 'A', 's', 's', 'e', 's', 's', 'i', 'n', 'g', ' ',
'a', 'n', 'd', ' ', 'I', 'm', 'p', 'r', 'o', 'v', 'i', 'n', 'g',
' ', 'C', 'a', 'r', 'e', '', ''],
['9', '.', ' ', 'N', 'o', 'n', '-', 'P', 'h', 'a', 'r', 'm', 'a',
'c', 'o', 'l', 'o', 'g', 'i', 'c', 'a', 'l', ' ', 'I', 'n', 't',
'e', 'r', 'v', 'e', 'n', 't', 'i', 'o', 'n', 's', '', '', '',
'', '']]
t = ["".join(i) for i in r]
print(t)
Output:
['1. Identifying, Assessing and Improving Care',
'9. Non-Pharmacological Interventions']
array = [['1', '.', ' ', 'I', 'd', 'e', 'n', 't', 'i', 'f', 'y', 'i', 'n',
'g', ',', ' ', 'A', 's', 's', 'e', 's', 's', 'i', 'n', 'g', ' ',
'a', 'n', 'd', ' ', 'I', 'm', 'p', 'r', 'o', 'v', 'i', 'n', 'g',
' ', 'C', 'a', 'r', 'e', '', ''],
['9', '.', ' ', 'N', 'o', 'n', '-', 'P', 'h', 'a', 'r', 'm', 'a',
'c', 'o', 'l', 'o', 'g', 'i', 'c', 'a', 'l', ' ', 'I', 'n', 't',
'e', 'r', 'v', 'e', 'n', 't', 'i', 'o', 'n', 's', '', '', '',
'', '']]
# array(['1. Identifying, Assessing and Improving Care',
# '9. Non-Pharmacological Interventions', ...
array = [''.join(i) for i in array]
print(array) #['1. Identifying, Assessing and Improving Care', '9. Non-Pharmacological Interventions']
Assuming from array([...]) that you are using numpy, here's a solution
import numpy as np
a = np.array([['1', '.', ' ', 'I', 'd', 'e', 'n', 't', 'i', 'f', 'y', 'i', 'n',
'g', ',', ' ', 'A', 's', 's', 'e', 's', 's', 'i', 'n', 'g', ' ',
'a', 'n', 'd', ' ', 'I', 'm', 'p', 'r', 'o', 'v', 'i', 'n', 'g',
' ', 'C', 'a', 'r', 'e', '', ''],
['9', '.', ' ', 'N', 'o', 'n', '-', 'P', 'h', 'a', 'r', 'm', 'a',
'c', 'o', 'l', 'o', 'g', 'i', 'c', 'a', 'l', ' ', 'I', 'n', 't',
'e', 'r', 'v', 'e', 'n', 't', 'i', 'o', 'n', 's', '', '', '',
'', '']])
b = np.empty(a.shape[0], dtype=object)
for i, x in enumerate(a): b[i] = ''.join(x)
If you make a loop over each element of your array, you can then use list .join to get what you are looking for.
Something like:
arr = [['1', '.', ' ', 'I', ...], ...]
output = list()
for x in arr:
output.append(''.join(x))
output
>>>
['1. Identifying, Assessing and Improving Care', ...]

Python encoding using ASCII

I am designing a program that will encode. messages imported from a csv. It will do this by converting them their ASCII value, adding 2 to them and then converting them back to characters.
My current problem is that while my code will encode each character in each string the messages are now no longer joined together.
Any help would be appreciated.
My code:
#importing csv file and allowing it to be read from
import csv
ifile = open("messages.csv","rb")
reader= csv.reader(ifile)
#creating lists
plain_text=[]
plain_ascii=[]
encrypted_ascii=[]
encrypted_text=[]
latitude=[]
longitude=[]
#appending csv data to separate lists
for row in reader:
latitude.append(row[0])
longitude.append(row[1])
plain_text.append(row[2])
#encoding messages
encrypted_text=[[chr(ord(ch)+2) for ch in string] for string in plain_text]
print plain_text
print encrypted_text
ifile.close()
The current output:
['A famous Scottish victory in the First War of Scottish Independence - bit.ly/1yIAb8Q', "How high is Scotland's tallest mountain? - bit.ly/1q3Rj6D", 'What is the traditional instrument most often linked with Scotland? - http://#bit.ly/1lNdrk3', "A prickly problem Scotland's national symbol - bit.ly/1q3REpQ", 'Name the largest city in Scotland - bit.ly/T4OEuU']
[['C', '"', 'h', 'c', 'o', 'q', 'w', 'u', '"', 'U', 'e', 'q', 'v', 'v', 'k', 'u', 'j', '"', 'x', 'k', 'e', 'v', 'q', 't', '{', '"', 'k', 'p', '"', 'v', 'j', 'g', '"', 'H', 'k', 't', 'u', 'v', '"', 'Y', 'c', 't', '"', 'q', 'h', '"', 'U', 'e', 'q', 'v', 'v', 'k', 'u', 'j', '"', 'K', 'p', 'f', 'g', 'r', 'g', 'p', 'f', 'g', 'p', 'e', 'g', '"', '/', '"', 'd', 'k', 'v', '0', 'n', '{', '1', '3', '{', 'K', 'C', 'd', ':', 'S'], ['J', 'q', 'y', '"', 'j', 'k', 'i', 'j', '"', 'k', 'u', '"', 'U', 'e', 'q', 'v', 'n', 'c', 'p', 'f', ')', 'u', '"', 'v', 'c', 'n', 'n', 'g', 'u', 'v', '"', 'o', 'q', 'w', 'p', 'v', 'c', 'k', 'p', 'A', '"', '/', '"', 'd', 'k', 'v', '0', 'n', '{', '1', '3', 's', '5', 'T', 'l', '8', 'F'], ['Y', 'j', 'c', 'v', '"', 'k', 'u', '"', 'v', 'j', 'g', '"', 'v', 't', 'c', 'f', 'k', 'v', 'k', 'q', 'p', 'c', 'n', '"', 'k', 'p', 'u', 'v', 't', 'w', 'o', 'g', 'p', 'v', '"', 'o', 'q', 'u', 'v', '"', 'q', 'h', 'v', 'g', 'p', '"', 'n', 'k', 'p', 'm', 'g', 'f', '"', 'y', 'k', 'v', 'j', '"', 'U', 'e', 'q', 'v', 'n', 'c', 'p', 'f', 'A', '"', '/', '"', 'j', 'v', 'v', 'r', '<', '1', '1', 'd', 'k', 'v', '0', 'n', '{', '1', '3', 'n', 'P', 'f', 't', 'm', '5'], ['C', '"', 'r', 't', 'k', 'e', 'm', 'n', '{', '"', 'r', 't', 'q', 'd', 'n', 'g', 'o', '"', 'U', 'e', 'q', 'v', 'n', 'c', 'p', 'f', ')', 'u', '"', 'p', 'c', 'v', 'k', 'q', 'p', 'c', 'n', '"', 'u', '{', 'o', 'd', 'q', 'n', '"', '/', '"', 'd', 'k', 'v', '0', 'n', '{', '1', '3', 's', '5', 'T', 'G', 'r', 'S'], ['P', 'c', 'o', 'g', '"', 'v', 'j', 'g', '"', 'n', 'c', 't', 'i', 'g', 'u', 'v', '"', 'e', 'k', 'v', '{', '"', 'k', 'p', '"', 'U', 'e', 'q', 'v', 'n', 'c', 'p', 'f', '"', '/', '"', 'd', 'k', 'v', '0', 'n', '{', '1', 'V', '6', 'Q', 'G', 'w', 'W']]
You need to join the inner lists.
[''.join(chr(ord(ch)+2) for ch in string) for string in plain]

How to print a list without [ , '' in python

I want to print
*IBM is a trademark of the International Business Machine Corporation.
in python instead of this
['*', 'I', 'B', 'M', ' ', 'i', 's', ' ', 'a', ' ', 't', 'r', 'a', 'd', 'e', 'm', 'a', 'r', 'k', ' ', 'o', 'f', ' ', 't', 'h', 'e', ' ', 'I', 'n', 't', 'e', 'r', 'n', 'a', 't', 'i', 'o', 'n', 'a', 'l', ' ', 'B', 'u', 's', 'i', 'n', 'e', 's', 's', ' ', 'M', 'a', 'c', 'h', 'i', 'n', 'e', ' ', 'C', 'o', 'r', 'p', 'o', 'r', 'a', 't', 'i', 'o', 'n', '.']
My code:
n=str(input())
l=len(n)
m=[' ']*l
for i in range(l):
m[i]=chr(ord(n[i])-7)
print(m)
Assuming that your list is:
this_is_a_list = ['*', 'I', 'B', 'M', ' ', 'i', 's', ' ', 'a', ' ', 't', 'r', 'a', 'd', 'e', 'm', 'a', 'r', 'k', ' ', 'o', 'f', ' ', 't', 'h', 'e', ' ', 'I', 'n', 't', 'e', 'r', 'n', 'a', 't', 'i', 'o', 'n', 'a', 'l', ' ', 'B', 'u', 's', 'i', 'n', 'e', 's', 's', ' ', 'M', 'a', 'c', 'h', 'i', 'n', 'e', ' ', 'C', 'o', 'r', 'p', 'o', 'r', 'a', 't', 'i', 'o', 'n', '.']
use join:
''.join(this_is_a_list)
extended:
in case you plan to use the string in the future: This method is extremely inefficient, but I'm going to leave it here as a showcase of what not to do: (Thanks to #PM 2Ring)
# BAD EXAMPLE, AVOID THIS METHOD
final_word = ""
for i in xrange(len(this_is_a_list)):
final_word = final_word + this_is_a_list[i]
print final_word
further edited, thanks to #kuro
final_word = ''.join(this_is_a_list)
Use join
x = ['*', 'I', 'B', 'M', ' ', 'i', 's', ' ', 'a', ' ', 't', 'r', 'a', 'd', 'e', 'm', 'a', 'r', 'k', ' ', 'o', 'f', ' ', 't', 'h', 'e', ' ', 'I', 'n', 't', 'e', 'r', 'n', 'a', 't', 'i', 'o', 'n', 'a', 'l', ' ', 'B', 'u', 's', 'i', 'n', 'e', 's', 's', ' ', 'M', 'a', 'c', 'h', 'i', 'n', 'e', ' ', 'C', 'o', 'r', 'p', 'o', 'r', 'a', 't', 'i', 'o', 'n', '.']
print(''.join(x))
'*IBM is a trademark of the International Business Machine Corporation.'
The sensible way to do this is to use .join. And to perform the decoding operation you can loop directly over the chars of the input string rather than using indices.
s = input('> ')
a = []
for u in s:
c = chr(ord(u) - 7)
a.append(c)
print(''.join(a))
demo
> 1PIT'pz'h'{yhklthyr'vm'{ol'Pu{lyuh{pvuhs'I|zpulzz'Thjopul'Jvywvyh{pvu5
*IBM is a trademark of the International Business Machine Corporation.
We can make this much more compact by using a list comprehension.
s = input('> ')
print(''.join([chr(ord(u)-7) for u in s]))
You can try this one
to_print = ['*', 'I', 'B', 'M', ' ', 'i', 's', ' ', 'a', ' ', 't', 'r', 'a', 'd', 'e', 'm', 'a', 'r', 'k', ' ', 'o', 'f', ' ', 't', 'h', 'e', ' ', 'I', 'n', 't', 'e', 'r', 'n', 'a', 't', 'i', 'o', 'n', 'a', 'l', ' ', 'B', 'u', 's', 'i', 'n', 'e', 's', 's', ' ', 'M', 'a', 'c', 'h', 'i', 'n', 'e', ' ', 'C', 'o', 'r', 'p', 'o', 'r', 'a', 't', 'i', 'o', 'n', '.']
word = ''
for i in range(len(to_print)):
word = word + to_print[i]
print (word)

Get the characters from a list of lists

I have this example :
example=[["hello i am adolf","hi my name is "],["this is a test","i like to play"]]
So , I want to get the following array:
chars2=[['h', 'e', 'l', 'l', 'o', ' ', 'i', ' ', 'a', 'm', ' ', 'a', 'd', 'o', 'l', 'f','h', 'i', ' ', 'm', 'y', ' ', 'n', 'a', 'm', 'e', ' ', 'i', 's'],['t', 'h', 'i', 's', ' ', 'i', 's', ' ', 'a', ' ', 't', 'e', 's', 't', 'i', ' ', 'l', 'i', 'k', 'e', ' ', 't', 'o', ' ', 'p', 'l', 'a', 'y']]
I tried this:
chars2=[]
for list in example:
for string in list:
chars2.extend(string)
but i get the following:
['h', 'e', 'l', 'l', 'o', ' ', 'i', ' ', 'a', 'm', ' ', 'a', 'd', 'o', 'l', 'f', 'h', 'i', ' ', 'm', 'y', ' ', 'n', 'a', 'm', 'e', ' ', 'i', 's', ' ', 't', 'h', 'i', 's', ' ', 'i', 's', ' ', 'a', ' ', 't', 'e', 's', 't', 'i', ' ', 'l', 'i', 'k', 'e', ' ', 't', 'o', ' ', 'p', 'l', 'a', 'y']
For each list in example you need to add another list inside chars2 , currently you are just extending chars2 directly with each character.
Example -
chars2=[]
for list in example:
a = []
chars2.append(a)
for string in list:
a.extend(string)
Example/Demo -
>>> example=[["hello i am adolf","hi my name is "],["this is a test","i like to play"]]
>>> chars2=[]
>>> for list in example:
... a = []
... chars2.append(a)
... for string in list:
... a.extend(string)
...
>>> chars2
[['h', 'e', 'l', 'l', 'o', ' ', 'i', ' ', 'a', 'm', ' ', 'a', 'd', 'o', 'l', 'f', 'h', 'i', ' ', 'm', 'y', ' ', 'n', 'a', 'm', 'e', ' ', 'i', 's', ' '], ['t', 'h', 'i', 's', ' ', 'i', 's', ' ', 'a', ' ', 't', 'e', 's', 't', 'i', ' ', 'l', 'i', 'k', 'e', ' ', 't', 'o', ' ', 'p', 'l', 'a', 'y']]
Try using a simple list comprehension
example = [list(item) for sub in example for item in sub]

Complex Python List Comprehension

Can somebody explain how exactly is list comprehension working here?
page = 'one two one three\n' * 10
unique_words = list(word for line in page for word in line.split())
print unique_words
OUTPUT
['o', 'n', 'e', 't', 'w', 'o', 'o', 'n', 'e', 't', 'h', 'r', 'e', 'e', 'o', 'n', 'e', 't', 'w', 'o', 'o', 'n', 'e', 't', 'h', 'r', 'e', 'e', 'o', 'n', 'e', 't', 'w', 'o', 'o', 'n', 'e', 't', 'h', 'r', 'e', 'e']
I am confused over where the variables are declared and where they are used?
e.g. Initially we only know about page as a string,
line in page -> should return each character from the string.
word in line.split() -> is removing '\n' and whitespaces and returning each character
and hence the output. But I still don't understand the way of writing it so that the compiler understands what I want.
QUESTION: How exactly is word for line in page for word in line.split() processed by the compiler step by step?
You need to see the double for loops as nested, from left to right:
for line in page:
for word in line.split():
word
You have one long string going in, so for line in page loops over each individual character; line is one character at a time. Splitting that character gives you a list with just that one character, unless that character is whitespace (space, newline, tab, etc.):
>>> page = 'one two one three\n' * 10
>>> list(page)
['o', 'n', 'e', ' ', 't', 'w', 'o', ' ', 'o', 'n', 'e', ' ', 't', 'h', 'r', 'e', 'e', '\n', 'o', 'n', 'e', ' ', 't', 'w', 'o', ' ', 'o', 'n', 'e', ' ', 't', 'h', 'r', 'e', 'e', '\n', 'o', 'n', 'e', ' ', 't', 'w', 'o', ' ', 'o', 'n', 'e', ' ', 't', 'h', 'r', 'e', 'e', '\n', 'o', 'n', 'e', ' ', 't', 'w', 'o', ' ', 'o', 'n', 'e', ' ', 't', 'h', 'r', 'e', 'e', '\n', 'o', 'n', 'e', ' ', 't', 'w', 'o', ' ', 'o', 'n', 'e', ' ', 't', 'h', 'r', 'e', 'e', '\n', 'o', 'n', 'e', ' ', 't', 'w', 'o', ' ', 'o', 'n', 'e', ' ', 't', 'h', 'r', 'e', 'e', '\n', 'o', 'n', 'e', ' ', 't', 'w', 'o', ' ', 'o', 'n', 'e', ' ', 't', 'h', 'r', 'e', 'e', '\n', 'o', 'n', 'e', ' ', 't', 'w', 'o', ' ', 'o', 'n', 'e', ' ', 't', 'h', 'r', 'e', 'e', '\n', 'o', 'n', 'e', ' ', 't', 'w', 'o', ' ', 'o', 'n', 'e', ' ', 't', 'h', 'r', 'e', 'e', '\n', 'o', 'n', 'e', ' ', 't', 'w', 'o', ' ', 'o', 'n', 'e', ' ', 't', 'h', 'r', 'e', 'e', '\n']
>>> page[0].split()
['o']
>>> page[3].split()
[]
so the end result is a list with individual characters.
Note that technically speaking you have a generator expression feeding a list() call; the output is the same a list comprehension however. You'd get a list comprehension if you replaced list(...) with [...].
If you wanted unique words, use a set() instead and just a simple str.split() call, no need for looping:
unique_words = set(page.split())
str.split() will already split your sentences into words on all whitespace, including the newlines; set() removes any duplicates:
>>> set(page.split())
{'two', 'one', 'three'}
You read that left to right:
[word for line in page for word in line.split()]
is the same as:
mylist=[]
for line in page:
for word in line.split():
mylist.append(word)

Categories

Resources