Related
I have an array consisting of labels but each label has been broken down by individual characters. For example, this is the first 2 elements of the array:
array([['1', '.', ' ', 'I', 'd', 'e', 'n', 't', 'i', 'f', 'y', 'i', 'n',
'g', ',', ' ', 'A', 's', 's', 'e', 's', 's', 'i', 'n', 'g', ' ',
'a', 'n', 'd', ' ', 'I', 'm', 'p', 'r', 'o', 'v', 'i', 'n', 'g',
' ', 'C', 'a', 'r', 'e', '', ''],
['9', '.', ' ', 'N', 'o', 'n', '-', 'P', 'h', 'a', 'r', 'm', 'a',
'c', 'o', 'l', 'o', 'g', 'i', 'c', 'a', 'l', ' ', 'I', 'n', 't',
'e', 'r', 'v', 'e', 'n', 't', 'i', 'o', 'n', 's', '', '', '',
'', ''], ...
I would like it to be formatted as such:
array(['1. Identifying, Assessing and Improving Care',
'9. Non-Pharmacological Interventions', ...
I want to be able to iterate through a concatenate the label output so it is as shown above.
Any help in achieving this would be much appreciated :) Many thanks!
import numpy as np
k=np.array([['1', '.', ' ', 'I', 'd', 'e', 'n', 't', 'i', 'f', 'y', 'i', 'n',
'g', ',', ' ', 'A', 's', 's', 'e', 's', 's', 'i', 'n', 'g', ' ',
'a', 'n', 'd', ' ', 'I', 'm', 'p', 'r', 'o', 'v', 'i', 'n', 'g',
' ', 'C', 'a', 'r', 'e', '', ''],
['9', '.', ' ', 'N', 'o', 'n', '-', 'P', 'h', 'a', 'r', 'm', 'a',
'c', 'o', 'l', 'o', 'g', 'i', 'c', 'a', 'l', ' ', 'I', 'n', 't',
'e', 'r', 'v', 'e', 'n', 't', 'i', 'o', 'n', 's', '', '', '',
'', '']])
for x in k:
print(''.join(x))
#output
1. Identifying, Assessing and Improving Care
9. Non-Pharmacological Interventions
Using List comprehension:
[''.join(x) for x in k]
#output
['1. Identifying, Assessing and Improving Care',
'9. Non-Pharmacological Interventions']
Considering the array as a list of lists, you could join all characters by looping through the list:
r = [['1', '.', ' ', 'I', 'd', 'e', 'n', 't', 'i', 'f', 'y', 'i', 'n',
'g', ',', ' ', 'A', 's', 's', 'e', 's', 's', 'i', 'n', 'g', ' ',
'a', 'n', 'd', ' ', 'I', 'm', 'p', 'r', 'o', 'v', 'i', 'n', 'g',
' ', 'C', 'a', 'r', 'e', '', ''],
['9', '.', ' ', 'N', 'o', 'n', '-', 'P', 'h', 'a', 'r', 'm', 'a',
'c', 'o', 'l', 'o', 'g', 'i', 'c', 'a', 'l', ' ', 'I', 'n', 't',
'e', 'r', 'v', 'e', 'n', 't', 'i', 'o', 'n', 's', '', '', '',
'', '']]
t = ["".join(i) for i in r]
print(t)
Output:
['1. Identifying, Assessing and Improving Care',
'9. Non-Pharmacological Interventions']
array = [['1', '.', ' ', 'I', 'd', 'e', 'n', 't', 'i', 'f', 'y', 'i', 'n',
'g', ',', ' ', 'A', 's', 's', 'e', 's', 's', 'i', 'n', 'g', ' ',
'a', 'n', 'd', ' ', 'I', 'm', 'p', 'r', 'o', 'v', 'i', 'n', 'g',
' ', 'C', 'a', 'r', 'e', '', ''],
['9', '.', ' ', 'N', 'o', 'n', '-', 'P', 'h', 'a', 'r', 'm', 'a',
'c', 'o', 'l', 'o', 'g', 'i', 'c', 'a', 'l', ' ', 'I', 'n', 't',
'e', 'r', 'v', 'e', 'n', 't', 'i', 'o', 'n', 's', '', '', '',
'', '']]
# array(['1. Identifying, Assessing and Improving Care',
# '9. Non-Pharmacological Interventions', ...
array = [''.join(i) for i in array]
print(array) #['1. Identifying, Assessing and Improving Care', '9. Non-Pharmacological Interventions']
Assuming from array([...]) that you are using numpy, here's a solution
import numpy as np
a = np.array([['1', '.', ' ', 'I', 'd', 'e', 'n', 't', 'i', 'f', 'y', 'i', 'n',
'g', ',', ' ', 'A', 's', 's', 'e', 's', 's', 'i', 'n', 'g', ' ',
'a', 'n', 'd', ' ', 'I', 'm', 'p', 'r', 'o', 'v', 'i', 'n', 'g',
' ', 'C', 'a', 'r', 'e', '', ''],
['9', '.', ' ', 'N', 'o', 'n', '-', 'P', 'h', 'a', 'r', 'm', 'a',
'c', 'o', 'l', 'o', 'g', 'i', 'c', 'a', 'l', ' ', 'I', 'n', 't',
'e', 'r', 'v', 'e', 'n', 't', 'i', 'o', 'n', 's', '', '', '',
'', '']])
b = np.empty(a.shape[0], dtype=object)
for i, x in enumerate(a): b[i] = ''.join(x)
If you make a loop over each element of your array, you can then use list .join to get what you are looking for.
Something like:
arr = [['1', '.', ' ', 'I', ...], ...]
output = list()
for x in arr:
output.append(''.join(x))
output
>>>
['1. Identifying, Assessing and Improving Care', ...]
I have variable like this :
message = "hello world"
and I want to put each 2 letters inside one list like this:
list = [['h', 'e'], ['l', 'l'], ...]
I have tried this method:
message = "hello world"
x,test = 0,[[]] * len(message)
for i in message:
if len(test[x]) >= 2:
x += 1
test[x].append(i)
else:
test[x].append(i)
but the result was adding hello world for every list.
The problem here is that your outer list contains a reference to a single inner list, just repeated. You can see what I mean by taking your resulting test and reassigning the value of one of the elements:
>>> test[1][0] = 9999
[[9999, 'e', 'l', 'l', 'o', ' ', 'w', 'o', 'r', 'l', 'd'],
[9999, 'e', 'l', 'l', 'o', ' ', 'w', 'o', 'r', 'l', 'd'],
[9999, 'e', 'l', 'l', 'o', ' ', 'w', 'o', 'r', 'l', 'd'],
[9999, 'e', 'l', 'l', 'o', ' ', 'w', 'o', 'r', 'l', 'd'],
[9999, 'e', 'l', 'l', 'o', ' ', 'w', 'o', 'r', 'l', 'd'],
[9999, 'e', 'l', 'l', 'o', ' ', 'w', 'o', 'r', 'l', 'd'],
[9999, 'e', 'l', 'l', 'o', ' ', 'w', 'o', 'r', 'l', 'd'],
[9999, 'e', 'l', 'l', 'o', ' ', 'w', 'o', 'r', 'l', 'd'],
[9999, 'e', 'l', 'l', 'o', ' ', 'w', 'o', 'r', 'l', 'd'],
[9999, 'e', 'l', 'l', 'o', ' ', 'w', 'o', 'r', 'l', 'd'],
[9999, 'e', 'l', 'l', 'o', ' ', 'w', 'o', 'r', 'l', 'd']]
So even though x is incrementing, you're still just appending to a single list object because your test variable is a list of repeated references to the same object.
You can get around this by using a comprehension to initialize your test variable:
test = [[] for _ in range(len(message))]
You can also use zip and slicing to get what you want in a single line of code:
[[*z] for z in zip(s[0::2], s[1::2])]
I am new to NLP and trying to do some pre-processing steps on my data for a classification task. I have already done most of the cleaning but there still are some special characters within the text that I am now trying to remove.
The text is in a Dataframe and is already tokenized and lemmatized, converted to lowercase, with no stopwords and no punctuation.
Each text record is represented by a list of words.
['​‘the', 'redwood', 'massacre’', 'five', 'adventurous', 'friend', 'visiting', 'legendary', 'murder', 'site', 'redwood', 'hallmark', 'exciting', 'thrilling', 'camping', 'weekend', 'away', 'soon', 'discover', 'they’re', 'people', 'mysterious', 'location', 'fun', 'camping', 'expedition', 'soon', 'turn', 'nightmare', 'sadistically', 'stalked', 'mysterious', 'unseen', 'killer']
I tried the following code and other solutions as well but I can't understand why the output splits the words into single letters instead of just removing the special character, leaving the words in a compact format.
def remove_character(text):
new_text=[word.replace('€','') for word in text]
return new_text
df["Column_name"]=df["Column_name"].apply(lambda x:remove_character(x))
After applying the function this is the output on the same text record:
"['[', ""'"", 'â', '', '‹', 'â', '', '˜', 't', 'h', 'e', ""'"", ',', ' ', ""'"", 'r', 'e', 'd', 'w', 'o', 'o', 'd', ""'"", ',', ' ', ""'"", 'm', 'a', 's', 's', 'a', 'c', 'r', 'e', 'â', '', '™', ""'"", ',', ' ', ""'"", 'f', 'i', 'v', 'e', ""'"", ',', ' ', ""'"", 'a', 'd', 'v', 'e', 'n', 't', 'u', 'r', 'o', 'u', 's', ""'"", ',', ' ', ""'"", 'f', 'r', 'i', 'e', 'n', 'd', ""'"", ',', ' ', ""'"", 'v', 'i', 's', 'i', 't', 'i', 'n', 'g', ""'"", ',', ' ', ""'"", 'l', 'e', 'g', 'e', 'n', 'd', 'a', 'r', 'y', ""'"", ',', ' ', ""'"", 'm', 'u', 'r', 'd', 'e', 'r', ""'"", ',', ' ', ""'"", 's', 'i', 't', 'e', ""'"", ',', ' ', ""'"", 'r', 'e', 'd', 'w', 'o', 'o', 'd', ""'"", ',', ' ', ""'"", 'h', 'a', 'l', 'l', 'm', 'a', 'r', 'k', ""'"", ',', ' ', ""'"", 'e', 'x', 'c', 'i', 't', 'i', 'n', 'g', ""'"", ',', ' ', ""'"", 't', 'h', 'r', 'i', 'l', 'l', 'i', 'n', 'g', ""'"", ',', ' ', ""'"", 'c', 'a', 'm', 'p', 'i', 'n', 'g', ""'"", ',', ' ', ""'"", 'w', 'e', 'e', 'k', 'e', 'n', 'd', ""'"", ',', ' ', ""'"", 'a', 'w', 'a', 'y', ""'"", ',', ' ', ""'"", 's', 'o', 'o', 'n', ""'"", ',', ' ', ""'"", 'd', 'i', 's', 'c', 'o', 'v', 'e', 'r', ""'"", ',', ' ', ""'"", 't', 'h', 'e', 'y', 'â', '', '™', 'r', 'e', ""'"", ',', ' ', ""'"", 'p', 'e', 'o', 'p', 'l', 'e', ""'"", ',', ' ', ""'"", 'm', 'y', 's', 't', 'e', 'r', 'i', 'o', 'u', 's', ""'"", ',', ' ', ""'"", 'l', 'o', 'c', 'a', 't', 'i', 'o', 'n', ""'"", ',', ' ', ""'"", 'f', 'u', 'n', ""'"", ',', ' ', ""'"", 'c', 'a', 'm', 'p', 'i', 'n', 'g', ""'"", ',', ' ', ""'"", 'e', 'x', 'p', 'e', 'd', 'i', 't', 'i', 'o', 'n', ""'"", ',', ' ', ""'"", 's', 'o', 'o', 'n', ""'"", ',', ' ', ""'"", 't', 'u', 'r', 'n', ""'"", ',', ' ', ""'"", 'n', 'i', 'g', 'h', 't', 'm', 'a', 'r', 'e', ""'"", ',', ' ', ""'"", 's', 'a', 'd', 'i', 's', 't', 'i', 'c', 'a', 'l', 'l', 'y', ""'"", ',', ' ', ""'"", 's', 't', 'a', 'l', 'k', 'e', 'd', ""'"", ',', ' ', ""'"", 'm', 'y', 's', 't', 'e', 'r', 'i', 'o', 'u', 's', ""'"", ',', ' ', ""'"", 'u', 'n', 's', 'e', 'e', 'n', ""'"", ',', ' ', ""'"", 'k', 'i', 'l', 'l', 'e', 'r', ""'"", ']']"
It seems you have single words in cells like this
$ df.head()
Column_name
0 ​‘the
1 redwood
2 massacre’
3 five
4 adventurous
so you shouldn't use for word in text which will split word into chars - it will work like for char in text.
You should use only replace() in apply() which will run it with every cell (similar to for-loop)
df["Column_name"] = df["Column_name"].apply(lambda word: word.replace('€',''))
Minimal working example (so everyone can copy and run it)
import pandas as pd
def remove_character(text):
return [word.replace('€', '') for word in text]
df = pd.DataFrame({'Column_name': ['​‘the', 'redwood', 'massacre’', 'five', 'adventurous', 'friend', 'visiting', 'legendary', 'murder', 'site', 'redwood', 'hallmark', 'exciting', 'thrilling', 'camping', 'weekend', 'away', 'soon', 'discover', 'they’re', 'people', 'mysterious', 'location', 'fun', 'camping', 'expedition', 'soon', 'turn', 'nightmare', 'sadistically', 'stalked', 'mysterious', 'unseen', 'killer']})
print(df.head())
df["Column_name"] = df["Column_name"].apply(lambda word: word.replace('€',''))
#df["Column_name"] = df["Column_name"].apply(lambda x:remove_character(x))
print(df.head())
Your remove_character function should return a string rather than a list. However, pandas includes the str accessor on a Series to perform operations on strings so another option you can use is
df["Column_name"] = df["Column_name"].str.replace('€','')
(no need to use apply)
I want to print
*IBM is a trademark of the International Business Machine Corporation.
in python instead of this
['*', 'I', 'B', 'M', ' ', 'i', 's', ' ', 'a', ' ', 't', 'r', 'a', 'd', 'e', 'm', 'a', 'r', 'k', ' ', 'o', 'f', ' ', 't', 'h', 'e', ' ', 'I', 'n', 't', 'e', 'r', 'n', 'a', 't', 'i', 'o', 'n', 'a', 'l', ' ', 'B', 'u', 's', 'i', 'n', 'e', 's', 's', ' ', 'M', 'a', 'c', 'h', 'i', 'n', 'e', ' ', 'C', 'o', 'r', 'p', 'o', 'r', 'a', 't', 'i', 'o', 'n', '.']
My code:
n=str(input())
l=len(n)
m=[' ']*l
for i in range(l):
m[i]=chr(ord(n[i])-7)
print(m)
Assuming that your list is:
this_is_a_list = ['*', 'I', 'B', 'M', ' ', 'i', 's', ' ', 'a', ' ', 't', 'r', 'a', 'd', 'e', 'm', 'a', 'r', 'k', ' ', 'o', 'f', ' ', 't', 'h', 'e', ' ', 'I', 'n', 't', 'e', 'r', 'n', 'a', 't', 'i', 'o', 'n', 'a', 'l', ' ', 'B', 'u', 's', 'i', 'n', 'e', 's', 's', ' ', 'M', 'a', 'c', 'h', 'i', 'n', 'e', ' ', 'C', 'o', 'r', 'p', 'o', 'r', 'a', 't', 'i', 'o', 'n', '.']
use join:
''.join(this_is_a_list)
extended:
in case you plan to use the string in the future: This method is extremely inefficient, but I'm going to leave it here as a showcase of what not to do: (Thanks to #PM 2Ring)
# BAD EXAMPLE, AVOID THIS METHOD
final_word = ""
for i in xrange(len(this_is_a_list)):
final_word = final_word + this_is_a_list[i]
print final_word
further edited, thanks to #kuro
final_word = ''.join(this_is_a_list)
Use join
x = ['*', 'I', 'B', 'M', ' ', 'i', 's', ' ', 'a', ' ', 't', 'r', 'a', 'd', 'e', 'm', 'a', 'r', 'k', ' ', 'o', 'f', ' ', 't', 'h', 'e', ' ', 'I', 'n', 't', 'e', 'r', 'n', 'a', 't', 'i', 'o', 'n', 'a', 'l', ' ', 'B', 'u', 's', 'i', 'n', 'e', 's', 's', ' ', 'M', 'a', 'c', 'h', 'i', 'n', 'e', ' ', 'C', 'o', 'r', 'p', 'o', 'r', 'a', 't', 'i', 'o', 'n', '.']
print(''.join(x))
'*IBM is a trademark of the International Business Machine Corporation.'
The sensible way to do this is to use .join. And to perform the decoding operation you can loop directly over the chars of the input string rather than using indices.
s = input('> ')
a = []
for u in s:
c = chr(ord(u) - 7)
a.append(c)
print(''.join(a))
demo
> 1PIT'pz'h'{yhklthyr'vm'{ol'Pu{lyuh{pvuhs'I|zpulzz'Thjopul'Jvywvyh{pvu5
*IBM is a trademark of the International Business Machine Corporation.
We can make this much more compact by using a list comprehension.
s = input('> ')
print(''.join([chr(ord(u)-7) for u in s]))
You can try this one
to_print = ['*', 'I', 'B', 'M', ' ', 'i', 's', ' ', 'a', ' ', 't', 'r', 'a', 'd', 'e', 'm', 'a', 'r', 'k', ' ', 'o', 'f', ' ', 't', 'h', 'e', ' ', 'I', 'n', 't', 'e', 'r', 'n', 'a', 't', 'i', 'o', 'n', 'a', 'l', ' ', 'B', 'u', 's', 'i', 'n', 'e', 's', 's', ' ', 'M', 'a', 'c', 'h', 'i', 'n', 'e', ' ', 'C', 'o', 'r', 'p', 'o', 'r', 'a', 't', 'i', 'o', 'n', '.']
word = ''
for i in range(len(to_print)):
word = word + to_print[i]
print (word)
Can somebody explain how exactly is list comprehension working here?
page = 'one two one three\n' * 10
unique_words = list(word for line in page for word in line.split())
print unique_words
OUTPUT
['o', 'n', 'e', 't', 'w', 'o', 'o', 'n', 'e', 't', 'h', 'r', 'e', 'e', 'o', 'n', 'e', 't', 'w', 'o', 'o', 'n', 'e', 't', 'h', 'r', 'e', 'e', 'o', 'n', 'e', 't', 'w', 'o', 'o', 'n', 'e', 't', 'h', 'r', 'e', 'e']
I am confused over where the variables are declared and where they are used?
e.g. Initially we only know about page as a string,
line in page -> should return each character from the string.
word in line.split() -> is removing '\n' and whitespaces and returning each character
and hence the output. But I still don't understand the way of writing it so that the compiler understands what I want.
QUESTION: How exactly is word for line in page for word in line.split() processed by the compiler step by step?
You need to see the double for loops as nested, from left to right:
for line in page:
for word in line.split():
word
You have one long string going in, so for line in page loops over each individual character; line is one character at a time. Splitting that character gives you a list with just that one character, unless that character is whitespace (space, newline, tab, etc.):
>>> page = 'one two one three\n' * 10
>>> list(page)
['o', 'n', 'e', ' ', 't', 'w', 'o', ' ', 'o', 'n', 'e', ' ', 't', 'h', 'r', 'e', 'e', '\n', 'o', 'n', 'e', ' ', 't', 'w', 'o', ' ', 'o', 'n', 'e', ' ', 't', 'h', 'r', 'e', 'e', '\n', 'o', 'n', 'e', ' ', 't', 'w', 'o', ' ', 'o', 'n', 'e', ' ', 't', 'h', 'r', 'e', 'e', '\n', 'o', 'n', 'e', ' ', 't', 'w', 'o', ' ', 'o', 'n', 'e', ' ', 't', 'h', 'r', 'e', 'e', '\n', 'o', 'n', 'e', ' ', 't', 'w', 'o', ' ', 'o', 'n', 'e', ' ', 't', 'h', 'r', 'e', 'e', '\n', 'o', 'n', 'e', ' ', 't', 'w', 'o', ' ', 'o', 'n', 'e', ' ', 't', 'h', 'r', 'e', 'e', '\n', 'o', 'n', 'e', ' ', 't', 'w', 'o', ' ', 'o', 'n', 'e', ' ', 't', 'h', 'r', 'e', 'e', '\n', 'o', 'n', 'e', ' ', 't', 'w', 'o', ' ', 'o', 'n', 'e', ' ', 't', 'h', 'r', 'e', 'e', '\n', 'o', 'n', 'e', ' ', 't', 'w', 'o', ' ', 'o', 'n', 'e', ' ', 't', 'h', 'r', 'e', 'e', '\n', 'o', 'n', 'e', ' ', 't', 'w', 'o', ' ', 'o', 'n', 'e', ' ', 't', 'h', 'r', 'e', 'e', '\n']
>>> page[0].split()
['o']
>>> page[3].split()
[]
so the end result is a list with individual characters.
Note that technically speaking you have a generator expression feeding a list() call; the output is the same a list comprehension however. You'd get a list comprehension if you replaced list(...) with [...].
If you wanted unique words, use a set() instead and just a simple str.split() call, no need for looping:
unique_words = set(page.split())
str.split() will already split your sentences into words on all whitespace, including the newlines; set() removes any duplicates:
>>> set(page.split())
{'two', 'one', 'three'}
You read that left to right:
[word for line in page for word in line.split()]
is the same as:
mylist=[]
for line in page:
for word in line.split():
mylist.append(word)