I have a dataframe column that looks like:
I'm looking into removing special characters. I' hoping to attach the tags (in list of lists) so that I can append the column to an existing df.
This is what I gathered so much, but it doesn't seem to work. Regex in particular is causing me so much pain as it always returns "expected string or byte-like objects".
df = pd.read_csv('flickr_tags_participation_inequality_omit.csv')
#df.dropna(inplace=True) and tokenise
tokens = df["tags"].astype(str).apply(nltk.word_tokenize)
filter_words = ['.',',',':',';','?','#','-','...','!','=', 'edinburgh', 'ecosse', 'écosse', 'scotland']
filtered = [i for i in tokens if i not in filter_words]
#filtered = [re.sub("[.,!?:;-=...##_]", '', w) for w in tokens]
#the above line didn't work
tokenised_tags= []
for i in filtered:
tokenised_tags.append(i) #this turns the single lists of tags into lists of lists
print(tokenised_tags)
The above code doesn't remove the custom-defined stopwords.
Any help is very much appreciated! Thanks!
You need to use
df['filtered'] = df['tags'].apply(lambda x: [t for t in nltk.word_tokenize(x) if t not in filter_words])
Note that nltk.word_tokenize(x) outputs a list of strings so you can apply a regulat list comprehension to it.
Related
How do i convert data into comma separated values, i want to convert like
I have this data in excel on single cell
"ABCD x3 ABC, BAC x 3"
Want to convert to
ABCD,ABCD,ABCD,ABC,BAC,BAC,BAC
can't find an easy way to do that.
I am trying to solve it in python so i can get a structured data
Hi Zeeshan to try and sort the string into usable data while also multiplying certain parts of the string is kind of tricky for me.
the best solution I can think of is kind of gross but it seems to work. hopefully my comments aren't too confusing <3
import re
data = "ABCD x3 AB BAC x2"
#this will split the string into a list that you can iterate through.
Datalist = re.findall(r'(\w+)', data)
#create a new list for the final result
newlist = []
for object in Datalist:
#for each object in the Datalist list
#if the object starts with 'x'
if re.search("x.*", object):
#convert the multiplier to type(string) and then split the x from the multiplier number string
xvalue = str(object).split('x')
#grab and remove the last item added to the newlist because it hasnt been multiplied.
lastitem = newlist.pop()
#now we can add the last item back in by as many times as the x value
newlist.extend([lastitem] * int(xvalue[1]))
else:
#if the object doesnt start with an x then we can just add it to the list.
newlist.extend([object])
#print result
print(newlist)
#re.search() - looks for a match in a string
#.split() - splits a string into multiple substrings
#.pop() - removes the last item from a list and returns that item.
#.extend() - adds an item to the end of a list
keep in mind that to find the multiplier its looking for x followed by a number (x1). if there is a space for example = (x 1) then it will match x but it wont return a value because there is a space.
there might be multiple ways around this issue and I think the best fix will be to restructure how the data is Formatted into the cell.
here are a couple of ways you can work with the data. it wont directly solve your issue but I hope it will help you think about how you approach it (not being rude I don't actually have a good way to handle your example <3 )
split() will split your string as character 'x' and return a list of substrings you can iterate over.
data = 'ABCD ABCD ABCD ABC BAC BAC BAC'
splitdata = data.split(' ')
print(splitdata)
#prints - ['ABCD', 'ABCD', 'ABCD', 'ABC', 'BAC', 'BAC', 'BAC']
you could also try and match strings from the data
import re
data2 = "ABCD x3 ABC BAC x3"
result = []
for match in re.finditer(r'(\w+) x(\d+)', data2):
substring, count = match.groups()
result.extend([substring] * int(count))
print(result)
use re.finditer to go through the string and match the data with the following format = '(\w+) x(\d+)'
each match then gets added to the list.
'\w' is used to match a character.
'\d' is used to match a digit.
'+' is the quantifier, means one or more.
so we are matching = '(\w+) x(\d+)',
which broken down means we are matching (\w+) one or more characters followed by a 'space' then 'x' followed by (\d+) one or more digits
so because your cell data is essentially a string followed by a multiplier then a string followed by another string and then another multiplier, the data just feels too random for a general solution and i think this requires a direct solution that can only work if you know exactly what data is already in the cell. that's why i think the best way to fix it is to rework the data in the cell first. im in no way an expert and this answer is to help you think of ways around the problem and to add to the discussion :) ,if someone wants to correct me and offer a better solution to this I would love to know myself.
I'm trying to remove punctuations from a tokenized text in python like so:
word_tokens = ntlk.tokenize(text)
w = word_tokens
for e in word_tokens:
if e in punctuation_marks:
w.remove(e)
This works somewhat, I manage to remove a lot of the punctuation marks but for some reason a lot of the punctuation marks in word_tokens are still left.
If I run the code another time, it again removes some more of the punctuations. After running the same code 3 times all the marks are removed. Why does this happen?
It doesn't seem to matter whether punctuation_marks is a list, a string or a dictionary. I've also tried to iterate over word_tokens.copy() which does a bit better, it almost removes all marks the first time, and all the second time.
Is there a simple way to fix this problem so that it is sufficient to run the code only once?
You are removing elements from the same list that you are iterating. It seems that you are aware of the potential problem, that's why you added the line:
w = word_tokens
However, that line doesn't actually create a copy of the object referenced by word_tokens, it only makes w reference the same object. In order to create a copy you can use the slicing operator, replacing the above line by:
w = word_tokens[:]
Why don't you add tokens that are not punctuations instead?
word_tokens = ntlk.tokenize(text)
w = list()
for e in word_tokens:
if e not in punctuation_marks:
w.append(e)
Suggestions:
I see you are creating words tokens. If that's the case I would suggest you remove punctuations before tokenizing the text. You may use the translate function (under string library) that is already available.
# Import the library
import string
# Initialize the translate to remove punctuations
tr = str.maketrans("", "", string.punctuation)
# Remove punctuations
text = text.translate(tr)
# Get the word tokens
word_tokens = ntlk.tokenize(text)
If you want to do sentence tokenization, then you may do something like the below:
from nltk.tokenize import sent_tokenize
texts = sent_tokenize(text)
for i in range(0, len(texts))
texts[i] = texts[i].translate(tr)
I suggest you try regex and append your results to a new list and not directly manipulating the word_tokens's one:
word_tokens = ntlk.tokenize(text)
w_ = list()
for e in word_tokens:
w_.append(re.sub('[.!?\\-]', e))
You are modifying the the actual word_tokens, which is wrong.
For instance, say you have something like A?!B where it's indexed as: A:0, ?:1, !:2, B:3. Your for loop has a counter (say i) that increase at each loop. Say you remove the ? (Means i=1) that makes the array indexes shift back (New indexes are: A:0, !:1, B:2) and your counter increments (i=2). So you missed the ! character here!
Best not to mess with the original string and simply copy to a new one.
I have a List of Parts of Speech tagged words (each element is in the format of "word|tag") and I am trying to find a way to delete the corresponding "tag" after I delete a certain "word." More specifically, my algorithm can only deal with the "word" portion of each element, so I first split my current "word"|"tag" list into two separate lists of words and tags. After I remove certain unnecessary words from the Words list though, I want to concatenate the corresponding tags. How can I effectively delete the corresponding tag from a different list? Or is there a better way to do this? I tried running my cleaning algorithm with the tagged words initially, but couldn't find a way to ignore the tags from each word.
My issue may be more clear by showing my code:
my_list = ['I|PN', 'am|V', 'very|ADV', 'happy|ADJ']
tags = []
words = []
for i, x in enumerate(my_list):
front, mid, end = x.partition('|')
words.append(front)
tags.append(mid+end)
Current Output (after I run the words list through my cleaning algorithm):
words = ['I', 'very', 'happy']
tags = ['PN', 'V', 'ADV', 'ADJ']
Clearly, I can not concatenate these lists element-wise anymore because I did not delete the corresponding tag from the removed word.
Desired Output:
words = ['I', 'very', 'happy']
tags = ['PN', 'ADV', 'ADJ']
How can I achieve the above output?
I suggest you follow this method:
Split your input into tuples of (word, tag)
Filter the list of tuples based on your needs
Convert the remaining list of tuples into two lists of words / tags
Here is an untested implementation:
word_list = ['I|PN', 'am|V', 'very|ADV', 'happy|ADJ']
def my_word_filter(pair):
word, tag = pair
# ... your word removal logic here. Return True if the word is OK,
# or false if you want it deleted. For example:
return word != 'am'
word_pairs = filter(my_word_filter, [w.split('|') for w in word_list])
words, tags = zip(*word_pairs)
# Now do whatever you want from the corresponding lists of words, tags
Why dont you try python dictionary ?
my_list={"I":"PN","am":"V","very":"ADV","happy":"ADJ"}
del my_list["am"]
print(my_list)
Output:
my_list={"I":"PN","very":"ADV","happy":"ADJ"}
stopwords is a list of strings, tokentext is a list of lists of strings. (Each list is a sentence, the list of lists is an text document).
I am simply trying to take out all the strings in tokentext that also occur in stopwords.
for element in tokentext:
for word in element:
if(word.lower() in stopwords):
element.remove(word)
print(tokentext)
I was hoping for someone to point out some fundamental flaw in the way I am iterating over the list..
Here is a data set where it fails:
http://pastebin.com/p9ezh2nA
Altering a list while iterating on it will always create issues. Try instead something like:
stopwords = ["some", "strings"]
tokentext = [ ["some", "lists"], ["of", "strings"] ]
new_tokentext = [[word for word in lst if word not in stopwords] for lst in tokentext]
# creates a new list of words, filtering out from stopwords
Or using filter:
new_tokentext = [list(filter(lambda x: x not in stopwords, lst)) for lst in tokentext]
# the call to `list` here is unnecessary in Python2
You could just do something simple like:
for element in tokentext:
if element in stop words:
stopwords.remove(element)
It's kinda like yours, but without the extra for loop. But I am not sure if this works, or if that's what you are trying to achieve, but it's an idea, and I hope it helps!
I have a list of strings that all follow a format of parts of the name divided by underscores. Here is the format:
string="somethingX_somethingY_one_two"
What I want to know how to do it extract "one_two" from each string in the list and rebuild the list so that each entry only has "somethingX_somethingY". I know that in C, there is a strtok function that is useful for splitting into tokens, but I'm not sure if there is a method like that or a strategy to get that same effect in Python. Help me please?
You can use split and a list comprehension:
l = ['_'.join(s.split('_')[:2]) for s in l]
If you're literally trying to remove "_one_two" from the end of the strings, then you can do this:
tail_len = len("_one_two")
strs = [s[:-tail_len] for s in strs]
If you want to remove the last two underscore-separated components, then you can do this:
strs = ["_".join(s.split("_")[:-2]) for s in strs]
If neither of these is what you want, then let update the question with more details.
I think this does what you're asking for.
s = "somethingX_somethingY_one_two"
splitted = s.split( "_" )
splitted = [ x for x in splitted if "something" in x ]
print "_".join( splitted )