I want to write a function that removes English stop words from a tweet.
Function Specifications:
It should take a pandas dataframe as input.
Should tokenise the sentences according to the definition in function 6. Note that function 6 cannot be called within this function.
Should remove all stop words in the tokenised list. The stopwords are defined in the stop_words_dict variable defined at the top of this notebook.
The resulting tokenised list should be placed in a column named "Without Stop Words".
The function should modify the input dataframe.
The function should return the modified dataframe.
Here is the twitter dataframe:
twitter_url = 'https://raw.githubusercontent.com/Explore-AI/Public-Data/master/Data/twitter_nov_2019.csv'
twitter_df = pd.read_csv(twitter_url)
twitter_df.head()
Here are the 'stop_words' in a dictionary:
stop_words_dict = {
'stopwords':[
'where', 'done', 'if', 'before', 'll', 'very', 'keep', 'something', 'nothing', 'thereupon',
'may', 'why', '’s', 'therefore', 'you', 'with', 'towards', 'make', 'really', 'few', 'former',
'during', 'mine', 'do', 'would', 'of', 'off', 'six', 'yourself', 'becoming', 'through',
'seeming', 'hence', 'us', 'anywhere', 'regarding', 'whole', 'down', 'seem', 'whereas', 'to',
'their', 'various', 'thereafter', '‘d', 'above', 'put', 'sometime', 'moreover', 'whoever', 'although',
'at', 'four', 'each', 'among', 'whatever', 'any', 'anyhow', 'herein', 'become', 'last', 'between', 'still',
'was', 'almost', 'twelve', 'used', 'who', 'go', 'not', 'enough', 'well', '’ve', 'might', 'see', 'whose',
'everywhere', 'yourselves', 'across', 'myself', 'further', 'did', 'then', 'is', 'except', 'up', 'take',
'became', 'however', 'many', 'thence', 'onto', '‘m', 'my', 'own', 'must', 'wherein', 'elsewhere', 'behind',
'becomes', 'alone', 'due', 'being', 'neither', 'a', 'over', 'beside', 'fifteen', 'meanwhile', 'upon', 'next',
'forty', 'what', 'less', 'and', 'please', 'toward', 'about', 'below', 'hereafter', 'whether', 'yet', 'nor',
'against', 'whereupon', 'top', 'first', 'three', 'show', 'per', 'five', 'two', 'ourselves', 'whenever',
'get', 'thereby', 'noone', 'had', 'now', 'everyone', 'everything', 'nowhere', 'ca', 'though', 'least',
'so', 'both', 'otherwise', 'whereby', 'unless', 'somewhere', 'give', 'formerly', '’d', 'under',
'while', 'empty', 'doing', 'besides', 'thus', 'this', 'anyone', 'its', 'after', 'bottom', 'call',
'n’t', 'name', 'even', 'eleven', 'by', 'from', 'when', 'or', 'anyway', 'how', 'the', 'all',
'much', 'another', 'since', 'hundred', 'serious', '‘ve', 'ever', 'out', 'full', 'themselves',
'been', 'in', "'d", 'wherever', 'part', 'someone', 'therein', 'can', 'seemed', 'hereby', 'others',
"'s", "'re", 'most', 'one', "n't", 'into', 'some', 'will', 'these', 'twenty', 'here', 'as', 'nobody',
'also', 'along', 'than', 'anything', 'he', 'there', 'does', 'we', '’ll', 'latterly', 'are', 'ten',
'hers', 'should', 'they', '‘s', 'either', 'am', 'be', 'perhaps', '’re', 'only', 'namely', 'sixty',
'made', "'m", 'always', 'those', 'have', 'again', 'her', 'once', 'ours', 'herself', 'else', 'has', 'nine',
'more', 'sometimes', 'your', 'yours', 'that', 'around', 'his', 'indeed', 'mostly', 'cannot', '‘ll', 'too',
'seems', '’m', 'himself', 'latter', 'whither', 'amount', 'other', 'nevertheless', 'whom', 'for', 'somehow',
'beforehand', 'just', 'an', 'beyond', 'amongst', 'none', "'ve", 'say', 'via', 'but', 'often', 're', 'our',
'because', 'rather', 'using', 'without', 'throughout', 'on', 'she', 'never', 'eight', 'no', 'hereupon',
'them', 'whereafter', 'quite', 'which', 'move', 'thru', 'until', 'afterwards', 'fifty', 'i', 'itself', 'n‘t',
'him', 'could', 'front', 'within', '‘re', 'back', 'such', 'already', 'several', 'side', 'whence', 'me',
'same', 'were', 'it', 'every', 'third', 'together'
]
}
Here is the code I have tried writing:
def stop_words_remover(df):
df['With Stop Words'] = df['Tweets'].str.split()
df['With Stop Words']
stop_words = stop_words_dict.values()
stop_words
df['Without Stop Words'] = df['With Stop Words'].replace(stop_words, '')
df = df[['Tweets', 'Date', 'Without Stop Words']]
return df
stop_words_remover(twitter_df.copy())
This is the output i got
TypeError Traceback (most recent call last)
C:\Users\DATASC~1\AppData\Local\Temp/ipykernel_5696/4217028502.py in <module>
15
16
---> 17 stop_words_remover(twitter_df.copy())
18 ### END FUNCTION
C:\Users\DATASC~1\AppData\Local\Temp/ipykernel_5696/4217028502.py in stop_words_remover(df)
4 stop_words = stop_words_dict.values()
5
----> 6 df['Without Stop Words'] = df['With Stop Words'].replace(stop_words, '', stop_words())
7
8 df = df[['Tweets', 'Date', 'Without Stop Words']]
TypeError: 'dict_values' object is not callable
This is the expected output
stop_words_remover(twitter_df.copy())
Tweets Date Without Stop Words
0 #BongaDlulane Please send an email to mediades... 2019-11-29 12:50:54 [#bongadlulane, send, email, mediadesk#eskom.c...
1 #saucy_mamiie Pls log a call on 0860037566 2019-11-29 12:46:53 [#saucy_mamiie, pls, log, 0860037566]
2 #BongaDlulane Query escalated to media desk. 2019-11-29 12:46:10 [#bongadlulane, query, escalated, media, desk.]
3 Before leaving the office this afternoon, head... 2019-11-29 12:33:36 [leaving, office, afternoon,, heading, weekend...
4 #ESKOMFREESTATE #MEDIASTATEMENT : ESKOM SUSPEN... 2019-11-29 12:17:43 [#eskomfreestate, #mediastatement, :, eskom, s...
... ... ... ...
195 Eskom's Visitors Centres’ facilities include i... 2019-11-20 10:29:07 [eskom's, visitors, centres’, facilities, incl...
196 #Eskom connected 400 houses and in the process... 2019-11-20 10:25:20 [#eskom, connected, 400, houses, process, conn...
197 #ArthurGodbeer Is the power restored as yet? 2019-11-20 10:07:59 [#arthurgodbeer, power, restored, yet?]
198 #MuthambiPaulina #SABCNewsOnline #IOL #eNCA #e... 2019-11-20 10:07:41 [#muthambipaulina, #sabcnewsonline, #iol, #enc...
199 RT #GP_DHS: The #GautengProvince made a commit... 2019-11-20 10:00:09 [rt, #gp_dhs:, #gautengprovince, commitment, e...
Please can someone help me?
there a simple way to do this in a single command using apply lambda:
twitter_df["Tweets"].apply(lambda x: " ".join([word for word in x.split() if word not in stop_words_dict["stopwords"]]))
If you prefer create a function to do this, the function could be:
def remove_stop_words(tweet, stop_words_dict):
sentence = tweet.split()
output = []
for word in sentence:
if word not in stop_words_dict["stopwords"]:
output.append(word)
return " ".join(output)
twitter_df["Tweets"].apply(lambda x: remove_stop_words(x, stop_words_dict))
I want to create a way to optimize my resume using a python script. To do this, I am trying to find keywords used in the job listing that I can add to my resume to make it stand out when it is run through ATS. Currently, I am using the following code to find what percent match my resume is for the job. How can I use this comparison and find how to improve my resume with specific keywords from the job listing?
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics.pairwise import cosine_similarity
resume = open("resume.txt", encoding='latin-1')
reference = open("reference.txt", encoding='latin-1')
compare = [resume.read(),reference.read()]
cMatrix = CountVectorizer().fit_transform(compare)
#prints how well the resume matches as a percentage
matPercent = cosine_similarity(cMatrix)[0][1] * 100
matPercent = round(matPercent, 2) # round to two decimal
print("Resume is a "+ str(matPercent)+ "% match to the job.")
I am using the following to generate keywords, however, this omits important words and is a long list that I think could be optimized better using sklearn. Instead of using FindKeywords(), how can I access information from the CountVectorizer().fit_transform(compare)
def FindKeywords():
file = open("reference.txt", encoding='latin-1')
string = file.read().replace("\n", " ").replace("\t", " ").lower()
kwDict = {}
avoidables = set(['skilled','skills','skill','minimum','tools','work','features','looking','highly','', ' ','','the', 'of', 'to', 'and', 'a', 'in', 'is', 'it', 'you', 'that', 'he', 'was', 'for', 'on', 'are', 'with', 'as', 'I', 'his', 'they', 'be', 'at', 'one', 'have', 'this', 'from', 'or', 'had', 'by', 'not', 'word', 'but', 'what', 'some', 'we', 'can', 'out', 'other', 'were', 'all', 'there', 'when', 'up', 'use', 'your', 'how', 'said', 'an', 'each', 'she', 'which', 'do', 'their', 'time', 'if', 'will', 'way', 'about', 'many', 'then', 'them', 'write', 'would', 'like', 'so', 'these', 'her', 'long', 'make', 'thing', 'see', 'him', 'two', 'has', 'look', 'more', 'day', 'could', 'go', 'come', 'did', 'number', 'sound', 'no', 'most', 'people', 'my', 'over', 'know', 'water', 'than', 'call', 'first', 'who', 'may', 'down', 'side', 'been', 'now', 'find', 'any', 'new', 'work', 'part', 'take', 'get', 'place', 'made', 'live', 'where', 'after', 'back', 'little', 'only', 'round', 'man', 'year', 'came', 'show', 'every', 'good', 'me', 'give', 'our', 'under', 'name', 'very', 'through', 'just', 'form', 'sentence', 'great', 'think', 'say', 'help', 'low', 'line', 'differ', 'turn', 'cause', 'much', 'mean', 'before', 'move', 'right', 'boy', 'old', 'too', 'same', 'tell', 'does', 'set', 'three', 'want', 'air', 'well', 'also', 'play', 'small', 'end', 'put', 'home', 'read', 'hand', 'port', 'large', 'spell', 'add', 'even', 'land', 'here', 'must', 'big', 'high', 'such', 'follow', 'act', 'why', 'ask', 'men', 'change', 'went', 'light', 'kind', 'off', 'need', 'house', 'picture', 'try', 'us', 'again', 'animal', 'point', 'mother', 'world', 'near', 'build', 'self', 'earth', 'father', 'head', 'stand', 'own', 'page', 'should', 'country', 'found', 'answer', 'school', 'grow', 'study', 'still', 'learn', 'plant', 'cover', 'food', 'sun', 'four', 'between', 'state', 'keep', 'eye', 'never', 'last', 'let', 'thought', 'city', 'tree', 'cross', 'farm', 'hard', 'start', 'might', 'story', 'saw', 'far', 'sea', 'draw', 'left', 'late', 'run', "don't", 'while', 'press', 'close', 'night', 'real', 'life', 'few', 'north', 'open', 'seem', 'together', 'next', 'white', 'children', 'begin', 'got', 'walk', 'example', 'ease', 'paper', 'group', 'always', 'music', 'those', 'both', 'mark', 'often', 'letter', 'until', 'mile', 'river', 'car', 'feet', 'care', 'second', 'book', 'carry', 'took', 'science', 'eat', 'room', 'friend', 'began', 'idea', 'fish', 'mountain', 'stop', 'once', 'base', 'hear', 'horse', 'cut', 'sure', 'watch', 'color', 'face', 'wood', 'main', 'enough', 'plain', 'girl', 'usual', 'young', 'ready', 'above', 'ever', 'red', 'list', 'though', 'feel', 'talk', 'bird', 'soon', 'body', 'dog', 'family', 'direct', 'pose', 'leave', 'song', 'measure', 'door', 'product', 'black', 'short', 'numeral', 'class', 'wind', 'question', 'happen', 'complete', 'ship', 'area', 'half', 'rock', 'order', 'fire', 'south', 'problem', 'piece', 'told', 'knew', 'pass', 'since', 'top', 'whole', 'king', 'space', 'heard', 'best', 'hour', 'better', 'true', 'during', 'hundred', 'five', 'remember', 'step', 'early', 'hold', 'west', 'ground', 'interest', 'reach', 'fast', 'verb', 'sing', 'listen', 'six', 'table', 'travel', 'less', 'morning', 'ten', 'simple', 'several', 'vowel', 'toward', 'war', 'lay', 'against', 'pattern', 'slow', 'center', 'love', 'person', 'money', 'serve', 'appear', 'road', 'map', 'rain', 'rule', 'govern', 'pull', 'cold', 'notice', 'voice', 'unit', 'power', 'town', 'fine', 'certain', 'fly', 'fall', 'lead', 'cry', 'dark', 'machine', 'note', 'wait', 'plan', 'figure', 'star', 'box', 'noun', 'field', 'rest', 'correct', 'able', 'pound', 'done', 'beauty', 'drive', 'stood', 'contain', 'front', 'teach', 'week', 'final', 'gave', 'green', 'oh', 'quick', 'develop', 'ocean', 'warm', 'free', 'minute', 'strong', 'special', 'mind', 'behind', 'clear', 'tail', 'produce', 'fact', 'street', 'inch', 'multiply', 'nothing', 'course', 'stay', 'wheel', 'full', 'force', 'blue', 'object', 'decide', 'surface', 'deep', 'moon', 'island', 'foot', 'system', 'busy', 'test', 'record', 'boat', 'common', 'gold', 'possible', 'plane', 'stead', 'dry', 'wonder', 'laugh', 'thousand', 'ago', 'ran', 'check', 'game', 'shape', 'equate', 'hot', 'miss', 'brought', 'heat', 'snow', 'tire', 'bring', 'yes', 'distant', 'fill', 'east', 'paint', 'language', 'among', 'grand', 'ball', 'yet', 'wave', 'drop', 'heart', 'am', 'present', 'heavy', 'dance', 'engine', 'position', 'arm', 'wide', 'sail', 'material', 'size', 'vary', 'settle', 'speak', 'weight', 'general', 'ice', 'matter', 'circle', 'pair', 'include', 'divide', 'syllable', 'felt', 'perhaps', 'pick', 'sudden', 'count', 'square', 'reason', 'length', 'represent', 'art', 'subject', 'region', 'energy', 'hunt', 'probable', 'bed', 'brother', 'egg', 'ride', 'cell', 'believe', 'fraction', 'forest', 'sit', 'race', 'window', 'store', 'summer', 'train', 'sleep', 'prove', 'lone', 'leg', 'exercise', 'wall', 'catch', 'mount', 'wish', 'sky', 'board', 'joy', 'winter', 'sat', 'written', 'wild', 'instrument', 'kept', 'glass', 'grass', 'cow', 'job', 'edge', 'sign', 'visit', 'past', 'soft', 'fun', 'bright', 'gas', 'weather', 'month', 'million', 'bear', 'finish', 'happy', 'hope', 'flower', 'clothe', 'strange', 'gone', 'jump', 'baby', 'eight', 'village', 'meet', 'root', 'buy', 'raise', 'solve', 'metal', 'whether', 'push', 'seven', 'paragraph', 'third', 'shall', 'held', 'hair', 'describe', 'cook', 'floor', 'either', 'result', 'burn', 'hill', 'safe', 'cat', 'century', 'consider', 'type', 'law', 'bit', 'coast', 'copy', 'phrase', 'silent', 'tall', 'sand', 'soil', 'roll', 'temperature', 'finger', 'industry', 'value', 'fight', 'lie', 'beat', 'excite', 'natural', 'view', 'sense', 'ear', 'else', 'quite', 'broke', 'case', 'middle', 'kill', 'son', 'lake', 'moment', 'scale', 'loud', 'spring', 'observe', 'child', 'straight', 'consonant', 'nation', 'dictionary', 'milk', 'speed', 'method', 'organ', 'pay', 'age', 'section', 'dress', 'cloud', 'surprise', 'quiet', 'stone', 'tiny', 'climb', 'cool', 'design', 'poor', 'lot', 'experiment', 'bottom', 'key', 'iron', 'single', 'stick', 'flat', 'twenty', 'skin', 'smile', 'crease', 'hole', 'trade', 'melody', 'trip', 'office', 'receive', 'row', 'mouth', 'exact', 'symbol', 'die', 'least', 'trouble', 'shout', 'except', 'wrote', 'seed', 'tone', 'join', 'suggest', 'clean', 'break', 'lady', 'yard', 'rise', 'bad', 'blow', 'oil', 'blood', 'touch', 'grew', 'cent', 'mix', 'team', 'wire', 'cost', 'lost', 'brown', 'wear', 'garden', 'equal', 'sent', 'choose', 'fell', 'fit', 'flow', 'fair', 'bank', 'collect', 'save', 'control', 'decimal', 'gentle', 'woman', 'captain', 'practice', 'separate', 'difficult', 'doctor', 'please', 'protect', 'noon', 'whose', 'locate', 'ring', 'character', 'insect', 'caught', 'period', 'indicate', 'radio', 'spoke', 'atom', 'human', 'history', 'effect', 'electric', 'expect', 'crop', 'modern', 'element', 'hit', 'student', 'corner', 'party', 'supply', 'bone', 'rail', 'imagine', 'provide', 'agree', 'thus', 'capital', "won't", 'chair', 'danger', 'fruit', 'rich', 'thick', 'soldier', 'process', 'operate', 'guess', 'necessary', 'sharp', 'wing', 'create', 'neighbor', 'wash', 'bat', 'rather', 'crowd', 'corn', 'compare', 'poem', 'string', 'bell', 'depend', 'meat', 'rub', 'tube', 'famous', 'dollar', 'stream', 'fear', 'sight', 'thin', 'triangle', 'planet', 'hurry', 'chief', 'colony', 'clock', 'mine', 'tie', 'enter', 'major', 'fresh', 'search', 'send', 'yellow', 'gun', 'allow', 'print', 'dead', 'spot', 'desert', 'suit', 'current', 'lift', 'rose', 'continue', 'block', 'chart', 'hat', 'sell', 'success', 'company', 'subtract', 'event', 'particular', 'deal', 'swim', 'term', 'opposite', 'wife', 'shoe', 'shoulder', 'spread', 'arrange', 'camp', 'invent', 'cotton', 'born', 'determine', 'quart', 'nine', 'truck', 'noise', 'level', 'chance', 'gather', 'shop', 'stretch', 'throw', 'shine', 'property', 'column', 'molecule', 'select', 'wrong', 'gray', 'repeat', 'require', 'broad', 'prepare', 'salt', 'nose', 'plural', 'anger', 'claim', 'continent', 'oxygen', 'sugar', 'death', 'pretty', 'skill', 'women', 'season', 'solution', 'magnet', 'silver', 'thank', 'branch', 'match', 'suffix', 'especially', 'fig', 'afraid', 'huge', 'sister', 'steel', 'discuss', 'forward', 'similar', 'guide', 'experience', 'score', 'apple', 'bought', 'led', 'pitch', 'coat', 'mass', 'card', 'band', 'rope', 'slip', 'win', 'dream', 'evening', 'condition', 'feed', 'tool', 'total', 'basic', 'smell', 'valley', 'nor', 'double', 'seat', 'arrive', 'master', 'track', 'parent', 'shore', 'division', 'sheet', 'substance', 'favor', 'connect', 'post', 'spend', 'chord', 'fat', 'glad', 'original', 'share', 'station', 'dad', 'bread', 'charge', 'proper', 'bar', 'offer', 'segment', 'slave', 'duck', 'instant', 'market', 'degree', 'populate', 'chick', 'dear', 'enemy', 'reply', 'drink', 'occur', 'support', 'speech', 'nature', 'range', 'steam', 'motion', 'path', 'liquid', 'log', 'meant', 'quotient', 'teeth', 'shell', 'neck'])
for word in string.split(' '):
if word not in kwDict and word not in avoidables:
kwDict[word] = 1
elif word not in avoidables:
kwDict[word] += 1
returns = [key for key in kwDict.keys() if kwDict[key]>0]
return [kw for kw in returns if kw not in avoidables]
You can use the get_feature_names() method from the CountVectorizer as documented here.
So with a concrete example from your code (adjusted a bit), it could look like this:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics.pairwise import cosine_similarity
resume = "This is an example resume for a job"
reference = "This is an example reference for a job advertisement"
compare = [resume,reference]
cVect = CountVectorizer()
cMatrix = cVect.fit_transform(compare)
#prints how well the resume matches as a percentage
matPercent = cosine_similarity(cMatrix)[0][1] * 100
matPercent = round(matPercent, 2) # round to two decimal
print("Resume is a "+ str(matPercent)+ "% match to the job.")
Returns:
Resume is a 80.18% match to the job.
Then to get the keywords:
cVect.get_feature_names()
The returned keywords:
['advertisement',
'an',
'example',
'for',
'is',
'job',
'reference',
'resume',
'this']
If you would want only keywords from your resume or reference without the other, then you can just fit_transform() another CountVectorizer() just on that data and then get the keywords from that.
The important thing to keep in mind, is that you need to 'save' your trained CountVectorizer, so instead of
CountVectorizer().fit_transform(compare)
You need to use
cVect = CountVectorizer()
cVect.fit_transform(compare)
So that you can later still access your CountVectorizer() instance.
Been struggling with this problem for a long time. I have a dataframe that looks like this:
dataframe pic
I'm trying to divide the length of each 'counter' by the length of each 'content'. I thought this would be fairly straightforward. So far I've tried:
reviews['diversity'] = reviews['counter'].apply(lambda x: 0 if len(x) == 0 else float(len(x)) / float(len(reviews['content'][x])))
as well as using x['content']. I get the massive error message KeyError: "None of [['aberfeldy', 'recorded', 'their', 'debut', 'young', 'forever', 'using', 'a', 'single', 'microphone', 'good', 'for', 'them', 'in', 'that', 'spirit', 'i', 'cut', 'short', 'my', 'obligatory', 'introduction', 'and', 'bring', 'you', 'straight', 'to', 'the', 'edinburgh', 'group', 'lovelorn', 'unfortunately', 'still', 'heart', 'exposed', 'by', 'oh', 'production', 'love', 'is', 'verb', 'noun', 'as', 'well', 'find', 'it', 'dictionary', 'under', 'l', 'little', 'witticism', 'comes', 'from', 'an', 'arrow', 'written', 'sung', 'riley', 'briggs', 'based', 'on', 'one', 'photo', 'looks', 'like', 'anthony', 'michael', 'hall', 'though', 'his', 'vocals', 'chart', 'fairly', 'standard', 'indie', 'course', 'borrowing', 'neil', 'friend', 'ben', 'gibbard', 'what', 'do', 'plain', 'sensitive', 'guys', 'everywhere', 'listen', 'some', 'of', 'best', 'friends', 'are', 'favorite', 'albums', 'consist', 'campfire', 'singalongs', 'bands', 'with', 'modest', 'acoustic', 'guitar', 'chops', 'cute', 'names', 'accents', 'but', 'those', 'lyrics', 'no', 'band', 'would', 'sing', 'such', 'words', 'deserves', 'easily', 'made', 'comparisons', 'fellow', 'scots', 'belle', '', 'sebastian', 'or', 'even', 'camera', 'obscura', 'let', 'alone', 'earnest', 'aussies', 'lucksmiths', 'compare', 'twee', 'progenitors', 'pastels', 'talulah', 'gosh', 'owe', 'me', 'your', 'cardigan', 'moniker', 'nipped', 'scottish', 'vacation', 'destination', 'practically', 'beg', 'name', 'there', 'need', 'encourage', 'throughout', 'record', 'shows', 'predisposition', 'toward', 'bungling', 'old', 'english', 'teachers', 'motto', 'show', 'not', 'tell', 'this', 'may', 'be', 'result', 'medical', 'condition', 'dyslexia', 'which', 'case', 'we', 'should', 'hold', 'our', 'snark', 'seems', 'guy', 'can', 'open', 'mouth', 'without', 'saying', 'nothing', 'so', 'sad', 'leaving', 'he', 'sings', 'out', 'lonely', 'now', 'she', 'gone', 'adds', 'tie', 'teems', 'vivid', 'storytelling', 'goes', 'rhyme', 'sacred', 'wasted', 'reasons', 'until', 'somewhere', 'editor', 'rhyming', 'loses', 'her', 'job', 'often', 'at', 'when', 'they', 'stumble', 'beyond', 'trite', 'infantilism', 'first', 'vegetarian', 'restaurant', 'lopes', 'along', 'winning', 'tangled', 'up', 'blue', 'strums', 'accented', 'subtle', 'fiddles', 'lovely', 'boy', 'harmonies', 'seemingly', 'aiming', 'album', 'cheerful', 'unpretentious', 'look', 'everyday', 'here', 'finally', 'makes', 'interesting', 'way', 'dance', 'kitchen', 'says', 'willing', 'see', 'where', 'takes', 'him', 'then', 'proclaims', 'sometimes', 'believe', 'human', 'duck', 'cover', 'speaking', 'aliens', 'heliopolis', 'night', 'next', 'track', 'incidentally', 'its', 'second', 'whimsical', 'spaceship', 'song', 'complete', 'nose', 'perfect', 'unique', 'yeah', 'was', 'means', 'warm', 'pop', 'heats', 'headphones', 'veritable', 'help', 'root', 'begins', 'everyone', 'because', 'last', 'thing', 'world', 'needs', 'another', 'batch', 'sullen', 'scenesters', 'yet', 'any', 'relationship', 'just', 'someone', 'doesn', 'mean', 'back', 'beautiful', 'gibbs', 'tells', 'us', 'tender', 'moment', 'probably', 'if', 'hope', 'gets', 'laid']] are in the [index]".
I've tried:
def diverse(x):
if len(x) == 0:
return 0
else:
return float(len(x)) / float(len(reviews['clean'][x]))
reviews['diverse'] = reviews['counter'].apply(diverse)
and get the same thing.
I've tried using applymap with reviews['diversity'] = reviews.applymap(lambda x: 0 if len(x) == 0 else float(len(reviews['counter'][x])) / float(len(reviews['content'][x])))
and get ("object of type 'int' has no len()", 'occurred at index Unnamed: 0').
And yet if I just do float(len(reviews['counter'][4])) / float(len(reviews['clean'][4])), I get 0.634375.
Any help is much appreciated.
edit: I tried:
def test(x, y):
for row, item in x.iteritems():
x = float(len(item))
for row, item in y.iteritems():
if len(item) == 0:
return (0)
else:
y = float(len(item))
return (x/y)`
When I used "print" instead of "return", it gave me all the values. But return only divides the length of the first row, which seems really weird?
Here is toy example I constructed to show how to do what you are asking:
import pandas as pd
from collections import Counter
df = pd.DataFrame([['hello world i am a computer'],
['hello i am a computer too hello computer']],
columns=['content'])
df['counter'] = df.content.str.split().apply(Counter)
df
# returns:
content counter
hello world i am a computer {'am': 1, 'hello': 1, 'computer': 1, 'world': ...
hello i am a computer too hello computer {'am': 1, 'hello': 2, 'computer': 2, 'a': 1, '...
This line answers the question as you phrased it:
df['diversity'] = df.content.str.len() / df.counter.apply(len)
But I think what you really wanted was to break the strings in content into a list of words by splitting on the space character. In that case, you probably want:
df['diversity'] = df.content.str.split().apply(len) / df.counter.apply(len)