How to remove all the elements that contain special characters and strings?

How to remove all the elements that contain special characters and strings? - python

I'm trying to remove all the elements that contain special characters or strings but some of the elements still there.
description_list = ['$', '2,850', 'door', '.', 'sale', '...', 'trades', '.', 'pay', 'pp', 'fees', 'shipping', 'cost', 'desirable', '\x932', 'liner', 'dial\x94', 'eta', 'movement', 'watch', '\x93safe', 'queen\x94', ',', 'pristine', 'condition', '.', 'i\x92m', 'original', 'owner', 'worn', 'watch', 'gently', 'handful', 'times', '.', 'protective', 'plastics', 'still', 'intact', 'case', 'back', ',', 'parts', 'clasp', 'full', 'original', 'kit', 'you\x92ll', 'see', 'pics', '.', 'includes', 'original', 'boxes', ',', 'manuals', ',', 'warranty', 'card', 'ad', ',', 'spare', 'bracelet', 'links', ',', 'dive', 'strap', '&', 'extension', ',', 'etc', 'payment', 'paypal', ',', 'due', 'quickly', 'upon', 'agreement', 'purchase', 'watch', '.', 'holds', ',', 'delays', ',', 'games', '.', 'pay', 'pp', 'fees', 'shipping', 'us', 'postal', 'service', 'priority', 'mail', 'w/signature', 'confirmation', ',', 'paypal', 'verified', 'address', 'inside', 'usa', '.', 'please', 'don\x92t', 'ask', 'ship', 'outside', 'usa', '.', 'exceptions', 'made', '.', 'please', 'e-mail', '[', 'email', 'protected', ']', '.', 'also', 'text', 'call', '210-705-3383.', 'name', 'james', 'crockett', 'thank', ',', 'james', 'crockett', '$', '2,850', 'door', '.', 'sale', '...', 'trades', '.', 'pay', 'pp', 'fees', 'shipping', 'cost', 'desirable', '\x932', 'liner', 'dial\x94', 'eta', 'movement', 'watch', '\x93safe', 'queen\x94', ',', 'pristine', 'condition', '.', 'i\x92m', 'original', 'owner', 'worn', 'watch', 'gently', 'handful', 'times', '.', 'protective', 'plastics', 'still', 'intact', 'case', 'back', ',', 'parts', 'clasp', 'full', 'original', 'kit', 'you\x92ll', 'see', 'pics', '.', 'includes', 'original', 'boxes', ',', 'manuals', ',', 'warranty', 'card', 'ad', ',', 'spare', 'bracelet', 'links', ',', 'dive', 'strap', '&', 'extension', ',', 'etc', 'payment', 'paypal', ',', 'due', 'quickly', 'upon', 'agreement', 'purchase', 'watch', '.', 'holds', ',', 'delays', ',', 'games', '.', 'pay', 'pp', 'fees', 'shipping', 'us', 'postal', 'service', 'priority', 'mail', 'w/signature', 'confirmation', ',', 'paypal', 'verified', 'address', 'inside', 'usa', '.', 'please', 'don\x92t', 'ask', 'ship', 'outside', 'usa', '.', 'exceptions', 'made', '.', 'please', 'e-mail', '[', 'email', 'protected', ']', '.', 'also', 'text', 'call', '210-705-3383.', 'name', 'james', 'crockett', 'thank', ',', 'james', 'crockett']
price_list = [x for x in description_list if any(c.isdigit() for c in x)]
Output
# price_list
['2,850', '\x932', '210-705-3383.', '2,850', '\x932', '210-705-3383.']
Should be like this (the comma is acceptable because want to extract price number)
['2,850', '2,850']

You can do an all check inside list comprehension that checks if the string contains all digits or comma and then filter only comma values:
price_list = [x for x in description_list if all(c.isdigit() or c == ',' for c in x) and x != ',']
# ['2,850', '2,850']

Regex answer
import re
price_list = [x for x in description_list if re.match('\d+(,*\d+)?$', x)]

You were close, assuming you want to retain data that contains digits or digits with commas. The current list comprehension for price_list is returning strings if they contain at least one digit.
[str(x) for x in description_list if str(x).replace(',', '').isdigit()]

Related

Write a function which removes english stop words from a tweet

I want to write a function that removes English stop words from a tweet.
Function Specifications:
It should take a pandas dataframe as input.
Should tokenise the sentences according to the definition in function 6. Note that function 6 cannot be called within this function.
Should remove all stop words in the tokenised list. The stopwords are defined in the stop_words_dict variable defined at the top of this notebook.
The resulting tokenised list should be placed in a column named "Without Stop Words".
The function should modify the input dataframe.
The function should return the modified dataframe.
Here is the twitter dataframe:
twitter_url = 'https://raw.githubusercontent.com/Explore-AI/Public-Data/master/Data/twitter_nov_2019.csv'
twitter_df = pd.read_csv(twitter_url)
twitter_df.head()
Here are the 'stop_words' in a dictionary:
stop_words_dict = {
'stopwords':[
'where', 'done', 'if', 'before', 'll', 'very', 'keep', 'something', 'nothing', 'thereupon',
'may', 'why', '’s', 'therefore', 'you', 'with', 'towards', 'make', 'really', 'few', 'former',
'during', 'mine', 'do', 'would', 'of', 'off', 'six', 'yourself', 'becoming', 'through',
'seeming', 'hence', 'us', 'anywhere', 'regarding', 'whole', 'down', 'seem', 'whereas', 'to',
'their', 'various', 'thereafter', '‘d', 'above', 'put', 'sometime', 'moreover', 'whoever', 'although',
'at', 'four', 'each', 'among', 'whatever', 'any', 'anyhow', 'herein', 'become', 'last', 'between', 'still',
'was', 'almost', 'twelve', 'used', 'who', 'go', 'not', 'enough', 'well', '’ve', 'might', 'see', 'whose',
'everywhere', 'yourselves', 'across', 'myself', 'further', 'did', 'then', 'is', 'except', 'up', 'take',
'became', 'however', 'many', 'thence', 'onto', '‘m', 'my', 'own', 'must', 'wherein', 'elsewhere', 'behind',
'becomes', 'alone', 'due', 'being', 'neither', 'a', 'over', 'beside', 'fifteen', 'meanwhile', 'upon', 'next',
'forty', 'what', 'less', 'and', 'please', 'toward', 'about', 'below', 'hereafter', 'whether', 'yet', 'nor',
'against', 'whereupon', 'top', 'first', 'three', 'show', 'per', 'five', 'two', 'ourselves', 'whenever',
'get', 'thereby', 'noone', 'had', 'now', 'everyone', 'everything', 'nowhere', 'ca', 'though', 'least',
'so', 'both', 'otherwise', 'whereby', 'unless', 'somewhere', 'give', 'formerly', '’d', 'under',
'while', 'empty', 'doing', 'besides', 'thus', 'this', 'anyone', 'its', 'after', 'bottom', 'call',
'n’t', 'name', 'even', 'eleven', 'by', 'from', 'when', 'or', 'anyway', 'how', 'the', 'all',
'much', 'another', 'since', 'hundred', 'serious', '‘ve', 'ever', 'out', 'full', 'themselves',
'been', 'in', "'d", 'wherever', 'part', 'someone', 'therein', 'can', 'seemed', 'hereby', 'others',
"'s", "'re", 'most', 'one', "n't", 'into', 'some', 'will', 'these', 'twenty', 'here', 'as', 'nobody',
'also', 'along', 'than', 'anything', 'he', 'there', 'does', 'we', '’ll', 'latterly', 'are', 'ten',
'hers', 'should', 'they', '‘s', 'either', 'am', 'be', 'perhaps', '’re', 'only', 'namely', 'sixty',
'made', "'m", 'always', 'those', 'have', 'again', 'her', 'once', 'ours', 'herself', 'else', 'has', 'nine',
'more', 'sometimes', 'your', 'yours', 'that', 'around', 'his', 'indeed', 'mostly', 'cannot', '‘ll', 'too',
'seems', '’m', 'himself', 'latter', 'whither', 'amount', 'other', 'nevertheless', 'whom', 'for', 'somehow',
'beforehand', 'just', 'an', 'beyond', 'amongst', 'none', "'ve", 'say', 'via', 'but', 'often', 're', 'our',
'because', 'rather', 'using', 'without', 'throughout', 'on', 'she', 'never', 'eight', 'no', 'hereupon',
'them', 'whereafter', 'quite', 'which', 'move', 'thru', 'until', 'afterwards', 'fifty', 'i', 'itself', 'n‘t',
'him', 'could', 'front', 'within', '‘re', 'back', 'such', 'already', 'several', 'side', 'whence', 'me',
'same', 'were', 'it', 'every', 'third', 'together'
]
}
Here is the code I have tried writing:
def stop_words_remover(df):
df['With Stop Words'] = df['Tweets'].str.split()
df['With Stop Words']
stop_words = stop_words_dict.values()
stop_words
df['Without Stop Words'] = df['With Stop Words'].replace(stop_words, '')
df = df[['Tweets', 'Date', 'Without Stop Words']]
return df
stop_words_remover(twitter_df.copy())
This is the output i got
TypeError Traceback (most recent call last)
C:\Users\DATASC~1\AppData\Local\Temp/ipykernel_5696/4217028502.py in <module>
15
16
---> 17 stop_words_remover(twitter_df.copy())
18 ### END FUNCTION
C:\Users\DATASC~1\AppData\Local\Temp/ipykernel_5696/4217028502.py in stop_words_remover(df)
4 stop_words = stop_words_dict.values()
5
----> 6 df['Without Stop Words'] = df['With Stop Words'].replace(stop_words, '', stop_words())
7
8 df = df[['Tweets', 'Date', 'Without Stop Words']]
TypeError: 'dict_values' object is not callable
This is the expected output
stop_words_remover(twitter_df.copy())
Tweets Date Without Stop Words
0 #BongaDlulane Please send an email to mediades... 2019-11-29 12:50:54 [#bongadlulane, send, email, mediadesk#eskom.c...
1 #saucy_mamiie Pls log a call on 0860037566 2019-11-29 12:46:53 [#saucy_mamiie, pls, log, 0860037566]
2 #BongaDlulane Query escalated to media desk. 2019-11-29 12:46:10 [#bongadlulane, query, escalated, media, desk.]
3 Before leaving the office this afternoon, head... 2019-11-29 12:33:36 [leaving, office, afternoon,, heading, weekend...
4 #ESKOMFREESTATE #MEDIASTATEMENT : ESKOM SUSPEN... 2019-11-29 12:17:43 [#eskomfreestate, #mediastatement, :, eskom, s...
... ... ... ...
195 Eskom's Visitors Centres’ facilities include i... 2019-11-20 10:29:07 [eskom's, visitors, centres’, facilities, incl...
196 #Eskom connected 400 houses and in the process... 2019-11-20 10:25:20 [#eskom, connected, 400, houses, process, conn...
197 #ArthurGodbeer Is the power restored as yet? 2019-11-20 10:07:59 [#arthurgodbeer, power, restored, yet?]
198 #MuthambiPaulina #SABCNewsOnline #IOL #eNCA #e... 2019-11-20 10:07:41 [#muthambipaulina, #sabcnewsonline, #iol, #enc...
199 RT #GP_DHS: The #GautengProvince made a commit... 2019-11-20 10:00:09 [rt, #gp_dhs:, #gautengprovince, commitment, e...
Please can someone help me?

there a simple way to do this in a single command using apply lambda:
twitter_df["Tweets"].apply(lambda x: " ".join([word for word in x.split() if word not in stop_words_dict["stopwords"]]))
If you prefer create a function to do this, the function could be:
def remove_stop_words(tweet, stop_words_dict):
sentence = tweet.split()
output = []
for word in sentence:
if word not in stop_words_dict["stopwords"]:
output.append(word)
return " ".join(output)
twitter_df["Tweets"].apply(lambda x: remove_stop_words(x, stop_words_dict))

How to find keywords in a text file using python's sklearn

I want to create a way to optimize my resume using a python script. To do this, I am trying to find keywords used in the job listing that I can add to my resume to make it stand out when it is run through ATS. Currently, I am using the following code to find what percent match my resume is for the job. How can I use this comparison and find how to improve my resume with specific keywords from the job listing?
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics.pairwise import cosine_similarity
resume = open("resume.txt", encoding='latin-1')
reference = open("reference.txt", encoding='latin-1')
compare = [resume.read(),reference.read()]
cMatrix = CountVectorizer().fit_transform(compare)
#prints how well the resume matches as a percentage
matPercent = cosine_similarity(cMatrix)[0][1] * 100
matPercent = round(matPercent, 2) # round to two decimal
print("Resume is a "+ str(matPercent)+ "% match to the job.")
I am using the following to generate keywords, however, this omits important words and is a long list that I think could be optimized better using sklearn. Instead of using FindKeywords(), how can I access information from the CountVectorizer().fit_transform(compare)
def FindKeywords():
file = open("reference.txt", encoding='latin-1')
string = file.read().replace("\n", " ").replace("\t", " ").lower()
kwDict = {}
avoidables = set(['skilled','skills','skill','minimum','tools','work','features','looking','highly','', ' ','','the', 'of', 'to', 'and', 'a', 'in', 'is', 'it', 'you', 'that', 'he', 'was', 'for', 'on', 'are', 'with', 'as', 'I', 'his', 'they', 'be', 'at', 'one', 'have', 'this', 'from', 'or', 'had', 'by', 'not', 'word', 'but', 'what', 'some', 'we', 'can', 'out', 'other', 'were', 'all', 'there', 'when', 'up', 'use', 'your', 'how', 'said', 'an', 'each', 'she', 'which', 'do', 'their', 'time', 'if', 'will', 'way', 'about', 'many', 'then', 'them', 'write', 'would', 'like', 'so', 'these', 'her', 'long', 'make', 'thing', 'see', 'him', 'two', 'has', 'look', 'more', 'day', 'could', 'go', 'come', 'did', 'number', 'sound', 'no', 'most', 'people', 'my', 'over', 'know', 'water', 'than', 'call', 'first', 'who', 'may', 'down', 'side', 'been', 'now', 'find', 'any', 'new', 'work', 'part', 'take', 'get', 'place', 'made', 'live', 'where', 'after', 'back', 'little', 'only', 'round', 'man', 'year', 'came', 'show', 'every', 'good', 'me', 'give', 'our', 'under', 'name', 'very', 'through', 'just', 'form', 'sentence', 'great', 'think', 'say', 'help', 'low', 'line', 'differ', 'turn', 'cause', 'much', 'mean', 'before', 'move', 'right', 'boy', 'old', 'too', 'same', 'tell', 'does', 'set', 'three', 'want', 'air', 'well', 'also', 'play', 'small', 'end', 'put', 'home', 'read', 'hand', 'port', 'large', 'spell', 'add', 'even', 'land', 'here', 'must', 'big', 'high', 'such', 'follow', 'act', 'why', 'ask', 'men', 'change', 'went', 'light', 'kind', 'off', 'need', 'house', 'picture', 'try', 'us', 'again', 'animal', 'point', 'mother', 'world', 'near', 'build', 'self', 'earth', 'father', 'head', 'stand', 'own', 'page', 'should', 'country', 'found', 'answer', 'school', 'grow', 'study', 'still', 'learn', 'plant', 'cover', 'food', 'sun', 'four', 'between', 'state', 'keep', 'eye', 'never', 'last', 'let', 'thought', 'city', 'tree', 'cross', 'farm', 'hard', 'start', 'might', 'story', 'saw', 'far', 'sea', 'draw', 'left', 'late', 'run', "don't", 'while', 'press', 'close', 'night', 'real', 'life', 'few', 'north', 'open', 'seem', 'together', 'next', 'white', 'children', 'begin', 'got', 'walk', 'example', 'ease', 'paper', 'group', 'always', 'music', 'those', 'both', 'mark', 'often', 'letter', 'until', 'mile', 'river', 'car', 'feet', 'care', 'second', 'book', 'carry', 'took', 'science', 'eat', 'room', 'friend', 'began', 'idea', 'fish', 'mountain', 'stop', 'once', 'base', 'hear', 'horse', 'cut', 'sure', 'watch', 'color', 'face', 'wood', 'main', 'enough', 'plain', 'girl', 'usual', 'young', 'ready', 'above', 'ever', 'red', 'list', 'though', 'feel', 'talk', 'bird', 'soon', 'body', 'dog', 'family', 'direct', 'pose', 'leave', 'song', 'measure', 'door', 'product', 'black', 'short', 'numeral', 'class', 'wind', 'question', 'happen', 'complete', 'ship', 'area', 'half', 'rock', 'order', 'fire', 'south', 'problem', 'piece', 'told', 'knew', 'pass', 'since', 'top', 'whole', 'king', 'space', 'heard', 'best', 'hour', 'better', 'true', 'during', 'hundred', 'five', 'remember', 'step', 'early', 'hold', 'west', 'ground', 'interest', 'reach', 'fast', 'verb', 'sing', 'listen', 'six', 'table', 'travel', 'less', 'morning', 'ten', 'simple', 'several', 'vowel', 'toward', 'war', 'lay', 'against', 'pattern', 'slow', 'center', 'love', 'person', 'money', 'serve', 'appear', 'road', 'map', 'rain', 'rule', 'govern', 'pull', 'cold', 'notice', 'voice', 'unit', 'power', 'town', 'fine', 'certain', 'fly', 'fall', 'lead', 'cry', 'dark', 'machine', 'note', 'wait', 'plan', 'figure', 'star', 'box', 'noun', 'field', 'rest', 'correct', 'able', 'pound', 'done', 'beauty', 'drive', 'stood', 'contain', 'front', 'teach', 'week', 'final', 'gave', 'green', 'oh', 'quick', 'develop', 'ocean', 'warm', 'free', 'minute', 'strong', 'special', 'mind', 'behind', 'clear', 'tail', 'produce', 'fact', 'street', 'inch', 'multiply', 'nothing', 'course', 'stay', 'wheel', 'full', 'force', 'blue', 'object', 'decide', 'surface', 'deep', 'moon', 'island', 'foot', 'system', 'busy', 'test', 'record', 'boat', 'common', 'gold', 'possible', 'plane', 'stead', 'dry', 'wonder', 'laugh', 'thousand', 'ago', 'ran', 'check', 'game', 'shape', 'equate', 'hot', 'miss', 'brought', 'heat', 'snow', 'tire', 'bring', 'yes', 'distant', 'fill', 'east', 'paint', 'language', 'among', 'grand', 'ball', 'yet', 'wave', 'drop', 'heart', 'am', 'present', 'heavy', 'dance', 'engine', 'position', 'arm', 'wide', 'sail', 'material', 'size', 'vary', 'settle', 'speak', 'weight', 'general', 'ice', 'matter', 'circle', 'pair', 'include', 'divide', 'syllable', 'felt', 'perhaps', 'pick', 'sudden', 'count', 'square', 'reason', 'length', 'represent', 'art', 'subject', 'region', 'energy', 'hunt', 'probable', 'bed', 'brother', 'egg', 'ride', 'cell', 'believe', 'fraction', 'forest', 'sit', 'race', 'window', 'store', 'summer', 'train', 'sleep', 'prove', 'lone', 'leg', 'exercise', 'wall', 'catch', 'mount', 'wish', 'sky', 'board', 'joy', 'winter', 'sat', 'written', 'wild', 'instrument', 'kept', 'glass', 'grass', 'cow', 'job', 'edge', 'sign', 'visit', 'past', 'soft', 'fun', 'bright', 'gas', 'weather', 'month', 'million', 'bear', 'finish', 'happy', 'hope', 'flower', 'clothe', 'strange', 'gone', 'jump', 'baby', 'eight', 'village', 'meet', 'root', 'buy', 'raise', 'solve', 'metal', 'whether', 'push', 'seven', 'paragraph', 'third', 'shall', 'held', 'hair', 'describe', 'cook', 'floor', 'either', 'result', 'burn', 'hill', 'safe', 'cat', 'century', 'consider', 'type', 'law', 'bit', 'coast', 'copy', 'phrase', 'silent', 'tall', 'sand', 'soil', 'roll', 'temperature', 'finger', 'industry', 'value', 'fight', 'lie', 'beat', 'excite', 'natural', 'view', 'sense', 'ear', 'else', 'quite', 'broke', 'case', 'middle', 'kill', 'son', 'lake', 'moment', 'scale', 'loud', 'spring', 'observe', 'child', 'straight', 'consonant', 'nation', 'dictionary', 'milk', 'speed', 'method', 'organ', 'pay', 'age', 'section', 'dress', 'cloud', 'surprise', 'quiet', 'stone', 'tiny', 'climb', 'cool', 'design', 'poor', 'lot', 'experiment', 'bottom', 'key', 'iron', 'single', 'stick', 'flat', 'twenty', 'skin', 'smile', 'crease', 'hole', 'trade', 'melody', 'trip', 'office', 'receive', 'row', 'mouth', 'exact', 'symbol', 'die', 'least', 'trouble', 'shout', 'except', 'wrote', 'seed', 'tone', 'join', 'suggest', 'clean', 'break', 'lady', 'yard', 'rise', 'bad', 'blow', 'oil', 'blood', 'touch', 'grew', 'cent', 'mix', 'team', 'wire', 'cost', 'lost', 'brown', 'wear', 'garden', 'equal', 'sent', 'choose', 'fell', 'fit', 'flow', 'fair', 'bank', 'collect', 'save', 'control', 'decimal', 'gentle', 'woman', 'captain', 'practice', 'separate', 'difficult', 'doctor', 'please', 'protect', 'noon', 'whose', 'locate', 'ring', 'character', 'insect', 'caught', 'period', 'indicate', 'radio', 'spoke', 'atom', 'human', 'history', 'effect', 'electric', 'expect', 'crop', 'modern', 'element', 'hit', 'student', 'corner', 'party', 'supply', 'bone', 'rail', 'imagine', 'provide', 'agree', 'thus', 'capital', "won't", 'chair', 'danger', 'fruit', 'rich', 'thick', 'soldier', 'process', 'operate', 'guess', 'necessary', 'sharp', 'wing', 'create', 'neighbor', 'wash', 'bat', 'rather', 'crowd', 'corn', 'compare', 'poem', 'string', 'bell', 'depend', 'meat', 'rub', 'tube', 'famous', 'dollar', 'stream', 'fear', 'sight', 'thin', 'triangle', 'planet', 'hurry', 'chief', 'colony', 'clock', 'mine', 'tie', 'enter', 'major', 'fresh', 'search', 'send', 'yellow', 'gun', 'allow', 'print', 'dead', 'spot', 'desert', 'suit', 'current', 'lift', 'rose', 'continue', 'block', 'chart', 'hat', 'sell', 'success', 'company', 'subtract', 'event', 'particular', 'deal', 'swim', 'term', 'opposite', 'wife', 'shoe', 'shoulder', 'spread', 'arrange', 'camp', 'invent', 'cotton', 'born', 'determine', 'quart', 'nine', 'truck', 'noise', 'level', 'chance', 'gather', 'shop', 'stretch', 'throw', 'shine', 'property', 'column', 'molecule', 'select', 'wrong', 'gray', 'repeat', 'require', 'broad', 'prepare', 'salt', 'nose', 'plural', 'anger', 'claim', 'continent', 'oxygen', 'sugar', 'death', 'pretty', 'skill', 'women', 'season', 'solution', 'magnet', 'silver', 'thank', 'branch', 'match', 'suffix', 'especially', 'fig', 'afraid', 'huge', 'sister', 'steel', 'discuss', 'forward', 'similar', 'guide', 'experience', 'score', 'apple', 'bought', 'led', 'pitch', 'coat', 'mass', 'card', 'band', 'rope', 'slip', 'win', 'dream', 'evening', 'condition', 'feed', 'tool', 'total', 'basic', 'smell', 'valley', 'nor', 'double', 'seat', 'arrive', 'master', 'track', 'parent', 'shore', 'division', 'sheet', 'substance', 'favor', 'connect', 'post', 'spend', 'chord', 'fat', 'glad', 'original', 'share', 'station', 'dad', 'bread', 'charge', 'proper', 'bar', 'offer', 'segment', 'slave', 'duck', 'instant', 'market', 'degree', 'populate', 'chick', 'dear', 'enemy', 'reply', 'drink', 'occur', 'support', 'speech', 'nature', 'range', 'steam', 'motion', 'path', 'liquid', 'log', 'meant', 'quotient', 'teeth', 'shell', 'neck'])
for word in string.split(' '):
if word not in kwDict and word not in avoidables:
kwDict[word] = 1
elif word not in avoidables:
kwDict[word] += 1
returns = [key for key in kwDict.keys() if kwDict[key]>0]
return [kw for kw in returns if kw not in avoidables]

You can use the get_feature_names() method from the CountVectorizer as documented here.
So with a concrete example from your code (adjusted a bit), it could look like this:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics.pairwise import cosine_similarity
resume = "This is an example resume for a job"
reference = "This is an example reference for a job advertisement"
compare = [resume,reference]
cVect = CountVectorizer()
cMatrix = cVect.fit_transform(compare)
#prints how well the resume matches as a percentage
matPercent = cosine_similarity(cMatrix)[0][1] * 100
matPercent = round(matPercent, 2) # round to two decimal
print("Resume is a "+ str(matPercent)+ "% match to the job.")
Returns:
Resume is a 80.18% match to the job.
Then to get the keywords:
cVect.get_feature_names()
The returned keywords:
['advertisement',
'an',
'example',
'for',
'is',
'job',
'reference',
'resume',
'this']
If you would want only keywords from your resume or reference without the other, then you can just fit_transform() another CountVectorizer() just on that data and then get the keywords from that.
The important thing to keep in mind, is that you need to 'save' your trained CountVectorizer, so instead of
CountVectorizer().fit_transform(compare)
You need to use
cVect = CountVectorizer()
cVect.fit_transform(compare)
So that you can later still access your CountVectorizer() instance.

How to print a list of tokenized text into a file

from urllib import request
from redditscore.tokenizer import CrazyTokenizer
tokenizer = CrazyTokenizer()
url = "http://www.site.uottawa.ca/~diana/csi5386/A1_2020/microblog2011.txt"
for line in request.urlopen(url):
tokens = tokenizer.tokenize(line.decode('utf-8'))
#print(tokens)
with open('your_file.txt', 'a') as f:
print(tokens)
for item in tokens:
f.write("%s\n" % item)
In the above code my output is in variable tokens in the form of list.
Output sample:
['\ufeffsave', 'bbc', 'world', 'service', 'from', 'savage', 'cuts']
['a', 'lot', 'of', 'people', 'always', 'make', 'fun', 'about', 'the', 'end', 'of', 'the', 'world', 'but', 'the', 'question', 'is', '"are', 'u', 'ready', 'for', 'it']
['rethink', 'group', 'positive', 'in', 'outlook', 'technology', 'staffing', 'specialist', 'the', 'rethink', 'group', 'expects', 'revenues', 'to', 'be']
Now i'm trying to print this output into a text file.
How can i do that? please help..

with open('your_file.txt', 'a') as f:
for line in request.urlopen(url):
tokens = tokenizer.tokenize(line.decode('utf-8'))
#print(tokens)
for item in tokens:
f.write("%s\n" % item)

Just use ' '.join with each token item
like the following (I am assuming that I already have the data in an array):
tokens = [
['\ufeffsave', 'bbc', 'world', 'service', 'from', 'savage', 'cuts'],
['a', 'lot', 'of', 'people', 'always', 'make', 'fun', 'about', 'the', 'end',
'of', 'the', 'world', 'but', 'the', 'question', 'is', '"are', 'u', 'ready',
'for', 'it'],
['rethink', 'group', 'positive', 'in', 'outlook', 'technology', 'staffing',
'specialist', 'the', 'rethink', 'group', 'expects', 'revenues', 'to', 'be']
]
with open('your_file.txt', 'a') as f:
print(tokens)
for item in tokens:
f.write("%s\n" % ' '.join(item))

Pandas not dividing length of cells

Been struggling with this problem for a long time. I have a dataframe that looks like this:
dataframe pic
I'm trying to divide the length of each 'counter' by the length of each 'content'. I thought this would be fairly straightforward. So far I've tried:
reviews['diversity'] = reviews['counter'].apply(lambda x: 0 if len(x) == 0 else float(len(x)) / float(len(reviews['content'][x])))
as well as using x['content']. I get the massive error message KeyError: "None of [['aberfeldy', 'recorded', 'their', 'debut', 'young', 'forever', 'using', 'a', 'single', 'microphone', 'good', 'for', 'them', 'in', 'that', 'spirit', 'i', 'cut', 'short', 'my', 'obligatory', 'introduction', 'and', 'bring', 'you', 'straight', 'to', 'the', 'edinburgh', 'group', 'lovelorn', 'unfortunately', 'still', 'heart', 'exposed', 'by', 'oh', 'production', 'love', 'is', 'verb', 'noun', 'as', 'well', 'find', 'it', 'dictionary', 'under', 'l', 'little', 'witticism', 'comes', 'from', 'an', 'arrow', 'written', 'sung', 'riley', 'briggs', 'based', 'on', 'one', 'photo', 'looks', 'like', 'anthony', 'michael', 'hall', 'though', 'his', 'vocals', 'chart', 'fairly', 'standard', 'indie', 'course', 'borrowing', 'neil', 'friend', 'ben', 'gibbard', 'what', 'do', 'plain', 'sensitive', 'guys', 'everywhere', 'listen', 'some', 'of', 'best', 'friends', 'are', 'favorite', 'albums', 'consist', 'campfire', 'singalongs', 'bands', 'with', 'modest', 'acoustic', 'guitar', 'chops', 'cute', 'names', 'accents', 'but', 'those', 'lyrics', 'no', 'band', 'would', 'sing', 'such', 'words', 'deserves', 'easily', 'made', 'comparisons', 'fellow', 'scots', 'belle', '', 'sebastian', 'or', 'even', 'camera', 'obscura', 'let', 'alone', 'earnest', 'aussies', 'lucksmiths', 'compare', 'twee', 'progenitors', 'pastels', 'talulah', 'gosh', 'owe', 'me', 'your', 'cardigan', 'moniker', 'nipped', 'scottish', 'vacation', 'destination', 'practically', 'beg', 'name', 'there', 'need', 'encourage', 'throughout', 'record', 'shows', 'predisposition', 'toward', 'bungling', 'old', 'english', 'teachers', 'motto', 'show', 'not', 'tell', 'this', 'may', 'be', 'result', 'medical', 'condition', 'dyslexia', 'which', 'case', 'we', 'should', 'hold', 'our', 'snark', 'seems', 'guy', 'can', 'open', 'mouth', 'without', 'saying', 'nothing', 'so', 'sad', 'leaving', 'he', 'sings', 'out', 'lonely', 'now', 'she', 'gone', 'adds', 'tie', 'teems', 'vivid', 'storytelling', 'goes', 'rhyme', 'sacred', 'wasted', 'reasons', 'until', 'somewhere', 'editor', 'rhyming', 'loses', 'her', 'job', 'often', 'at', 'when', 'they', 'stumble', 'beyond', 'trite', 'infantilism', 'first', 'vegetarian', 'restaurant', 'lopes', 'along', 'winning', 'tangled', 'up', 'blue', 'strums', 'accented', 'subtle', 'fiddles', 'lovely', 'boy', 'harmonies', 'seemingly', 'aiming', 'album', 'cheerful', 'unpretentious', 'look', 'everyday', 'here', 'finally', 'makes', 'interesting', 'way', 'dance', 'kitchen', 'says', 'willing', 'see', 'where', 'takes', 'him', 'then', 'proclaims', 'sometimes', 'believe', 'human', 'duck', 'cover', 'speaking', 'aliens', 'heliopolis', 'night', 'next', 'track', 'incidentally', 'its', 'second', 'whimsical', 'spaceship', 'song', 'complete', 'nose', 'perfect', 'unique', 'yeah', 'was', 'means', 'warm', 'pop', 'heats', 'headphones', 'veritable', 'help', 'root', 'begins', 'everyone', 'because', 'last', 'thing', 'world', 'needs', 'another', 'batch', 'sullen', 'scenesters', 'yet', 'any', 'relationship', 'just', 'someone', 'doesn', 'mean', 'back', 'beautiful', 'gibbs', 'tells', 'us', 'tender', 'moment', 'probably', 'if', 'hope', 'gets', 'laid']] are in the [index]".
I've tried:
def diverse(x):
if len(x) == 0:
return 0
else:
return float(len(x)) / float(len(reviews['clean'][x]))
reviews['diverse'] = reviews['counter'].apply(diverse)
and get the same thing.
I've tried using applymap with reviews['diversity'] = reviews.applymap(lambda x: 0 if len(x) == 0 else float(len(reviews['counter'][x])) / float(len(reviews['content'][x])))
and get ("object of type 'int' has no len()", 'occurred at index Unnamed: 0').
And yet if I just do float(len(reviews['counter'][4])) / float(len(reviews['clean'][4])), I get 0.634375.
Any help is much appreciated.
edit: I tried:
def test(x, y):
for row, item in x.iteritems():
x = float(len(item))
for row, item in y.iteritems():
if len(item) == 0:
return (0)
else:
y = float(len(item))
return (x/y)`
When I used "print" instead of "return", it gave me all the values. But return only divides the length of the first row, which seems really weird?

Here is toy example I constructed to show how to do what you are asking:
import pandas as pd
from collections import Counter
df = pd.DataFrame([['hello world i am a computer'],
['hello i am a computer too hello computer']],
columns=['content'])
df['counter'] = df.content.str.split().apply(Counter)
df
# returns:
content counter
hello world i am a computer {'am': 1, 'hello': 1, 'computer': 1, 'world': ...
hello i am a computer too hello computer {'am': 1, 'hello': 2, 'computer': 2, 'a': 1, '...
This line answers the question as you phrased it:
df['diversity'] = df.content.str.len() / df.counter.apply(len)
But I think what you really wanted was to break the strings in content into a list of words by splitting on the space character. In that case, you probably want:
df['diversity'] = df.content.str.split().apply(len) / df.counter.apply(len)

Python convert string in array

Hello i have a string that looks like that
el-gu-en-tr-ca-it-eu-ca#valencia-ar-eo-cs-et-th_TH-gl-id-es-bn_IN-ru-he-nl-pt-no-nb-id_ID-lv-lt-pa-te-pl-ta-bg_BG-be-fr-de-bn_BD-uk-pt_BR-ast-hr-jv-zh_TW-sr#latin-da-fa-hi-tr_TR-fi-hu-ja-fo-bs_BA-ro-fa_IR-zh_CN-sr-sq-mn-ko-sv-km-sk-km_KH-en_GB-ms-sc-ug-bal
how can i break items by - and place them in an array like
array[0]->el
array[1]->gu
.....

Use the .split() method on your string:
>>> example = 'el-gu-en-tr-ca-it-eu-ca#valencia-ar-eo-cs-et-th_TH-gl-id-es-bn_IN-ru-he-nl-pt-no-nb-id_ID-lv-lt-pa-te-pl-ta-bg_BG-be-fr-de-bn_BD-uk-pt_BR-ast-hr-jv-zh_TW-sr#latin-da-fa-hi-tr_TR-fi-hu-ja-fo-bs_BA-ro-fa_IR-zh_CN-sr-sq-mn-ko-sv-km-sk-km_KH-en_GB-ms-sc-ug-bal'
>>> example.split('-')
['el', 'gu', 'en', 'tr', 'ca', 'it', 'eu', 'ca#valencia', 'ar', 'eo', 'cs', 'et', 'th_TH', 'gl', 'id', 'es', 'bn_IN', 'ru', 'he', 'nl', 'pt', 'no', 'nb', 'id_ID', 'lv', 'lt', 'pa', 'te', 'pl', 'ta', 'bg_BG', 'be', 'fr', 'de', 'bn_BD', 'uk', 'pt_BR', 'ast', 'hr', 'jv', 'zh_TW', 'sr#latin', 'da', 'fa', 'hi', 'tr_TR', 'fi', 'hu', 'ja', 'fo', 'bs_BA', 'ro', 'fa_IR', 'zh_CN', 'sr', 'sq', 'mn', 'ko', 'sv', 'km', 'sk', 'km_KH', 'en_GB', 'ms', 'sc', 'ug', 'bal']

Call str.split():
s = "el-gu-en-tr-ca-it-eu-ca#valencia-ar-eo-cs-et-th_TH-gl-id-es-bn_IN-ru-he-nl-pt-no-nb-id_ID-lv-lt-pa-te-pl-ta-bg_BG-be-fr-de-bn_BD-uk-pt_BR-ast-hr-jv-zh_TW-sr#latin-da-fa-hi-tr_TR-fi-hu-ja-fo-bs_BA-ro-fa_IR-zh_CN-sr-sq-mn-ko-sv-km-sk-km_KH-en_GB-ms-sc-ug-bal"
locales = s.split("-")

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to remove all the elements that contain special characters and strings? - python

You can do an all check inside list comprehension that checks if the string contains all digits or comma and then filter only comma values: price_list = [x for x in description_list if all(c.isdigit() or c == ',' for c in x) and x != ','] # ['2,850', '2,850']

Regex answer import re price_list = [x for x in description_list if re.match('\d+(,*\d+)?$', x)]

You were close, assuming you want to retain data that contains digits or digits with commas. The current list comprehension for price_list is returning strings if they contain at least one digit. [str(x) for x in description_list if str(x).replace(',', '').isdigit()]

Related

Write a function which removes english stop words from a tweet

How to find keywords in a text file using python's sklearn

How to print a list of tokenized text into a file

Pandas not dividing length of cells

Python convert string in array

Categories

Resources