Specific words from a text file with PLY - python
I am making a lexical analyzer for determined words that are in a .txt file, for this I declare determined words reserved and I try to print only the selected words on the screen, but the result I get is that it takes all the words in the txt file and prints them. I've been following the tutorial and the official Ply documentation in http://www.dabeaz.com/ply/ply.html#ply_nn6 but I still don't achieve my goal. Could someone help me with this? Thank you very much.
import ply.lex as lex
import re
import os
import sys
reservadas = {
'if' : 'if',
'then' : 'then',
'else' : 'else',
'while' : 'while',
}
tokens = ['ID','NUMBER','PLUS','MINUS','TIMES','DIVIDE',
'ODD','ASSIGN','NE','LT','LTE','GT','GTE',
'LPARENT', 'RPARENT','COMMA','SEMMICOLOM',
'DOT','UPDATE'
] + list(reservadas.values())
#tokens = tokens+reservadas
# reservadas = {
# 'begin':'BEGIN',
# 'end':'END',
# 'if':'IF',
# 'then':'THEN',
# 'while':'WHILE',
# 'do':'DO',
# 'call':'CALL',
# 'const':'CONST',
# 'int':'VAR',
# 'procedure':'PROCEDURE',
# 'out':'OUT',
# 'in':'IN',
# 'else':'ELSE'
# }
#tokens = tokens+list(reservadas.values())
t_ignore = '\t '
t_ignore_PLUS = r'\+'
t_ignore_MINUS = r'\-'
t_ignore_TIMES = r'\*'
t_ignore_DIVIDE = r'/'
t_ignore_ODD = r'ODD'
t_ignore_ASSIGN = r'='
t_ignore_NE = r'<>'
t_ignore_LT = r'<'
t_ignore_LTE = r'<='
t_ignore_GT = r'>'
t_ignore_GTE = r'>='
t_ignore_LPARENT = r'\('
t_ignore_RPARENT = r'\)'
t_ignore_COMMA = r','
t_ignore_SEMMICOLOM = r';'
t_ignore_DOT = r'\.'
t_ignore_UPDATE = r':='
def t_ID(t):
r'[a-zA-Z_][a-zA-Z_0-9]*'
t.type = reservadas.get(t.value,'ID') # Check for reserved words
return t
def t_newline(t):
r'\n+'
t.lexer.lineno += len(t.value)
#dsfjksdlgjklsdgjsdgslxcvjlk-,.
def t_COMMENT(t):
r'\//.*'
r'\/*.*'
r'\*/.*'
pass
def t_NUMBER(t):
r'\d+'
t.value = int(t.value)
pass
def t_error(t):
print ("----- '%s'" % t.value[0])
t.lexer.skip(1)
while True:
tok = analizador.token()
if not tok : break
print (tok)
the output I get with the above code is:
LexToken(ID,'FSR',1,3)
LexToken(ID,'testing',1,7)
LexToken(ID,'sketch',1,15)
'---- '
'---- '
LexToken(ID,'Connect',3,28)
LexToken(ID,'one',3,36)
LexToken(ID,'end',3,40)
LexToken(ID,'of',3,44)
LexToken(ID,'FSR',3,47)
LexToken(ID,'to',3,51)
LexToken(ID,'V',3,55)
LexToken(ID,'the',3,58)
LexToken(ID,'other',3,62)
LexToken(ID,'end',3,68)
LexToken(ID,'to',3,72)
LexToken(ID,'Analog',3,75)
'---- '
.
.
.
.
LexToken(ID,'Serial',21,694)
LexToken(ID,'print',21,701)
----- '"'
LexToken(ID,'Analog',21,708)
LexToken(ID,'reading',21,715)
----- '"'
'---- '
LexToken(ID,'Serial',22,732)
LexToken(ID,'println',22,739)
LexToken(ID,'fsrReading',22,747)
'---- '
'---- '
LexToken(ID,'LEDbrightness',26,898)
LexToken(ID,'map',26,914)
LexToken(ID,'fsrReading',26,918)
'---- '
LexToken(ID,'analogWrite',28,996)
LexToken(ID,'LEDpin',28,1008)
LexToken(ID,'LEDbrightness',28,1016)
'---- '
LexToken(ID,'IF',29,1034)
'---- '
LexToken(if,'if',30,1038)
'---- '
LexToken(ID,'delay',31,1044)
'---- '
----- '}'
Press any key to continue . . .
my expectation for the exit would be this:
LexToken(ID,'IF',29,1034)
'---- '
LexToken(if,'if',30,1038)
I am analyzing a code of arduino, and all those words are comments, but I only need you to look for the conditionals if or IF, or other reserved words like for, but the main idea is that with a list of reserved words you identify them and show me only those selected
If you want to discard tokens that are not in your 'reserved' list, adjust the t_ID function like so:
def t_ID(t):
r'[a-zA-Z_][a-zA-Z_0-9]*'
reserved_type = reservadas.get(t.value, False)
if reserved_type:
t.type = reserved_type
return t # Return token with reserved type
return None # Discard non-reserved tokens
Additionally, your comment token function is probably misapplied here.
def t_COMMENT(t):
r'\//.*'
r'\/*.*'
r'\*/.*'
pass
You can't use multiple rules or span a rule over multiple strings like this. Because the docstring (which ply uses to get the regex) will only contain the very first string.
Secondly, I think the regex needs adjusting for comments, assuming you're tokenizing C or a C-like language. Particularly, it needs to account for the possibility that comments span multiple lines.
To fix, apply the following for dealing with comments:
def t_block_comment(tok):
r'/\*((.|\n))*?\*/'
tok.lexer.lineno += tok.value.count('\n')
return None # Discard block comments "/* comment */"
t_ignore_comment = r'//.*' # ignore inline comments "// comment"
You may also need to apply the regex multiline flag:
analizador = lex.lex(reflags=re.MULTILINE)
Lastly, your t_ignore_DIVIDE = r'/' may be preventing your comment rules from applying, too. Consider ordering this after the comment rules.
Related
regex expression for delete python comment [duplicate]
This question already has answers here: Script to remove Python comments/docstrings (7 answers) Closed 2 years ago. I want to delete all comment in python file. file like this: --------------- comment.py --------------- # this is comment line. age = 18 # comment in line msg1 = "I'm #1." # comment. there's a # in code. msg2 = 'you are #2. ' + 'He is #3' # strange sign ' # ' in comment. print('Waiting your answer') I write many regex to extract all comments, some like this: (?(?<=['"])(?<=['"])\s*#.*$|\s*#.*$) get: #1." # comment. there's a # in code. (?<=('|")[^\1]*\1)\s*#.*$|\s*#.*$ wrong. it's not 0-width in lookaround (?<=..) But it dosn't work right. What's the right regex? Could you help me, please?
You can try using tokenize instead of regex, as #OlvinRoght said, parsing code using regex maybe is bad idea in this case. As you can see here, you can try something like this to detect the comments: import tokenize fileObj = open('yourpath\comment.py', 'r') for toktype, tok, start, end, line in tokenize.generate_tokens(fileObj.readline): # we can also use token.tok_name[toktype] instead of 'COMMENT' # from the token module if toktype == tokenize.COMMENT: print('COMMENT' + " " + tok) Output: COMMENT # -*- coding: utf-8 -*- COMMENT # this is comment line. COMMENT # comment in line COMMENT # comment. there's a # in code. COMMENT # strange sign ' # ' in comment. Then, to get the expected result, that is python file without comments, you can try this: nocomments=[] for toktype, tok, start, end, line in tokenize.generate_tokens(fileObj.readline): if toktype != tokenize.COMMENT: nocomments.append(tok) print(' '.join(nocomments)) Output: age = 18 msg1 = "I'm #1." msg2 = 'you are #2. ' + 'He is #3' print ( 'Waiting your answer' )
Credit: https://gist.github.com/BroHui/aca2b8e6e6bdf3cb4af4b246c9837fa3 This will do. It uses tokenize. You can modify this code as per your use. """ Strip comments and docstrings from a file. """ import sys, token, tokenize def do_file(fname): """ Run on just one file. """ source = open(fname) mod = open(fname + ",strip", "w") prev_toktype = token.INDENT first_line = None last_lineno = -1 last_col = 0 tokgen = tokenize.generate_tokens(source.readline) for toktype, ttext, (slineno, scol), (elineno, ecol), ltext in tokgen: if 0: # Change to if 1 to see the tokens fly by. print("%10s %-14s %-20r %r" % ( tokenize.tok_name.get(toktype, toktype), "%d.%d-%d.%d" % (slineno, scol, elineno, ecol), ttext, ltext )) if slineno > last_lineno: last_col = 0 if scol > last_col: mod.write(" " * (scol - last_col)) if toktype == token.STRING and prev_toktype == token.INDENT: # Docstring mod.write("#--") elif toktype == tokenize.COMMENT: # Comment mod.write("\n") else: mod.write(ttext) prev_toktype = toktype last_col = ecol last_lineno = elineno if __name__ == '__main__': do_file("text.txt") text.txt: # this is comment line. age = 18 # comment in line msg1 = "I'm #1." # comment. there's a # in code. msg2 = 'you are #2. ' + 'He is #3' # strange sign ' # ' in comment. print('Waiting your answer') Output: age = 18 msg1 = "I'm #1." msg2 = 'you are #2. ' + 'He is #3' print('Waiting your answer') Input: msg1 = "I'm #1." # comment. there's a # in code. the regex#.*$ will match #1." # comment. there's a # in code. . Right match shoud be # comment. there's a # in code. Output: msg1 = "I'm #1."
How NOT to print emojis from comments or submission when using praw
Getting error messages when I am trying to print out comments or submission with emojis in it. How can I just disregard and print only letters and numbers? Using Praw to webscrape top_posts2 = page.top(limit = 25) for post in top_posts2: outputFile.write(post.title) outputFile.write(' ') outputFile.write(str(post.score)) outputFile.write('\n') outputFile.write(post.selftext) outputFile.write('\n') submissions = reddit.submission(id = post.id) comment_page = submissions.comments top_comment = comment_page[0] #by default, this will be the best comment of the post commentBody = top_comment.body outputFile.write(top_comment.body) outputFile.write('\n') I want to output only letters and numbers. and maybe some special characters (or all)
There's a couple ways you can do this. I would recommend creating kind of a "text cleaning" function def cleanText(text): new_text = "" for c in text: # for each character in the text if c.isalnum(): # check if it is either a letter or number (alphanumeric) new_text += c return new_text or if you want to include specific non-alphanumeric numbers def cleanText(text): valid_symbols = "!##$%^&*()" # <-- add whatever symbols you want here new_text = "" for c in text: # for each character in the text if c.isalnum() or c in valid_symbols: # check if alphanumeric or a valid symbol new_text += c return new_text so then in your script you can do something like commentBody = cleanText(top_comment.body)
parsing a string in python for #hashtag
I am wondering, how could I make an algorithm that parses a string for the hashtag symbol ' # ' and returns the full string, but where ever a word starts with a '#' symbol, it becomes a link. I am using python with Google app engine: webapp2 and Jinja2 and I am building a blog. Thanks
A more efficient and complete way to find the "hashwords": import functools def hash_position(string): return string.find('#') def delimiter_position(string, delimiters): positions = filter(lambda x: x >= 0, map(lambda delimiter: string.find(delimiter), delimiters)) try: return functools.reduce(min, positions) except TypeError: return -1 def get_hashed_words(string, delimiters): maximum_length = len(string) current_hash_position = hash_position(string) string = string[current_hash_position:] results = [] counter = 0 while current_hash_position != -1: current_delimiter_position = delimiter_position(string, delimiters) if current_delimiter_position == -1: results.append(string) else: results.append(string[0:current_delimiter_position]) # Update offsets and the haystack string = string[current_delimiter_position:] current_hash_position = hash_position(string) string = string[current_hash_position:] return results if __name__ == "__main__": string = "Please #clarify: What do you #mean with returning somthing as a #link. #herp" delimiters = [' ', '.', ',', ':'] print(get_hashed_words(string, delimiters)) Imperative code with updates of the haystack looks a little bit ugly but hey, that's what we get for (ab-)using mutable variables. And I still have no idea what do you mean with "returning something as a link". Hope that helps.
not sure where do you get the data for the link, but maybe something like: [('%s' % word) for word in input.split() if word[0]=='#']
Are you talking about twitter? Maybe this? def get_hashtag_link(hashtag): if hashtag.startswith("#"): return '%s' % (hashtag[1:], hashtag) >>> get_hashtag_link("#stackoverflow") '#stackoverflow' It will return None if hashtag is not a hashtag.
Find strings that begins with a '#' and create link
I want to check whether a string (a tweet) begins with a '#' (i.e. is a hashtag) or not, and if so create a link. Below is what I've tried so far but it doesn't work (error on the last line). How can I fix this and will the code work for the purpose? tag_regex = re.compile(r""" [\b#\w\w+] # hashtag found!""", re.VERBOSE) message = raw_message for tag in tag_regex.findall(raw_message): message = message.replace(url, '' + message + '')
>>> msg = '#my_tag the rest of my tweet' >>> re.sub('^#(\w+) (.*)', r'\2', msg) 'the rest of my tweet' >>>
Lexer for Parsing to the end of a line
If I have a keyword, how can I get it to, once it encounters a keyword, to just grab the rest of the line and return it as a string? Once it encounters an end of line, return everything on that line. Here is the line I'm looking at: description here is the rest of my text to collect Thus, when the lexer encounters description, I would like "here is the rest of my text to collect" returned as a string I have the following defined, but it seems to be throwing an error: states = ( ('bcdescription', 'exclusive'), ) def t_bcdescription(t): r'description ' t.lexer.code_start = t.lexer.lexpos t.lexer.level = 1 t.lexer.begin('bcdescription') def t_bcdescription_close(t): r'\n' t.value = t.lexer.lexdata[t.lexer.code_start:t.lexer.lexpos+1] t.type="BCDESCRIPTION" t.lexer.lineno += t.valiue.count('\n') t.lexer.begin('INITIAL') return t This is part of the error being returned: File "/Users/me/Coding/wm/wm_parser/ply/lex.py", line 393, in token raise LexError("Illegal character '%s' at index %d" % (lexdata[lexpos],lexpos), lexdata[lexpos:]) ply.lex.LexError: Illegal character ' ' at index 40 Finally, if I wanted this functionality for more than one token, how could I accomplish that? Thanks for your time
There is no big problem with your code,in fact,i just copy your code and run it,it works well import ply.lex as lex states = ( ('bcdescription', 'exclusive'), ) tokens = ("BCDESCRIPTION",) def t_bcdescription(t): r'\bdescription\b' t.lexer.code_start = t.lexer.lexpos t.lexer.level = 1 t.lexer.begin('bcdescription') def t_bcdescription_close(t): r'\n' t.value = t.lexer.lexdata[t.lexer.code_start:t.lexer.lexpos+1] t.type="BCDESCRIPTION" t.lexer.lineno += t.value.count('\n') t.lexer.begin('INITIAL') return t def t_bcdescription_content(t): r'[^\n]+' lexer = lex.lex() data = 'description here is the rest of my text to collect\n' lexer.input(data) while True: tok = lexer.token() if not tok: break print tok and result is : LexToken(BCDESCRIPTION,' here is the rest of my text to collect\n',1,50) So maybe your can check other parts of your code and if I wanted this functionality for more than one token, then you can simply capture words and when there comes a word appears in those tokens, start to capture the rest of content by the code above.
It is not obvious why you need to use a lexer/parser for this without further information. >>> x = 'description here is the rest of my text to collect' >>> a, b = x.split(' ', 1) >>> a 'description' >>> b 'here is the rest of my text to collect'