Input an item and gets its REGEX in python - python

I'm trying to make a stand alone application using Python and Tkinter.
My work is to get all similar looking product IDs from a excel sheet using Python. I got similar looking products for a particular company XYZ.
The code goes like this
IDs = df1['A'].str.extract(r'\b(\d{8}s\d{2})\b' , expand = False).dropna().tolist()
This helps extract all items which have "8 Number followed by s followed by 2 more numbers" like 01234567s12 or 98765432s23
But i want to do something opposite that is input the product ID and get its regex.
The product ID can be anything say ABC123456 or C234-D456
So is there a code which can help me get the regex ?

what you could do is generate regex according to pattern recognition:
6numbers 2letter 2symbol 4 numbers would be :
\d{6} .{2} \S{2} \d{4}
i do not know if this a good practice like this
but atleast you will have regex thats get generated.
the regex :
https://regex101.com/r/HPPAAm/1

I used re module to do this .
import re
text ="12345678S00"
y=""
for i in range(0,len(text)):
r=re.match('[a-zA-Z]',text[i])
if r!=None:
y+='s'
r=re.match('[0-9]',text[i])
if r!=None:
y+='\d'
r=re.match('[.,_=&*()%^#$!#-]',text[i])
if r!=None:
y+='\S'
\d\d\d\d\d\d\d\ds\d\d #output

Related

python3.6 How do I regex a url from a .txt?

I need to grab a url from a text file.
The URL is stored in a string like so: 'URL=http://example.net'.
Is there anyway I could grab everything after the = char up until the . in '.net'?
Could I use the re module?
text = """A key feature of effective analytics infrastructure in healthcare is a metadata-driven architecture. In this article, three best practice scenarios are discussed: https://www.healthcatalyst.com/clinical-applications-of-machine-learning-in-healthcare Automating ETL processes so data analysts have more time to listen and help end users , https://www.google.com/, https://www.facebook.com/, https://twitter.com
code below catches all urls in text and returns urls in list."""
urls = re.findall('(?:(?:https?|ftp):\/\/)?[\w/\-?=%.]+\.[\w/\-?=%.]+', text)
print(urls)
output:
[
'https://www.healthcatalyst.com/clinical-applications-of-machine-learning-in-healthcare',
'https://www.google.com/',
'https://www.facebook.com/',
'https://twitter.com'
]
i dont have much information but i will try to help with what i got im assuming that URL= is part of the string in that case you can do this
re.findall(r'URL=(.*?).', STRINGNAMEHERE)
Let me go more into detail about (.*?) the dot means Any character (except newline character) the star means zero or more occurences and the ? is hard to explain but heres an example from the docs "Causes the resulting RE to match 0 or 1 repetitions of the preceding RE. ab? will match either ‘a’ or ‘ab’." the brackets place it all into a group. All this togethear basicallly means it will find everything inbettween URL= and .
You don't need RegEx'es (the re module) for such a simple task.
If the string you have is of the form:
'URL=http://example.net'
Then you can solve this using basic Python in numerous ways, one of them being:
file_line = 'URL=http://example.net'
start_position = file_line.find('=') + 1 # this gives you the first position after =
end_position = file_line.find('.')
# this extracts from the start_position up to but not including end_position
url = file_line[start_position:end_position]
Of course that this is just going to extract one URL. Assuming that you're working with a large text, where you'd want to extract all URLs, you'll want to put this logic into a function so that you can reuse it, and build around it (achieve iteration via the while or for loops, and, depending on how you're iterating, keep track of the position of the last extracted URL and so on).
Word of advice
This question has been answered quite a lot on this forum, by very skilled people, in numerous ways, for instance: here, here, here and here, to a level of detail that you'd be amazed. And these are not all, I just picked the first few that popped up in my search results.
Given that (at the time of posting this question) you're a new contributor to this site, my friendly advice would be to invest some effort into finding such answers. It's a crucial skill, that you can't do without in the world of programming.
Remember, that whatever problem it is that you are encountering, there is a very high chance that somebody on this forum had already encountered it, and received an answer, you just need to find it.
Please try this. It worked for me.
import re
s='url=http://example.net'
print(re.findall(r"=(.*)\.",s)[0])

How to extract questions from a word doc with Python using regex

I am using docx library to read files from a word doc, I am trying to extract only the questions using regex search and match. I found infinite ways of doing it but I keep getting a "TypeError".
The data I am trying to extract is this:
Will my financial aid pay for housing?
Off Campus Housing - After financial aid applies toward your tuition and fees, any remaining funds will be sent to you as a refund that will either be directly deposited (which can be set up through your account) or mailed to you as a paper check. You can then use the refund to pay your rent. It is important to note that financial aid may not be available when rent is due, so make sure to have a plan in place to pay your rent. Will my financial aid pay for housing?
"financial" "help" "house"
funds "univ oak"
"money" "chisho"
"pay" "chap"
"grant" "laurel"
What are the requirements to receive a room and grant?
How do I pay for my housing?
How do I pay for housing?
If there's also an easier method of exporting the word doc into a different type of file, that'll be great to know for feedback. Thank you
I am using regex 101, I've tried the following regex expressions to match only the sentences that end in a question mark
".*[?=?]$"
"^(W|w).*[?=?]$"
"^[A-Za-z].*[?=?]$"
import re
import sys
from docx import Document
wordDoc = Document('botDoc.docx')
result = re.search('.*[?=?]$', wordDoc)
print(result)
if result:
print(result.group(0))
for table in wordDoc.tables:
for row in table.rows:
for cell in row.cells:
print("test")
I expect to save the matching patterns into directories so I can export the data to a csv file
Your error:
result = re.search('.*[?=?]$', wordDoc)
I believe that this line is the cause of the problem. search() is expecting a string as a second parameter, but is receiving a Document object.
What you should do is use the findall() function. search() only finds the first match for a pattern; findall() finds all the matches and returns them as a list of strings, with each string representing one match.
Since you are working with docx, you would have to extract the contents of the docx and use them as second parameter of the findall() method. If I remember correctly, this is done by first extracting all the paragraphs, and then extracting the text of the individual paragraphs. Refer to this question.
FYI, the way you would do this for a simple text file is the following:
# Open file
f = open('test.txt', 'r')
# Feed the file text into findall(); it returns a list of all the found strings
strings = re.findall(r'your pattern', f.read())
Your Regex:
Unfortunately, your regex is not quite correct, because although logically it makes sense to match only sentences that end on a ?, one of your matches is place to pay your rent. Will my financial aid pay for housing?, for example. Only the second part of that sentence is an actual question. So discard any lower case letters. Your regex should be something like:
[A-Z].*\?$

Python - Regex - Match anything except

I'm trying to get my regular expression to work but can't figure out what I'm doing wrong. I am trying to find any file that is NOT in a specific format. For example all files are dates that are in this format MM-DD-YY.pdf (ex. 05-13-17.pdf). I want to be able to find any files that are not written in that format.
I can create a regex to find those with:
(\d\d-\d\d-\d\d\.pdf)
I tried using the negative lookahead so it looked like this:
(?!\d\d-\d\d-\d\d\.pdf)
That works in not finding those anymore but it doesn't find the files that are not like it.
I also tried adding a .* after the group but then that finds the whole list.
(?!\d\d-\d\d-\d\d\.pdf).*
I'm searching through a small list right now for testing:
05-17-17.pdf Test.pdf 05-48-2017.pdf 03-14-17.pdf
Is there a way to accomplish what I'm looking for?
Thanks!
You can try this:
import re
s = "Test.docx 04-05-2017.docx 04-04-17.pdf secondtest.pdf"
new_data = re.findall("[a-zA-Z]+\.[a-zA-Z]+|\d{1,}-\d{1,}-\d{4}\.[a-zA-Z]+", s)
Output:
['Test.docx', '04-05-2017.docx', 'secondtest.pdf']
First find all that are matching, then remove them from your list separately. firstFindtheMatching method first finds matching names using re library:
def firstFindtheMatching(listoffiles):
"""
:listoffiles: list is the name of the files to check if they match a format
:final_string: any file that doesn't match the format 01-01-17.pdf (MM-DD-YY.pdf) is put in one str type output. (ALSO) I'm returning the listoffiles so in that you can see the whole output in one place but you really won't need that.
"""
import re
matchednames = re.findall("\d{1,2}-\d{1,2}-\d{1,2}\.pdf", listoffiles)
#connect all output in one string for simpler handling using sets
final_string = ' '.join(matchednames)
return(final_string, listoffiles)
Here is the output:
('05-08-17.pdf 04-08-17.pdf 08-09-16.pdf', '05-08-17.pdf Test.pdf 04-08-17.pdf 08-09-16.pdf 08-09-2016.pdf some-all-letters.pdf')
set(['08-09-2016.pdf', 'some-all-letters.pdf', 'Test.pdf'])
I've used the main below if you like to regenerate the results. Good thing about doing it this way is that you can add more regex to your firstFindtheMatching(). It helps you to keep things separate.
def main():
filenames= "05-08-17.pdf Test.pdf 04-08-17.pdf 08-09-16.pdf 08-09-2016.pdf some-all-letters.pdf"
[matchednames , alllist] = firstFindtheMatching(filenames)
print(matchednames, alllist)
notcommon = set(filenames.split()) - set(matchednames.split())
print(notcommon)
if __name__ == '__main__':
main()

Regular expression for data scraping?

I'm over-complicating this simple project, but I am trying to learn more about Python, so I thought of this simple app that involves scraping the movie times of all current movies based on the movies listed on google showtimes.
The location is irrelevant, because it pulls up all current movies. I have the code to scrap all the data in the <span class=info></span> tag, but it obviously extracts the length of the movie along with a ton of other html data. I only want the movie times.
I am assuming to extract just the movie times, I need some sort of regular expression.
Here is a small snippet of what part of the text information looks like
<span class=info>‎2hr 3min‎‎ - Rated PG-13&#8
I need the hour and the min, nothing else. What is the best way to go about parsing this data from this line of text?
You could use a regular expression here, yes. BeautifulSoup will give you a unicode value when you extract the tag text:
>>> soup = BeautifulSoup('''<span class=info>‎2hr 3min‎‎ - Rated PG-13&#8''')
>>> soup.span.get_text()
u'\u200e2hr 3min\u200e\u200e - Rated PG-13'
The U+200e LEFT-TO-RIGHT MARK codepoints can be ignored, a regular expression can pick out the time easy enough:
import re
time_pattern = re.compile(r'(\d+)hr\s*(\d+)min')
hours, minutes = time_pattern.search(soup.span.get_text()).groups()
where the two \d+ groups match digits followed by hr and min text respectively, separated by whitespace.
This produces:
>>> time_pattern = re.compile(r'(\d+)hr\s*(\d+)min')
>>> hours, minutes = time_pattern.search(soup.span.get_text()).groups()
>>> hours
u'2'
>>> minutes
u'3'

Maximum capacity on re module? Python

I've been using python for web scraping. Everything worked like a oiled gear until I used it to get the description of a product which is actually a laaaarge description.
So, it's not working at all... like if my regex was incorrect. Sadly I can not tell you which website I'm scraping in order to show you the real example, but I actually know that the regex is actually ok... it's something like this:
descriptionRegex = 'id="this_id">(.*)</div>\s*<div\ id="another_id"'
for found in re.findall(descriptionRegex, response) :
print found
The deal is that (.*) is like 25000+ characters
There's a limit of characters to reach on a re.findall() finding? There's any way I can achieve this?
You need to specify re.DOTALL in your call to .findall().
If you run this program, it will behave as you request:
import re
response = '''id="this_id">
blah
</div> <div id="another_id"'''
descriptionRegex = r'id="this_id">(.*)</div>\s*<div\ id="another_id"'
for found in re.findall(descriptionRegex, response, re.DOTALL ) :
print found

Categories

Resources