Regular expression for data scraping?

Regular expression for data scraping? - python

I'm over-complicating this simple project, but I am trying to learn more about Python, so I thought of this simple app that involves scraping the movie times of all current movies based on the movies listed on google showtimes.
The location is irrelevant, because it pulls up all current movies. I have the code to scrap all the data in the <span class=info></span> tag, but it obviously extracts the length of the movie along with a ton of other html data. I only want the movie times.
I am assuming to extract just the movie times, I need some sort of regular expression.
Here is a small snippet of what part of the text information looks like
<span class=info>‎2hr 3min‎‎ - Rated PG-13&#8
I need the hour and the min, nothing else. What is the best way to go about parsing this data from this line of text?

You could use a regular expression here, yes. BeautifulSoup will give you a unicode value when you extract the tag text:
>>> soup = BeautifulSoup('''<span class=info>‎2hr 3min‎‎ - Rated PG-13&#8''')
>>> soup.span.get_text()
u'\u200e2hr 3min\u200e\u200e - Rated PG-13'
The U+200e LEFT-TO-RIGHT MARK codepoints can be ignored, a regular expression can pick out the time easy enough:
import re
time_pattern = re.compile(r'(\d+)hr\s*(\d+)min')
hours, minutes = time_pattern.search(soup.span.get_text()).groups()
where the two \d+ groups match digits followed by hr and min text respectively, separated by whitespace.
This produces:
>>> time_pattern = re.compile(r'(\d+)hr\s*(\d+)min')
>>> hours, minutes = time_pattern.search(soup.span.get_text()).groups()
>>> hours
u'2'
>>> minutes
u'3'

Related

python3.6 How do I regex a url from a .txt?

I need to grab a url from a text file.
The URL is stored in a string like so: 'URL=http://example.net'.
Is there anyway I could grab everything after the = char up until the . in '.net'?
Could I use the re module?

text = """A key feature of effective analytics infrastructure in healthcare is a metadata-driven architecture. In this article, three best practice scenarios are discussed: https://www.healthcatalyst.com/clinical-applications-of-machine-learning-in-healthcare Automating ETL processes so data analysts have more time to listen and help end users , https://www.google.com/, https://www.facebook.com/, https://twitter.com
code below catches all urls in text and returns urls in list."""
urls = re.findall('(?:(?:https?|ftp):\/\/)?[\w/\-?=%.]+\.[\w/\-?=%.]+', text)
print(urls)
output:
[
'https://www.healthcatalyst.com/clinical-applications-of-machine-learning-in-healthcare',
'https://www.google.com/',
'https://www.facebook.com/',
'https://twitter.com'
]

i dont have much information but i will try to help with what i got im assuming that URL= is part of the string in that case you can do this
re.findall(r'URL=(.*?).', STRINGNAMEHERE)
Let me go more into detail about (.*?) the dot means Any character (except newline character) the star means zero or more occurences and the ? is hard to explain but heres an example from the docs "Causes the resulting RE to match 0 or 1 repetitions of the preceding RE. ab? will match either ‘a’ or ‘ab’." the brackets place it all into a group. All this togethear basicallly means it will find everything inbettween URL= and .

You don't need RegEx'es (the re module) for such a simple task.
If the string you have is of the form:
'URL=http://example.net'
Then you can solve this using basic Python in numerous ways, one of them being:
file_line = 'URL=http://example.net'
start_position = file_line.find('=') + 1 # this gives you the first position after =
end_position = file_line.find('.')
# this extracts from the start_position up to but not including end_position
url = file_line[start_position:end_position]
Of course that this is just going to extract one URL. Assuming that you're working with a large text, where you'd want to extract all URLs, you'll want to put this logic into a function so that you can reuse it, and build around it (achieve iteration via the while or for loops, and, depending on how you're iterating, keep track of the position of the last extracted URL and so on).
Word of advice
This question has been answered quite a lot on this forum, by very skilled people, in numerous ways, for instance: here, here, here and here, to a level of detail that you'd be amazed. And these are not all, I just picked the first few that popped up in my search results.
Given that (at the time of posting this question) you're a new contributor to this site, my friendly advice would be to invest some effort into finding such answers. It's a crucial skill, that you can't do without in the world of programming.
Remember, that whatever problem it is that you are encountering, there is a very high chance that somebody on this forum had already encountered it, and received an answer, you just need to find it.

Please try this. It worked for me.
import re
s='url=http://example.net'
print(re.findall(r"=(.*)\.",s)[0])

How to extract questions from a word doc with Python using regex

I am using docx library to read files from a word doc, I am trying to extract only the questions using regex search and match. I found infinite ways of doing it but I keep getting a "TypeError".
The data I am trying to extract is this:
Will my financial aid pay for housing?
Off Campus Housing - After financial aid applies toward your tuition and fees, any remaining funds will be sent to you as a refund that will either be directly deposited (which can be set up through your account) or mailed to you as a paper check. You can then use the refund to pay your rent. It is important to note that financial aid may not be available when rent is due, so make sure to have a plan in place to pay your rent. Will my financial aid pay for housing?
"financial" "help" "house"
funds "univ oak"
"money" "chisho"
"pay" "chap"
"grant" "laurel"
What are the requirements to receive a room and grant?
How do I pay for my housing?
How do I pay for housing?
If there's also an easier method of exporting the word doc into a different type of file, that'll be great to know for feedback. Thank you
I am using regex 101, I've tried the following regex expressions to match only the sentences that end in a question mark
".*[?=?]$"
"^(W|w).*[?=?]$"
"^[A-Za-z].*[?=?]$"
import re
import sys
from docx import Document
wordDoc = Document('botDoc.docx')
result = re.search('.*[?=?]$', wordDoc)
print(result)
if result:
print(result.group(0))
for table in wordDoc.tables:
for row in table.rows:
for cell in row.cells:
print("test")
I expect to save the matching patterns into directories so I can export the data to a csv file

Your error:
result = re.search('.*[?=?]$', wordDoc)
I believe that this line is the cause of the problem. search() is expecting a string as a second parameter, but is receiving a Document object.
What you should do is use the findall() function. search() only finds the first match for a pattern; findall() finds all the matches and returns them as a list of strings, with each string representing one match.
Since you are working with docx, you would have to extract the contents of the docx and use them as second parameter of the findall() method. If I remember correctly, this is done by first extracting all the paragraphs, and then extracting the text of the individual paragraphs. Refer to this question.
FYI, the way you would do this for a simple text file is the following:
# Open file
f = open('test.txt', 'r')
# Feed the file text into findall(); it returns a list of all the found strings
strings = re.findall(r'your pattern', f.read())
Your Regex:
Unfortunately, your regex is not quite correct, because although logically it makes sense to match only sentences that end on a ?, one of your matches is place to pay your rent. Will my financial aid pay for housing?, for example. Only the second part of that sentence is an actual question. So discard any lower case letters. Your regex should be something like:
[A-Z].*\?$

How to search for multiple multi-word phrases in pandas?

I have some JSON data converted into a Pandas DataFrame. I am looking to find all columns whose string content matches a list of multi word phrases.
I am working with a massive amount of Twitter JSON data already downloaded for public use (so Twitter API usage is not applicable). This JSON is converted into a Pandas DataFrame. One of the columns available is, text which the body of the tweet. An example is
We’re kicking off the first portion of a citywide traffic calming project to make residential streets more safe & pedestrian-friendly, next week!
Tuesday, July 30 at 10:30 AM
Nautilus Drive and 42 Street
I want to be able to have a list of phrases, phrases = ["We're kicking off", "we're starting", "we're initiating"] and do something like pd[pd['text'].str.contains(phrases)]] to ensure that I can obtain pandas DataFrame rows whose text column contains one of the phrases.
This is perhaps asking too much, but ideally I would also be able to match something like phrases = ["(We're| we are) kicking off", "(we're | we are) starting", "(we're| we are) initiating"]

Make a list with keywords or phrases you want to match, i have put on logic for perfect match, you can change it by changing regex. Also it will capture by which keywords was the text caught.
Here is the code -
for i in range(len(mustkeywords)):
for index in range(len(text)):
result = re.search(r'\s*\b'+mustkeywords[i]+r'\W\s*', text[index])
if result:
commentlist.append(text[index])
keywordlist.append(mustkeywords[i])
tempmustkeywordsdf=pd.DataFrame(columns={"Comments"},data=commentlist) #temp df for keywords
tempmustkeywordsdf["Keywords"]=keywordlist #adding keywords column to this df
Here mustkeywords is a list that contains your phrases or keywords
.text is a string that contains all the data/phrases that you want to check keywords into.
and tempmustkeywordsdf is that contains matched strings and keywords that matched them.
I hope this helps.

Input an item and gets its REGEX in python

I'm trying to make a stand alone application using Python and Tkinter.
My work is to get all similar looking product IDs from a excel sheet using Python. I got similar looking products for a particular company XYZ.
The code goes like this
IDs = df1['A'].str.extract(r'\b(\d{8}s\d{2})\b' , expand = False).dropna().tolist()
This helps extract all items which have "8 Number followed by s followed by 2 more numbers" like 01234567s12 or 98765432s23
But i want to do something opposite that is input the product ID and get its regex.
The product ID can be anything say ABC123456 or C234-D456
So is there a code which can help me get the regex ?

what you could do is generate regex according to pattern recognition:
6numbers 2letter 2symbol 4 numbers would be :
\d{6} .{2} \S{2} \d{4}
i do not know if this a good practice like this
but atleast you will have regex thats get generated.
the regex :
https://regex101.com/r/HPPAAm/1

I used re module to do this .
import re
text ="12345678S00"
y=""
for i in range(0,len(text)):
r=re.match('[a-zA-Z]',text[i])
if r!=None:
y+='s'
r=re.match('[0-9]',text[i])
if r!=None:
y+='\d'
r=re.match('[.,_=&*()%^#$!#-]',text[i])
if r!=None:
y+='\S'
\d\d\d\d\d\d\d\ds\d\d #output

Get num of page with beautifulsoup

i want to get the number of pages in the next code html:
<span id="viewns_Z7_AVEQAI930OBRD02JPMTPG21004_:form1:textfooterInfoNumPagMAQ" class="outputText marginLeft0punto5">1</span>
<span id="viewns_Z7_AVEQAI930OBRD02JPMTPG21004_:form1:textfooterInfoTotalPaginaMAQ" class="outputText marginLeft0punto5">37</span>
<span id="viewns_Z7_AVEQAI930OBRD02JPMTPG21004_:form1:textfooterTotalTotalMAQ" class="outputText marginLeft0punto5">736</span>
The goal is get the number 1, 37 and 736
My problem is that i don't know how define the line to extract the numbers, for example for the number 1:
req = requests.get(url)
soup = BeautifulSoup(req.text, "lxml")
first_page = int(soup.find('span', {'id': 'viewns_Z7_AVEQAI930OBRD02JPMTPG21004_:form1:textfooterInfoNumPagMAQ'}).getText())
Thanks so much
EDIT: Finally i found a solution with Selenium:
numpag = int(driver.find_element_by_xpath('//*[#id="viewns_Z7_AVEQAI930OBRD02JPMTPG21004_:form1:textfooterInfoNumPagMAQ"]').text)
pagtotal = int(driver.find_element_by_xpath('//*[#id="viewns_Z7_AVEQAI930OBRD02JPMTPG21004_:form1:textfooterInfoTotalPaginaMAQ"]').text)
totaltotal = int(driver.find_element_by_xpath('//*[#id="viewns_Z7_AVEQAI930OBRD02JPMTPG21004_:form1:textfooterTotalTotalMAQ"]').text)
Thanks #abarnert, sorry for the caos in my question, it was my first post =)

The code you provided already works for the example you provided.
My guess is that your problem is that it doesn't work for any other page, probably because those id values are different each time.
If that's the case, you need to look at (or show us) multiple different outputs to figure out if there's a recognizable pattern that you can match with a regular expression or a function full of string operations or whatever. See Searching the tree in the docs for the different kinds of filters you can use.
As a wild guess, that Z7 and AVEQAI930OBRD02JPMTPG21004 are replaced by different strings of capitals and digits each time, but the rest of the format is always the same? If so, there are some pretty obvious regular expressions you can use:
rnumpag = re.compile(r'.*:form1:textfooterInfoNumPagMAQ')
rtotalpagina = re.compile(r'.*:form1:textfooterInfoTotalPaginaMAQ')
rtotaltotal = re.compile(r'.*:form1:textfooterTotalTotalMAQ')
numpag = int(soup.find('span', id=rnumpag).string)
totalpagina = int(soup.find('span', id=rtotalpagina).string)
totaltotal = int(soup.find('span', id=rtotaltotal).string)
This works on your provided example, and would also work on a different page that had different strings of characters within the part we're matching with .*.
And, even if my wild guess was wrong, this should show you how to write a search for whatever you actually do have to search for.
As a side note, you were using the undocumented legacy function getText(). This implies that you're copying and pasting ancient BS3 code. Don't do that. Some of it will work with BS4, even when it isn't documented to (as in this case), but it's still a bad idea. It's like trying to run Python 2 source code with Python 3 without understanding the differences.
What you want here is either get_text(), string, or text, and you should look at what all three of these mean in the docs to understand the difference—but here, the only thing within the tag is a text string, so they all happen to do the same thing.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Regular expression for data scraping? - python

Related

python3.6 How do I regex a url from a .txt?

How to extract questions from a word doc with Python using regex

How to search for multiple multi-word phrases in pandas?

Input an item and gets its REGEX in python

Get num of page with beautifulsoup

Categories

Resources