Extract string from <script> - BeautifulSoup python - python

I'm trying to create a python script to extract some informations from a webmail. I wanna follow a redirection.
My code :
br1 = mechanize.Browser()
br1.set_handle_robots(False)
br1.set_cookiejar(cj)
br1.open("LOGIN URL")
br1.select_form(nr=0)
br1.form['username'] = mail_site
br1.form['password'] = pw_site
res1 = br1.submit()
html = res1.read()
print html
Result is not what i expect.
It contains only a redirection script.
I've seen that i have to extract the information from this script to follow this redirection.
So, in my case,i've to extract jsessionid into a script.
The script is :
<script>
function redir(){
window.self.location.replace('/webmail/en_EN/continue.html;jsessionid=1D5QS4DA6C148DC4C14QS4CS5.1FDS5F4DSV1A64DA5DA?MESSAGE=NO_COOKIE&DT=1&URL_VALID=welcome.html');
return true;
}
</script>
If i'm not wrong, i've to build one regex.
I've tried many things but no results.
Anyone have an idea ?

import re
get_jsession = re.search(r'jsessionid=([A-Za-z0-9.]+)',script_)
print(get_jsession.group(1))
>>> '1D5QS4DA6C148DC4C14QS4CS5.1FDS5F4DSV1A64DA5DA'

Related

How can I enable PDF page breaks from HTML, maybe using a marker in the source HTML file?

I am using pdfkit to create a PDF from a HTML file... like so:
import pdfkit
pdfkit.from_file([source], target + '.pdf')
I create the HTML file myself before doing this conversion.
What I'm now trying to do is find a way to impleet a page break.
The HTML file doesn't use page breaks because ... well, it's basic html.
But PDF's are page type structures.
So how can I pickup something in the HTML as a marker, and then use that to implement a page break in the PDF?
Of course pdfkit.from_file([source], target + '.pdf') is a simple single line... there's no parsing of the content..... so I don't see how I could tell it what to look for.
Any ideas?
EDIT
With some advice from #Nathanial below, I've added to my CSS
#media print {
h2 {
page-break-before: always;
}
But I don't see pdfkit.from_file([source], target + '.pdf') picking it up?
Opening the html file in the browser and printing to PDF works perfectly. so this is more of a pdfkit issue.
Found a similar question here:
How to insert a page break in HTML so wkhtmltopdf parses it?
I think the pdfkit wrapper for wkhtmltopdf is limited.
On the commnd line, this works perfectly.
wkhtmltopdf --print-media-type 10100005.html 10100005.pdf
But how do I replicate that in python? It's not my first choice to doa os.execute....:/
After some fiddling, this worked for me. I'm putting this here to help the next person.
Thanks #Nathaniel Flick for pointing me to use media print and print only styles.
Example 11 on this page also helped
https://www.programcreek.com/python/example/100586/pdfkit.from_file
In the style sheet
#media print {
h2 {
page-break-before: always;
}
}
Then in the python code
pdfkit_options = {
'print-media-type': '',
}
>>> print (source)
c:/users/maxcot/desktop/Reports/10100001.html
>>> print (target)
c:/users/maxcot/desktop/Reports/10100001.pdf
>>> print (pdfkit_options)
{'print-media-type': ''}
pdfkit.from_file(source, target, options=pdfkit_options)

How to see the output after getting the data in web scraping?

I typed some code to get some data from a site (scraping) and I got the result I want (results are just numbers).
the question is, how can I show results as an output on my website? I am using python and HTML in Vs Code. Here is the scraping code:
import requests
from bs4 import BeautifulSoup
getpage= requests.get('https://www.worldometers.info/coronavirus/country/austria/')
getpage_soup= BeautifulSoup(getpage.text, 'html.parser')
Get_Total_Deaths_Recoverd_Cases= getpage_soup.findAll('div', {'class':'maincounter-number'})
for para in Get_Total_Deaths_Recoverd_Cases:
print (para.text)
Get_Active_And_Closed= BeautifulSoup(getpage.text, 'html.parser')
All_Numbers2= Get_Active_And_Closed.findAll('div', {'class':'number-table-main'})
for para2 in All_Numbers2:
print (para2.text)
and these are the results that I want to show on the website:
577,007
9,687
535,798
31,522
545,485
I don't know how to describe this solution but it is a solution nonetheless:
import os
lst = [
577.007, 9.687, 535.798, 31.522, 545.485
]
p = ''
for num in lst:
p += f'<p>{num}</p>\n'
html = f"""
<!doctype>
<html>
<head>
<title>Test</title>
</head>
<body>
{p}
</body>
</html>
"""
with open('web.html', 'w') as file:
file.write(html)
os.startfile('web.html')
You don't have to necessarily use that os.startfile but You get the idea I hope, You now have web.html file and You can display it on Your website or whatever and theoretically You can also get a normal html document and copy it there (set as html variable and put the {p} wherever You need/want) or even go to Your html and put somewhere for example {%%} (which is what django uses but anyways),then read the whole file and replace those {%%} with Your value and write that back.
At the end this is a solution but depending on what You use e.g. a framework of sorts for example django or flask, there are easier ways to do that there

Scraping csv file from url with React script

I want to scrape sample_info.csv file from https://depmap.org/portal/download/.
Since there is a React script on the website it's not that straightforward with BeautifulSoup and accessing the file via an appropriate tag. I did approach this from many angles and the one that gave me the best results looks like this and it returns the executed script where all downloaded files are listed together with other data. My then idea was to strip the tags and store the information in JSON. However, I think there must be some kind of mistake in the data because it is impossible to store it as JSON.
url = 'https://depmap.org/portal/download/'
html_content = requests.get(url).text
soup = BeautifulSoup(html_content, "lxml")
all_scripts = soup.find_all('script')
script = str(all_scripts[32])
last_char_index = script.rfind("}]")
first_char_index = script.find("[{")
script_cleaned = script[first_char_index:last_char_index+2]
script_json = json.loads(script_cleaned)
This code gives me an error
JSONDecodeError: Extra data: line 1 column 7250 (char 7249)
I know that my solution might not be elegant but it took me closest to the goal i.e. downloading the sample_info.csv file from the website. Not sure how to proceed here. If there are other options? I tried with selenium but this solution will not be feasible for the end-user of my script due to the driver path declaration
It is probably easier in this context to use regular expressions, since the string is invalid JSON.
This RegEx tool (https://pythex.org/) can be useful for testing expressions.
import re
re.findall(r'"downloadUrl": "(.*?)".*?"fileName": "(.*?)"', script_cleaned)
#[
# ('https://ndownloader.figshare.com/files/26261524', 'CCLE_gene_cn.csv'),
# ('https://ndownloader.figshare.com/files/26261527', 'CCLE_mutations.csv'),
# ('https://ndownloader.figshare.com/files/26261293', 'Achilles_gene_effect.csv'),
# ('https://ndownloader.figshare.com/files/26261569', 'sample_info.csv'),
# ('https://ndownloader.figshare.com/files/26261476', 'CCLE_expression.csv'),
# ('https://ndownloader.figshare.com/files/17741420', 'primary_replicate_collapsed_logfold_change_v2.csv'),
# ('https://gygi.med.harvard.edu/publications/ccle', 'protein_quant_current_normalized.csv'),
# ('https://ndownloader.figshare.com/files/13515395', 'D2_combined_gene_dep_scores.csv')
# ]
Edit: This also works by passing the html_content directly (no need to BeautifulSoup).
url = 'https://depmap.org/portal/download/'
html_content = requests.get(url).text
re.findall(r'"downloadUrl": "(.*?)".*?"fileName": "(.*?)"', html_content)

Executing Javascript on a webpage with python requests

I am not sure if this is possible but let me try to explain.
I am trying to post data from a form but before my data gets posted the website encrypts some of it, with a public key, that i am able to achieve from the response.text
I found the javascript that is used
var myVal = 123
n = (myVal, ClassName.create(publicKey);
n.encrypt(myVal)
The .encrypt returns the string that is passed to the form. My question is can I somehow bring that javascript into my script so I can execute that .encrypt method to pass that properly to the form?
if the script is simple,I will use pyexecjs
import execjs
js_cmd = '''
function add(x,y){
return x+y
}
'''
cxt = execjs.compile(js_cmd)
print(cxt.eval("add(3,4)"))

Python find tag in XML using wildcard

I have this line in my python script:
url = tree.find("//video/products/product/read_only_info/read_only_value[#key='storeURL-GB']")
but sometimes the storeURL-GB key changes the last two country code letters, so I am trying to use something like this, but it doesn't work:
url = tree.find("//video/products/product/read_only_info/read_only_value[#key='storeURL-\.*']")
Any suggestions please?
You should probably try .xpath() and starts-with():
urls = tree.xpath("//video/products/product/read_only_info/read_only_value[starts-with(#key, 'storeURL-')]")
if urls:
url = urls[0]

Categories

Resources