How to scrape data from JSON/Javascript of web page? - python

I'm new to Python, just get started with it today.
My system environment are Python 3.5 with some libraries on Windows10.
I want to extract football player data from site below as CSV file.
Problem: I can not extract data from soup.find_all('script')[17] to my expected CSV format. How to extract those data as I want ?
My code is shown as below.
from bs4 import BeautifulSoup
import re
from urllib.request import Request, urlopen
req = Request('http://www.futhead.com/squad-building-challenges/squads/343', headers={'User-Agent': 'Mozilla/5.0'})
webpage = urlopen(req).read()
soup = BeautifulSoup(webpage,'html.parser') #not sure if i need to use lxml
soup.find_all('script')[17] #My target data is in 17th
My expected output would be similar to this
position,slot_position,slug
ST,ST,paulo-henrique
LM,LM,mugdat-celik

As #josiah Swain said, it's not going to be pretty. For this sort of thing it's more recommended to use JS as it can understand what you have.
Saying that, python is awesome and here is you solution!
#Same imports as before
from bs4 import BeautifulSoup
import re
from urllib.request import Request, urlopen
#And one more
import json
# The code you had
req = Request('http://www.futhead.com/squad-building-challenges/squads/343',
headers={'User-Agent': 'Mozilla/5.0'})
webpage = urlopen(req).read()
soup = BeautifulSoup(webpage,'html.parser')
# Store the script
script = soup.find_all('script')[17]
# Extract the oneline that stores all that JSON
uncleanJson = [line for line in script.text.split('\n')
if line.lstrip().startswith('squad.register_players($.parseJSON') ][0]
# The easiest way to strip away all that yucky JS to get to the JSON
cleanJSON = uncleanJson.lstrip() \
.replace('squad.register_players($.parseJSON(\'', '') \
.replace('\'));','')
# Extract out that useful info
data = [ [p['position'],p['data']['slot_position'],p['data']['slug']]
for p in json.loads(cleanJSON)
if p['player'] is not None]
print('position,slot_position,slug')
for line in data:
print(','.join(line))
The result I get for copying and pasting this into python is:
position,slot_position,slug
ST,ST,paulo-henrique
LM,LM,mugdat-celik
CAM,CAM,soner-aydogdu
RM,RM,petar-grbic
GK,GK,fatih-ozturk
CDM,CDM,eray-ataseven
LB,LB,kadir-keles
CB,CB,caner-osmanpasa
CB,CB,mustafa-yumlu
RM,RM,ioan-adrian-hora
GK,GK,bora-kork
Edit: On reflection this is not the easiest code to read for a beginner. Here is a easier to read version
# ... All that previous code
script = soup.find_all('script')[17]
allScriptLines = script.text.split('\n')
uncleanJson = None
for line in allScriptLines:
# Remove left whitespace (makes it easier to parse)
cleaner_line = line.lstrip()
if cleaner_line.startswith('squad.register_players($.parseJSON'):
uncleanJson = cleaner_line
cleanJSON = uncleanJson.replace('squad.register_players($.parseJSON(\'', '').replace('\'));','')
print('position,slot_position,slug')
for player in json.loads(cleanJSON):
if player['player'] is not None:
print(player['position'],player['data']['slot_position'],player['data']['slug'])

So my understanding is that beautifulsoup is better for HTML parsing, but you are trying to parse javascript nested in the HTML.
So you have two options
Simply create a function that takes the result of soup.find_all('script')[17], loop and search the string manually for the data and extract it. You can even use ast.literal_eval(string_thats_really_a_dictionary) to make it even easier. This is may not be the best a approach but if you are new to python you might want to do it this just for practice.
Use the json library like in this example. or alternatively like this way. This is probably the better way to do.

Related

Scraping csv file from url with React script

I want to scrape sample_info.csv file from https://depmap.org/portal/download/.
Since there is a React script on the website it's not that straightforward with BeautifulSoup and accessing the file via an appropriate tag. I did approach this from many angles and the one that gave me the best results looks like this and it returns the executed script where all downloaded files are listed together with other data. My then idea was to strip the tags and store the information in JSON. However, I think there must be some kind of mistake in the data because it is impossible to store it as JSON.
url = 'https://depmap.org/portal/download/'
html_content = requests.get(url).text
soup = BeautifulSoup(html_content, "lxml")
all_scripts = soup.find_all('script')
script = str(all_scripts[32])
last_char_index = script.rfind("}]")
first_char_index = script.find("[{")
script_cleaned = script[first_char_index:last_char_index+2]
script_json = json.loads(script_cleaned)
This code gives me an error
JSONDecodeError: Extra data: line 1 column 7250 (char 7249)
I know that my solution might not be elegant but it took me closest to the goal i.e. downloading the sample_info.csv file from the website. Not sure how to proceed here. If there are other options? I tried with selenium but this solution will not be feasible for the end-user of my script due to the driver path declaration
It is probably easier in this context to use regular expressions, since the string is invalid JSON.
This RegEx tool (https://pythex.org/) can be useful for testing expressions.
import re
re.findall(r'"downloadUrl": "(.*?)".*?"fileName": "(.*?)"', script_cleaned)
#[
# ('https://ndownloader.figshare.com/files/26261524', 'CCLE_gene_cn.csv'),
# ('https://ndownloader.figshare.com/files/26261527', 'CCLE_mutations.csv'),
# ('https://ndownloader.figshare.com/files/26261293', 'Achilles_gene_effect.csv'),
# ('https://ndownloader.figshare.com/files/26261569', 'sample_info.csv'),
# ('https://ndownloader.figshare.com/files/26261476', 'CCLE_expression.csv'),
# ('https://ndownloader.figshare.com/files/17741420', 'primary_replicate_collapsed_logfold_change_v2.csv'),
# ('https://gygi.med.harvard.edu/publications/ccle', 'protein_quant_current_normalized.csv'),
# ('https://ndownloader.figshare.com/files/13515395', 'D2_combined_gene_dep_scores.csv')
# ]
Edit: This also works by passing the html_content directly (no need to BeautifulSoup).
url = 'https://depmap.org/portal/download/'
html_content = requests.get(url).text
re.findall(r'"downloadUrl": "(.*?)".*?"fileName": "(.*?)"', html_content)

scraping data from json after using requests

i am trying to extract specific data from requested json file
so after passing Authorization and using requests.get i got my request , i think it is called dictionary for python coders and called json for javascript coders
it containt too much information that i dont need and i would like to extract one or two only
for example {"bio" : " hello world " }
and that json file contains more that one " bio "
for example i am scraping 100 accounts and i would like to extract all " bio " in one code
so i tried this :
from bs4 import BeautifulSoup
import requests
headers = {"Authorization" : "xxxx"}
req = requests.get('website', headers = headers)
data = req.text
soup = BeautifulSoup(data,'html.parser')
titles = soup.find_all('span',{'class':'bio'})
for title in titles :
print(title.text)
and didnt work , i tried multiple ideas with no success
if possible please write me a code that i can understande since iam trying to learn more about my mistakes
thanks
The Aphid library I created is perfect for this.
from command-prompt
py -m pip install Aphid
Then its just as easy as loading your json data and searching it with aphid.
import json
import Aphid
resp = requests.get(yoururl)
data = json.loads(resp.text)
results = Aphid.findall(data, 'bio')
results is now equal to a list of tuples(key, value), of every occurence of the 'bio' key.
After you get your request either:
you get a simple json file (in which case you import it to python using json) or
you get an html file from which you can extract the json code (using BeautifulSoup) which in turn you will parse using json library.

How do I get the HTML of a website using Python 3?

I've been trying to do this with repl.it and have tried several solutions on this site, but none of them work. Right now, my code looks like
import urllib
url = "http://www.pythonchallenge.com/pc/def/linkedlist.php?nothing=12345"
print (urllib.urlopen(url).read())
but it just says "AttributeError: module 'urllib' has no attribute 'urlopen'".
If I add import urllib.urlopen, it tells me there's no module named that. How can I fix my problem?
The syntax you are using for the urllib library is from Python v2. The library has changed somewhat for Python v3. The new notation would look something more like:
import urllib.request
response = urllib.request.urlopen("http://www.google.com")
html = response.read()
The html object is just a string, with the returned HTML of the site. Much like the original urllib library, you should not expect images or other data files to be included in this returned object.
The confusing part here is that, in Python 3, this would fail if you did:
import urllib
response = urllib.request.urlopen("http://www.google.com")
html = response.read()
This strange module-importing behavior is, I am told, as intended and working. BUT it is non-intuitive and awkward. More importantly, for you, it makes the situation harder to debug. Enjoy.
Python3
import urllib
import requests
url = "http://www.pythonchallenge.com/pc/def/linkedlist.php?nothing=12345"
r = urllib.request.urlopen(url).read()
print(r)
or
import urllib.request
url = "http://www.pythonchallenge.com/pc/def/linkedlist.php?nothing=12345"
r = urllib.request.urlopen(url).read()
print(r)

Problems crawling wordreference

I am trying to crawl wordreference, but I am not succeding.
The first problem I have encountered is, that a big part is loaded via JavaScript, but that shouldn't be much problem because I can see what I need in the source code.
So, for example, I want to extract for a given word, the first two meanings, so in this url: http://www.wordreference.com/es/translation.asp?tranword=crane I need to extract grulla and grĂșa.
This is my code:
import lxml.html as lh
import urllib2
url = 'http://www.wordreference.com/es/translation.asp?tranword=crane'
doc = lh.parse((urllib2.urlopen(url)))
trans = doc.xpath('//td[#class="ToWrd"]/text()')
for i in trans:
print i
The result is that I get an empty list.
I have tried to crawl it with scrapy too, no success. I am not sure what is going on, the only way I have been able to crawl it is using curl, but that is sloopy, I want to do it in an elegant way, with Python.
Thank you very much
It looks like you need a User-Agent header to be sent, see Changing user agent on urllib2.urlopen.
Also, just switching to requests would do the trick (it automatically sends the python-requests/version User Agent by default):
import lxml.html as lh
import requests
url = 'http://www.wordreference.com/es/translation.asp?tranword=crane'
response = requests.get("http://www.wordreference.com/es/translation.asp?tranword=crane")
doc = lh.fromstring(response.content)
trans = doc.xpath('//td[#class="ToWrd"]/text()')
for i in trans:
print(i)
Prints:
grulla
grĂșa
plataforma
...
grulla blanca
grulla trompetera

How do I print a line following a line containing certain text in a saved file in Python?

I have written a Python program to find the carrier of a cell phone given the number. It downloads the source of http://www.whitepages.com/carrier_lookup?carrier=other&number_0=1112223333&response=1 (where 1112223333 is the phone number to lookup) and saves this as carrier.html. In the source, the carrier is in the line after the [div class="carrier_result"] tag. (switch in < and > for [ and ], as stackoverflow thought I was trying to format using the html and would not display it.)
My program currently searches the file and finds the line containing the div tag, but now I need a way to store the next line after that as a string. My current code is: http://pastebin.com/MSDN0vbC
What you really want to be doing is parsing the HTML properly. Use the BeautifulSoup library - it's wonderful at doing so.
Sample code:
import urllib2, BeautifulSoup
opener = urllib2.build_opener()
opener.addheaders[0] = ('User-agent', 'Mozilla/5.1')
response = opener.open('http://www.whitepages.com/carrier_lookup?carrier=other&number_0=1112223333&response=1').read()
bs = BeautifulSoup.BeautifulSoup(response)
print bs.findAll('div', attrs={'class': 'carrier_result'})[0].next.strip()
You should be using a HTML parser such as BeautifulSoup or lxml instead.
to get the next line, you can use
htmlsource = open('carrier.html', 'r')
for line in htmlsource:
if '<div class="carrier_result">' in line:
nextline = htmlsource.next()
print nextline
A "better" way is to split on </div>, then get the things you want, as sometimes the stuff you want can occur all in one line. So using next() if give wrong result.eg
data=open("carrier.html").read().split("</div>")
for item in data:
if '<div class="carrier_result">' in item:
print item.split('<div class="carrier_result">')[-1].strip()
by the way, if its possible, try to use Python's own web module, like urllib, urllib2 instead of calling external wget.

Categories

Resources