Save every scraped element in a loop to a json file - python

I am web scraping one website. When I scrape one URL, I write it to a dict. What I want to do is to write every dictionary to a json file. When I do the following loop, the file is saved as not a list, but as this structure {} {} that is not readable.
df_price_m = {}
with open(r"C:\Users\USER\Desktop\diploma\information.json", 'w', encoding='utf8') as fout:
row = 0
for url in data:
row +=1
driver.get(url)
user_name_xpath = "//h1[#itemprop='name' and #data-shmid='profilePrepName']"
user_name = get_elements(user_name_xpath)
user_about_xpath = "//*[#class='desktop-profile-page__about-text']"
user_about = get_elements(user_about_xpath)
df_info['id'] = url
df_info['user_name'] = user_name[0]
df_info['user_about'] = user_about[0]
json.dump(df_price_m, fout, ensure_ascii=False)
I get the folowing json:
{"id": "www.aina.com", user_name: "Aina Nurma", "user_about": "I am a student"}
{"id": "www.aina.ru", user_name: "Aina Nur", "user_about": "I am a teacher"}

Looks like you're missing some code but I'd suggest saving all of data as a list of dicts and then dumping it at the end rather than dumping to file having processed just one url

Related

Python loop through table of URL's and get data from those websites

i'm trying to loop through a table that has all the websites that i want to get the JSON data from.
def getResponse(url):
operUrl = urllib.request.urlopen(url)
if(operUrl.getcode()==200):
data = operUrl.read()
jsonData = json.loads(data)
else:
print("Error receiving data", operUrl.getcode())
return jsonData
def main():
urlData = ("site1.com")
#Needs to loop all the URL's inside
#urlData = ["site1.com", "site2.com"] and so on
jsonData = getResponse(urlData)
for i in jsonData["descriptions"]:
description = f'{i["groups"][0]["variables"][0]["content"]}'
data = data = {'mushrooms':[{'description': description,}]}
with open('data.json', 'w') as f:
json.dump(data, f, ensure_ascii=False)
print(json.dumps(data, indent=4, ensure_ascii=False), )
After running it saves it into a data.json file and here is what it looks like
{
"mushrooms": [
{
"description": "example how it looks",
}
]
}
It does get the data from the one site but i want it to loop through multiple URL's that are in a table like
EDIT:
i got it working by looping like this
for url in urlData:
and i have all my website links in a table urlData and after that appending the data found from those sites into a another table.
I got it working by looping like this
for url in urlData:
and i have all my website links in a table urlData and after that appending the data found from those sites into a another table. and after it's done it dumps the data into a json.

Output non JSON data from regex web scraping to a JSON file

I'm using requests and regex to scrape data from an entire website and then save it to a JSON file, hosted on github so I and anyone else can access the data from other devices.
The first thing I tried was just to open every single page on the website and get all the data I want but I found that to be unnecessary so I decided to make two scripts, the first one finds the URL of every page on the site and the second one will be the one called which will then scrape the called URL. What I'm having trouble with right now is getting my data formatted correctly for the JSON file. Currently this is a sample of what the output looks like:
{
"Console":"/neo-geo-aes",
"Call ID":"62815",
"URL":"https://www.pricecharting.com/game/jp-sega-mega-drive/bare-knuckle"
}{
"Console":"/neo-geo-cd",
"Call ID":"62817",
"URL":"https://www.pricecharting.com/game/jp-sega-mega-drive/bare-knuckle-2"
}{
"Console":"/neo-geo-pocket-color",
"Call ID":"62578",
"URL":"https://www.pricecharting.com/game/jp-sega-mega-drive/batman"
}{
"Console":"/playstation",
"Call ID":"62580",
"URL":"https://www.pricecharting.com/game/jp-sega-mega-drive/batman-forever"
}
I've looked into this a lot and can't find a solution, here's the code in question:
import re
import requests
import json
##The base URL
URL = "https://www.pricecharting.com/"
r = requests.get(URL)
htmltext = r.text
##Find all system URLs
dataUrl = re.findall('(?<=<li><a href="\/console).*(?=">)', htmltext)
print(dataUrl)
##For each Item(number of consoles) find games
for i in range(len(dataUrl)):
##make console URL
newUrl = ("https://www.pricecharting.com/console" + dataUrl[i])
req = requests.get(newUrl)
newHtml = req.text
##Get item URLs
urlOne = re.findall('(?<=<a href="\/game).*(?=">)', newHtml)
itemId = re.findall('(?<=tr id="product-).*(?=" data)', newHtml)
##For every item in list(items per console)
out_list = []
for i in range(len(urlOne)):
##Make item URL
itemUrl = ("https://www.pricecharting.com/game" + urlOne[i])
callId = (itemId[i])
##Format for JSON
json_file_content = {}
json_file_content['Console'] = dataUrl[i]
json_file_content['Call ID'] = callId
json_file_content['URL'] = itemUrl
out_list.append(json_file_content)
data_json_filename = 'docs/result.json'
with open(data_json_filename, 'a') as data_json_file:
json.dump(out_list, data_json_file, indent=4)

Loop thats supposed to append data into separate text files for each API query I do, is duplicating the data across those separate files

I have a list of websites that I'm doing an API query on. I also have a separate list that stores a shortened version of these websites which I want to be the name of the text file where the data is getting appended to from the API query.
What I want to do is for each website, append the data from the query into a file named website1.txt and so on and so forth. Instead whats happening is the all the data from website1,website2,website3 is getting appended to website1.txt,website2.txt etc. All these text files have the same data, instead of separate data being appended to each text files.
Heres my code:
list_of_websites = ['api.cats.com/animaldata', 'api.elephants.com/animaldata', 'api.dogs.com/animaldata']
name_of_websites = ['cats data', 'elephants data', 'dogs data']
for website in list_of_websites:
counter = 1
while True:
response = requests.get(f"https://api.superComputer.info?page={counter}", headers={ "Accept": "application/.v3+json", "Authorization": "Bearer 123456"}, params={"site": f"{website}"})
if response.json():
for site in name_of_websites:
file_name = f"{site}.txt"
f = open(file_name, "a")
f.write(json.dumps(response.json(), indent=4))
counter += 1
else:
break
Since there is 1-1 mapping with websites urls and website names you can use zip and iterate over it.
And since all the content of different pages is to be appended to a single file you can open it just once outside the while loop.
list_of_websites = ['api.cats.com/animaldata', 'api.elephants.com/animaldata', 'api.dogs.com/animaldata']
name_of_websites = ['cats data', 'elephants data', 'dogs data']
for website, website_name in zip(list_of_websites,name_of_websites):
counter = 1
file_name = f"{website_name}.txt"
with open(file_name, 'a') as f:
while True:
response = requests.get("https://api.superComputer.info", headers={ "Accept": "application/.v3+json", "Authorization": "Bearer 123456"}, params={"page": counter, "site": f"{website}"})
if response.ok:
f.write(json.dumps(response.json(), indent=4))
counter +=1
else:
break

Json file not exporting correctly from python string array in a for loop

i am making a website scraper that scrapes the websites and looks for specific keywords in a website and if it finds the keyword it would either call the website to productive or unproductive and then it would export that info into a json file so i can get it with c# later but the problem is that the json exporting method is not exporting correctly and i am new to both pyhton and json.
i have tried everything and every syntax there is but nothing seems to be working as i want it to be.
this is my python code
from bs4 import BeautifulSoup
import requests
import json
import os
import pandas as pd
import numpy as np
# this scraps the websites that i give it
def scrap_website():
pages = ['https://www.youtube.com/watch?v=tHI2NIaNrGk',
'https://aljazeera.com', 'https://www.svt.se']
for site in pages:
page = requests.get(site)
soup = BeautifulSoup(page.content, 'html.parser')
if 'Game' in soup.getText():
is_productive = False
json_map = {}
json_map["websiteLink"] = site
json_map["isProductive"] = is_productive
json_text = json.dumps(json_map)
else:
is_productive = True
json_map = {}
json_map["websiteLink"] = site
json_map["isProductive"] = is_productive
json_text = json.dumps(json_map)
data = []
data.append(json_text)
with open('data\\data.json', 'a') as json_file:
json.dump(data, json_file, indent=2, separators=(
", ", " "), sort_keys=True)
scrap_website()
this is the json code that i am getting
[
"{\"websiteLink\": \"https://www.youtube.com/watch?v=tHI2NIaNrGk\", \"isProductive\": false}"
][
"{\"websiteLink\": \"https://aljazeera.com\", \"isProductive\": true}"
][
"{\"websiteLink\": \"https://www.svt.se\", \"isProductive\": true}"
]
You can add all nodes in the same json, declaring an array and go to adding each website as node of the array so:
Declare array before than loop
json_map = []
For each website
site_node = {}
site_node["websiteLink"] = site
site_node["isProductive"] = is_productive
json_map.append(site_node)
Finally save the json outside of the loop
with open('data.json', 'w') as outFile:
json.dump(json_map, outFile)
After you can load the json array and loop it with a simple for

How to save scraped json data with key value pair into a json file format using python

I scraped a site for data and I was able to print the desired output with json format containing only value but what i actually needed is to get the data with both key and value pair and save it into output.json format so I can insert into my django database. Here is what I have done so far
import requests
import json
URL ='http://tfda.go.tz/portal/en/trader_module/trader_module/getRegisteredDrugs_products'payload = "draw=1&columns%5B0%5D%5Bdata%5D=no&columns%5B0%5D%5Bname%5D=&columns%5B0%5D%5Bsearchable%5D=True&columns%5B0%5D%5Borderable%5D=True&columns%5B0%5D%5Bsearch%5D%5Bvalue%5D=&columns%5B0%5D%5Bsearch%5D%5Bregex%5D=False&columns%5B1%5D%5Bdata%5D=certificate_no&columns%5B1%5D%5Bname%5D=&columns%5B1%5D%5Bsearchable%5D=True&columns%5B1%5D%5Borderable%5D=True&columns%5B1%5D%5Bsearch%5D%5Bvalue%5D=&columns%5B1%5D%5Bsearch%5D%5Bregex%5D=False&columns%5B2%5D%5Bdata%5D=brand_name&columns%5B2%5D%5Bname%5D=&columns%5B2%5D%5Bsearchable%5D=True&columns%5B2%5D%5Borderable%5D=True&columns%5B2%5D%5Bsearch%5D%5Bvalue%5D=&columns%5B2%5D%5Bsearch%5D%5Bregex%5D=False&columns%5B3%5D%5Bdata%5D=classification_name&columns%5B3%5D%5Bname%5D=&columns%5B3%5D%5Bsearchable%5D=True&columns%5B3%5D%5Borderable%5D=True&columns%5B3%5D%5Bsearch%5D%5Bvalue%5D=&columns%5B3%5D%5Bsearch%5D%5Bregex%5D=False&columns%5B4%5D%5Bdata%5D=common_name&columns%5B4%5D%5Bname%5D=&columns%5B4%5D%5Bsearchable%5D=True&columns%5B4%5D%5Borderable%5D=True&columns%5B4%5D%5Bsearch%5D%5Bvalue%5D=&columns%5B4%5D%5Bsearch%5D%5Bregex%5D=False&columns%5B5%5D%5Bdata%5D=dosage_form&columns%5B5%5D%5Bname%5D=&columns%5B5%5D%5Bsearchable%5D=True&columns%5B5%5D%5Borderable%5D=True&columns%5B5%5D%5Bsearch%5D%5Bvalue%5D=&columns%5B5%5D%5Bsearch%5D%5Bregex%5D=False&columns%5B6%5D%5Bdata%5D=product_strength&columns%5B6%5D%5Bname%5D=&columns%5B6%5D%5Bsearchable%5D=True&columns%5B6%5D%5Borderable%5D=True&columns%5B6%5D%5Bsearch%5D%5Bvalue%5D=&columns%5B6%5D%5Bsearch%5D%5Bregex%5D=False&columns%5B7%5D%5Bdata%5D=registrant&columns%5B7%5D%5Bname%5D=&columns%5B7%5D%5Bsearchable%5D=True&columns%5B7%5D%5Borderable%5D=True&columns%5B7%5D%5Bsearch%5D%5Bvalue%5D=&columns%5B7%5D%5Bsearch%5D%5Bregex%5D=False&columns%5B8%5D%5Bdata%5D=registrant_country&columns%5B8%5D%5Bname%5D=&columns%5B8%5D%5Bsearchable%5D=True&columns%5B8%5D%5Borderable%5D=True&columns%5B8%5D%5Bsearch%5D%5Bvalue%5D=&columns%5B8%5D%5Bsearch%5D%5Bregex%5D=False&columns%5B9%5D%5Bdata%5D=manufacturer&columns%5B9%5D%5Bname%5D=&columns%5B9%5D%5Bsearchable%5D=True&columns%5B9%5D%5Borderable%5D=True&columns%5B9%5D%5Bsearch%5D%5Bvalue%5D=&columns%5B9%5D%5Bsearch%5D%5Bregex%5D=False&columns%5B10%5D%5Bdata%5D=manufacturer_country&columns%5B10%5D%5Bname%5D=&columns%5B10%5D%5Bsearchable%5D=True&columns%5B10%5D%5Borderable%5D=True&columns%5B10%5D%5Bsearch%5D%5Bvalue%5D=&columns%5B10%5D%5Bsearch%5D%5Bregex%5D=False&columns%5B11%5D%5Bdata%5D=expiry_date&columns%5B11%5D%5Bname%5D=&columns%5B11%5D%5Bsearchable%5D=True&columns%5B11%5D%5Borderable%5D=True&columns%5B11%5D%5Bsearch%5D%5Bvalue%5D=&columns%5B11%5D%5Bsearch%5D%5Bregex%5D=False&columns%5B12%5D%5Bdata%5D=id&columns%5B12%5D%5Bname%5D=&columns%5B12%5D%5Bsearchable%5D=True&columns%5B12%5D%5Borderable%5D=True&columns%5B12%5D%5Bsearch%5D%5Bvalue%5D=&columns%5B12%5D%5Bsearch%5D%5Bregex%5D=False&order%5B0%5D%5Bcolumn%5D=0&order%5B0%5D%5Bdir%5D=asc&start=0&length=3911&search%5Bvalue%5D=&search%5Bregex%5D=False"
with requests.Session() as s:
s.headers={"User-Agent":"Mozilla/5.0"}
s.headers.update({'Content-Type': 'application/x-www-form-urlencoded; charset=UTF-8'})
res = s.post(URL, data = payload)
for data in res.json()['data']:
serial = data['no']
certno = data['certificate_no']
brndname = data['brand_name']
clssification = data['classification_name']
common_name = data['common_name']
dosage_form = data['dosage_form']
expiry_date = data['expiry_date']
manufacturer = data['manufacturer']
manufacturer_country = data['manufacturer_country']
product_strength = data['product_strength']
registrant = data['registrant']
registrant_country = data['registrant_country']
output = (serial,certno,brndname,clssification,common_name,dosage_form,expiry_date,manufacturer, manufacturer_country,product_strength,registrant, registrant_country )
my_list = output
json_str = json.dumps(my_list)
print (json_str)
And here is my attached output screenshot
So how do I approach this?
Use json.dump
with open(path, 'w') as file:
[...]
json.dump(myPythonList, file)
file.write('\n')

Categories

Resources