Scraping from web site into HDFS

Scraping from web site into HDFS - python

I'm trying to scrap data from website into HDFS, at first it was working well the scraping, and then I added the line of storing data into HDFS it's not working:
import requests
from pathlib import Path
import os
from datetime import date
from hdfs import InsecureClient
date= date.today()
date
def downloadFile(link, destfolder):
r = requests.get(link,stream=True)
filename="datanew1"+ str(date)+".xls"
downloaded_file = open(os.path.join(destfolder, filename), 'wb')
client= InsecureClient('http://hdfs-namenode.default.svc.cluster.local:50070', user='hdfs')
with client.download('/data/test.csv')
for chunk in r.iter_content(chunk_size=256):
if chunk:
downloaded_file.write(chunk)
link="https://api.worldbank.org/v2/fr/indicator/FP.CPI.TOTL.ZG?downloadformat=excel"
Path('http://hdfs-namenode.default.svc.cluster.local:50070/data').mkdir(parents=True, exist_ok=True)
downloadFile(link, 'http://hdfs-namenode.default.svc.cluster.local:50070/data')
There is no error in the code, just I can't found the data scraped!

Related

Use Python to scrape images from xml tags

I am trying to write a short python program to download of copy of the xml jail roster for the local county save that that file, scrape and save all the names and image links in a csv file, then download each of the photos with the file name being the name.
I've managed to get the XML file, save it locally, and create the csv file. I was briefly able to write the full xml tag (tag and attribute) to the csv file, but can't seem to get just the attribute, or the image links.
from datetime import datetime
from datetime import date
import requests
import csv
import bs4 as bs
from bs4 import BeautifulSoup
# get current date
today = date.today()
# convert date to date-sort format
d1 = today.strftime("%Y-%m-%d")
# create filename variable
roster = 'jailroster' + '-' + d1 + '-dev' + '.xml'
# grab xml file from server
url = "fakepath.xml"
print("ATTEMPTING TO GET XML FILE FROM SERVER")
req_xml = requests.get(url)
print("Response code:", req_xml.status_code)
if req_xml.status_code == 200:
print("XML file downloaded at ", datetime.now())
soup = BeautifulSoup(req_xml.content, 'lxml')
# save xml file from get locally
with open(roster, 'wb') as file:
file.write(req_xml.content)
print('Saving local copy of XML as:', roster)
# read xml data from saved copy
infile = open(roster,'r')
contents = infile.read()
soup = bs.BeautifulSoup(contents,'lxml')
# variables needed for image list
images = soup.findAll('image1')
fname = soup.findAll('nf')
mname = soup.findAll('nm')
lname = soup.findAll('nl')
baseurl = 'fakepath.com'
with open('image-list.csv', 'w', newline='') as csvfile:
imagelist = csv.writer(csvfile, delimiter=',')
print('Image list being created')
imagelist.writerows(images['src'])
I've gone through about a half dozen tutorials trying to figure all this out, but I think this is the edge of what I have been able to learn so far and I haven't even started to try and figure out how to save the list of images as files. Can anyone help out with a pointer or two or point me towards tutorials on this?
Update: No this is not for a mugshot site or any unethical purposes. This data is for a private data project for a non-public public safety project.

This should get you the data you need:
from datetime import date
import requests
from bs4 import BeautifulSoup
import pandas as pd
def extractor(tag: str) -> list:
return [i.getText() for i in soup.find_all(tag)]
url = "https://legacyweb.randolphcountync.gov/sheriff/jailroster.xml"
soup = BeautifulSoup(requests.get(url).text, features="lxml")
images = [
f"{'https://legacyweb.randolphcountync.gov'}{i['src'].lstrip('..')}"
for i in soup.find_all('image1')
]
df = pd.DataFrame(
zip(extractor("nf"), extractor("nm"), extractor("nl"), images),
columns=['First Name', 'Middle Name', 'Last Name', 'Mugshot'],
)
df.to_csv(
f"jailroster-{date.today().strftime('%Y-%m-%d')}-dev.csv",
index=False,
)
Sample output (a .csv file):

How to append current date and time to file name?

I have a script to download a PDF from the internet and save it to a specific directory, how can I go about appending the date and time to the file name?
# Import all needed modules and tools
from fileinput import filename
import os
import os.path
from datetime import datetime
import urllib.request
import requests
# Disable SSL and HTTPS Certificate Warnings
import urllib3
urllib3.disable_warnings()
resp = requests.get('url.org', verify=False)
# Get current date and time
current_datetime = datetime.now()
print("Current date & time : ". current_datetime)
# Convert datetime obj to string
str_current_datetime = str(current_datetime)
# Download and name the PDF file from the URL
response= urllib.request.urlretrieve('url.pdf',
filename = 'my directory\civil.pdf')
# Save to the preferred directory
with open("my directory\civil.pdf", 'wb') as f: f.write(resp.content)

Use f-strings:
open(f"file - {datetime.now().strftime('%Y-%m-%D')}.txt", "w")
# will create a new file with the title: "file - Year-Month-Date.txt"
# then you can do whatever you want with it
f-string docs

How do I fix my code so that it is automated?

I have the below code that takes my standardized .txt file and converts it into a JSON file perfectly. The only problem is that sometimes I have over 300 files and doing this manually (i.e. changing the number at the end of the file and running the script is too much and takes too long. I want to automate this. The files as you can see reside in one folder/directory and I am placing the JSON file in a differentfolder/directory, but essentially keeping the naming convention standardized except instead of ending with .txt it ends with .json but the prefix or file names are the same and standardized. An example would be: CRAZY_CAT_FINAL1.TXT, CRAZY_CAT_FINAL2.TXT and so on and so forth all the way to file 300. How can I automate and keep the file naming convention in place, and read and output the files to different folders/directories? I have tried, but can't seem to get this to iterate. Any help would be greatly appreciated.
import glob
import time
from glob import glob
import pandas as pd
import numpy as np
import csv
import json
csvfile = open(r'C:\Users\...\...\...\Dog\CRAZY_CAT_FINAL1.txt', 'r')
jsonfile = open(r'C:\Users\...\...\...\Rat\CRAZY_CAT_FINAL1.json', 'w')
reader = csv.DictReader(csvfile)
out = json.dumps([row for row in reader])
jsonfile.write(out)
****************************************************************************
I also have this code using the python library "requests". How do I make this code so that it uploads multiple json files with a standard naming convention? The files end with a number...
import requests
#function to post to api
def postData(xactData):
url = 'http link'
headers = {
'Content-Type': 'application/json',
'Content-Length': str(len(xactData)),
'Request-Timeout': '60000'
}
return requests.post(url, headers=headers, data=xactData)
#read data
f = (r'filepath/file/file.json', 'r')
data = f.read()
print(data)
# post data
result = postData(data)
print(result)

Use f-strings?
for i in range(1,301):
csvfile = open(f'C:\Users\...\...\...\Dog\CRAZY_CAT_FINAL{i}.txt', 'r')
jsonfile = open(f'C:\Users\...\...\...\Rat\CRAZY_CAT_FINAL{i}.json', 'w')

import time
from glob import glob
import csv
import json
import os
INPATH r'C:\Users\...\...\...\Dog'
OUTPATH = r'C:\Users\...\...\...\Rat'
for csvname in glob(INPATH+'\*.txt'):
jsonname = OUTPATH + '/' + os.basename(csvname[:-3] + 'json')
reader = csv.DictReader(open(csvname,'r'))
json.dump( list(reader), open(jsonname,'w') )

Web scraping - saving files to nested folders

I am downloading pdf files from different URLs using a built-in API.
My end result should be to download files from each unique link (identified as links in the code below) to unique folders (folder_location in the code) on the desktop.
I am quite puzzled by how I should arrange codes to do this as I am still a novice. So far I have tried the following.
import os
import requests
from glob import glob
import time
from urllib.parse import urljoin
from bs4 import BeautifulSoup
links = ["P167897", "P173997", "P166309"]
folder_location = "/pdf/"
for link, folder in zip(links, folder_location):
time.sleep(10)
end_point = f"https://search.worldbank.org/api/v2/wds?" \
f"format=json&includepublicdocs=1&" \
f"fl=docna,lang,docty,repnb,docdt,doc_authr,available_in&" \
f"os=0&rows=20&proid={link}&apilang=en"
documents = requests.get(end_point).json()["documents"]
for document_data in documents.values():
try:
pdf_url = document_data["pdfurl"]
filename = os.path.join(folder,pdf_url.split('/')[-1])
with open(filename, 'wb') as f:
f.write(requests.get(pdf_url).content)
EDIT: To clarify, the objects in links are id based on which links to the pdf files are to be identified from the API.

You could try using the pathlib module.
Here's how:
import os
import time
from pathlib import Path
import requests
links = ["P167897", "P173997", "P166309"]
for link in links:
end_point = f"https://search.worldbank.org/api/v2/wds?" \
f"format=json&includepublicdocs=1&" \
f"fl=docna,lang,docty,repnb,docdt,doc_authr,available_in&" \
f"os=0&rows=20&proid={link}&apilang=en"
documents = requests.get(end_point).json()["documents"]
for document_data in documents.values():
try:
pdf_url = document_data["pdfurl"]
file_path = Path(f"pdf/{link}/{pdf_url.rsplit('/')[-1]}")
file_path.parent.mkdir(parents=True, exist_ok=True)
with file_path.open("wb") as f:
f.write(requests.get(pdf_url).content)
time.sleep(10)
except KeyError:
continue
This outputs files to:
pdf/
└── P167897
├── Official-Documents-First-Restatement-to-the-Disbursement-Letter-for-Grant-D6810-SL-and-for-Additional-Financing-Grant-TF0B4694.pdf
└── Official-Documents-Grant-Agreement-for-Additional-Financing-Grant-TF0B4694.pdf
...

Duplicate data store in database PYTHON

My current python script:
import ftplib
import hashlib
import httplib
import pytz
from datetime import datetime
import urllib
from pytz import timezone
import os.path, time
import glob
def ftphttp(cam_name):
for image in glob.glob(os.path.join('/tmp/image/*.png')):
ts = os.path.getmtime(image)
dt = datetime.fromtimestamp(ts, pytz.utc)
timeZone= timezone('Asia/Singapore')
localtime = dt.astimezone(timeZone).isoformat()
camid = cam_name(cam_name)
tscam = camid + localtime
ftp = ftplib.FTP('10.217.137.121','kevin403','S$ip1234')
ftp.cwd('/var/www/html/image')
m=hashlib.md5()
m.update(tscam)
dd=m.hexdigest()
x = httplib.HTTPConnection('10.217.137.121', 8086)
x.connect()
f = {'ts' : localtime}
x.request('GET','/camera/store?fn='+dd+'&'+urllib.urlencode(f)+'&cam='+cam_name(cam_name))
y = x.getresponse()
z=y.read()
x.close()
with open(image, 'rb') as file:
ftp.storbinary('STOR '+dd+ '.png', file)
ftp.quit()
Right now I'm able to send multiple files into another folder but the data that is store in the database is duplicated. Like example, when i store 3 files into the folder and then my database stored 6 data via httplib. Anybody got any ideas why the data is duplicated? HELP needed!

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Scraping from web site into HDFS - python

Related

Use Python to scrape images from xml tags

How to append current date and time to file name?

How do I fix my code so that it is automated?

Web scraping - saving files to nested folders

Duplicate data store in database PYTHON

Categories

Resources