Pulling data from xml page into .txt - python

Im trying to pull just the keywords from an xml output like shown on:
http://clients1.google.com/complete/search?hl=en&output=toolbar&q=test+a
I have tried putting together the below but i don't seem to get any errors or any output. Any ideas?
import urllib2 as ur
import re
f = ur.urlopen(u'http://clients1.google.com/complete/search?hl=en&output=toolbar&q=test+a')
res = f.readlines()
for d in res:
data = re.findall('<CompleteSuggestion><\/CompleteSuggestion>',d)
for i in data:
print i
file = open("keywords.txt", "a")
file.write(i + '\n')
file.close()
I am trying to,
Fetch the xml from url given
Store list of keywords from XML file, parsed using regex
Thanks,

from urllib2 import urlopen
import re
xml_url = u'http://clients1.google.com/complete/search?hl=en&output=toolbar&q=test+a'
xml_file_contents = urlopen(xml_url).readlines()
keywords_file = open("keywords.txt", "a")
for entry in xml_file_contents:
output = "\n".join(re.findall('data=\"([^\"]*)',entry))
print output
keywords_file.write(output + '\n')
keywords_file.close()
output:
test anxiety
test america
test adobe flash
test automation
test act
test alternator
test and set
test adblock
test adobe shockwave
test automation tools
Let me know in case of any doubt

Related

Pulling info from an api url

I'm trying to pull the average of temperatures from this API from a bunch of different ZIP codes. I can currently do so by manually changing the ZIP code in the URL for the API, but I was hoping it to be able to loop through a list of ZIP codes or ask for input and use those zip codes.
However, I'm rather new and have no idea on how to add variables and stuff to a link, either that or I'm overcomplicating it. So basically I was searching for some methods to add a variable to the link or something to the same effect so I can change it whenever I want.
import urllib.request
import json
out = open("output.txt", "w")
link = "http://api.openweathermap.org/data/2.5/weather?zip={zip-code},us&appid={api-key}"
print(link)
x = urllib.request.urlopen(link)
url = x.read()
out.write(str(url, 'utf-8'))
returnJson = json.loads(url)
print('\n')
print(returnJson["main"]["temp"])
import urllib.request
import json
zipCodes = ['123','231','121']
out = open("output.txt", "w")
for i in zipCodes:
link = "http://api.openweathermap.org/data/2.5/weather?zip=" + i + ",us&appid={api-key}"
x = urllib.request.urlopen(link)
url = x.read()
out.write(str(url, 'utf-8'))
returnJson = json.loads(url)
print(returnJson["main"]["temp"])
out.close()
You can achieve what you want by looping through a list of zipcodes and creating a new URL from them.

PyPDF4 not reading certain characters

I'm compiling some data for a project and I've been using PyPDF4 to read this data from it's source PDF file, but I've been having trouble with certain characters not showing up correctly. Here's my code:
from PyPDF4 import PdfFileReader
import pandas as pd
import numpy as np
import os
import xml.etree.cElementTree as ET
# File name
pdf_path = "PV-9-2020-10-23-RCV_FR.pdf"
# Results storage
results = {}
# Start page
page = 5
# Lambda to assign votes
serify = lambda voters, vote: pd.Series({voter.strip(): vote for voter in voters})
with open(pdf_path, 'rb') as f:
# Get PDF reader for PDF file f
pdf = PdfFileReader(f)
while page < pdf.numPages:
# Get text of page in PDF
text = pdf.getPage(page).extractText()
proposal = text.split("\n+\n")[0].split("\n")[3]
# Collect all pages relevant pages
while text.find("\n0\n") is -1:
page += 1
text += "\n".join(pdf.getPage(page).extractText().split("\n")[3:])
# Remove corrections
text, corrections = text.split("CORRECCIONES")
# Grab relevant text !!! This is where the missing characters show up.
text = "\n, ".join([n[:n.rindex("\n")] for n in text.split("\n:")])
for_list = "".join(text[text.index("\n+\n")+3:text.index("\n-\n")].split("\n")[:-1]).split(", ")
nay_list = "".join(text[text.index("\n-\n")+3:text.index("\n0\n")].split("\n")[:-1]).split(", ")
abs_list = "".join(text[text.index("\n0\n")+3:].split("\n")[:-1]).split(", ")
# Store data in results
results.update({proposal: dict(pd.concat([serify(for_list, 1), serify(nay_list, -1), serify(abs_list, 0)]).items())})
page += 1
print(page)
results = pd.DataFrame(results)
The characters I'm having difficulty don't show up in the text extracted using extractText. Ždanoka for instance becomes "danoka, Štefanec becomes -tefanc. It seems like most of the characters are Eastern European, which makes me think I need one of the latin decoders.
I've looked through some of PyPDF4's capabilities, it seems like it has plenty of relevant codecs, including latin1. I've attempted decoding the file using different functions from the PyPDF4.generic.codecs module, and either the characters don't show still, or the code throws an error at an unrecognised byte.
I haven't yet attempted using multiple codecs on different bytes from the same file, that seems like it would take some time. Am I missing something in my code that can easily fix this? Or is it more likely I will have to tailor fit a solution using PyPDF4's functions?
Use pypdf instead of PyPDF2/PyPDF3/PyPDF4. You will need to apply the migrations.
pypdf has received a lot of updates in December 2022. Especially the text extraction.
To give you a minimal full example for text extraction:
from pypdf import PdfReader
reader = PdfReader("example.pdf")
for page in reader.pages:
print(page.extract_text())

After reading urls from a text file, how can I save all the responses into separate files?

I have a script that reads urls from a text file, performs a request and then saves all the responses in one text file. How can I save each response in a different text file instead of all in the same file? For example, if my text file labeled input.txt has 20 urls, I would like to save the responses in 20 different .txt files like output1.txt, output2.txt instead of just one .txt file. So for each request, the response in saved in a new .txt file. Thank you
import requests
from bs4 import BeautifulSoup
with open('input.txt', 'r') as f_in:
for line in map(str.strip, f_in):
if not line:
continue
response = requests.get(line)
data = response.text
soup = BeautifulSoup(data, 'html.parser')
categories = soup.find_all("a", {"class":'navlabellink nvoffset nnormal'})
for category in categories:
data = line + "," + category.text
with open('output.txt', 'a+') as f:
f.write(data + "\n")
print(data)
Here's a quick way to implement what others have hinted at:
import requests
from bs4 import BeautifulSoup
with open('input.txt', 'r') as f_in:
for i, line in enumerate(map(str.strip, f_in)):
if not line:
continue
...
with open(f'output_{i}.txt', 'w') as f:
f.write(data + "\n")
print(data)
You can make a new file by using open('something.txt', 'w'). If the file is found, it'll erase its content. Else, it'll make a new file named 'something.txt'. Now, you can use file.write() to write your info!
I'm not sure, if I understood your problem right.
I would create an array/list and would create an object for each url request and response. Then add the objects to the array/list and write for each object a different file.
There are at least two ways you could generate files for each url. One, shown below, is to create a hash of some data unique data of the file. In this case I chose category but your could also use the whole contents of the file. This creates a unique string to use for a file name so that two links with the same category text don't overwrite each other when saved.
Another way, not shown, is to find some unique value within the data itself and use it as the filename without hashing it. However, this can cause more problems than it solves since data on the Internet should not be trusted.
Here's your code with an MD5 hash used for a filename. MD5 is not a secure hashing function for passwords but it's safe for creating unique filenames.
Updated Snippet
import hashlib
import requests
from bs4 import BeautifulSoup
with open('input.txt', 'r') as f_in:
for line in map(str.strip, f_in):
if not line:
continue
response = requests.get(line)
data = response.text
soup = BeautifulSoup(data, 'html.parser')
categories = soup.find_all("a", {"class":'navlabellink nvoffset nnormal'})
for category in categories:
data = line + "," + category.text
filename = hashlib.sha256()
filename.update(category.text.encode('utf-8'))
with open('{}.html'.format(filename.hexdigest()), 'w') as f:
f.write(data + "\n")
print(data)
Code added
filename = hashlib.sha256()
filename.update(category.text.encode('utf-8'))
with open('{}.html'.format(filename.hexdigest()), 'w') as f:
Capturing Updated Pages
If you care about catching contents of a page at different points in time, hash the whole contents of the file. That way, if anything within the page changes the previous contents of the page aren't lost. In this case, I hash both the url and the file contents and concatenate the hashes with the URL hash followed by a hash of the file contents. That way, all versions of a file are visible when the directory is sorted.
hashed_contents = hashlib.sha256()
hashed_contents.update(category['href'].encode('utf-8'))
with open('{}.html'.format(filename.hexdigest()), 'w') as f:
for category in categories:
data = line + "," + category.text
hashed_url = hashlib.sha256()
hashed_url.update(category['href'].encode('utf-8'))
page = requests.get(category['href'])
hashed_content = hashlib.sha256()
hashed_content.update(page.text.encode('utf-8')
filename = '{}_{}.html'.format(hashed_url.hexdigest(), hashed_content.hexdigest())
with open('{}.html'.format(filename.hexdigest()), 'w') as f:
f.write(data + "\n")
print(data)

Python : Extract exact word from a url

I just begin to learn python from 2 days and I try to make a script that extract me some data from url and save it, but the problems is I want to extract only a specific data from a long line
EX :
{"2019-11-19":{"period":"2019-11-19T00:00:00+00:00","uniqs":"344627","hits":"0","clicked":"4922","pay":126.52971186,"currency":"RON","subs":0},"2019-11-20":{"period":"2019-11-20T00:00:00+00:00","uniqs":"1569983","hits":"0","clicked":"15621","pay":358.43100342,"currency":"RON","subs":0},"2019-11-21":{"period":"2019-11-21T00:00:00+00:00","uniqs":"1699844","hits":"0","clicked":"16172","pay":363.15667371,"currency":"RON","subs":0},"2019-11-22":{"period":"2019-11-22T00:00:00+00:00","uniqs":"1779319","hits":"0","clicked":"17865","pay":384.67092962,"currency":"RON","subs":0},"2019-11-23":{"period":"2019-11-23T00:00:00+00:00","uniqs":"1825346","hits":"0","clicked":"17740","pay":356.72833095,"currency":"RON","subs":0},"2019-11-24":{"period":"2019-11-24T00:00:00+00:00","uniqs":"1732639","hits":"0","clicked":"16870","pay":308.4201041,"currency":"RON","subs":0},"2019-11-25":{"period":"2019-11-25T00:00:00+00:00","uniqs":"1826060","hits":"0","clicked":"17991","pay":346.29137133,"currency":"RON","subs":0},"2019-11-26":{"period":"2019-11-26T00:00:00+00:00","uniqs":"1873961","hits":"0","clicked":"18645","pay":379.17652358,"currency":"RON","subs":0},"2019-11-27":{"period":"2019-11-27T00:00:00+00:00","uniqs":"1734207","hits":"0","clicked":"16187","pay":251.91152953,"currency":"RON","subs":0},"2019-11-28":{"period":"2019-11-28T00:00:00+00:00","uniqs":"1611611","hits":"0","clicked":"12056","pay":158.96447829,"currency":"RON","subs":0},"2019-11-29":{"period":"2019-11-29T00:00:00+00:00","uniqs":"712011","hits":"0","clicked":"6242","pay":85.70053418,"currency":"RON","subs":0},"2019-11-30":{"period":"2019-11-30T00:00:00+00:00","uniqs":"47957","hits":"0","clicked":"427","pay":8.32775435,"currency":"RON","subs":0},"2019-12-01":{"period":"2019-12-01T00:00:00+00:00","uniqs":"1268892","hits":"0","clicked":"11779","pay":217.42321168,"currency":"RON","subs":0},"2019-12-02":{"period":"2019-12-02T00:00:00+00:00","uniqs":"1130724","hits":"0","clicked":"10694","pay":195.44476902,"currency":"RON","subs":0},"2019-12-03":{"period":"2019-12-03T00:00:00+00:00","uniqs":"1058965","hits":"0","clicked":"8123","pay":151.05243751,"currency":"RON","subs":0},"2019-12-04":{"period":"2019-12-04T00:00:00+00:00","uniqs":"1228326","hits":"0","clicked":"12230","pay":230.84154581,"currency":"RON","subs":0},"2019-12-05":{"period":"2019-12-05T00:00:00+00:00","uniqs":"1181029","hits":"0","clicked":"11467","pay":196.21644271,"currency":"RON","subs":0},"2019-12-06":{"period":"2019-12-06T00:00:00+00:00","uniqs":"951828","hits":"0","clicked":"9379","pay":153.35155293,"currency":"RON","subs":0},"2019-12-07":{"period":"2019-12-07T00:00:00+00:00","uniqs":"1172156","hits":"0","clicked":"11776","pay":181.65819439,"currency":"RON","subs":0},"2019-12-08":{"period":"2019-12-08T00:00:00+00:00","uniqs":"912109","hits":"0","clicked":"9240","pay":147.6364827,"currency":"RON","subs":0}}
I try to extract the after "pay": and save it to a file, after that i write the code that will calculate the amount and give me the result :D for this i worked 1 day :D
i use this code to extract and save the data from link :
from urllib.request import urlopen as uReq
url1 = 'http://link.com'
page = uReq(url1).read().decode()
f = open("dataNEW.txt", "w")
f.write(page)
f.close()
but the problem is it write me all details there, I want to save only what is after pay.
That string is in JSON format that can be easily converted to a Python data structure using the json package. Her is an example:
import json
from urllib.request import urlopen as uReq
url1 = 'http://link.com'
page = uReq(url1).read().decode()
data = json.loads(page)
with open("dataNEW.txt", "w") as f:
for sub_dict in data.values():
f.write("{}\n".format(sub_dict["pay"]))
Your dataNEW.txt should then look like the following:
358.43100342
363.15667371
384.67092962
356.72833095
126.52971186
346.29137133
379.17652358
251.91152953
158.96447829
85.70053418
8.32775435
147.6364827
153.35155293
181.65819439
308.4201041
196.21644271
230.84154581
151.05243751
195.44476902
217.42321168

Is there a way to get the exact data needed from my Python Script

My python script fetches data from below website 'http://api.sl.se/api2/deviations.json?key=c7606e4606f642a380f7fdd75d683448' in a text file.
Now my aim is to filter: 'Headers', 'Details', 'FromDateTime', 'UptoDateTime' and 'Updated'
I have tried BS with text specific search, but not there...Below code shows that. Any help will be indeed helpful :)Sorry if I missed something very natural..
'''
import requests
from bs4 import BeautifulSoup
import csv
import pandas as pd
import csv
import operator
from numpy import *
# Collect and parse first page
page = requests.get('http://api.sl.se/api2/deviations.json?
key=c7606e4606f642a380f7fdd75d683448')
soup = BeautifulSoup(page.text, 'html.parser')
#print(soup)
for script in
soup(["Header","Details","Updated","UpToDateTime","FromDateTime"]):
script.extract()
# get text
text = soup.get_text()
# break into lines and remove leading and trailing space on each
lines = (line.strip() for line in text.splitlines())
# break multi-headlines into a line each
chunks = (phrase.strip() for line in lines for phrase in line.split(" "))
# drop blank lines
text = '\n'.join(chunk for chunk in chunks if chunk)
f1 = open(data.txt", "r")
resultFile = open("out.csv", "wb")
wr = csv.writer(resultFile, quotechar=',')
'''
I expect a csv with columns of Header","Details","Updated","UpToDateTime","FromDateTime"
You are doing in a wrong way. You don't need a beautifulsoup for this task. Your api returning data as json. BeautifulSoup is best for html. For your purpose you can use PANDAS and JSON Library.
Pandas can read directly from webresource as well but you want only requestdata from json so that you require both library.
Here is a snippet which you can use :
import pandas as pd
import requests
import json
page = requests.get('http://api.sl.se/api2/deviations.json?key=c7606e4606f642a380f7fdd75d683448')
data = json.loads(page.text)
df = pd.DataFrame(data["ResponseData"])
df.to_csv("file path")
Change File path and you get whole data inside csv.
But if you want to remove any column or any manipulation over data you can do using pandas dataframe as well. It is very powerful library you can learn about it using google.

Categories

Resources