Python : Extract exact word from a url - python
I just begin to learn python from 2 days and I try to make a script that extract me some data from url and save it, but the problems is I want to extract only a specific data from a long line
EX :
{"2019-11-19":{"period":"2019-11-19T00:00:00+00:00","uniqs":"344627","hits":"0","clicked":"4922","pay":126.52971186,"currency":"RON","subs":0},"2019-11-20":{"period":"2019-11-20T00:00:00+00:00","uniqs":"1569983","hits":"0","clicked":"15621","pay":358.43100342,"currency":"RON","subs":0},"2019-11-21":{"period":"2019-11-21T00:00:00+00:00","uniqs":"1699844","hits":"0","clicked":"16172","pay":363.15667371,"currency":"RON","subs":0},"2019-11-22":{"period":"2019-11-22T00:00:00+00:00","uniqs":"1779319","hits":"0","clicked":"17865","pay":384.67092962,"currency":"RON","subs":0},"2019-11-23":{"period":"2019-11-23T00:00:00+00:00","uniqs":"1825346","hits":"0","clicked":"17740","pay":356.72833095,"currency":"RON","subs":0},"2019-11-24":{"period":"2019-11-24T00:00:00+00:00","uniqs":"1732639","hits":"0","clicked":"16870","pay":308.4201041,"currency":"RON","subs":0},"2019-11-25":{"period":"2019-11-25T00:00:00+00:00","uniqs":"1826060","hits":"0","clicked":"17991","pay":346.29137133,"currency":"RON","subs":0},"2019-11-26":{"period":"2019-11-26T00:00:00+00:00","uniqs":"1873961","hits":"0","clicked":"18645","pay":379.17652358,"currency":"RON","subs":0},"2019-11-27":{"period":"2019-11-27T00:00:00+00:00","uniqs":"1734207","hits":"0","clicked":"16187","pay":251.91152953,"currency":"RON","subs":0},"2019-11-28":{"period":"2019-11-28T00:00:00+00:00","uniqs":"1611611","hits":"0","clicked":"12056","pay":158.96447829,"currency":"RON","subs":0},"2019-11-29":{"period":"2019-11-29T00:00:00+00:00","uniqs":"712011","hits":"0","clicked":"6242","pay":85.70053418,"currency":"RON","subs":0},"2019-11-30":{"period":"2019-11-30T00:00:00+00:00","uniqs":"47957","hits":"0","clicked":"427","pay":8.32775435,"currency":"RON","subs":0},"2019-12-01":{"period":"2019-12-01T00:00:00+00:00","uniqs":"1268892","hits":"0","clicked":"11779","pay":217.42321168,"currency":"RON","subs":0},"2019-12-02":{"period":"2019-12-02T00:00:00+00:00","uniqs":"1130724","hits":"0","clicked":"10694","pay":195.44476902,"currency":"RON","subs":0},"2019-12-03":{"period":"2019-12-03T00:00:00+00:00","uniqs":"1058965","hits":"0","clicked":"8123","pay":151.05243751,"currency":"RON","subs":0},"2019-12-04":{"period":"2019-12-04T00:00:00+00:00","uniqs":"1228326","hits":"0","clicked":"12230","pay":230.84154581,"currency":"RON","subs":0},"2019-12-05":{"period":"2019-12-05T00:00:00+00:00","uniqs":"1181029","hits":"0","clicked":"11467","pay":196.21644271,"currency":"RON","subs":0},"2019-12-06":{"period":"2019-12-06T00:00:00+00:00","uniqs":"951828","hits":"0","clicked":"9379","pay":153.35155293,"currency":"RON","subs":0},"2019-12-07":{"period":"2019-12-07T00:00:00+00:00","uniqs":"1172156","hits":"0","clicked":"11776","pay":181.65819439,"currency":"RON","subs":0},"2019-12-08":{"period":"2019-12-08T00:00:00+00:00","uniqs":"912109","hits":"0","clicked":"9240","pay":147.6364827,"currency":"RON","subs":0}}
I try to extract the after "pay": and save it to a file, after that i write the code that will calculate the amount and give me the result :D for this i worked 1 day :D
i use this code to extract and save the data from link :
from urllib.request import urlopen as uReq
url1 = 'http://link.com'
page = uReq(url1).read().decode()
f = open("dataNEW.txt", "w")
f.write(page)
f.close()
but the problem is it write me all details there, I want to save only what is after pay.
That string is in JSON format that can be easily converted to a Python data structure using the json package. Her is an example:
import json
from urllib.request import urlopen as uReq
url1 = 'http://link.com'
page = uReq(url1).read().decode()
data = json.loads(page)
with open("dataNEW.txt", "w") as f:
for sub_dict in data.values():
f.write("{}\n".format(sub_dict["pay"]))
Your dataNEW.txt should then look like the following:
358.43100342
363.15667371
384.67092962
356.72833095
126.52971186
346.29137133
379.17652358
251.91152953
158.96447829
85.70053418
8.32775435
147.6364827
153.35155293
181.65819439
308.4201041
196.21644271
230.84154581
151.05243751
195.44476902
217.42321168
Related
Pulling info from an api url
I'm trying to pull the average of temperatures from this API from a bunch of different ZIP codes. I can currently do so by manually changing the ZIP code in the URL for the API, but I was hoping it to be able to loop through a list of ZIP codes or ask for input and use those zip codes. However, I'm rather new and have no idea on how to add variables and stuff to a link, either that or I'm overcomplicating it. So basically I was searching for some methods to add a variable to the link or something to the same effect so I can change it whenever I want. import urllib.request import json out = open("output.txt", "w") link = "http://api.openweathermap.org/data/2.5/weather?zip={zip-code},us&appid={api-key}" print(link) x = urllib.request.urlopen(link) url = x.read() out.write(str(url, 'utf-8')) returnJson = json.loads(url) print('\n') print(returnJson["main"]["temp"])
import urllib.request import json zipCodes = ['123','231','121'] out = open("output.txt", "w") for i in zipCodes: link = "http://api.openweathermap.org/data/2.5/weather?zip=" + i + ",us&appid={api-key}" x = urllib.request.urlopen(link) url = x.read() out.write(str(url, 'utf-8')) returnJson = json.loads(url) print(returnJson["main"]["temp"]) out.close() You can achieve what you want by looping through a list of zipcodes and creating a new URL from them.
How do I update and save data to a CSV file each day?
I'm trying to log covid data from a website and update it each day with new cases. So far I have managed to put the numbers of cases in the file through scraping, but each day I have to manually enter the dates and run the file to get the updated statistics. How would I go about writing a script that will update the CSV each day, with new dates and the new number of cases, while saving the old ones for future use? I wrote this and run it in Virtual Studio Code. import csv import bs4 import urllib from urllib.request import urlopen as uReq from urllib.request import Request, urlopen from bs4 import BeautifulSoup as soup #For sites that can't be opened due to Urllib blocker, use a Mozilla User agent to get access pageRequest = Request('https://coronavirusbellcurve.com/', headers = {'User-Agent': 'Mozilla/5.0'}) htmlPage = urlopen(pageRequest).read() page_soup = soup(htmlPage, 'html.parser') specificDiv = page_soup.find("div", {"class": "table-responsive-xl"}) TbodyStats = specificDiv.table.tbody.tr.contents TbodyDates = specificDiv.table.thead.tr.contents def writeCSV(): with open('CovidHTML.csv','w', newline= '') as file: theWriter = csv.writer(file) theWriter.writerow(['5/8', ' 5/9', ' 5/10',' 5/11',' 5/12']) row = [] for i in range(3,len(TbodyStats),2): row.append([TbodyStats[i].text]) theWriter.writerow(row) writeCSV()
If you want to preserve the older contents of the csv file, then open the file in append mode (as correctly pointed out by #bfris) with open('CovidHTML.csv','a', newline= '') as file: If you are using Linux, you can set up a cron job to invoke the python script every day at some specific time. First, locate the path to python using the which command: $ which python3 This gave me /usr/bin/python3 Then the cron job will look like: 10 14 * * * /usr/bin/python3 /path/to/python/file.py Add this line to the crontab file. This will call the python script everyday at 2:10PM everyday. You can take a look here for details. In case you are using Windows, you can take a look at this question.
Python: List of URLS to text list in Python (Excel)
I have a list of URLs to tweets in Excel. Is it possible to take out the text from these tweets (URLs) in Python? And later save it in the Excel sheet? I saw someone used the code below, but this is only only for 1 URL. from lxml import html import requests page = requests.get('https://twitter.com/realDonaldTrump/status/1237448419284783105') tree = html.fromstring(page.content) tree.xpath('//div[contains(#class, "permalink-tweet-container")]//p[contains(#class, "tweet-text")]//text()') The excel containts with columns: author and URL. The excelfile ('twitter.xlsx') looks like this: Author URL realDon.. https://twitter.com/realDon.. . . . . . . I tried this code: import pandas as pd from lxml import html import requests input_data = pd.read_excel('twitter.xlsx') input_data1 = input_data[['URL']] tweets = [] for url in input_data1.values: x = requests.get(url) tree = html.fromstring(x.content) i = tree.xpath('//div[contains(#class, "permalink-tweet container")]//p[contains(#class, "tweet-text")]//text()') tweets.append(i) Error: InvalidSchema: No connection adapters were found for '['https://twitter.com/realDonaldTrump/status/1237448419284783105']'
Short answer - yes. Long answer - yes, it's possible. I suggest you do some reading on the topic. https://automatetheboringstuff.com/chapter12/ covers how to manage and manipulate excel files. The openpyxl library is your friend here - here's their documentation. requests is a great library to use for getting access to websites! Here is their documentation Here's a pseudo code mock up of what your program logic could look like: input_data = read(excel_file) tweets = [] for url in input_data: x = get(url) tweets.append(x) for tweet in tweets: write(tweet, excel_file)
Is there a way to get the exact data needed from my Python Script
My python script fetches data from below website 'http://api.sl.se/api2/deviations.json?key=c7606e4606f642a380f7fdd75d683448' in a text file. Now my aim is to filter: 'Headers', 'Details', 'FromDateTime', 'UptoDateTime' and 'Updated' I have tried BS with text specific search, but not there...Below code shows that. Any help will be indeed helpful :)Sorry if I missed something very natural.. ''' import requests from bs4 import BeautifulSoup import csv import pandas as pd import csv import operator from numpy import * # Collect and parse first page page = requests.get('http://api.sl.se/api2/deviations.json? key=c7606e4606f642a380f7fdd75d683448') soup = BeautifulSoup(page.text, 'html.parser') #print(soup) for script in soup(["Header","Details","Updated","UpToDateTime","FromDateTime"]): script.extract() # get text text = soup.get_text() # break into lines and remove leading and trailing space on each lines = (line.strip() for line in text.splitlines()) # break multi-headlines into a line each chunks = (phrase.strip() for line in lines for phrase in line.split(" ")) # drop blank lines text = '\n'.join(chunk for chunk in chunks if chunk) f1 = open(data.txt", "r") resultFile = open("out.csv", "wb") wr = csv.writer(resultFile, quotechar=',') ''' I expect a csv with columns of Header","Details","Updated","UpToDateTime","FromDateTime"
You are doing in a wrong way. You don't need a beautifulsoup for this task. Your api returning data as json. BeautifulSoup is best for html. For your purpose you can use PANDAS and JSON Library. Pandas can read directly from webresource as well but you want only requestdata from json so that you require both library. Here is a snippet which you can use : import pandas as pd import requests import json page = requests.get('http://api.sl.se/api2/deviations.json?key=c7606e4606f642a380f7fdd75d683448') data = json.loads(page.text) df = pd.DataFrame(data["ResponseData"]) df.to_csv("file path") Change File path and you get whole data inside csv. But if you want to remove any column or any manipulation over data you can do using pandas dataframe as well. It is very powerful library you can learn about it using google.
Download the content of a url and save it
I have this format of data from a url [{"column1":"something","column2":"something 2","column1":"something3","column2":"something4"}, etc] I want to download this content from the url and save it into a csv file in this format: column1 column2 something something2 something3 something4 Can I do it with python using urllib2? Or any other library?
Requests is probably the easiest for the download. You'd just use: import codecs, requests req = requests.get(url) text = req.text.encode(req.encoding) with codecs.open(filename, 'w', encoding='utf-8') as f: f.write(text) As yedpodtrzitko mentioned, however, parsing that particular input as JSON will leave you with two key/value pairs, and it'll get overwritten with the last ones it encounters for each key. You could split and parse it manually into rows, but it shouldn't be valid JSON.