How to convert url data to csv using python - python

i am trying to download the data from the following url and tying to save it as csv data but the output i am getting is a text file. can anyone pls help what i am doing wrong here ? also, is it possible to add multiple url in the same script and download multiple csv files.
import csv
import pandas as pd
import requests
from datetime import datetime
CSV_URL = ('https://dsv-ops-toolkit.ihsmvals.com/ftp?config=fenics-bgc&file=IRSDATA_20211129_1700_Intra.csv&directory=%2FIRS%2FIntraday%2FDaily')
with requests.Session() as s:
download = s.get(CSV_URL)
decoded_content = download.content.decode('utf-8')
cr = csv.reader(decoded_content.splitlines(), delimiter=',')
date =datetime.now().strftime('%y%m%d')
my_list = list(cr)
df=pd.DataFrame(my_list)
df.to_csv(f'RFR_{date}')

You can create a list of your necessary URLs like:
urls = ['http://url1.com','http://url2.com','http://url3.com']
Iterate through the list for each url and your requests will be as it is:
for each_url in urls:
with requests.Session() as s:
# your_code_here
Hope you'll find this helpful.

Related

Use Python to scrape images from xml tags

I am trying to write a short python program to download of copy of the xml jail roster for the local county save that that file, scrape and save all the names and image links in a csv file, then download each of the photos with the file name being the name.
I've managed to get the XML file, save it locally, and create the csv file. I was briefly able to write the full xml tag (tag and attribute) to the csv file, but can't seem to get just the attribute, or the image links.
from datetime import datetime
from datetime import date
import requests
import csv
import bs4 as bs
from bs4 import BeautifulSoup
# get current date
today = date.today()
# convert date to date-sort format
d1 = today.strftime("%Y-%m-%d")
# create filename variable
roster = 'jailroster' + '-' + d1 + '-dev' + '.xml'
# grab xml file from server
url = "fakepath.xml"
print("ATTEMPTING TO GET XML FILE FROM SERVER")
req_xml = requests.get(url)
print("Response code:", req_xml.status_code)
if req_xml.status_code == 200:
print("XML file downloaded at ", datetime.now())
soup = BeautifulSoup(req_xml.content, 'lxml')
# save xml file from get locally
with open(roster, 'wb') as file:
file.write(req_xml.content)
print('Saving local copy of XML as:', roster)
# read xml data from saved copy
infile = open(roster,'r')
contents = infile.read()
soup = bs.BeautifulSoup(contents,'lxml')
# variables needed for image list
images = soup.findAll('image1')
fname = soup.findAll('nf')
mname = soup.findAll('nm')
lname = soup.findAll('nl')
baseurl = 'fakepath.com'
with open('image-list.csv', 'w', newline='') as csvfile:
imagelist = csv.writer(csvfile, delimiter=',')
print('Image list being created')
imagelist.writerows(images['src'])
I've gone through about a half dozen tutorials trying to figure all this out, but I think this is the edge of what I have been able to learn so far and I haven't even started to try and figure out how to save the list of images as files. Can anyone help out with a pointer or two or point me towards tutorials on this?
Update: No this is not for a mugshot site or any unethical purposes. This data is for a private data project for a non-public public safety project.
This should get you the data you need:
from datetime import date
import requests
from bs4 import BeautifulSoup
import pandas as pd
def extractor(tag: str) -> list:
return [i.getText() for i in soup.find_all(tag)]
url = "https://legacyweb.randolphcountync.gov/sheriff/jailroster.xml"
soup = BeautifulSoup(requests.get(url).text, features="lxml")
images = [
f"{'https://legacyweb.randolphcountync.gov'}{i['src'].lstrip('..')}"
for i in soup.find_all('image1')
]
df = pd.DataFrame(
zip(extractor("nf"), extractor("nm"), extractor("nl"), images),
columns=['First Name', 'Middle Name', 'Last Name', 'Mugshot'],
)
df.to_csv(
f"jailroster-{date.today().strftime('%Y-%m-%d')}-dev.csv",
index=False,
)
Sample output (a .csv file):

Download xlsx file with Python

Want to download to local directory. This code works for csv but not xlsx. It writes a file but cannot be opened as Excel.
Any help will be appreciated.
url = 'https://some_url'
resp = requests.get(url)
open('some_filename.xlsx', 'wb').write(resp.content)
You could create a dataframe from the resp data and then use pd.to_excel() function to obtain the xlsx file. This is a tested solution, and it worked for me.
import requests
import pandas as pd
import io
url='https://www.google.com' #as an example
urlData = requests.get(url).content #Get the content from the url
dataframe = pd.read_csv(io.StringIO(urlData.decode('latin-1')))
filename="data.xlsx"
dataframe.to_excel(filename)
In pandas you could just do:
import pandas as pd
url = 'https://some_url'
df = pd.read_csv(url)

Python: List of URLS to text list in Python (Excel)

I have a list of URLs to tweets in Excel. Is it possible to take out the text from these tweets (URLs) in Python? And later save it in the Excel sheet?
I saw someone used the code below, but this is only only for 1 URL.
from lxml import html
import requests
page = requests.get('https://twitter.com/realDonaldTrump/status/1237448419284783105')
tree = html.fromstring(page.content)
tree.xpath('//div[contains(#class, "permalink-tweet-container")]//p[contains(#class, "tweet-text")]//text()')
The excel containts with columns: author and URL.
The excelfile ('twitter.xlsx') looks like this:
Author URL
realDon.. https://twitter.com/realDon..
. .
. .
. .
I tried this code:
import pandas as pd
from lxml import html
import requests
input_data = pd.read_excel('twitter.xlsx')
input_data1 = input_data[['URL']]
tweets = []
for url in input_data1.values:
x = requests.get(url)
tree = html.fromstring(x.content)
i = tree.xpath('//div[contains(#class, "permalink-tweet container")]//p[contains(#class, "tweet-text")]//text()')
tweets.append(i)
Error:
InvalidSchema: No connection adapters were found for '['https://twitter.com/realDonaldTrump/status/1237448419284783105']'
Short answer - yes.
Long answer - yes, it's possible. I suggest you do some reading on the topic.
https://automatetheboringstuff.com/chapter12/ covers how to manage and manipulate excel files. The openpyxl library is your friend here - here's their documentation.
requests is a great library to use for getting access to websites! Here is their documentation
Here's a pseudo code mock up of what your program logic could look like:
input_data = read(excel_file)
tweets = []
for url in input_data:
x = get(url)
tweets.append(x)
for tweet in tweets:
write(tweet, excel_file)

How to download csv ata from website using Python

I'm trying to automatically download data from the following website; however I just get the html and no data:
http://tcplus.com/GTN/OperationalCapacity#filter.GasDay=02/02/19&filter.CycleType=1&page=1&sort=LocationName&sort_direction=ascending
import csv
import urllib2
downloaded_data = urllib2.urlopen('http://tcplus.com/GTN/OperationalCapacity#filter.GasDay=02/02/19&filter.CycleType=1&page=1&sort=LocationName&sort_direction=ascending')
csv_data = csv.reader(downloaded_data)
for row in csv_data:
print row
The code below will only fetch data from provided url, but if you tweak parameters you can get other reports as well.
import requests
parameters = {'serviceTypeName': 'Ganesha.InfoPost.Service.OperationalCapacity.OperationalCapacityService, Ganesha.InfoPost.Service',
'filterTypeName': 'Ganesha.InfoPost.ViewModels.GasDayAndCycleTypeFilterViewModel, Ganesha.InfoPost',
'templateType': 6,
'exportType': 1,
'filter.GasDay': '02/02/19',
'filter.CycleType': 1}
response = requests.post('http://tcplus.com/GTN/Export/Generate', data=parameters)
with open('result.csv', 'w') as f:
f.write(response.text)

Convert text data from requests object to dataframe with pandas

Using requests I am creating an object which is in .csv format. How can I then write that object to a DataFrame with pandas?
To get the requests object in text format:
import requests
import pandas as pd
url = r'http://test.url'
r = requests.get(url)
r.text #this will return the data as text in csv format
I tried (doesn't work):
pd.read_csv(r.text)
pd.DataFrame.from_csv(r.text)
Try this
import requests
import pandas as pd
import io
urlData = requests.get(url).content
rawData = pd.read_csv(io.StringIO(urlData.decode('utf-8')))
I think you can use read_csv with url:
pd.read_csv(url)
filepath_or_buffer : str, pathlib.Path, py._path.local.LocalPath or any object with a read() method (such as a file handle or StringIO)
The string could be a URL. Valid URL schemes include http, ftp, s3, and file. For file URLs, a host is expected. For instance, a local file could be file ://localhost/path/to/table.csv
import pandas as pd
import io
import requests
url = r'http://...'
r = requests.get(url)
df = pd.read_csv(io.StringIO(r))
If it doesnt work, try update last line:
import pandas as pd
import io
import requests
url = r'http://...'
r = requests.get(url)
df = pd.read_csv(io.StringIO(r.text))
Using "read_csv with url" worked:
import requests, csv
import pandas as pd
url = 'https://arte.folha.uol.com.br/ciencia/2020/coronavirus/csv/mundo/dados-bra.csv'
corona_bra = pd.read_csv(url)
print(corona_bra.head())
if the url has no authentication then you can directly use read_csv(url)
if you have authentication you can use request to get it un-pickel and print the csv and make sure the result is CSV and use panda.
You can directly use importing
import csv

Categories

Resources