Open URL stored in a csv file - python

I'm almost an absolute beginner in Python, but I am asked to manage some difficult task. I have read many tutorials and found some very useful tips on this website, but I think that this question was not asked until now, or at least in the way I tried it in the search engine.
I have managed to write some url in a csv file. Now I would like to write a script able to open this file, to open the urls, and write their content in a dictionary. But I have failed : my script can print these addresses, but cannot process the file.
Interestingly, my script dit not send the same error message each time. Here the last : req.timeout = timeout
AttributeError: 'list' object has no attribute 'timeout'
So I think my script faces several problems :
1- is my method to open url the right one ?
2 - and what is wrong in the way I build the dictionnary ?
Here is my attempt below. Thanks in advance to those who would help me !
import csv
import urllib
dict = {}
test = csv.reader(open("read.csv","rb"))
for z in test:
sock = urllib.urlopen(z)
source = sock.read()
dict[z] = source
sock.close()
print dict

First thing, don't shadow built-ins. Rename your dictionary to something else as dict is used to create new dictionaries.
Secondly, the csv reader creates a list per line that would contain all the columns. Either reference the column explicitly by urllib.urlopen(z[0]) # First column in the line or open the file with a normal open() and iterate through it.
Apart from that, it works for me.

Related

When I import an array from another file, do I take just the data from it or need to "build" the array with how the original file build it?

Sorry if the question is not well formulated, will reformulated if necessary.
I have a file with an array that I filled with data from an online json db, I imported this array to another file to use its data.
#file1
response = urlopen(url1)
a=[]
data = json.loads(response.read())
for i in range(len(data)):
a.append(data[i]['name'])
i+=1
#file2
from file1 import a
'''do something with "a"'''
Does importing the array means I'm filling the array each time I call it in file2?
If that is the case, what can I do to just keep the data from the array without "building" it each time I call it?
If you saved a to a file, then read a -- you will not need to rebuild a -- you can just open it. For example, here's one way to open a text file and get the text from the file:
# set a variable to be the open file
OpenFile = open(file_path, "r")
# set a variable to be everything read from the file, then you can act on that variable
file_guts = OpenFile.read()
From the Python docs on the Modules section - link - you can read:
When you run a Python module with
python fibo.py <arguments>
the code in the module will be executed, just as if you imported it
This means that importing a module has the same behavior as running it as a regular Python script, unless you use the __name__ as mentioned right after this quotation.
Also, if you think about it, you are opening something, reading from it, and then doing some operations. How can you be sure that the content you are now reading from is the same as the one you had read the first time?

RedVox Python SDK | Not Reading in .rdvxz Files

I'm attempting to read in a series of files for processing contained in a single directory using RedVox:
input_directory = "/home/ben/Documents/Data/F1D1/21" # file location
rdvx_data = DataWindow(input_dir=input_directory, apply_correction=False, debug=True) # using RedVox to read in the files
print(os.listdir(input_directory)) # verifying the files actually exist...
# returns "['file1.rdvxz', 'file2.rdvxz', file3.rdvxz', ...etc]", they exist
# write audio portion to file
rdvx_data.to_json_file(base_dir=output_rpd_directory,
file_name=output_filename)
# this never runs, because rdvx_data.stations = [] (verified through debugging)
for station in rdvx_data.stations:
# some code here
Enabling debugging through arguments as seen above does not provide an extra details. In fact, there is no error message whatsoever. It writes the JSON file and pickle to disk, but the JSON file is full of null values and the pickle object is just a shell, no contents. So the files definitely exist, os.listdir() sees them, but RedVox does not.
I assume this is some very silly error or lack of understanding on my part. Any help is greatly appreciated. I have not worked with RedVox previously, nor do I have much understanding of what these files contain other than some audio data and some other data. I've simply been tasked with opening them to work on a model to analyze the data within.
SOLVED: Not sure why the previous code doesn't work (it was handed to me), however, I worked around the DataWindow call and went straight to calling the "redvox.api900.reader" object:
from redvox.api900 import reader
dataset_dir = "/home/*****/Documents/Data/F1D1/21/"
rdvx_files = glob(dataset_dir+"*.rdvxz")
for file in rdvx_files:
wrapped_packet = reader.read_rdvxz_file(file)
From here I can view all of the sensor data within:
if wrapped_packet.has_microphone_sensor():
microphone_sensor = wrapped_packet.microphone_sensor()
print("sample_rate_hz", microphone_sensor.sample_rate_hz())
Hope this helps anyone else who's confused.

When I run the code, it runs without errors, but the csv file is not created, why?

I found a tutorial and I'm trying to run this script, I did not work with python before.
tutorial
I've already seen what is running through logging.debug, checking whether it is connecting to google and trying to create csv file with other scripts
from urllib.parse import urlencode, urlparse, parse_qs
from lxml.html import fromstring
from requests import get
import csv
def scrape_run():
with open('/Users/Work/Desktop/searches.txt') as searches:
for search in searches:
userQuery = search
raw = get("https://www.google.com/search?q=" + userQuery).text
page = fromstring(raw)
links = page.cssselect('.r a')
csvfile = '/Users/Work/Desktop/data.csv'
for row in links:
raw_url = row.get('href')
title = row.text_content()
if raw_url.startswith("/url?"):
url = parse_qs(urlparse(raw_url).query)['q']
csvRow = [userQuery, url[0], title]
with open(csvfile, 'a') as data:
writer = csv.writer(data)
writer.writerow(csvRow)
print(links)
scrape_run()
The TL;DR of this script is that it does three basic functions:
Locates and opens your searches.txt file.
Uses those keywords and searches the first page of Google for each
result.
Creates a new CSV file and prints the results (Keyword, URLs, and
page titles).
Solved
Google add captcha couse i use to many request
its work when i use mobile internet
Assuming the links variable is full and contains data - please verify.
if empty - test the api call itself you are making, maybe it returns something different than you expected.
Other than that - I think you just need to tweak a little bit your file handling.
https://www.guru99.com/reading-and-writing-files-in-python.html
here you can find some guidelines regarding file handling in python.
in my perspective, you need to make sure you create the file first.
start on with a script which is able to just create a file.
after that enhance the script to be able to write and append to the file.
from there on I think you are good to go and continue with you're script.
other than that I think that you would prefer opening the file only once instead of each loop, it could mean much faster execution time.
let me know if something is not clear.

Python JSON dictionary key error

I'm trying to collect data from a JSON file using python. I was able to access several chunks of text but when I get to the 3rd object in the JSON file I'm getting a key error. The first three lines work fine but the last line gives me a key error.
response = urllib.urlopen("http://asn.desire2learn.com/resources/D2740436.json")
data = json.loads(response.read())
title = data["http://asn.desire2learn.com/resources/D2740436"]["http://purl.org/dc/elements/1.1/title"][0]["value"]
description = data["http://asn.desire2learn.com/resources/D2740436"]["http://purl.org/dc/terms/description"][0]["value"]
topics = data["http://asn.desire2learn.com/resources/D2740436"]["http://purl.org/gem/qualifiers/hasChild"]
topicDesc = data["http://asn.desire2learn.com/resources/S2743916"]
Here is the JSON file I'm using. http://s3.amazonaws.com/asnstaticd2l/data/rdf/D2742493.json I went through all the braces and can't figure out why I'm getting this error. Anyone know why I might be getting this?
topics = data["http://asn.desire2learn.com/resources/D2740436"]["http://purl.org/gem/qualifiers/hasChild"]
I don't see this key "http://asn.desire2learn.com/resources/D2740436" anywhere in your source file. You didn't include your stack, but my first thought would be typo resulting in a bad key and you getting an error like:
KeyError: "http://asn.desire2learn.com/resources/D2740436"
Which means that value does not exist in the data you are referencing
The link in your code and your AWS link go to very different files. Open up the link in your code in a web browser, and you will find that it's much shorter than the file on AWS. It doesn't actually contain the key you're looking for.
You say that you are using the linked file, in which the key "http://asn.desire2learn.com/resources/S2743916" turns up once.
However, your code is downloading a different file - one in which the key does not appear.
Try using the file you linked in your code, and you should see the key will work.

having issue with parsing xml in python

i am really new to python. this is actually my first script with it and most of it is a copied example. i have an xml file that i need to parse out an attribute for. i got that part figured out but my issue is that that attribute does not always exist in the xml file. here is my code:
#!/usr/bin/python
#import library to do http requests:
import urllib2
import os
#import easy to use xml parser called minidom:
from xml.dom.minidom import parseString
#download the history:
history = urllib2.urlopen('http://192.168.1.1/example.xml')
#convert to string:
historydata = history.read()
history.close()
#parse the xml you downloaded
dom = parseString(historydata)
xmlTagHistory = dom.getElementsByTagName('loaded')[0].toxml()
xmlDataHistory=xmlTagHistory.replace('<loaded>','').replace('</loaded>','')
print xmlDataHistory
when the attribute doesnt exist i get a return of "IndexError: list index out of range". what i am attempting to do with this code is to get it to run a command if the attribute doesnt exist, or it is false. the other issue i will probably have is that there will be times when that attribute appears more than once so i would also need it to account for that scenario by NOT running the command if there is even one instance of "loaded" being true. as i said, i am really new at this so i could use all the help i can get. much appreciated.
Since dom.getElementsByTagName('loaded') returns a list, you can just check the list size with the len(list) function. Only if the list length is above 0, is it valid to do the [0] dereferencing.
An alternative is to wrap the code in try/exception pair and catch the parse exception.
http://docs.python.org/tutorial/errors.html
using the try and except you should be able to handle everything you need.

Categories

Resources