I don't quite understand how I can download data from a dataset. I only download one file, and there are several of them. How can I solve this problem?
I am using hdx api library. There is a small example in the documentation. A list is returned to me and I use the download method. But only the first file from the list is downloaded, not all of them.
My code
from hdx.hdx_configuration import Configuration
from hdx.data.dataset import Dataset
Configuration.create(hdx_site='prod', user_agent='A_Quick_Example', hdx_read_only=True)
dataset = Dataset.read_from_hdx('novel-coronavirus-2019-ncov-cases')
resources = dataset.get_resources()
print(resources)
url, path = resources[0].download()
print('Resource URL %s downloaded to %s' % (url, path))
I tried to use different methods, but only this one turned out to be working, it seems some kind of error in the loop, but I do not understand how to solve it.
Result
Resource URL https://data.humdata.org/hxlproxy/api/data-preview.csv?url=https%3A%2F%2Fraw.githubusercontent.com%2FCSSEGISandData%2FCOVID-19%2Fmaster%2Fcsse_covid_19_data%2Fcsse_covid_19_time_series%2Ftime_series_covid19_confirmed_global.csv&filename=time_series_covid19_confirmed_global.csv downloaded to C:\Users\tred1\AppData\Local\Temp\time_series_covid19_confirmed_global.csv.CSV
Forgot to add that I get a list of strings where there is a download url value. Probably the problem is in the loop
When I use a for-loop I get this:
for res in resources:
print(res)
res[0].download()
Traceback (most recent call last):
File "C:/Users/tred1/PycharmProjects/pythonProject2/HDXapi.py", line 31, in <module>
main()
File "C:/Users/tred1/PycharmProjects/pythonProject2/HDXapi.py", line 21, in main
res[0].download()
File "C:\Users\tred1\AppData\Local\Programs\Python\Python38\lib\collections\__init__.py", line 1010, in __getitem__
raise KeyError(key)
KeyError: 0
Datasets
You can get the download link as follows:
dataset = Dataset.read_from_hdx('acled-conflict-data-for-africa-1997-lastyear')
lita_resources = dataset.get_resources()
dictio=lista_resources[1]
url=dictio['download_url']
Related
Up until this morning I successfully used snscrape to scrape twitter tweets via python.
The code looks like this:
import snscrape.modules.twitter as sntwitter
query = "from:annewilltalk"
for i,tweet in enumerate(sntwitter.TwitterSearchScraper(query).get_items()):
# do stuff
But this morning without any changes I got the error:
Traceback (most recent call last):
File "twitter.py", line 568, in <module>
Scraper = TwitterScraper()
File "twitter.py", line 66, in __init__
self.get_tweets_talkshow(username = "annewilltalk")
File "twitter.py", line 271, in get_tweets_talkshow
for i,tweet in enumerate(sntwitter.TwitterSearchScraper(query).get_items()):
File "/home/pi/.local/lib/python3.8/site-packages/snscrape/modules/twitter.py", line 1455, in get_items
for obj in self._iter_api_data('https://api.twitter.com/2/search/adaptive.json', _TwitterAPIType.V2, params, paginationParams, cursor = self._cursor):
File "/home/pi/.local/lib/python3.8/site-packages/snscrape/modules/twitter.py", line 721, in _iter_api_data
obj = self._get_api_data(endpoint, apiType, reqParams)
File "/home/pi/.local/lib/python3.8/site-packages/snscrape/modules/twitter.py", line 691, in _get_api_data
r = self._get(endpoint, params = params, headers = self._apiHeaders, responseOkCallback = self._check_api_response)
File "/home/pi/.local/lib/python3.8/site-packages/snscrape/base.py", line 221, in _get
return self._request('GET', *args, **kwargs)
File "/home/pi/.local/lib/python3.8/site-packages/snscrape/base.py", line 217, in _request
raise ScraperException(msg)
snscrape.base.ScraperException: 4 requests to https://api.twitter.com/2/search/adaptive.json?include_profile_interstitial_type=1&include_blocking=1&include_blocked_by=1&include_followed_by=1&include_want_retweets=1&include_mute_edge=1&include_can_dm=1&include_can_media_tag=1&include_ext_has_nft_avatar=1&skip_status=1&cards_platform=Web-12&include_cards=1&include_ext_alt_text=true&include_quote_count=true&include_reply_count=1&tweet_mode=extended&include_entities=true&include_user_entities=true&include_ext_media_color=true&include_ext_media_availability=true&include_ext_sensitive_media_warning=true&include_ext_trusted_friends_metadata=true&send_error_codes=true&simple_quoted_tweet=true&q=from%3Aannewilltalk&tweet_search_mode=live&count=20&query_source=spelling_expansion_revert_click&pc=1&spelling_corrections=1&ext=mediaStats%2ChighlightedLabel%2ChasNftAvatar%2CvoiceInfo%2Cenrichments%2CsuperFollowMetadata%2CunmentionInfo failed, giving up.
I found online, that the URL encoding must not exceed a length of 500 and the one from the error message is about 800 long. Could that be the problem? Why did that change overnight?
How can I fix that?
the same problem here. It was working just fine and sudenly stoped. I get the same error code.
This is the code I used:
import snscrape.modules.twitter as sntwitter
import pandas as pd
# Creating list to append tweet data to
tweets_list2 = []
# Using TwitterSearchScraper to scrape data and append tweets to list
for i,tweet in enumerate(sntwitter.TwitterSearchScraper('xxxx since:2022-06-01 until:2022-06-30').get_items()):
if i>100000:
break
tweets_list2.append([tweet.date, tweet.id, tweet.content, tweet.user.username])
# Creating a dataframe from the tweets list above
tweets_df2 = pd.DataFrame(tweets_list2, columns=['Datetime', 'Tweet Id', 'Text', 'Username'])
print(tweets_df2)
tweets_df2.to_csv('xxxxx.csv')
Twitter changed something that broke the snscrape requests several days back. It looks like they just reverted back to the configuration that worked about 15 minutes ago, though. If you are using the latest snscrape version, everything should be working for you now!
Here is a thread with more info:
https://github.com/JustAnotherArchivist/snscrape/issues/671
There was a lot of discussion in the snscrape github, but none of the answers worked for me.
I now have a workaround using the shell-client. Executing the queries via terminal works for some reason.
So I execute the shell command via subprocess, save the result in a json and load that json back to python.
Example shell command: snscrape --jsonl --max-results 10 twitter-search "from:maybritillner" > test_twitter.json
I'm using notion.py and I'm new to python I want to get a page title from page and post it in another page but when I try I'm getting an error
Traceback (most recent call last):
File "auto_notion_read.py", line 16, in <module>
page_read = client.get_block(list_url_read)
File "/home/lotfi/.local/lib/python3.6/site-packages/notion/client.py", line 169, in get_block
block = self.get_record_data("block", block_id, force_refresh=force_refresh)
File "/home/lotfi/.local/lib/python3.6/site-packages/notion/client.py", line 162, in get_record_data
return self._store.get(table, id, force_refresh=force_refresh)
File "/home/lotfi/.local/lib/python3.6/site-packages/notion/store.py", line 184, in get
self.call_load_page_chunk(id)
File "/home/lotfi/.local/lib/python3.6/site-packages/notion/store.py", line 286, in call_load_page_chunk
recordmap = self._client.post("loadPageChunk", data).json()["recordMap"]
File "/home/lotfi/.local/lib/python3.6/site-packages/notion/client.py", line 262, in post
"message", "There was an error (400) submitting the request."
requests.exceptions.HTTPError: Invalid input.
My code is that I'm using is
from notion.client import NotionClient
import time
token_v2 = "my page tocken"
client = NotionClient(token_v2 = token_v2)
list_url_read = 'the url of the page page to read'
page_read = client.get_block(list_url_read)
list_url_post = 'the url of the page'
page_post = client.get_block(list_url_post)
print (page_read.title)
It isn't recommended to edit source code for dependencies, as you will most certainly cause a conflict when updating the dependencies in the future.
Fix PR 294 has been open since the 6th of March 2021 and has not been merged.
To fix this issue with the currently open PR (pull request) on GitHub, do the following:
pip uninstall notion
Then either:
pip install git+https://github.com/jamalex/notion-py.git#refs/pull/294/merge
OR in your requirements.txt add
git+https://github.com/jamalex/notion-py.git#refs/pull/294/merge
Source for PR 294 fix
You can find the fix here
In a nutshell you need to modify two files in the library itself:
store.py
client.py
Find "limit" value and change it to 100 in both.
I am currently logged on to my BBG anywhere (web login) on my Mac. So first question is would I still be able to extract data using tia (as I am not actually on my terminal)
import pdblp
con = pdblp.BCon(debug=True, port=8194, timeout=5000)
con.start()
I got this error
pdblp.pdblp:WARNING:Message Received:
SessionStartupFailure = {
reason = {
source = "Session"
category = "IO_ERROR"
errorCode = 9
description = "Connection failed"
}
}
Traceback (most recent call last):
File "<input>", line 1, in <module>
File "/Users/prasadkamath/anaconda2/envs/Pk36/lib/python3.6/site-packages/pdblp/pdblp.py", line 147, in start
raise ConnectionError('Could not start blpapi.Session')
ConnectionError: Could not start blpapi.Session
I am assuming that I need to be on the terminal to be able to extract data, but wanted to confirm that.
This is a duplicate of this issue here on SO. It is not an issue with pdblp per se, but with blpapi not finding a connection. You mention that you are logged in via the web, which only allows you to use the terminal (or Excel add-in) within the browser, but not outside of it, since this way of accessing Bloomberg lacks a data feed and an API. More details and alternatives can be found here.
I have code for a view which serves a file download, and it works fine in the browser. Now I am trying to write a test for it, using the internal django Client.get:
response = self.client.get("/compile-book/", {'id': book.id})
self.assertEqual(response.status_code, 200)
self.assertEquals(response.get('Content-Disposition'),
"attachment; filename=book.zip")
so far so good. Now I would like to test if the downloaded file is the one I expect it to download. So I start by saying:
f = cStringIO.StringIO(response.content)
Now my test runner responds with:
Traceback (most recent call last):
File ".../tests.py", line 154, in test_download
f = cStringIO.StringIO(response.content)
File "/home/epub/projects/epub-env/lib/python2.7/site-packages/django/http/response.py", line 282, in content
self._consume_content()
File "/home/epub/projects/epub-env/lib/python2.7/site-packages/django/http/response.py", line 278, in _consume_content
self.content = b''.join(self.make_bytes(e) for e in self._container)
File "/home/epub/projects/epub-env/lib/python2.7/site-packages/django/http/response.py", line 278, in <genexpr>
self.content = b''.join(self.make_bytes(e) for e in self._container)
File "/usr/lib/python2.7/wsgiref/util.py", line 30, in next
data = self.filelike.read(self.blksize)
ValueError: I/O operation on closed file
Even when I do simply: self.assertIsNotNone(response.content) I get the same ValueError
The only topic on the entire internet (including django docs) I could find about testing downloads was this stackoverflow topic: Django Unit Test for testing a file download. Trying that solution led to these results. It is old and rare enough for me to open a new question.
Anybody knows how the testing of downloads is supposed to be handled in Django? (btw, running django 1.5 on python 2.7)
This works for us. We return rest_framework.response.Response but it should work with regular Django responses, as well.
import io
response = self.client.get(download_url, {'id': archive_id})
downloaded_file = io.BytesIO(b"".join(response.streaming_content))
Note:
streaming_content is only available for StreamingHttpResponse (also Django 1.10):
https://docs.djangoproject.com/en/1.10/ref/request-response/#django.http.StreamingHttpResponse.streaming_content
I had some file download code and a corresponding test that worked with Django 1.4. The test failed when I upgraded to Django 1.5 (with the same ValueError: I/O operation on closed file error that you encountered).
I fixed it by changing my non-test code to use a StreamingHttpResponse instead of a standard HttpResponse. My test code used response.content so I first migrated to CompatibleStreamingHttpResponse, then changed my test code code to use response.streaming_content instead to allow me to drop CompatibleStreamingHttpResponse in favour of StreamingHttpResponse.
am sorry if it is considered duplicate, but i've tried all python modules that can communicate to Amazon API, but sadly, all of them seems to require the product ID to get the exact price! and what i need is a price from a product name!
lastly, i've tried an extension of Bottlenose its name is python-amazon-simple-product-api except that it has the same problem: how do i get only the price from the name of a product.
here is what i've tried:
product = api.search(Keyword = "playstation", SearchIndex='All')
for i, produ in enumerate(product):
print "{0}. '{1}'".format(i, produ.title)
(this is the same result as using produ.price_and_currency which in the example with the file is used with ID)
and then give me this error:
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "build\bdist.win-amd64\egg\amazon\api.py", line 174, in __iter__
File "build\bdist.win-amd64\egg\amazon\api.py", line 189, in iterate_pages
File "build\bdist.win-amd64\egg\amazon\api.py", line 211, in _query amazon.api.SearchException: Amazon Search Error: 'AWS.MinimumParameterRequirement', 'Your request should have atleast 1 of the following parameters: 'Keywords','Title','Power','BrowseNode','Artist','Author','Actor','Director','AudienceRati g','Manufacturer','MusicLabel','Composer','Publisher','Brand','Conductor','Orchestra','Tex Stream','Cuisine','City','Neighborhood'.'
Edit: after correcting Keyword to Keywords i get a looong time response (infinite loop! and tried it sevral times)! not like returning just the whole XML, but when using only bottlenose, i ony get tags that dont have Price or something...
<ItemLink>
<Description>Technical Details</Description>
<URL>http://www.amazon.com/*****</URL>
</ItemLink>
Update2: it seems that amazon will return ALL results, so how to limit this to only the first bucket (because it gives results by groups of 10 results)
Without having experience with the Amazon API: it's a matter of performing the search properly and intelligently. Think about it carefully, and read through
http://docs.amazonwebservices.com/AWSECommerceService/2011-08-01/DG/ItemSearch.html
so that you don't miss any important search feature.
The response contains something between 0 and a zillion of items, depending on how intelligent you search query was. In any case, the items in the response identify themselves via their ASIN, the product ID. Example: <ASIN>B00021HBN6</ASIN>
After having collected ASINs via ItemSearch, you can perform an ItemLookup on these items in order to find further details, like the price.
sorry for the delay, solved:
pagination is done using search_n :
test = api.search_n(10, Keywords='example name', SearchIndex='All') # this will return only 10 results
Link