AWS DynamoDB BOTO3 Confusing Scan

AWS DynamoDB BOTO3 Confusing Scan - python

Basically, if i loop a datetime performing an scan with date range per-day, like:
table_hook = dynamodb_resource.Table('table1')
date_filter = Key('date_column').between('2021-01-01T00:00:00+00:00', '2021-01-01T23:59:59+00:00')
response = table_hook.scan(FilterExpression=date_filter)
incoming_data = response['Items']
if (response['Count']) == 0:
return
_counter = 1
while 'LastEvaluatedKey' in response:
response = table_hook.scan(ExclusiveStartKey=response['LastEvaluatedKey'])
if (
parser.parse(response['Items'][0]['date_column']).replace(tzinfo=None) < parser.parse('2021-01-01T00:00:00+00:00').replace(tzinfo=None)
or
parser.parse(response['Items'][0]['date_column']).replace(tzinfo=None).replace(tzinfo=None) > parser.parse('2021-06-07T23:59:59+00:00').replace(tzinfo=None)
):
break
incoming_data.extend(response['Items'])
_counter+=1
print("|-> Getting page %s" % _counter)
At the end of Day1 to Day2 loop, it retrieve me X rows,
But if i perform the same scan at the same way (paginating), with the same range (Day1 to Day2), without doing a loop, it retrieve me Y rows,
And to become better, when i perform a table.describe_table(TableName='table1'), row_count field comes with Z rows, i literally dont understand what is going on!

Based on help of above guys, i found my error, basically i'm not passing the filter again when performing pagination so the fixed code are:
table_hook = dynamodb_resource.Table('table1')
date_filter = Key('date_column').between('2021-01-01T00:00:00+00:00', '2021-01-01T23:59:59+00:00')
response = table_hook.scan(FilterExpression=date_filter)
incoming_data = response['Items']
_counter = 1
while 'LastEvaluatedKey' in response:
response = table_hook.scan(FilterExpression=date_filter,
ExclusiveStartKey=response['LastEvaluatedKey'])
incoming_data.extend(response['Items'])
_counter+=1
print("|-> Getting page %s" % _counter)

Related

Reset offset variable in API when using a for loop

I am working with a real estate API pulling rental listings. I'd like to loop through a list of zipcodes to pull the data. The API requires an offset of 500 rows of data or less. The code below works fine until the while loop hits the second zipcode. The issue is that after the first zipcode has run successfully, I need the offset variable to reset to 500 and begin counting up again until the while loop breaks for the second zipcode in the list.
`# This just formats your token for the requests library.
headers = {
"X-RapidAPI-Key": "your-key-here",
"X-RapidAPI-Host": "realty-mole-property-api.p.rapidapi.com"
}
# Initial Limit and Offset values.
limit = 500
offset = 0
zipCode = [77449, 77008]
# This will be an array of all the listing records.
texas_listings = []
# We loop until we get no results.
for i in zipCode:
while True:
print("----")
url = f"https://realty-mole-property-api.p.rapidapi.com/rentalListings?offset=. {offset}&limit={limit}&zipCode={i}"
print("Requesting", url)
response = requests.get(url, headers=headers)
data = response.json()
print(data)
# Did we find any listings?
if len(data) == 0:
# If not, exit the loop
break
# If we did find listings, add them to the data
# and then move onto the next offset.
texas_listings.extend(data)
offset = offset + 500
`
Here is a snippet of the final printed output. As you can see, zipcode 77008 gets successfully passed to the zipCode variable after the 77449 zipcode returns an empty list and breaks the loop at offset 5500. However, you can also see that the 77008 offset starts at 5500 and it appears there aren't that many listings in that zipcode. How do I reset offset variable to 500 and begin counting again?

You can reset the offset variable back to 500 before starting the next iteration of the loop over the next zip code.
for i in zipCode:
while True:
print("----")
url = f"https://realty-mole-property-api.p.rapidapi.com/rentalListings?offset={offset}&limit={limit}&zipCode={i}"
print("Requesting", url)
response = requests.get(url, headers=headers)
data = response.json()
print(data)
if len(data) == 0:
break
texas_listings.extend(data)
offset = offset + 500
offset = 500 # reset offset to 500 for the next zip code

Update: put the offset and limit within the for loop and it works the way I expect.
# We loop until we get no results.
for i in zipCode:
limit = 500
offset = 0
while True:
print("----")
url = f"https://realty-mole-property-api.p.rapidapi.com/rentalListings?offset={offset}&limit={limit}&zipCode={i}"
print("Requesting", url)
response = requests.get(url, headers=headers)
data = response.json()
print(data)
# Did we find any listings?
if len(data) == 0:
# If not, exit the loop
break
# If we did find listings, add them to the data
# and then move onto the next offset.
texas_listings.extend(data)
offset = offset + 500
texas = pd.DataFrame(texas_listings).append(texas,
ignore_index=True)
texas['date_pulled'] = pd.Timestamp.now().normalize()

How to remove duplicate tweets in Python?

I am trying to retrieve about 1000 tweets from a search term like 'NFL' using tweepy and storing the tweets into a DataFrame using pandas. My issue is I can't find a way to remove duplicated tweets, I have tried df.drop_duplicates but it only gives me about 100 tweets to work with. Help would be appreciated!
num_needed = 1000
tweet_list = [] # Lists to be added as columns( Tweets, usernames, and screen names) in our dataframe
user_list = []
screen_name_list = []
last_id = -1 # ID of last tweet seen
while len(tweet_list) < num_needed:
try:
new_tweets = api.search(q = 'NFL', count = num_needed, max_id = str(last_id - 1), lang = 'en', tweet_mode = 'extended') # This is the criteria for collecting the tweets that I want. I want to make sure the results are as accurate as possible when making a final analysis.
except tweepy.TweepError as e:
print("Error", e)
break
else:
if not new_tweets:
print("Could not find any more tweets!")
break
else:
for tweet in new_tweets:
# Fetching the screen name and username
screen_name = tweet.author.screen_name
user_name = tweet.author.name
tweet_text = tweet.full_text
tweet_list.append(tweet_text)
user_list.append(user_name)
screen_name_list.append(screen_name)
df = pd.DataFrame() #Create a new dataframe (df) with new columns
df['Screen name'] = screen_name_list
df['Username'] = user_list
df['Tweets'] = tweet_list

Well, yes, when you use .drop_duplicates(), you only get 100 tweets because that's how many duplicates there are. Doesn't matter what technique you use here, there are 900 or so duplicates with how your code runs.
So you might be asking, why? It by default returns only 100 tweets, which I am assuming you are aware of since you are looping and you try to get more by using the max_id parameter. However, your max_id, is always -1 here, you never get the id and thus never change that parameter. So one thing you can do, is while you iterate through the tweets, also collect the ids. Then after you get all the ids, store the minimum id value as last_id, then it'll work in your loop:
Code:
num_needed = 1000
tweet_list = [] # Lists to be added as columns( Tweets, usernames, and screen names) in our dataframe
user_list = []
screen_name_list = []
tw_id = [] #<-- ADDED THIS
last_id = -1 # ID of last tweet seen
while len(tweet_list) < num_needed:
try:
new_tweets = api.search(q = 'NFL -filter:retweets', count = num_needed, max_id = str(last_id - 1), lang = 'en', tweet_mode = 'extended') # This is the criteria for collecting the tweets that I want. I want to make sure the results are as accurate as possible when making a final analysis.
except tweepy.TweepError as e:
print("Error", e)
break
else:
if not new_tweets:
print("Could not find any more tweets!")
break
else:
for tweet in new_tweets:
# Fetching the screen name and username
screen_name = tweet.author.screen_name
user_name = tweet.author.name
tweet_text = tweet.full_text
tweet_list.append(tweet_text)
user_list.append(user_name)
screen_name_list.append(screen_name)
tw_id.append(tweet.id) #<-- ADDED THIS
last_id = min(tw_id) #<-- ADDED THIS
df = pd.DataFrame({'Screen name':screen_name_list,
'Username':user_list,
'Tweets':tweet_list})
df = df.drop_duplicates()
This returns to me aprox 1000 tweets.
Output:
print (len(df))
1084

I'm getting a "ListError: list index out of range" in my "clean_json_response" function

I'm using a medium API to get a some information but after some API calls the python script ended with this error:
IndexError: list index out of range
Here is my Python code:
def get_post_responses(posts):
#start = time.time()
count = 0
print('Retrieving the post responses...')
responses = []
for post in posts:
url = MEDIUM + '/_/api/posts/' + post + '/responses'
count = count + 1
print("number of times api called",count)
response = requests.get(url)
response_dict = clean_json_response(response)
responses += response_dict['payload']['value']
#end = time.time()
#four = end - start
#global time_cal
#time_cal.append(four)
return responses
def check_if_high_recommends(response, recommend_min):
if response['virtuals']['recommends'] >= recommend_min:
return True
def check_if_recent(response):
limit_date = datetime.now() - timedelta(days=360)
creation_epoch_time = response['createdAt'] / 1000
creation_date = datetime.fromtimestamp(creation_epoch_time)
if creation_date >= limit_date:
return True
It needs to work for more then 10000 followers for a user.

i got an ans for my question...
just i need to use try catch exception ...
response_dict = clean_json_response(response)
try:
responses += response_dict['payload']['value']
catch:
continue

Printing the last element received from an API

I'm making a bot that compares the last buys and sells orders received by fetching a cryptocurrencie's exchange and prints the difference.
My problem, right now, is that it prints the last order received over and over, i think it's because of the while loop. Is there a way to make it print only the last two without printing the same thing more times? I was thinking of using OrderedDict but i don't know how to use it on Json. Here is the code involved:
import time, requests, json
> while True:
> BU = requests.session()
> URL = 'https://bittrex.com/api/v1.1/public/getmarkethistory?market=BTC-DOGE'
> r = BU.get(URL, timeout=(15, 10))
> time.sleep(1)
> MarketPairs = json.loads(r.content)
> for element in MarketPairs['result']:
> id = element['Id']
> price = element['Price']
> tot = element['Total']
> time = element['TimeStamp']
> type = element['OrderType']
>
>
> if time > '2017-12-11T21:37:01.103':
> print type, id, tot, price, time
> time.sleep(1)

I guess this is what you want...
It will print only last price, and only if it's different from previous one
import requests as req
import time
previous=None
while 1:
url='https://bittrex.com/api/v1.1/public/getmarkethistory?market=BTC-DOGE'
response = req.get(url,timeout=(15,10)).json()
result = response["result"]
last_price_dict = result[0]
id=last_price_dict["Id"]
price = last_price_dict["Price"]
total = last_price_dict["Total"]
timestamp = last_price_dict["TimeStamp"]
order_type=last_price_dict["OrderType"]
this_one = (id, total, price, timestamp)
if id != previous:
print(this_one)
previous = id
time.sleep(3)

Python 3.6 API while loop to json script not ending

I'm trying to create a loop via API call to a json string since each call is limited to 200 rows. When I tried the below code, the loop doesn't seem to end even when I left the code running for an hour or so. Max rows I'm looking to pull is about ~200k rows from the API.
bookmark=''
urlbase = 'https://..../?'
alldata = []
while True:
if len(bookmark)>0:
url = urlbase + 'bookmark=' + bookmark
requests.get(url, auth=('username', 'password'))
data = response.json()
alldata.extend(data['rows'])
bookmark = data['bookmark']
if len(data['rows'])<200:
break
Also, I'm looking to filter the loop to only output if json value 'pet.type' is "Puppies" or "Kittens." Haven't been able to figure out the syntax.
Any ideas?
Thanks

The break condition for you loop is incorrect. Notice it's checking len(data["rows"]), where data only includes rows from the most recent request.
Instead, you should be looking at the total number of rows you've collected so far: len(alldata).
bookmark=''
urlbase = 'https://..../?'
alldata = []
while True:
if len(bookmark)>0:
url = urlbase + 'bookmark=' + bookmark
requests.get(url, auth=('username', 'password'))
data = response.json()
alldata.extend(data['rows'])
bookmark = data['bookmark']
# Check `alldata` instead of `data["rows"]`,
# and set the limit to 200k instead of 200.
if len(alldata) >= 200000:
break

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

AWS DynamoDB BOTO3 Confusing Scan - python

Related

Reset offset variable in API when using a for loop

How to remove duplicate tweets in Python?

I'm getting a "ListError: list index out of range" in my "clean_json_response" function

Printing the last element received from an API

Python 3.6 API while loop to json script not ending

Categories

Resources