What is the data format returned by the AdWords API TargetingIdeaPage service? - python

When I query the AdWords API to get search volume data and trends through their TargetingIdeaSelector using the Python client library the returned data looks like this:
(TargetingIdeaPage){
totalNumEntries = 1
entries[] =
(TargetingIdea){
data[] =
(Type_AttributeMapEntry){
key = "KEYWORD_TEXT"
value =
(StringAttribute){
Attribute.Type = "StringAttribute"
value = "keyword phrase"
}
},
(Type_AttributeMapEntry){
key = "TARGETED_MONTHLY_SEARCHES"
value =
(MonthlySearchVolumeAttribute){
Attribute.Type = "MonthlySearchVolumeAttribute"
value[] =
(MonthlySearchVolume){
year = 2016
month = 2
count = 2900
},
...
(MonthlySearchVolume){
year = 2015
month = 3
count = 2900
},
}
},
},
}
This isn't JSON and appears to just be a messy Python list. What's the easiest way to flatten the monthly data into a Pandas dataframe with a structure like this?
Keyword | Year | Month | Count
keyword phrase 2016 2 10

The output is a sudsobject. I found that this code does the trick:
import suds.sudsobject as sudsobject
import pandas as pd
a = [sudsobject.asdict(x) for x in output]
df = pd.DataFrame(a)
Addendum: This was once correct but new versions of the API (I tested
201802) now return a zeep.objects. However, zeep.helpers.serialize_object should do the same trick.
link

Here's the complete code that I used to query the TargetingIdeaSelector, with requestType STATS, and the method I used to parse the data to a useable dataframe; note the section starting "Parse results to pandas dataframe" as this takes the output given in the question above and converts it to a dataframe. Probably not the fastest or best, but it works! Tested with Python 2.7.
"""This code pulls trends for a set of keywords, and parses into a dataframe.
The LoadFromStorage method is pulling credentials and properties from a
"googleads.yaml" file. By default, it looks for this file in your home
directory. For more information, see the "Caching authentication information"
section of our README.
"""
from googleads import adwords
import pandas as pd
adwords_client = adwords.AdWordsClient.LoadFromStorage()
PAGE_SIZE = 10
# Initialize appropriate service.
targeting_idea_service = adwords_client.GetService(
'TargetingIdeaService', version='v201601')
# Construct selector object and retrieve related keywords.
offset = 0
stats_selector = {
'searchParameters': [
{
'xsi_type': 'RelatedToQuerySearchParameter',
'queries': ['donald trump', 'bernie sanders']
},
{
# Language setting (optional).
# The ID can be found in the documentation:
# https://developers.google.com/adwords/api/docs/appendix/languagecodes
'xsi_type': 'LanguageSearchParameter',
'languages': [{'id': '1000'}],
},
{
# Location setting
'xsi_type': 'LocationSearchParameter',
'locations': [{'id': '1027363'}] # Burlington,Vermont
}
],
'ideaType': 'KEYWORD',
'requestType': 'STATS',
'requestedAttributeTypes': ['KEYWORD_TEXT', 'TARGETED_MONTHLY_SEARCHES'],
'paging': {
'startIndex': str(offset),
'numberResults': str(PAGE_SIZE)
}
}
stats_page = targeting_idea_service.get(stats_selector)
##########################################################################
# Parse results to pandas dataframe
stats_pd = pd.DataFrame()
if 'entries' in stats_page:
for stats_result in stats_page['entries']:
stats_attributes = {}
for stats_attribute in stats_result['data']:
#print (stats_attribute)
if stats_attribute['key'] == 'KEYWORD_TEXT':
kt = stats_attribute['value']['value']
else:
for i, val in enumerate(stats_attribute['value'][1]):
data = {'keyword': kt,
'year': val['year'],
'month': val['month'],
'count': val['count']}
data = pd.DataFrame(data, index = [i])
stats_pd = stats_pd.append(data, ignore_index=True)
print(stats_pd)

Related

How do I access specific data in a nested JSON file with Python and Pandas

I am still a newbie with Python and working on my first REST API. I have a JSON file that has a few levels. When I create the data frame with pandas, no matter what I try I cannot access the level I need.
The API is built with Flask and has the correct parameters for the book, chapter and verse.
Below is a small example of the JSON data.
{
"book": "Python",
"chapters": [
{
"chapter": "1",
"verses": [
{
"verse": "1",
"text": "Testing"
},
{
"verse": "2",
"text": "Testing 2"
}
]
}
]
}
Here is my code:
#app.route("/api/v1/<book>/<chapter>/<verse>/")
def api(book, chapter, verse):
book = book.replace(" ", "").title()
df = pd.read_json(f"Python/{book}.json")
filt = (df['chapters']['chapter'] == chapter) & (df['chapters']['verses']['verse'] == verse)
text = df.loc[filt].to_json()
result_dictionary = {'Book': book, 'Chapter': chapter, "Verse": verse, "Text": text}
return result_dictionary
Here is the error I am getting:
KeyError
KeyError: 'chapter'
I have tried normalizing the data, using df.loc to filter and just trying to access the data directly.
Expecting that the API endpoint will allow the user to supply the book, chapter and verse as arguments and then it returns the text for the given position based on those parameters supplied.
You can first create a dataframe of the JSON and then query it.
import json
import pandas as pd
def api(book, chapter, verse):
# Read the JSON file
with open(f"Python/{book}.json", "r") as f:
data = json.load(f)
# Convert it into a DataFrame
df = pd.json_normalize(data, record_path=["chapters", "verses"], meta=["book", ["chapters", "chapter"]])
df.columns = ["Verse", "Text", "Book", "Chapter"] # rename columns
# Query the required content
query = f"Book == '{book}' and Chapter == '{chapter}' and Verse == '{verse}'"
result = df.query(query).to_dict(orient="records")[0]
return result
Here df would look like this after json_normalize:
Verse Text Book Chapter
0 1 Testing Python 1
1 2 Testing 2 Python 1
2 1 Testing Python 2
3 2 Testing 2 Python 2
And result is:
{'Verse': '2', 'Text': 'Testing 2', 'Book': 'Python', 'Chapter': '1'}
You are trying to access a list in a dict with a dict key ?
filt = (df['chapters'][0]['chapter'] == "chapter") & (df['chapters'][0]['verses'][0]['verse'] == "verse")
Will get a result.
But df.loc[filt] requires a list with (boolean) filters and above only gerenerates one false or true, so you can't filter with that.
You can filter like:
df.from_dict(df['chapters'][0]['verses']).query("verse =='1'")
One of the issues here is that "chapters" is a list
"chapters": [
This is why ["chapters"]["chapter"] wont work as you intend.
If you're new to this, it may be helpful to "normalize" the data yourself:
import json
with open("book.json") as f:
book = json.load(f)
for chapter in book["chapters"]:
for verse in chapter["verses"]:
row = book["book"], chapter["chapter"], verse["verse"], verse["text"]
print(repr(row))
('Python', '1', '1', 'Testing')
('Python', '1', '2', 'Testing 2')
It is possible to pass this to pd.DataFrame()
df = pd.DataFrame(
([book["book"], chapter["chapter"], verse["verse"], verse["text"]]
for verse in chapter["verses"]
for chapter in book["chapters"]),
columns=["Book", "Chapter", "Verse", "Text"]
)
Book Chapter Verse Text
0 Python 1 1 Testing
1 Python 1 2 Testing 2
Although it's not clear if you need a dataframe here at all.

How to calculate EC2 instance price using Python script [duplicate]

Now that AWS have a Pricing API, how could one use Boto3 to fetch the current hourly price for a given on-demand EC2 instance type (e.g. t2.micro), region (e.g. eu-west-1) and operating system (e.g. Linux)? I only want the price returned. Based on my understanding, having those four pieces of information should be enough to filter down to a singular result.
However, all the examples I've seen fetch huge lists of data from the API that would have to be post-processed in order to get what I want. I would like to filter the data on the API side, before it's being returned.
Here is the solution I ended up with. Using Boto3's own Pricing API with a filter for the instance type, region and operating system. The API still returns a lot of information, so I needed to do a bit of post-processing.
import boto3
import json
from pkg_resources import resource_filename
# Search product filter. This will reduce the amount of data returned by the
# get_products function of the Pricing API
FLT = '[{{"Field": "tenancy", "Value": "shared", "Type": "TERM_MATCH"}},'\
'{{"Field": "operatingSystem", "Value": "{o}", "Type": "TERM_MATCH"}},'\
'{{"Field": "preInstalledSw", "Value": "NA", "Type": "TERM_MATCH"}},'\
'{{"Field": "instanceType", "Value": "{t}", "Type": "TERM_MATCH"}},'\
'{{"Field": "location", "Value": "{r}", "Type": "TERM_MATCH"}},'\
'{{"Field": "capacitystatus", "Value": "Used", "Type": "TERM_MATCH"}}]'
# Get current AWS price for an on-demand instance
def get_price(region, instance, os):
f = FLT.format(r=region, t=instance, o=os)
data = client.get_products(ServiceCode='AmazonEC2', Filters=json.loads(f))
od = json.loads(data['PriceList'][0])['terms']['OnDemand']
id1 = list(od)[0]
id2 = list(od[id1]['priceDimensions'])[0]
return od[id1]['priceDimensions'][id2]['pricePerUnit']['USD']
# Translate region code to region name. Even though the API data contains
# regionCode field, it will not return accurate data. However using the location
# field will, but then we need to translate the region code into a region name.
# You could skip this by using the region names in your code directly, but most
# other APIs are using the region code.
def get_region_name(region_code):
default_region = 'US East (N. Virginia)'
endpoint_file = resource_filename('botocore', 'data/endpoints.json')
try:
with open(endpoint_file, 'r') as f:
data = json.load(f)
# Botocore is using Europe while Pricing API using EU...sigh...
return data['partitions'][0]['regions'][region_code]['description'].replace('Europe', 'EU')
except IOError:
return default_region
# Use AWS Pricing API through Boto3
# API only has us-east-1 and ap-south-1 as valid endpoints.
# It doesn't have any impact on your selected region for your instance.
client = boto3.client('pricing', region_name='us-east-1')
# Get current price for a given instance, region and os
price = get_price(get_region_name('eu-west-1'), 't3.micro', 'Linux')
print(price)
This example outputs 0.0114000000 (hourly price in USD) fairly quickly. (This number was verified to match the current value listed here at the date of this writing)
If you don't like the native function, then look at Lyft's awspricing library for Python. Here's an example:
import awspricing
ec2_offer = awspricing.offer('AmazonEC2')
p = ec2_offer.ondemand_hourly(
't2.micro',
operating_system='Linux',
region='eu-west-1'
)
print(p) # 0.0126
I'd recommend enabling caching (see AWSPRICING_USE_CACHE) otherwise it will be slow.
I have updated toringe's solution a bit to handle different key errors
def price_information(self, instance_type, os, region):
# Search product filter
FLT = '[{{"Field": "operatingSystem", "Value": "{o}", "Type": "TERM_MATCH"}},' \
'{{"Field": "instanceType", "Value": "{t}", "Type": "TERM_MATCH"}}]'
f = FLT.format(t=instance_type, o=os)
try:
data = self.pricing_client.get_products(ServiceCode='AmazonEC2', Filters=json.loads(f))
instance_price = 0
for price in data['PriceList']:
try:
first_id = list(eval(price)['terms']['OnDemand'].keys())[0]
price_data = eval(price)['terms']['OnDemand'][first_id]
second_id = list(price_data['priceDimensions'].keys())[0]
instance_price = price_data['priceDimensions'][second_id]['pricePerUnit']['USD']
if float(price) > 0:
break
except Exception as e:
print(e)
print(instance_price)
return instance_price
except Exception as e:
print(e)
return 0
Based on other answers, here's some code that returns the On Demand prices for all instance types (or for a given instance type, if you add the search filter), gets some relevant attributes for each instance type, and pretty-prints the data.
It assumes pricing is the AWS Pricing client.
import json
def ec2_get_ondemand_prices(Filters):
data = []
reply = pricing.get_products(ServiceCode='AmazonEC2', Filters=Filters, MaxResults=100)
data.extend([json.loads(r) for r in reply['PriceList']])
while 'NextToken' in reply.keys():
reply = pricing.get_products(ServiceCode='AmazonEC2', Filters=Filters, MaxResults=100, NextToken=reply['NextToken'])
data.extend([json.loads(r) for r in reply['PriceList']])
print(f"\x1b[33mGET \x1b[0m{len(reply['PriceList']):3} \x1b[94m{len(data):4}\x1b[0m")
instances = {}
for d in data:
attr = d['product']['attributes']
type = attr['instanceType']
if type in data: continue
region = attr.get('location', '')
clock = attr.get('clockSpeed', '')
type = attr.get('instanceType', '')
market = attr.get('marketoption', '')
ram = attr.get('memory', '')
os = attr.get('operatingSystem', '')
arch = attr.get('processorArchitecture', '')
region = attr.get('regionCode', '')
storage = attr.get('storage', '')
tenancy = attr.get('tenancy', '')
usage = attr.get('usagetype', '')
vcpu = attr.get('vcpu', '')
terms = d['terms']
ondemand = terms['OnDemand']
ins = ondemand[next(iter(ondemand))]
pricedim = ins['priceDimensions']
price = pricedim[next(iter(pricedim))]
desc = price['description']
p = float(price['pricePerUnit']['USD'])
unit = price['unit'].lower()
if 'GiB' not in ram: print('\x1b[31mWARN\x1b[0m')
if 'hrs'!=unit: print('\x1b[31mWARN\x1b[0m')
if p==0.: continue
instances[type] = {'type':type, 'market':market, 'vcpu':vcpu, 'ram':float(ram.replace('GiB','')), 'ondm':p, 'unit':unit, 'terms':list(terms.keys()), 'desc':desc}
instances = {k:v for k,v in sorted(instances.items(), key=lambda e: e[1]['ondm'])}
for ins in instances.values():
p = ins['ondm']
print(f"{ins['type']:32} {ins['market'].lower()}\x1b[91m: \x1b[0m{ins['vcpu']:3} vcores\x1b[91m, \x1b[0m{ins['ram']:7.1f} GB, \x1b[0m{p:7.4f} \x1b[95m$/h\x1b[0m, \x1b[0m\x1b[0m{p*720:8,.1f} \x1b[95m$/m\x1b[0m, \x1b[0m\x1b[0m{p*720*12:7,.0f} \x1b[95m$/y\x1b[0m, \x1b[0m{ins['unit']}\x1b[91m, \x1b[0m{ins['terms']}\x1b[0m")
# print(desc, , sep='\n')
print(f'\x1b[92m{len(instances)}\x1b[0m')
flt = [
# {'Field': 'instanceType', 'Value': 't4g.nano', 'Type': 'TERM_MATCH'}, # enable this filter to select only 1 instance type
{'Field': 'regionCode', 'Value': 'us-east-2', 'Type': 'TERM_MATCH'}, # alternative notation?: {'Field': 'location', 'Value': 'US East (Ohio)', 'Type': 'TERM_MATCH'},
{'Field': 'operatingSystem', 'Value': 'Linux', 'Type': 'TERM_MATCH'},
{'Field': 'tenancy', 'Value': 'shared', 'Type': 'TERM_MATCH'},
{'Field': 'capacitystatus', 'Value': 'Used', 'Type': 'TERM_MATCH'},
]
ec2_get_ondemand_prices(Filters=flt)

Creating multiple dataframe using loop or function

I'm trying to extract the hash rate for 3 cryptocurrencies and I have attached the code for the same below. Now, I want to pass three urls and in return I need three different different dictionaries which should have the values. I'm stuck and I don't understand how should I go about it. I have tried using loops but it is not working out for me.
url = {'Bitcoin' : 'https://bitinfocharts.com/comparison/bitcoin-hashrate.html#3y',
'Ethereum': 'https://bitinfocharts.com/comparison/ethereum-hashrate.html#3y',
'Litecoin': 'https://bitinfocharts.com/comparison/litecoin-hashrate.html'}
for ele in url:
#### requesting the page and extracting the script which has date and values
session = requests.Session()
page = session.get(ele[i])
soup = BeautifulSoup(page.content, 'html.parser')
values = str(soup.find_all('script')[4])
values = values.split('d = new Dygraph(document.getElementById("container"),')[1]
#create an empty dict to append date and hashrates
dict([("crypto_1 %s" % i,[]) for i in range(len(url))])
#run a loop over all the dates and adding to dictionary
for i in range(values.count('new Date')):
date = values.split('new Date("')[i+1].split('"')[0]
value = values.split('"),')[i+1].split(']')[0]
dict([("crypto_1 %s" % i)[date] = value
You can use next example how to get data from all 3 URLs and create a dataframe/dictionary from it:
import re
import requests
import pandas as pd
url = {
"Bitcoin": "https://bitinfocharts.com/comparison/bitcoin-hashrate.html#3y",
"Ethereum": "https://bitinfocharts.com/comparison/ethereum-hashrate.html#3y",
"Litecoin": "https://bitinfocharts.com/comparison/litecoin-hashrate.html",
}
data = []
for name, u in url.items():
html_doc = requests.get(u).text
for date, hash_rate in re.findall(
r'\[new Date\("(.*?)"\),(.*?)\]', html_doc
):
data.append(
{
"Name": name,
"Date": date,
"Hash Rate": float("nan")
if hash_rate == "null"
else float(hash_rate),
}
)
df = pd.DataFrame(data)
df["Date"] = pd.to_datetime(df["Date"])
# here save df to CSV
# this will create a dictionary, where the keys are crypto names and values
# are dicts with keys Date/HashRate:
out = {}
for name, g in df.groupby("Name"):
out[name] = g[["Date", "Hash Rate"]].to_dict(orient="list")
print(out)
Prints:
{
"Bitcoin": {
"Date": [
Timestamp("2009-01-03 00:00:00"),
Timestamp("2009-01-04 00:00:00"),
Timestamp("2009-01-05 00:00:00"),
...

Find coordinates in wikipedia pages iterating over a list

Probably this is a simple question, but my experience in for loop is very limited.
I was trying to adapt the solution in this page https://www.mediawiki.org/wiki/API:Geosearch with some simple examples that i have, but the result is not what i expected.
For example:
I have this simple data frame:
df= pd.DataFrame({'City':['Sesimbra','Ciudad Juárez','31100 Treviso','Ramada Portugal','Olhão'],
'Country':['Portugal','México','Itália','Portugal','Portugal']})
I created a list based on cities:
lista_cidades = list(df['City'])
and i would like to iterate over this list to get the coordinates (decimal, preferably)
So far i tried this approach:
import requests
lng_dict = {}
lat_dict = {}
S = requests.Session()
URL = "https://en.wikipedia.org/w/api.php"
PARAMS = {
"action": "query",
"format": "json",
"titles": [lista_cidades],
"prop": "coordinates"
}
R = S.get(url=URL, params=PARAMS)
DATA = R.json()
PAGES = DATA['query']['pages']
for i in range(len(lista_cidades)):
for k, v in PAGES.items():
try:
lat_dict[lista_cidades[i]] = str(v['coordinates'][0]['lat'])
lng_dict[lista_cidades[i]] = str(v['coordinates'][0]['lon'])
except:
pass
but it looks like the code doesn't iterate over the list and always returns the same coordinate
For example, when i call the dictionary with latitude coordinates, this is what i get
lng_dict
{'Sesimbra': '-7.84166667',
'Ciudad Juárez': '-7.84166667',
'31100 Treviso': '-7.84166667',
'Ramada Portugal': '-7.84166667',
'Olhão': '-7.84166667'}
What should i do to solve this?
Thanks in advance
I think the query returns only one result, it will take only the last city from you list (in your cas the "Olhão" coordinates).
You can check it by logging the DATA content.
I do not know about wikipedia API, but either your call lack a parameter (documentation should give you the information) or you have to call the API for each city like :
import pandas as pd
import requests
df = pd.DataFrame({'City': ['Sesimbra', 'Ciudad Juárez', '31100 Treviso', 'Ramada Portugal', 'Olhão'],
'Country': ['Portugal', 'México', 'Itália', 'Portugal', 'Portugal']})
lista_cidades = list(df['City'])
lng_dict = {}
lat_dict = {}
S = requests.Session()
URL = "https://en.wikipedia.org/w/api.php"
for city in lista_cidades:
PARAMS = {
"action": "query",
"format": "json",
"titles": city,
"prop": "coordinates"
}
R = S.get(url=URL, params=PARAMS)
DATA = R.json()
PAGES = DATA['query']['pages']
for k, v in PAGES.items():
try:
lat_dict[city] = str(v['coordinates'][0]['lat'])
lng_dict[city] = str(v['coordinates'][0]['lon'])
except:
pass

Extending Googles HelloAnalytics Python script to pull out the data in a csv [duplicate]

I am trying to build an API for my Google Analytics Account to export the data into a CSV. I have the Authentication code working, but I am struggling with now printing the data in the format I would like.
For the time being, I am only pulling dimension country, dimension city, and metric session. (However these will change when I get this working.) Right now, it prints:
Date Range(0)
ga:sessions: 2
ga:country:United States
ga:city:Los Angeles
...
However, I would like to have this in a line:
date Range sessions country city
0 2 USA Los Angeles
...
What code in Python do I need to use? Below is what I have.
def initialize_analyticsreporting():
parser = argparse.ArgumentParser(
formatter_class=argparse.RawDescriptionHelpFormatter,
parents=[tools.argparser])
flags = parser.parse_args([])
http = credentials.authorize(httplib2.Http())
service = build('analytics', 'v4', http=http, discoveryServiceUrl=('https://analyticsreporting.googleapis.com/$discovery/rest'))
def get_report(service):
return service.reports().batchGet(
body={
'reportRequests':[
{
"viewId": "ga:52783868",
"dimensions": [{
"name": "ga:country"},
{"name": "ga:city"}],
"metrics": [{
"expression": "ga:sessions"}],
"dateRanges": [{
"startDate": "2017-04-10",
"endDate": "2017-04-12"}]
}
]
}
).execute()
countries = []
cities = []
val = []
def print_reponse(response):
for report in response.get('reports', []):
columnHeader = report.get('columnHeader',{})
dimensionHeaders=columnHeader.get('columnHeader',[])
metricHeaders = columnHeader.get('metricHeader',{}).get('metricHeaderEntries',[])
rows = report.get('data',{}).get('rows',[])
for row in rows:
dimensions = row.get('dimensions',[])
dateRangeValues=row.get('metrics',[])
for header, dimension in zip(dimensionHeaders,dimensions):
print(header+':'+dimension)
for i, values in enumerate(dateRangeValues):
for metricHeader, value in zip(metricHeaders, values.get('values')):
print(metricHeader.get('name')+':'+value)
def main():
analytics = initialize_analyticsreporting()
response = get_report(service)
print_reponse(response)
if __name__ == '__main__':
main()
As lrnzcig's suggestion, we could parse the data with pandas and then export to csv file.
First, import pandas and json_normalize
import pandas as pd
from pandas.io.json import json_normalize
Use this function to parse data
def parse_data(response):
reports = response['reports'][0]
columnHeader = reports['columnHeader']['dimensions']
metricHeader = reports['columnHeader']['metricHeader']['metricHeaderEntries']
columns = columnHeader
for metric in metricHeader:
columns.append(metric['name'])
data = json_normalize(reports['data']['rows'])
data_dimensions = pd.DataFrame(data['dimensions'].tolist())
data_metrics = pd.DataFrame(data['metrics'].tolist())
data_metrics = data_metrics.applymap(lambda x: x['values'])
data_metrics = pd.DataFrame(data_metrics[0].tolist())
result = pd.concat([data_dimensions, data_metrics], axis=1, ignore_index=True)
return result
The result will look like ...
ga:country ga:city ga:sessions
0 (not set) (not set) 64
1 Argentina (not set) 1
2 Australia Adelaide 3
3 Australia Brisbane 9
...
Finally, call function to_csv to save data as csv file
result.to_csv('result.csv')

Categories

Resources