foursquare api data pull from databricks

foursquare api data pull from databricks - python

I am using the following command to pull data from foursquare api which is working fine. How can I write the json output as table in databricks? I can't use show/display functions on data output.
import json, requests
url = 'https://api.foursquare.com/v2/venues/explore'
params = dict(
client_id='CLIENT_ID',
client_secret='CLIENT_SECRET',
v='20180323',
ll='40.7243,-74.0018',
query='coffee',
limit=1
)
resp = requests.get(url=url, params=params)
data = json.loads(resp.text)

You could read and write the data received as follows:
df = spark.read.json(resp.text)
location = 'dbfs:/tmp/test.json'
df.write.json(location)
and then create a table using the file created :
spark.sql(f'''
CREATE TABLE IF NOT EXISTS foursquare
USING JSON
LOCATION "{location}"
''')

Related

Pandas Dataframe from Cloud Functions to BigQuery - only PARQUET and CSV source_formats?

I'm querying an API with GCP Cloud Functions and would like to write the result to BigQuery. I'm getting this error:
Got unexpected source_format: 'NEWLINE_DELIMITED_JSON'. Currently, only PARQUET and CSV are supported
This is my code
from google.cloud import bigquery
import pandas as pd
import requests
import datetime
def hello_pubsub(event, context):
response = requests.get("https://api.openweathermap.org/data/2.5/weather?q=berlin&appid=12345&units=metric&lang=de")
responseJson = response.json()
# Creates DataFrame
df = pd.DataFrame({'datetime':pd.to_datetime(format(datetime.datetime.now())),
'name':str(responseJson['name']),
'temp':float(responseJson['main']['temp']),
'windspeed':float(responseJson['wind']['speed']),
'winddeg':int(responseJson['wind']['deg'])
}, index=[0])
project_id = 'myproj'
client = bigquery.Client(project=project_id)
dataset_id = 'weather'
dataset_ref = client.dataset(dataset_id)
job_config = bigquery.LoadJobConfig()
job_config.autodetect = True
job_config.write_disposition = "WRITE_APPEND"
job_config.source_format = bigquery.SourceFormat.NEWLINE_DELIMITED_JSON
load_job = client.load_table_from_dataframe(df, dataset_ref.table("weather_de"), job_config=job_config)
What's the best way to do this?

The BigQuery client library reference states that this is intended behavior when loading into a table from a dataframe using load_table_from_dataframe():
By default, this method uses the parquet source format. To override this, supply a value for source_format with the format name. Currently only CSV and PARQUET are supported.
Something you can try is replacing that method with load_table_from_json(), which is also available, and uses NEWLINE_DELIMITED_JSON as the source format. This method clearly will not accept a dataframe as input, so I could recommend using a JSON object to store the data you need from the API response. Otherwise, you can convert the existing dataframe you created to json using the to_json() method from the pandas doc.
You can read more into how the BigQuery client works from the reference, and you can also see the built source formats.

Missing Data while Reading from Salesforce using Python

I am trying to bulk read data from Salesforce using Python. This is creating an output JSON file. However, the file doesn't seem to contain all the data. It has some but not everything.
I confirmed the record id exists in Salesforce but not in JSON file. If I change the WHERE condition to be close around the missing id's modifieddate, it shows up in JSON file. I think there is some kind of size limit on response here but can't find anything.
Has anyone come across such kind of issue? TIA.
MissingSFData.py
...
sf_object = 'Account'
sf_conn = SalesforceOauthHook(self.sf_conn_id_client, self.sf_conn_id_user).sign_in()
bulk_query = 'select Id,IsDeleted from Account WHERE ModifiedDate >= 2021-06-17T23:10:00+00:00 AND ModifiedDate < 2021-06-21T23:15:00+00:00'
query_results = sf_conn.bulk.__getattr__(sf_object).query(bulk_query) /*bulk.py slightly different from default*/
...
SalesforceOauthHook.py
from simple_salesforce.api import Salesforce /**api.py slightly different from default**/
from airflow.hooks.base_hook import BaseHook
class SalesforceOauthHook(BaseHook):
...
def sign_in(self):
...
url = "https://{}.my.salesforce.com/services/oauth2/token".format(instance)
payload = "&".join([
"client_id={}".format(client_id),
"client_secret={}".format(client_secret),
"grant_type=password&",
"username={}".format(username),
"password={}".format(password)
])
headers = {
'content-type': "application/x-www-form-urlencoded"
}
response = requests.request("POST", url, data=payload, headers=headers)
credentials = response.json()
sf = Salesforce(instance_url=credentials["instance_url"],
session_id=credentials["access_token"],
version="47.0")

I found my issue is with bulk.py in simple_salesforce. It's only reading first batch. Here is the solution to read multiple batches.
https://github.com/simple-salesforce/simple-salesforce/issues/280

Grab a specific item in Json data using Python

I'm currently writing a script to find emails based on the domain name using Hunter.io API
The thing is that my script returns me a JSON with a lot's a details and I only want the mail address in it.
Here's my code :
"""script to find emails based on the domain name using Hunter.io"""
import requests # To make get and post requests to APIs
import json # To deal with json responses to APIs
from pprint import pprint # To pretty print json info in a more readable format
from pyhunter import PyHunter # Using the hunter module
# Global variables
hunter_api_key = "API KEY" # API KEY
contacts = [] # list where we'll store our contacts
contact_info = {} # dictionary where we'll store each contact information
limit = 1 # Limit of emails adresses pulled by the request
value = "value" # <- seems to be the key in Hunter's API of the email adress founded
# TODO: Section A - Ask Hunter to find emails based on the domain name
def get_email_from_hunter(domain_name,limit):
url = "https://api.hunter.io/v2/domain-search"
params = {
"domain" : domain_name,
"limit" : 1,
"api_key" : hunter_api_key,
}
response = requests.get(url, params= params,)
json_data = response.json()
email_adress = json_data["data"]["emails"] # <- I have to find witch is the good key in order to return only the mail adress
#pprint(email_adress)
contact_info["email_adress"] = email_adress
contact_info["domain_name"] = domain_name
pprint(contact_info)
return contact_info
get_email_from_hunter("intercom.io","1")
and here's the JSON returned :
JSON exemple extracted from the documentation
Thanks per advance for the help provided :)

email_addresses = [item['value'] for item in json_data["data"]["emails"]]
note, not tested as you post image, not json data as text.

Appending CSV to BigQuery table with Python client

I have a new CSV file each week in the same format, which I need to append to a BigQuery table using the Python client. I successfully created the table using the first CSV, but I am unsure how to append subsequent CSVs going forward. The only way I have found is the google.cloud.bigquery.client.Client().insert_rows() method. See api link here. This would require me to first read the CSV in as a list of dictionaries. Is there a better way to append data from a CSV to a BigQuery table?

See simple example below
# from google.cloud import bigquery
# client = bigquery.Client()
# table_ref = client.dataset('my_dataset').table('existing_table')
job_config = bigquery.LoadJobConfig()
job_config.write_disposition = bigquery.WriteDisposition.WRITE_APPEND
job_config.skip_leading_rows = 1
# The source format defaults to CSV, so the line below is optional.
job_config.source_format = bigquery.SourceFormat.CSV
uri = "gs://your_bucket/path/your_file.csv"
load_job = client.load_table_from_uri(
uri, table_ref, job_config=job_config
) # API request
print("Starting job {}".format(load_job.job_id))
load_job.result() # Waits for table load to complete.
print("Job finished.")
destination_table = client.get_table(table_ref)
print("Loaded {} rows.".format(destination_table.num_rows))
see more details in BigQuery Documentation

Using API to create a new query on Redash

I managed to import queries into another account. I used the endpoint POST function given by Redash, it sort of just applies to just “modifying/replacing”: https://github.com/getredash/redash/blob/5aa620d1ec7af09c8a1b590fc2a2adf4b6b78faa/redash/handlers/queries.py#L178
So actually, if I want to import a new query what should I do? I want to create a new query that doesn’t exist on my account. I’m looking at https://github.com/getredash/redash/blob/5aa620d1ec7af09c8a1b590fc2a2adf4b6b78faa/redash/handlers/queries.py#L84
Following is the function which I made to create new queries if the query_id doesn’t exist.
url = path, api = user api, f = filename, query_id = query_id of file in local desktop
def new_query(url, api, f, query_id):
headers ={'Authorization': 'Key {}'.format(api), 'Content-Type': 'application/json'}
path = "{}/api/queries".format(url)
query_content = get_query_content(f)
query_info = {'query':query_content}
print(json.dumps(query_info))
response = requests.post(path, headers = headers, data = json.dumps(query_info))
print(response.status_code)
I am getting response.status_code 500. Is there anything wrong with my code? How should I fix it?

For future reference :-) here's a python POST that creates a new query:
payload = {
"query":query, ## the select query
"name":"new query name",
"data_source_id":1, ## can be determined from the /api/data_sources end point
"schedule":None,
"options":{"parameters":[]}
}
res = requests.post(redash_url + '/api/queries',
headers = {'Authorization':'Key YOUR KEY'},
json=payload)
(solution found thanks to an offline discussion with #JohnDenver)

TL;DR:
...
query_info = {'query':query_content,'data_source_id':<find this number>}
...
Verbose:
I had a similar problem. Checked redash source code, it looks for data_source_id. I added the data_source_id to my data payload which worked.
You can find the appropriate data_source_id by looking at the response from a 'get query' call:
import json
def find_data_source_id(url,query_number,api)
path = "{}/api/queries/{}".format(url,query_number)
headers ={'Authorization': 'Key {}'.format(api), 'Content-Type': 'application/json'}
response = requests.get(path, headers = headers)
return json.loads(response.text)['data_source_id']

The Redash official API document is so lame, it doesn't give any examples for the documented "Common Endpoints". I was having no idea how I should use the API key.
Instead check this saviour https://github.com/damienzeng73/redash-api-client .

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

foursquare api data pull from databricks - python

You could read and write the data received as follows: df = spark.read.json(resp.text) location = 'dbfs:/tmp/test.json' df.write.json(location) and then create a table using the file created : spark.sql(f''' CREATE TABLE IF NOT EXISTS foursquare USING JSON LOCATION "{location}" ''')

Related

Pandas Dataframe from Cloud Functions to BigQuery - only PARQUET and CSV source_formats?

Missing Data while Reading from Salesforce using Python

Grab a specific item in Json data using Python

Appending CSV to BigQuery table with Python client

Using API to create a new query on Redash

Categories

Resources