I'm attempting to load Excel data into a Pandas DataFrame and then push the ip_address from the DataFrame to an api which returns information back in json format, and then append the results from the json back to the original DataFrame - how would I do this, iterating over each row in the dataframe and appending the results back each time?
Initial dataframe:
date | ip | name
date1 | ip1 | name1
date2 | ip2 | name2
json:
{
"status": "ok",
"8.8.8.8": {
"proxy": "yes",
"type": "VPN",
"provider": "Google Inc",
"risk": 66
}
}
Code:
df = pd.read_excel (r'test_data.xlsx')
def query_api(ip_address):
risk_score = None
vpn_score = None
api_key = "xxx"
base_url = "http://falseacc.com/"
endpoint = f"{base_url}{ip_address}?key={api_key}&risk=1&vpn=1"
r = requests.get(endpoint)
if r.status_code not in range(200, 299):
return None, None
try:
with urllib.request.urlopen(endpoint) as url:
data = json.loads(url.read().decode())
proxy = (data[ip_address]['proxy'])
type = (data[ip_address]['type'])
risk = (data[ip_address]['risk'])
df2 = pd.Dataframe({"ip":[ip_address],
"proxy":[proxy],
"type":[type],
"risk":[risk]})
print(df2)
except:
print("No data")
Expected output:
Dataframe:
date | ip | name | proxy | type | risk
date1 | ip1 | name1 | proxy1 | type1 | risk1
date2 | ip2 | name2 | proxy2 | type2 | risk2
You can use the pandas Series.apply method to pick each ip from your dataframe, and get the proxy, type, risk values corresponding to it from your query_api function. Then assign the values to the corresponding columns in the end:
df = pd.read_excel (r'test_data.xlsx')
def query_api(ip_address):
risk_score = None
vpn_score = None
api_key = "xxx"
base_url = "http://falseacc.com/"
endpoint = f"{base_url}{ip_address}?key={api_key}&risk=1&vpn=1"
r = requests.get(endpoint)
if r.status_code not in range(200, 299):
return pd.Series([None]*3)
try:
with urllib.request.urlopen(endpoint) as url:
data = json.loads(url.read().decode())
proxy = (data[ip_address]['proxy'])
type = (data[ip_address]['type'])
risk = (data[ip_address]['risk'])
return pd.Series([proxy, type, risk])
except:
return pd.Series([None]*3)
df[['proxy','type','risk']] = df.ip.apply(query_api)
Have a look at the official docs to know how pandas.Series.apply works.
I would also recommend to not use type as a variable name in your code since it overshadows the in-built type function in python
Related
I am working on processing a CDC data recieved via kafka tables, and load them into databricks delta tables. I am able to get it working all, except for a nested JSON string which is not getting loaded when using from_json, spark.read.json.
When I try to fetch schema of the json from level 1, using "spark.read.json(df.rdd.map(lambda row: row.value)).schema", the column INPUT_DATA is considered as string loaded as a string object. Am giving sample json string, the code that I tried, and the expected results.
I have many topics to process and each topic will have different schema, so I would like to process dynamically, and do not prefer to store the schemas, since the schema may change over time, and i would like to have my code handle the changes automatically.
Appreciate any help as I have spent whole day to figure out, and still trying. Thanks in advance.
Sample Json with nested tree:
after = {
"id_transaction": "121",
"product_id": 25,
"transaction_dt": 1662076800000000,
"creation_date": 1662112153959000,
"product_account": "40012",
"input_data": "{\"amount\":[{\"type\":\"CASH\",\"amount\":1000.00}],\"currency\":\"USD\",\"coreData\":{\"CustId\":11021,\"Cust_Currency\":\"USD\",\"custCategory\":\"Premium\"},\"context\":{\"authRequired\":false,\"waitForConfirmation\":false,\"productAccount\":\"CA12001\"},\"brandId\":\"TOYO-2201\",\"dealerId\":\"1\",\"operationInfo\":{\"trans_Id\":\"3ED23-89DKS-001AA-2321\",\"transactionDate\":1613420860087},\"ip_address\":null,\"last_executed_step\":\"PURCHASE_ORDER_CREATED\",\"last_result\":\"OK\",\"output_dataholder\":\"{\"DISCOUNT_AMOUNT\":\"0\",\"BONUS_AMOUNT_APPLIED\":\"10000\"}",
"dealer_id": 1,
"dealer_currency": "USD",
"Cust_id": 11021,
"process_status": "IN_PROGRESS",
"tot_amount": 10000,
"validation_result_code": "OK_SAVE_AND_PROCESS",
"operation": "Create",
"timestamp_ms": 1675673484042
}
I have created following script to get all the columns of the json structure:
import json
# table_column_schema = {}
json_keys = {}
child_members = []
table_column_schema = {}
column_schema = []
dbname = "mydb"
tbl_name = "tbl_name"
def get_table_keys(dbname):
table_values_extracted = "select value from {mydb}.{tbl_name} limit 1"
cmd_key_pair_data = spark.sql(table_values_extracted)
jsonkeys=cmd_key_pair_data.collect()[0][0]
json_keys = json.loads(jsonkeys)
column_names_as_keys = json_keys["after"].keys()
value_column_data = json_keys["after"].values()
column_schema = list(column_names_as_keys)
for i in value_column_data:
if ("{" in str(i) and "}" in str(i)):
a = json.loads(i)
for i2 in a.values():
if (str(i2).startswith("{") and str(i2).endswith('}')):
column_schema = column_schema + list(i2.keys())
table_column_schema['temp_table1'] = column_schema
return 0
get_table_keys("dbname")
The following code is used to process the json and create a dataframe with all nested jsons as the columns:
from pyspark.sql.functions import from_json, to_json, col
from pyspark.sql.types import StructType, StructField, StringType, LongType, MapType
import time
dbname = "mydb"
tbl_name = "tbl_name"
start = time.time()
df = spark.sql(f'select value from {mydb}.{tbl_name} limit 2')
tbl_columns = table_column_schema[tbl_name]
data = []
for i in tbl_columns:
if i == 'input_data':
# print('FOUND !!!!')
data.append(StructField(f'{i}', MapType(StringType(),StringType()), True))
else:
data.append(StructField(f'{i}', StringType(), True))
schema2 = spark.read.json(df.rdd.map(lambda row: row.value)).schema
print(type(schema2))
df2 = df.withColumn("value", from_json("value", schema2)).select(col('value.after.*'), col('value.op'))
Note: The VALUE is a column in my delta table (bronze layer)
Current dataframe output:
Expected dataframe output:
You can use rdd to get the schema and from_json to read the value as json.
schema = spark.read.json(df.rdd.map(lambda r: r.input_data)).schema
df = df.withColumn('input_data', f.from_json('input_data', schema))
new_cols = df.columns + df.select('input_data.*').columns
df = df.select('*', 'input_data.*').toDF(*new_cols).drop('input_data')
df.show(truncate=False)
+-------+----------------+---------------+---------+--------------+---------+--------------+---------------+----------+-------------+----------+----------------+----------------------+----------------+---------+-----------------------+---------------------+--------+--------+----------+----------------------+-----------+---------------------------------------+-----------------+
|Cust_id|creation_date |dealer_currency|dealer_id|id_transaction|operation|process_status|product_account|product_id|timestamp_ms |tot_amount|transaction_dt |validation_result_code|amount |brandId |context |coreData |currency|dealerId|ip_address|last_executed_step |last_result|operationInfo |output_dataholder|
+-------+----------------+---------------+---------+--------------+---------+--------------+---------------+----------+-------------+----------+----------------+----------------------+----------------+---------+-----------------------+---------------------+--------+--------+----------+----------------------+-----------+---------------------------------------+-----------------+
|11021 |1662112153959000|USD |1 |121 |Create |IN_PROGRESS |40012 |25 |1675673484042|10000 |1662076800000000|OK_SAVE_AND_PROCESS |[{1000.0, CASH}]|TOYO-2201|{false, CA12001, false}|{11021, USD, Premium}|USD |1 |null |PURCHASE_ORDER_CREATED|OK |{3ED23-89DKS-001AA-2321, 1613420860087}|{10000, 0} |
+-------+----------------+---------------+---------+--------------+---------+--------------+---------------+----------+-------------+----------+----------------+----------------------+----------------+---------+-----------------------+---------------------+--------+--------+----------+----------------------+-----------+---------------------------------------+-----------------+
I am trying to insert data from s3 bucket to mysql RDS instance in aws using lambda function. I have connected to mysql endpoint using sqlalchemy. I want to make a few modifications in the data. I have changed column names and then reindex them so that I can map them to the table in RDS instance. The issue is in the df.columns line. Instead of getting the column names in string format, I am getting them as tuples.
+-----------------+-------------+----------------------+---------------+---------
| ('col_a',) | ('date_timestamp',) | ('col_b',) | ('col_c',) | (vehicle_id',) |
+-----------------+-------------+----------------------+---------------+---------
| 0.180008333 | 2017-09-28T20:36:00Z | -6.1487501 | 38.35 | 1004 |
| 0.809708333 | 2017-06-17T14:16:00Z | 8.189424 | -6.8732784 | NominalValue |
+-----------------+-------------+----------------------+---------------+---------
Below is the code -
from __future__ import print_function
import boto3
import json
import logging
import pymysql
from sqlalchemy import create_engine
from pandas.io import sql
from pandas.io.json import json_normalize
from datetime import datetime
print('Loading function')
s3 = boto3.client('s3')
def getEngine(endpoint):
engine_ = None
try:
engine_ = create_engine(endpoint)
except Exception as e:
print('Error getting object {} from bucket {}. Make sure they exist and your bucket is in the same region as this function.'.format(key, bucket))
raise e
return engine_
engine = getEngine('mysql+pymysql://username:password#endpoint/database')
configuration = {
"aTable":
{
"from" : ['col_1','col_2','date_timestamp','operator_id'],
"to" : ['date_timestamp','operator_id','col_1','col_2'],
"sql_table_name" : 'sql_table_a'
},
"bTable" : {
"from" : ['col_a','date_timestamp','col_b','col_c','vehicle_id'],
"to" : ['date_timestamp','col_a','col_b','vehicle_id','col_c'],
"sql_table_name" : 'sql_table_b'
}
}
def handler(event, context):
bucket = event['Records'][0]['s3']['bucket']['name']
s3_object_key = event['Records'][0]['s3']['object']['key']
obj = s3.get_object(Bucket=bucket, Key=s3_object_key)
data = json.loads(obj['Body'].read())
for _key in data:
if not _key in configuration:
print("No configuration found for {0}".format(_key))
df = json_normalize(data[str(_key)])
df.columns=[configuration[_key]['from']]
#df = df.reindex(indexlist,axis="columns")
#df['date_timestamp'] = df['date_timestamp'].apply(lambda x: datetime.strptime(x, "%Y-%m-%dT%H:%M:%SZ"))
df.to_sql(name=configuration[_key]['sql_table_name'], con=engine, if_exists='append', index=False)
print(df)
return "Loaded data in RDS"
We should remove the [] from the code in line -
df.columns=[configuration[_key]['from']]
Correct code is
df.columns=configuration[_key]['from']
I am looking for the ways to change the headings of Cucumber's Data Table to the side. So it will make the feature file readable.
Ordinary way:
| Name | Email | Phone No. | ......... |
| John | i#g.net | 098765644 | ......... |
It can be a very wide data table and I would have to scroll back and forth.
Desired way:
| Name | John |
| Email | i#g.net |
| Phone No. | 098765444 |
.
.
.
There are a small number of examples in Java and Ruby. But I am working with Python.
I had tried many different things like numpy.transpose(), converting them to list. But it won't work because the Data Table's format is:
[<Row['Name','John'],...]
You can implement this behaviour quite simply yourself, here is my version:
def tabledict(table, defaults, aliases = {}):
"""
Converts a behave context.table to a dictionary.
Throws NotImplementedError if the table references an unknown key.
defaults should contain a dictionary with the (surprise) default values.
aliases makes it possible to map alternative names to the keys of the defaults.
All keys of the table will be converted to lowercase, you sould make sure that
you defaults and aliases dictionaries also use lowercase.
Example:
Given the book
| Property | Value |
| Title | The Tragedy of Man |
| Author | Madach, Imre |
| International Standard Book Number | 9631527395 |
defaults = { "title": "Untitled", "author": "Anonymous", "isbn": None, "publisher": None }
aliases = { "International Standard Book Number" : "isbn" }
givenBook = tabledict(context.table, defaults, aliases)
will give you:
givenBook == {
"title": "The Tragedy of Man",
"author": "Madach, Imre",
"isbn": 9631527395,
"publisher": None
}
"""
initParams = defaults.copy()
validKeys = aliases.keys()[:] + defaults.keys()[:]
for row in table:
name, value = row[0].lower(), row[1]
if not name in validKeys:
raise NotImplementedError(u'%s property is not supported.'%name)
if name in aliases: name = aliases[name]
initParams[name] = value
return initParams
This doesn't look like it's related to numpy.
pivoting a list of list is often done with zip(*the_list)
This will return a pivoted behave table
from behave.model import Table
class TurnTable(unittest.TestCase):
"""
"""
def test_transpose(self):
table = Table(
['Name', 'John', 'Mary'],
rows=[
['Email', "john#example.com", "mary#example.com"],
['Phone', "0123456789", "9876543210"],
])
aggregate = [table.headings[:]]
aggregate.extend(table.rows)
pivoted = list(zip(*aggregate))
self.assertListEqual(pivoted,
[('Name', 'Email', 'Phone'),
('John', 'john#example.com', '0123456789'),
('Mary', 'mary#example.com', '9876543210')])
pivoted_table = Table(
pivoted[0],
rows=pivoted[1:])
mary = pivoted_table.rows[1]
self.assertEqual(mary['Name'], 'Mary')
self.assertEqual(mary['Phone'], '9876543210')
you can also have a look at https://pypi.python.org/pypi/pivottable
i trying to send a ASCII table by email but when i received i got a unexpected format
the format that shows in my python script is
from tabulate import tabulate
message_launch = ['value','value2','value3','value4','value5','value6']
headers = ['data_a','data_b','data_c','data_d','date_e','data_f']
**t = tabulate(message_launch, headers=message_headers, missingval='?', stralign='center', tablefmt='grid').encode('utf-8')**
(Pdb) type(t)
<type 'str'>
+------------+----------+------------+----------+----------------+--------------+
| data_a | data_b | data_c | data_d | data_e | data_f |
+============+==========+============+==========+================+==============+
| value1 | value2 | value3 | value4 | value5 | value6 |
+------------+----------+------------+----------+----------------+--------------+-+
and the table that i got from my email is very disordered
how can i receive in the correct the way the table by email ?
Most email clients support html rendering. I think that would be the easiest way.
Pass 'html' to tablefmt:
t = tabulate(message_launch, headers=message_headers, missingval='?', stralign='center', tablefmt='html').encode('utf-8')
Then send t as html with whatever email library you are using.
I want to convert CSV to JSON in python. I was able to convert simple csv files to json, but not able to join two csv into one nested json.
emp.csv:
empid | empname | empemail
e123 | adam | adam#gmail.com
e124 | steve | steve#gmail.com
e125 | brian | brain#yahoo.com
e126 | mark | mark#msn.com
items.csv:
empid | itemid | itemname | itemqty
e123 | itm128 | glass | 25
e124 | itm130 | bowl | 15
e123 | itm116 | book | 50
e126 | itm118 | plate | 10
e126 | itm128 | glass | 15
e125 | itm132 | pen | 10
the output should be like:
[{
"empid": "e123",
"empname": "adam",
"empemail": "adam#gmail.com",
"items": [{
"itemid": "itm128",
"itmname": "glass",
"itemqty": 25
}, {
"itemid": "itm116",
"itmname": "book",
"itemqty": 50
}]
},
and similar for others]
the code that i have written:
import csv
import json
empcsvfile = open('emp.csv', 'r')
jsonfile = open('datamodel.json', 'w')
itemcsvfile = open('items.csv', 'r')
empfieldnames = ("empid","name","phone","email")
itemfieldnames = ("empid","itemid","itemname","itemdesc","itemqty")
empreader = csv.DictReader( empcsvfile, empfieldnames)
itemreader = csv.DictReader( itemcsvfile, itemfieldnames)
output=[];
empcount=0
for emprow in empreader:
output.append(emprow)
for itemrow in itemreader:
if(itemrow["empid"]==emprow["empid"]):
output.append(itemrow)
empcount = empcount +1
print output
json.dump(output, jsonfile,sort_keys=True)
and it doesnot work.
Help needed. Thanks
Okay, so you have a few problems. The first is that you need to specify the delimiter for your CSV file. You're using the | character and by default python is probably going to expect ,. So you need to do this:
empreader = csv.DictReader( empcsvfile, empfieldnames, delimiter='|')
Second, you aren't appending the items to the employee dictionary. You probably should create a key called 'items' on each employee dictionary object and append the items to that list. Like this:
for emprow in empreader:
emprow['items'] = [] # create a list to hold items for this employee
...
for itemrow in itemreader:
...
emprow['items'].append(itemrow) # append an item for this employee
Third, each time you loop through an employee, you need to go back to the top of the item csv file. You have to realize that once python reads to the bottom of a file it won't just go back to the top of it on the next loop. You have to tell it to do that. Right now, your code reads through the item.csv file after the first employee is processed then stays there at the bottom of the file for all the other employees. You have to use seek(0) to tell it to go back to the top of the file for each employee.
for emprow in empreader:
emprow['items'] = []
output.append(emprow)
itemcsvfile.seek(0)
for itemrow in itemreader:
if(itemrow["empid"]==emprow["empid"]):
emprow['items'].append(itemrow)
Columns are not matching:
empid | empname | empemail
empfieldnames = ("empid","name","phone","email")
empid | itemid | itemname | itemqty
itemfieldnames = ("empid","itemid","itemname","itemdesc","itemqty")
We use , usually instead of | in CSV
What's more, you need to replace ' into " in JSON