Remove the duplicate id based on status without pandas? - python

I need to write a output from aws account to excel sheet. I am using graphql and using jmespath.search to map the expressions and store in a excel sheet. I am facing issue with duplicate Id getting stored. I am using filter to merge two columns values into single columns values such as "active" or ""inactive". While storing the data I am getting duplicate Ids as well. I need to remove the duplicate id based on "inactive" status and add active Ids alone into a sheet.
For example let us take that response is coming in list format as below.
data = [
{"id": 1, "deregistered ": True, "deactivated":True, "location": true},
{"id": 1, "deregistered ": False, "deactivated": False, "location": true},
{"id": 2, "deregistered ": False, "deactivated":False, "location": true},
]
Now, I need to write this into excel sheet by removing the duplicate id based on status. I need only below values from the data. i.e remove the id 1 if it is duplicate and store only active ids.
output={'id':1, 'status' : 'active', 'location': true},{'id':2, 'status' : 'inactive', 'location': true}
How to achieve this using python but without pandas. and I am using jmespath.search and mapping the values.
I tried as below but not getting the logic.
for val in data:
loc_enabled = val.get("location")
if loc_enabled:
search = """
{
"id": id,
"status": ((deregistered == `true` || deactivated == `true`) && `Inactive`) || `Active`,
"location":location
}"""
test = jmespath.search(search, val)
if test:
loc_enabled.append(test)
print(loc_enabled)
I need to know where to handle the logic here without pandas. and get the desired results
`

Related

How do I access nested elements inside a json array in python

I want to iterate over the below json array to extract all the referenceValues and the corresponding paymentIDs into one
{
"payments": [{
"paymentID": "xxx",
"externalReferences": [{
"referenceKind": "TRADE_ID",
"referenceValue": "xxx"
}, {
"referenceKind": "ID",
"referenceValue": "xxx"
}]
}, {
"paymentID": "xxx",
"externalReferences": [{
"referenceKind": "ID",
"referenceValue": "xxx"
}]
}]
}
The below piece only extracts in case of a single payment and single externalreferences. I want to be able to do it for multiple payments and multiple externalreferences as well.
payment_ids = []
for notification in notifications:
payments= [(payment[0], payment["externalReferences"][0]["referenceValue"])
for payment in notification[0][0]]
if payments[0][1] in invoice_ids:
payment_ids.extend([payment[0] for payment in payments])
Looking at your structure, first you have to iterate through every dictionary in payments, then iterate through their external references. So the below code should extract all reference values and their payment IDs to a dictionary (and append to a list)
refVals = [] # List of all reference values
for payment in data["payments"]:
for reference in payment["externalReferences"]:
refVals.append({ # Dictionary of the current values
"referenceValue": reference["referenceValue"], # The current reference value
"paymentID": payment["paymentID"] # The current payment ID
})
print(refVals)
This code should output a list of a dictionary with all reference values and their payment IDs, in the data dictionary (assuming you read your data into the data variable)

Multi-level Python Dict to Pandas DataFrame only processes one level out of many

I'm parsing some XML data, doing some logic on it, and trying to display the results in an HTML table. The dictionary, after filling, looks like this:
{
"general_info": {
"name": "xxx",
"description": "xxx",
"language": "xxx",
"prefix": "xxx",
"version": "xxx"
},
"element_count": {
"folders": 23,
"conditions": 72,
"listeners": 1,
"outputs": 47
},
"external_resource_count": {
"total": 9,
"extensions": {
"jar": 8,
"json": 1
},
"paths": {
"/lib": 9
}
},
"complexity": {
"over_1_transition": {
"number": 4,
"percentage": 30.769
},
"over_1_trigger": {
"number": 2,
"percentage": 15.385
},
"over_1_output": {
"number": 4,
"percentage": 30.769
}
}
}
Then I'm using pandas to convert the dictionary into a table, like so:
data_frame = pandas.DataFrame.from_dict(data=extracted_metrics, orient='index').stack().to_frame()
The result is a table that is mostly correct:
While the first and second levels seem to render correctly, those categories with a sub-sub category get written as a string in the cell, rather than as a further column. I've also tried using stack(level=1) but it raises an error "IndexError: Too many levels: Index has only 1 level, not 2". I've also tried making it into a series with no luck. It seems like it only renders "complete" columns. Is there a way of filling up the empty spaces in the dictionary before processing?
How can I get, for example, external_resource_count -> extensions to have two daughter rows jar and json, with an additional column for the values, so that the final table looks like this:
Extra credit if anyone can tell me how to get rid of the first row with the index numbers. Thanks!
The way you load the dataframe is correct but you should rename the 0 to a some column name.
# this function extracts all the keys from your nested dicts
def explode_and_filter(df, filterdict):
return [df[col].apply(lambda x:x.get(k) if type(x)==dict else x).rename(f'{k}')
for col,nested in filterdict.items()
for k in nested]
data_frame = pd.DataFrame.from_dict(data= extracted_metrics, orient='index').stack().to_frame(name='somecol')
#lets separate the rows where a dict is present & explode only those rows
mask = data_frame.somecol.apply(lambda x:type(x)==dict)
expp = explode_and_filter(data_frame[mask],
{'somecol':['jar', 'json', '/lib', 'number', 'percentage']})
# here we concat the exploded series to a frame
exploded_df = pd.concat(expp, axis=1).stack().to_frame(name='somecol2').reset_index(level=2)\.rename(columns={'level_2':'somecol'})
# and now we concat the rows with dict elements with the rows with non dict elements
out = pd.concat([data_frame[~mask], exploded_df])
The output dataframe looks like this

how to check whether the comma-separated value in the database is present in JSON data or not using python

I just have to check the JSON data on the basis of comma-separated e_code in the table.
how to filter only that data where users e_codes are available
in the database:
id email age e_codes
1. abc#gmail 19 123456,234567,345678
2. xyz#gmail 31 234567,345678,456789
This is my JSON data
[
{
"ct": 1,
"e_code": 123456,
},
{
"ct": 2,
"e_code": 234567,
},
{
"ct": 3,
"e_code": 345678,
},
{
"ct": 4,
"e_code": 456789,
},
{
"ct": 5,
"e_code": 456710,
}
]
If efficiency is not an issue, you could loop through the table, split the values to a list by using case['e_codes'].split(',') and then, for each code loop through the JSON to see whether it is present.
This might be a little inefficient if your data, JSON, or number of values are long.
It might be better to first create a lookup dictionary in which the codes are the keys:
lookup={}
for e in my_json:
lookup[e['e_code']] = 1
You can then check how many of the codes in your table are actually in the JSON:
## Let's assume that the "e_codes" cell of the
## current line is data['e_codes'][i], where i is the line number
for i in lines:
match = [0,0]
for code in data['e_codes'][i].split(','):
try:
match[0]+=lookup[code]
match[1]+=1
except:
match[1]+=1
if match[1]>0: share_present=match[0]/match[1]
For each case, you get a share_present, which is 1.0 if all codes appear in the JSON, 0.0 if none of them do and some value between to indicate the share of codes that were present. Depending on your threshold for keeping a case you can set a filter to True or False depending on this value.

Required columns are not present in the data frame but column header gets created in the csv and all rows gets populated with null

I am working on getting data from an API using python. The API returns data in form of json which is being normalised and written to a data frame which is then written to a csv file.
The API can return any number of columns which differs between each records. I need only a fixed number of columns which i am defining in the code.
In the scenario where the required column is not being returned my code fails.
I need a solution where even though required columns are not present in the data frame column header gets created in the csv and all rows gets populated with null.
required csv structure :
name address phone
abc bcd 1214
bcd null null
I'm not sure if understood you correctly but I hope the following code solves your problem:
import json
import pandas as pd
# Declare json with missing values:
# - First element doesn't contain "phone" field
# - Second element doesn't contain "married" field
api_data = """
{ "sentences" :
[
{ "name": "abc", "address": "bcd", "married": true},
{ "name": "def", "address": "ghi", "phone" : 7687 }
]
}
"""
json_data = json.loads(api_data)
df = pd.DataFrame(
data=json_data["sentences"],
# Explicitly declare which columns should be presented in DataFrame
# If value for given column is absent it will be populated with NaN
columns=["name", "address", "married", "phone"]
)
# Save result to csv:
df.to_csv("tmp.csv", index=False)
The content of resulting csv:
name,address,married,phone
abc,bcd,True,
def,ghi,,7687.0
P.S.:
It should work even if columns are absent in all the records. Here is another example:
# Both elements do not contain "married" and "phone" fields
api_data = """
{ "sentences" :
[
{ "name": "abc", "address": "bcd"},
{ "name": "def", "address": "ghi"}
]
}
"""
json_data = json.loads(api_data)
json_data["sentences"][0]
df = pd.DataFrame(
data=json_data["sentences"],
# Explicitly declare which columns should be presented in DataFrame
# If value for given column is absent it will be populated with NaN
columns=["name", "address", "married", "phone"]
)
# Print first rows of DataFrame:
df.head()
# Expected output:
# name address married phone
# 0 abc bcd NaN NaN
# 1 def ghi NaN NaN
df.to_csv("tmp.csv", index=False)
In this case the resulting csv file will contain the following text:
name,address,married,phone
abc,bcd,,
def,ghi,,
The last two commas in the 2nd and 3d lines mean "an empty/missing value" and if you create DataFrame from resulting csv by pd.read_csv then "married" and "phone" columns will be populated with NaN values.

Update single row formatting for entire sheet

I want to just apply a formatting from a JSON Entry. The first thing I did was make my desirable format on my spreadsheet for the second row of all columns. I then retrieved them with a .get request (from A2 to AO3).
request = google_api.service.spreadsheets().get(
spreadsheetId=ss_id,
ranges="Tab1!A2:AO3",
includeGridData=True).execute()
The next thing I did was collect each of the formats for each column and record them in a dictionary.
my_dictionary_of_formats = {}
row_values = row_1['sheets'][0]['data'][0]['rowData'][0]['values']
for column in range(0, len(row_values)):
my_dictionary_of_formats[column] = row_values[column]['effectiveFormat']
Now I have a dictionray of all my effective formats for all my columns. I'm having trouble now applying that format to all rows in each column. I tried a batchUpdate request:
cell_data = {
"effectiveFormat": my_dictionary_of_formats[0]}
row_data = {
"values": [
cell_data
]
}
update_cell = {
"rows": [
row_data
],
"fields": "*",
"range":
{
"sheetId": input_master.tab_id,
"startRowIndex": 2,
"startColumnIndex": 0,
"endColumnsIndex": 1
}
}
request_body = {
"requests": [
{"updateCells": update_cell}],
"includeSpreadsheetInResponse": True,
"responseIncludeGridData": True}
service.spreadsheets().batchUpdate(spreadsheetId=my_id, body=request_body).execute()
This wiped out everything and I'm not sure why. I don't think I understand the fields='* attribute.
TL;DR
I want to apply a format to all rows in a single column. Much like if I used the "Paint Format" tool on the second row, first column and dragged it all the way down to the last row.
-----Update
Hi, thanks to the comments this was my solution:
###collect all formats from second row
import json
row_2 = goolge_api.service.spreadsheets().get(
spreadsheetId=spreadsheet_id,
ranges="tab1!A2:AO2",
includeGridData=True).execute()
my_dictionary = {}
row_values = row_2['sheets'][0]['data'][0]['rowData'][0]['values']
for column in range(0,len(row_values)):
my_dictionary[column] = row_values[column]
json.dumps(my_dictionary,open('config/format.json','w'))
###Part 2, apply formats
requests = []
my_dict = json.load(open('config/format.json'))
for column in my_dict:
requests.append(
{
"repeatCell": {
"range": {
"sheetId": tab_id,
"startRowIndex": str(1),
"startColumnIndex":str(column),
"endColumnIndex":str(int(column)+1)
},
"cell": {
"userEnteredFormat": my_dict[column]
},
'fields': "userEnteredFormat({})".format(",".join(my_dict[column].keys()))
}
})
body = {"requests": requests}
google_api.service.spreadsheets().batchUpdate(spreadsheetId=s.spreadsheet_id,body=body).execute()
When you include fields as a part of the request, you indicate to the API endpoint that it should overwrite the specified fields in the targeted range with the information found in your uploaded resource. fields="*" correspondingly is interpreted as "This request specifies the entire data and metadata of the given range. Remove any previous data and metadata from the range and use what is supplied instead."
Thus, anything not specified in your updateCells requests will be removed from the range supplied in the request (e.g. values, formulas, data validation, etc.).
You can learn more in the guide to batchUpdate
For an updateCell request, the fields parameter is as described:
The fields of CellData that should be updated. At least one field must be specified. The root is the CellData; 'row.values.' should not be specified. A single "*" can be used as short-hand for listing every field.
If you then view the resource description of CellData, you observe the following fields:
"userEnteredValue"
"effectiveValue"
"formattedValue"
"userEnteredFormat"
"effectiveFormat"
"hyperlink"
"note"
"textFormatRuns"
"dataValidation"
"pivotTable"
Thus, the proper fields specification for your request is likely to be fields="effectiveFormat", since this is the only field you supply in your row_data property.
Consider also using the repeatCell request if you are just specifying a single format.

Categories

Resources