Pandas select rows from a DataFrame based on column values? - python

I have below json string loaded to dataframe. Now I want to filter the record based on ossId.
The condition I have is giving the error message. what is the correct way to filter by ossId?
import pandas as pd
data = """
{
"components": [
{
"ossId": 3946,
"project": "OALX",
"licenses": [
{
"name": "BSD 3",
"status": "APPROVED"
}
]
},
{
"ossId": 3946,
"project": "OALX",
"version": "OALX.client.ALL",
"licenses": [
{
"name": "GNU Lesser General Public License v2.1 or later",
"status": "APPROVED"
}
]
},
{
"ossId": 2550,
"project": "OALX",
"version": "OALX.webservice.ALL" ,
"licenses": [
{
"name": "MIT License",
"status": "APPROVED"
}
]
}
]
}
"""
df = pd.read_json(data)
print(df)
df1 = df[df["components"]["ossId"] == 2550]

I think your issue is due to the json structure. You are actually loading into df a single row that is the whole list of field component.
You should instead pass to the dataframe the list of records. Something like:
json_data = json.loads(data)
df = pd.DataFrame(json_data["components"])
filtered_data = df[df["ossId"] == 2550]

You need to go into the cell's data and get the correct key:
df[df['components'].apply(lambda x: x.get('ossId')==2550)]

Use str
df[df.components.str['ossId']==2550]
Out[89]:
components
2 {'ossId': 2550, 'project': 'OALX', 'version': ...

Related

Mapping pandas df to JSON Schema

Here is my df:
text
date
channel
sentiment
product
segment
0
I like the new layout
2021-08-30T18:15:22Z
Snowflake
predict
Skills
EMEA
I need to convert this to JSON output that matches the following:
[
{
"text": "I like the new layout",
"date": "2021-08-30T18:15:22Z",
"channel": "Snowflake",
"sentiment": "predict",
"fields": [
{
"field": "product",
"value": "Skills"
},
{
"field": "segment",
"value": "EMEA"
}
]
}
]
I'm getting stuck with mapping the keys of the columns to the values in the first dict and mapping the column and row to new keys in the final dict. I've tried various options using df.groupby with .apply() but am coming up short.
Samples of what I've tried:
df.groupby(['text', 'date','channel','sentiment','product','segment']).apply(
lambda r: r[['27cf2f]].to_dict(orient='records')).unstack('text').apply(lambda s: [
{s.index.name: idx, 'fields': value}
for idx, value in s.items()]
).to_json(orient='records')
Any and all help is appreciated!
One option is to use a nested list comprehension:
# Start with your example data
d = {'text': ['I like the new layout'],
'date': ['2021-08-30T18:15:22Z'],
'channel': ['Snowflake'],
'sentiment': ['predict'],
'product': ['Skills'],
'segment': ['EMEA']}
df = pd.DataFrame(d)
# Specify field column names
fieldcols = ['product', 'segment']
# Build a dict for each group as a Series named `fields`
res = (df.groupby(['text', 'date','channel','sentiment'])
.apply(lambda s: [{'field': field,
'value': value}
for field in fieldcols
for value in s[field].values])
).rename('fields')
# Convert Series to DataFrame and then to_json
res = res.reset_index().to_json(orient='records')
# Print result
import json
print(json.dumps(json.loads(res), indent=2))
[
{
"text": "I like the new layout",
"date": "2021-08-30T18:15:22Z",
"channel": "Snowflake",
"sentiment": "predict",
"fields": [
{
"field": "product",
"value": "Skills"
},
{
"field": "segment",
"value": "EMEA"
}
]
}
]

How to create JSON structure from a pyspark dataframe?

I'm trying to create a JSON structure from a pyspark dataframe. I have below columns in my dataframe - batch_id, batch_run_id, table_name, column_name, column_datatype, last_refresh_time, refresh_frequency, owner
I want it in below JSON structure -
{
"GeneralInfo": {
"DataSetID": "xxxx1234Abcsd",
"Owner" : ["test1#email.com", "test2#email.com", "test3#email.com"]
"Description": "",
"BuisnessFunction": "",
"Source": "",
"RefreshRate": "Weekly",
"LastUpdate": "2020/10/15",
"InfoSource": "TemplateInfo"
},
"Tables": [
{
"TableName": "Employee",
"Columns" : [
{ "ColumnName" : "EmployeeID",
"ColumnDataType": "int"
},
{ "ColumnName" : "EmployeeName",
"ColumnDataType": "string"
}
]
}
}
}
I'm trying to assign the values in JSON string through dataframe column indexes but it is giving me an error as "Object of Type Column is not JSON serializable". I have used like below -
{
"GeneralInfo": {
"DataSetID": df["batch_id"],
"Owner" : list(df["owner"])
"Description": "",
"BuisnessFunction": "",
"Source": "",
"RefreshRate": df["refresh_frequency"],
"LastUpdate": df["last_update_time"],
"InfoSource": "TemplateInfo"
},
"Tables": [
{
"TableName": df["table_name"],
"Columns" : [
{ "ColumnName" : df["table_name"]["column_name"],
"ColumnDataType": df["table_name"]["column_datatype"]
}
]
}
}
}
Sample Data -
Please help me on this, I have newly started coding in Pyspark.
Tried getting JSON format from the sample data which you provided, output format is not matching exactly as you expected. You can improvise the below code further.
We can use toJSON function to convert dataframe to JSON format. Before calling toJSON function we need to use array(), struct functions by passing required columns to match JSON format as required.
from pyspark.sql import *
from pyspark.sql.functions import *
spark = SparkSession.builder.master('local[*]').getOrCreate()
in_values = [
(123, '123abc', 'Employee', 'Employee_id', 'int', '21/05/15', 'Weekly',
['test1#gmail.com', 'test1#gmail.com', 'test3#gmail.com']),
(123, '123abc', 'Employee', 'Employee_name', 'string', '21/05/15', 'Weekly',
['test1#gmail.com', 'test1#gmail.com', 'test3#gmail.com'])
]
cols = ["batch_id", "batch_run_id", "table_name", "column_name", "column_datatype",
"last_update_time", "refresh_frequency", "Owner"]
df = spark.createDataFrame(in_values).toDF(*cols)\
.selectExpr("*","'' Description", "'' BusinessFunction", "'TemplateInfo' InfoSource", "'' Source")
list1 = [df["batch_id"].alias("DataSetID"), df["Owner"], df["refresh_frequency"].alias("RefreshRate"),
df["last_update_time"].alias("LastUpdate"), "Description", "BusinessFunction","InfoSource", "Source"]
list2 = [df["table_name"].alias("TableName"),df["column_name"].alias("ColumnName"),
df["column_datatype"].alias("ColumnDataType")]
df.groupBy("batch_id") \
.agg(collect_set(struct(*list1))[0].alias("GeneralInfo"),
collect_list(struct(*list2)).alias("Tables")).drop("batch_id") \
.toJSON().foreach(print)
# outputs JSON --->
'''
{
"GeneralInfo":{
"DataSetID":123,
"Owner":[
"test1#gmail.com",
"test1#gmail.com",
"test3#gmail.com"
],
"RefreshRate":"Weekly",
"LastUpdate":"21/05/15",
"Description":"",
"BusinessFunction":"",
"InfoSource":"TemplateInfo",
"Source":""
},
"Tables":[
{
"TableName":"Employee",
"ColumnName":"Employee_id",
"ColumnDataType":"int"
},
{
"TableName":"Employee",
"ColumnName":"Employee_name",
"ColumnDataType":"string"
}
]
}
'''

Convert PANDAS dataframe to nested JSON + add array name

I've been wresting with this for many days now and would appreciate any help.
I'm importing an Excel file to a Pandas data frame resulting in the following dataframe [record]:
account_id
name
timestamp
value
A0001C
Fund_1
1588618800000000000
1
B0001B
Dev_2
1601578800000000000
1
I'm looking to produce a nested JSON output (will be used to submit data to an API), include adding a records and metric labels for the arrays.
Here is the output i'm looking for:
{
"records": [
{
"name": "Fund_1",
"account_id": "A0001C",
"metrics": [
{
"timestamp": 1588618800000000000,
"value": 1
}
]
}
{
"name": "Dev_2",
"account_id": "B0001B",
"metrics": [
{
"timestamp": 1601578800000000000,
"value": 1
}
]
}
]
}
I've gotten an output of a none nested JSON data set, but not able split out the timestamp and value to add the metrics part.
for record in df.to_dict(orient='records'):
record_data = {'records': [record]}
payload_json = json.dumps(record_data)
print(payload_json)
I get the following output:
{"records": [{"account_id": "A0001C", "name": "Fund_1", "Date Completed": 1588618800000000000, "Count": "1"}]}
{"records": [{"account_id": "B0001B", "name": "Dev_2", "Date Completed": 1601578800000000000, "Count": "1"}]}
Any help on how i can modify my code to add the metrics label and nest the data.
Thanks in advance.
One approach is through the use of pd.apply. This allows you to apply a function to series (either column- or row-wise) in your dataframe.
In your particular case, you want to apply the function row-by-row, so you have to use apply with axis=1:
records = list(df.apply(lambda row: {"name": row["name"],
"account_id": row["account_id"],
"metrics": [{
"timestamp": row["timestamp"],
"value": row["value"]}]
},
axis=1).values)
payload = {"records": records}
Alternatively, you could introduce an auxiliary column "metrics" in which you store your metrics (subsequently applying pd.to_json):
df["metrics"] = df.apply(lambda e: [{"timestamp": e.timestamp,
"value": e.value}],
axis=1)
records = df[["account_id", "name", "metrics"]].to_dict(orient="records")
payload = {"records": records}
Here's a full example applying option 2:
import io
import json
import pandas as pd
data = io.StringIO("""account_id name timestamp value
A0001C Fund_1 1588618800000000000 1
B0001B Dev_2 1601578800000000000 1""")
df = pd.read_csv(data, sep="\t")
df["metrics"] = df.apply(lambda e: [{"timestamp": e.timestamp,
"value": e.value}],
axis=1)
records = df[["account_id", "name", "metrics"]].to_dict(orient="records")
payload = {"records": records}
print(json.dumps(payload, indent=4))
Output:
{
"records": [
{
"account_id": "A0001C",
"name": "Fund_1",
"metrics": [
{
"timestamp": 1588618800000000000,
"value": 1
}
]
},
{
"account_id": "B0001B",
"name": "Dev_2",
"metrics": [
{
"timestamp": 1601578800000000000,
"value": 1
}
]
}
]
}
Edit: The second approach also makes grouping by accounts (in case you want to do that) rather easy. Below is a small example and output:
import io
import json
import pandas as pd
data = io.StringIO("""account_id name timestamp value
A0001C Fund_1 1588618800000000000 1
A0001C Fund_1 1588618900000000000 2
B0001B Dev_2 1601578800000000000 1""")
df = pd.read_csv(data, sep="\t")
# adding the metrics column as above
df["metrics"] = df.apply(lambda e: {"timestamp": e.timestamp,
"value": e.value},
axis=1)
# group metrics by account
df_grouped = df.groupby(by=["name", "account_id"]).metrics.agg(list).reset_index()
records = df_grouped[["account_id", "name", "metrics"]].to_dict(orient="records")
payload = {"records": records}
print(json.dumps(payload, indent=4))
Output:
{
"records": [
{
"account_id": "B0001B",
"name": "Dev_2",
"metrics": [
{
"timestamp": 1601578800000000000,
"value": 1
}
]
},
{
"account_id": "A0001C",
"name": "Fund_1",
"metrics": [
{
"timestamp": 1588618800000000000,
"value": 1
},
{
"timestamp": 1588618900000000000,
"value": 2
}
]
}
]
}

How to get only required columns in Python script while parsing the data from Json File

I am trying to write a python script . As per the requirement I have around 400 columns which will be coming as per of multiple arrays in JSON file.
I am using Pandas library and python version 3.6. I may get more columns than 400 column from the JSON file. How can i restrict the unwanted columns and I want to get only specified columns in my python output file.
I am using below code to get the data as per specified columns.
Issue: In my output file other than mentioned columns in the column list file I am also getting the rest of the columns. How can I restrict the unwanted columns and get only required columns in my output?
with open('Columns.txt') as c:
columns_list = c.readlines()
with open('JsonFile.json') as f:
json_file = json.load(f)
df = pd.DataFrame(columns=columns_list)
and i have one more scneario.. Currently i have data as below sample data.
70 % of cases i have data like [attributes][ABC][Values][Value] and in remaining cases i have [attributes][Xdfghgjgjgj][grp]( here i have some 2 records inside ) . To handle these type of scenario multi valued attributes can you help me with some solution
{
"entities": [
{
"id": "XXXXXXXXXXXXXXX",
"data": {
"attributes": {
"ABC": {
"values": [
{
"value": 00000000000000
}
]
}
"Xdfghgjgjgj": {
"grp": [
{
"SUPP": {
"values": [
{
"value": "000000000000000000"
}
]
},
"yfyfyfyfyfy": {
"values": [
{
"value": "909000090099090"
}
]
},
},
{
"SUPP": {
"values": [
{
"value": "000000000000000000"
}
]
},
"yfyfyfyfyfy": {
"values": [
{
"value": "909000090099090"
}
]
},
}
]
}
}
there is a way to read specific columns from csv using pandas :
import pandas as pd
cols= ['col1', 'col2', 'col3']
df = pd.read_csv('JsonFile.csv', skipinitialspace=True, usecols=cols)
#save to output
df.to_csv('output.csv',Index=False)
or you could specify the columns when you are saving your file :
df = pd.read_csv('JsonFile.csv')
df[column_names].to_csv('output.csv',index=False)
Edit :
with open('Columns.txt') as c:
columns_list = c.readlines()
with open('JsonFile.json') as f:
json_file = json.load(f)
#df = pd.DataFrame.from_dict(json_file, orient='columns')
df = pd.DataFrame(json_file)
df[columns_list].to_csv('output.csv',index=False)

Django filtering based on list value

My JSON data is in this format..
[
{
"id": "532befe4ee434047ff968a6e",
"company": "528458c4bbe7823947b6d2a3",
"values" : [
{
"Value":"11",
"uniqueId":true
},
{
"Value":"14",
"uniqueId":true
},
]
},
{
"id": "532befe4ee434047ff968a",
"company": "528458c4bbe7823947b6d",
"values" : [
{
"Value":"1111",
"uniqueId":true
},
{
"Value":"10",
"uniqueId":true
},
]
}
]
If I want to filter based on company field then it is possible in this way.
qaresults = QAResult.objects.filter(company= comapnyId)
and it gives me first dictionary of list
But what If I want to filter this based on values list's "value" of Value Key of first dictionary ?
I am not 100% sure what you want , but from what i understand your question,
Try this solution :
import json
json_dict = json.loads('[{"id": "532befe4ee434047ff968a6e","company": "528458c4bbe7823947b6d2a3","values": [{"Value": "11","uniqueId": true},{"Value": "14","uniqueId": true}]},{"id": "532befe4ee434047ff968a","company": "528458c4bbe7823947b6d","values": [{"Value": "1111","uniqueId": true},{"Value": "10","uniqueId": true}]}]')
expected_values = []
js = json_dict[0]
for key,value in js.items():
if key == 'values':
expected_values.append(value[0]['Value'])
And then
qaresults = QAResult.objects.filter(company__id__in = expected_values)

Categories

Resources