Python, how to create a table from JSON data - indexing - python

I am trying to create a table from JSON data. I have already used the json.dumps for my data:
this is what I am trying to export to the table:
label3 = json.dumps({'class': CLASSES[idx],"confidence": str(round(confidence * 100, 1)) + "%","startX": str(startX),"startY": str(startY),"EndX": str(endX),"EndY": str(endY),"Timestamp": now.strftime("%d/%m/%Y, %H:%M")})
I have tryied with:
val1 = json.loads(label3)
df = pd.DataFrame(val1)
print(df.T)
The system gives me an error that I must pass an index.
And also with:
val = ast.literal_eval(label3)
val1 = json.loads(json.dumps(val))
print(val1)
val2 = val1["class"][0]["confidence"][0]["startX"][0]["startY"][0]["endX"][0]["endY"][0]["Timestamp"][0]
df = pd.DataFrame(data=val2, columns=["class", "confidence", "startX", "startY", "EndX", "EndY", "Timestamp"])
print(df)
When I try this, the error it gives is that String indices mustb be integers.
How can I create the index?
Thank you,

There are two ways we can tackle this issue.
Do as directed by the error, pass the index to the dataframe function
pd.Dataframe(val1, index=list(range(number_of_rows)) # number of rows is 1 in your case.
While dumping the data using json.dumps, dump dictionary which has the mapping from key:list of values instead of key:value. For example
json.dumps({ 'class': [ CLASSES[idx] ],"confidence": [ ' some confidence ' ] })
I have shortened your given example. See I am passing values as list of values(even if it is only one value per key).

Related

Python - Get the top 5 items from dictionary type column pandas dataframe

I've a dataframe which one of the columns are a dictionary, I'm getting a huge number of items inside that dictionary which is causing me memory problems. The solution was get only the first 10 items from that dictionary. I already have the code but it gives a error:
TypeError: '<' not supported between instances of 'dict' and 'dict'
I made a sample code just to show you my problem:
import pandas as pd
import datetime
res = pd.DataFrame([])
res_tmp = pd.DataFrame([])
d = {'club': ['A1', 'B1'], 'score': [3, 4]}
df = pd.DataFrame(data=d)
for index, row in df.iterrows():
total = int(row['score']) * -1
res_tmp = res_tmp.append({'today': str(datetime.datetime.now()), 'total': total}, ignore_index=True)
res = res.append({'club': row['club'], 'details': res_tmp.to_dict('dict')},ignore_index=True)
res['details'] = res['details'].apply(lambda y: (sorted(y.items(), key=lambda x: x[1]))[:1])
What I am doing wrong? Note: In example I just have two rows that's why I put the top 1 instead of top 10
Thanks!
As the error message tells you, there is no defined value ordering for dicts. If you want to sort the dicts, you must provide the function you write to define the sort order. You extracted the value, but you also have to convert the dict to some type that does have < defined. For instance:
key = lambda x: list(x[1].values())

Convert list of strings to list of json objects in pyspark

def tranform_doc(docs):
json_list = []
print(docs)
for doc in docs:
json_doc = {}
json_doc["customKey"] = doc
json_list.append(json_doc)
return json_list
df.groupBy("colA") \
.agg(custom_udf(collect_list(col("colB"))).alias("customCol"))
First Hurdle:
Input: ["str1","str2","str3"]
Output: [{"customKey":"str1"},{"customKey":"str2"},{"customKey":"str3"}]
Second Hurdle:
columns in agg collect_list are changing dynamically. So, how to adjust schema dynamically.
when elements in list changes, receiving an error
Input row doesn't have expected number of values required by the schema. 1 fields are required while 3 values are provided
What I did:
def tranform_doc(agg_docs):
return json_list
## When I failed to get a list of JSON I tried just return the original list of strings to the list of json
schema = StructType([{StructField("col1",StringType()),StructField("col2",StringType()),StructField("col3",StringType())}])
custom_udf = udf(tranform_doc,schema)
df.groupBy("colA") \
.agg(custom_udf(collect_list(col("colB"))).alias("customCol"))
Output I got:
{"col2":"str1","col1":"str2","col3":"str3"}
Struggling to get the required list of JSON strings and to make it dynamical to number of elements in the list
No UDF needed. You can convert colB to a struct before collect_list.
import pyspark.sql.functions as F
df2 = df.groupBy('colA').agg(
F.to_json(
F.collect_list(
F.struct(F.col('colB').alias('customKey'))
)
).alias('output')
)

Create Json file from Python dataframe with grouping on one col and making column name as key with unique values as a list inside the key

#Create the pandas DataFrame#
My data frame is like this
data = [[6, 1, "False","var_1"], [6, 1, "False","var_2"], [7, 1, "False","var_3"]]
df = pd.DataFrame(data, columns =['CONSTRAINT_ID','CONSTRAINT_NODE_ID','PRODUCT_GRAIN','LEFT_SIDE_TYPE'])
##Expected Output Json##
I want to group by column CONSTRAINT_ID and the key should be natural numbers or index. LEFT_SIDE_TYPE column values should come in list
{
"1": {"CONSTRAINT_NODE_ID ":[1],
"product_grain":False,
"left_side_type":["Variable_1","Variable_2"],
},
"2": {"CONSTRAINT_NODE_ID ":[2],
"product_grain":False,
"left_side_type":["Variable_3"],
}
}
It is likely not the most efficient solution. However provided a df in the format specified in your original question, the below function will return a str consisting of a valid json string with the desired structure and values.
It filters the df by CONSTRAINT_ID, iterating across each unique value and creating a JSON object with a key 1...n and the desired values based on your original question within the response variable. This implementation uses set structures to store values during iterations to avoid duplication of values before converting these to list instances before they are added to the response.
import json
def generate_response(df):
response = dict()
constraints = df['CONSTRAINT_ID'].unique()
for i, c in enumerate(constraints):
temp = {'CONSTRAINT_NODE_ID': set(),'PRODUCT_GRAIN': None, 'LEFT_SIDE_TYPE': set()}
for _, row in df[df['CONSTRAINT_ID'] == c].iterrows():
temp['CONSTRAINT_NODE_ID'].add(row['CONSTRAINT_NODE_ID'])
temp['PRODUCT_GRAIN'] = row['PRODUCT_GRAIN']
temp['LEFT_SIDE_TYPE'].add(row['LEFT_SIDE_TYPE'])
temp['CONSTRAINT_NODE_ID'] = list(temp['CONSTRAINT_NODE_ID'])
temp['LEFT_SIDE_TYPE'] = list(temp['LEFT_SIDE_TYPE'])
response[str(i + 1)] = temp
return json.dumps(response, indent=4)

Optimize row access and transformation in pyspark

I have a large dataset(5GB) in the form of jason in S3 bucket.
I need to transform the schema of the data, and write back the transformed data to S3 using an ETL script.
So I use a crawler to detect the schema and load the data in pyspark dataframe, and change the schema. Now I iterate over every row in the dataframe and convert it to dictionary. Remove null columns, and then convert the dictionary to string and write back to S3. Following is the code:
#df is the pyspark dataframe
columns = df.columns
print(columns)
s3 = boto3.resource('s3')
cnt = 1
for row in df.rdd.toLocalIterator():
data = row.asDict(True)
for col_name in columns:
if data[col_name] is None:
del data[col_name]
content = json.dumps(data)
object = s3.Object('write-test-transaction-transformed', str(cnt)).put(Body=content)
cnt = cnt+1
print(cnt)
I have used toLocalIterator.
Is the execution of above code performes serially? if yes then how to optimize it? Is there any better approach for execution of above logic?
assuming, each row in the dataset as json string format
import pyspark.sql.functions as F
def drop_null_cols(data):
import json
content = json.loads(data)
for key, value in list(content.items()):
if value is None:
del content[key]
return json.dumps(content)
drop_null_cols_udf = F.udf(drop_null_cols, F.StringType())
df = spark.createDataFrame(
["{\"name\":\"Ranga\", \"age\":25, \"city\":\"Hyderabad\"}",
"{\"name\":\"John\", \"age\":null, \"city\":\"New York\"}",
"{\"name\":null, \"age\":31, \"city\":\"London\"}"],
"string"
).toDF("data")
df.select(
drop_null_cols_udf("data").alias("data")
).show(10,False)
If the input dataframe is having the cols and output only needs to be not null cols json
df = spark.createDataFrame(
[('Ranga', 25, 'Hyderabad'),
('John', None, 'New York'),
(None, 31, 'London'),
],
['name', 'age', 'city']
)
df.withColumn(
"data", F.to_json(F.struct([x for x in df.columns]))
).select(
drop_null_cols_udf("data").alias("data")
).show(10, False)
#df.write.format("csv").save("s3://path/to/file/) -- save to s3
which results
+-------------------------------------------------+
|data |
+-------------------------------------------------+
|{"name": "Ranga", "age": 25, "city": "Hyderabad"}|
|{"name": "John", "city": "New York"} |
|{"age": 31, "city": "London"} |
+-------------------------------------------------+
I'll follow the below approach(written in scala, but can be implemented in python with minimal change)-
Find the dataset count and named it as totalCount
val totalcount = inputDF.count()
Find the count(col) for all the dataframe columns and get the map of fields to their count
Here for all columns of input dataframe, the count is getting computed
Please note that count(anycol) returns the number of rows for which the supplied column are all non-null. For example - if a column has 10 row value and if say 5 values are null then the count(column) becomes 5
Fetch the first row as Map[colName, count(colName)] referred as fieldToCount
val cols = inputDF.columns.map { inputCol =>
functions.count(col(inputCol)).as(inputCol)
}
// Returns the number of rows for which the supplied column are all non-null.
// count(null) returns 0
val row = dataset.select(cols: _*).head()
val fieldToCount = row.getValuesMap[Long]($(inputCols))
Get the columns to be removed
Use the Map created in step#2 here and mark the column having count less than the totalCount as the column to be removed
select all the columns which has count == totalCount from the input dataframe and save the processed output Dataframe anywhere in any format as per requirement.
Please note that, this approach will remove all the column having at least one null value
val fieldToBool = fieldToCount.mapValues(_ < totalcount)
val processedDF = inputDF.select(fieldToBool.filterNot(_._2).map(_.1) :_*)
// save this processedDF anywhere in any format as per requirement
I believe this approach will perform well than the approach you have currently
I solved the above problem.
We can simply query the dataframe for null values.
df = df.filter(df.column.isNotNull()) thereby removing all rows where null is present.
So if there are n columns, We need 2^n queries to filter out all possible combinations. In my case there were 10 columns so total of 1024 queries, which is acceptable as sql queries are parrallelized.

DataFrame constructor not properly called! error

I am new to Python and I am facing problem in creating the Dataframe in the format of key and value i.e.
data = [{'key':'\[GlobalProgramSizeInThousands\]','value':'1000'},]
Here is my code:
columnsss = ['key','value'];
query = "select * from bparst_tags where tag_type = 1 ";
result = database.cursor(db.cursors.DictCursor);
result.execute(query);
result_set = result.fetchall();
data = "[";
for row in result_set:
`row["tag_expression"]`)
data += "{'value': %s , 'key': %s }," % ( `row["tag_expression"]`, `row["tag_name"]` )
data += "]" ;
df = DataFrame(data , columns=columnsss);
But when I pass the data in DataFrame it shows me
pandas.core.common.PandasError: DataFrame constructor not properly called!
while if I print the data and assign the same value to data variable then it works.
You are providing a string representation of a dict to the DataFrame constructor, and not a dict itself. So this is the reason you get that error.
So if you want to use your code, you could do:
df = DataFrame(eval(data))
But better would be to not create the string in the first place, but directly putting it in a dict. Something roughly like:
data = []
for row in result_set:
data.append({'value': row["tag_expression"], 'key': row["tag_name"]})
But probably even this is not needed, as depending on what is exactly in your result_set you could probably:
provide this directly to a DataFrame: DataFrame(result_set)
or use the pandas read_sql_query function to do this for you (see docs on this)
Just ran into the same error, but the above answer could not help me.
My code worked fine on my computer which was like this:
test_dict = {'x': '123', 'y': '456', 'z': '456'}
df=pd.DataFrame(test_dict.items(),columns=['col1','col2'])
However, it did not work on another platform. It gave me the same error as mentioned in the original question. I tried below code by simply adding the list() around the dictionary items, and it worked smoothly after:
df=pd.DataFrame(list(test_dict.items()),columns=['col1','col2'])
Hopefully, this answer can help whoever ran into a similar situation like me.
import json
# Opening JSON file
f = open('data.json')
# returns JSON object as
# a dictionary
data1 = json.load(f)
#converting it into dataframe
df = pd.read_json(data1, orient ='index')

Categories

Resources