Convert JSON column in dataframe to simple array of values

Convert JSON column in dataframe to simple array of values - python

I am trying to convert the JSON in the bbox (bounding box) column into a simple array of values for a DL project in python in a Jupyter notebook.
The possible labels are the following categories: [glass, cardboard, trash, metal, paper].
[{"left":191,"top":70,"width":183,"height":311,"label":"glass"}]
TO
([191 70 183 311], 0)
I'm looking for help to convert the bbox column from the JSON object for a single CSV that contains all the image names and the related bboxes.
UPDATE
The current column is a series so I keep getting a "TypeError: the JSON object must be str, bytes or bytearray, not 'Series'" any time I try to apply JSON operations on the column. So far I have tried to convert the column into JSON object and then pull out the values from the keys.
BB_CSV

You'll want to use a JSON decoder: https://docs.python.org/3/library/json.html
import json
li = json.loads('''[{"left":191,"top":70,"width":183,"height":311,"label":"glass"}]''')
d = dictionary = li[0]
result = ([d[key] for key in "left top width height".split()], 0)
print(result)
Edit:
If you want map the operation of extracting the values from the dictionary to all element of the list, you can do:
extracted = []
for element in li:
result = ([element[key] for key in "left top width height".split()], 0)
extracted.append(result)
# print(extracted)
print(extracted[:10])
# `[:10]` is there to limit the number of item displayed to 10
Similarly, as per my comment, if you do not want commas between the extracted numbers in the list, you can use:
without_comma = []
for element, zero in extracted:
result_string = "([{}], 0)".format(" ".join([str(value) for value in element]))
without_comma.append(result_string)

It looks like each row of your bbox column contains a dictionary inside of a list. I've tried to replicate your problem as follows. Edit: Clarifying that the below solution assumes that what you're referring to as a "JSON object" is represented as a list containing a single dictionary, which is what it appears to be per your example and screenshot.
# Create empty sample DataFrame with one row
df = pd.DataFrame([None],columns=['bbox'])
# Assign your sample item to the first row
df['bbox'][0] = [{"left":191,"top":70,"width":183,"height":311,"label":"glass"}]
Now, to simple unpack the row you can do:
df['bbox_unpacked'] = df['bbox'].map(lambda x: x[0].values())
Which will get you a new column with a tuple of 5 items.
If you want to go further and apply your labels, you'll likely want to create a dictionary to contain your labeling logic. Per the example you're given in the comments, I've done:
labels = {
'cardboard': 1,
'trash': 2,
'glass': 3
}
This should get your your desired layout if you want a one-line solution without writing your own function.
df['bbox_unpacked'] = df['bbox'].map(lambda x: (list(x[0].values())[:4],labels.get(list(x[0].values())[-1])))
A more readable solution would be to define your own function using the .apply() method. Edit: Since it looks like your JSON object is being stored as a str inside your DataFrame rows, I added json.loads(row) to process the string first before retrieving the keys. You'll need to import json to run.
import json
def unpack_bbox(row, labels):
# load the string into a JSON object (in this
# case a list of length one containing the dictionary;
# index the list to its first item [0] and use the .values()
# dictionary method to access the values only
keys = list(json.loads(row)[0].values())
bbox_values = keys[:4]
bbox_label = keys[-1]
label_value = labels.get(bbox_label)
return bbox_values, label_value
df['bbox_unpacked'] = df['bbox'].apply(unpack_bbox,args=(labels,))

Related

Extract element of exploded JSON via name of list element

I have a JSON that I have read in using
data_fields = spark.read.json(json_files)
where json_files is the path to the json files. To extract the data from the JSON I then use:
data_fields = data_fields.select('datarecords.fields')
I then give each record its own row via:
input_data = input_data.select((explode("fields").alias('fields')))
Resulting in data in the fields column that looks like:
fields
[[ID,, 101],[other_var,, 'some_value']]
[[other_var,,"some_value"],[ID,, 102],[other_var_2,, 'some_value_2']
each sub list element can be refereed too using "name", "status" and "value" as the components. For example:
input_data = input_data.withColumn('new_col', col('fields.name'))
Will extract the name of the first element. So in the above example, "ID" and "other_var". I am trying to extract the id for each record to its own column to end with:
id
fields
101
[[ID,, 101],[other_var,, 'some_value']]
102
[[other_var,,"some_value"],[ID,, 102],[other_var_2,, 'some_value_2']
For those cases where the id is the first element in the fields column, row 1 above, I can do this via:
input_data = input_data.withColumn('id', col('fields')[0].value)
However as shown the "id" is not always the first element in the list in column fields, and there are many hundreds of potential sub list elements. I have therefore being trying to extract the "id" via its name rather than its position in the list but have come up against a blank. The nearest I have come is to use the below to identify which element it exists in:
input_data = input_data.withColumn('id', array_position(col('fields.name'),"ID"))
Which returns the position. But not sure where to go to get the value unless I do something like:
result = input_data.withColumn('id',
when(col('fields.name')[0] == 'ID',col('fields')[0].value)
.when(col('fields.name')[1] == 'ID',col('fields')[1].value)
.when(col('fields.name')[2] == 'ID',col('fields')[2].value))
And of course the above is impractical with potentially 100 of sub list elements in the fields column
Any help to achieve the above would be appreciated to extract the id regardless of position in the list efficiently.
Hopefully the above minimum example is clear.

Multiple data types in a python list

I have a list for birth data each record has 3 columns for [DOB, weight, height] as such:
bd = [['10/03/2021 00:00','6.2', '33.3'],['12/04/2021 00:00','6.2', '33.3'],
['13/05/2021 00:00','6.2','33.3']]
I need to change the data types of this as they are all strings in the list I want the first item in the record to be datetime and then the rest to be floats. I have:
newdata = [i[0] for i in bd]
p= []
for x in bd:
xy = datetime.datetime.strptime(x,'%d/%m/%Y %H:%M') # this works to change the data type
p.append(xy) #it fails to update this to the new list
I get an attribute error:
AttributeError: 'str' object has no attribute 'append'
I would like to achieve this just by utilizing pythons file IO operations. I also want to maintain each record of data together in a list within the main list I just want to update the datatypes.

Your code is incomplete, there may be unexpected variable coverage, you can try to use the list comprehension directly.
[[datetime.datetime.strptime(i[0],'%d/%m/%Y %H:%M'), float(i[1]), float(i[2])] for i in bd]

unhashable type: 'dict'

I am new in here and want to ask something about removing duplicate data enter, right now I'm still doing my project about face recognition and stuck in remove duplicate data enter that I send to google sheets, this is the code that I use:
if(confidence <100):
id = names[id]
confidence = "{0}%".format (round(100-confidence))
row = (id,datetime.datetime,now().strftime('%Y-%m-%d %H:%M:%S'))
index = 2
sheet.insert_row (row,index)
data = sheet.get_all_records()
result = list(set(data))
print (result)
The message error "unhashable type: 'dict"
I want to post the result in google sheet only once enter

You can't add dictionaries to sets.
What you can do is add the dictionary items to the set. You can cast this to a list of tuples like so:
s = set(tuple(data.items()))
If you need to convert this back to a dictionary after, you can do:
for t in s:
new_dict = dict(t)

According to documentation of gspread get_all_records() returns list of dicts where dict has head row as key and value as cell value. So, you need to iterate through this list compare your ids to find and remove repeating items. Sample code:
visited = []
filtered = []
for row in data:
if row['id'] not in visited:
visited.append(row['id'])
else:
filtered.append(row)
Now, filtered should contain unique items. But instead of id you should put the name of the column which contains repeating value.

Extract a string from a CSV cell containing special characters in Python

I'm writing a Python program to extract specific values from each cell in a .CSV file column and then make all the extracted values new columns.
Sample column cell:(This is actually a small part, the real cell contains much more data)
AudioStreams":[{"JitterInterArrival":10,"JitterInterArrivalMax":24,"PacketLossRate":0.01353227,"PacketLossRateMax":0.09027778,"BurstDensity":null,"BurstDuration":null,"BurstGapDensity":null,"BurstGapDuration":null,"BandwidthEst":25245423,"RoundTrip":520,"RoundTripMax":11099,"PacketUtilization":2843,"RatioConcealedSamplesAvg":0.02746676,"ConcealedRatioMax":0.01598402,"PayloadDescription":"SIREN","AudioSampleRate":16000,"AudioFECUsed":true,"SendListenMOS":null,"OverallAvgNetworkMOS":3.487248,"DegradationAvg":0.2727518,"DegradationMax":0.2727518,"NetworkJitterAvg":253.0633,"NetworkJitterMax":1149.659,"JitterBufferSizeAvg":220,"JitterBufferSizeMax":1211,"PossibleDataMissing":false,"StreamDirection":"FROM-to-
One value I'm trying to extract is number 10 between the "JitterInterArrival": and ,"JitterInterArrivalMax" . But since each cell contains relatively long strings and special characters around it(such as ""), opener=re.escape(r"***")and closer=re.escape(r"***") wouldn't work.
Does anyone know a better solution? Thanks a lot!

IIUC, you have a json string and wish to get values from its attributes. So, given
s = '''
{"AudioStreams":[{"JitterInterArrival":10,"JitterInterArrivalMax":24,"PacketLossRate":0.01353227,"PacketLossRateMax":0.09027778,"BurstDensity":null,
"BurstDuration":null,"BurstGapDensity":null,"BurstGapDuration":null,"BandwidthEst":25245423,"RoundTrip":520,"RoundTripMax":11099,"PacketUtilization":2843,"RatioConcealedSamplesAvg":0.02746676,"ConcealedRatioMax":0.01598402,"PayloadDescription":"SIREN","AudioSampleRate":16000,"AudioFECUsed":true,"SendListenMOS":null,"OverallAvgNetworkMOS":3.487248,"DegradationAvg":0.2727518,
"DegradationMax":0.2727518,"NetworkJitterAvg":253.0633,
"NetworkJitterMax":1149.659,"JitterBufferSizeAvg":220,"JitterBufferSizeMax":1211,
"PossibleDataMissing":false}]}
'''
You can do
import json
>>> data = json.loads(s)
>>> ji = data['AudioStreams'][0]['JitterInterArrival']
10
In a data frame scenario, if you have a column col of strings such as the above, e.g.
df = pd.DataFrame({"col": [s]})
You can use transform passing json.loads as argument
df.col.transform(json.loads)
to get a Series of dictionaries. Then, you can manipulate these dicts or just access the data as done above.

Getting all column values from google sheet using Gspread and Python

So i have a problem with the Gspread for python 3
when i do something like:
x = worksheet.cell(1,1).value
print(x)
Then i get the value of cell 1,1 which in my case is:
Nice
But when i do:
x = worksheet.col_values(1)
print(x)
Then i get all the results as in
'Nice', 'Cool','','','','','','','','','','','','','',''
And all the empty cells as well which i don't understand since i am asking just for values why i do i get all the '', empty brackets and why the other results are also in brackets ? I would expect something like:
Nice
Cool
When i call for the values of a column and those are the only values. Anyone know how to get such results ?
According to this https://github.com/burnash/gspread documentation it should work but it dose not.

You are getting all of the column data, contained in a list. It starts at row one and gives you all rows in that column to the bottom of the spreadsheet (1000 rows by default), including empty cells. The documentation tells you this:
col_values(col) Returns a list of all values in column col.
Empty cells in this list will be rendered as None.
This seems to have been changed to return empty strings instead, but the principle is the same.
To get just values, use a list comprehension:
x = [item for item in worksheet.col_values(1) if item]
Noting that the above will remove blank rows between items, which might cause misalignment if you try to work with multiple columns where row number is important. Since it's a list, individual items are accessed with:
for item in x:
print(item)

Looking again at the gspread-documentation, I was able to create a dataframe and then thereafter obtain the column-values:
gc = gspread.authorize(GoogleCredentials.get_application_default())
sht2 = gc.open_by_url('https://docs.google.com/spreadsheets/d/<id>')
worksheet = sht2.worksheet("Sheet-name")
dataframe = pd.DataFrame(worksheet.get_all_records())
dataframe.head(3)
Note: Don't forget to enable your gsheet's sharing-settings to "Anyone with a link", to be able to access the sheet from e.g. google colab.

You can also create a while loop and make something like this.
Let's say you want column E to G, you can start the loop from x=5 and end it on x=7. Just make sure that you transpose the dataframe at the end before printing it.
columns = []
x = 5
while x < 8:
data = sheet.col_values(x)[1:]
x += 1
columns.append(data)
df = pd.DataFrame(columns).T
print(df)

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Convert JSON column in dataframe to simple array of values - python

Related

Extract element of exploded JSON via name of list element

Multiple data types in a python list

unhashable type: 'dict'

Extract a string from a CSV cell containing special characters in Python

Getting all column values from google sheet using Gspread and Python

Categories

Resources