Extract element of exploded JSON via name of list element - python

I have a JSON that I have read in using
data_fields = spark.read.json(json_files)
where json_files is the path to the json files. To extract the data from the JSON I then use:
data_fields = data_fields.select('datarecords.fields')
I then give each record its own row via:
input_data = input_data.select((explode("fields").alias('fields')))
Resulting in data in the fields column that looks like:
fields
[[ID,, 101],[other_var,, 'some_value']]
[[other_var,,"some_value"],[ID,, 102],[other_var_2,, 'some_value_2']
each sub list element can be refereed too using "name", "status" and "value" as the components. For example:
input_data = input_data.withColumn('new_col', col('fields.name'))
Will extract the name of the first element. So in the above example, "ID" and "other_var". I am trying to extract the id for each record to its own column to end with:
id
fields
101
[[ID,, 101],[other_var,, 'some_value']]
102
[[other_var,,"some_value"],[ID,, 102],[other_var_2,, 'some_value_2']
For those cases where the id is the first element in the fields column, row 1 above, I can do this via:
input_data = input_data.withColumn('id', col('fields')[0].value)
However as shown the "id" is not always the first element in the list in column fields, and there are many hundreds of potential sub list elements. I have therefore being trying to extract the "id" via its name rather than its position in the list but have come up against a blank. The nearest I have come is to use the below to identify which element it exists in:
input_data = input_data.withColumn('id', array_position(col('fields.name'),"ID"))
Which returns the position. But not sure where to go to get the value unless I do something like:
result = input_data.withColumn('id',
when(col('fields.name')[0] == 'ID',col('fields')[0].value)
.when(col('fields.name')[1] == 'ID',col('fields')[1].value)
.when(col('fields.name')[2] == 'ID',col('fields')[2].value))
And of course the above is impractical with potentially 100 of sub list elements in the fields column
Any help to achieve the above would be appreciated to extract the id regardless of position in the list efficiently.
Hopefully the above minimum example is clear.

Related

unhashable type: 'dict'

I am new in here and want to ask something about removing duplicate data enter, right now I'm still doing my project about face recognition and stuck in remove duplicate data enter that I send to google sheets, this is the code that I use:
if(confidence <100):
id = names[id]
confidence = "{0}%".format (round(100-confidence))
row = (id,datetime.datetime,now().strftime('%Y-%m-%d %H:%M:%S'))
index = 2
sheet.insert_row (row,index)
data = sheet.get_all_records()
result = list(set(data))
print (result)
The message error "unhashable type: 'dict"
I want to post the result in google sheet only once enter
You can't add dictionaries to sets.
What you can do is add the dictionary items to the set. You can cast this to a list of tuples like so:
s = set(tuple(data.items()))
If you need to convert this back to a dictionary after, you can do:
for t in s:
new_dict = dict(t)
According to documentation of gspread get_all_records() returns list of dicts where dict has head row as key and value as cell value. So, you need to iterate through this list compare your ids to find and remove repeating items. Sample code:
visited = []
filtered = []
for row in data:
if row['id'] not in visited:
visited.append(row['id'])
else:
filtered.append(row)
Now, filtered should contain unique items. But instead of id you should put the name of the column which contains repeating value.

Convert JSON column in dataframe to simple array of values

I am trying to convert the JSON in the bbox (bounding box) column into a simple array of values for a DL project in python in a Jupyter notebook.
The possible labels are the following categories: [glass, cardboard, trash, metal, paper].
[{"left":191,"top":70,"width":183,"height":311,"label":"glass"}]
TO
([191 70 183 311], 0)
I'm looking for help to convert the bbox column from the JSON object for a single CSV that contains all the image names and the related bboxes.
UPDATE
The current column is a series so I keep getting a "TypeError: the JSON object must be str, bytes or bytearray, not 'Series'" any time I try to apply JSON operations on the column. So far I have tried to convert the column into JSON object and then pull out the values from the keys.
BB_CSV
You'll want to use a JSON decoder: https://docs.python.org/3/library/json.html
import json
li = json.loads('''[{"left":191,"top":70,"width":183,"height":311,"label":"glass"}]''')
d = dictionary = li[0]
result = ([d[key] for key in "left top width height".split()], 0)
print(result)
Edit:
If you want map the operation of extracting the values from the dictionary to all element of the list, you can do:
extracted = []
for element in li:
result = ([element[key] for key in "left top width height".split()], 0)
extracted.append(result)
# print(extracted)
print(extracted[:10])
# `[:10]` is there to limit the number of item displayed to 10
Similarly, as per my comment, if you do not want commas between the extracted numbers in the list, you can use:
without_comma = []
for element, zero in extracted:
result_string = "([{}], 0)".format(" ".join([str(value) for value in element]))
without_comma.append(result_string)
It looks like each row of your bbox column contains a dictionary inside of a list. I've tried to replicate your problem as follows. Edit: Clarifying that the below solution assumes that what you're referring to as a "JSON object" is represented as a list containing a single dictionary, which is what it appears to be per your example and screenshot.
# Create empty sample DataFrame with one row
df = pd.DataFrame([None],columns=['bbox'])
# Assign your sample item to the first row
df['bbox'][0] = [{"left":191,"top":70,"width":183,"height":311,"label":"glass"}]
Now, to simple unpack the row you can do:
df['bbox_unpacked'] = df['bbox'].map(lambda x: x[0].values())
Which will get you a new column with a tuple of 5 items.
If you want to go further and apply your labels, you'll likely want to create a dictionary to contain your labeling logic. Per the example you're given in the comments, I've done:
labels = {
'cardboard': 1,
'trash': 2,
'glass': 3
}
This should get your your desired layout if you want a one-line solution without writing your own function.
df['bbox_unpacked'] = df['bbox'].map(lambda x: (list(x[0].values())[:4],labels.get(list(x[0].values())[-1])))
A more readable solution would be to define your own function using the .apply() method. Edit: Since it looks like your JSON object is being stored as a str inside your DataFrame rows, I added json.loads(row) to process the string first before retrieving the keys. You'll need to import json to run.
import json
def unpack_bbox(row, labels):
# load the string into a JSON object (in this
# case a list of length one containing the dictionary;
# index the list to its first item [0] and use the .values()
# dictionary method to access the values only
keys = list(json.loads(row)[0].values())
bbox_values = keys[:4]
bbox_label = keys[-1]
label_value = labels.get(bbox_label)
return bbox_values, label_value
df['bbox_unpacked'] = df['bbox'].apply(unpack_bbox,args=(labels,))

python, dictionary in a data frame, sorting

I have a python data frame called wiki, with the wikipedia information for some people.
Each row is a different person, and the columns are : 'name', 'text' and 'word_count'. The information in 'text' has been put in dictionary form (keys,values), to create the information in the column 'word_count'.
If I want to extract the row related to Barack Obama, then:
row = wiki[wiki['name'] == 'Barack Obama']
Now, I would like the most popular word. When I do:
adf=row[['word_count']]
I get another data frame because I see that:
type(adf)=<class 'pandas.core.frame.DataFrame'>
and if I do
adf.values
I get:
array([[ {u'operations': 1, u'represent': 1, u'office': 2, ..., u'began': 1}], dtype=object)
However, what is very confusing to me is that the size is 1
adf.size=1
Therefore, I do not know how to actually extract the keys and values. Things like adf.values[1] do not work
Ultimately, what I need to do is sort the information in word_count so that the most frequent words appear first.
But I would like to understand how to access a the information that is inside a dictionary, inside a data frame... I am lost about the types here. I am not new to programming, but I am relatively new to python.
Any help would be very very much appreciated
If the name column is unique, then you can change the column to the index of the DataFrame object:wiki.set_index("name", inplace=True). Then you can get the value by: wiki.at['Barack Obama', 'word_count'].
With your code:
row = wiki[wiki['name'] == 'Barack Obama']
adf = row[['word_count']]
The first line use a bool array to get the data, here is the document: http://pandas.pydata.org/pandas-docs/stable/indexing.html#boolean-indexing
wiki is a DataFrame object, and row is also a DataFrame object with only one row, if the name column is unique.
The second line get a list of columns from the row, here is the document: http://pandas.pydata.org/pandas-docs/stable/indexing.html#basics
You get a DataFrame with only one row and one column.
And here is the document of .at[]: http://pandas.pydata.org/pandas-docs/stable/indexing.html#fast-scalar-value-getting-and-setting

Dynamically parsing research data in python

The long (winded) version:
I'm gathering research data using Python. My initial parsing is ugly (but functional) code which gives me some basic information and turns my raw data into a format suitable for heavy duty statistical analysis using SPSS. However, every time I modify the experiment, I have to dive into the analysis code.
For a typical experiment, I'll have 30 files, each for a unique user. Field count is fixed for each experiment (but can vary from one to another 10-20). Files are typically 700-1000 records long with a header row. Record format is tab separated (see sample which is 4 integers, 3 strings, and 10 floats).
I need to sort my list into categories. In a 1000 line file, I could have 4-256 categories. Rather than trying to pre-determine how many categories each file has, I'm using the code below to count them. The integers at the beginning of each line dictate what category the float values in the row correspond to. Integer combinations can be modified by the string values to produce wildly different results, and multiple combinations can sometimes be lumped together.
Once they're in categories, number crunching begins. I get statistical info (mean, sd, etc. for each category for each file).
The essentials:
I need to parse data like the sample below into categories. Categories are combos of the non-floats in each record. I'm also trying to come up with a dynamic (graphical) way to associate column combinations with categories. Will make a new post fot this.
I'm looking for suggestions on how to do both.
# data is a list of tab separated records
# fields is a list of my field names
# get a list of fieldtypes via gettype on our first row
# gettype is a function to get type from string without changing data
fieldtype = [gettype(n) for n in data[1].split('\t')]
# get the indexes for fields that aren't floats
mask = [i for i, field in enumerate(fieldtype) if field!="float"]
# for each row of data[skipping first and last empty lists] we split(on tabs)
# and take the ith element of that split where i is taken from the list mask
# which tells us which fields are not floats
records = [[row.split('\t')[i] for i in mask] for row in data[1:-1]]
# we now get a unique set of combos
# since set doesn't happily take a list of lists, we join each row of values
# together in a comma seperated string. So we end up with a list of strings.
uniquerecs = set([",".join(row) for row in records])
print len(uniquerecs)
quit()
def gettype(s):
try:
int(s)
return "int"
except ValueError:
pass
try:
float(s)
return "float"
except ValueError:
return "string"
Sample Data:
field0 field1 field2 field3 field4 field5 field6 field7 field8 field9 field10 field11 field12 field13 field14 field15
10 0 2 1 Right Right Right 5.76765674196 0.0310912272139 0.0573603238282 0.0582901376612 0.0648936500524 0.0655294305058 0.0720571099855 0.0748289246137 0.446033755751
3 1 3 0 Left Left Right 8.00982745764 0.0313840132052 0.0576521406854 0.0585844966069 0.0644905497442 0.0653386429438 0.0712603578765 0.0740345755708 0.2641076191
5 19 1 0 Right Left Left 4.69440026591 0.0313852052224 0.0583165354345 0.0592403274967 0.0659404609478 0.0666070804916 0.0715314027001 0.0743022054775 0.465994962101
3 1 4 2 Left Right Left 9.58648184552 0.0303649003017 0.0571579895338 0.0580911765412 0.0634304670863 0.0640132919609 0.0702920967445 0.0730697946335 0.556525293
9 0 0 7 Left Left Left 7.65374257547 0.030318719717 0.0568551744109 0.0577785415066 0.0640577002605 0.0647226582655 0.0711459854908 0.0739256050784 1.23421547397
Not sure if I understand your question, but here are a few thoughts:
For parsing the data files, you usually use the Python csv module.
For categorizing the data you could use a defaultdict with the non-float fields joined as a key for the dict. Example:
from collections import defaultdict
import csv
reader = csv.reader(open('data.file', 'rb'), delimiter='\t')
data_of_category = defaultdict(list)
lines = [line for line in reader]
mask = [i for i, n in enumerate(lines[1]) if gettype(n)!="float"]
for line in lines[1:]:
category = ','.join([line[i] for i in mask])
data_of_category[category].append(line)
This way you don't have to calculate the categories in the first place an can process the data in one pass.
And I didn't understand the part about "a dynamic (graphical) way to associate column combinations with categories".
For at least part of your question, have a look at Named Tuples
Step 1: Use something like csv.DictReader to turn the text file into an iterable of rows.
Step 2: Turn that into a dict of first entry: rest of entries.
with open("...", "rb") as data_file:
lines = csv.Reader(data_file, some_custom_dialect)
categories = {line[0]: line[1:] for line in lines}
Step 3: Iterate over the items() of the data and do something with each line.
for category, line in categories.items():
do_stats_to_line(line)
Some useful answers already but I'll throw mine in as well. Key points:
Use the csv module
Use collections.namedtuple for each row
Group the rows using a tuple of int field values as the key
If your source rows are sorted by the keys (the integer column values), you could use itertools.groupby. This would likely reduce memory consumption. Given your example data, and the fact that your files contain >= 1000 rows, this is probably not an issue to worry about.
def coerce_to_type(value):
_types = (int, float)
for _type in _types:
try:
return _type(value)
except ValueError:
continue
return value
def parse_row(row):
return [coerce_to_type(field) for field in row]
with open(datafile) as srcfile:
data = csv.reader(srcfile, delimiter='\t')
## Read headers, create namedtuple
headers = srcfile.next().strip().split('\t')
datarow = namedtuple('datarow', headers)
## Wrap with parser and namedtuple
data = (parse_row(row) for row in data)
data = (datarow(*row) for row in data)
## Group by the leading integer columns
grouped_rows = defaultdict(list)
for row in data:
integer_fields = [field for field in row if isinstance(field, int)]
grouped_rows[tuple(integer_fields)].append(row)
## DO SOMETHING INTERESTING WITH THE GROUPS
import pprint
pprint.pprint(dict(grouped_rows))
EDIT You may find the code at https://gist.github.com/985882 useful.

How to store two different values returned from query into list data types to be used later(plpy python)

I need to store two values, "id" and "name" returned from sql query into a variable which I can use later. Can I use list for this purpose. I want to store values from sql at once and then only to refer to the stored value. I was able to do so but with only one value (id), but now I need to store both id and name together. the purpose is to do string comparision and based on it, its corresponding id is to be assigned.
for example ,first i tried to retrieve data from database by
rv = plpy.execute (select id,name from aa)
Now I need to store these two values somewhere in two varaible for example, lets say id in storevalueID and name in storevalueName so later I can do things like,
if someXname = Replace(storeValueName("hello","")) then
assign its concerned id to some varaible lile xID = storevalueID,
I am not sure if we can do this , but i need to do something like this.
Any help will be appreciated..
I'm not sure I understand your question completely. But if you were previously storing a list of "id"s:
mylist = []
mylist.append(id1) # or however you get your id values
mylist.append(id2)
# ..
so mylist is something like [1, 2, 3], then you can simply use tuples to store more than one element that are associated together:
mylist = []
mylist.append( (id1, name1) )
mylist.append( (id2, name2) )
# ..
Now mylist is something like [ (1, 'Bob'), (2, 'Alice'), (3, 'Carol')]. You can perform string comparisons on the second element of each tuple in your list:
mylist[0][1] == 'Bob' # True
mylist[1][2] == 'Alice' # True
Update I just saw the updated question. In plypy, you should be able to access the variables like this:
for row in rv:
the_id = row['id']
name = row['name']
using the column names. See this page for more information.

Categories

Resources