Extract information from a string of data in a CSV file - python

I'm trying to run some analysis on some data and ran into some questions while parsing the data in csv file.
This is the raw data in one cell:
{"completed": true, "attempts": 1, "item_state": {"1": {"correct": true, "zone": "zone-7"}, "0": {"correct": true, "zone": "zone-2"}, "2": {"correct": true, "zone": "zone-12"}}, "raw_earned": 1.0}
Formatted for clarity:
{
"completed": true,
"attempts": 1,
"item_state": {
"1": {
"correct": true,
"zone": "zone-7"
},
"0": {
"correct": true,
"zone": "zone-2"
},
"2": {
"correct": true,
"zone": "zone-12"
}
},
"raw_earned": 1.0
}
I want to extract only the zone information after each number (1, 0, 2) and put the results (zone-7, zone-2, zone-12) in separate columns. How can I do that using R or Python?

It looks like a dictionary, and when it is stored as an element in a csv, it is stored as a string. In python you can use ast.literal_Eval(). It parses strings to pythonic data types like list, dictionary etc. Also works as data type parser.
If the cell you mentioned is indexed [i,j],
import pandas as pd
import ast
df = pd.read_csv(filename)
a = ast.literal_eval(df.loc[i][j])
b = pd.io.json.json_normalize(a)
output = []
for i in range(df.shape[0]):
c = ast.literal_eval(df.iloc[i][j])
temp = pd.DataFrame({'key':c['item_state'].keys(),'zone':[x['zone'] for x in c['item_state'].values()]})
temp['row_n'] = i
output.append(temp)
output2 = pd.concat(temp)
If [i,j] is your cell,
a in the above code is the dictionary as given in your example.
b is a flattened dictionary and contains all key,value pairs in output.
The rest of the code is to extract only the zone values.
If you are looking to apply this for more than one cell, use the loop, else only use the content inside the loop.
output is a list data frames, each of which has the item_state key and zone value as columns and also a row_number for identification.
output2 is concatenated data frame.
ast - Abstract Syntax Trees

In Python, you can use the json library to do something like this:
d = json.loads(raw_cell_data) # Load the data into a Python dict
results = {}
for key, value in d['item_state'].items():
results[key] = value['zone']
And then you can use results to print to a CSV.

The initial situation is a bit unclear, what you show looks like json, but you mention it is in a csv.
assuming you have a csv where the individual fields are strings containing json data, you can extract the zone information by using the csv and json packages.
set up a for loop to iterate over the rows of the csv (see csv docs for more detail)
and then use the json module to extract the zone from the string.
import csv
import json
# to get ss from a csv:
# my_csv = csv.reader( ... )
# for row in my_csv:
# ss = row[N]
ss = '{"completed": true, "attempts": 1, "item_state": {"1": {"correct": true, "zone": "zone-7"}, "0": {"correct": true, "zone": "zone-2"}, "2": {"correct": true, "zone": "zone-12"}}, "raw_earned": 1.0}'
jj = json.loads(ss)
for vv in jj['item_state'].values():
print(vv['zone'])

Convert the cell value to JSON and then you can access any element you would like so:
import csv
import json
column_index = 0
state_keys = ['1', '0', '2']
with open('data.csv') as f:
reader = csv.reader(f, delimiter=';')
for row in reader:
object = json.loads(row[column_index])
state = object['item_state']
# Show all values under item_state in order they appear:
for key, value in state.items():
print(state[key]['zone'])
# Show only state_keys defined in variable in order they are defined in a list
for key in state_keys:
print(state[key]['zone'])

Some thing like this. Not tested as you have not provided sufficient sample.
import csv
import json
with open('data.csv') as fr:
rows = list(csv.reader(fr))
for row in rows:
data = json.loads(row[0])
new_col_data = [v['zone'] for v in data['item_state'].values()]
row.append(", ".join(new_col_data)
with open('new_data.csv', 'w') as fw:
writer = csv.writer(fw)
writer.writerows(rows)

In R package rjson, function fromJSON is simple to use.
Any of the following ways of reading the JSON string will produce the same result.
library("rjson")
x <- '{"completed": true, "attempts": 1, "item_state": {"1": {"correct": true, "zone": "zone-7"}, "0": {"correct": true, "zone": "zone-2"}, "2": {"correct": true, "zone": "zone-12"}}, "raw_earned": 1.0}'
json <- fromJSON(json_str = x)
# if the string is in a file, say, "so.json"
#json <- fromJSON(file = "so.json")
json is an object of class "list", make a dataframe out of it.
result <- data.frame(zone_num = names(json$item_state))
result <- cbind(result, do.call(rbind.data.frame, json$item_state)[2])
result
# zone_num zone
#1 1 zone-7
#0 0 zone-2
#2 2 zone-12

get item_state and find the value as a zone than append key and value to empty list and finally create the new columns with those list
zone_val = []
zone_key = []
for k,v in d['item_state'].items():
zone_val.append(v['zone'])
zone_key.append(k)
DF[zone_key] = zone_key
DF[zone_val] = zone_val

In Python, it looks like each cells data is a dictionary that also contains dictionaries, ie nested dictionaries
If this cell's data were referenced as a variable cell_data, then you can get into the inner "item_state" dictionary with:
cell_data["item_state"]
this will return
{"1": {"correct": true, "zone": "zone-7"}, "0": {"correct": true, "zone": "zone-2"}, "2": {"correct": true, "zone": "zone-12"}}
Then you can do the same operation one level deeper by asking for the "1" dictionary:
cell_data["item_state"]["1"]
returns:
{'correct': 'true', 'zone': 'zone-7'}
Then once more:
cell_data["item_state"]["1"]["zone"]
returns
'zone-7'
So to bring it all together, you could get what you want with the following:
your_list = list( cell_data["item_state"][i]['zone'] for i in ["1","0","2"] )
returns:
['zone-7', 'zone-2', 'zone-12']

Related

Mapping key and values in JSON

I am trying to map key and values to write it in JSON and I am unable to convert it as required in the below template:
{Pregnancies : [], Glucose : [], BloodPressure : [], SkinThickness : [579], Insulin : [8,13,111,153,...so on]
Below is the code I am working on currently (names is a list having values Blood Pressure, SkinThickness ... and Outlier records has values [], [],[579],[8,13,111,153,...].
Outlier_records
names
joinedlist = names + Outlier_records
joinedlist
json.dumps(joinedlist)
os.chdir(Output)
with open('Outlier_Records.txt', 'w') as json_file:
json.dump(joinedlist, json_file)
The output that I am getting now is attached in the image below whereas I am actually expecting the output to be mapped as above
{Pregnancies: [], BloodPressure: [], SkinThickness: [579], Insulin: [8,13,111,153,...so on]
The template you provided is in json format which are dicts in python, so in this case you would need to create a dictionary and add the corresponding data to it in the form of key-value paris with the code below.
import json
names = ["blood", "test", "test1", "ntek"]
outliner_records = [
[],
[],
[579],
[8,13,111,153]
]
joinedDict = {}
for i in range(len(names)):
joinedDict[names[i]] = outliner_records[i]
with open("tt.json", "w") as json_file:
json.dump(joinedDict, json_file)
Instead of joining your lists you can make a dict from them to pass to json.dumps:
import json
keys = ['apples','bananas','fish']
values = [1,2,[1,2,3]]
out = dict(zip(keys,values))
print(json.dumps(out))
outputs:
{"apples": 1, "bananas": 2, "fish": [1, 2, 3]}

How to clear an array and reset the values in a for loop while building a Json string?

I am looping through each row in an excel sheet using the openpyxl import to ultimately build a large Json string that i can feed to an API.
I am looping through each row and building out my json structure, I need to split a cell value by " || " and then for each value it needs to be added as a nested array inside a json section. I currently am using the following code and my problem is that I build my list object in my for loop and append the json chunk to a larger array and it keeps appending my list values during each loop. So i used the .Clear() method on the list to clear it after each loop...but then when i compile my final output my list is empty. Its like it does not maintain its values when it is added to the list each loop. I am new to Python and gave it a good whirl. Any suggestions in the right direction would be appreciated. Its almost like each loop needs its own unique array to use and keep the values. The tags section of the Json is emptied in the final output for each json line...when it should have the values for each unique iteration in it.
My Data Set (i have 3 rows in excel). You can see that i have values that i want to split in the 7th column. That is the column i am looping through to split the values as they will be nested in my json.
Row 1 (cells) = "ABC","Testing","Testing Again","DATE","DATE",Empty,"A || B || C".
Row 2 (cells) = "ABC 2","Testing 2","Testing Again 2","DATE","DATE",Empty,"X || Y || Z".
Row 3 (cells) = "ABC 3","Testing 3","Testing Again 3","DATE","DATE",Empty,Empty.
My Code.
#from openpyxl import Workbook
import json
from openpyxl import load_workbook
output_table = input_table.copy()
var_path_excel_file = flow_variables['Location']
workbook = load_workbook(filename=var_path_excel_file)
sheet = workbook.active
#create a null value to be used
emptyString = "Null"
#list out all of the sections of the json that we want to print out - these are based on the inputs
jsonFull = []
jsondata = {}
tags = []
for value in sheet.iter_rows(min_row=2,min_col=0,max_col=40,values_only=True):
#I add my split values to an array so that way when i add the array to the json it will have the proper brackets i need for the API to run correctly
if value[6] is not None:
data = value[6].split(" || ")
for temp in data:
tags.append(temp)
#I build out the json structure here that will be added for each excel row basically
jsondata = {
"name": value[0],
"short_description": value[1],
"long_description": value[2],
"effective_start_date": value[3],
"effective_end_date": value[4],
"workflow_state": emptyString,
"tags": tags
}
#Add the jsondata row to the larger collection
jsonFull.append(jsondata)
tags.clear()
print(json.dumps(jsonFull))
And then my desired outcome would be something like this. I just need to figure out the proper syntax for the list handling...and can't seem to find an example to base off of.
[
{
"name": "ABC",
"short_description": "Testing",
"long_description": "Testing Again",
"effective_start_date": "2020-03-04T14:45:22Z",
"effective_end_date": "2020-03-04T14:45:22Z",
"workflow_state": "Null",
"tags": [
"A",
"B",
"C"
]
},
{
"name": "ABC 2",
"short_description": "Testing 2",
"long_description": "Testing Again 2",
"effective_start_date": "2020-03-04T14:45:22Z",
"effective_end_date": "2020-03-04T14:45:22Z",
"workflow_state": "Null",
"tags": [
"X",
"Y",
"Z"
]
},
{
"name": "ABC 3",
"short_description": "Testing 3",
"long_description": "Testing Again 3",
"effective_start_date": "2020-03-04T14:45:22Z",
"effective_end_date": "2020-03-04T14:45:22Z",
"workflow_state": "Null",
"tags": [
]
}
]
You're not making a copy of tags when you put it into the dictionary or call tags.clear(), you're just putting a reference to the same list. You need to create a new list at the beginning of each loop iteration, not reuse the same list.
for value in sheet.iter_rows(min_row=2,min_col=0,max_col=40,values_only=True):
#I add my split values to an array so that way when i add the array to the json it will have the proper brackets i need for the API to run correctly
if value[6] is not None:
tags = value[6].split(" || ")
else:
tags = []
#I build out the json structure here that will be added for each excel row basically
jsondata = {
"name": value[0],
"short_description": value[1],
"long_description": value[2],
"effective_start_date": value[3],
"effective_end_date": value[4],
"workflow_state": emptyString,
"tags": tags
}
#Add the jsondata row to the larger collection
jsonFull.append(jsondata)

List key values for Json data file

I have a very long json file, that I need make sense of in order to query the correct data that is related to what I am interested in. In order to do this, I would like to extract all of the key values in order to know what is available to query. Is there an quick way of doing this, or should I just write a parser that traverses the json file and extracts anything in-between either { and : or , and : ?
given the example:
[{"Name": "key1", "Value": "value1"}, {"Name": "key2", "Value": "value2"}]
I am looking for the values:
"Name"
"Value"
That will depend on if there's any nesting. But the basic pattern is something like this:
import json
with open("foo.json", "r") as fh:
data = json.load(fh)
all_keys = set()
for datum in data:
keys = set(datum.keys())
all_keys.update(keys)
This:
dict = [{"Name": "key1", "Value": "value1"}, {"Name": "key2", "Value": "value2"}]
for val in dict:
print(val.keys())
gives you:
dict_keys(['Name', 'Value'])
dict_keys(['Name', 'Value'])

Convert CSV into JSON. How do I keep values with the same Index?

I am using this database: https://cpj.org/data/killed/?status=Killed&motiveConfirmed%5B%5D=Confirmed&type%5B%5D=Journalist&localOrForeign%5B%5D=Foreign&start_year=1992&end_year=2019&group_by=year
I have preprocessed it into this csv (showing only 2 lines of 159):
year,combinedStatus,fullName,sortName,primaryNationality,secondaryNationality,tertiaryNationality,gender,photoUrl,photoCredit,type,lastStatus,typeOfDeath,status,employedAs,organizations,jobs,coverage,mediums,country,location,region,state,locality,province,localOrForeign,sourcesOfFire,motiveConfirmed,accountabilityCrossfire,accountabilityAssignment,impunityMurder,tortured,captive,threatened,charges,motive,lengthOfSentence,healthProblems,impCountry,entry,sentenceDate,sentence,locationImprisoned
1994,Confirmed,Abdelkader Hireche,,,,,Male,,,Journalist,,Murder,Killed,Staff,Algerian Television (ENTV),Broadcast Reporter,Politics,Television,Algeria,Algiers,,,Algiers,,Foreign,,Confirmed,,,Partial Impunity,No,No,No,,,,,,,,,
2014,Confirmed,Ahmed Hasan Ahmed,,,,,Male,,,Journalist,,Dangerous Assignment,Killed,Staff,Xinhua News Agency,"Camera Operator,Photographer","Human Rights,Politics,War",Internet,Syria,Damascus,,,Damascus,,Foreign,,Confirmed,,,,,,,,,,,,,,,
And I want to make this type of JSON out of it:
"Afghanistan": {"year": 2001, "fullName": "Volker Handloik", "gender": "Male", "typeOfDeath": "Crossfire", "employedAs": "Freelance", "organizations": "freelance reporter", "jobs": "Print Reporter", "coverage": "War", "mediums": "Print", "photoUrl": NaN}, "Somalia": {"year": 1994, "fullName": "Pierre Anceaux", "gender": "Male", "typeOfDeath": "Murder", "employedAs": "Freelance", "organizations": "freelance", "jobs": "Broadcast Reporter", "coverage": "Human Rights", "mediums": "Television", "photoUrl": NaN}
The problem is that Afghanistan (as you can see in the link) has had many journalist deaths. I want to list all these killings under the Index 'Afghanistan'. However, as I currently do it, only the last case (Volker Handloik) in the csv file shows up. How can I get it so every case shows up?
this is my code atm
import pandas as pd
import pprint as pp
import json
# list with stand-ins for empty cells
missing_values = ["n/a", "na", "unknown", "-", ""]
# set missing values to NaN
df = pd.read_csv("data_journalists.csv", na_values = missing_values, skipinitialspace = True, error_bad_lines=False)
# columns
columns_keep = ['year', 'fullName', 'gender', 'typeOfDeath', 'employedAs', 'organizations', 'jobs', 'coverage', 'mediums', 'country', 'photoUrl']
small_df = df[columns_keep]
with pd.option_context('display.max_rows', None, 'display.max_columns', None): # more options can be specified also
print(small_df)
# create dict with country-column as index
df_dict = small_df.set_index('country').T.to_dict('dict')
print(df_dict)
# make json file from the dict
with open('result.json', 'w') as fp:
json.dump(df_dict, fp)
# use pretty print to see if dict matches the json example in the exercise
pp.pprint(df_dict)
I want to include all of these names (and more) in the JSON under the index Afghanistan
I think I will need a list of objects that is attached to the index of a country so that every country can show all the cases of journalists death instead of only 1 (each time being replaced by the next in the csv) I hope this is clear enough
I'll keep your code until the definition of small_df.
After that, we perform a groupby on the 'country' column and use pd.to_json on it:
country_series = small_df.groupby('country').apply(lambda r : r.drop(['country'], axis=1).to_json())
country_series is a pd.Series with the countries as index.
After that, we create a nested dictionary, so that we have a valid json object:
fullDict = {}
for ind, a in country_series.iteritems():
b = json.loads(a)
c = b['fullName']
smallDict = {}
for index, journalist in c.items():
smallDict[journalist] = {}
for i in b.keys():
smallDict[journalist][i] = b[i][index]
fullDict[ind] = (smallDict)
The nomenclature in my part of code is pretty bad, but I tried to write all the steps explicitly so that things should be clear.
Finally, we write the results to a file:
with open('result.json', 'w') as f:
json.dump(fullDict, f)

How can i create the json object on python?

I tried to create a json object but I made a mistake somewhere. I'm getting some data on CSV file (center is string, lat and lng are float).
My codes:
data = []
data.append({
'id': 'id',
'is_city': false,
'name': center,
'county': center,
'cluster': i,
'cluster2': i,
'avaible': true,
'is_deleted': false,
'coordinates': ('{%s,%s}' %(lat,lng))
})
json_data = json.dumps(data)
print json_data
It goes with this:
[{
"county": "County",
"is_city": false,
"is_deleted": false,
"name": "name",
"cluster": 99,
"cluster2": 99,
"id": "id",
"coordinates": "{41.0063945,28.9048234}",
"avaible": true
}]
That's I want:
{
"id" : "id",
"is_city" : false,
"name" : "name",
"county" : "county",
"cluster" : 99,
"cluster2" : 99,
"coordinates" : [
41.0870185,
29.0235126
],
"available" : true,
"isDeleted" : false,
}
You are defining coordinates to be a string of the specified format. There is no way json can encode that as a list; you are saying one thing when you want another.
Similarly, if you don't want the top-level dictionary to be the only element in a list, don't define it to be an element in a list.
data = {
'id': 'id',
'is_city': false,
'name': name,
'county': county,
'cluster': i,
'cluster2': i,
'available': true,
'is_deleted': false,
'coordinates': [lat, lng]
}
I don't know how you defined center, or how you expected it to have the value 'name' and the value 'county' at basically the same time. I have declared two new variables to hold these values; you will need to adapt your code to take care of this detail. I also fixed the typo in "available" where apparently you expected Python to somehow take care of this.
You can use pprint to make pretty printing at python, but it should be applied on object not string.
At your case json_data is a string that represents a JSON object, so you need to load it back to be an object when you try to pprint it, (or to use the data variable itself since it already contains this JSON object in your example)
for example try to run:
pprint.pprint(json.loads(json_data))

Categories

Resources