Error in loading json with os and pandas package [duplicate] - python

I have some difficulty in importing a JSON file with pandas.
import pandas as pd
map_index_to_word = pd.read_json('people_wiki_map_index_to_word.json')
This is the error that I get:
ValueError: If using all scalar values, you must pass an index
The file structure is simplified like this:
{"biennials": 522004, "lb915": 116290, "shatzky": 127647, "woode": 174106, "damfunk": 133206, "nualart": 153444, "hatefillot": 164111, "missionborn": 261765, "yeardescribed": 161075, "theoryhe": 521685}
It is from the machine learning course of University of Washington on Coursera. You can find the file here.

Try
ser = pd.read_json('people_wiki_map_index_to_word.json', typ='series')
That file only contains key value pairs where values are scalars. You can convert it to a dataframe with ser.to_frame('count').
You can also do something like this:
import json
with open('people_wiki_map_index_to_word.json', 'r') as f:
data = json.load(f)
Now data is a dictionary. You can pass it to a dataframe constructor like this:
df = pd.DataFrame({'count': data})

You can do as #ayhan mention which will give you a column base format
Or you can enclose the object in [ ] (source) as shown below to give you a row format that will be convenient if you are loading multiple values and planing on using matrix for your machine learning models.
df = pd.DataFrame([data])

I think what is happening is that the data in
map_index_to_word = pd.read_json('people_wiki_map_index_to_word.json')
is being read as a string instead of a json
{"biennials": 522004, "lb915": 116290, "shatzky": 127647, "woode": 174106, "damfunk": 133206, "nualart": 153444, "hatefillot": 164111, "missionborn": 261765, "yeardescribed": 161075, "theoryhe": 521685}
is actually
'{"biennials": 522004, "lb915": 116290, "shatzky": 127647, "woode": 174106, "damfunk": 133206, "nualart": 153444, "hatefillot": 164111, "missionborn": 261765, "yeardescribed": 161075, "theoryhe": 521685}'
Since a string is a scalar, it wants you to load it as a json, you have to convert it to a dict which is exactly what the other response is doing
The best way is to do a json loads on the string to convert it to a dict and load it into pandas
myfile=f.read()
jsonData=json.loads(myfile)
df=pd.DataFrame(data)

{
"biennials": 522004,
"lb915": 116290
}
df = pd.read_json('values.json')
As pd.read_json expects a list
{
"biennials": [522004],
"lb915": [116290]
}
for a particular key, it returns an error saying
If using all scalar values, you must pass an index.
So you can resolve this by specifying 'typ' arg in pd.read_json
map_index_to_word = pd.read_json('Datasets/people_wiki_map_index_to_word.json', typ='dictionary')

For newer pandas, 0.19.0 and later, use the lines parameter, set it to True.
The file is read as a json object per line.
import pandas as pd
map_index_to_word = pd.read_json('people_wiki_map_index_to_word.json', lines=True)
If fixed the following errors I encountered especially when some of the json files have only one value:
ValueError: If using all scalar values, you must pass an index
JSONDecodeError: Expecting value: line 1 column 1 (char 0)
ValueError: Trailing data

For example
cat values.json
{
name: "Snow",
age: "31"
}
df = pd.read_json('values.json')
Chances are you might end up with this
Error: if using all scalar values, you must pass an index
Pandas looks up for a list or dictionary in the value. Something like
cat values.json
{
name: ["Snow"],
age: ["31"]
}
So try doing this. Later on to convert to html tohtml()
df = pd.DataFrame([pd.read_json(report_file, typ='series')])
result = df.to_html()

I solved this by converting it into an array like so
[{"biennials": 522004, "lb915": 116290, "shatzky": 127647, "woode": 174106, "damfunk": 133206, "nualart": 153444, "hatefillot": 164111, "missionborn": 261765, "yeardescribed": 161075, "theoryhe": 521685}]

Related

How can I handle missing values in the dictionary when I use the function eval(String dictionary) -> dictionary PYTHON?

I need to convert the ‘content’ column from a string dictionary to a dictionary in python. After that I will use the following line of code:
df[‘content’].apply(pd.Series).
To have the dictionary values as a column name and the dictionary value in a cell.
I can’t do this now because there are missing values in the dictionary string.
How can I handle missing values in the dictionary when I use the function eval(String dictionary) -> dictionary?
[I'm working on the 'content' column that I want to convert to the correct format first, I tried with the eval() function, but it doesn't work, because there are missing values. This is json data.
My goal is to have the content column data for the keys in the column titles and the values in the cells](https://i.stack.imgur.com/1CsIl.png)
you can use json.loads in lambda function. if row value is nan, pass, if not, apply json.loads:
:
import json
import numpy as np
df['content']=df['content'].apply(lambda x: json.loads(x) if pd.notna(x) else np.nan)
now you can use pd.Series.
v1 = df['Content'].apply(pd.Series)
df = df.drop(['Content'],axis=1).join(v1)
if you have missing values in string dictionaries:
def check_json(x):
import ast
import json
if pd.isna(x):
return np.nan
else:
try:
return json.loads(x)
except:
try:
mask=x.replace('{','').replace('}','') #missing dictionary
mask=mask.split(",")
for i in range(0,len(mask)):
if not len(mask[i].partition(":")[-1]) > 0:
print(mask[i])
mask[i]=mask[i] + '"None"' # ---> you can replace None with what do you want
return json.loads(str({','.join(mask)}).replace("\'", ""))
except:
try:
x=x.replace("\'", "\"")
mask=x.replace('{','').replace('}',"") #missing dictionary
mask=mask.split(",")
for i in range(0,len(mask)):
if not len(mask[i].partition(":")[-1]) > 0:
print(mask[i])
mask[i]=mask[i] + '"None"' # ---> you can replace None with what do you want
b=str({','.join(mask)}).replace("\'", "")
return ast.literal_eval(b)
except:
print("Could not parse json object. Returning nan")
return np.nan
df['content']=df['content'].apply(lambda x: check_json(x))
v1 = df['Content'].apply(pd.Series)
df = df.drop(['Content'],axis=1).join(v1)
I cannot see what the missing values look like in your screenshot, but i tested the following code and got what seems to be a good result. The simple explanation in to use str.replace() to fix the null values before parsing the string to dict.
import pandas as pd
import numpy as np
import json
## setting up an example dataframe. note that row2 has a null value
json_example = [
'{"row1_key1":"row1_value1","row1_key2":"row1_value2"}',
'{"row2_key1":"row2_value1","row2_key2": null}'
]
df= pd.DataFrame()
df['Content'] = json_example
## using string replace on the string representation of the json to clean it up
df['Content'].apply(lambda x: x.replace('null','"0"'))
## using lambda x to first load the string into a dict, then applying pd.Series()
df['Content'].apply(lambda x: pd.Series(json.loads(x)))
Output

Store string in a column as nested JSON to a JSON file - Pyspark

I have a pyspark dataframe, this is what it looks like
+------------------------------------+-------------------+-------------+--------------------------------+---------+
|member_uuid |Timestamp |updated |member_id |easy_id |
+------------------------------------+-------------------+-------------+--------------------------------+---------+
|027130fe-584d-4d8e-9fb0-b87c984a0c20|2020-02-11 19:15:32|password_hash|ajuypjtnlzmk4na047cgav27jma6_STG|993269700|
I transformed the above dataframe to this,
+---------+---------+-------------------------------------------------------------------------------------------------------------------------------------------------+-------------------+
|attribute|operation|params |timestamp |
+---------+---------+-------------------------------------------------------------------------------------------------------------------------------------------------+-------------------+
|profile |UPDATE |{"member_uuid":"027130fe-584d-4d8e-9fb0-b87c984a0c20","member_id":"ajuypjtnlzmk4na047cgav27jma6_STG","easy_id":993269700,"field":"password_hash"}|2020-02-11 19:15:32|
Using the following code,
ll = ['member_uuid', 'member_id', 'easy_id', 'field']
df = df.withColumn('timestamp', col('Timestamp')).withColumn('attribute', lit('profile')).withColumn('operation', lit(col_name)) \
.withColumn('field', col('updated')).withColumn('params', F.to_json(struct([x for x in ll])))
df = df.select('attribute', 'operation', 'params', 'timestamp')
I have save this dataframe df to a text file after converting it to JSON.
I tried using the following code to do the same,
df_final.toJSON().coalesce(1).saveAsTextFile('file')
The file contains,
{"attribute":"profile","operation":"UPDATE","params":"{\"member_uuid\":\"027130fe-584d-4d8e-9fb0-b87c984a0c20\",\"member_id\":\"ajuypjtnlzmk4na047cgav27jma6_STG\",\"easy_id\":993269700,\"field\":\"password_hash\"}","timestamp":"2020-02-11T19:15:32.000Z"}
I want it to save in this format,
{"attribute":"profile","operation":"UPDATE","params":{"member_uuid":"027130fe-584d-4d8e-9fb0-b87c984a0c20","member_id":"ajuypjtnlzmk4na047cgav27jma6_STG","easy_id":993269700,"field":"password_hash"},"timestamp":"2020-02-11T19:15:32.000Z"}
to_json saves the value in the params columns as a string, is there a way to keep the json context here so I can save it as the desired output?
Don't use to_json to create params column in dataframe.
The trick here is just create struct and write to the file (using .saveAsTextFile (or) .write.json()) Spark will create JSON for the Struct field.
if we already created json object and writing in json format Spark will add \ to escape the quotes already exists in Json string.
Example:
from pyspark.sql.functions import *
#sample data
df=spark.createDataFrame([("027130fe-584d-4d8e-9fb0-b87c984a0c20","2020-02-11 19:15:32","password_hash","ajuypjtnlzmk4na047cgav27jma6_STG","993269700")],["member_uuid","Timestamp","updated","member_id","easy_id"])
df1=df.withColumn("attribute",lit("profile")).withColumn("operation",lit("UPDATE"))
df1.selectExpr("struct(member_uuid,member_id,easy_id) as params","attribute","operation","timestamp").write.format("json").mode("overwrite").save("<path>")
#{"params":{"member_uuid":"027130fe-584d-4d8e-9fb0-b87c984a0c20","member_id":"ajuypjtnlzmk4na047cgav27jma6_STG","easy_id":"993269700"},"attribute":"profile","operation":"UPDATE","timestamp":"2020-02-11 19:15:32"}
df1.selectExpr("struct(member_uuid,member_id,easy_id) as params","attribute","operation","timestamp").toJSON().saveAsTextFile("<path>")
#{"params":{"member_uuid":"027130fe-584d-4d8e-9fb0-b87c984a0c20","member_id":"ajuypjtnlzmk4na047cgav27jma6_STG","easy_id":"993269700"},"attribute":"profile","operation":"UPDATE","timestamp":"2020-02-11 19:15:32"}
A simple way to handle it is to just do a replace operation on the file
sourceData=open('file').read().replace('"{','{').replace('}"','}').replace('\\','')
with open('file','w') as final:
final.write(sourceData)
This might not be what you are looking for, but will achieve the end result.

Pandas unexpected scientific notion

I have a large csv file contains some bus network information.
The stop code are made of a large number with a certain letter in the end. However, some of them are only numbers. When I read them into pandas, the large numbers become in scientific notion. like
code_o lat_o lon_o code_d
490016444HN 51.56878 0.1811568 490013271R
490013271R 51.57493 0.1781319 490009721A
490009721A 51.57708 0.1769355 490010407C
490010407C 51.57947 0.1775409 490011659G
490011659G 51.5806 0.1831088 490009810M
490009810M 51.57947 0.1848733 490014448S
490014448S 51.57751 0.185111 490001243Y
490001243Y 51.57379 0.1839945 490013654S
490013654S 51.57143 0.184776 490013482E
490013482E 51.57107 0.187039 490015118E
490015118E 51.5724 0.1923417 490011214E
490011214E 51.57362 0.1959939 490006980E
490006980E 51.57433 0.1999537 4.90E+09
4.90E+09 51.57071 0.2087701 490003049E
490003049E 51.5631 0.2146196 490004001A
490004001A 51.56314 0.2165552 490015350F
The type of them are object, however I need them to be a normal number in order to cross join other tables.
Since the column is not an 'int' or 'float', I cannot modify them by a whole column.
Any suggestion?
I attached the file from dropbox
https://www.dropbox.com/s/jhbxsncd97rq1z4/gtfs_OD_links_L.csv?dl=0
IIUC, try forcing object type for the code_d column on import:
import numpy as np
import pandas as pd
df = pd.read_csv('your_original_file.csv', dtype={'code_d': 'object'})
You can then parse that column, discarding the letter at the end and casting the result to integer type:
df['code_d'] = df['code_d'].str[:-1].astype(np.int)
Keep it simple: df=pd.read_csv('myfile.csv',dtype=str) and it will read everything in as strings. Or as was posted earlier by #Alberto to specify that column only just: df=pd.read_csv('myfile.csv',dtype={'code_o':str})

How to replace comma with dash using python pandas?

I have a file like this:
name|count_dic
name1 |{'x1':123,'x2,bv.':435,'x3':4}
name2|{'x2,bv.':435,'x5':98}
etc.
I am trying to load the data into a dataframe and count the number of keys in in the count_dic. The problem is that the dic items are separated with comma and also some of the keys contain comma. I am looking for a way to be able to replace commas in the key with '-' and then be able to separate different key,value pairs in the count_dic.something like this:
name|count_dic
name1 |{'x1':123,'x2-bv.':435,'x3':4}
name2|{'x2-bv.':435,'x5':98}
etc.
This is what I have done.
df = pd.read_csv('file' ,names = ['name','count_dic'],delimiter='|')
data = json.loads(df.count_dic)
and I get the following error:
TypeError: the JSON object must be str, not 'Series'
Does any body have any suggestions?
You can use ast.literal_eval as a converter for loading the dataframe, as it appears you have data that's more Python dict-like... JSON uses double quotes - eg:
import pandas as pd
import ast
df = pd.read_csv('file', delimiter='|', converters={'count_dic': ast.literal_eval})
Gives you a DF of:
name count_dic
0 name1 {'x2,bv.': 435, 'x3': 4, 'x1': 123}
1 name2 {'x5': 98, 'x2,bv.': 435}
Since count_dic is actually a dict, then you can apply len to get the number of keys, eg:
df.count_dic.apply(len)
Results in:
0 3
1 2
Name: count_dic, dtype: int64
Once df is defined as above:
# get a value to play around with
td = df.iloc[0].count_dic
td
# that looks like a dict definition... evaluate it?
eval(td)
eval(td).keys() #yup!
#apply to the whole df
df.count_dic = map(eval, df.count_dic)
#and a hint towards your key-counting
map(lambda i: i.keys(), df.count_dic)

DataFrame constructor not properly called! error

I am new to Python and I am facing problem in creating the Dataframe in the format of key and value i.e.
data = [{'key':'\[GlobalProgramSizeInThousands\]','value':'1000'},]
Here is my code:
columnsss = ['key','value'];
query = "select * from bparst_tags where tag_type = 1 ";
result = database.cursor(db.cursors.DictCursor);
result.execute(query);
result_set = result.fetchall();
data = "[";
for row in result_set:
`row["tag_expression"]`)
data += "{'value': %s , 'key': %s }," % ( `row["tag_expression"]`, `row["tag_name"]` )
data += "]" ;
df = DataFrame(data , columns=columnsss);
But when I pass the data in DataFrame it shows me
pandas.core.common.PandasError: DataFrame constructor not properly called!
while if I print the data and assign the same value to data variable then it works.
You are providing a string representation of a dict to the DataFrame constructor, and not a dict itself. So this is the reason you get that error.
So if you want to use your code, you could do:
df = DataFrame(eval(data))
But better would be to not create the string in the first place, but directly putting it in a dict. Something roughly like:
data = []
for row in result_set:
data.append({'value': row["tag_expression"], 'key': row["tag_name"]})
But probably even this is not needed, as depending on what is exactly in your result_set you could probably:
provide this directly to a DataFrame: DataFrame(result_set)
or use the pandas read_sql_query function to do this for you (see docs on this)
Just ran into the same error, but the above answer could not help me.
My code worked fine on my computer which was like this:
test_dict = {'x': '123', 'y': '456', 'z': '456'}
df=pd.DataFrame(test_dict.items(),columns=['col1','col2'])
However, it did not work on another platform. It gave me the same error as mentioned in the original question. I tried below code by simply adding the list() around the dictionary items, and it worked smoothly after:
df=pd.DataFrame(list(test_dict.items()),columns=['col1','col2'])
Hopefully, this answer can help whoever ran into a similar situation like me.
import json
# Opening JSON file
f = open('data.json')
# returns JSON object as
# a dictionary
data1 = json.load(f)
#converting it into dataframe
df = pd.read_json(data1, orient ='index')

Categories

Resources