I need to analyze metadata from here: http://jmcauley.ucsd.edu/data/amazon/links.html
However, metadata JSON files here are nested & have single quotes, not double quotes. Therefore I can't use json_normalize to flatten this data into a Pandas dataframe.
Example:
{'A':'1', 'B':{'c':['1','2'], 'd':['3','4']}}
I need to flatten this into a Pandas data frame with objects A B.c B.d
With guideline given in the link I used eval to get A and B but can't get B.c, B.d.
Could you please suggest a way to do this?
That's a dict, not a JSON, if you want to convert that to a DataFrame, just do this:
d = {'A':'1', 'B':{'c':['1','2'], 'd':['3','4']}}
df = pd.DataFrame(d)
A B
c 1 [1, 2]
d 1 [3, 4]
If your problem is loading this text into a python dict
you can try a couple of things
replace single quotes -> json.loads(data.replace("'",'"'))
try to read it as a python dict -> eval(data)
A JSON cannot have keys or values encompassed in single quotes.
If you have to parse a string with single quotes as a dict then you can probably use
import ast
data = str({'A':'1', 'B':{'c':['1','2'], 'd':['3','4']}})
data_dict = ast.literal_eval(data)
from pandas.io.json import json_normalize
data_normalized = json_normalize(data)
https://stackoverflow.com/a/21154138/13561487
Related
I have a dataframe that contains json style dictionaries per row, and I need to extract the fourth and eigth fields, or values from the second and fourth pairs to a new column i.e. 'a' for row 1, 'a' for 2 (corresponding to the Group) and '10786' for row 1, '38971' for row 2 (corresponding to Code). The expected output is below.
dat = pd.DataFrame({ 'col1': ['{"Number":1,"Group":"a","First":"Yes","Code":"10786","Desc":true,"Labs":"["a","b","c"]"}',
'{"Number":2,"Group":"a","First":"Yes","Code":"38971","Desc":true,"Labs":"["a","b","c"]"}']})
expected output
a Group Code
0 {"Number":1,"Group":"a","First":"Yes","Second"... a 10786
1 {"Number":2,"Group":"a","First":"Yes","Second"... a 38971
I have tried indexing by location but its printing only characters rather than fields e.g.
tuple(dat['a'].items())[0][1][4]
I also cannot appear to normalize the data with json_normalize, which I'm not sure why - perhaps the json string is stored suboptimally. So I am quite confused, and any tips would be great. Thanks so much
The reason pd.json_normalize is not working is because pd.json_normalize works on dictionaries and your strings are not json compatible.
Your strings are not json compatible because the "Labs" values contain multiple quotes which aren't escaped.
It's possible to write a quick function to make your strings json compatible, then parse them as dictionaries, and finally use pd.json_normalize.
import pandas as pd
import json
import re
jsonify = lambda x: re.sub(pattern='"Labs":"(.*)"', repl='"Labs":\g<1>', string=x) # Function to remove unnecessary quotes
dat = pd.DataFrame({ 'col1': ['{"Number":1,"Group":"a","First":"Yes","Code":"10786","Desc":true,"Labs":"["a","b","c"]"}',
'{"Number":2,"Group":"a","First":"Yes","Code":"38971","Desc":true,"Labs":"["a","b","c"]"}']})
json_series = dat['col1'].apply(jsonify) # Remove unnecessary quotes
json_series = json_series.apply(json.loads) # Convert string to dictionary
output = pd.json_normalize(json_series) # "Flatten" the dictionary into columns
output.insert(loc=0, column='a', value=dat['col1']) # Add the original column as a column named "a" because that's what the question requests.
a
Number
Group
First
Code
Desc
Labs
0
{"Number":1,"Group":"a","First":"Yes","Code":"10786","Desc":true,"Labs":"["a","b","c"]"}
1
a
Yes
10786
True
['a', 'b', 'c']
1
{"Number":2,"Group":"a","First":"Yes","Code":"38971","Desc":true,"Labs":"["a","b","c"]"}
2
a
Yes
38971
True
['a', 'b', 'c']
This question already has answers here:
Pandas DataFrame stored list as string: How to convert back to list
(9 answers)
Closed 2 years ago.
Although I referred many sources like How to convert string representation of list to a list?
but I couldn't solve my problem below.
My list looked like below and I added this list to the dataframe as column, and saved the dataframe.
ls = [['abc'],['a"bcd"e', "ab'cde'"]]
df['list_col'] = ls
df.to_csv('path')
After, I opened the df dataframe, and I confirmed that the list changed to the string representation of list by the code below.
type(df.list_col[0]) # str
So I tried to make the string representation of my list to use the code below.
import ast
df.list_col = [ast.literal_eval(ls) for ls in df.list_col]
# SyntaxError: EOL while scanning string literal
Is there any solution I can solve this problem?
Use the converters parameter of pandas.read_csv when reading the file in.
import pandas as pd
from ast import literal_eval
# test dataframe
ls = [['abc'],['a"bcd"e', "ab'cde'"]]
df = pd.DataFrame({'test': ls})
# save to csv
df.to_csv('test2.csv', index=False)
# read file in with converters
df2 = pd.read_csv('test2.csv', converters={'test': literal_eval})
print(type(df2.iloc[0, 0]))
[out]: list
Is this what you want?
>>> ls = [['abc'],['a"bcd"e', "ab'cde'"]]
>>> l = [i for a in ls for i in a]
['abc','a"bcd"e', "ab'cde'"]
My code works perfect fine for 1 dataframe using the to_json
However now i would like to have a 2nd dataframe in this result.
So I thought creating a dictionary would be the answer.
However it produces the result below which is not practical.
Any help please
I was hoping to produce something a lot prettier without all the "\"
A simple good example
import pandas as pd
data = {'Name':['Tom', 'Jack', 'Steve', 'Ricky'],'Age':[28,34,29,42]}
df = pd.DataFrame(data)
df.to_json(orient='records')
A simple bad example
import pandas as pd
data = {'Name':['Tom', 'Jack', 'Steve', 'Ricky'],'Age':[28,34,29,42]}
df = pd.DataFrame(data)
{"result_1": df.to_json(orient='records')}
I also tried
jsonify({"result_1": df.to_json(orient='records')})
and
{"result_1": [df.to_json(orient='records')]}
Hi I think that you are on the right way.
My advice is to use also json.loads to decode json and create a list of dictionary.
As you said before we can create a pandas dataframe and then use df.to_json to convert itself.
Then use json.loads to json format data and create a dictionary to insert into a list e.g. :
data = {}
jsdf = df.to_json(orient = "records")
data["result"] = json.loads(jsdf)
Adding elements to dictionary as below you will find a situation like this:
{"result1": [{...}], "result2": [{...}]}
PS:
If you want to generate random values for different dataframe you can use faker library from python.
e.g.:
from faker import Faker
faker = Faker()
for n in range(5):
df.append(list(faker.profile().values()))
df = pd.DataFrame(df, columns=faker.profile().keys())
I have a dataframe with one column named "metadata" in unicode format, as it can be seen below:
print(df.metadata[1])
u'{"vehicle_year":2010,"issue_state":"RS",...,"type":4}'
type(df.metadata[1])
unicode
I have other column in this dataframe named 'issue_state_update' and I need to change the values from issue state from what is in the metadata to the data in the metadata's row in 'issue_state_update' column.
I have tried to use the following:
for i in range(len(df_final['metadata'])):
df_final['metadata'][i] = json.loads((df_final['metadata'][i]))
json_dumps(df_final['metadata'][i].update({'issue_state': df_final['issue_state_update'][i]}),ensure_ascii=False).encode('utf-8')
However what I get is an error:
TypeError: expected string or buffer
What I need is to have exactly the same format as before doing this change, but with the new info associated with 'issue_state'
For example:
u'{"vehicle_year":2010,"issue_state":"NO STATE",...,"type":4}'
I'm assuming you have a DataFrame (DF) that looks something like:
screenshot of a DF I mocked up
Since you're working with a DF you should manipulate the data as a vector instead of iterating over it like in standard Python. One way to do this is by defining a function and then "applying" it to your data. Something like:
def parse_dict(x):
x['metadata']['issue_state'] = x['issue_state_update']
Then you could apply it to every row in your DataFrame using:
some_df.apply(parse_dict, axis=1)
After running that code I get an updated DF that looks like:
updated DF where dict now has value from 'issue_state_update'
Actually I have found the answer. I don't know how efficient it is, but it works. Here it goes:
def replacer(df):
df_final = df
import unicodedata
df_final['issue_state_upd'] = ""
for i in range(len(df_final['issue_state'])):
#From unicode to string
df_final['issue_state_upd'][i] = unicodedata.normalize('NFKD', df_final['issue_state'][i]).encode('ascii','ignore')
#From string to dict
df_final['issue_state_upd'][i] = json.loads((df_final['issue_state_upd'][i]))
#Replace value in fuel key
df_final['issue_state_upd'][i].update({'fuel_type': df_final['issue_state_upd'][i]})
#From dict to str
df_final['issue_state_upd'][i] = json.dumps(df_final['issue_state_upd'][i])
#From str to unicode
df_final['issue_state_upd'][i] = unicode(df_final['issue_state_upd'][i], "utf-8")
return df_final
I'm using Pandas to load an Excel spreadsheet which contains zip code (e.g. 32771). The zip codes are stored as 5 digit strings in spreadsheet. When they are pulled into a DataFrame using the command...
xls = pd.ExcelFile("5-Digit-Zip-Codes.xlsx")
dfz = xls.parse('Zip Codes')
they are converted into numbers. So '00501' becomes 501.
So my questions are, how do I:
a. Load the DataFrame and keep the string type of the zip codes stored in the Excel file?
b. Convert the numbers in the DataFrame into a five digit string e.g. "501" becomes "00501"?
As a workaround, you could convert the ints to 0-padded strings of length 5 using Series.str.zfill:
df['zipcode'] = df['zipcode'].astype(str).str.zfill(5)
Demo:
import pandas as pd
df = pd.DataFrame({'zipcode':['00501']})
df.to_excel('/tmp/out.xlsx')
xl = pd.ExcelFile('/tmp/out.xlsx')
df = xl.parse('Sheet1')
df['zipcode'] = df['zipcode'].astype(str).str.zfill(5)
print(df)
yields
zipcode
0 00501
You can avoid panda's type inference with a custom converter, e.g. if 'zipcode' was the header of the column with zipcodes:
dfz = xls.parse('Zip Codes', converters={'zipcode': lambda x:x})
This is arguably a bug since the column was originally string encoded, made an issue here
str(my_zip).zfill(5)
or
print("{0:>05s}".format(str(my_zip)))
are 2 of many many ways to do this
The previous answers have correctly suggested using zfill(5). However, if your zipcodes are already in float datatype for some reason (I recently encountered data like this), you first need to convert it to int. Then you can use zfill(5).
df = pd.DataFrame({'zipcode':[11.0, 11013.0]})
zipcode
0 11.0
1 11013.0
df['zipcode'] = df['zipcode'].astype(int).astype(str).str.zfill(5)
zipcode
0 00011
1 11013