Pandas: read_json Changes Data - python

I have a json file that I'm trying to read into Pandas. The file looks like this:
{"0": {"a": 0, "b": "some_text", "c": "other_text"},
"1": {"a": 1, "b": "some_text1", "c": "other_text1"},
"2": {"a": 2, "b": "some_text2", "c": "other_text2"}}
When I do:
df = pd.read_json("my_file.json")
df = df.transpose()
df.head()
I see:
a b c
0 0 some_text other_text
1 1 some_text1 other_text1
10 10 some_text2 other_text2
So the dataframe's index and column a have somehow gotten mangled in the process. What am I doing incorrectly?
Thanks!

Related

JSON file with duplicate keys to a dataframe or excel file

I have a huge JSON file with duplicate keys in each object, simplified example:
[
{
"a": 3,
"b": "Banana",
"c": 45,
"a": 3,
"a": 8,
}
]
of course, my data has many more keys and objects, but this is a good snippet.
and I'd like it to look like this:
|a| b |c |
-------------
|3|Banana|45|
|3|Banana|45|
|8|Banana|45|
I'm not picky, anything on excel, R, python... but none of the json parsers I've seen allow duplicates like this.
I've searched a lot, but I haven't found an answer. Is there any way I can do this and not have to do it manually? The dataset is HUGE.
PS I know it's not favorable for json to have multiple duplicate keys. Both the key names and values have duplicates, and I need all of them, but I was given the file this way.
Here's an R solution.
Premise: partially un-jsonify into lists with duplicate names, convert into frames individually, then aggregate into one frame.
I'll augment the data slightly do demonstrate more than one dictionary:
json <- '[
{
"a": 3,
"b": "Banana",
"c": 45,
"a": 3,
"a": 8
},
{
"a": 4,
"b": "Pear",
"c": 46,
"a": 4,
"a": 9
}
]'
Here's the code:
L <- jsonlite::fromJSON(json, simplifyDataFrame=FALSE)
L2 <- lapply(L, function(z) as.data.frame(split(unlist(z, use.names=FALSE), names(z))))
do.call(rbind, L2)
# a b c
# 1 3 Banana 45
# 2 3 Banana 45
# 3 8 Banana 45
# 4 4 Pear 46
# 5 4 Pear 46
# 6 9 Pear 46
Maybe I can help with the duplicate keys issue, they are the main problem, IMO.
In Python, there is a way how to deal with duplicate keys in JSON. You could define an own "hook" that processes key:value pairs.
In your example, the key "a" is present 3 times. Here is a demo that gives all such multiple keys unique names by appending consecutive numbers "_1", "_2", "_3", etc. (If there is a chance of name clash with an existing key like "a_1", change the naming format.)
The result is a valid dict you can process as you like.
import collections
import json
data = """
[
{
"a": 3,
"b": "Banana",
"c": 45,
"a": 3,
"a": 8
}
]
"""
def object_pairs(pairs):
dups = {d:1 for d, i in collections.Counter(pair[0] for pair in pairs).items() if i > 1}
# ^^^ change to d:0 for zero-based counting
dedup = {}
for k, v in pairs:
try:
num = dups[k]
dups[k] += 1
k = f"{k}_{num}"
except KeyError:
pass
dedup[k] = v
return dedup
result = json.loads(data, object_pairs_hook=object_pairs)
print(result) # [{'a_1': 3, 'b': 'Banana', 'c': 45, 'a_2': 3, 'a_3': 8}]

Is it possible to merge two pandas dataframes based on indices and column names?

I have two dataframes:
left = pd.DataFrame(
{
"Col": ["D", "C", "B", "A"],
},
index=[0, 1, 2, 3],
)
right = pd.DataFrame(
{
"A": ["A0", "A1", "A2", "A3"],
"B": ["B0", "B1", "B2", "B3"],
"C": ["C0", "C1", "C2", "C3"],
"D": ["D0", "D1", "D2", "D3"],
},
index=[0, 1, 2, 3],
)
Is it possible to merge them based on indices and col of the left and column names on the right ?
I need to get the following result:
result = pd.DataFrame(
{
"Col": ["D", "C", "B", "A"],
"Val": ["D0", "C1", "B2", "A3"],
},
)
Try with
left['new'] = right.values[np.arange(len(left)), right.columns.get_indexer(left.Col)]
left
Out[129]:
Col new
0 D D0
1 C C1
2 B B2
3 A A3
Notice, we used to have lookup but it deprecated, ,above is one of the alternative of lookup from numpy
The reason here I am not use the index : numpy do not have index, so we need the position to pass by the correct value, most of time index same as position but it will may different from
each other as well.
Another solution:
left["new"] = right.apply(lambda x: x[left.loc[x.name, "Col"]], axis=1)
print(left)
Prints:
Col new
0 D D0
1 C C1
2 B B2
3 A A3
Alternative approach (convert columns to index with melt and then merge):
left['id'] = left.index
m = right.melt(ignore_index=False, var_name="Col", value_name="Val")
m['id'] = m.index
result = pd.merge(left, m, on=["id", "Col"])[["Col", "Val"]]
It is faster than use of apply but slower than the accepted answer.

Is it possible to merge two pandas dataframes based on column name?

I have two dataframes:
left = pd.DataFrame(
{
"Col": ["D", "C", "B", "A"],
},
)
right = pd.DataFrame(
{
"A": ["A0"],
"B": ["B0"],
"C": ["C0"],
"D": ["D0"],
},
)
Is it possible to merge them based on col of the left and column names on the right ?
I need to get the following result:
result = pd.DataFrame(
{
"Col": ["D", "C", "B", "A"],
"Val": ["D0", "C0", "B0", "A0"],
},
)
You can do it with a pretty straightforward .map:
In [319]: left['Val'] = left['Col'].map(right.T[0])
In [320]: left
Out[320]:
Col Val
0 D D0
1 C C0
2 B B0
3 A A0
Try join with the transposition (T or transpose()):
import pandas as pd
left = pd.DataFrame({
"Col": ["D", "C", "B", "A"],
})
right = pd.DataFrame({
"A": ["A0"],
"B": ["B0"],
"C": ["C0"],
"D": ["D0"],
})
new_df = left.join(right.T, on='Col').rename(columns={0: 'Val'})
print(new_df)
new_df:
Col Val
0 D D0
1 C C0
2 B B0
3 A A0

Python Count JSON key values

I have a dataframe df with column 'ColumnA'. How do i count the keys in this column using python.
df = pd.DataFrame({
'ColA': [{
"a": 10,
"b": 5,
"c": [1, 2, 3],
"d": 20
}, {
"f": 1,
"b": 3,
"c": [0],
"x": 71
}, {
"a": 1,
"m": 99,
"w": [8, 6],
"x": 88
}, {
"a": 9,
"m": 99,
"c": [3],
"x": 55
}]
})
Here i want to calculate count for each key like this. Then visualise the frequency using a chart
Expected Answers :
a=3,
b=2,
c=3,
d=1,
f=1,
x=3,
m=2,
w=1
try this, Series.explode transform's list-like to a row, Series.value_counts to get counts of unique values, Series.plot to create plot out of the series generated.
df.ColA.apply(lambda x : list(x.keys())).explode().value_counts()
a 3
c 3
x 3
b 2
m 2
f 1
d 1
w 1
Name: ColA, dtype: int64

Json data ordering PANDAS, python

I have json like this:
json = {
"b": 22,
"x": 12,
"a": 2,
"c": 4
}
When i generate an Excel file from this json like this:
import pandas as pd
df = pd.read_json(json_text)
file_name = 'test.xls'
file_path = "/tmp/" + file_name
df.to_excel(file_path, index=False)
print("path to excel " + file_path)
Pandas does its own ordering in the Excel file like this:
pandas_json = {
"a": 2,
"b": 22,
"c": 4,
"x": 12
}
I don't want this. I need the ordering which exists in the json. Please give me some advice how to do this.
UPDATE:
if i have json like this:
json = [
{"b": 22, "x":12, "a": 2, "c": 4},
{"b": 22, "x":12, "a": 2, "c": 2},
{"b": 22, "x":12, "a": 4, "c": 4},
]
pandas will generate its own ordering like this:
panas_json = [
{"a": 2, "b":22, "c": 4, "x": 12},
{"a": 2, "b":22, "c": 2, "x": 12},
{"a": 4, "b":22, "c": 4, "x": 12},
]
How can I make pandas preserve my own ordering?
You can read the json as OrderedDict which will help to retain original order:
import json
from collections import OrderedDict
json_ = """{
"b": 22,
"x": 12,
"a": 2,
"c": 4
}"""
data = json.loads(json_, object_pairs_hook=OrderedDict)
pd.DataFrame.from_dict(data,orient='index')
0
b 22
x 12
a 2
c 4
Edit, updated json also works:
j="""[{"b": 22, "x":12, "a": 2, "c": 4},
{"b": 22, "x":12, "a": 2, "c": 2},{"b": 22, "x":12, "a": 4, "c": 4}]"""
data = json.loads(j, object_pairs_hook=OrderedDict)
pd.DataFrame.from_dict(data).to_json(orient='records')
'[{"b":22,"x":12,"a":2,"c":4},{"b":22,"x":12,"a":2,"c":2},
{"b":22,"x":12,"a":4,"c":4}]'

Categories

Resources