Python: json normalize "String indices must be integers" error - python

I am getting a type error as "TypeError: string indices must be integers" in the following code.
import pandas as pd
import json
from pandas.io.json import json_normalize
full_json_df = pd.read_json('data/world_bank_projects.json')
json_nor = json_normalize(full_json_df, 'mjtheme_namecode')
json_nor.groupby('name')['code'].count().sort_values(ascending=False).head(10)
Output:
TypeError
Traceback (most recent call last)
<ipython-input-28-9401e8bf5427> in <module>()
1 # Find the top 10 major project themes (using column 'mjtheme_namecode')
2
----> 3 json_nor = json_normalize(full_json_df, 'mjtheme_namecode')
4 #json_nor.groupby('name')['code'].count().sort_values(ascending = False).head(10)
TypeError: string indices must be integers

According to pandas documentation, for data argument of the method json_normalize :
data : dict or list of dicts Unserialized JSON objects
In above, pd.read_json returns dataframe.
So, you can try converting dataframe to dictionary using .to_dict(). There are various options for using to_dict() as well.
May be something like below:
json_normalize(full_json_df.to_dict(), ......)

Related

Error in loading json with os and pandas package [duplicate]

I have some difficulty in importing a JSON file with pandas.
import pandas as pd
map_index_to_word = pd.read_json('people_wiki_map_index_to_word.json')
This is the error that I get:
ValueError: If using all scalar values, you must pass an index
The file structure is simplified like this:
{"biennials": 522004, "lb915": 116290, "shatzky": 127647, "woode": 174106, "damfunk": 133206, "nualart": 153444, "hatefillot": 164111, "missionborn": 261765, "yeardescribed": 161075, "theoryhe": 521685}
It is from the machine learning course of University of Washington on Coursera. You can find the file here.
Try
ser = pd.read_json('people_wiki_map_index_to_word.json', typ='series')
That file only contains key value pairs where values are scalars. You can convert it to a dataframe with ser.to_frame('count').
You can also do something like this:
import json
with open('people_wiki_map_index_to_word.json', 'r') as f:
data = json.load(f)
Now data is a dictionary. You can pass it to a dataframe constructor like this:
df = pd.DataFrame({'count': data})
You can do as #ayhan mention which will give you a column base format
Or you can enclose the object in [ ] (source) as shown below to give you a row format that will be convenient if you are loading multiple values and planing on using matrix for your machine learning models.
df = pd.DataFrame([data])
I think what is happening is that the data in
map_index_to_word = pd.read_json('people_wiki_map_index_to_word.json')
is being read as a string instead of a json
{"biennials": 522004, "lb915": 116290, "shatzky": 127647, "woode": 174106, "damfunk": 133206, "nualart": 153444, "hatefillot": 164111, "missionborn": 261765, "yeardescribed": 161075, "theoryhe": 521685}
is actually
'{"biennials": 522004, "lb915": 116290, "shatzky": 127647, "woode": 174106, "damfunk": 133206, "nualart": 153444, "hatefillot": 164111, "missionborn": 261765, "yeardescribed": 161075, "theoryhe": 521685}'
Since a string is a scalar, it wants you to load it as a json, you have to convert it to a dict which is exactly what the other response is doing
The best way is to do a json loads on the string to convert it to a dict and load it into pandas
myfile=f.read()
jsonData=json.loads(myfile)
df=pd.DataFrame(data)
{
"biennials": 522004,
"lb915": 116290
}
df = pd.read_json('values.json')
As pd.read_json expects a list
{
"biennials": [522004],
"lb915": [116290]
}
for a particular key, it returns an error saying
If using all scalar values, you must pass an index.
So you can resolve this by specifying 'typ' arg in pd.read_json
map_index_to_word = pd.read_json('Datasets/people_wiki_map_index_to_word.json', typ='dictionary')
For newer pandas, 0.19.0 and later, use the lines parameter, set it to True.
The file is read as a json object per line.
import pandas as pd
map_index_to_word = pd.read_json('people_wiki_map_index_to_word.json', lines=True)
If fixed the following errors I encountered especially when some of the json files have only one value:
ValueError: If using all scalar values, you must pass an index
JSONDecodeError: Expecting value: line 1 column 1 (char 0)
ValueError: Trailing data
For example
cat values.json
{
name: "Snow",
age: "31"
}
df = pd.read_json('values.json')
Chances are you might end up with this
Error: if using all scalar values, you must pass an index
Pandas looks up for a list or dictionary in the value. Something like
cat values.json
{
name: ["Snow"],
age: ["31"]
}
So try doing this. Later on to convert to html tohtml()
df = pd.DataFrame([pd.read_json(report_file, typ='series')])
result = df.to_html()
I solved this by converting it into an array like so
[{"biennials": 522004, "lb915": 116290, "shatzky": 127647, "woode": 174106, "damfunk": 133206, "nualart": 153444, "hatefillot": 164111, "missionborn": 261765, "yeardescribed": 161075, "theoryhe": 521685}]

Pyspark and Python - Column is not iterable

I am using Python-3 with Azure data bricks.
I have a dataframe. The column 'BodyJson' is a json string that contains one occurrence of 'vmedwifi/' within it. I have added a constant string literal of 'vmedwifi/' as column named 'email_type'.
I want to find the start position of text 'vmedwifi/' with column 'BodyJson' - all columns are within the same dataframe. My code is below.
I get the error 'Column is not iterable' on the second line of code. Any ideas of what I am doing wrong?
# Weak logic to try and identify email addressess
emailDf = inputDf.select('BodyJson').where("BodyJson like('%vmedwifi%#%.%')").withColumn('email_type', lit('vmedwifi'))
b=emailDf.withColumn('BodyJson_Cutdown', substring(emailDf.BodyJson, expr('locate(emailDf.email_type, emailDf.BodyJson)'), 20))
TypeError: Column is not iterable
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<command-536715104422314> in <module>()
12 #emailDf1 = inputDf.select('BodyJson').where("BodyJson like('%#xxx.abc.uk%')")
13
---> 14 b=emailDf.withColumn('BodyJson_Cutdown', substring(emailDf.BodyJson, expr('locate(emailDf.email_type, emailDf.BodyJson)'), 20))
15
16 #inputDf.unpersist()
The issue was with the literial passed to expr.
I decided to tackle this problem a different way which got around this issue.

Why is this error occuring when I am using filter in pandas: TypeError: 'int' object is not iterable

When I want to remove some elements which satisfy a particular condition, python is throwing up the following error:
TypeError Traceback (most recent call last)
<ipython-input-25-93addf38c9f9> in <module>()
4
5 df = pd.read_csv('fb441e62df2d58994928907a91895ec62c2c42e6cd075c2700843b89.csv;
----> 6 df = filter(df,~('-02-29' in df['Date']))
7 '''tmax = []; tmin = []
8 for dates in df['Date']:
TypeError: 'int' object is not iterable
The following is the code :
import pandas as pd
import matplotlib.pyplot as plt
df = pd.read_csv('data/C2A2_data/BinnedCsvs_d400/fb441e62df2d58994928907a91895ec62c2c42e6cd075c2700843b89.csv');
df = filter(df,~('-02-29' in df['Date']))
What wrong could I be doing?
Following is sample data
Sample Data
Use df.filter() (https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.filter.html)
Also please attach the csv so we can run it locally.
Another way to do this is to use one of pandas' string methods for Boolean indexing:
df = df[~ df['Date'].str.contains('-02-29')]
You will still have to make sure that all the dates are actually strings first.
Edit:
Seeing the picture of your data, maybe this is what you want (slashes instead of hyphens):
df = df[~ df['Date'].str.contains('/02/29')]

'expected string or buffer' when using re.match with pandas

I am trying to clean some data from a csv file. I need to make sure that whatever is in the 'Duration' category matches a certain format. This is how I went about that:
import re
import pandas as pd
data_path = './ufos.csv'
ufos = pd.read_csv(data_path)
valid_duration = re.compile('^[0-9]+ (seconds|minutes|hours|days)$')
ufos_clean = ufos[valid_duration.match(ufos.Duration)]
ufos_clean.head()
This gives me the following error:
TypeErrorTraceback (most recent call last)
<ipython-input-4-5ebeaec39a83> in <module>()
6
7 valid_duration = re.compile('^[0-9]+ (seconds|minutes|hours|days)$')
----> 8 ufos_clean = ufos[valid_duration.match(ufos.Duration)]
9
10 ufos_clean.head()
TypeError: expected string or buffer
I used a similar method to clean data before without the regular expressions. What am I doing wrong?
Edit:
MaxU got me the closest, but what ended up working was:
valid_duration_RE = '^[0-9]+ (seconds|minutes|hours|days)$'
ufos_clean = ufos
ufos_clean = ufos_clean[ufos.Duration.str.contains(valid_duration_RE)]
There's probably a lot of redundancy in there, I'm pretty new to python, but it worked.
You can use vectorized .str.match() method:
valid_duration_RE = '^[0-9]+ (seconds|minutes|hours|days)$'
ufos_clean = ufos[ufos.Duration.str.contains(valid_duration_RE)]
I guess you want it the other way round (not tested):
import re
import pandas as pd
data_path = './ufos.csv'
ufos = pd.read_csv(data_path)
def cleanit(val):
# your regex solution here
pass
ufos['ufos_clean'] = ufos['Duration'].apply(cleanit)
After all, ufos is a DataFrame.

Too many values to unpack when creating a data frame from two lists

I have list c and p and both have 35300 elements. I try to create a pandas data frame but thereĀ“s an error message when I run the code. How Can I fix this?
import pandas as pd
e=pd.DataFrame.from_items(['Company',c],['ID',p])
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-284-89427a7d8af3> in <module>()
1 import pandas as pd
2
----> 3 e=pd.DataFrame.from_items(['Company',c],['ID',p])
C:\Users\toshiba\Anaconda\lib\site-packages\pandas\core\frame.pyc in from_items(cls, items, columns, orient)
1195 frame : DataFrame
1196 """
-> 1197 keys, values = zip(*items)
1198
1199 if orient == 'columns':
ValueError: too many values to unpack
Since c and p are lists, it sounds like you want to define a DataFrame with two columns, Company and ID:
e = pd.DataFrame({'Company':c, 'ID':p})
As behzad.nouri's suggests,
e = pd.DataFrame.from_items([('Company',c), ('ID',p)])
would also work, and unlike my first suggestion, would fix the order of the columns.

Categories

Resources