I tried converting the excel file into dictionary in python using Pandas Library as following
import pandas as pd
my_dic = pd.read_excel('zoho.xlsx', index_col=0).to_dict)
When I try to retrieve the value from dictionary using key value as
print my_dic['Password']
It is printing the data with extra arguments
{1L: 991253376L} instead of 991253376
How to trim those extra arguments.
my_dic['Password'] is a dictionary also. Try my_dic['Password'][1]
as an example:
my_dic = {"red" : "rot", "green" : "grün", "blue" : "blau", "password":{0:654651054}}
my_dic['password']
Out[27]: {0: 654651054}
my_dic['password'][0]
Out[28]: 6546510
Related
I have some difficulty in importing a JSON file with pandas.
import pandas as pd
map_index_to_word = pd.read_json('people_wiki_map_index_to_word.json')
This is the error that I get:
ValueError: If using all scalar values, you must pass an index
The file structure is simplified like this:
{"biennials": 522004, "lb915": 116290, "shatzky": 127647, "woode": 174106, "damfunk": 133206, "nualart": 153444, "hatefillot": 164111, "missionborn": 261765, "yeardescribed": 161075, "theoryhe": 521685}
It is from the machine learning course of University of Washington on Coursera. You can find the file here.
Try
ser = pd.read_json('people_wiki_map_index_to_word.json', typ='series')
That file only contains key value pairs where values are scalars. You can convert it to a dataframe with ser.to_frame('count').
You can also do something like this:
import json
with open('people_wiki_map_index_to_word.json', 'r') as f:
data = json.load(f)
Now data is a dictionary. You can pass it to a dataframe constructor like this:
df = pd.DataFrame({'count': data})
You can do as #ayhan mention which will give you a column base format
Or you can enclose the object in [ ] (source) as shown below to give you a row format that will be convenient if you are loading multiple values and planing on using matrix for your machine learning models.
df = pd.DataFrame([data])
I think what is happening is that the data in
map_index_to_word = pd.read_json('people_wiki_map_index_to_word.json')
is being read as a string instead of a json
{"biennials": 522004, "lb915": 116290, "shatzky": 127647, "woode": 174106, "damfunk": 133206, "nualart": 153444, "hatefillot": 164111, "missionborn": 261765, "yeardescribed": 161075, "theoryhe": 521685}
is actually
'{"biennials": 522004, "lb915": 116290, "shatzky": 127647, "woode": 174106, "damfunk": 133206, "nualart": 153444, "hatefillot": 164111, "missionborn": 261765, "yeardescribed": 161075, "theoryhe": 521685}'
Since a string is a scalar, it wants you to load it as a json, you have to convert it to a dict which is exactly what the other response is doing
The best way is to do a json loads on the string to convert it to a dict and load it into pandas
myfile=f.read()
jsonData=json.loads(myfile)
df=pd.DataFrame(data)
{
"biennials": 522004,
"lb915": 116290
}
df = pd.read_json('values.json')
As pd.read_json expects a list
{
"biennials": [522004],
"lb915": [116290]
}
for a particular key, it returns an error saying
If using all scalar values, you must pass an index.
So you can resolve this by specifying 'typ' arg in pd.read_json
map_index_to_word = pd.read_json('Datasets/people_wiki_map_index_to_word.json', typ='dictionary')
For newer pandas, 0.19.0 and later, use the lines parameter, set it to True.
The file is read as a json object per line.
import pandas as pd
map_index_to_word = pd.read_json('people_wiki_map_index_to_word.json', lines=True)
If fixed the following errors I encountered especially when some of the json files have only one value:
ValueError: If using all scalar values, you must pass an index
JSONDecodeError: Expecting value: line 1 column 1 (char 0)
ValueError: Trailing data
For example
cat values.json
{
name: "Snow",
age: "31"
}
df = pd.read_json('values.json')
Chances are you might end up with this
Error: if using all scalar values, you must pass an index
Pandas looks up for a list or dictionary in the value. Something like
cat values.json
{
name: ["Snow"],
age: ["31"]
}
So try doing this. Later on to convert to html tohtml()
df = pd.DataFrame([pd.read_json(report_file, typ='series')])
result = df.to_html()
I solved this by converting it into an array like so
[{"biennials": 522004, "lb915": 116290, "shatzky": 127647, "woode": 174106, "damfunk": 133206, "nualart": 153444, "hatefillot": 164111, "missionborn": 261765, "yeardescribed": 161075, "theoryhe": 521685}]
I need to convert the ‘content’ column from a string dictionary to a dictionary in python. After that I will use the following line of code:
df[‘content’].apply(pd.Series).
To have the dictionary values as a column name and the dictionary value in a cell.
I can’t do this now because there are missing values in the dictionary string.
How can I handle missing values in the dictionary when I use the function eval(String dictionary) -> dictionary?
[I'm working on the 'content' column that I want to convert to the correct format first, I tried with the eval() function, but it doesn't work, because there are missing values. This is json data.
My goal is to have the content column data for the keys in the column titles and the values in the cells](https://i.stack.imgur.com/1CsIl.png)
you can use json.loads in lambda function. if row value is nan, pass, if not, apply json.loads:
:
import json
import numpy as np
df['content']=df['content'].apply(lambda x: json.loads(x) if pd.notna(x) else np.nan)
now you can use pd.Series.
v1 = df['Content'].apply(pd.Series)
df = df.drop(['Content'],axis=1).join(v1)
if you have missing values in string dictionaries:
def check_json(x):
import ast
import json
if pd.isna(x):
return np.nan
else:
try:
return json.loads(x)
except:
try:
mask=x.replace('{','').replace('}','') #missing dictionary
mask=mask.split(",")
for i in range(0,len(mask)):
if not len(mask[i].partition(":")[-1]) > 0:
print(mask[i])
mask[i]=mask[i] + '"None"' # ---> you can replace None with what do you want
return json.loads(str({','.join(mask)}).replace("\'", ""))
except:
try:
x=x.replace("\'", "\"")
mask=x.replace('{','').replace('}',"") #missing dictionary
mask=mask.split(",")
for i in range(0,len(mask)):
if not len(mask[i].partition(":")[-1]) > 0:
print(mask[i])
mask[i]=mask[i] + '"None"' # ---> you can replace None with what do you want
b=str({','.join(mask)}).replace("\'", "")
return ast.literal_eval(b)
except:
print("Could not parse json object. Returning nan")
return np.nan
df['content']=df['content'].apply(lambda x: check_json(x))
v1 = df['Content'].apply(pd.Series)
df = df.drop(['Content'],axis=1).join(v1)
I cannot see what the missing values look like in your screenshot, but i tested the following code and got what seems to be a good result. The simple explanation in to use str.replace() to fix the null values before parsing the string to dict.
import pandas as pd
import numpy as np
import json
## setting up an example dataframe. note that row2 has a null value
json_example = [
'{"row1_key1":"row1_value1","row1_key2":"row1_value2"}',
'{"row2_key1":"row2_value1","row2_key2": null}'
]
df= pd.DataFrame()
df['Content'] = json_example
## using string replace on the string representation of the json to clean it up
df['Content'].apply(lambda x: x.replace('null','"0"'))
## using lambda x to first load the string into a dict, then applying pd.Series()
df['Content'].apply(lambda x: pd.Series(json.loads(x)))
Output
I have one dictionary named column_types with values as below.
column_types = {'A': 'pa.int32()',
'B': 'pa.string()'
}
I want to pass the dictionary to pyarrow read csv function as below
from pyarrow import csv
table = csv.read_csv(file_name,
convert_options=csv.ConvertOptions(column_types=column_types)
)
But it is giving an error because values in dictionary is a string.
The below statement will work without any issues.
from pyarrow import csv
table = csv.read_csv(file_name, convert_options=csv.ConvertOptions(column_types = {
'A':pa.int32(),
'B':pa.string()
}))
How can I change dictionary values to executable statements and pass it into the csv.ConvertOptions ?
There are two ways that worked for me you can use both of them however I would recommend the second one as the first one uses eval() and using it is risky in user input cases. If you are not using input string given by user you can use method 1 too.
1) USING eval()
import pyarrow as pa
column_types={}
column_types['A'] = 'pa.'+'string'+'()'
column_types['B'] = 'pa.'+'int32'+'()'
final_col_types={key:eval(val) for key,val in column_types.items()} # calling eval() to parse each string as a function and creating a new dict containing 'col':function()
from pyarrow import csv
table = csv.read_csv(filename,convert_options=csv.ConvertOptions(column_types=final_col_types))
print(table)
2) By creating a master dictionary dict_dtypes that contains the callable function name for a particular string. And further using dict_dtypes to map the string to its corresponding function.
import pyarrow as pa
column_types={}
column_types['A'] = 'pa.'+'string'+'()'
column_types['B'] = 'pa.'+'int32'+'()'
dict_dtypes={'pa.string()':pa.string(),'pa.int32()':pa.int32()} # master dict containing callable function for a string
final_col_types={key:dict_dtypes[val] for key,val in column_types.items() } # final column_types dictionary created after mapping master dict and the column_types dict
from pyarrow import csv
table = csv.read_csv(filename,convert_options=csv.ConvertOptions(column_types=final_col_types))
print(table)
Why don't we use something like this:
column_types = {'A': pa.int32(),
'B': pa.string()}
table = csv.read_csv(file_name,
convert_options=csv.ConvertOptions(column_types=column_types))
I need to capture the contents of a field from a table in order to append it to a filename. I have sorted the renaming process. Is there anyway I can save the output of the following in order to append it to renamed file? I can't use Scala, it has to be in python
df = sqlContext.sql("select replace(value,'-','') from dbsmets1mig02_technical_build.tbl_Tech_admin_data where type = 'Week_Starting'")
df.show()
Have you tried using indexing?
df = pd.DataFrame({
'Name':['Peter', 'Peter', 'Peter', 'Peter'],
'Planned_Start':['1/1/2019', '1/2/2019', '1/15/2019', '1/2/2019'],
'Duration':[2, 3, 5, 6],
'Hrs':[0.6, 1, 1.2, 0.3]})
df.iloc[3][1]
The syntax is df.iloc[< index of row containing desired enty >][< position of your entry >]
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.iloc.html
You could convert the df object into a pandas DataFrame/Series object, then can use other Python commands more easily on this;
pandasdf = df.toPandas()
Once you have this as a pandas data frame - say it looks something like this;
import pandas as pd
pandasdf = pd.DataFrame({"col1" : ["20191122"]})
Then you can pull out the string and use f strings to join it into a filename;
date = pandasdf.iloc[0, 0]
filename = f"my_file_{date}.csv"
Then we have the filename object as 'my_file_20191122.csv'
I have a file like this:
name|count_dic
name1 |{'x1':123,'x2,bv.':435,'x3':4}
name2|{'x2,bv.':435,'x5':98}
etc.
I am trying to load the data into a dataframe and count the number of keys in in the count_dic. The problem is that the dic items are separated with comma and also some of the keys contain comma. I am looking for a way to be able to replace commas in the key with '-' and then be able to separate different key,value pairs in the count_dic.something like this:
name|count_dic
name1 |{'x1':123,'x2-bv.':435,'x3':4}
name2|{'x2-bv.':435,'x5':98}
etc.
This is what I have done.
df = pd.read_csv('file' ,names = ['name','count_dic'],delimiter='|')
data = json.loads(df.count_dic)
and I get the following error:
TypeError: the JSON object must be str, not 'Series'
Does any body have any suggestions?
You can use ast.literal_eval as a converter for loading the dataframe, as it appears you have data that's more Python dict-like... JSON uses double quotes - eg:
import pandas as pd
import ast
df = pd.read_csv('file', delimiter='|', converters={'count_dic': ast.literal_eval})
Gives you a DF of:
name count_dic
0 name1 {'x2,bv.': 435, 'x3': 4, 'x1': 123}
1 name2 {'x5': 98, 'x2,bv.': 435}
Since count_dic is actually a dict, then you can apply len to get the number of keys, eg:
df.count_dic.apply(len)
Results in:
0 3
1 2
Name: count_dic, dtype: int64
Once df is defined as above:
# get a value to play around with
td = df.iloc[0].count_dic
td
# that looks like a dict definition... evaluate it?
eval(td)
eval(td).keys() #yup!
#apply to the whole df
df.count_dic = map(eval, df.count_dic)
#and a hint towards your key-counting
map(lambda i: i.keys(), df.count_dic)