Python - Parsing JSON Data through user defined function - python

I have a JSON Text File
Inside the JSON text file, there are columns like id, title, context, question, is_impossible, answer_start and text.
I am trying to read this into a Pandas DataFrame. I am new to Python and JSON. So giving a go with a function definition.
Here is my code,
def squad_json_pd_df(json_dict):
mylistsize = len((list(json_normalize(json_dict,'data')['title'])))
row = []
for i in range(0,mylistsize):
data = [c for c in json_dict['data']][i]
df = pd.DataFrame()
data_paragraphs = data['paragraphs']
mytitle = data['title']
for article_dict in data_paragraphs:
for answers_dict in article_dict['qas']:
for answer in answers_dict['answers']:
row.append((
answers_dict['id'],
mytitle,
article_dict['context'],
answers_dict['question'],
answers_dict['is_impossible'],
answer['answer_start'],
answer['text']
))
df = pd.concat([df, pd.DataFrame.from_records(row, columns=['id', 'title','context', 'question','is_impossible', 'answer_start', 'answer'])], axis=0, ignore_index=True)
df.drop_duplicates(inplace=True)
return df
with open(dev_datapath) as file:
dev_dict = json.load(file)
dev_df = squad_json_pd_df(dev_dict)
So the problem here is - The is_Impossible column has both true and false values inside it (When I see it in the text file). But after I load in the Pandas dataframe, I can see only false records.
My understanding of the problem is that - The JSON File structure could be different for true records and I am not parsing it correctly in Python.
The Is_Impossible false structure looks as below,
The Is_Impossible true structure looks as below,

The reason why you don't get the "True"-s back is because they are under a different json-tag - they are under "plausible_answers" instead of answers I think. In your code the answers_dict is only pulled from the "answers" tag from the json - so you never actually loop over the plausible_answers list where the tag would be set to True

Related

How to create a new csv from a csv that separated cell

I created a function for convert the csv.
The main topic is: get a csv file like:
,features,corr_dropped,var_dropped,uv_dropped
0,AghEnt,False,False,False
and I want to conver it to an another csv file:
features
corr_dropped
var_dropped
uv_dropped
0
AghEnt
False
False
False
I created a function for that but it is not working. The output is same as the input file.
function
def convert_file():
input_file = "../input.csv"
output_file = os.path.splitext(input_file)[0] + "_converted.csv"
df = pd.read_table(input_file, sep=',')
df.to_csv(output_file, index=False, header=True, sep=',')
you could use
df = pd.read_csv(input_file)
this works with your data. There is not much difference though. The only thing that changes is that the empty space before the first delimiter now has Unnamed: 0 in there.
Is that what you wanted? (Still not entirely sure what you are trying to achieve, as you are importing a csv and exporting the same data as a csv without really doing anything with it. the output example you showed is just a formated version of your initial data. but formating is not something csv can do.)

How to turn multiple rows of dictionaries from a file into a dataframe

I have a script that I use to fire orders from a csv file, to an exchange using a for loop.
data = pd.read_csv('orderparameters.csv')
df = pd.DataFrame(data)
for i in range(len(df)):
order = Client.new_order(...
...)
file = open('orderData.txt', 'a')
original_stdout = sys.stdout
with file as f:
sys.stdout = f
print(order)
file.close()
sys.stdout = original_stdout
I put the response from the exchange in a txt file like this...
I want to turn the multiple responses into 1 single dataframe. I would hope it would look something like...
(I did that manually).
I tried;
data = pd.read_csv('orderData.txt', header=None)
dfData = pd.DataFrame(data)
print(dfData)
but I got;
I have also tried
data = pd.read_csv('orderData.txt', header=None)
organised = data.apply(pd.Series)
print(organised)
but I got the same output.
I can print order['symbol'] within the loop etc.
I'm not certain whether I should be populating this dataframe within the loop, or by capturing and writing the response and processing it afterwards. Appreciate your advice.
It looks like you are getting json strings back, you could read json objects into dictionaries and then create a dataframe from that. Perhaps try something like this (no longer needs a file)
data = pd.read_csv('orderparameters.csv')
df = pd.DataFrame(data)
response_data = []
for i in range(len(df)):
order_json = Client.new_order(...
...)
response_data.append(eval(order_json))
response_dataframe = pd.DataFrame(response_data)
If I understand your question correctly, you can simply do the following:
import pandas as pd
orders = pd.read_csv('orderparameters.csv')
responses = pd.DataFrame(Client.new_order(...) for _ in range(len(orders)))

Pandas error reading csv with double quotes

I've read all related topics - like this, this and this - but couldn't get a solution to work.
I have an input csv file like this:
ItemId,Content
i0000008,{"Title":"Edison Kinetoscopic Record of a Sneeze","Year":"1894","Rated":"N/A"}
i0000010,{"Title":"Employees, Leaving the Lumiére, Factory","Year":"1895","Rated":"N/A"}
I've tried several different approaches but couldn't get it to work. I want to read this csv file into a Dataframe like this:
ItemId Content
-------- -------------------------------------------------------------------------------
i0000008 {"Title":"Edison Kinetoscopic Record of a Sneeze","Year":"1894","Rated":"N/A"}
i0000010 {"Title":"Employees, Leaving the Lumiére, Factory","Year":"1895","Rated":"N/A"}
With following code (Python 3.9):
df = pd.read_csv('test.csv', sep=',', skipinitialspace = True, quotechar = '"')
As far as I understand, commas inside dictionary column and commas inside quotation marks are being treated as regular separators, so it raises following error:
pandas.errors.ParserError: Error tokenizing data. C error: Expected 4 fields in line 3, saw 6
Is it possible to produce desired result? Thanks.
The problem is that the comma's in the Content column are interpreted as separators. You can solve this by using pd.read_fwf to manually set the number of characters on which to split:
df = pd.read_fwf('test.csv', colspecs=[(0, 8),(9,100)], header=0, names=['ItemId', 'Content'])
Result:
ItemId
Content
0
i0000008
{"Title":"Edison Kinetoscopic Record of a Sneeze","Year":"1894","Rated":"N/A"}
1
i0000010
{"Title":"Employees, Leaving the Lumiére, Factory","Year":"1895","Rated":"N/A"}
I don't think you'll be able to read it normally with pandas because it has the delimiter used multiple times for a single value; however, reading it with python and doing some processing, you should be able to convert it to pandas dataframe:
def splitValues(x):
index = x.find(',')
return x[:index], x[index+1:].strip()
import pandas as pd
data = open('file.csv')
columns = next(data)
columns = columns.strip().split(',')
df = pd.DataFrame(columns=columns, data=(splitValues(row) for row in data))
OUTPUT:
ItemId Content
0 i0000008 {"Title":"Edison Kinetoscopic Record of a Sneeze","Year":"1894","Rated":"N/A"}
1 i0000010 {"Title":"Employees, Leaving the Lumiére, Factory","Year":"1895","Rated":"N/A"}

How to name dataframes with a for loop?

I want to read several files json files and write them to a dataframe with a for-loop.
review_categories = ["beauty", "pet"]
for i in review_categories:
filename = "D:\\Library\\reviews_{}.json".format(i)
output = pd.read_json(path_or_buf=filename, lines=True)
return output
The problem is I want each review category to have its own variable, like a dataframe called "beauty_reviews", and another called "pet_reviews", containing the data read from reviews_beauty.json and reviews_pet.json respectively.
I think it is easy to handle the dataframes in a dictionary. Try the codes below:
review_categories = ["beauty", "pet"]
reviews = {}
for review in review_categories:
df_name = review + '_reviews' # the name for the dataframe
filename = "D:\\Library\\reviews_{}.json".format(review)
reviews[df_name] = pd.read_json(path_or_buf=filename, lines=True)
In reviews, you will have a key with the respective dataframe to store the data. If you want to retrieve the data, just call:
reviews["beauty_reviews"]
Hope it helps.
You can first pack the files into a list
reviews = []
review_categories = ["beauty", "pet"]
for i in review_categories:
filename = "D:\\Library\\reviews_{}.json".format(i)
reviews.append(pd.read_json(path_or_buf=filename, lines=True))
and then unpack your results into the variable names you wanted:
beauty_reviews, pet_reviews = reviews

Convert Json data to Python DataFrame

This is my first time accessing an API / working with json data so if anyone can point me towards a good resource for understanding how to work with it I'd really appreciate it.
Specifically though, I have json data in this form:
{"result": { "code": "OK", "msg": "" },"report_name":"DAILY","columns":["ad","ad_impressions","cpm_cost_per_ad","cost"],"data":[{"row":["CP_CARS10_LR_774470","966","6.002019","5.797950"]}],"total":["966","6.002019","5.797950"],"row_count":1}
I understand this structure but I don't know how to get it into a DataFrame properly.
Looking at the structure of your json, presumably you will have several rows for your data and in my opinion it will make more sense to build the dataframe yourself.
This code uses columns and data to build a dataframe:
In [12]:
import json
import pandas as pd
​
with open('... path to your json file ...') as fp:
for line in fp:
obj = json.loads(line)
columns = obj['columns']
data = obj['data']
l = []
for d in data:
l += [d['row']]
df = pd.DataFrame(l, index=None, columns=columns)
df
Out[12]:
ad ad_impressions cpm_cost_per_ad cost
0 CP_CARS10_LR_774470 966 6.002019 5.797950
As for the rest of the data, in your json, I guess you could e.g. use the totals for checking your dataframe,
In [14]:
sums = df.sum(axis=0)
obj['total']
for i in range(0,3):
if (obj['total'][i] != sums[i+1]):
print "error in total"
In [15]:
if obj['row_count'] != len(df.index):
print "error in row count"
As for the rest of the data in the json, it is difficult for me to know if anything else should be done.
Hope it helps.
Check pandas documentation. Specifically,
http://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_json.html
Pandas supports reading json to dataframe

Categories

Resources