sorry for the long post! I'm a bit Python-illiterate, so please bear with me:
I am working on a project that uses extracted Fitbit resting heart-rate data to compare heart-rate values between a series of years.
The fitbit data exports as a .json file that I am attempting to convert to .csv for further analysis.
I pulled a script from github that converts .json files to .csv-formatted files, however when inputing the resting heart rate data I am running into a few troubles.
Sample lines from .json:
[{
"dateTime" : "09/30/16 00:00:00",
"value" : {
"date" : "09/30/16",
"value" : 76.83736383927637,
"error" : 2.737363838373737
}
Section of GitHub code that transforms nested frame into columns:
# reading json into dataframes
resting_hr_df = get_json_to_df(file_list=resting_hr_file_list).reset_index()
# Heart rate contains a sub json that are explicitly converted into column
resting_hr_df['date'] = resting_hr_df['value'].transform(lambda x: make_new_df_value(x, 'date'))
resting_hr_df['value'] = resting_hr_df['value'].transform(lambda x: make_new_df_value(x, 'value'))
resting_hr_df['error'] = resting_hr_df['value'].transform(lambda x: make_new_df_value(x, 'error'))
resting_hr_df = resting_hr_df.drop(['value', 'index'], axis=1)
There are two variables named 'value' and I think this is causing the issue.
When using the transform function in pandas to assign variable names for the nested dataframe keys, the second ‘value’ values store as 0 in the .csv file.
How should I store the values?
The problem is that this is a nested json file. The solution is to load the json file with json and then load it into pandas with json_normalize
import json
import pandas as pd
with open('filename.json') as data_file:
data = json.load(data_file)
resting_hr_df = pd.json_normalize(data)
resting_hr_df
Output resting_hr_df:
| | dateTime | value.date | value.value | value.error |
|---:|:------------------|:-------------|--------------:|--------------:|
| 0 | 09/30/16 00:00:00 | 09/30/16 | 76.8374 | 2.73736 |
You can easy solve this method using default pandas function like read_json .
df = pd.read_json('abc.json', orient='index')
data = df.to_csv(index=False)
print(data)
it can be easy and helpful to solve this problem by convert json to csv file
Related
I'm having trouble loading a big JSON lines file in pandas, mainly because I need to "flatten" one of the resulting columns after using pd.read_json
For example, for this JSON line:
{"user_history": [{"event_info": 248595, "event_timestamp": "2019-10-01T12:46:03.145-0400", "event_type": "view"}, {"event_info": 248595, "event_timestamp": "2019-10-01T13:21:50.697-0400", "event_type": "view"}], "item_bought": 1909110}
I'd need to load 2 rows with 4 columns in pandas like this:
+--------------+--------------------------------+--------------+---------------+
| "event_info" | "event_timestamp" | "event_type" | "item_bought" |
+--------------+--------------------------------+--------------+---------------+
| 248595 | "2019-10-01T12:46:03.145-0400" | "view" | 1909110 |
| 248595 | "2019-10-01T13:21:50.697-0400" | "view" | 1909110 |
+--------------+--------------------------------+--------------+---------------+
The thing is, given the size of the file (413000+ lines, over 1GB), none of the ways that I managed to do it is fast enough for me. I was trying a rather rudimentary way, iterating over the loaded dataframe, creating a dictionary and appending the values to an empty dataframe:
history_df = pd.read_json('data/train_dataset.jl', lines=True)
history_df['index1'] = history_df.index
normalized_history = pd.DataFrame()
for index, row in history_df.iterrows():
for dic in row['user_history']:
dic['index1'] = row['index1']
dic['item_bought'] = row['item_bought']
normalized_history = normalized_history.append(dic, ignore_index=True)
So the question is which would be the fastest way to accomplish this? Is there any way without iterating the whole history_df dataframe?
Thank you in advance
Maybe if you try this?:
import pandas as pd
import json
data = []
# assuming each line from data/train_dataset.jl
# is a json object like the one you posted above:
with open('data/train_dataset.jl') as f:
for line in f:
data.append(json.loads(line))
normalized_history = pd.json_normalize(data, 'user_history', 'item_bought')
I have to do a Machine Learning project and my data set is in the form of JSON files. I have a feature position with 3 values (x,y,z). When I convert the JSON files to CSV files with Python I have the values of feature position in the form of array.
How can I generate from one feature three features pos_x, pos_y and pos_z ??
JSON: "pos":[3838.387671754935,5853.151423739182,1.895]
CSV: "pos": "[3838.387671754935,5853.151423739182,1.895]"
But I must have 3 separated features pos_x : 3838.387671754935, pos_y: 5853.151423739182, pos_z: 1.895
The code I used :
import pandas as pd
import json
data = []
with open('JSONfile.json') as fh:
for line in fh:
data.append(json.loads(line))
df = pd.DataFrame.from_dict(data)
df.to_csv ('csvFile.csv', index = False)
You can use Miller (mlr) to produce all kind of different output:
mlr --ijson --ocsv cat multivalued.json
which will return the following CSV
pos.1,pos.2,pos.3
3838.387671754935,5853.151423739182,1.895
or for a table formatting
+-------------------+-------------------+-------+
| pos.1 | pos.2 | pos.3 |
+-------------------+-------------------+-------+
| 3838.387671754935 | 5853.151423739182 | 1.895 |
+-------------------+-------------------+-------+
with this alternate
mlr --ijson --opprint --barred cat multivalued.json
and if you don't specify output format (CSV, pretty print...) it will use DKVP (key-value pair):
pos.1=3838.387671754935,pos.2=5853.151423739182,pos.3=1.895
NB: the index is numeric and separator is dot (.1, .2... instead of _a, _b, etc.)
The goal is to scrape pokemonDB, create a DataFrame of the Pokemon data; (Number, Name, Primary type, and secondary type), Separate the two types into their own rows, and export it as a CSV file.
I'm stuck on accessing the 'dex' dataframe, specifically the contents of the columns. Am I using ".loc" correctly?
And then there is separating the two types to each columns. I know i must use a space" " as a delimiter, but not sure how. I'm new to Pandas.
This is what i have:
import pandas as pd
import requests
page = requests.get("https://pokemondb.net/pokedex/all")
dex = pd.read_html(page.text, attrs = {'id': 'pokedex'}, index_col = '#')
column_label_list = (list(dex[0].columns))
NationalNo = column_label_list[0];
Name = column_label_list[1];
Type = column_label_list[2];
numbers_list = dex.loc[ "#"]
names_list = dex.loc[ "Name"]
types1_list = dex.loc[ "Type"]
pokemon_list = pd.DataFrame(
{
NationalNo: numbers_list,
Name: names_list,
Type: types1_list,
#'Type2': types2_list,
})
print(pokemon_list)
#pokemon_list.to_csv('output.csv',encoding='utf-8-sig')
The result should look this like:
output.csv
# | Name | Type1 | Type2 |
__|_________|_______|_______|
0 |Bulbasaur|Grass |Poison |
__|_________|_______|_______|
.
.
.
etc...
I hope what I'm trying to accomplish makes sense.
The dex is an array of all tables existed on that HTML, since there is only one table, select the first table, then you don't need to map them to dataframe anymore, just export it directly since it's already a dataframe. Please consider using this code below:
import pandas as pd
import requests
page = requests.get("https://pokemondb.net/pokedex/all")
dex = pd.read_html(page.text, attrs = {'id': 'pokedex'}, index_col = '#')
dex[0].to_csv("output.csv", encoding='utf-8')
I have a pyspark dataframe, this is what it looks like
+------------------------------------+-------------------+-------------+--------------------------------+---------+
|member_uuid |Timestamp |updated |member_id |easy_id |
+------------------------------------+-------------------+-------------+--------------------------------+---------+
|027130fe-584d-4d8e-9fb0-b87c984a0c20|2020-02-11 19:15:32|password_hash|ajuypjtnlzmk4na047cgav27jma6_STG|993269700|
I transformed the above dataframe to this,
+---------+---------+-------------------------------------------------------------------------------------------------------------------------------------------------+-------------------+
|attribute|operation|params |timestamp |
+---------+---------+-------------------------------------------------------------------------------------------------------------------------------------------------+-------------------+
|profile |UPDATE |{"member_uuid":"027130fe-584d-4d8e-9fb0-b87c984a0c20","member_id":"ajuypjtnlzmk4na047cgav27jma6_STG","easy_id":993269700,"field":"password_hash"}|2020-02-11 19:15:32|
Using the following code,
ll = ['member_uuid', 'member_id', 'easy_id', 'field']
df = df.withColumn('timestamp', col('Timestamp')).withColumn('attribute', lit('profile')).withColumn('operation', lit(col_name)) \
.withColumn('field', col('updated')).withColumn('params', F.to_json(struct([x for x in ll])))
df = df.select('attribute', 'operation', 'params', 'timestamp')
I have save this dataframe df to a text file after converting it to JSON.
I tried using the following code to do the same,
df_final.toJSON().coalesce(1).saveAsTextFile('file')
The file contains,
{"attribute":"profile","operation":"UPDATE","params":"{\"member_uuid\":\"027130fe-584d-4d8e-9fb0-b87c984a0c20\",\"member_id\":\"ajuypjtnlzmk4na047cgav27jma6_STG\",\"easy_id\":993269700,\"field\":\"password_hash\"}","timestamp":"2020-02-11T19:15:32.000Z"}
I want it to save in this format,
{"attribute":"profile","operation":"UPDATE","params":{"member_uuid":"027130fe-584d-4d8e-9fb0-b87c984a0c20","member_id":"ajuypjtnlzmk4na047cgav27jma6_STG","easy_id":993269700,"field":"password_hash"},"timestamp":"2020-02-11T19:15:32.000Z"}
to_json saves the value in the params columns as a string, is there a way to keep the json context here so I can save it as the desired output?
Don't use to_json to create params column in dataframe.
The trick here is just create struct and write to the file (using .saveAsTextFile (or) .write.json()) Spark will create JSON for the Struct field.
if we already created json object and writing in json format Spark will add \ to escape the quotes already exists in Json string.
Example:
from pyspark.sql.functions import *
#sample data
df=spark.createDataFrame([("027130fe-584d-4d8e-9fb0-b87c984a0c20","2020-02-11 19:15:32","password_hash","ajuypjtnlzmk4na047cgav27jma6_STG","993269700")],["member_uuid","Timestamp","updated","member_id","easy_id"])
df1=df.withColumn("attribute",lit("profile")).withColumn("operation",lit("UPDATE"))
df1.selectExpr("struct(member_uuid,member_id,easy_id) as params","attribute","operation","timestamp").write.format("json").mode("overwrite").save("<path>")
#{"params":{"member_uuid":"027130fe-584d-4d8e-9fb0-b87c984a0c20","member_id":"ajuypjtnlzmk4na047cgav27jma6_STG","easy_id":"993269700"},"attribute":"profile","operation":"UPDATE","timestamp":"2020-02-11 19:15:32"}
df1.selectExpr("struct(member_uuid,member_id,easy_id) as params","attribute","operation","timestamp").toJSON().saveAsTextFile("<path>")
#{"params":{"member_uuid":"027130fe-584d-4d8e-9fb0-b87c984a0c20","member_id":"ajuypjtnlzmk4na047cgav27jma6_STG","easy_id":"993269700"},"attribute":"profile","operation":"UPDATE","timestamp":"2020-02-11 19:15:32"}
A simple way to handle it is to just do a replace operation on the file
sourceData=open('file').read().replace('"{','{').replace('}"','}').replace('\\','')
with open('file','w') as final:
final.write(sourceData)
This might not be what you are looking for, but will achieve the end result.
how to convert that data(json) to pandas Dataframe so that it can populate the "keys" as columns and "values" as rows in a grafana simple json to table dynamically. actually, it's not in a perfect array format how do I manipulate it to work as data frame? help would be greatly appreciated.
data = {"name":"john","class":"fifth"}
{"name":"emma","class":"sixth"}
I want to populate keys as columns and rows as values dynamically no matter how many json's we have.
You could use the pandas.DataFrame.from_dict(data) method. Docs
An example could look like this:
import pandas as pd
data = [{"name":"john","class":"fifth"},
{"name":"emma","class":"sixth"}]
df = pd.DataFrame.from_dict(data)
Result:
class name
0 fifth john
1 sixth emma
In this case data is already a list of dictionaries (I assume this is what you mean in your question). If you have the data in JSON-Strings/Files you can use json.loads(data_string) / json.load(data_file) from the json module.
Update
For the grafana table data structure like this:
data = [
{
"columns":[
{"text":"Time","type":"time"},
{"text":"Country","type":"string"},
{"text":"Number","type":"number"}
],
"rows":[
[1234567,"SE",123],
[1234567,"DE",231],
[1234567,"US",321]
],
"type":"table"
}
]
A pandas dataframe can be created:
keys = [d['text'] for d in data[0]['columns']]
pd.DataFrame(data=data[0]['rows'], columns=keys)
For a result like:
Time Country Number
0 1234567 "SE" 123
1 1234567 "DE" 231
2 1234567 "US" 312