I have to do a Machine Learning project and my data set is in the form of JSON files. I have a feature position with 3 values (x,y,z). When I convert the JSON files to CSV files with Python I have the values of feature position in the form of array.
How can I generate from one feature three features pos_x, pos_y and pos_z ??
JSON: "pos":[3838.387671754935,5853.151423739182,1.895]
CSV: "pos": "[3838.387671754935,5853.151423739182,1.895]"
But I must have 3 separated features pos_x : 3838.387671754935, pos_y: 5853.151423739182, pos_z: 1.895
The code I used :
import pandas as pd
import json
data = []
with open('JSONfile.json') as fh:
for line in fh:
data.append(json.loads(line))
df = pd.DataFrame.from_dict(data)
df.to_csv ('csvFile.csv', index = False)
You can use Miller (mlr) to produce all kind of different output:
mlr --ijson --ocsv cat multivalued.json
which will return the following CSV
pos.1,pos.2,pos.3
3838.387671754935,5853.151423739182,1.895
or for a table formatting
+-------------------+-------------------+-------+
| pos.1 | pos.2 | pos.3 |
+-------------------+-------------------+-------+
| 3838.387671754935 | 5853.151423739182 | 1.895 |
+-------------------+-------------------+-------+
with this alternate
mlr --ijson --opprint --barred cat multivalued.json
and if you don't specify output format (CSV, pretty print...) it will use DKVP (key-value pair):
pos.1=3838.387671754935,pos.2=5853.151423739182,pos.3=1.895
NB: the index is numeric and separator is dot (.1, .2... instead of _a, _b, etc.)
Related
sorry for the long post! I'm a bit Python-illiterate, so please bear with me:
I am working on a project that uses extracted Fitbit resting heart-rate data to compare heart-rate values between a series of years.
The fitbit data exports as a .json file that I am attempting to convert to .csv for further analysis.
I pulled a script from github that converts .json files to .csv-formatted files, however when inputing the resting heart rate data I am running into a few troubles.
Sample lines from .json:
[{
"dateTime" : "09/30/16 00:00:00",
"value" : {
"date" : "09/30/16",
"value" : 76.83736383927637,
"error" : 2.737363838373737
}
Section of GitHub code that transforms nested frame into columns:
# reading json into dataframes
resting_hr_df = get_json_to_df(file_list=resting_hr_file_list).reset_index()
# Heart rate contains a sub json that are explicitly converted into column
resting_hr_df['date'] = resting_hr_df['value'].transform(lambda x: make_new_df_value(x, 'date'))
resting_hr_df['value'] = resting_hr_df['value'].transform(lambda x: make_new_df_value(x, 'value'))
resting_hr_df['error'] = resting_hr_df['value'].transform(lambda x: make_new_df_value(x, 'error'))
resting_hr_df = resting_hr_df.drop(['value', 'index'], axis=1)
There are two variables named 'value' and I think this is causing the issue.
When using the transform function in pandas to assign variable names for the nested dataframe keys, the second ‘value’ values store as 0 in the .csv file.
How should I store the values?
The problem is that this is a nested json file. The solution is to load the json file with json and then load it into pandas with json_normalize
import json
import pandas as pd
with open('filename.json') as data_file:
data = json.load(data_file)
resting_hr_df = pd.json_normalize(data)
resting_hr_df
Output resting_hr_df:
| | dateTime | value.date | value.value | value.error |
|---:|:------------------|:-------------|--------------:|--------------:|
| 0 | 09/30/16 00:00:00 | 09/30/16 | 76.8374 | 2.73736 |
You can easy solve this method using default pandas function like read_json .
df = pd.read_json('abc.json', orient='index')
data = df.to_csv(index=False)
print(data)
it can be easy and helpful to solve this problem by convert json to csv file
I'm having trouble loading a big JSON lines file in pandas, mainly because I need to "flatten" one of the resulting columns after using pd.read_json
For example, for this JSON line:
{"user_history": [{"event_info": 248595, "event_timestamp": "2019-10-01T12:46:03.145-0400", "event_type": "view"}, {"event_info": 248595, "event_timestamp": "2019-10-01T13:21:50.697-0400", "event_type": "view"}], "item_bought": 1909110}
I'd need to load 2 rows with 4 columns in pandas like this:
+--------------+--------------------------------+--------------+---------------+
| "event_info" | "event_timestamp" | "event_type" | "item_bought" |
+--------------+--------------------------------+--------------+---------------+
| 248595 | "2019-10-01T12:46:03.145-0400" | "view" | 1909110 |
| 248595 | "2019-10-01T13:21:50.697-0400" | "view" | 1909110 |
+--------------+--------------------------------+--------------+---------------+
The thing is, given the size of the file (413000+ lines, over 1GB), none of the ways that I managed to do it is fast enough for me. I was trying a rather rudimentary way, iterating over the loaded dataframe, creating a dictionary and appending the values to an empty dataframe:
history_df = pd.read_json('data/train_dataset.jl', lines=True)
history_df['index1'] = history_df.index
normalized_history = pd.DataFrame()
for index, row in history_df.iterrows():
for dic in row['user_history']:
dic['index1'] = row['index1']
dic['item_bought'] = row['item_bought']
normalized_history = normalized_history.append(dic, ignore_index=True)
So the question is which would be the fastest way to accomplish this? Is there any way without iterating the whole history_df dataframe?
Thank you in advance
Maybe if you try this?:
import pandas as pd
import json
data = []
# assuming each line from data/train_dataset.jl
# is a json object like the one you posted above:
with open('data/train_dataset.jl') as f:
for line in f:
data.append(json.loads(line))
normalized_history = pd.json_normalize(data, 'user_history', 'item_bought')
The goal is to scrape pokemonDB, create a DataFrame of the Pokemon data; (Number, Name, Primary type, and secondary type), Separate the two types into their own rows, and export it as a CSV file.
I'm stuck on accessing the 'dex' dataframe, specifically the contents of the columns. Am I using ".loc" correctly?
And then there is separating the two types to each columns. I know i must use a space" " as a delimiter, but not sure how. I'm new to Pandas.
This is what i have:
import pandas as pd
import requests
page = requests.get("https://pokemondb.net/pokedex/all")
dex = pd.read_html(page.text, attrs = {'id': 'pokedex'}, index_col = '#')
column_label_list = (list(dex[0].columns))
NationalNo = column_label_list[0];
Name = column_label_list[1];
Type = column_label_list[2];
numbers_list = dex.loc[ "#"]
names_list = dex.loc[ "Name"]
types1_list = dex.loc[ "Type"]
pokemon_list = pd.DataFrame(
{
NationalNo: numbers_list,
Name: names_list,
Type: types1_list,
#'Type2': types2_list,
})
print(pokemon_list)
#pokemon_list.to_csv('output.csv',encoding='utf-8-sig')
The result should look this like:
output.csv
# | Name | Type1 | Type2 |
__|_________|_______|_______|
0 |Bulbasaur|Grass |Poison |
__|_________|_______|_______|
.
.
.
etc...
I hope what I'm trying to accomplish makes sense.
The dex is an array of all tables existed on that HTML, since there is only one table, select the first table, then you don't need to map them to dataframe anymore, just export it directly since it's already a dataframe. Please consider using this code below:
import pandas as pd
import requests
page = requests.get("https://pokemondb.net/pokedex/all")
dex = pd.read_html(page.text, attrs = {'id': 'pokedex'}, index_col = '#')
dex[0].to_csv("output.csv", encoding='utf-8')
I can easily build a pandas dataframe from a string that contains only one key value pair. For example:
string1 = '{"Country":"USA","Name":"Ryan"}'
dict1 = json.loads(string1)
df=pd.DataFrame([dict1])
print(df)
However, when I use a string that has more than one key value pair :
string2 = '{"Country":"USA","Name":"Ryan"}{"Country":"Sweden","Name":"Sam"}{"Country":"Brazil","Name":"Ralf"}'
dict2 = json.loads(string2)
I get the following error:
raise JSONDecodeError("Extra data", s, end)
I am aware that string2 is not a valid JSON.
What modifications can I do on string2 programmatically so that I can convert it to a valid JSON and then get a dataframe output which is as follows:
| Country | Name |
|---------|------|
| USA | Ryan |
| Sweden | Sam |
| Brazil | Ralf |
Your error
The error says it all. The JSON is not valid. Where did you get that string2? Are you typing it in yourself?
In that case you should surround the items with brackets [] and separate the items with comma ,.
Working example:
import pandas as pd
import json
string2 = '[{"Country":"USA","Name":"Ryan"},{"Country":"Sweden","Name":"Sam"},{"Country":"Brazil","Name":"Ralf"}]'
df = pd.DataFrame(json.loads(string2))
print(df)
Returns:
Country Name
0 USA Ryan
1 Sweden Sam
2 Brazil Ralf
Interestingly, if you are extra observant, in this line here df=pd.DataFrame([dict1]) you are actually putting your dictionary inside an array with brackers[]. This is because pandas DataFrame accepts arrays of data. What you actually have in your first example is an item in which case a serie would make more sense or df = pd.Series(dict1).to_frame().T.
Or:
string1 = '[{"Country":"USA","Name":"Ryan"}]' # <--- brackets here to read json as arr
dict1 = json.loads(string1)
df=pd.DataFrame(dict1)
print(df)
And if you understood this I think it becomes easier to understand that we need , to seperate the elements.
Alternative inputs
But let's say you are creating this dataset yourself, then you could go ahead and do this:
data = [("USA","Ryan"),("Sweden","Sam"),("Brazil","Ralf")]
dict1 = [{"Country":i, "Name":y} for i,y in data] # <-- dictionaries inside arr
df = pd.DataFrame(dict1)
Or:
data = [("USA","Ryan"),("Sweden","Sam"),("Brazil","Ralf")]
df = pd.DataFrame(dict1, columns=['Country','Name'])
Or which I would prefer to use a CSV-structure:
data = '''\
Country,Name
USA,Ryan
Sweden,Sam
Brazil,Ralf'''
df = pd.read_csv(pd.compat.StringIO(data))
In the off chance that you are getting data from elsewhere in the weird format that you described, following regular expression based substitutions can fix your json and there after you can go as per #Anton vBR 's solution.
import pandas as pd
import json
import re
string2 = '{"Country":"USA","Name":"Ryan"}{"Country":"Sweden","Name":"Sam"}{"Country":"Brazil","Name":"Ralf"}'
#create dict of substitutions
rd = { '^{' : '[{' , #substitute starting char with [
'}$' : '}]', #substitute ending char with ]
'}{' : '},{' #Add , in between two dicts
}
#replace as per dict
for k,v in rd.iteritems():
string2 = re.sub(r'{}'.format(k),r'{}'.format(v),string2)
df = pd.DataFrame(json.loads(string2))
print(df)
I'm using PySpark to load data from Google BigQuery.
I've loaded data by using:
dfRates = sc.newAPIHadoopRDD(
'com.google.cloud.hadoop.io.bigquery.JsonTextBigQueryInputFormat',
'org.apache.hadoop.io.LongWritable',
'com.google.gson.JsonObject',
conf=conf)
Where conf is defined as https://cloud.google.com/hadoop/examples/bigquery-connector-spark-example.
I need this data as a DataFrame, so I tried,
row = Row(['userId','accoId','rating']) # or row = Row(('userId','accoId','rating'))
dataRDD = dfRates.map(row).toDF()
and
dataRDD = sqlContext.createDataFrame(dfRates,['userId','accoId','rating'])
But it does not convert the data into a DataFrame. Is there a way to convert it into a DataFrame?
As long as the types can represented using Spark SQL types there is no reason it couldn't be. The only problem here seems to be your code.
newAPIHadoopRDD returns a RDD of pairs (tuple of length equal two). In this particular context it looks you'll get (int, str) in Python which clearly cannot be unpacked into ['userId','accoId','rating'].
According to the doc you've linked com.google.gson.JsonObject is represented as a JSON string which can be either parsed on a Python side using standard Python utils (json module):
def parse(v, fields=["userId", "accoId", "rating"]):
row = Row(*fields)
try:
parsed = json.loads(v)
except json.JSONDecodeError:
parsed = {}
return row(*[parsed.get(x) for x in fields])
dfRates.map(parse).toDF()
or on the Scala / DataFrame side using get_json_object:
from pyspark.sql.functions import col, get_json_object
dfRates.toDF(["id", "json_string"]).select(
# This assumes you expect userId field
get_json_object(col("json_string"), "$.userId"),
...
)
Please note the differences in the syntax I've used to define and create rows.
hbase table rows:
hbase(main):008:0> scan 'test_hbase_table'
ROW COLUMN+CELL
dalin column=cf:age, timestamp=1464101679601, value=40
tangtang column=cf:age, timestamp=1464101649704, value=9
tangtang column=cf:name, timestamp=1464108263191, value=zitang
2 row(s) in 0.0170 seconds
here we go
import json
host = '172.23.18.139'
table = 'test_hbase_table'
conf = {"hbase.zookeeper.quorum": host, "zookeeper.znode.parent": "/hbase-unsecure", "hbase.mapreduce.inputtable": table}
keyConv = "org.apache.spark.examples.pythonconverters.ImmutableBytesWritableToStringConverter"
valueConv = "org.apache.spark.examples.pythonconverters.HBaseResultToStringConverter"
hbase_rdd = sc.newAPIHadoopRDD(
"org.apache.hadoop.hbase.mapreduce.TableInputFormat",
"org.apache.hadoop.hbase.io.ImmutableBytesWritable",
"org.apache.hadoop.hbase.client.Result",
keyConverter=keyConv,
valueConverter=valueConv,
conf=conf)
hbase_rdd1 = hbase_rdd.flatMapValues(lambda v: v.split("\n"))
and here we got the results
tt=sqlContext.jsonRDD(hbase_rdd1.values())
In [113]: tt.show()
+------------+---------+--------+-------------+----+------+
|columnFamily|qualifier| row| timestamp|type| value|
+------------+---------+--------+-------------+----+------+
| cf| age| dalin|1464101679601| Put| 40|
| cf| age|tangtang|1464101649704| Put| 9|
| cf| name|tangtang|1464108263191| Put|zitang|
+------------+---------+--------+-------------+----+------+