I have a very messy JSON file (lists inside of lists) that I'm trying to convert to an R dataframe (part of the reason to convert the file is that I need to export it into a .csv file). Here is a sample of the data (https://www.dropbox.com/s/ikb4znhpaavyc9z/20140909-20141010_10zdfxhqf0_2014_10_09_23_50_activities.json?dl=0). I tried this solution (Parse nested JSON to Data Frame in R), but that got rid of many of my columns. Below is the code I have so far:
library("twitteR")
library ("streamR")
library("rjson")
json_file <- "20140909-20141010_10zdfxhqf0_2014_09_09_01_00_activities.json"
json_data <- fromJSON(file=json_file) #convert to r list
str (json_data) #list of 16 objects
#unlist elements
tweets.i <- lapply(json_data, function(x){ unlist(x)})
tweets <- do.call("rbind", tweets.i)
tweets <- as.data.frame(tweets)
library(plyr)
tweets <- rbind.fill(lapply(tweets.i,
function(x) do.call("data.frame", as.list(x))
))
Anyone have a way to convert the file to an R dataframe without losing all the info? I'm open to using Python to do this work to, I just don't have the expertise to figure out how to code it.
This is not very efficient, but it may work for you:
download.file("https://www.dropbox.com/s/ikb4znhpaavyc9z/20140909-20141010_10zdfxhqf0_2014_10_09_23_50_activities.json?dl=1", destfile = tf <- tempfile(fileext = ".json"))
txt <- readLines(tf)
library(jsonlite)
library(plyr)
df <- do.call(plyr::rbind.fill, lapply(txt[txt != ""], function(x) as.data.frame(t(unlist(fromJSON(x))))))
I like the answer provided above as a really quick way to get everything. You could try tidyjson, but it also will not be efficient since it requires pre-knowledge of the structure. listviewer::jsonedit might help visualize what you are working with.
#devtools::install_github("timelyportfolio/listviewer")
library(listviewer)
jsonedit(readLines(
"https://www.dropbox.com/s/ikb4znhpaavyc9z/20140909-20141010_10zdfxhqf0_2014_10_09_23_50_activities.json?dl=1"
)[2])
Perhaps a data.frame really isn't the best structure, but it really depends on what you are trying to accomplish.
This is just a sample to hopefully show you how it might look.
library(tidyjson)
library(dplyr)
json <- readLines(
"https://www.dropbox.com/s/ikb4znhpaavyc9z/20140909-20141010_10zdfxhqf0_2014_10_09_23_50_activities.json?dl=1"
)
json %>%
{
Filter(
function(x){return (nchar(x) != 0)}
,.
)
} %>%
as.tbl_json() %>%
spread_values(
id = jstring("id")
,objectType = jstring("objectType")
,link = jstring("link")
,body = jstring("body")
,favoritesCount = jstring("favoritesCount")
,twitter_filter_level = jstring("twitter_filter_level")
,twitter_lang = jstring("twitter_lang")
,retweetCount = jnumber("retweetCount")
,verb = jstring("verb")
,postedTime = jstring("postedTime")
# from actor object in the JSON
,actor_objectType = jstring("actor","objectType")
,actor_id = jstring("actor","id")
,actor_link = jstring("actor","link")
,actor_displayName = jstring("actor","displayName")
,actor_image = jstring("actor","image")
,actor_summary = jstring("actor","summary")
,actor_friendsCount = jnumber("actor","friendsCount")
,actor_followersCount = jnumber("actor","followersCount")
) %>%
# careful once you enter you can't go back up
enter_object("actor","links") %>%
gather_array( ) %>%
spread_values(
actor_href = jstring("href")
)
Related
I have following data:
#1. dates of 15 day frequency:
dates = seq(as.Date("2017-01-01"), as.Date("2020-12-30"), by=15)
#2. I have a dataframe containing dates where certain observation is recoded per variable as:
#3. Values corresponding to dates in #2 as:
What I am trying to do is assign values to respective dates, and keep other as NaN for the dates which has no observation, and save as a text file. The output look something like below. Appreciate your help. Can be in R or in python.
This code work on the example data you provided.
Due to loop, it will not be the fastest way out there, but it does the job.
The date DataFrames is containing the dates, your data shown in #2. And data is the DataFrames containing the data shown in #3.
# IMPORT PACKAGES AND LOAD DATA
import pandas as pd
import numpy as np
data = pd.read_csv("./data.csv")
date = pd.read_csv("./date.csv")
# GET THE UNIQUE DATES
date_unique = pd.Series(np.concatenate([date[col_name].dropna().unique() for col_name in date.columns]).flat).unique()
# GET THE END DATA FRAME READY
data_reform = pd.DataFrame(data=[], columns=["date"])
for col in data.columns:
data_reform.insert(len(data_reform.columns),col,[])
data_reform["date"]=date_unique
data_reform.sort_values(by=["date"],inplace=True)
# ITER THROUGH THE DATA AND ALLOCATED THEM TO THE FINAL DATA FRAME
for row_id, row_val in data.iterrows():
for col_id, col_val in row_val.dropna().items():
data_reform[col_id][data_reform["date"]==date[col_id].iloc[row_id]]=col_val
You can use stack, then merge and finally spread for going back to your column-style matrix. However, I think this can create a sparse matrix (not too optimal in memory for large datasets).
library(pivottabler)
library(dplyr)
library(tidyr)
dates_df <- data.frame(read.csv("date.txt"))
dates_df$dates_formatted <- as.Date(dates_df$dates, format = "%d/%m/%Y")
dates_df <- data.frame(dates_df[,2])
names(dates_df) <- c("dates")
valid_observations <- read.csv("cp.txt", sep = " " ,na.strings=c('NaN'))
observations <- read.csv("cpAC.txt", sep = " ")
# Idea is to produce an EAV table
# date var_name value
EAV_Table <- cbind(stack(valid_observations), stack(observations))
complete_EAV <- EAV_Table[complete.cases(EAV_Table), 1:3]
complete_EAV$date <- as.Date(complete_EAV$values, format = "%Y-%m-%d")
complete_EAV <- complete_EAV[, 2:4]
names(complete_EAV) <- c("Variable", "Value", "dates")
complete_EAV$Variable <- as.factor(complete_EAV$Variable)
dates_measures <- merge(dates_df, complete_EAV, all.x = TRUE)
result <- spread(dates_measures, Variable, Value)
write.csv(result, "data_measurements.csv", row.names = FALSE)
Issue
I have a data set I am trying to use to create an H5 file I can pass to a Python package. I am populating the attributes using the function attr(object=,attr_name=) <- value. However, when I try to load my attributes for each object in the group within my h5 file, it appears the data class is not being preserved. When I open my h5 file in Python with {h5py} and look at the attributes every object is defined as fallows dtype=object. Does anyone know if this is a default of the attr() function? If so, should I try to use the create_attr() instead?
Thanks for any and all help! I recommend running this in Rmarkdown so you can make one r chunk and one python chunk for each of my blocks of code.
Reprex - edited
I am providing a simplified version of the code with sample data for three objects/events within the first group.
These events house a 3x6000 matrix each.
Each matrix should have 3 attributes each - a numeric, a char, and a list
Edits
The reticulate package will be used at the end of the r chunk to pass the path to the h5 file you created in R to the Python chunk.
Removed the list format from the purrr functions, functions working cleanly now.
R Code for Creating the File
library(hdf5r)
library(dplyr)
library(reticulate)
h5_file_path = here::here() # path to where you are creating the h5 file
# This line creates the empty file
NMTSO_trainer.h5 <- H5File$new(filename = sprintf("%s/NMTSO_trainer.h5", h5_file_path), mode = "a")
# This creates a group within the file, think of the file as a directory tree and each group is like folder
data.grp <- NMTSO_trainer.h5$create_group("data")
# Items to populate attribues
trace_name = c("sample_event1", "sample_event2", "sample_event3")
col_names = c("att1", "att2", "att3")
value = list(runif(n = 1, -100, 100), "SC", list(c(0,0,runif(n = 1, 0, 5))))
# Place holder for the matrices per event
x = list()
events = length(trace_name)
# Populates the event matrices
for (i in 1:events){
x[[i]] <- runif(n = 6000, -1, 1) %>% matrix(nrow = 1)
x[[i]] <- rep(0,(2 * 6000)) %>% matrix(nrow = 2) %>% rbind(x[[i]])
}
# Puts each matrix within the corresponding "folder" in the h5 file
purrr::map2(trace_name, x, function(trace_name, x){
data.grp[[trace_name]] <- x
})
# Puts the corresponding attributes with each matrix - there should be 3 per matrix.
# This is where I am wondering if I should use create_attr() rather than h5attr()
purrr::walk(trace_name, function(trace_name){
purrr::walk2(col_names, value, function(col_names, value){
h5attr(data.grp[[trace_name]], col_names) <- value
})
})
# Shows the class of the objects pupulated in the h5 file according to R
h5attr(data.grp[[trace_name[1]]], col_names[1]) %>% class()
h5attr(data.grp[[trace_name[1]]], col_names[2]) %>% class()
h5attr(data.grp[[trace_name[1]]], col_names[3]) %>% class()
# The file must be closed for all data to be written to the file
NMTSO_trainer.h5$close_all(close_self = TRUE)
py$file_path = sprintf("%s/NMTSO_trainer.h5", h5_file_path)
Python Chunk for Evaluating Attr Format
Edits
Passed the file_path object created in R into the Python environment using the reticulate package. No longer requires any user file_path manipulation as long as the code is ran in Rstudio's Rmarkdown files to take advantage of R's Python engine.
import h5py
import pandas as pd
import numpy as np
e = h5py.File(file_path, 'r')
# Shows the users what groups are in the file
list(e.keys())
group = e['data']
# Shows the user what events are in the group
list(group.keys())
# Shows the user what is in the attributes
group['sample_event1'].attrs['att1']
group['sample_event1'].attrs['att2']
group['sample_event1'].attrs['att3']
# Shows the user what format the data is in
type(group['sample_event1'].attrs['att1'])
type(group['sample_event1'].attrs['att2'])
type(group['sample_event1'].attrs['att3'])
Half Solution/Work Around
So I found if I use {reticulate} to create the hdf5 file in Python rather than R, the attributes I create retain their formatting after closing the file and reopening them. As I have built my pipeline in R, this is not the most ideal solution. If anyone knows how to do this with {HDF5r} rather than {h5py}, I would love to learn.
Remaining Question
The first attribute's type returns as a float64, I believe the standard for R. Is there a simple way to convert to float32? Is it necessary to convert between them? I have an exemplary file that goes to a program and the float attributes are float32.
library(dplyr)
library(reticulate)
h5_file_path = sprintf("%s/NMTSO_test.h5", here::here())
# Items to populate attribues
trace_name = c("sample_event1", "sample_event2", "sample_event3")
col_names = c("att1", "att2", "att3")
value = list(runif(n = 1, -100, 100), "SC", list(c(0,0,runif(n = 1, 0, 5))))
# Place holder for the matrices per event
x = list()
events = length(trace_name)
# Populates the event matrices
for (i in 1:events){
x[[i]] <- runif(n = 6000, -1, 1) %>% matrix(nrow = 1)
x[[i]] <- rep(0,(2 * 6000)) %>% matrix(nrow = 2) %>% rbind(x[[i]])
rm(events)
rm(i)
}
reticulate::repl_python()
import h5py
import pandas as pd
import numpy as np
# instert path inplace of my path here
e = h5py.File(''+r.h5_file_path, 'w')
# loop to create and fill each event in the group with values from the R object 'x'
for i in np.r_[0:len(r.trace_name)]:
e.create_dataset("data/"+ r.trace_name[i], r.x[i].shape, data=r.x[i], dtype=np.float32)
# loops to populate the attributes in each sample_event from the R object 'value'
for i in np.r_[0:len(r.trace_name)]:
for j in np.r_[0:len(r.col_names)]:
e['data/'+r.trace_name[i]].attrs[''+r.col_names[j]] = r.value[j]
# Shows the user what is in the attributes
e['data/sample_event1'].attrs['att1']
e['data/sample_event1'].attrs['att2']
e['data/sample_event1'].attrs['att3']
# Shows the user what format the data is in
type(e['data/sample_event1'].attrs['att1'])
type(e['data/sample_event1'].attrs['att2'])
type(e['data/sample_event1'].attrs['att3'])
e.close()
exit
I am trying to create partial dependent plot using the following code
rf_pdp = rf_model .partial_plot(data = htest, cols = ['var1', 'var2', 'var3'], plot=True)
rf_pdp
Is there a way I can save the output such as mean_resp into a data frame?
The partial_plot() method returns a list where the elements are of type h2o.two_dim_table.H2OTwoDimTable or a list and a plot if you set the plot parameter = True (see the api docs to learn more about the parameters and return types).
to see this do:
type(rf_pdp) # should return a list
type(rf_pdp[0]) # should return h2o.two_dim_table.H2OTwoDimTable
Once you have selected the H2OTwoDimTable corresponding to the pdp column of interest you either select the "mean_response" column or you could convert the H2OTwoDimTable to a pandas dataframe and select the mean_resp from there.
So to get the mean_response column for "var1 " for example you can do
rf_pdp[0]["mean_response"]
or
rf_pdp[0].as_data_frame()['mean_response']
Reproducible example:
if (!require("pacman")) install.packages("pacman")
pacman::p_load(h2o, tidyverse, modeldata)
h2o.init()
# load HR data from modeldata pkg
data("attrition", package = "modeldata")
df <-
attrition %>%
mutate_if(
.predicate = is.ordered,
.funs. = factor,
ordered = FALSE
) %>%
mutate(
Attrition = factor(Attrition, levels = c("Yes", "No"))
)
index <- 1:5
train.obs <- df[-index,]
loc.obs <- df[index,]
# create h2o objects for modeling
y <- "Attrition"
x <- setdiff(names(train.obs), y)
train.obs.h2o <- as.h2o(train.obs)
loc.obs.h2o <- as.h2o(loc.obs)
# create h2o RF model
h2o_rf <- h2o.randomForest(x, y, training_frame = train.obs.h2o)
# create Partial Dependence Plots (PDP)
h2o_pdp <-
h2o.partialPlot(
object = h2o_rf,
data = train.obs.h2o,
cols. = c("MonthlyIncome", "Age")
)
Option 1: Export pdp dataframe for each of AGE and RACE directly to Excel in separate named sheets
writexl::write_xlsx(
list(
MonthlyIncome = h2o_pdp[[1]],
Age = h2o_pdp[[2]]),
path = "pdp-metrics.xlsx"
)
Option 2: Store H2o PDP locally as dataframe object for each variable
pdpMonIncTbl <- as.data.frame(h2o_pdp[[1]])
pdpAgeTbl <- as.data.frame(h2o_pdp[[2]])
I'm moving to R from python and am trying to use my python skills to become familiar with scraping json with R. I am having some issues viewing and scraping what I would like to. I'm pretty sure I have the For loops down but I am unsure on how to select keys and return their content. I have read some documents but being new to R its a little tough to understand. For this I created a quick script with python to show what I am trying to do in Rstudio.
import requests
from pprint import pprint
start = '2018-10-03'
end = '2018-10-10'
req = requests.get('https://statsapi.web.nhl.com/api/v1/schedule?startDate=' + str(start) + '&endDate=' + str(end) + '&hydrate=team(leaders(categories=[points,goals,assists],gameTypes=[P])),linescore,broadcasts(all),tickets,game(content(media(epg),highlights(scoreboard)),seriesSummary),radioBroadcasts,metadata,seriesSummary(series),decisions,scoringplays&leaderCategories=&site=en_nhl&teamId=&gameType=&timecode=')
data = req.json()['dates']
for info in data:
date = info['date']
games = info['games']
for game in games:
gamePk = game['gamePk']
print(date, gamePk)
Below is what I have started with but I am having an issue understanding where I can view my json other than print data which locks up R. I would like to be able to view the dictionaries and keys as I go. The other question is how would i then add the key-values to a "vector? or df?" and write them out. I am familiar with exporting to a file but curious as to how I add the values to a df. Would that be bind? or would i not have to do that?
library(jsonlite)
start <- as.Date(c('2018-10-03'))
end <- as.Date(c('2019-04-15'))
url <- paste0('https://statsapi.web.nhl.com/api/v1/schedule?startDate=', start,'&endDate=', end,'&hydrate=team(leaders(categories=[points,goals,assists],gameTypes=[P])),linescore,broadcasts(all),tickets,game(content(media(epg),highlights(scoreboard)),seriesSummary),radioBroadcasts,metadata,seriesSummary(series),decisions,scoringplays&leaderCategories=&site=en_nhl&teamId=&gameType=&timecode=')
data <- fromJSON(url)
To expound on my issue here is a further sample of where the struggle lies.
library(jsonlite)
start <- as.Date(c('2018-10-03'))
end <- as.Date(c('2018-10-04'))
url <- paste0('https://statsapi.web.nhl.com/api/v1/schedule?startDate=', start,'&endDate=', end,'&hydrate=team(leaders(categories=[points,goals,assists],gameTypes=[P])),linescore,broadcasts(all),tickets,game(content(media(epg),highlights(scoreboard)),seriesSummary),radioBroadcasts,metadata,seriesSummary(series),decisions,scoringplays&leaderCategories=&site=en_nhl&teamId=&gameType=&timecode=')
data <- fromJSON(url)
date <- data$dates$date
game_id <- data$dates$games
game <- NULL
for (ids in game_id) {
pk <- ids$gamePk
game <- rbind(game, pk)
}
I figured the "pk" would be in 1 column but its in multiple columns and I receive a In rbind: number of columns of result is not a multiple of a vector length
I can make numpy ndarrays with rec2csv,
data = recfromcsv(dataset1, names=True)
xvars = ['exp','exp_sqr','wks','occ','ind','south','smsa','ms','union','ed','fem','blk']
y = data['lwage']
X = data[xvars]
c = ones_like(data['lwage'])
X = add_field(X, 'constant', c)
But, I have no idea how to take this into an R data frame usable by Rpy2,
p = roptim(theta,robjects.r['ols'],method="BFGS",hessian=True ,y= robjects.FloatVector(y),X = base.matrix(X))
ValueError: Nothing can be done for the type <class 'numpy.core.records.recarray'> at the moment.
p = roptim(theta,robjects.r['ols'],method="BFGS",hessian=True ,y= robjects.FloatVector(y),X = base.matrix(array(X)))
ValueError: Nothing can be done for the type <type 'numpy.ndarray'> at the moment.
Just to get an RPY2 DataFrame from a csv file, in RPY2.3, you can just do:
df = robjects.DataFrame.from_csvfile('filename.csv')
Documentation here.
I'm not 100% sure I understand your issue, but a couple things:
1) if it's ok, you can read a csv into R directly, that is:
robjects.r('name <- read.csv(filename.csv)')
After which you can refer to the resulting data frame in later functions.
Or 2) you can convert a numpy array into a data frame - to do this you need to import the package 'rpy2.robjects.numpy2ri'
Then you could do something like:
array_ex = np.array([[4,3],[3,2], [1,5]])
rmatrix = robjects.r('matrix')
rdf = robjects.r('data.frame')
rlm = robjects.r('lm')
mat_ex = rmatrix(array_ex, ncol = 2)
df_ex = rdf(mat_ex)
fit_ex = rlm('X1 ~ X2', data = df_ex)
Or whatever other functions you wanted.
There may be a more direct way - I get frustrated going between the two data types and so I am much more likely to use option 1) if possible.
Would either of these methods get you to where you need to be?