I am trying to re-write a json file to add missing data values. but i cant seem to get the code to re-write the data on the json file. Here is the code to fill in missing data:
import pandas as pd
import json
data_df = pd.read_json("Data_test.json")
#replacing empty strings with nan
df2 = data_df.mask(data_df == "")
#filling the nan with data from above.
df2["Food_cat"].fillna(method="ffill", inplace=True,)
"Data_test.json" is the file with the list of dictionary and I am trying to either edit this json file or create a new one with the filled in data that was missing.
I have tried using
with open('complete_data', 'w') as f:
json.dump(df2, f)
but it does not seem to work. is there a way to edit the current data or create a new json file with the completed data?
this is the original data, I would like to keep this format.
Try to do this
import pandas as pd
import json
data_df = pd.read_json("Data_test.json")
#replacing empty strings with nan
df2 = data_df.mask(data_df == "")
#filling the nan with data from above.
df2["Food_cat"].fillna(method="ffill", inplace=True,)
df2.to_json('path_of_file.json')
Tell me if it works.
Related
I have a pandas csv file down below. In the symbol column I want to replace all the BTC/USD plots to BTCUSD. How would I be able to do that?
Code:
# read_csv function which is used to read the required CSV file
data = pd.read_csv("sample.txt")
csv file:
unix,date,symbol,open,high,low,close,Volume BTC
1544217600,2018-12-07 21:20:00,BTC/USD,3348.77,3350.41,3345.07,3345.12,3.11919918
1544217540,2018-12-07 21:19:00,BTC/USD,3342.24,3351.14,3342.24,3346.37,21.11950697
1544217480,2018-12-07 21:18:00,BTC/USD,3336.02,3336.02,3336.02,3336.02,0.0
1544217420,2018-12-07 21:17:00,BTC/USD,3332.26,3336.02,3330.69,3336.02,3.28495056
Expected Output:
unix,date,symbol,open,high,low,close,Volume BTC
1544217600,2018-12-07 21:20:00,BTCUSD,3348.77,3350.41,3345.07,3345.12,3.11919918
1544217540,2018-12-07 21:19:00,BTCUSD,3342.24,3351.14,3342.24,3346.37,21.11950697
1544217480,2018-12-07 21:18:00,BTCUSD,3336.02,3336.02,3336.02,3336.02,0.0
1544217420,2018-12-07 21:17:00,BTCUSD,3332.26,3336.02,3330.69,3336.02,3.28495056
# importing pandas module
import pandas as pd
# reading csv file
data = pd.read_csv("sample.txt")
# overwriting column with replaced value
data["Team"]= data["symbol"].str.replace("BTC/USD", "BTCUSD", case = False)
You can use str.replace like this:
df['symbol'] = df['symbol'].str.replace('BTC/USD','BTCUSD')
is it possible to use pandas after rewriting data into a csv using something like this:
import csv
headers = []
cleaned_data = open('cleaned_data.csv', 'w')
writer = csv.writer(cleaned_data)
for row in open("house_prices.csv"):
# <-- Some body code here to filter out the headers
This is where I want to continue with my cleaning of data and get rid of rows that contain missing values. I've been told that using pandas is the way to go but I'm not sure if it's ok to do it since the first steps are to write this code:
import pandas as pd
df = pd.read_csv('house_prices.csv')
which conflicts with my first code, right? So is it possible to remove rows of missing values with this method or is there another way without importing anything?
Or would it be possible to combine both?ie:
import csv
import pandas as pd
headers = []
cleaned_data = open('cleaned_data.csv', 'w')
writer = csv.writer(cleaned_data)
df = pd.read_csv('house_prices.csv')
df.dropna(axis=0, how='any', thresh=None, subset=None, inplace=False)
for row in open("house_prices.csv"):
# <-- Some body code here to filter out the headers
Would that work? This is the first time I'm seeing pandas
I am trying to read this json file in python using this code (I want to have all the data in a data frame):
import numpy as np
import pandas as pd
import json
from pandas.io.json import json_normalize
df = pd.read_json('short_desc.json')
df.head()
Data frame head screenshot
using this code I am able to convert only the first row to separated columns:
json_normalize(df.short_desc.iloc[0])
First row screenshot
I want to do the same for whole df using this code:
df.apply(lambda x : json_normalize(x.iloc[0]))
but I get this error:
ValueError: If using all scalar values, you must pass an index
What I am doing wrong?
Thank you in advance
After reading the json file with json.load, you can use pd.DataFrame.from_records. This should create the DataFrame you are looking for.
wih open('short_desc.json') as f:
d = json.load(f)
df = pd.DataFrame.from_records(d)
I am using python 3.6 and trying to download json file (350 MB) as pandas dataframe using the code below. However, I get the following error:
data_json_str = "[" + ",".join(data) + "]
"TypeError: sequence item 0: expected str instance, bytes found
How can I fix the error?
import pandas as pd
# read the entire file into a python array
with open('C:/Users/Alberto/nutrients.json', 'rb') as f:
data = f.readlines()
# remove the trailing "\n" from each line
data = map(lambda x: x.rstrip(), data)
# each element of 'data' is an individual JSON object.
# i want to convert it into an *array* of JSON objects
# which, in and of itself, is one large JSON object
# basically... add square brackets to the beginning
# and end, and have all the individual business JSON objects
# separated by a comma
data_json_str = "[" + ",".join(data) + "]"
# now, load it into pandas
data_df = pd.read_json(data_json_str)
From your code, it looks like you're loading a JSON file which has JSON data on each separate line. read_json supports a lines argument for data like this:
data_df = pd.read_json('C:/Users/Alberto/nutrients.json', lines=True)
Note
Remove lines=True if you have a single JSON object instead of individual JSON objects on each line.
Using the json module you can parse the json into a python object, then create a dataframe from that:
import json
import pandas as pd
with open('C:/Users/Alberto/nutrients.json', 'r') as f:
data = json.load(f)
df = pd.DataFrame(data)
If you open the file as binary ('rb'), you will get bytes. How about:
with open('C:/Users/Alberto/nutrients.json', 'rU') as f:
Also as noted in this answer you can also use pandas directly like:
df = pd.read_json('C:/Users/Alberto/nutrients.json', lines=True)
if you want to convert it into an array of JSON objects, I think this one will do what you want
import json
data = []
with open('nutrients.json', errors='ignore') as f:
for line in f:
data.append(json.loads(line))
print(data[0])
The easiest way to read json file using pandas is:
pd.read_json("sample.json",lines=True,orient='columns')
To deal with nested json like this
[[{Value1:1},{value2:2}],[{value3:3},{value4:4}],.....]
Use Python basics
value1 = df['column_name'][0][0].get(Value1)
Please the code below
#call the pandas library
import pandas as pd
#set the file location as URL or filepath of the json file
url = 'https://www.something.com/data.json'
#load the json data from the file to a pandas dataframe
df = pd.read_json(url, orient='columns')
#display the top 10 rows from the dataframe (this is to test only)
df.head(10)
Please review the code and modify based on your need. I have added comments to explain each line of code. Hope this helps!
CSV file as stack.csv
PROBLEM_CODE;OWNER_EMAIL;CALENDAR_YEAR;CALENDAR_QUARTER
CONFIG_ASSISTANCE;dalangle#gmail.com;2014;2014Q3
ERROR_MESSAGES;aganju#gmail.com;2014;2014Q3
PASSWORD_RECOV;dalangle#gmail.com;2014;2014Q3
ERROR_MESSAGES;biyma#gmail.com;2014;2014Q3
ERROR_MESSAGES;derrlee#gmail.com;2014;2014Q3
SOFTWARE_FAILURE;dalangle#gmail.com;2014;2014Q3
ERROR_MESSAGES;maariano#gmail.com;2014;2014Q3
SOFTWARE_FAILURE;dalangle#gmail.com;2014;2014Q3
My Code:
import pandas as pd
import csv
data = pd.read_csv('stack.csv', sep='delimiter')
min_indices = (data['OWNER_EMAIL'] == dalangle#gmail.com)
data = data[min_indices]
data.to_csv('isabevdata.csv')
Error:
KeyError: 'OWNER_EMAIL'
I need help with this code using pandas. I want to remove some columns later on from the result: isabevdata.csv --> using petl module and then send the table to front end for display