I am trying to iterate through json files in a folder and append them all into one pandas dataframe.
If I say
import pandas as pd
import numpy as np
import json
from pandas.io.json import json_normalize
import os
directory_in_str = 'building_data'
directory = os.fsencode(directory_in_str)
df_all = pd.DataFrame()
with open("building_data/rooms.json") as file:
data = json.load(file)
df = json_normalize(data['rooms'])
df_y.append(df, ignore_index=True)
I get a dataframe with the data from the one file. If I turn this thinking into a for loop, I have tried
import pandas as pd
import numpy as np
import json
from pandas.io.json import json_normalize
import os
directory_in_str = 'building_data'
directory = os.fsencode(directory_in_str)
df_all = pd.DataFrame()
for file in os.listdir(directory):
with open(directory_in_str+'/'+filename) as file:
data = json.load(file)
df = json_normalize(data['rooms'])
df_all.append(df, ignore_index=True)
print(df_all)
This returns an empty dataframe. Does anyone know why this is happening? If I print df before appending it, it prints the correct values, so I am not sure why it is not appending.
Thank you!
Instead of append next DataFrame I would try to join them like that:
if df_all.empty:
df_all = df
else:
df_all = df_all.join(df)
When joining DataFrames, you can specify on what they should be joined - on index or on specific (key) column, as well as how (default option is similar to appending - 'left').
Here's docs about pandas.DataFrame.join.
In these instances I load everything from json into a list by appending each file's returned dict onto that list. Then I pass the list to pandas.DataFrame.from_records (docs)
In this case the source would become something like...
import pandas as pd
import numpy as np
import json
from pandas.io.json import json_normalize
import os
directory_in_str = 'building_data'
directory = os.fsencode(directory_in_str)
json_data = []
for file in os.listdir(directory):
with open(directory_in_str+'/'+filename) as file:
data = json.load(file)
json_data.append( json_normalize(data['rooms']) )
df_all = pandas.DataFrame.from_records( json_data )
print(df_all)
Related
I am writing a python script that will read all the csv files in the current location and merge them into a single csv file. Below is my code:-
import os
import numpy as np
import pandas as pd
import glob
path = os.getcwd()
extension = csv
os.chdir(path)
tables = glob.glob('*.{}'.format(extension))
data = pd.DataFrame()
for i in tables:
try:
df = pd.read_csv(r''+path+'/'+i+'')
# Here I want to create an index column with the name of the file and leave that column empty
df[i] = np.NaN
df.set_index(i, inplace=True)
# Below line appends an empty row for easy differentiation
df.loc[df.iloc[-1].name+1,:] = np.NaN
data = data.append(df)
except Exception as e:
print(e)
data.to_csv('final_output.csv', indexx=False, header=None)
If I remove the below lines of code then it works:-
df[i] = np.NaN
df.set_index(i, inplace=True)
But I want to have the first column name as the name of the file and its values NaN or empty.
I want the output to look something like this:-
I tend to avoid the .append method in favor of pandas.concat
Try this:
import os
from pathlib import Path
import pandas as pd
files = Path(os.getcwd()).glob('*.csv')
df = pd.concat([
pd.read_csv(f).assign(filename=f.name)
for f in files
], ignore_index=True)
df.to_csv('alldata.csv', index=False)
I'm having a hard time loading multiple line delimited JSON files into a single pandas dataframe. This is the code I'm using:
import os, json
import pandas as pd
import numpy as np
import glob
pd.set_option('display.max_columns', None)
temp = pd.DataFrame()
path_to_json = '/Users/XXX/Desktop/Facebook Data/*'
json_pattern = os.path.join(path_to_json,'*.json')
file_list = glob.glob(json_pattern)
for file in file_list:
data = pd.read_json(file, lines=True)
temp.append(data, ignore_index = True)
It looks like all the files are loading when I look through file_list, but cannot figure out how to get each file into a dataframe. There are about 50 files with a couple lines in each file.
Change the last line to:
temp = temp.append(data, ignore_index = True)
The reason we have to do this is because the append doesn't happen in place. The append method does not modify the data frame. It just returns a new data frame with the result of the append operation.
Edit:
Since writing this answer I have learned that you should never use DataFrame.append inside a loop because it leads to quadratic copying (see this answer).
What you should do instead is first create a list of data frames and then use pd.concat to concatenate them all in a single operation. Like this:
dfs = [] # an empty list to store the data frames
for file in file_list:
data = pd.read_json(file, lines=True) # read data frame from json file
dfs.append(data) # append the data frame to the list
temp = pd.concat(dfs, ignore_index=True) # concatenate all the data frames in the list.
This alternative should be considerably faster.
If you need to flatten the JSON, Juan Estevez’s approach won’t work as is. Here is an alternative :
import pandas as pd
dfs = []
for file in file_list:
with open(file) as f:
json_data = pd.json_normalize(json.loads(f.read()))
dfs.append(json_data)
df = pd.concat(dfs, sort=False) # or sort=True depending on your needs
Or if your JSON are line-delimited (not tested) :
import pandas as pd
dfs = []
for file in file_list:
with open(file) as f:
for line in f.readlines():
json_data = pd.json_normalize(json.loads(line))
dfs.append(json_data)
df = pd.concat(dfs, sort=False) # or sort=True depending on your needs
from pathlib import Path
import pandas as pd
paths = Path("/home/data").glob("*.json")
df = pd.DataFrame([pd.read_json(p, typ="series") for p in paths])
I combined Juan Estevez's answer with glob. Thanks a lot.
import pandas as pd
import glob
def readFiles(path):
files = glob.glob(path)
dfs = [] # an empty list to store the data frames
for file in files:
data = pd.read_json(file, lines=True) # read data frame from json file
dfs.append(data) # append the data frame to the list
df = pd.concat(dfs, ignore_index=True) # concatenate all the data frames in the list.
return df
Maybe you should state, if the json files are created themselves with pandas pd.to_json() or in another way.
I used data which was not created with pd.to_json() and I think it is not possible to use pd.read_json() in my case. Instead, I programmed a customized for-each loop approach to write everything to the DataFrames
I have an xls file which containes multiple sheets, i want to merge all this sheet in one and only one sheet.
import numpy as np
import pandas as pd
import glob
import os
import xlrd
df = pd.concat(map(pd.read_excel, glob.glob(os.path.join('', "bank.xls"))))
Tried this got a warning
WARNING *** file size (25526815) not 512 + multiple of sector size (512)
And nothing Happened
I want to concat all this sheet
This works for me (just tested).
import pandas as pd
import sys
input_file = 'C:\\your_path\\Book1.xlsx'
output_file = 'C:\\your_path\\BookFinal.xlsx'
df = pd.read_excel(input_file, None)
all_df = []
for key in df.keys():
all_df.append(df[key])
data_concatenated = pd.concat(all_df,axis=0,ignore_index=True)
writer = pd.ExcelWriter(output_file)
data_concatenated.to_excel(writer,sheet_name='merged',index=False)
writer.save()
I am currently trying to change the headings of the file I am creating. The code I am using is as follows;
import pandas as pd
import os, sys
import glob
path = "C:\\Users\\cam19\\Desktop\\Test1\\*.csv"
list_=[]
for fname in glob.glob(path):
df = pd.read_csv(fname, dtype=None, low_memory=False)
output = (df['logid'].value_counts())
list_.append(output)
df1 = pd.DataFrame()
df2 = pd.concat(list_, axis=1)
df2.to_csv('final.csv')
Basically I am looping through a file directory and extracting data from each file. Using this is outputs the following image;
http://imgur.com/a/LE7OS
All i want to do it change the columns names from 'logid' to the file name it is currently searching but I am not sure how to do this. Any help is great! Thanks.
Instead of appending the values try to append values by creating the dataframe and setting the column i.e
output = pd.DataFrame(df['value'].value_counts())
output.columns = [os.path.basename(fname).split('.')[0]]
list_.append(output)
Changes in the code in the question
import pandas as pd
import os, sys
import glob
path = "C:\\Users\\cam19\\Desktop\\Test1\\*.csv"
list_=[]
for fname in files:
df = pd.read_csv(fname)
output = pd.DataFrame(df['value'].value_counts())
output.columns = [os.path.basename(fname).split('.')[0]]
list_.append(output)
df2 = pd.concat(list_, axis=1)
df2.to_csv('final.csv')
Hope it helps
Probably a very easy fix, but my english isn't good enough to search for the right answer.
Python/Pandas is changing the numbers that I'm writing from: 6570631401430749 to something like: 6.17063140131e+15
I'm merging hundreds of csv files, and this one column comes out all wrong. The name of the column is "serialnumber" and its the 3rd column.
import pandas as pd
import glob
import os
interesting_files = glob.glob("*.csv")
df_list = []
for filename in sorted(interesting_files):
frame = pd.read_csv(filename)
print(os.path.basename(filename))
frame['filename'] = os.path.basename(filename)
df_list.append(frame)
full_df = pd.concat(df_list)
full_df.to_csv('output.csv',encoding='utf-8-sig')
You can use dtype = object when you read csv if you want to preserve the data in its original form. You can change your code to
import pandas as pd
import glob
import os
interesting_files = glob.glob("*.csv")
df_list = []
for filename in sorted(interesting_files):
frame = pd.read_csv(filename,dtype=object)
print(os.path.basename(filename))
frame['filename'] = os.path.basename(filename)
df_list.append(frame)
full_df = pd.concat(df_list)
full_df.to_csv('output.csv',encoding='utf-8-sig')