This is not about reading large JSON files, instead it's about reading large number of JSON files in the most efficient way.
Question
I am working with last.fm dataset from the Million song dataset.
The data is available as a set of JSON-encoded text files where the keys are: track_id, artist, title, timestamp, similars and tags.
Currently I'm reading them into pandas in the following way after going through a few options as this is the fastest as shown here:
import os
import pandas as pd
try:
import ujson as json
except ImportError:
try:
import simplejson as json
except ImportError:
import json
# Path to the dataset
path = "../lastfm_train/"
# Getting list of all json files in dataset
all_files = [os.path.join(root,file) for root, dirs, files in os.walk(path) for file in files if file.endswith('.json')]
data_list=[json.load(open(file)) for file in all_files]
df = pd.DataFrame(data_list, columns=['similars', 'track_id'])
df.set_index('track_id', inplace=True)
The current method reads the subset (1% of full dataset in less than a second). However, reading the full train set is too slow and takes forever (I have waited for couple of hours as well) to read and has become a bottleneck for further tasks such as shown in question here.
I'm also using ujson for speed purposes in parsing json files which can be seen evidently from this question here
UPDATE 1
Using generator comprehension instead of list comprehension.
data_list=(json.load(open(file)) for file in all_files)
If you need to read and write the dataset multiple times, you could try converting .json files into a faster format. For example in pandas 0.20+ you could try using the .feather format.
I would build an iterator on files and just yield the two columns you want.
Then you can instantiate a DataFrame with that iterator.
import os
import json
import pandas as pd
# Path to the dataset
path = "../lastfm_train/"
def data_iterator(path):
for root, dirs, files in os.walk(path):
for f in files:
if f.endswith('.json'):
fp = os.path.join(root,f)
with open(fp) as o:
data = json.load(o)
yield {"similars" : data["similars"], "track_id": data["track_id"]}
df = pd.DataFrame(data_iterator(path))
df.set_index('track_id', inplace=True)
This way you only go over your files list once and you won't duplicate the data before and after passing it to DataFrame
Related
this is the link from which I want the csv files:http://archive.ics.uci.edu/ml/datasets/selfBACK
My approach right now is to download it locally, by simply clicking it. But, this folder has a lot of different folders with many CSVs in it. How I do i import it in an efficient manner?
I know how to do it one by one but I feel there has to be a more efficient way.
You can first read all paths in that folder, and filter for csv files (or add other filters e.g. for specific file names). After that combine the files, here i use pandas if the data is tabular and structured in the same way.
import os
import pandas as pd
path = 'your_folder_path'
dfs = [pd.read_csv(f) for f in os.listdir(path) if f.endswith('.csv')]
# combine them (if they have the same format) like this:
df = pd.concat(dfs)
Note: you could also make a dictionary instead (key=filename, value=dataframe) and then access the data by using the filename.
I've been searching for a way to merge all csv files in a folder. They all have the same headers, but different names. I've found some videos on youtube on merge and some questions here on stackoverflow that touches the matter. The problem is that this tutorials are focused on files with the same name as: sales1, sales2, etc.
In my case, all files in the directory are CSVs and are located in 'D:\XXXX\XXXX\output'
The code I have used is:
import pandas as pd
# set files path
amazon = r'D:\XXXX\XXXX\output\amazonbooks.csv'
bookcrossing = r'D:\XXXX\XXXX\output\bookcrossing.csv'
# merge files
dataFrame = pd.concat(
map(pd.read_csv, [amazon, bookcrossing]), ignore_index=True)
print(dataFrame)
If the code could merge all the files that stand in the folder output (since all of them are .csv), instead of naming each one of them, it would be better.
I'd be glad if anyone can help me with this problem, or can guide me on how to solve this.
If the goal is to append the files into a single result, you don't really need any CSV processing at all. Just write the file contents minus the header line (except the first one). glob will return file names with path that match the pattern, "*.csv".
from glob import glob
import os
import shutil
csv_dir = r'D:\XXXX\XXXX\output'
result_csv = r'd:\XXXX\XXXX\combined.csv'
first_hdr = True
# all .csv files in the directory have the same header
with open(result_csv, "w", newline="") as result_file:
for filename in glob(os.path.join(csv_dir, "*.csv")):
with open(filename) as in_file:
header = in_file.readline()
if first_hdr:
result_file.write(header)
first_hdr = False
shutil.copyfileobj(in_file, result_file)
(assuming all csvs have equal number of columns)
Try something like this:
import os
import pandas as pd
csvs = [file for file in os.listdir('D:\XXXX\XXXX\output\') if file.endswith('.csv')]
result_df = pd.concat([pd.read_csv(f'D:\XXXX\XXXX\output\{file}') for file in csvs])
I'm working with JSON filetypes and I've created some code that will open a single file and add it to a pandas dataframe, performing some procedures on the data within, snipper of this code as follows;
response_dic=first_response.json()
print(response_dic)
base_df=pd.DataFrame(response_dic)
base_df.head()
The code then goes on to extract parts of the JSON data into dataframes, before merging and printing to CSV.
Where I want to develop the code, is to have it iterate through a folder first, find filenames that match my list of filenames that I want to work on and then perform the functions on those filenames. For example, I have a folder with 1000 docs, I will only need to perform the function on a sample of these.
I've created a list in CSV of the account codes that I want to work on, I've then imported the csv details and created a list of account codes as follows:
csv_file=open(r'C:\filepath','r')
cikas=[]
cikbs=[]
csv_file.readline()
for a,b,c in csv.reader(csv_file, delimiter=','):
cikas.append(a)
cikbs.append(b)
midstring=[s for s in cikbs]
print(midstring)
My account names are then stored in midstring, for example ['12345', '2468', '56789']. This means I can control which account codes are worked on by amending my CSV file in future. These names will vary at different stages hence I don't want to absolutely define them at this stage.
What I would like the code to do, is check the working directory, see if there is a file that matches for example C:\Users*12345.json. If there is, perform the pandas procedures upon it, then move to the next file. Is this possible? I've tried a number of tutorials involving glob, iglob, fnmatch etc but struggling to come up with a workable solution.
you can list all the files with .json extension in the current directory first.
import os, json
import pandas as pd
path_to_json = 'currentdir/'
json_files = [json_file for json_file in os.listdir(path_to_json) if json_file.endswith('.json')]
print(json_files)
Now iterate over the list of json_files and perform a check
# example list json_files= ['12345.json','2468.json','56789.json']
# midstring = ['12345', '2468, '56789']
for file in json_files:
if file.split('.')[0] in midstring:
df = pd.DataFrame.from_dict(json_file)
# perform pandas functions
else:
continue
I have multiple zip files in a folder and within the zip files are multiple csv files.
All csv files dont have all the columns but a few have all the columns.
How can I use the file that has all the columns as an example and then loop it to extract all the data into one dataframe and save it into one csv for further use?
The code I am following right now is as below:
import glob
import zipfile
import pandas as pd
dfs = []
for zip_file in glob.glob(r"C:\Users\harsh\Desktop\Temp\*.zip"):
zf = zipfile.ZipFile(zip_file)
dfs += [pd.read_csv(zf.open(f), sep=";", encoding='latin1') for f in zf.namelist()]
df = pd.concat(dfs,ignore_index=True)
print(df)
However, I am not getting the columns and headers at all. I am stuck at this stage.
If you'd like to know the file structure,
Please find the output of the code here and
The example csv file here.
If you would like to see my project files for this code, Please find the shared google drive link here
Also, at the risk of sounding redundant, why am I required to use the sep=";", encoding='latin1' part? The code gives me an error without it otherwise.
I want to open multiple csv files in python, collate them and have python create a new file with the data from the multiple files reorganised...
Is there a way for me to read all the files from a single directory on my desktop and read them in python like this?
Thanks a lot
If you a have a directory containing your csv files, and they all have the extension .csv, then you could use, for example, glob and pandas to read them all in and concatenate them into one csv file. For example, say you have a directory, like this:
csvfiles/one.csv
csvfiles/two.csv
where one.csv contains:
name,age
Keith,23
Jane,25
and two.csv contains:
name,age
Kylie,35
Jake,42
Then you could do the following in Python (you will need to install pandas with, e.g., pip install pandas):
import glob
import os
import pandas as pd
# the path to your csv file directory
mycsvdir = 'csvdir'
# get all the csv files in that directory (assuming they have the extension .csv)
csvfiles = glob.glob(os.path.join(mycsvdir, '*.csv'))
# loop through the files and read them in with pandas
dataframes = [] # a list to hold all the individual pandas DataFrames
for csvfile in csvfiles:
df = pd.read_csv(csvfile)
dataframes.append(df)
# concatenate them all together
result = pd.concat(dataframes, ignore_index=True)
# print out to a new csv file
result.to_csv('all.csv')
Note that the output csv file will have an additional column at the front containing the index of the row. To avoid this you could instead use:
result.to_csv('all.csv', index=False)
You can see the documentation for the to_csv() method here.
Hope that helps.
Here is a very simple way to do what you want to do.
import pandas as pd
import glob, os
os.chdir("C:\\your_path\\")
results = pd.DataFrame([])
for counter, file in enumerate(glob.glob("1*")):
namedf = pd.read_csv(file, skiprows=0, usecols=[1,2,3])
results = results.append(namedf)
results.to_csv('C:\\your_path\\combinedfile.csv')
Notice this part: glob("1*")
This will look only for files that start with '1' in the name (1, 10, 100, etc). If you want everything, change it to this: glob("*")
Sometimes it's necessary to merge all CSV files into a single CSV file, and sometimes you just want to merge some files that match a certain naming convention. It's nice to have this feature!
I know that the post is a little bit old, but using Glob can be quite expensive in terms of memory if you are trying to read large csv files, because you will store all that data into a list in then you'll still have to have enough memory to concatenate the dataframes inside that list into a dataframe with all the data. Sometimes this is not possible.
dir = 'directory path'
df= pd.DataFrame()
for i in range(0,24):
csvfile = pd.read_csv(dir+'/file name{i}.csv'.format(i), encoding = 'utf8')
df = df.append(csvfile)
del csvfile
So, in case your csv files have the same name and have some kind of number or string that differentiates them, you could just do a for loop trough the files and delete them after they are stored in a dataframe variable using pd.append! In this case all my csv files have the same name except they are numbered in a range that goes from 0 to 23.