I have a folder structure that looks like this:
1. data
1.1. ABC
1.1.1 monday_data
monday.json
1.1.2 tuesday_data
tuesday.json
1.2. YXZ
1.2.1 wednesday_data
wednesday.json
1.2.2
etc
I want to unpack all of these json files into a pandas dataframe in python.
I have spend alot of time trying to get this to work, but without success.
What would be the most efficient way to do this?
You can use rglob from pathlib.Path to get the path of all files under a directory that end with a certain extension
from pathlib import Path
for path in Path('data').rglob('*.json'):
print(path)
Outputs
directory\ABC\monday_data\monday.json
directory\ABC\tuesday_data\tuesday.json
directory\XYZ\wednesday_data\wednesday.json
Now you can simple read this data into a dataframe according to your requirements
import os
import glob
import pandas as pd
# set the path to the directory where the JSON files are located
path = 'data/'
# use glob to find all the JSON files in the directory + its subdirectories
json_files = glob.glob(os.path.join(path, '**/*.json'), recursive=True)
This is how you can get all paths to your JSON files.
I am not sure how you want to load all of them in a dataframe.
You can try something like this.
# create an empty list to store the dataframes
dfs = []
# loop over the JSON files and read each file into a dataframe
for file in json_files:
df = pd.read_json(file)
dfs.append(df)
# concatenate the dataframes into a single dataframe
df = pd.concat(dfs, ignore_index=True)
Related
I have a folder which contain files like: ANGOSTURA_U1_20220901.csv ,ANGOSTURA_U1_20220902.csv, ANGOSTURA_U1_20220903.csv
I want to read all files and concatenated in one csv and print this cocatenated df as ANGOSTURA_U1_202209_month.csv
Take into consideration that these files can be called Colbun_U1_20220801.csv, Colbun_U1_20220802.csv, Colbun_U1_20220803.csv , but I want the file name output always be the first name and the date. In this case it would be : Colbun_U1_202208_month.csv if the files are ANGOSTURA_U1_XXXX01.csv output file name: ANGOSTURA_U1_XXXX_month.csv, if the files are Colbun_U2_XXXX01.csv output file name: Colbun_U2_XXXX_month.csv but it always be in the folder either Colbun or Angostura not both
This is my code: (i try os.listdir and glob.glob)
import pandas as pd
import numpy as np
import glob
import os
import csv
all_files = glob.glob("C:/Users/ep_irojaso/Desktop/PROGRAMA DESEMPEÑO/saturnmensual/*.csv")
file_list = []
for f in (all_files):
data = pd.read_csv(f,usecols=["t","f"])
file_list.append(data)
df=pd.concat(file_list,ignore_index=True)
df.to_csv(f'C:/Users/ep_irojaso/Desktop/PROGRAMA DESEMPEÑO/Saturn2mensual/{os.path.basename(f).split(".")[0]}_mensual.csv')
You could try the following:
from itertools import groupby
from pathlib import Path
def key(file_path): return file_path.stem[:-2]
base = Path("C:/Users/ep_irojaso/Desktop/PROGRAMA DESEMPEÑO/saturnmensual/")
all_files = sorted(base.glob("*.csv"))
for key, files in groupby(all_files, key=key):
pd.concat(
[pd.read_csv(file, usecols=["t", "f"]) for file in files]
).to_csv(base / f"{key}_month.csv", index=False)
Use pathlib from the standard library instead of os: set the base path to the folder that contains the CSV-files.
glob all CSV-files in it and sort them into a list all_files.
Now group the files into monthly buckets with groupby from the standard library module itertools. The grouping key is the file name without the extension and the last to characters (the days according to your specification).
Then concat all the dataframes from one month and write the new dataframe to a new CSV-file.
I currently have several csv files in a folder. I am wanting to use Python to loop over the files in the folder and make small changes to each csv file. Please see my code below which is not currently working:
import os
import pandas as pd
folder_to_view = "C:/path"
for file in os.listdir(folder_to_view):
df = pd.read_csv(file)
df.columns = ['Location','Subscriber','Speed','IP','Start','End','Bytes','Test Status','Comment']
df.to_csv(file, index=False)
I imagine that the issue is not forming the path correctly as the renaming of columns should be fine. os.listdir() returns a list of the files within that directory without the directory name prepended, so try this:
import os
import pandas as pd
folder_to_view = "C:/path"
for file in os.listdir(folder_to_view):
full_path = f'{folder_to_view}/{file}'
df = pd.read_csv(full_path)
df.columns = ['Location','Subscriber','Speed','IP','Start','End','Bytes','Test Status','Comment']
df.to_csv(full_path, index=False)
I found here how to import multiple text files to one data frame. However, it gives an error. Files are with the names as footballseason1,footballseason2,footballseason3 ... (until footballseason5000)
import pandas as pd
import datetime as dt
import os, glob
os.chdir("~/Downloads/data")
filenames = [i for i in glob.glob("*.txt")]
FileNotFoundError: [Errno 2] No such file or directory: '~/Downloads/data'
However, if I try to import one file, everything is working and the directory is found
df = pd.read_csv("~/Downloads/data/footballseason1.txt", sep=",")
Could you help to fix the problem? and are there any ways to do it without changing directory and simply do all the steps using the path where all files are located?
Python's os does not understand ~ by default, so it needs to be expanded manually:
filenames = [i for i in glob.glob(os.path.expanduser("~/Downloads/data/*.txt"))]
You can use python's list comprehension and pd.concat like below
df = pd.concat([pd.read_csv(i, sep=',') for i in glob.glob("~/Downloads/data/*.txt", recursive=True)])
Via pathlib ->
import pandas as pd
from pathlib import Path
inp_path = Path("~/Downloads/data")
df = pd.concat([
pd.read_csv(txt_file, sep=',') for txt_file in inp_path.glob('*.txt')
])
With added check - >
import pandas as pd
from pathlib import Path
inp_path = Path("~/Downloads/data")
if inp_path.exists():
df = pd.concat([
pd.read_csv(txt_file, sep=',') for txt_file in inp_path.glob('*.txt')
])
else:
print('input dir doesn\'t exist please check path')
Importing Data from Multiple files
Now let’s see how can we import data from multiple files from a specific directory. There are many ways to do so, but I personally believe this is an easier and simpler way to use and also to understand especially for beginners.
1)First, we are going to import the OS and glob libraries. We need them to navigate through different working directories and getting their paths.
import os
import glob
2) We also need to import the pandas library as we need to work with data frames.
import pandas as pd
3) Let’s change our working directory to the directory where we have all the data files.
os.chdir(r"C:\Users\HARISH\Path_for_our_files")
4) Now we need to create a for loop which iterates through all the .csv file in the current working directory
filenames = [i for i in glob.glob("*.csv")]
I would like to automatically import all csv files that are in one folder as dataframes and set the dataframe's variable name to the respective filename.
For example, in the folder are the following three files: data1.csv, data2.csv and data3.csv
How can I automatically import all three files having three dataframes (data1, data2 and data3) as the result?
If you want to save dataframe as variable with own file name. But it is not secure. This could cause code injection.
import pandas
import os
path = "path_of_directory"
files = os.listdir(path) # Returns list of files in the folder which is specifed path
for file in files:
if file.endswith(".csv"):# Checking wheter file endswith .csv
# os.sep returns the separtor of operator system
exec(f"{file[:-4]} = pandas.read_csv({path}+{os.sep}+{file})")
You can loop over the directory using pathlib and build a dictionary of name->DataFrame, eg:
import pathlib
import pandas as pd
dfs = {path.stem: pd.read_csv(path) for path in pathlib.Path('thepath/').glob(*.csv')}
Then access as dfs['test1'] etc...
Since the answer that was given includes an exec command, and munir.aygun already warned you what could go wrong with that approach. Now I want to show you the way to do it as Justin Ezequiel or munir.aygun already suggested:
import os
import glob
import pandas as pd
# Path to your data
path = r'D:\This\is\your\path'
# Get all .csv files at your path
allFiles = glob.glob(path + "/*.csv")
# Read in the data from files and safe to dictionary
dataStorage = {}
for filename in allFiles:
name = os.path.basename(filename).split(".")[0]
dataStorage[name] = pd.read_csv(filename)
# Can be used then like this (for printing here)
if "data1" in dataStorage:
print(dataStorage["data1"])
Hope this can still be helpful.
I want to open multiple csv files in python, collate them and have python create a new file with the data from the multiple files reorganised...
Is there a way for me to read all the files from a single directory on my desktop and read them in python like this?
Thanks a lot
If you a have a directory containing your csv files, and they all have the extension .csv, then you could use, for example, glob and pandas to read them all in and concatenate them into one csv file. For example, say you have a directory, like this:
csvfiles/one.csv
csvfiles/two.csv
where one.csv contains:
name,age
Keith,23
Jane,25
and two.csv contains:
name,age
Kylie,35
Jake,42
Then you could do the following in Python (you will need to install pandas with, e.g., pip install pandas):
import glob
import os
import pandas as pd
# the path to your csv file directory
mycsvdir = 'csvdir'
# get all the csv files in that directory (assuming they have the extension .csv)
csvfiles = glob.glob(os.path.join(mycsvdir, '*.csv'))
# loop through the files and read them in with pandas
dataframes = [] # a list to hold all the individual pandas DataFrames
for csvfile in csvfiles:
df = pd.read_csv(csvfile)
dataframes.append(df)
# concatenate them all together
result = pd.concat(dataframes, ignore_index=True)
# print out to a new csv file
result.to_csv('all.csv')
Note that the output csv file will have an additional column at the front containing the index of the row. To avoid this you could instead use:
result.to_csv('all.csv', index=False)
You can see the documentation for the to_csv() method here.
Hope that helps.
Here is a very simple way to do what you want to do.
import pandas as pd
import glob, os
os.chdir("C:\\your_path\\")
results = pd.DataFrame([])
for counter, file in enumerate(glob.glob("1*")):
namedf = pd.read_csv(file, skiprows=0, usecols=[1,2,3])
results = results.append(namedf)
results.to_csv('C:\\your_path\\combinedfile.csv')
Notice this part: glob("1*")
This will look only for files that start with '1' in the name (1, 10, 100, etc). If you want everything, change it to this: glob("*")
Sometimes it's necessary to merge all CSV files into a single CSV file, and sometimes you just want to merge some files that match a certain naming convention. It's nice to have this feature!
I know that the post is a little bit old, but using Glob can be quite expensive in terms of memory if you are trying to read large csv files, because you will store all that data into a list in then you'll still have to have enough memory to concatenate the dataframes inside that list into a dataframe with all the data. Sometimes this is not possible.
dir = 'directory path'
df= pd.DataFrame()
for i in range(0,24):
csvfile = pd.read_csv(dir+'/file name{i}.csv'.format(i), encoding = 'utf8')
df = df.append(csvfile)
del csvfile
So, in case your csv files have the same name and have some kind of number or string that differentiates them, you could just do a for loop trough the files and delete them after they are stored in a dataframe variable using pd.append! In this case all my csv files have the same name except they are numbered in a range that goes from 0 to 23.