I want to know if there's a way in python for reading multiple CSV file form a folder and assigning to separate data frame by the name of the file. The below code will throw an error but to show the point I pasted it
import glob
for filename in glob.glob('*.csv'):
index = filename.find(".csv")
if "test" in filename:
filename[:index]) = pd.read_csv(filename)
I believe you need create dictionary of DataFrame with keys by filenames:
d = {}
for filename in glob.glob('*.csv'):
if "test" in filename:
d[filename[:-4]] = pd.read_csv(filename)
What is same as:
d = {f[:-4]: pd.read_csv(f) for f in glob.glob('*.csv') if "test" in f}
If want only name of file is possible use:
d = {os.path.basename(f).split('.')[0]:pd.read_csv(f) for f in glob.glob('*.csv') if "test" in f}
Related
Here I am trying to load multiple csv files and return exception if I wrongly entered csv with different column headers.
But I did first part loading all csv files but exception part I got stuck on can any one help me with this.
Here is my code:
# import necessary libraries
import pandas as pd
import os
import glob
# use glob to get all the csv files
# in the folder
path = os.getcwd()#my folder path with csv files
csv_files = glob.glob(os.path.join(path, "*.csv"))
# loop over the list of csv files
for f in csv_files:
# read the csv file
df = pd.read_csv(f)
# print the location and filename
print('Location:', f)
print('File Name:', f.split("\\")[-1])
# print the content
print('Content:')
display(df)
print()
Here I am trying to get exception when csv file is wrongly entered for example if I want to load 10 csv files from /data folder but 9 have column headers as id,name,address and the 10th one has id,road,street,lane ,the code should throw error in this scenario that csv files are not identical else it will concat and give all in one csv file.
If you properly know or if you can define which columns are appropriate then you can do the following:
appropriate_columns = ['id', 'name', 'address']
for f in csv_files:
df = pd.read_csv(f)
# do a set difference of column names and check if length is 0
if len(set(df.columns) - set(appropriate_columns)) != 0:
print("Inappropriate Columns exist in file")
else:
print('Location:', f)
print('File Name:', f.split("\\")[-1])
print('Content:')
display(df)
I am importing a large list of JSON files. They come from one folder for each year.
The files are properly imported, and stored as dataframes in a list. My plan was to concatenate the list and export as one CSV for each year. The problem is that concatenate is not working because the list of df is too long (it works when I try with few files). I think I should find a way to make a list for each folder in order to concatenate each list and then exporting, or find a way to concatenate only the df in the list that have the same year (every df has a column with the year value). I can't do neither, so I need help.
My code looks like this:
os.chdir('C:\\Users\\User\\Documents\\Local\\hs')
rootdir = 'C:\\Users\\User\\Documents\\Local\\hs'
data_df = []
files_notloading=[]
for subdir, dirs, files in os.walk(rootdir):
for file in files:
print(os.path.join(subdir, file))
if 'json' in os.path.join(subdir, file):
try:
with open(os.path.join(subdir, file),'r') as f:
data = json.loads(f.read())
if not data['search']:
data['search'] = [{'R': 0}]
# Normalizing data
df = pd.json_normalize(data, record_path =['search'],
meta =['month', 'type',
'day','year'],errors='ignore')
data_df.append(df)
except: files_notloading.append(file)
data_df = pd.concat(data_df)
files_notloading = pd.DataFrame(files_notloading)
for year in data_df['year'].unique():
file_name = '/Users/User/Documents/data/hs_{0}.csv'.format(year)
data_df[data_df['year'] == year].to_csv(file_name, index= False)
files_notloading.to_csv(path_or_buf='/Users/User/Documents/data/filesnotloading_hs.csv',index= False)
I was able to find a way to make a list for each folder in order to concatenate each list and then exporting.
code:
import os
import pandas as pd
import json
import os.path
import os
os.chdir('C:\\Users\\User\\Documents\\Local\\hs')
working_dir = "C:\\Users\\User\\Documents\\Local\\hs"
output_dir = "C:\\Users\\User\\Documents\\Local\\hs"
files_notloading=[]
for root, dirs, files in os.walk(working_dir):
file_list = []
df_list=[]
for filename in files:
print(os.path.join(root,filename))
if filename.endswith('.json'):
file_list.append(os.path.join(root, filename))
for file in file_list:
try:
with open(file,'r') as f:
data = json.loads(f.read())
if not data['search']:
data['search'] = [{'R': 0}]
df = pd.json_normalize(data, record_path =['search'],
meta =['month','type', 'day','year'],errors='ignore')
df_list.append(df)
except: files_notloading.append(file)
if df_list:
final_df = pd.concat(df_list)
final_file = 'hs_{0}.csv'.format(final_df['year'].iloc[0])
final_df.to_csv(os.path.join(output_dir, final_file),index=False)
files_notloading = pd.DataFrame(files_notloading,columns =['file'])
files_notloading.to_csv(os.path.join(output_dir, 'hs_files_notloading.csv'),index= False)
inp_file=os.getcwd()
files_comp = pd.read_csv(inp_file,"B00234*.csv", na_values = missing_values, nrows=10)
for f in files_comp:
df_calculated = pd.read_csv(f, na_values = missing_values, nrows=10)
col_length=len(df.columns)-1
Hi folks, How can I read 4 csv files in a for a loop. I am getting an error while reading the CSV in above format. Kindly help me
You basically need this:
Get a list of all target files. files=os.listdir(path) and then keep only the filenames that start with your pattern and end with .csv.
You could also improve it using regular expression (by importing re library for more sophistication, or use glob.glob).
filesnames = os.listdir(path)
filesnames = [f for f in filesnames if (f.startswith("B00234") and f.lower().endswith(".csv"))]
Read in files using a for loop:
dfs = list()
for filename in filesnames:
df = pd.read_csv(filename)
dfs.append(df)
Complete Example
We will first make some dummy data and then save that to some .csv and .txt files. Some of these .csv files will begin with "B00234" and some other would not. We will write the dumy data to these files. And then selectively only read in the .csv files into a list of dataframes, dfs.
import pandas as pd
from IPython.display import display
# Define Temporary Output Folder
path = './temp_output'
# Clean Temporary Output Folder
import shutil
reset = True
if os.path.exists(path) and reset:
shutil.rmtree(path, ignore_errors=True)
# Create Content
df0 = pd.DataFrame(np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]]),
columns=['a', 'b', 'c'])
display(df0)
# Make Path
import os
if not os.path.exists(path):
os.makedirs(path)
else:
print('Path Exists: {}'.format(path))
# Make Filenames
filenames = list()
for i in range(10):
if i<5:
# Create Files starting with "B00234"
filenames.append("B00234_{}.csv".format(i))
filenames.append("B00234_{}.txt".format(i))
else:
# Create Files starting with "B00678"
filenames.append("B00678_{}.csv".format(i))
filenames.append("B00678_{}.txt".format(i))
# Create files
# Make files with extensions: .csv and .txt
# and file names starting
# with and without: "B00234"
for filename in filenames:
fpath = path + '/' + filename
if filename.lower().endswith(".csv"):
df0.to_csv(fpath, index=False)
else:
with open(fpath, 'w') as f:
f.write(df0.to_string())
# Get list of target files
files = os.listdir(path)
files = [f for f in files if (f.startswith("B00234") and f.lower().endswith(".csv"))]
print('\nList of target files: \n\t{}\n'.format(files))
# Read each csv file into a dataframe
dfs = list() # a list of dataframes
for csvfile in files:
fpath = path + '/' + csvfile
print("Reading file: {}".format(csvfile))
df = pd.read_csv(fpath)
dfs.append(df)
The list dfs should have five elements, where each is dataframe read from the files.
Ouptput:
a b c
0 1 2 3
1 4 5 6
2 7 8 9
List of target files:
['B00234_3.csv', 'B00234_4.csv', 'B00234_0.csv', 'B00234_2.csv', 'B00234_1.csv']
Reading file: B00234_3.csv
Reading file: B00234_4.csv
Reading file: B00234_0.csv
Reading file: B00234_2.csv
Reading file: B00234_1.csv
I have a list of files stored in directory such as
filenames=[
abc_1.txt
abc_2.txt
abc_3.txt
bcd_1.txt
bcd_2.txt
bcd_3.txt
]
pattern=[abc]
I want to read multiple txt files into one dataframe such that all files starting with abc will be in one dataframe then all all filename starting with bcd etc.
My code:
file_path = '/home/iolie/Downloads/test/'
filenames = os.listdir(file_path)
prefixes = list(set(i.split('_')[0] for i in filenames))
for prefix in prefixes:
print('Reading files with prefix:',prefix)
for file in filenames:
if file.startswith(prefix):
print('Reading files:',file)
list_of_dfs = [pd.concat([pd.read_csv(os.path.join(file_path, file), header=None) ],ignore_index=True)]
final = pd.concat(list_of_dfs)
This code doesnt't append but overwrites the dataframe. Can someone help wih this?
A better idea than creating an arbitrary number of unlinked dataframes is to output a dictionary of dataframes, where the key is the prefix:
from collections import defaultdict
filenames = ['abc_1.txt', 'abc_2.txt', 'abc_3.txt',
'bcd_1.txt', 'bcd_2.txt', 'bcd_3.txt']
dd = defaultdict(list)
for fn in filenames:
dd[fn.split('_')[0]].append(fn)
dict_of_dfs = {}
for k, v in dd.items():
dict_of_dfs[k] = pd.concat([pd.read_csv(fn) for fn in v], ignore_index=True)
I have the following architecture of the text files in the folders and subfolders.
I want to read them all and create a df. I am using this code, but it dont work well for me as the text is not what I checked and the files are not equivalent to my counting.
l = [pd.read_csv(filename,header=None, encoding='iso-8859-1') for filename in glob.glob("2018_01_01/*.txt")]
main_df = pd.concat(l, axis=1)
main_df = main_df.T
for i in range(2):
l = [pd.read_csv(filename, header=None, encoding='iso-8859-1',quoting=csv.QUOTE_NONE) for filename in glob.glob(str(foldernames[i+1])+ '/' + '*.txt')]
df = pd.concat(l, axis=1)
df = df.T
main_df = pd.merge(main_df, df)
file
Assuming those directories contain txt files in which information have the same structure on all of them:
import os
import pandas as pd
df = pd.DataFrame(columns=['observation'])
path = '/path/to/directory/of/directories/'
for directory in os.listdir(path):
if os.path.isdir(directory):
for filename in os.listdir(directory):
with open(os.path.join(directory, filename)) as f:
observation = f.read()
current_df = pd.DataFrame({'observation': [observation]})
df = df.append(current_df, ignore_index=True)
Once all your files have been iterated, df should be the DataFrame containing all the information in your different txt files.
You can do that using a for loop. But before that, you need to give a sequenced name to all the files like 'fil_0' within 'fol_0', 'fil_1' within 'fol_1', 'fil_2' within 'fol_2' and so on. That would facilitate the use of a for loop:
dataframes = []
import pandas as pd
for var in range(1000):
name = "fol_" + str(var) + "/fil_" + str(var) + ".txt"
dataframes.append(pd.read_csv(name)) # if you need to use all the files at once
#otherwise
df = pd.read_csv(name) # you can use file one by one
It will automatically create dataframes for each file.