Concatenating CSVs after reading in python - python

I am new to python and am trying to read all the files in a directory that end with the extension ".txt" and spit them back out into a CSV with headers. So far I have been able to successfully do that except it is iterating through my list twice instead of just once and I can't seem to figure out where it is reading it for the second time.
I am using this code:
import pandas as pd
import os, shutil
# Get current directory path and list of file names
path = os.getcwd()
file_names = os.listdir(path)
col_names = [A, B, C,...]
merge_list = []
#reads and concatenates files into single CSV
def read_and_concat_files(file_names):
read_file = pd.read_csv(file_name, delimiter='|', names=col_names)
merge_list.append(read_file)
merge_file = pd.concat(merge_list)
merge_file.to_csv('Combined_EDDs.csv', index=False)
# Use 'for loop' to iterate through each file_name in the list of file_names, calling function on each item in list
for root, dirs, files in os.walk(path, topdown=True):
for file_name in file_names:
if os.path.exists(file_name):
if file_name.endswith(".txt"): #reads only files with .txt extension
read_and_concat_files(file_name)
print(f'\nFormatting file {file_name}...\n')
And I am expecting a CSV with about 165 lines and instead I end up with 330. I suspected it was in the loop somewhere but everything I have tried hasn't helped. TIA

Related

How to read pairwise csv and json files having same names inside a folder using python?

Consider my folder structure having files in these fashion:-
abc.csv
abc.json
bcd.csv
bcd.json
efg.csv
efg.json
and so on i.e. a pair of csv files and json files having the same names, i have to perform the same operation by reading the same named files , do some operation and proceed to the next pair of files. How do i go about this?
Basically what i have in mind as a pseudo code is:-
for files in folder_name:
df_csv=pd.read_csv('abc.csv')
df_json=pd.read_json('abc.json')
# some script to execute
#now read the next pair and repeat for all files
Did you think of something like this?
import os
# collects filenames in the folder
filelist = os.listdir()
# iterates through filenames in the folder
for file in filelist:
# pairs the .csv files with the .json files
if file.endswith(".csv"):
with open(file) as csv_file:
pre, ext = os.path.splitext(file)
secondfile = pre + ".json"
with open(secondfile) as json_file:
# do something
You can use the glob module to extract the file names matching a pattern:
import glob
import os.path
for csvfile in glob.iglob('*.csv'):
jsonfile = csvfile[:-3] + 'json'
# optionaly control file existence
if not os.path.exists(jsonfile):
# show error message
...
continue
# do smth. with csvfile
# do smth. else with jsonfile
# and proceed to next pair
If the directory structure is consistent you could do the following:
import os
for f_name in {x.split('.')[0] for x in os.listdir('./path/to/dir')}:
df_csv = pd.read_csv("{f_name}.csv")
df_json = pd.read_json("{f_name}.json")
# execute the rest

saving csv files to new directory

I am trying to use this code to write my edited csv files to a new directory. Does anyone know how I specify the directory?
I have tried this but it doesn't seem to be working.
dir = r'C:/Users/PycharmProjects/pythonProject1' # raw string for windows.
csv_files = [f for f in Path(dir).glob('*.csv')] # finds all csvs in your folder.
cols = ['Temperature']
for csv in csv_files: #iterate list
df = pd.read_csv(csv) #read csv
df[cols].to_csv('C:/Users/Desktop', csv.name, index=False)
print(f'{csv.name} saved.')
I think your only problem is the way you're calling to_csv(), passing a directory and a filename. I tried that and got this error:
IsADirectoryError: [Errno 21] Is a directory: '/Users/zyoung/Desktop/processed'
because to_csv() is expecting a path to a file, not a directory path and a file name.
You need to join the output directory and CSV's file name, and pass that, like:
out_dir = PurePath(base_dir, r"processed")
# ...
# ...
csv_out = PurePath(out_dir, csv_in)
df[cols].to_csv(csv_out, index=False)
I'm writing to the subdirectory processed, in my current dir ("."), and using the PurePath() function to do smart joins of the path components.
Here's the complete program I wrote for myself to test this:
import os
from pathlib import Path, PurePath
import pandas as pd
base_dir = r"."
out_dir = PurePath(base_dir, r"processed")
csv_files = [x for x in Path(base_dir).glob("*.csv")]
if not os.path.exists(out_dir):
os.mkdir(out_dir)
cols = ["Temperature"]
for csv_in in csv_files:
df = pd.read_csv(csv_in)
csv_out = PurePath(out_dir, csv_in)
df[cols].to_csv(csv_out, index=False)
print(f"Saved {csv_out.name}")

How to recursively generate and save csv files to a specific directory?

I am working on a code that gets as input a zip file that contains excel files, extract them in a folder, convert them in dataframes and load all these dataframes files in a list. I would like to create a new folder, convert those dataframes in csv files and save them in the above-mentioned folder. The goal is to be able to download as a zip file a folder of csv files.
The main problem for me is to make sure that every csv file has the name of the excel file it was originated from.
I'm adding my code, in the first block there's the first part of the code, while in the second one there's the part of the code in which i have a problem.
running this last part of the code i get this error:
"XLRDError: Unsupported format, or corrupt file: Expected BOF record; found b'\xef\xbb\xbf;data'"
%%capture
import os
import numpy as np
import pandas as pd
import glob
import os.path
!pip install xlrd==1.2.0
from google.colab import files
uploaded = files.upload()
%%capture
zipname = list(uploaded.keys())[0]
destination_path = 'files'
infolder = os.path.join('/content/', destination_path)
!unzip -o $zipname -d $destination_path
# Load an excel file, return events dataframe + file header dataframe
def load_xlsx(fullpath):
return events, meta
tasks = [os.path.join(dp, fname) for dp, dn, filenames in os.walk(infolder) for fname in filenames if fname.lower().endswith('.xls')]
dfs = []
metas = []
for fname in tasks:
df, meta = load_xlsx(fname)
dfs.append(df)
metas.append(meta)
newpath = 'csv2021'
if not os.path.exists(newpath):
os.makedirs(newpath)
filepath = os.path.join('/content/files/', newpath)
for fname in tasks:
filename = load_xlsx(fname)
my_csv = filename.to_csv(os.path.join(filepath, filename), encoding="utf-8-sig" , sep = ';')

What am I missing on how to change the df.to_csv outfile name for looping through a folder?

I have a folder of 16 files and instead of doing it manually every quarter (names will change) I want to write a script that reads in space delimited data and outputs comma delimited. The input files are .out and the output I want are .csv but with some of the name removed for example:
bls_2.out ---> bls.csv
'''
import pandas as pd
import os
import csv
directory = r'C:\Projects'
for filename in os.listdir(directory):
if filename.endswith(".out"):
df = pd.read_csv(filename,sep="\s+",header=None,skiprows=15)
df.to_csv(filename + '.csv',sep=",", index=False,quoting=csv.QUOTE_NONE, quotechar="", escapechar=" ",
header=["East","North","Elevation","HoleId","aufaf","aucnf","aufacomf","aucncomf","prvf","ocf",
"ccf","ssf","Domain_Code"])
else:
continue
'''
It looks like you have to remove the file extension before writing a new one. This can be done with os.path.splitext. Also, to get the .csv in the original directory, you'll have to join those paths.
out_file = os.path.join(directory, os.path.splitext(filename)[0] + ".csv")

Opening multiple CSV files

I am trying to open multiple excel files. My program throws error message "FileNotFoundError". The file is present in the directory.
Here is the code:
import os
import pandas as pd
path = "C:\\GPA Calculations for CSM\\twentyfourteen"
files = os.listdir(path)
print (files)
df = pd.DataFrame()
for f in files:
df = pd.read_excel(f,'Internal', skiprows = 7)
print ("file name is " + f)
print (df.loc[0][1])
print (df.loc[1][1])
print (df.loc[2][1])
Program gives error on df = pd.read_excel(f,'Internal', skiprows = 7).
I opened the same file on another program (which opens single file) and that worked fine. Any suggestions or advice would be highly appreciated.
os.listdir lists the filenames relative to the directory (path) you're giving as argument. Thus, you need to join the path and filename together to get the absolute path for each file. Thus, in your loop:
for filename in files:
abspath = os.path.join(path, filename)
<etc, replace f by abspath>

Categories

Resources