I have written a script which works but is not very elegant. It merges csv files, outputs a new file, filters that file to the required conditions, then outputs the filtered file, which is the file I want. I then repeat the process for every month.
Rather than altering this code to process every month (I have 5 more years worth of data to go), I would like to automate the path directory part and export csv file names that change from one month (and year) to the next.
See snippet of Jan and Feb below:
import os
import glob
import pandas as pd
import shutil
path = r"C:\Users\jonathan.capanda\Documents\Fishing_DataBase\gfw_data\100_deg_data\daily_csvs\20xx01"
os.chdir(path)
extension = 'csv'
all_filenames = [i for i in glob.glob('*.{}'.format(extension))]
combined_csv = pd.concat([pd.read_csv(f) for f in all_filenames])
combined_csv.to_csv("201401.csv", index=False, encoding='utf-8-sig')
grab1 = r'C:\Users\jonathan.capanda\Documents\Fishing_DataBase\gfw_data\100_deg_data\daily_csvs\20xx01\201401.csv'
move1 = r'C:\Users\jonathan.capanda\Documents\Fishing_DataBase\gfw_data\100_deg_data\daily_csvs\2014\2014-01.csv'
shutil.move(grab1,move1)
fd = pd.read_csv(r'C:\Users\jonathan.capanda\Documents\Fishing_DataBase\gfw_data\100_deg_data\daily_csvs\2014\2014-01.csv')
df = pd.DataFrame(fd)
irishsea = df[(df.lat_bin >= 5300) & (df.lat_bin <= 5500) & (df.lon_bin >= -650) & (df.lon_bin <= -250)]
irishsea.to_csv("2014-01_irishsea.csv", index=False, encoding='utf-8-sig')
grab2 = r'C:\Users\jonathan.capanda\Documents\Fishing_DataBase\gfw_data\100_deg_data\daily_csvs\20xx01\2014-01_irishsea.csv'
move2 = r'C:\Users\jonathan.capanda\Documents\Fishing_DataBase\gfw_data\100_deg_data\daily_csvs\2014\2014-01-IrishSea.csv'
shutil.move(grab2,move2)
I then repeat it for Feb data but have to update the path locations.
#process feb data
path = r"C:\Users\jonathan.capanda\Documents\Fishing_DataBase\gfw_data\100_deg_data\daily_csvs\20xx02"
os.chdir(path)
extension = 'csv'
all_filenames = [i for i in glob.glob('*.{}'.format(extension))]
combined_csv = pd.concat([pd.read_csv(f) for f in all_filenames])
combined_csv.to_csv("201402.csv", index=False, encoding='utf-8-sig')
grab1 = r'C:\Users\jonathan.capanda\Documents\Fishing_DataBase\gfw_data\100_deg_data\daily_csvs\20xx02\201402.csv'
move1 = r'C:\Users\jonathan.capanda\Documents\Fishing_DataBase\gfw_data\100_deg_data\daily_csvs\2014\2014-02.csv'
shutil.move(grab1,move1)
fd = pd.read_csv(r'C:\Users\jonathan.capanda\Documents\Fishing_DataBase\gfw_data\100_deg_data\daily_csvs\2014\2014-02.csv')
df = pd.DataFrame(fd)
irishsea = df[(df.lat_bin >= 5300) & (df.lat_bin <= 5500) & (df.lon_bin >= -650) & (df.lon_bin <= -250)]
irishsea.to_csv("2014-02_irishsea.csv", index=False, encoding='utf-8-sig')
grab2 = r'C:\Users\jonathan.capanda\Documents\Fishing_DataBase\gfw_data\100_deg_data\daily_csvs\20xx02\2014-02_irishsea.csv'
move2 = r'C:\Users\jonathan.capanda\Documents\Fishing_DataBase\gfw_data\100_deg_data\daily_csvs\2014\2014-02-IrishSea.csv'
shutil.move(grab2,move2)
You can do something like the following. Keep in mind that the second number of range (the stop value) needs to be one value higher than you intend.
for year in range(2014, 2020):
for month in range(1, 13):
if month < 10:
month_as_string = "0" + str(month)
else:
month_as_string = str(month)
date = "%s\%s-%s" % (year, year, month_as_string)
pathname = 'YOUR\FILEPATH\HERE' + date + 'irishsea.csv'
You can learn more about string formatting here https://www.learnpython.org/en/String_Formatting
Related
I have over 100K CSV (total file size north of 150 GB) which I need to join. All have standard column names although the sequence of columns may not match and some csv have a few columns missing.
Now I just created a dataframe and kept concating the datframe from each csv in each iteration to have a standard dataframe containing all columns which I eventually intended to save as csv
I tried making a dataframe with 1000 sample csv and noticed as the dataframe size increased, the number of iteration dropped down from 10 to 1.5 per second which probably means that it would follow a similar trend if I got all-in with 100k csv thus taking days if not months to combine them.
Is there a better way of combining huge number of csv files?
Here is my code
df_t1 = pd.DataFrame()
for i in tqdm(range(len(excelNames))):
thisCSV = str(excelNames[i]).lower().strip()
df = pd.read_csv(pathxl + "\\" + thisCSV, error_bad_lines=False, warn_bad_lines=False,low_memory=False)
df["File Name"] = pd.Series([thisCSV for x in range(len(df.index))])
if thisCSV.endswith('type1.csv'):
df_t1 = pd.concat([df_t1,df], axis=0, ignore_index=True)
df_t1.to_csv(outpath + "df_t1.csv", index = None, header=True, encoding='utf-8')
print("df_t1.csv generated")
Possible improvement
Method 1: Using Pandas
#df_t1 = pd.DataFrame()
df_t1_lst = []
for i in tqdm(range(len(excelNames))):
thisCSV = str(excelNames[i]).lower().strip()
if thisCSV.endswith('type1.csv'):
df = pd.read_csv(pathxl + "\\" + thisCSV, error_bad_lines=False, warn_bad_lines=False,low_memory=False)
#df["File Name"] = pd.Series([thisCSV for x in range(len(df.index))]) --unnecessary to loop use next line instead
df["File Name"] = thisCSV # places thisCSV in every row
#df_t1 = pd.concat([df_t1,df], axis=0, ignore_index=True) # concat slow, append to list instead
df_t1_lst.append(df)
df_t1 = pd.concat(df_t1_lst, ignore_index=True) # Form dataframe from list (faster than pd.concat in loop)
df_t1.to_csv(outpath + "df_t1.csv", index = None, header=True, encoding='utf-8')
print("df_t1.csv generated")
Method 1a
Using Pandas to continuously append to CSV output file
import os
import pandas as pd
def str_to_bytes(s):
' String to byte array '
result = bytearray()
result.extend(map(ord, s))
return result
def good_file(file_path):
""" Check if file exists and is not empty"""
# Check if file exist and it is empty
return os.path.exists(file_path) and os.stat(file_path).st_size > 0
SEPARATOR = ',' # Separator used by CSV file
write_header = True
pathxl = 'xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx'
outpath = 'xxxxxxxxxxxxxxxxxxxxxxxxxx'
excelNames = ["xxx.csv", "xxxxx.csv"]
pathxl = r"C:\\Users\\darryl\\OneDrive\\Python"
outpath = pathxl + r"\\"
excelNames = ["test1_type1.csv", "test2_type1.csv"]
output_file = outpath + "df_t1.csv"
with open(output_file, "w") as ofile:
pass # create empty output file
for i in tqdm(range(len(excelNames))):
thisCSV = str(excelNames[i]).lower().strip()
input_file = pathxl + "\\" + thisCSV
if thisCSV.endswith('type1.csv') and good_file(input_file):
df = pd.read_csv(input_file)
if df.shape[0] > 0:
df['File Name'] = thisCSV # Add filename
df = df.sort_index(axis = 1) # sort based upon colunn in ascending order
# Append to output file
df.to_csv(output_file, mode='a',
index = False,
header= write_header)
write_header = False # Only write header once
del df
Method 2: Binary Files
Reading/Writing binary and using memory-map should be faster.
from tqdm import tqdm
import os
import mmap
def str_to_bytes(s):
' String to byte array '
result = bytearray()
result.extend(map(ord, s))
return result
def good_file(file_path):
""" Check if file exists and is not empty"""
# Check if file exist and it is empty
return os.path.exists(file_path) and os.stat(file_path).st_size > 0
SEPARATOR = ',' # Separator used by CSV file
header = None
pathxl = 'xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx'
outpath = 'xxxxxxxxxxxxxxxxxxxxxxxxxx'
excelNames = ["xxx.csv", "xxxxx.csv"]
with open(outpath + "df_t1.csv", "wb") as ofile:
for i in tqdm(range(len(excelNames))):
thisCSV = str(excelNames[i]).lower().strip()
input_file = pathxl + "\\" + thisCSV
if thisCSV.endswith('type1.csv') and good_file(input_file):
with open(input_file, "rb") as ifile:
print('file ', thisCSV)
# memory-map the file, size 0 means whole file
with mmap.mmap(ifile.fileno(), length=0, access=mmap.ACCESS_READ) as mmap_obj:
text_iter = iter(mmap_obj.read().split(b'\n'))
if header is None:
header = next(text_iter)
header = header.rstrip() + str_to_bytes(SEPARATOR + "File Name\n")
ofile.write(header) # write header
else:
next(text_iter) # ignore header row
# write data to output file
file_value = str_to_bytes(SEPARATOR + f"{thisCSV}\n")
for line in text_iter:
if line.strip(): # skip blank lines
ofile.write(line.rstrip() + file_value)
I have a python script for generating 1 upload file from 1 input file.
The thing is that the input files have started coming in batches, 30-50 at one time.
e.g.:
1111.xlsx --> upload.xlsx
1125.xlsx --> upload.xlsx
1176.xlsx --> upload.xlsx
1322.xlsx --> upload.xlsx
The code just converting the input files in the upload format.
Here's what I have done so far (1 input file -> 1 output file):
def main():
initial_workbook = 'C:/files/1111.xlsx'
temp_df = pd.ExcelFile(initial_workbook)
initial_df = pd.read_excel(initial_workbook, sheet_name = "default")
#drop first 4 rows to set header
new_header = initial_df.iloc[2]
initial_df = initial_df.iloc[3:]
initial_df.columns = new_header
#drop all rows with no data
indexNames = initial_df[initial_df['grade'] == 'select'].index
initial_df.drop(indexNames , inplace=True)
initial_df.dropna(axis=1, how='all')
output = initial_df.to_excel('C:/files/upload_file.xlsx', index = False)
Is there a way to generate one upload file for all the files from the input folder. And once the files input files have been processed, rename them by prefixing x in front of it. e.g. x1111.xlsx
So here is how I will approach, for a given batch:
from datetime import datetime
import os
from pathlib import Path
all_dfs = []
proj_path = Path("C:/files/")
for f in os.listdir(proj_path):
if f.endswith(".xlsx"):
print(f"processing {f}...")
df_tmp = main(proj_path / f)
df_tmp["file_name"] = f
all_dfs.append(df_tmp)
df_all = pd.concat(all_dfs, axis=0)
df_all.to_excel(proj_path / f"{datetime.now()}_batch.xlsx", index = False)
def main(f):
initial_workbook = proj_path / f
temp_df = pd.ExcelFile(initial_workbook)
initial_df = pd.read_excel(initial_workbook, sheet_name = "default")
#drop first 4 rows to set header
new_header = initial_df.iloc[2]
initial_df = initial_df.iloc[3:]
initial_df.columns = new_header
#drop all rows with no data
indexNames = initial_df[initial_df['grade'] == 'select'].index
initial_df.drop(indexNames, inplace=True)
initial_df.dropna(axis=1, how='all', inplace=True)
return initial_df
You can potentially enclose the logic for a batch in a function.
I have a datafile which is the result of combining several sources that contain name information. Each name have a unique ID (Column ID).
Sorting the ID by column, I would like to remove the second/third source finding in the column Source.
My output today:
(all the red rows are "duplicates" since we already got them from the first source (blue rows))
What I would like to achieve:
How can I achieve this result?
Is there a way to iterate row by row, where I remove duplicate of ID already when I iterate in the function "for file in files:" part of the code?
Or is it easier to do it in the "df_merged" before I output the dataframe to an an excel file?.
Code:
import pandas as pd
import os
from datetime import datetime
from shutil import copyfile
from functools import reduce
import numpy as np
#Path
base_path = "G:/Till/"
# Def
def get_files(folder, filetype):
list_files = []
directory = os.fsencode(folder)
for file in os.listdir(directory):
filename = os.fsdecode(file)
if filename.endswith("." + filetype.strip().lower()):
list_files.append(filename)
return list_files
# export files
df_result_e = pd.DataFrame()
files = get_files(base_path + "datasource/" + "export","xlsx")
df_append_e = pd.DataFrame()
for file in files:
df_temp = pd.read_excel(base_path + "datasource/" + "export/" + file, "Results", dtype=str, index=False)
df_temp["Source"] = file
df_append_e = pd.concat([df_append_e, df_temp])
df_result_e = pd.concat([df_result_e, df_append_e])
print(df_result_e)
# match files
df_result_m = pd.DataFrame()
files = get_files(base_path + "datasource/" + "match","xlsx")
df_append_m = pd.DataFrame()
for file in files:
df_temp = pd.read_excel(base_path + "datasource/" + "match/" + file, "Page 1", dtype=str, index=False)
df_append_m = pd.concat([df_append_m, df_temp])
df_result_m = pd.concat([df_result_m, df_append_m])
df_result_m = df_result_m[['ID_Our','Name_Our','Ext ID']]
df_result_m.rename(columns={'ID_Our' : 'ID', 'Name_Our' : 'Name' , 'Ext ID' : 'Match ID'}, inplace=True)
df_result_m.dropna(subset=["Match ID"], inplace=True) # Drop all NA
data_frames = [df_result_e, df_result_m]
# Join files
df_merged = reduce(lambda left,right: pd.merge(left, right, on=["Match ID"], how='outer'), data_frames)
#Output of files
df_merged.to_excel(base_path + "Total datasource Export/" + datetime.now().strftime("%Y-%m-%d_%H%M") + ".xlsx", index=False)
For remove them you can try transform with factorize
newdf=df[df.groupby('ID')['Source'].transform(lambda x : x.factorize()[0])==0]
could anyone advise me how to apply this code to several csv in one folder? Then, save the modified csv to another folder and each separately? In short, I need to automate it.
I need to automatically load the csv file, execute the code, save the newly modified csv file, and then repeat it to the next csv file in the folder.
import pandas as pd
import datetime as dt
import numpy as np
from numpy import nan as Nan
path = "C://Users//Zemi4//Desktop//csv//A-001.csv"
df = pd.read_csv(path,delimiter=";")
df['ta'] = pd.to_numeric(df['ta'])
df['tw'] = pd.to_numeric(df['tw'])
df["time_str"] = [dt.datetime.strptime(d, "%d.%m.%Y %H:%M:%S") for d in df["time"]]
df["time_str"] = [d.date() for d in df["time_str"]]
df["time_str"] = pd.to_datetime(df["time_str"])
df["time_zaokrouhleny"]=df["time_str"]
def analyza(pozadovane_data):
new_list = []
new_df = pd.DataFrame(new_list)
new_df=df.loc[df["time_str"] == pozadovane_data,["ta","tw", "zone", "time_zaokrouhleny"]]
counter = new_df.ta.count()
if counter < 24:
for i in range(counter,24):
new_df.loc[i] = [Nan for n in range(4)]
new_df["ta"]= new_df.ta.fillna(0)
new_df["tw"] = new_df.tw.fillna(0)
new_df["zone"] = new_df.zone.fillna(0)
new_df["time_zaokrouhleny"]=new_df.time_zaokrouhleny.fillna(new_df.time_zaokrouhleny.min())
elif counter > 24:
counter_list = list(range(24,counter))
new_df = new_df.drop(new_df.index[counter_list])
new_df["time_oprava"] = [dt.datetime.combine(d.date(),dt.time(1,0)) for d in new_df["time_zaokrouhleny"]]
s = 0
cas_list = []
for d in new_df["time_oprava"]:
d =d + dt.timedelta(hours=s)
#print(d)
#print(s)
cas_list.append(d)
s = s + 1
se = pd.Series(cas_list)
new_df['time_oprava'] = se.values
new_df['Validace'] = (new_df['ta'] != 0) & (new_df['tw'] != 0)
new_df['Rozdil'] = new_df['ta'] - new_df['tw']
new_df.rename(columns={"ta": "Skutecna teplota", "tw": "Pozadovana teplota", "time_oprava": "Cas", "zone": "Mistnost"}, inplace = True)
new_df.index = new_df['Cas']
return new_df
start = dt.datetime(2010,10,6)
end = dt.datetime(2010,12,27)
date_range = []
date_range = [start + dt.timedelta(days=x) for x in range(0,(end-start).days)]
new_list = []
vysledek_df =pd.DataFrame(new_list)
for d in date_range:
pom = analyza(d)
vysledek_df = vysledek_df.append(pom,ignore_index=True)
vysledek_df.pop('time_zaokrouhleny')
vysledek_df.to_csv('C://Users//Zemi4//Desktop//zpr//A-001.csv', encoding='utf-8', index=False)
The code itself works correctly. Thank you for your advice.
Simplest way is to use glob. Just give the folder_path and output_path as per your requirements and use the sample code below. I commented the code to help you understand the code.
import os
import glob
folder_path = 'path/to/folder/' # path to folder containing .csv files
output_path = 'path/to/output/folder/' # path to output folder
for file in glob.glob(folder_path + '*.csv'): # only loads .csv files from the folder
df = pd.read_csv(file, delimiter=";") # read .csv file
# Do something
df.to_csv(output_path + 'modified_' + str(os.path.basename(file)), encoding='utf-8', index=False) # saves modified .csv file to output_path
You want to use os.listdir() to find the contents of the directory, then parameterize the file path in a new function. You can then loop over a list of directories retrieved via os.walk() and run the function for each one.
import os
def run(file_directory):
filelist = os.listdir(file_directory)
for path in filelist:
df = pd.read_csv(path,delimiter=";")
# etc.
df.to_csv(os.path.join(file_directory, 'output.csv'))
If you need to create a new directory, you can use os.mkdir(newpath)
Can you still advise on how to parameterize the function?
I am trying to combine multiple .csv files into one .csv file using the dataframe in pandas. the tricky part about this is, i need to grab multiple files from multiple days. Please let me know if this does not make sense. As it currently stands i cannot figure out how to loop through the directory. Could you offer some assistance?
import csv
import pandas as pd
import datetime as dt
import glob, os
startDate = 20160613
endDate = 20160614
dateRange = endDate - startDate
dateRange = dateRange + 1
todaysDateFilePath = startDate
for x in xrange(dateRange):
print startDate
startDate = startDate + 1
filePath = os.path.join(r"\\export\path", startDate, "preprocessed")
os.chdir(filePath)
interesting_files = glob.glob("trade" + "*.csv")
print interesting_files
df_list = []
for filename in sorted(interesting_files):
df_list.append(pd.read_csv(filename))
full_df = pd.concat(df_list)
saveFilepath = r"U:\Chris\Test_Daily_Fails"
fileList = []
full_df.to_csv(saveFilepath + '\\Files_For_IN' + "_0613_" + ".csv", index = False)
IIUC you can create list all_files and in loop append output from glob to all_files:
all_files = []
for x in xrange(dateRange):
print startDate
startDate = startDate + 1
filePath = os.path.join(r"\\export\path", startDate, "preprocessed")
os.chdir(filePath)
all_files = all_files + glob.glob("trade" + "*.csv")
print interesting_files
Also you need first append all values to df_list and then only once concat (I indented code for concat):
df_list = []
for filename in sorted(interesting_files):
df_list.append(pd.read_csv(filename))
full_df = pd.concat(df_list)