I am writing a py to import in a large amount of files, manipulate them and then output to .csv. Which is cake in Pandas, however I have no control over the files coming in so I am trying to write the script to have an exception on how to handle if files come in the "wrong" way.
Anyway, I am using a Try/Except to show the user that there is a KeyError in one of the files (basicially there is a " in a cell when the datatype is int).
My question is: Is there a way to have the except: bring back the file name of the file that caused the error??
for csv in csvList:
df = pd.read_csv(csv, header=0, skip_blank_lines=True, skipinitialspace=True)\
.dropna(how='all')
try:
df[0] = df[0].astype(int)
df[1] = df[1].astype(int)
df[2] = df[2].astype(int)
df[3] = df[3].astype(int)
report_path = 'UPC_Ready_for_Import'
if not os.path.exists(report_path):
os.makedirs(report_path)
df.to_csv(os.path.join(report_path, csv + '_import.csv'), index=False)
except KeyError:
print('Error within file, please review files')
Assuming csvList contains list of input file paths:
for csv in csvList:
....
try:
...
except KeyError:
print('Error within file {}, please review files'.format(csv))
You could write, something like this, I guess:
for csv in csvList:
df = pd.read_csv(csv, header=0, skip_blank_lines=True, skipinitialspace=True)\
.dropna(how='all')
try:
df[0] = df[0].astype(int)
df[1] = df[1].astype(int)
df[2] = df[2].astype(int)
df[3] = df[3].astype(int)
report_path = 'UPC_Ready_for_Import'
if not os.path.exists(report_path):
os.makedirs(report_path)
file_name = os.path.join(report_path, csv + '_import.csv')
df.to_csv(file_name, index=False)
except KeyError:
print('Error within file', file_name ', please review files')
The main idea is to store the file name in a variable file_name and use it in the except block.
Related
I have the following code that reads some JSON files from a directory and returns them after some preprocessing. However, some of them are dict so they do not have the desired columns. As a result, I take back this error
KeyError: "None of [Index(['aaa', 'xxx'], dtype='object')] are in the [columns]"]
How to ignore them and continue with the other JSON files? Perhaps a try-except procedure?
import os, json
import pandas as pd
path_to_json = 'C:/Users/aaa/Desktop/'
json_files = [pos_json for pos_json in os.listdir(path_to_json) if pos_json.endswith('.json')]
def func(s):
try:
return eval(s)
except:
return dict()
list_of_df=[]
for i in range(len(json_files)):
file_name = json_files[i]
df = pd.read_json(file_name, lines=True)
df= df[['columnx']]
df = df['columnx'].apply(func)
df=pd.json_normalize(df)
df=pd.DataFrame(df[["xxx", "aaa"]])
list_of_df.append(df)
df=pd.concat(list_of_df)
df = df[['Index','xxx', 'aaa']]
df.head()
You have to add the try-except block added in your for loop which iterates over the json files.
Dataset:
https://github.com/Bene939/newsheadlinedatasets
With my program I am labeling my dataset of news headlines. It worked fine until today.
For some reason it won't write the csv file anymore. As far as I can see the data frame gets updated though.
At around 4469 rows of my csv it started to not overwrite the csv file. And then it did. And then didnt do it again until it stopped overwriting completely at row 4474. It worked fine until now and if I create a new csv it will overwrite it.
I am using Jupyter Notebook. Is there some kind of limit to this? The labeled dataset is around 300KB.
!pip install pandas
!pip install pathlib
import pandas as pd
from pathlib import Path
#takes data frame and file name & appends it to given csv
def append_df(df, file_name):
my_file = Path(file_name)
if my_file.exists():
print("Appending to existing file named " + file_name)
orig_df = pd.read_csv(file_name)
print("Old Data Frame: ")
print(orig_df)
new_df = pd.concat([orig_df, df], ignore_index=True).drop_duplicates()
print("New Data Frame: ")
print(new_df)
new_df.to_csv(file_name, index=False, header = True, encoding='utf-8-sig')
else:
print("Creating new file named" + file_name)
news_sentiment_df.to_csv(file_name, index=False, header = True, encoding='utf-8-sig')
#takes data frame and file name & overwrites given csv
def update_csv(df, file_name):
print("Overwriting " + file_name)
df.to_csv(file_name, index=False, header = True, encoding='utf-8-sig')
#shows sentence by sentence, labels it according to input and saves it in a new csv file
print("WARNING: EDITING CSV FILE WITH EXCEL MAY CORRUPT FILE\n")
file_name = "news_headlines.csv"
new_file = "news_headlines_sentiment.csv"
news_sentiment_df = pd.DataFrame(columns=["news", "sentiment"])
my_file = Path(file_name)
if my_file.exists():
df = pd.read_csv(file_name, encoding='utf-8-sig', error_bad_lines=False)
print("Loaded " + file_name)
for index, row in df.iterrows():
user_input = -1
range = [0, 1, 2]
while user_input not in range:
print("####################################################################")
print(row["news"])
try:
user_input = int(input("Negative: 0\nNeutral: 1\nPositive: 2\n"))
except ValueError as err:
print("\nPlease enter an Integer!\n")
pass
new_element = 0
#label sentiment according to input
if user_input == 0:
new_element = [row["news"], 0]
elif user_input == 1:
new_element = [row["news"], 1]
elif user_input == 2:
new_element = [row["news"], 2]
#save labeled sentence to new file
news_sentiment_df.loc[len(news_sentiment_df)] = new_element
append_df(news_sentiment_df, new_file)
#delete data point from original data frame
index_name = df[df["news"] == row["news"]].index
df.drop(index_name, inplace=True)
#update old csv file
update_csv(df, file_name)
else:
print("File not Found")
I was trying to add duplicates while using drop_duplicates function without noticing it
I wrote a python script to bulk upload files from a folder into postgresql. While the script works, I do not think it's super efficient. Can anyone tell me how to improve it?
It takes a very long time for the files to actually be uploaded.
Spacing/indentation is slightly off in posting, this is not an issue in actual script.
def addFilesToDatabase(directory):
uploadedFiles = []
errorFiles = []
rows_to_chunk_by = 1000
for filename in os.listdir(directory):
try:
filename_used = filename.upper()
if filename_used.endswith(".CSV"):
file_directory = os.path.join(directory, filename)
tableName = filename_used.replace('.CSV','')
df = pd.read_csv(file_directory, header=0, nrows = 1)
columns = df.columns
while 1==1:
for skiprows in range(100000000):
if(skiprows == 0):
df = pd.read_csv(file_directory, header=0, nrows = rows_to_chunk_by, skiprows = skiprows*rows_to_chunk_by)
df.to_sql(name=tableName, con=engine, if_exists='append', schema=None, index=False)
else:
df = pd.read_csv(file_directory, header=None, nrows = rows_to_chunk_by, skiprows = skiprows*rows_to_chunk_by)
df.columns = columns
df.to_sql(name=tableName, con=engine, if_exists='append', schema=None, index=False)
if(len(df)<rows_to_chunk_by):
break
uploadedFiles.append(filename)
break
except Exception as e:
if str(e) == "No columns to parse from file":
uploadedFiles.append(filename)
elif str(e)[0:16] == "Length mismatch:":
uploadedFiles.append(filename)
else:
errorFiles.append(filename)
print('Error with ' + filename)
print(e)
continue
I am moving an application from a classic Tkinter GUI to a Django cloud-based application and am receiving a
ValueError: Invalid file path or buffer object type: <class 'bool'>
when trying to run a function which calls on pandas.
Exception Location: C:\Users\alfor\AppData\Local\Programs\Python\Python37-32\lib\site-packages\pandas\io\common.py in get_filepath_or_buffer, line 232
I have not tried much because I cannot find this same Error in searches.
I do not believe this function even runs AT ALL because my media folder is not getting a new directory where the file would be saved.. but I could be wrong.
The beginning of the function that is having issues looks like this:
def runpayroll():
man_name = 'Jessica Jones'
sar_file = os.path.isfile('media/reports/Stylist_Analysis.xls')
sar_file2 = os.path.isfile('media/reports/Stylist_Analysis.xls')
tips_file = os.path.isfile('media/reports/Tips_By_Employee_Report.xls')
hours_wk1_file = os.path.isfile('media/reports/Employee_Hours1.xls')
hours_wk2_file = os.path.isfile('media/reports/Employee_Hours2.xls')
retention_file = os.path.isfile('media/reports/SC_Client_Retention_Report.xls')
efficiency_file = os.path.isfile('media/reports/Employee_Service_Efficiency.xls')
df_sar = pd.read_excel(sar_file,
sheet_name=0, header=None, skiprows=4)
df_sar2 = pd.read_excel(sar_file2,
sheet_name=0, header=None, skiprows=4)
df_tips = pd.read_excel(tips_file,
sheet_name=0, header=None, skiprows=0)
df_hours1 = pd.read_excel(hours_wk1_file,
header=None, skiprows=5)
df_hours2 = pd.read_excel(hours_wk2_file,
header=None, skiprows=5)
df_retention = pd.read_excel(retention_file, sheet_name=0,
header=None, skiprows=8)
df_efficiency = pd.read_excel(efficiency_file, sheet_name=0,
header=None, skiprows=5)
The only code I have changed from the rest of this function is this which I am assuming does not matter because it is only a file location..
writer = pd.ExcelWriter('/media/payroll.xlsx', engine='xlsxwriter')
and instead of asking the user for a file save location using tkinter I used...
with open(file_path, 'rb') as f:
response = HttpResponse(f, content_type=guess_type(file_path)[0])
response['Content-Length'] = len(response.content)
return response
Expected results are to open a few excel sheets, do some work to the dataframes, and to spit out an excel sheet to the user.
I believe you need change for each file from:
sar_file = os.path.isfile('media/reports/Stylist_Analysis.xls')
to:
sar_file = 'media/reports/Stylist_Analysis.xls'
because os.path.isfile:
Return True if path is an existing regular file. This follows symbolic links, so both islink() and isfile() can be true for the same path.
I have a set of files that do not have any extension. They are currently stored in a folder that is referenced by this variable "allFiles".
allFiles = glob.glob(base2 + "/*")
I am trying to add an extension to each of the files in allFiles. Add .csv to the file name. I do it using the below code:
for file in allFiles:
os.rename(os.path.join(base2, file), os.path.join(base2, file+'.csv'))
Next I try to append each of these csv files into one as per the below code.
list_ = []
for file_ in allFiles:
try:
df = pd.read_csv(file_, index_col=None, header=None,delim_whitespace = True, error_bad_lines=False)
list_.append(df)
except pd.errors.EmptyDataError:
continue
When I run the above code, I get an error stating one of the files do not exist.
Error : FileNotFoundError: File b'/Users/base2/file1' does not exist
But file1 has now been renamed to file1.csv
Could anyone advice as to where am I going wrong in the above. Thanks
Update:
allFiles = glob.glob(base2 + "/*")
print(allFiles)
list_ = []
print(list_)
allFiles = [x + '.csv' for x in allFiles]
print(allFiles)
for file_ in allFiles:
try:
df = pd.read_csv(file_, index_col=None, header=None)
list_.append(df)
except pd.errors.EmptyDataError:
continue
Error : FileNotFoundError: File b'/Users/base2/file1.csv' does not exist
Before running your loop, do:
EDIT for clarity:
for file in allFiles:
os.rename(os.path.join(base2, file), os.path.join(base2, file+'.csv'))
###What you're adding###
allFiles = [x+'.csv' for x in allFiles]
########################
for file_ in allFiles:
try:
Basically, the problem is that you're changing the file names, but you're not changing the strings in your list to reflect the new file names. You can see this if you print allFiles. The above will make the necessary change for you.