How do I make a dataset that shows historic data from snapshots?
I have a csv-file that is updated and overwritten with new snapshot data once a day. I would like to make a python-script that regularly updates the snapshot data with the current snapshots.
One way I thought of was the following:
import pandas as pd
# Read csv-file
snapshot = pd.read_csv('C:/source/snapshot_data.csv')
# Try to read potential trend-data
try:
historic = pd.read_csv('C:/merged/historic_data.csv')
# Merge the two dfs and write back to historic file-path
historic.merge(snapshot).to_csv('C:/merged/historic_data.csv')
except:
snapshot.to_csv('C:/merged/historic_data.csv')
However, I don't like the fact that I use a try-function to get the historic data if the file-path exists or write the snapshot data to the historic path if the path doesn't exist.
Is there anyone that knows a better way of creating a trend dataset?
You can use os module to check if the file exists and mode argument in to_csv function to append data to the file.
The code below will:
Read from snapshot.csv.
Checks if the historic.csv file exists.
If it exists then save the headers else dont save header.
Save the file. If the file already exists, new data will be appended to the file instead of overwriting it.
import os
import pandas as pd
# Read snapshot file
snapshot = pd.read_csv("snapshot.csv")
# Check if historic data file exists
file_path = "historic.csv"
header = not os.path.exists(file_path) # whether header needs to written
# Create or append to the historic data file
snapshot.to_csv(file_path, header=header, index=False, mode="a")
you could easily one line it by utilising the mode parameter in `to_csv'.
pandas.read_csv('snapshot.csv').to_csv('historic.csv', mode='a')
It will create the file if it doesn't already exist, or will append if it does.
What happens if you don't have a new snapshot file? You might want to wrap that in a try... except block. The pythonic way is typically ask for forgiveness instead of permission.
I wouldn't even both with an external library like pandas as the standard library has all you need to 'append' to a file.
with open('snapshot.csv', 'r') as snapshot:
with open('historic.csv', 'a') as historic:
for line in new_file.readline():
historic_file.write(line)
Related
I am creating a azure function. This function locally have a csv file. I read it via given code
csv = f'{context.function_directory}/new_csv.csv'
df=pd.read_csv(csv)
after reading the data, I want to make some changes to this csv (like adding some columns). Please suggest some code how can I write this updated csv/dataframe in same directory and with the same name.
you just have to write back to the same file (or a new file)...
csv = f'{context.function_directory}/new_csv.csv'
df=pd.read_csv(csv)
# make edits to the dataframe
make a new csv or write to existing csv
df.to_csv()
Here is a link:
https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.to_csv.html
I am trying to use openpyxl to open an Excel file, create a dataframe from filtered data in one of the sheets, and then write that data to an existing sheet in another file, but I keep getting an error saying that the permission is denied, I think because the way I'm calling the dataframe in the append step is somehow opening the file again after I've closed it or something. So I guess I'm wondering if there's a way to somehow get the dataframe of source data into Python, and then close out that source file, open the destination file, and write the dataframe to it. I apologize if that doesn't make sense; I'm pretty new to Python.
My code is below, and any suggestions or simplifications are welcome.
# Get latest source report using list
list_of_files_source = glob.glob(r'C:[my_path]/*')
latest_file_source = max(sorted(list_of_files_source, key = os.path.getctime))
# Load "Employee OT Data" sheet from workbook
file_source = pd.ExcelFile(latest_file_source)
df_source_Employee_OT = pd.read_excel(latest_file_source, 'Employee OT Data')
# Identify 6 most recent weeks (based on week ending date)
wk_end_source = pd.DataFrame(df_source_Employee_OT, columns = ['WEEK_ENDING']).drop_duplicates().apply(pd.to_datetime)
recent_wk_end_source = wk_end_source.sort_values(['WEEK_ENDING'], ascending=False).groupby('WEEK_ENDING').head(1)
recent_wk_end_source = recent_wk_end_source.head(6)
print(recent_wk_end_source)
# Filter source employee data for only 6 most recent weeks
df_source_Employee_OT = recent_wk_end_source.merge(df_source_Employee_OT, on='WEEK_ENDING', how='inner')
file_source.close()
# Make sure Excel instances are closed
import os
os.system("taskkill /f /im EXCEL.exe")
# Load destination workbook, targeting 'SOURCEDATA' sheet
dst = r'C:[my_other_path/Pivots.xlsm')
pivots = xw.Book(str(r'C:[my_other_path]/Pivots.xlsm'))
pivots_source_sheet = pivots.sheets['SOURCEDATA']
# Clear out old data from sheet
pivots_source_sheet.range('2:100000').api.Delete(DeleteShiftDirection.xlShiftUp)
# Save report and close
pivots.save(dst)
# Append with source data
with pd.ExcelWriter(dst, engine='openpyxl', mode='a') as writer:
df_source_Employee_OT.to_excel(writer, sheet_name=pivots_source_sheet, startrow = 2)
pivots.save(dst)
pivots.close()
This part should not be needed since the excel app is never opened, python just reads the data from an excel file, not from the software itself.
import os
os.system("taskkill /f /im EXCEL.exe")
You can pass the name of the excel file directly to pandas.read_excel and pandas.DataFrame.to_excel.
Check out the official documentation for pandas.read_excel and pandas.DataFrame.to_excel. The first one returns a dataframe given the name of an excel file, while the second is called on a dataframe and saves it to an excel file given the target file name. That should be all you need for file i/o. If these functions do not work for some reason, please include the error message you are getting.
I'm having trouble dropping columns and saving the new data frame as a CSV file.
Code:
import pandas as pd
file_path = 'Downloads/editor_events.csv'
df = pd.read_csv(file_path, index_col = False, nrows= 1000)
df.to_csv(file_path, index = False)
df.to_csv(file_path)
The code executes and doesn't give any error. I've looked in my root directory but can't see any new csv file
Check file in folder in which you are running python script. And you are saving with same name, so you can check modified time to confirm it. Also you are not dropping columns as per posted code, you are just taking 1000 rows and saving it.
First: you are saving the same file that you are reading, so you won't see any new csv files. All you are doing right now is rewriting the same file.
But since I can guess you just show it as simple example of what you want to do, I will move to second:
Make sure that your path is correct. Try to write the full path, like 'c:\Users\AwesomeUser\Downloads\editor_events.csv' instead of just 'Downloads\editor_events.csv'.
I have following N number of invoice data in Excel and I want to create CSV of that file so that it can be imported whenever needed...so how can I archive this?
Here is a screenshot:
Assuming you have a Folder "excel" full of Excel Files within your Project-Directory and you also have another folder "csv" where you intend to put your generated CSV Files, you could pretty much easily batch-convert all the Excel Files in the "excel" Directory into "csv" using Pandas.
It will be assumed that you already have Pandas installed on your System. Otherwise, you could do that via: pip install pandas. The fairly commented Snippet below illustrates the Process:
# IMPORT DATAFRAME FROM PANDAS AS WELL AS PANDAS ITSELF
from pandas import DataFrame
import pandas as pd
import os
# OUR GOAL IS:::
# LOOP THROUGH THE FOLDER: excelDir.....
# AT EACH ITERATION IN THE LOOP, CHECK IF THE CURRENT FILE IS AN EXCEL FILE,
# IF IT IS, SIMPLY CONVERT IT TO CSV AND SAVE IT:
for fileName in os.listdir(excelDir):
#DO WE HAVE AN EXCEL FILE?
if fileName.endswith(".xls") or fileName.endswith(".xlsx"):
#IF WE DO; THEN WE DO THE CONVERSION USING PANDAS...
targetXLFile = os.path.join(excelDir, fileName)
targetCSVFile = os.path.join(csvDir, fileName) + ".csv"
# NOW, WE READ "IN" THE EXCEL FILE
dFrame = pd.read_excel(targetXLFile)
# ONCE WE DONE READING, WE CAN SIMPLY SAVE THE DATA TO CSV
pd.DataFrame.to_csv(dFrame, path_or_buf=targetCSVFile)
Hope this does the Trick for you.....
Cheers and Good-Luck.
Instead of putting total output into one csv, you could go with following steps.
Convert your excel content to csv files or csv-objects.
Each object will be tagged with invoice id and save into dictionary.
your dictionary data structure could be like {'invoice-id':
csv-object, 'invoice-id2': csv-object2, ...}
write custom function which can reads your csv-object, and gives you
name,product-id, qty, etc...
Hope this helps.
I'm new to python and having trouble dealing with excel manpulation in python.
So here's my situation: I'm using requests to get a .xls file from a web server. After that I'm using xlrd to save the content in excel file. I'm only interested in one value of that file, and there are thousands of files im retrieving from different url addresses.
I want to know how could i handle the contents i get from request in some other way rather than creating a new file.
Besides, i've included my code my comments on how could I improve it. Besides, it doesn't work, since i'm trying to save new content in an already created excel file (but i couldnt figure out how to delete the contents of that file for my code to work (even if its not efficient)).
import requests
import xlrd
d={}
for year in string_of_years:
for month in string_of_months:
dls=" http://.../name_year_month.xls"
resp = requests.get(dls)
output = open('temp.xls', 'wb')
output.write(resp.content)
output.close()
workbook = xlrd.open_workbook('temp.xls')
worksheet = workbook.sheet_by_name(mysheet_name)
num_rows = worksheet.nrows
for k in range(num_rows):
if condition I'm looking for:
w={key_year_month:worksheet.cell_value(k,0)}
dic.update(w)
break
xlrd.open_workbook can accept a string for the file data instead of the file name. Your code could pass the contents of the XLS, rather than creating a file and passing its name.
Try this:
# UNTESTED
resp = requests.get(dls)
workbook = xlrd.open_workbook(file_contents=resp.content)
Reference: xlrd.open_workbook documentation
Save it and then delete the file readily on each loop after the work with os.
import os
#Your Stuff here
os.remove(#path to temp_file)