Why can't I get a in 1 dataframe all looped dataframes? - python

I have a listing of 5 stock symbols in a .csv file. I am using the loop below to get options data from each one of the symbols. The output of all 5 symbols ideally will get save in an .xlsx file
If I execute print(df_puts) I see all symbols in the dataframe. However the output .xlsx file only has data from last symbol in the .csv file. Basically it prints data from the last looped symbol and not from all symbols within the loop
Im new to pandas and python in general. I like to understand why this happens for future projects
stocklist = pd.read_excel(filePath)
for i in stocklist.index:
stock=str(stocklist["Symbols"][i])
#df = pdr.get_data_yahoo(stock, start, now, threads=False)
option_dict = options.get_options_chain(stock)
#print(option_dict)
df_puts = pd.DataFrame.from_dict(option_dict.get("puts"))
df_calls = pd.DataFrame.from_dict(option_dict.get("calls"))
newFile = os.path.dirname(filePath1) + "/OptionsOutput.xlsx"
writer = ExcelWriter(newFile)
df_puts.to_excel(writer, "puts", float_format="%.3f")
df_calls.to_excel(writer, "calls", float_format="%.3f")
writer.save()

You need to save all the dataframes in an array and then concat them into final dataframe. After that you can save them in an excel file.
stocklist = pd.read_excel(filePath)
call_df_arr = [] # Created new lists to save dataframe for each stock
put_df_arr = []
for i in stocklist.index:
stock=str(stocklist["Symbols"][i])
#df = pdr.get_data_yahoo(stock, start, now, threads=False)
option_dict = options.get_options_chain(stock)
df_puts = pd.DataFrame.from_dict(option_dict.get("puts"))
df_calls = pd.DataFrame.from_dict(option_dict.get("calls"))
call_df_arr.append(df_calls) # Append DFs
put_df_arr.append(df_puts)
final_call_df = pd.concat(call_df_arr) # Concat DFs
final_put_df = pd.concat(put_df_arr)
newFile = os.path.dirname(filePath1) + "/OptionsOutput.xlsx"
writer = ExcelWriter(newFile)
final_put_df.to_excel(writer, "puts", float_format="%.3f") # Changed name of df to final_put_df
final_call_df.to_excel(writer, "calls", float_format="%.3f")
writer.save()
Edit - Added comments for code changes done.

Related

Python: use the arguments in enumerate as variable names for the nested loop

I would like to generate two data frames (and subsequently export to CSV) from two CSV files. I come up with the following (incomplete) code, which focuses on dealing with a.csv. I create an empty data frame (df_a) to store rows from itterows iteration (df_b is missing).
The problem is I do not know how to process b.csv without manually describing all avariables of empty dataframes in advance (i.e. df_a = pd.DataFrame(columns=['start', 'end']) and df_b = pd.DataFrame(columns=['start', 'end'])).
I hope I can use the arguments of enumerate (ie. the content of file) as variables (ie. something like df_file) for the data frames (instead of df_a and df_b).
list_files = [a.csv, b.csv]
for i, file in enumerate(list_file):
df = pd.read_csv(file)
# Create empty data frame to store data for each iteration below
df_a = pd.DataFrame(columns=['start', 'end'])
for index, row in df.iterrows():
var = df.loc[index, 'name']
df_new = SomeFunction(var)
# Append a new row to the empty data frame
dicts = {'start': df_new['column1'], 'end': df_new['column2']}
df_dicts = pd.DataFrame([dicts])
df_a = pd.concat([df_a, df_dicts], ignore_index=True)
df_a_csv = df_a.to_csv('df_a.csv')
Ideally, it could look a bit like (note: file is used as a part of variable name df_file)
list_files = [a.csv, b.csv]
for i, file in enumerate(list_file):
df = pd.read_csv(file)
# Create empty data frame to store data for each iteration below
df_file = pd.DataFrame(columns=['start', 'end'])
for index, row in df.iterrows():
var = df.loc[index, 'name']
df_new = SomeFunction(var)
# Append a new row to the empty data frame
dicts = {'start': df_new['column1'], 'end': df_new['column2']}
df_dicts = pd.DataFrame([dicts])
df_file = pd.concat([df_file, df_dicts], ignore_index=True)
df_file_csv = df_file.to_csv('df_' + file + '.csv')
Different approaches are also welcome. I just need to save the dataframe outcome for each input file. Many Thanks!
SomeFunction(var) aside, can you get the result you seek without pandas for the most part?
import csv
import pandas
## -----------
## mocked
## -----------
def SomeFunction(var):
return None
## -----------
list_files = ["a.csv", "b.csv"]
for file_path in list_files:
with open(file_path, "r") as file_in:
results = []
for row in csv.DictReader(file_in):
df_new = SomeFunction(row['name'])
start, end = df_new['column1'], df_new['column2']
results.append({"start": start, "end": end})
with open(f"df_{file_path}", "w") as file_out:
writer = csv.DictWriter(file_out, fieldnames=list(results[0].keys())):
writer.writeheader()
writer.writerows(results)
Note that you can also stream rows from the input to the output if you would rather not read them all into memory.
There are many things we could comment, but I understand that you are concerned about not having to specify the loop for a and for b, given that you already are doing it in list_files.
If this is the issue, what about doing something like this?
# CHANGED list only the stem of the base name, we will use them for many things
file_name_stems = ["a", "b"]
# CHANGED we save a dictionary for the dataframes
dataframes = {}
# CHANGED did you really need the enumerate?
for file_stem in file_name_stems:
filename = file_stem + ".csv"
df = pd.read_csv(filename)
# Create empty data frame to store data for each iteration below
# CHANGED let's use df_x as a generic name. Knowing your code, you will surely find better names
df_x = pd.DataFrame(columns=['start', 'end'])
for index, row in df.iterrows():
var = df.loc[index, 'name']
df_new = SomeFunction(var)
# Append a new row to the empty data frame
dicts = {'start': df_new['column1'], 'end': df_new['column2']}
df_dicts = pd.DataFrame([dicts])
df_x = pd.concat([df_a, df_dicts], ignore_index=True)
# CHANGED and now, we print to the file
csv_x = df_x.to_csv(f'df_{file_stem}.csv')
# CHANGED and save it to a dictionary in case you need it
dataframes[stem] = csv_x
So, instead of listing the exact filenames, you can list the stem of their name, and then compose de source filename and the output one.
Another option could be to list the source filenames and replace some part of the filename to generate the output filename:
list_files = ["a.csv", "b.csv"]
for filename in list_files:
# ...
output_file_name = filename.replace(".csv", "_df.csv")
# this produces "a_df.csv" and "b_df.csv"
Does any of this look to solve your problem? :)

Reading multiple excel files in pyspark/pandas

I have a scenario where I need to read excel spreadsheet with multiple sheets inside and process each sheet separately.
The Sheets inside the excel workbook are named something like [sheet1, data1,data2,data3,summary,reference,other_info,old_records]
I need to read only sheets [reference, data1,data2,data3]
I can hardcode the name reference which is static everytime, but the names data1,data2,data3 are not static as there maybe data1 only or data1,data2 or it can be (eg) data1,data2….data(n)
whatever the count of the sheets be, it will remain same across all files (eg - its not allowed to have Data1,Data2 in one file and Data1,Data2,Data3 in the other - Just to clarify the requirement).
I can check the name by using the following code
reallall = [key for key in pd.read_excel(path,sheet_name = None) if ('Data') in key]
for n in range(0,len(readall)):
sheetname = readall[n]
dfname = df_list[n] – trying to create dynamic dataframe so that we can create separate tables at the end
for s in allsheets:
sheetname = s
data_df = readfile(path,s,”’Data1!C5’”) -- function to read excel file into dataframe
df_ref = readreference(path,s,”’Reference!A1’”)
df_ref is same for all sheets in a workbook, and the data_df is joined with the reference. (Just adding this as an info – there are some other processing that needs to be done as well, which I have already done)
the above is a sample code to read a particular sheet.
My Problem is:
I have Multiple excel files (around 100 files) to read.
Matching sheets from all files should be combined together (eg) ‘Data1’ from file1 should be combined with data1 from file2, data1 from file3 … and so on. Similary Data2 from all files should be combined into a separate dataframe (all sheets have same columns)
Separate delta tables should be created for each tab (eg) table for Data1 should be something like Customers_Data1, Table for Data2 should be Customers_Data2 and so on.
Any help on this please ?
Thanks
Resolved my Issue through the following code.
final_dflist = []
sheet_names = [['Data1','CUSTOMER_DATA1'],['Data2','CUSTOMER_DATA2']]
for shname in sheet_names:
final_df = spark.createDataFrame([],final_schema)
print(f'Initializing final df - record count: {final_df.count()}')
sheetname = shname[0]
dfname = shname[1]
print(f'Processing: {sheetname}')
for file in historic_files:
fn = file.rsplit('/',1)[-1]
fpath = '/dbfs' + file
print(f'Reading file:{fn} -->{sheetname}')
indx_df = pd.read_excel(fpath,sheet_name = 'Index', skiprows = None)
for index,row in indx_df.iterrows():
if row[0] == 'Data Item Description':
row_index = 'A' + str(index+2)
df_index = read_index(file,'Index',row_index)
df_index = df_index.selectExpr('`Col 1` as co_1','Col2','Col3','Col4')
df_data = read_data(file,sheetname,'A10')
# Join Data with index here
# Drop null columns from final df
df = combine_df.na.drop("all")
exec(f'{dfname} = final_df.select("*")')
final_dflist.append([dfname,final_df])
print(f'Data Frame created:{dfname}')
print (f'final df - record count: {final_df.count()}')
Any suggestions to improve this ?

Is there a way to export a list of 100+ dataframes to excel?

So this is kind of weird but I'm new to Python and I'm committed to seeing my first project with Python through to the end.
So I am reading about 100 .xlsx files in from a file path. I then trim each file and send only the important information to a list, as an individual and unique dataframe. So now I have a list of 100 unique dataframes, but iterating through the list and writing to excel just overwrites the data in the file. I want to append the end of the .xlsx file. The biggest catch to all of this is, I can only use Excel 2010, I do not have any other version of the application. So the openpyxl library seems to have some interesting stuff, I've tried something like this:
from openpyxl.utils.dataframe import dataframe_to_rows
wb = load_workbook(outfile_path)
ws = wb.active
for frame in main_df_list:
for r in dataframe_to_rows(frame, index = True, header = True):
ws.append(r)
Note: In another post I was told it's not best practice to read dataframes line by line using loops, but when I started I didn't know that. I am however committed to this monstrosity.
Edit after reading Comments
So my code scrapes .xlsx files and stores specific data based on a keyword comparison into dataframes. These dataframes are stored in a list, I will list the entirety of the program below so hopefully I can explain what's in my head. Also, feel free to roast my code because I have no idea what is actual good python practices vs. not.
import os
import pandas as pd
from openpyxl import load_workbook
#the file path I want to pull from
in_path = r'W:\R1_Manufacturing\Parts List Project\Tool_scraping\Excel'
#the file path where row search items are stored
search_parameters = r'W:\R1_Manufacturing\Parts List Project\search_params.xlsx'
#the file I will write the dataframes to
outfile_path = r'W:\R1_Manufacturing\Parts List Project\xlsx_reader.xlsx'
#establishing my list that I will store looped data into
file_list = []
main_df = []
master_list = []
#open the file path to store the directory in files
files = os.listdir(in_path)
#database with terms that I want to track
search = pd.read_excel(search_parameters)
search_size = search.index
#searching only for files that end with .xlsx
for file in files:
if file.endswith('.xlsx'):
file_list.append(in_path + '/' + file)
#read in the files to a dataframe, main loop the files will be maninpulated in
for current_file in file_list:
df = pd.read_excel(current_file)
#get columns headers and a range for total rows
columns = df.columns
total_rows = df.index
#adding to store where headers are stored in DF
row_list = []
column_list = []
header_list = []
for name in columns:
for number in total_rows:
cell = df.at[number, name]
if isinstance(cell, str) == False:
continue
elif cell == '':
continue
for place in search_size:
search_loop = search.at[place, 'Parameters']
#main compare, if str and matches search params, then do...
if insensitive_compare(search_loop, cell) == True:
if cell not in header_list:
header_list.append(df.at[number, name]) #store data headers
row_list.append(number) #store row number where it is in that data frame
column_list.append(name) #store column number where it is in that data frame
else:
continue
else:
continue
for thing in column_list:
df = pd.concat([df, pd.DataFrame(0, columns=[thing], index = range(2))], ignore_index = True)
#turns the dataframe into a set of booleans where its true if
#theres something there
na_finder = df.notna()
#create a new dataframe to write the output to
outdf = pd.DataFrame(columns = header_list)
for i in range(len(row_list)):
k = 0
while na_finder.at[row_list[i] + k, column_list[i]] == True:
#I turn the dataframe into booleans and read until False
if(df.at[row_list[i] + k, column_list[i]] not in header_list):
#Store actual dataframe into my output dataframe, outdf
outdf.at[k, header_list[i]] = df.at[row_list[i] + k, column_list[i]]
k += 1
main_df.append(outdf)
So main_df is a list that has 100+ dataframes in it. For this example I will only use 2 of them. I would like them to print out into excel like:
So the comment from Ashish really helped me, all of the dataframes had different column titles so my 100+ dataframes eventually concat'd to a dataframe that is 569X52. Here is the code that I used, I completely abandoned openpyxl because once I was able to concat all of the dataframes together, I just had to export it using pandas:
# what I want to do here is grab all the data in the same column as each
# header, then move to the next column
for i in range(len(row_list)):
k = 0
while na_finder.at[row_list[i] + k, column_list[i]] == True:
if(df.at[row_list[i] + k, column_list[i]] not in header_list):
outdf.at[k, header_list[i]] = df.at[row_list[i] + k, column_list[i]]
k += 1
main_df.append(outdf)
to_xlsx_df = pd.DataFrame()
for frame in main_df:
to_xlsx_df = pd.concat([to_xlsx_df, frame])
to_xlsx_df.to_excel(outfile_path)
The output to excel ended up looking something like this:
Hopefully this can help someone else out too.

dropping columns in multiple excel spreedsheets

Is there a way in python i can drop columns in multiple excel files? i.e. i have a folder with several xlsx files. each file has about 5 columns (date, value, latitude, longitude, region). I want to drop all columns except date and value in each excel file.
Let's say you have a folder with multiple excel files:
from pathlib import Path
folder = Path('excel_files')
xlsx_only_files = list(folder.rglob('*.xlsx'))
def process_files(xls_file):
#stem is a method in pathlib
#that gets just the filename without the parent or the suffix
filename = xls_file.stem
#sheet = None ensure the data is read in as a dictionary
#this sets the sheetname as the key
#usecols allows you to read in only the relevant columns
df = pd.read_excel(xls_file, usecols = ['date','value'] ,sheet_name = None)
df_cleaned = [data.assign(sheetname=sheetname,
filename = filename)
for sheetname, data in df.items()
]
return df_cleaned
combo = [process_files(xlsx) for xlsx in xlsx_only_files]
final = pd.concat(combo, ignore_index = True)
Let me know how it goes
stem
I suggest you should define columns you want to keep as a list and then select as a new dataframe.
# after open excel file as
df = pd.read_excel(...)
keep_cols = ['date', 'value']
df = df[keep_cols] # keep only selected columns it will return df as dataframe
df.to_excel(...)

How do i add a new column to new .xlsx file when i combine 2 of them

Priority : I would like to create a new column when i combine 2 .xlsx files, im pretty new to python, please help.
Secondly : i would also like to know how can i loop through the file in a folder? i am doing this hard coded but i would like to improve and loops thru every .xlsx files to create the result i want.
i tried to look for resources online, but couldnt find any
excel1 = '1.xlsx'
excel2 = '2.xlsx'
excel3 = '3.xlsx'
df1 = pd.read_excel(excel1)
df2 = pd.read_excel(excel2)
df3 = pd.read_excel(excel3)
values1 = df1[['Purchasing Document','Material','Quantity
Received','Still to be delivered (qty)','invoice','cancel']]
values2 = df2[['Purchasing Document','Material','Quantity
Received','Still to be delivered (qty)','invoice','cancel']]
values3 = df3[['Purchasing Document','Material','Quantity
Received','Still to be delivered (qty)','invoice','cancel']]
dataframes = [values1, values2, values3]
join = pd.concat(dataframes)
join.to_excel("testing123.xlsx")
Actual result right now is only showing 4 columns, Purchasing document to Qty, invoice and cancel gives me error.
I expect the result to be showing 6 columns, 4 of them filled with documents and invoice and cancel will be blank.
For reading mutiple files from a folder and storing there data in a excel with multiple sheets, you can try below code :
import os
import pandas as pd
dirpath = "C:\\Users\\Path\\TO\\Your XLS folder\\data\\"
fileNames = os.listdir(dirpath)
writer = pd.ExcelWriter(dirpath+'combined.xlsx', engine='xlsxwriter')
for fname in fileNames:
df = pd.read_excel(dirpath+fname)
print(df)
df.to_excel(writer, sheet_name=fname)
writer.save()
i hope this would help in your second point.

Categories

Resources