Returning in a DataFrame - Python - python

Good morning.
I have a question regarding Python. I have an if where has the conditional and else, the else it renders more than one file and I need to save all information it reads inside a DataFrame, is there a way to do this?
The code I am using:
for idx, folder in enumerate(fileLista):
if folder == 'filename_for_treatment':
df1 = pd.read_excel(folder, sheet_name = sheetName[idx], skiprows=1)
df1.columns = df1.columns.str.strip()
tratativaUm = df1[[column information to be used]]
else:
df2 = pd.read_excel(folder, sheet_name = sheetName[idx], skiprows=1)
df2.columns = df2.columns.str.strip()
TratativaDois = df2[[column information to be use]]
####assign result of each file received in the else
frames = [tratativaUm, tratativaDois]
titEmpresa = pd.concat(frames)
Can someone help me, is it possible to do this? Thanks

you can do it by appending your dataframes in a list for example:
list_df_tratativaDois = []
for idx, folder in enumerate(fileLista):
df = pd.read_excel(folder, sheet_name = sheetName[idx], skiprows=1)
df.columns = df.columns.str.strip()
if folder == 'filename_for_treatment':
tratativaUm = df[[column information to be used]]
else:
list_df_tratativaDois.append(df[[column information to be use]])
titEmpresa = pd.concat([tratativaUm]+list_df_tratativaDois)
Note that instead of df1 and df2 you can just create a df as it was the same read_excel and then depending if folder is the right one, do a different action on df

Related

How to iterate with For loops using Excel and Pandas

I am working on combining two excel files that that the same columns but have different values. I would like to convert all numbers into currency form ($ and commas). I've been able to do this but would like to find a more simple way to write the code.
Also, I need help with the output file. I cannot open it unless I close python. It says "Cannot access this file" and is always syncing. Anyone know any solutions?
Here is my code
import pandas as pd
import openpyxl
import xlsxwriter
outputfile = "Outputfile.xlsx"
excel_files = ["File1.xlsx",
"File2.xlsx"]
def combine_excel(excel_files, sheet_name):
sheet_frames = [pd.read_excel(x, sheet_name=sheet_name) for x in excel_files]
combined_df = pd.concat(sheet_frames).reset_index(drop=True)
return combined_df
df1 = combine_excel(excel_files, 0)
df2 = combine_excel(excel_files, 1)
df3 = combine_excel(excel_files, 2)
df4 = combine_excel(excel_files, 3)
df5 = combine_excel(excel_files, 4)
df6 = combine_excel(excel_files, 5)
df7 = combine_excel(excel_files, 6)
for x in df1.iloc[:,[10,11,12,13,14,15,16,17,18,19,20,26,27,28,29,30]]:
df1[x] = df1[x].apply(lambda x: f"${x:,.0f}")
for x in df2.iloc[:,[10,11,12,13,14,15,16,17,18,19,20,26,27,28,29,30]]:
df2[x] = df2[x].apply(lambda x: f"${x:,.0f}")
.
.
.
.
.
.
writer = pd.ExcelWriter(outputfile, engine='xlsxwriter')
df1.to_excel(writer, sheet_name ='Column1', index = False)
df2.to_excel(writer, sheet_name='Column2', index = False)
df3.to_excel(writer, sheet_name='Column3', index = False)
df4.to_excel(writer, sheet_name='Column4', index = False)
df5.to_excel(writer, sheet_name='Column5', index = False)
df6.to_excel(writer, sheet_name='Column6', index = False)
df7.to_excel(writer, sheet_name ='Column7', index = False)
writer.save()
As you can see I would like to make this part more simple to read and write:
for x in df1.iloc[:,[10,11,12,13,14,15,16,17,18,19,20,26,27,28,29,30]]:
df1[x] = df1[x].apply(lambda x: f"${x:,.0f}")
for x in df2.iloc[:,[10,11,12,13,14,15,16,17,18,19,20,26,27,28,29,30]]:
df2[x] = df2[x].apply(lambda x: f"${x:,.0f}")
.
.
.
.
.
.
There is a total 12 lines of code just to convert a number of columns into currency form. Is there a way to do this with 2 lines of code? Also the reason there are multiple df(s) is because I am combining 6 sheets within each Excel file.
I can't test this, but this simplification by refactoring should work:
# instead df1 = ..., df2 = ..., etc., store them in a list
combined_frames = [combine_excel(excel_files, i) for i in range(7)]
# instead of explicitly enumerating all column indices, use a range;
# instead of applying to each column individually, use applymap to apply to
# all cells in the dataframe
for i,df in enumerate(combined_frames):
combined_frames[i].iloc[:, 10:31] = df.iloc[:, 10:31].applymap(lambda x: f"${x:,.0f}")
writer = pd.ExcelWriter(outputfile, engine='xlsxwriter')
# instead of exporting each individual df, export them in a loop,
# dynamically setting the sheet_name
for i, df in enumerate(combined_frames, start=1):
df.to_excel(writer, sheet_name = f'Column{i}', index=False)

Loop through old and new versions of files

I am trying to create a .csv containing records that are different between old and new csv files. I have successfully accomplished this with a single such pair using
old_df = 'file1_old.csv'
new_df = 'file1_new.csv'
df1 = pd.read_csv(old_df)
df2 = pd.read_csv(new_df)
df1['flag'] = 'old'
df2['flag'] = 'new'
df = pd.concat([df1, df2])
dups_dropped = df.drop_duplicates(df.columns.difference(['flag']) keep=False)
dups_dropped.to_csv('difference.csv', index=False)
I am struggling to wrap my mind around how to scale this with a loop (?) to output a csv for each new pairing if the new v. old file names input are of the same convention, for instance:
file1_new, file1_new
file2_new, file2_old
file3_new, file3_old
so that the output is
file1_difference.csv
file2_difference.csv
file3_difference.csv
Thoughts? Much appreciated
Using a simple for loop with f-strings to help format the filenames should work:
for i in range(1,11): # replace 11 with the number of files you have + 1
old_df = f'file{i}_old.csv'
new_df = f'file{i}_new.csv'
df1 = pd.read_csv(old_df)
df2 = pd.read_csv(new_df)
df1['flag'] = 'old'
df2['flag'] = 'new'
df = pd.concat([df1, df2])
dups_dropped = df.drop_duplicates(df.columns.difference(['flag']) keep=False)
dups_dropped.to_csv(f'difference{i}.csv', index=False)

Pandas - Loop through sheets

I have 5 sheets and created a script to do numerous formatting, I tested it per sheet, and it works perfectly.
import numpy as np
import pandas as pd
FileLoc = r'C:\T.xlsx'
Sheets = ['Alex','Elvin','Gerwin','Jeff','Joshua',]
df = pd.read_excel(FileLoc, sheet_name= 'Alex', skiprows=6)
df = df[df['ENDING'] != 0]
df = df.head(30).T
df = df[~df.index.isin(['Unnamed: 2','Unnamed: 3','Unnamed: 4','ENDING' ,3])]
df.index.rename('STORE', inplace=True)
df['index'] = df.index
df2 = df.melt(id_vars=['index', 2 ,0, 1] ,value_name='SKU' )
df2 = df2[df2['variable']!= 3]
df2['SKU2'] = np.where(df2['SKU'].astype(str).fillna('0').str.contains('ALF|NOB|MET'),df2.SKU, None)
df2['SKU2'] = df2['SKU2'].ffill()
df2 = df2[~df2[0].isnull()]
df2 = df2[df2['SKU'] != 0]
df2[1] = pd.to_datetime(df2[1]).dt.date
df2.to_excel(r'C:\test.xlsx', index=False)
but when I assigned a list in Sheet_name = Sheets it always produced an error KeyError: 'ENDING'. This part of the code:
Sheets = ['Alex','Elvin','Gerwin','Jeff','Joshua',]
df = pd.read_excel(FileLoc,sheet_name='Sheets',skiprows=6)
Is there a proper way to do this, like looping?
My expected result is to execute the formatting that I have created and consolidate it into one excel file.
NOTE: All sheets have the same format.
In using the read_excel method, if you give the parameter sheet_name=None, this will give you a OrderedDict with the sheet names as keys and the relevant DataFrame as the value. So, you could apply this and loop through the dictionary using .items().
The code would look something like this,
dfs = pd.read_excel('your-excel.xlsx', sheet_name=None)
for key, value in dfs.items():
# apply logic to value
If you wish to combine the data in the sheets, you could use .append(). We can append the data after the logic has been applied to the data in each sheet. That would look something like this,
combined_df = pd.DataFrame()
dfs = pd.read_excel('your-excel.xlsx', sheet_name=None)
for key, value in dfs.items():
# apply logic to value, which is a DataFrame
combined_df = combined_df.append(sheet_df)

Pandas Dataframe Styles are not working with Jupyter Notebook

I have a dataframe which is constructed using a list and other dataframe that i read from excel file.
What I want to do is, I just have to apply the background color to first row of a dataframe which I would export in to an excel.
The below code doing the job correclty as expected.(There is issue with the data)
The issue is the style which I have applied to the dataframe was not reflected in the excel sheet. I am using Jupyter Notebook.
Please suggest a way to get the styles in excel.
import pandas as pd
sheet1 = r'D:\dinesh\input.xlsx'
sheet2 = "D:\dinesh\lookup.xlsx"
sheet3 = "D:\dinesh\Output.xlsx"
sheetname = 'Dashboard Índice (INPUT)'
print('Started extracting the Crime Type!')
df1 = pd.read_excel(sheet1,sheet_name = 'Dashboard Índice (INPUT)',skiprows=10, usecols = 'B,C,D,F,H,J,L,N,P', encoding = 'unicode_escape')
crime_type = list(df1.iloc[:0])[3:]
print(f'crime_types : {crime_type}')
df1 = (df1.drop(columns=crime_type,axis=1))
cols = list(df1.iloc[0])
print(f'Columns : {cols}')
df1.columns = cols
df1 = (df1[1:]).dropna()
final_data = []
for index, row in df1.iterrows():
sheetname = (f'{row[cols[1]]:0>2d}. {row[cols[0]]}')
cnty_cd = [row[cols[0]], row[cols[1]], row[cols[2]]]
wb = pd.ExcelFile(sheet2)
workbook = ''.join([workbook for workbook in wb.sheet_names if workbook.upper() == sheetname])
if workbook:
df2 = pd.read_excel(sheet2, sheet_name = workbook, skiprows=7, usecols ='C,D,H:T', encoding = 'unicode_escape')
df2_cols = list(df2.columns)
final_cols = cols + df2_cols
df2 = df2.iloc[2:]
df2 = df2.dropna(subset=[df2_cols[1]])
for index2, row2 in df2.iterrows():
if row2[df2_cols[1]].upper() in crime_type:
s1 = pd.Series(cnty_cd)
df_rows = (pd.concat([s1, row2], axis=0)).to_frame().transpose()
final_data.append(df_rows)
break
else:
print(f'{sheetname} does not exists!')
df3 = pd.concat(final_data)
df3.columns = final_cols
df_cols = (pd.Series(final_cols, index=final_cols)).to_frame().transpose()
df_final = (pd.concat([df_cols,df3], axis=0, ignore_index=True, sort=False))
df_final.style.apply(lambda x: ['background: blue' if x.name==0 else '' for i in x], axis=1)
df_final.to_excel(sheet3, sheet_name='Crime Details',index=False,header = None)
print(f'Sucessfully created the Output file to {sheet3}!')
You need to export to excel the styled dataframe and not the unstyled dataframe and so you either need to chain your styling and sending to Excel together, similar to shown in the documentation here, or assign the styled dataframe and use that to send to Excel.
The latter could look like this based on your code:
df_styled = df_final.style.apply(lambda x: ['background: blue' if x.name==0 else '' for i in x], axis=1)
df_styled.to_excel(sheet3,, sheet_name='Crime Details',index=False, header = None, engine='openpyxl')
As described here you need either the OpenPyXL or XlsxWriter engines for export.

How to create variables and read several excel files in a loop with pandas?

L=[('X1',"A"),('X2',"B"),('X3',"C")]
for i in range (len(L)):
path=os.path.join(L[i][1] + '.xlsx')
book = load_workbook(path)
xls = pd.ExcelFile(path)
''.join(L[i][0])=pd.read_excel(xls,'Sheet1')
File "<ipython-input-1-6220ffd8958b>", line 6
''.join(L[i][0])=pd.read_excel(xls,'Sheet1')
^
SyntaxError: can't assign to function call
I have a problem with pandas, I can not create several dataframes for several excel files but i don't know how to create variables
I'll need a result that looks like this :
X1 will have dataframe of A.xlsx
X2 will have dataframe of B.xlsx
.
.
.
Solved :
d = {}
for i,value in L:
path=os.path.join(value + '.xlsx')
book = load_workbook(path)
xls = pd.ExcelFile(path)
df = pd.read_excel(xls,'Sheet1')
key = 'df-'+str(i)
d[key] = df
Main pull:
I would approach this by reading everything into 1 dataframe (loop over files, and concat):
import os
import pandas as pd
files = [] #generate list for files to go into
path_of_directory = "path/to/folder/"
for dirname, dirnames, filenames in os.walk(path_of_directory):
for filename in filenames:
files.append(os.path.join(dirname, filename))
output_data = [] #blank list for building up dfs
for name in files:
df = pd.read_excel(name)
df['name'] = os.path.basename(name)
output_data.append(df)
total = pd.concat(output_data, ignore_index=True, sort=True)
Then:
From then you can interrogate the df by using df.loc[df['name'] == 'choice']
Or (in keeping with your question):
You could then split into a dictionary of dataframes, based on this column. This is the best approach...
dictionary = {}
df[column] = df[column].astype(str)
col_values = df[column].unique()
for value in col_values:
key_name = 'df'+str(value)
dictionary[key_name] = copy.deepcopy(df)
dictionary[key_name] = dictionary[key_name][df[column] == value]
dictionary[key_name].reset_index(inplace=True, drop=True)
return dictionary
The reason for this approach is discussed here:
Create new dataframe in pandas with dynamic names also add new column which basically says that dynamic naming of dataframes is bad, and this dict approach is best
This might help.
files_xls = ['all your excel filename goes here']
df = pd.DataFrame()
for f in files_xls:
data = pd.read_excel(f, 'Sheet1')
df = df.append(data)
print(df)

Categories

Resources