How to append a row to a dataframe with a bucle? - python

I've written a function to transform an excel sheet and take only one row from a monthly data. Mensually I'll have the data on a new excel sheet.
I've made this:
def bocapago(nombre):
path='/content/drive/MyDrive/Fundacion Frontera Economica/Muni/python/inputs/BOCAS DE PAGO'
filename = path + "/" + nombre.upper() + '.xlsx'
input_cols=[0,1,2,3] # Columnas a importar
df = pd.read_excel(filename,
header=0,
usecols = input_cols,
index_col=False,
)
df.columns = ['n_tasa','Fecha','Lugar','Importe']
pd.to_datetime(df['Fecha'])
df['Periodo'] = pd.DatetimeIndex(df['Fecha']).month
df['Periodo'] = nombre
df['Periodo'] = df['Periodo'].str[:3] + "-" + df['Periodo'].str[-4:]
df = pd.pivot_table(df, values='Importe', index='Periodo', columns='Lugar', aggfunc='sum')
df = df.assign(Total=df.sum(1))
df = df.rename(columns={'Total':'TOTAL GENERAL'})
df.head()
return df
That is the function to read an proccess the sheet. And then I did this as a second step:
ENERO1 = bocapago('ENERO2021')
FEBRERO1 = bocapago('FEBRERO2021')
MARZO1 = bocapago('MARZO2021')
MAYO1 = bocapago('MAYO2021')
ingxboca = [ENERO1, FEBRERO1, MARZO1, MAYO1]
ingxboca = pd.concat(ingxboca)
ingxboca = ingxboca.merge(ingresos['TOTAL IACM'], how='left', on='Periodo')
ingxboca['DIFERENCIA'] = ingxboca['TOTAL IACM']-ingxboca['TOTAL GENERAL']
ingxboca.head()
I use another dataframe called "ingresos" on this case to merge.
My doubt is how can I do a for or while bucle to do the second step, so I can include all of it inside the function called "bocapago" or make another function like "finishing".

I would keep bocapago as it's own function just like you've done and have the second function call it. It keeps the complexity of a single function lower and will be easier for code reuse in the future. IF I understood your question correctly, would this work?
def new_function(file_list:list):
ingxboca = pd.concat([bocapago(f) for f in file_list])
ingxboca = ingxboca.merge(ingresos['TOTAL IACM'], how='left', on='Periodo')
ingxboca['DIFERENCIA'] = ingxboca['TOTAL IACM']-ingxboca['TOTAL GENERAL']
return ingxboca.head()
I'm not sure if that answered the question or not. If so, I imagine the list comprehension in the first line did it. keep in mind you can add if statements to a list comprehension. You can also pass in a string and use something like glob to give you a filelist with rules in it.

Related

Is there a "cleaner" way to write this code?

so just messing around with Pandas for the first time - curious, specifically with the variables in my code - does it make sense to keep iterating with "df#" or should I just keep rewriting "df"? Or if there's a more elegant way that I'm missing.
def func(csvfile):
df = pd.read_csv(csvfile)
df.columns = df.columns.str.replace(" ", "_")
df2 = df.assign(column3=df.column3.str.split(",")).explode(
"column3"
)
df3 = df2.assign(column2=df.column2.str.split("; ")).explode("column2")
df3["column2"] = df3["column2"].str.replace(r"\(\d+\)", "", regex=True)
df4 = df3[df3["column2"].str.contains("value2") == False]
print(df4)
Taking a complete shot in the dark since you're unable to provide anything to work with, but I'd bet that this does the same:
def func(csvfile):
df = pd.read_csv(csvfile)
df.columns = df.columns.str.replace(" ", "_")
df.column2 = df.column2.str.split("; ")
df.column3 = df.column3.str.split(",")
df = df.explode(['column2', 'column3']) # Or maybe explode them one at a time? I have no idea what you're doing.
df.column2 = df.column2.str.replace(r"\(\d+\)", "", regex=True)
df = df[~df.column2.str.contains("value2")]
return df
df = func(csvfile)
print(df)

How to use a function twice?

I have to use the same function twice. The first when the parameter is df, the second when the parameter is df3. How to do that? The function:
def add(df, df3):
df["timestamp"] = pd.to_datetime(df["timestamp"])
df = df.groupby(pd.Grouper(key = "timestamp", freq = "h")).agg("mean")
price = df["price"]
amount = df["amount"]
return (price * amount) // amount
The double use :
out = []
# This loop will use the add(df) function for every csv and append in a list
for f in csv_files:
df = pd.read_csv(f, header=0)
# Replace empty values with numpy, not sure if usefull, maybe pandas can handle this
df.replace("", np.nan)
#added aggregate DataFrame with new column to list of DataFrames
out.append(add(df))
out2 = []
df3 = pd.Series(dtype=np.float64)
for f in csv_files:
df2 = pd.read_csv(f, header=0)
df3 = pd.concat([df3, df2], ignore_index=True)
out2 = pd.DataFrame(add(df = df3))
out2
I got the error:
TypeError: add() missing 1 required positional argument: 'df3'
The names of the add function have nothing to do with the variable names df and df3 in the rest of the script.
As #garagnoth has stated, you only need one parameter in add. You can call it df, foo or myvariablename: it is not related to nor df, nor df3.
In your case, you can change the add function to the following:
def add(a_dataframe):
# I set the argument name to "a_dataframe" so you can
# see its name is not linked to outside variables
a_dataframe["timestamp"] = pd.to_datetime(a_dataframe["timestamp"])
a_dataframe = a_dataframe.groupby(pd.Grouper(key = "timestamp", freq = "h")).agg("mean")
price = a_dataframe["price"]
amount = a_dataframe["amount"]
return (price * amount) // amount
You can now call this function with df or df3 as the rest of the script already does.

Python pandas DF question : trying to drop a column but nothing happens (df.drop) - code also runs without any errors

I am trying to delete a column called Rank but nothing happens. The remaining code all executes without any issue but the column itself remains in the output file. I've highlighted the part of the code that is not working.
def read_csv():
file = "\mona" + yday+".csv"
#df=[]
df = pd.read_csv(save_path+file,skiprows=3,encoding = "ISO-8859-1",error_bad_lines=False)
return df
# replace . with / in column EPIC
def tickerchange():
df=read_csv()
df['EPIC'] = df['EPIC'].str.replace('.','/')
return df
def consolidate_AB_listings():
df=tickerchange()
Aline = df.loc[(df['EPIC'] =='RDSA'),'Mkt Cap (àm)']
Bline = df.loc[(df['EPIC'] =='RDSB'),'Mkt Cap (àm)']
df.loc[(df['EPIC'] =='RDSA'),'Mkt Cap (àm)']= float(Aline) + float(Bline)
df = df.loc[(df.Ind != 'I/E')]
df = df.loc[(df.Ind != 'FL')]
df = df.loc[(df.Ind != 'M')]
df = df.loc[(df.EPIC != 'RDSB')]
return df
def ranking_mktcap():
df = consolidate_AB_listings()
df['Rank']= df['Mkt Cap (àm)'].rank(ascending=False)
df = df.loc[(df.Rank != 1)]
df['Rank1']= df['Mkt Cap (Em)'].rank(ascending=False)
## This doesn't seem to work
df = df.drop(df['Security'], 1)
return df
def save_outputfile():
#df = drop()
df = ranking_mktcap()
df.to_csv(r'S:\Index_Analytics\UK\Index Methodology\FTSE\Py_file_download\MonitoredList.csv', index=False)
print("finished")
if __name__ == "__main__":
main()
read_csv()
tickerchange()
consolidate_AB_listings()
ranking_mktcap()
save_outputfile()
DataFrame.drop() takes the following: DataFrame.drop(self, labels=None, axis=0, index=None, columns=None, level=None, inplace=False, errors='raise').
When you call df = df.drop(df['Security'], 1) it's using df['security'] as the labels to drop. And the 1 is being passed through the axis parameter.
If you want to drop the column 'Security' then you'd want to do:
df = df.drop('Security', axis=1)
# this is same as
df = df.drop(labels='Security', axis=1)
# you can also specify the column name directly, like this
df = df.drop(columns='Security')
Note: the columns= parameter can take a single lable (str) like above, or can take a list of column names.
Try by replacing
df = df.drop(df['Security'], 1)
By
df.drop(['Security'],axis=1, inplace=True)
I had the same issue and all I did was add inplace = True
So it will be df = df.drop(df['Security'], 1, inplace = True)

Script keep showing "SettingCopyWithWarning'

Hello my problem is that my script keep showing below message
SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame
See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
downcast=downcast
I Searched the google for a while regarding this, and it seems like my code is somehow
assigning sliced dataframe to new variable, which is problematic.
The problem is ** I can't find where my code get problematic **
I tried copy function, or seperated the nested functions, but it is not working
I attached my code below.
def case_sorting(file_get, col_get, methods_get, operator_get, value_get):
ops = {">": gt, "<": lt}
col_get = str(col_get)
value_get = int(value_get)
if methods_get is "|x|":
new_file = file_get[ops[operator_get](file_get[col_get], value_get)]
else:
new_file = file_get[ops[operator_get](file_get[col_get], np.percentile(file_get[col_get], value_get))]
return new_file
Basically what i was about to do was to make flask api that gets excel file as an input, and returns the csv file with some filtering. So I defined some functions first.
def get_brandlist(df_input, brand_input):
if brand_input == "default":
final_list = (pd.unique(df_input["브랜드"])).tolist()
else:
final_list = brand_input.split("/")
if '브랜드' in final_list:
final_list.remove('브랜드')
final_list = [x for x in final_list if str(x) != 'nan']
return final_list
Then I defined the main function
def select_bestitem(df_data, brand_name, col_name, methods, operator, value):
# // 2-1 // to remove unnecessary rows and columns with na values
df_data = df_data.dropna(axis=0 & 1, how='all')
df_data.fillna(method='pad', inplace=True)
# // 2-2 // iterate over all rows to find which row contains brand value
default_number = 0
for row in df_data.itertuples():
if '브랜드' in row:
df_data.columns = df_data.iloc[default_number, :]
break
else:
default_number = default_number + 1
# // 2-3 // create the list contains all the target brand names
brand_list = get_brandlist(df_input=df_data, brand_input=brand_name)
# // 2-4 // subset the target brand into another dataframe
df_data_refined = df_data[df_data.iloc[:, 1].isin(brand_list)]
# // 2-5 // split the dataframe based on the "brand name", and apply the input condition
df_per_brand = {}
df_per_brand_modified = {}
for brand_each in brand_list:
df_per_brand[brand_each] = df_data_refined[df_data_refined['브랜드'] == brand_each]
file = df_per_brand[brand_each].copy()
df_per_brand_modified[brand_each] = case_sorting(file_get=file, col_get=col_name, methods_get=methods,
operator_get=operator, value_get=value)
# // 2-6 // merge all the remaining dataframe
df_merged = pd.DataFrame()
for brand_each in brand_list:
df_merged = df_merged.append(df_per_brand_modified[brand_each], ignore_index=True)
final_df = df_merged.to_csv(index=False, sep=',', encoding='utf-8')
return final_df
And I am gonna import this function in my app.py later
I am quite new to all the coding, therefore really really sorry if my code is quite hard to understand, but I just really wanted to get rid of this annoying warning message. Thanks for help in advance :)

How to open the excel file creating from pandas faster?

The excel file creating from python is extremely slow to open even the size of file is about 50 mb.
I have tried on both pandas and openpyxl.
def to_file(list_report,list_sheet,strip_columns,Name):
i = 0
wb = ExcelWriter(path_output + '\\' + Name + dateformat + '.xlsx')
while i <= len(list_report)-1:
try:
df = pd.DataFrame(pd.read_csv(path_input + '\\' + list_report[i] + reportdate + '.csv'))
for column in strip_column:
try:
df[column] = df[column].str.strip('=("")')
except:
pass
df = adjust_report(df,list_report[i])
df = df.apply(pd.to_numeric, errors ='ignore', downcast = 'integer')
df.to_excel(wb, sheet_name = list_sheet[i], index = False)
except:
print('Missing report: ' + list_report[i])
i += 1
wb.save()
Is there anyway to speed it up?
idiom
Let us rename list_report to reports.
Then your while loop is usually expressed as simply: for i in range(len(reports)):
You access the i-th element several times. The loop could bind that for you, with: for i, report in enumerate(reports):.
But it turns out you never even need i. So most folks would write this as: for report in reports:
code organization
This bit of code is very nice:
for column in strip_column:
try:
df[column] = df[column].str.strip('=("")')
except:
pass
I recommend you bury it in a helper function, using def strip_punctuation.
(The list should be plural, I think? strip_columns?)
Then you would have a simple sequence of df assignments.
timing
Profile elapsed time(). Surround each df assignment with code like this:
t0 = time()
df = ...
print(time() - t0)
That will show you which part of your processing pipeline takes the longest and therefore should receive the most effort for speeding it up.
I suspect adjust_report() uses the bulk of the time,
but without seeing it that's hard to say.

Categories

Resources