How to open the excel file creating from pandas faster? - python

The excel file creating from python is extremely slow to open even the size of file is about 50 mb.
I have tried on both pandas and openpyxl.
def to_file(list_report,list_sheet,strip_columns,Name):
i = 0
wb = ExcelWriter(path_output + '\\' + Name + dateformat + '.xlsx')
while i <= len(list_report)-1:
try:
df = pd.DataFrame(pd.read_csv(path_input + '\\' + list_report[i] + reportdate + '.csv'))
for column in strip_column:
try:
df[column] = df[column].str.strip('=("")')
except:
pass
df = adjust_report(df,list_report[i])
df = df.apply(pd.to_numeric, errors ='ignore', downcast = 'integer')
df.to_excel(wb, sheet_name = list_sheet[i], index = False)
except:
print('Missing report: ' + list_report[i])
i += 1
wb.save()
Is there anyway to speed it up?

idiom
Let us rename list_report to reports.
Then your while loop is usually expressed as simply: for i in range(len(reports)):
You access the i-th element several times. The loop could bind that for you, with: for i, report in enumerate(reports):.
But it turns out you never even need i. So most folks would write this as: for report in reports:
code organization
This bit of code is very nice:
for column in strip_column:
try:
df[column] = df[column].str.strip('=("")')
except:
pass
I recommend you bury it in a helper function, using def strip_punctuation.
(The list should be plural, I think? strip_columns?)
Then you would have a simple sequence of df assignments.
timing
Profile elapsed time(). Surround each df assignment with code like this:
t0 = time()
df = ...
print(time() - t0)
That will show you which part of your processing pipeline takes the longest and therefore should receive the most effort for speeding it up.
I suspect adjust_report() uses the bulk of the time,
but without seeing it that's hard to say.

Related

How to append a row to a dataframe with a bucle?

I've written a function to transform an excel sheet and take only one row from a monthly data. Mensually I'll have the data on a new excel sheet.
I've made this:
def bocapago(nombre):
path='/content/drive/MyDrive/Fundacion Frontera Economica/Muni/python/inputs/BOCAS DE PAGO'
filename = path + "/" + nombre.upper() + '.xlsx'
input_cols=[0,1,2,3] # Columnas a importar
df = pd.read_excel(filename,
header=0,
usecols = input_cols,
index_col=False,
)
df.columns = ['n_tasa','Fecha','Lugar','Importe']
pd.to_datetime(df['Fecha'])
df['Periodo'] = pd.DatetimeIndex(df['Fecha']).month
df['Periodo'] = nombre
df['Periodo'] = df['Periodo'].str[:3] + "-" + df['Periodo'].str[-4:]
df = pd.pivot_table(df, values='Importe', index='Periodo', columns='Lugar', aggfunc='sum')
df = df.assign(Total=df.sum(1))
df = df.rename(columns={'Total':'TOTAL GENERAL'})
df.head()
return df
That is the function to read an proccess the sheet. And then I did this as a second step:
ENERO1 = bocapago('ENERO2021')
FEBRERO1 = bocapago('FEBRERO2021')
MARZO1 = bocapago('MARZO2021')
MAYO1 = bocapago('MAYO2021')
ingxboca = [ENERO1, FEBRERO1, MARZO1, MAYO1]
ingxboca = pd.concat(ingxboca)
ingxboca = ingxboca.merge(ingresos['TOTAL IACM'], how='left', on='Periodo')
ingxboca['DIFERENCIA'] = ingxboca['TOTAL IACM']-ingxboca['TOTAL GENERAL']
ingxboca.head()
I use another dataframe called "ingresos" on this case to merge.
My doubt is how can I do a for or while bucle to do the second step, so I can include all of it inside the function called "bocapago" or make another function like "finishing".
I would keep bocapago as it's own function just like you've done and have the second function call it. It keeps the complexity of a single function lower and will be easier for code reuse in the future. IF I understood your question correctly, would this work?
def new_function(file_list:list):
ingxboca = pd.concat([bocapago(f) for f in file_list])
ingxboca = ingxboca.merge(ingresos['TOTAL IACM'], how='left', on='Periodo')
ingxboca['DIFERENCIA'] = ingxboca['TOTAL IACM']-ingxboca['TOTAL GENERAL']
return ingxboca.head()
I'm not sure if that answered the question or not. If so, I imagine the list comprehension in the first line did it. keep in mind you can add if statements to a list comprehension. You can also pass in a string and use something like glob to give you a filelist with rules in it.

Script keep showing "SettingCopyWithWarning'

Hello my problem is that my script keep showing below message
SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame
See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
downcast=downcast
I Searched the google for a while regarding this, and it seems like my code is somehow
assigning sliced dataframe to new variable, which is problematic.
The problem is ** I can't find where my code get problematic **
I tried copy function, or seperated the nested functions, but it is not working
I attached my code below.
def case_sorting(file_get, col_get, methods_get, operator_get, value_get):
ops = {">": gt, "<": lt}
col_get = str(col_get)
value_get = int(value_get)
if methods_get is "|x|":
new_file = file_get[ops[operator_get](file_get[col_get], value_get)]
else:
new_file = file_get[ops[operator_get](file_get[col_get], np.percentile(file_get[col_get], value_get))]
return new_file
Basically what i was about to do was to make flask api that gets excel file as an input, and returns the csv file with some filtering. So I defined some functions first.
def get_brandlist(df_input, brand_input):
if brand_input == "default":
final_list = (pd.unique(df_input["브랜드"])).tolist()
else:
final_list = brand_input.split("/")
if '브랜드' in final_list:
final_list.remove('브랜드')
final_list = [x for x in final_list if str(x) != 'nan']
return final_list
Then I defined the main function
def select_bestitem(df_data, brand_name, col_name, methods, operator, value):
# // 2-1 // to remove unnecessary rows and columns with na values
df_data = df_data.dropna(axis=0 & 1, how='all')
df_data.fillna(method='pad', inplace=True)
# // 2-2 // iterate over all rows to find which row contains brand value
default_number = 0
for row in df_data.itertuples():
if '브랜드' in row:
df_data.columns = df_data.iloc[default_number, :]
break
else:
default_number = default_number + 1
# // 2-3 // create the list contains all the target brand names
brand_list = get_brandlist(df_input=df_data, brand_input=brand_name)
# // 2-4 // subset the target brand into another dataframe
df_data_refined = df_data[df_data.iloc[:, 1].isin(brand_list)]
# // 2-5 // split the dataframe based on the "brand name", and apply the input condition
df_per_brand = {}
df_per_brand_modified = {}
for brand_each in brand_list:
df_per_brand[brand_each] = df_data_refined[df_data_refined['브랜드'] == brand_each]
file = df_per_brand[brand_each].copy()
df_per_brand_modified[brand_each] = case_sorting(file_get=file, col_get=col_name, methods_get=methods,
operator_get=operator, value_get=value)
# // 2-6 // merge all the remaining dataframe
df_merged = pd.DataFrame()
for brand_each in brand_list:
df_merged = df_merged.append(df_per_brand_modified[brand_each], ignore_index=True)
final_df = df_merged.to_csv(index=False, sep=',', encoding='utf-8')
return final_df
And I am gonna import this function in my app.py later
I am quite new to all the coding, therefore really really sorry if my code is quite hard to understand, but I just really wanted to get rid of this annoying warning message. Thanks for help in advance :)

How to loop through a DataFrame containing 80k+ rows

This question might have other answers but I could not figure out how to apply them on my current code.
I have to iterate through the DataFrame and modify certain column values as shown below:
NOTE: All of the columns are strings. The ones with _Length contain the length in int of the columns containing strings.
for col in range(0, 200):
if df['Partial_Input_Length'][col] < 50:
df['Full_Input'][col] = df['Partial_Input'][col] + " " + df['Input5'][col] + " " + df['Input6'][col]
else:
df['Full_Input'][col] = df['Partial_Input'][col]
This was used when I used a testing DataFrame containing only 200 rows. If I use for col in range(0, 80000): in the 80k rows DataFrame, it takes a huge amount of time until every operation is done.
I also tried out with itertuples() in this way:
for col in df.itertuples():
if col.Partial_Input_Length < 50:
col.Full_Input = col.Partial_Input + " " + col.Input5 + " " + col.Input6
else:
col.Full_Input = col.Partial_Input
But after running it, I get the following error.
File "", line 23, in
col.Full_Input = col.Partial_Input + " " + col.Input5 + " " + col.Input6
AttributeError: can't set attribute
Moreover, I tried with iterrows() like this:
for index, col in df.iterrows():
if df['Partial_Input_Length'][index] < 50:
df['Full_Input'][index] = df['Partial_Input'][index] + " " + df['Input5'][index] + " " + df['Input6'][index]
else:
df['Full_Input'][index] = df['Partial_Input'][index]
But the code above is taking huge amounts of time, as well.
Is it normal that every time I run these iterations on a big dataframe it takes a lot of time or am I doing something wrong?
I am quite a newbie when it comes to iterating in python. Therefore, what method should I use for the quickest iteration time and which fits on what I am trying to use it for?
You can do it without looping:
df['Full_Input'] = df['Partial_Input'].str.cat(df['Input5'], sep=" ").str.cat(df['Input6'], sep=" ")
df['Full_Input'] = np.where(df['Partial_Input_Length'].str.len() > 50, df['Partial_Input'], df['Full_Input'])
first of all you should not be modifying the elements that you are iterating over
almost all iter* functions in pandas will return read-only items, so setting anything on them will not work
to do what you want, I use apply or run a loop, that will call a function that will return a dict with the changes you want to be done and then either remake the entire dataframe or do a merge
something like
# if your modification is more simple then a simple apply will also work
df['new_col'] = df.apply(lambda x: f'{x.startDate.year}-{x.startDate.week}')
# if you want to do something more complex with all the items in the row
def foo(row):
def mofification_code(item):
return modified_item
return {
'primary_key': row.primary_key,
'modified_data': modification_code(row.item)
}
modified_data = [foo(row) for row in df.itertuples()]
# sometimes this may be sufficient,
new_df = pd.DataFrame(modified_data)
# alternatively, you can do a merge with the original data
new_df = pd.merge(df, new_df, how='left', left_on='primary_key', right_on='primary_key')

Loop list in python

I'm a newbie to Python but attempting to call Google Distance Matrix API
This is how my data frame looks like
Loading data into data frame
data = pd.read_csv(input_filename, encoding ='utf8')
I just need some help looping the list.
Issue: It keeps on printing the entire list
#Column name in your input data
start_latitude_name = "Start Latitude"
start_longitude_name = "Start Longitude"
end_latitude_name = "End Latitude"
end_longitude_name = "End Longitude"
start_latitude_names = data[start_latitude_name].tolist()
end_latitude_names = data[end_latitude_name].tolist()
start_longitude_names = data[start_longitude_name].tolist()
end_longitude_names = data[end_longitude_name].tolist()
for start_latitude_name in start_latitude_names:
origins = start_latitude_name, start_longitude_name
destinations = end_latitude_name, end_longitude_name
mode = "walking"
# Set up your distance matrix url
distancematrix_url = "*Omitted unnecessary parts*origins={0}&destinations={1}&mode={2}&language=en-EN&key={3}".format(origins, destinations, mode, API_KEY)
print(distancematrix_url)
Current Output (From each loop)
# Omitted unnecessary info
origins=40.7614645,123.0,-73.9825913,456.0&destinations=40.65815,789.0,-73.98283,0.0
Expected Output (From each loop)
origins=40.7614645,-73.9825913&destinations=40.65815,-73.98283
I'm sure that i'm not looping it correctly, but i have tried the answers on several post and it didn't work work for me.
I'm open to better alternatives of looping the data. Feel free to correct me.
Thanks!
You could do this with pandas and df.iterrows():
import pandas as pd
data = pd.read_csv(input_filename, encoding ='utf8')
for idx, row in data.iterrows():
origins = row['Start Latitude'], row['Start Longitude']
destinations = row['End Latitude'], row['End Longitude']
mode = "walking"
# Set up your distance matrix url
distancematrix_url = "*Omitted unnecessary parts*origins={0}&destinations={1}&mode={2}&language=en-EN&key={3}".format(origins, destinations, mode, API_KEY)
print(distancematrix_url)
If I've understood correctly, you can vectorize this operation and use the string-representations of your coordinates:
import pandas as pd
# Make pandas print entire strings without truncating them
pd.set_option("display.max_colwidth", -1)
# Create dummy-df from your example
df = pd.DataFrame({"start_latitude": [40.76, 123.00], "start_longitude": [-73.98, 456.00], "end_latitude": [40.65, 789.00], "end_longitude": [-73.98, 0.00]})
print df
# Set globals
mode = "walking"
API_KEY = "my_key"
# Create the url strings for each row
df["distance_matrix_url"] = "origins=" + df["start_latitude"].map(str) + "," + df["start_longitude"].map(str) + "&destinations=" + df["end_latitude"].map(str) + "," + df["end_longitude"].map(str) + "&mode=" + mode + "&languge=en-EN&key=" + API_KEY
# Print results
print df
Output:
end_latitude end_longitude start_latitude start_longitude distance_matrix_url
0 40.65 -73.98 40.76 -73.98 origins=40.76,-73.98&destinations=40.65,-73.98&mode=walking&languge=en-EN&key=my_key
1 789.00 0.00 123.00 456.00 origins=123.0,456.0&destinations=789.0,0.0&mode=walking&languge=en-EN&key=my_key
Is this what you're looking for?

Adding a specified value to each in a pandas data frame

I am iterating over the rows that are available, but it doesn't seem to be the most optimal way to do it -- it takes forever.
Is there a special way in Pandas to do it.
INIT_TIME = datetime.datetime.strptime(date + ' ' + time, "%Y-%B-%d %H:%M:%S")
#NEED TO ADD DATA FROM THAT COLUMN
df = pd.read_csv(dataset_path, delimiter=',',skiprows=range(0,1),names=['TCOUNT','CORE','COUNTER','EMPTY','NAME','TSTAMP','MULT','STAMPME'])
df = df.drop('MULT',1)
df = df.drop('EMPTY',1)
df = df.drop('TSTAMP', 1)
for index, row in df.iterrows():
TMP_TIME = INIT_TIME + datetime.timedelta(seconds=row['TCOUNT'])
df['STAMPME'] = TMP_TIME.strftime("%s")
In addition, the datetime I am adding is in the following format
2017-05-11 11:12:37.100192 1494493957
2017-05-11 11:12:37.200541 1494493957
and therefore the unix timestamp is same (and it is correct), but is there a better way to represent it?
Assuming the datetimes are correctly reflecting what you're trying to do, with respect to Pandas you should be able to do:
df['STAMPME'] = df['TCOUNT'].apply(lambda x: (datetime.timedelta(seconds=x) + INIT_TIME).strftime("%s"))
As noted here you should not use iterrows() to modify the DF you are iterating over. If you need to iterate row by row (as opposed to using the apply method) you can use another data object, e.g. a list, to retain the values you're calculating, and then create a new column from that.
Also, for future reference, the itertuples() method is faster than iterrows(), although it requires you to know the index of each column (i.e. row[x] as opposed to row['name']).
I'd rewrite your code like this
INIT_TIME = datetime.datetime.strptime(date + ' ' + time, "%Y-%B-%d %H:%M:%S")
INIT_TIME = pd.to_datetime(INIT_TIME)
df = pd.read_csv(
dataset_path, delimiter=',',skiprows=range(0,1),
names=['TCOUNT','CORE','COUNTER','EMPTY','NAME','TSTAMP','MULT','STAMPME']
)
df = df.drop(['MULT', 'EMPTY', 'TSTAMP'], 1)
df['STAMPME'] = pd.to_timedelta(df['TCOUNT'], 's') + INIT_TIME

Categories

Resources