For loop looping multiple times - python

At the request of many, I have simplified the problem as far as I can imagine (my imagination doesn't go that far), and I think it's reproducable also. The two different Excel files I've been using are called: "first apples.xlsx" and "second apples.xlsx". I've been using the following data:
import os
import numpy as np
import pandas as pd
import glob
#%%
path = os.getcwd() + r"\apples"
file_locations = glob.glob(path + "\*.xlsx")
#%%
df = {}
for i, file in enumerate(file_locations):
df[i] = pd.read_excel(file, usecols=['Description', 'Price'])
#%%
price_standard_apple = []
price_special_apple = []
special_apple_description = ['Golden', 'Diamond', 'Titanium']
#%%
for file in range(len(df)):
df_description = pd.DataFrame(df[file].iloc[:,-2])
df_prices = pd.DataFrame(df[file].iloc[:,-1])
for description in df_description['Description']:
if description in special_apple_description:
description_index = df_description.loc[df_description['Description']==description].index.values
price = df_prices['Price'].iloc[description_index]
price_sum = np.sum(price)
price_special_apple.append(price_sum)
elif description not in special_apple_description:
description_index = df_description.loc[df_description['Description']==description].index.values
price = df_prices['Price'].iloc[description_index]
price_sum = np.sum(price)
price_standard_apple.append(price_sum)
I would expect the sum of the red colored cells (the special apples so to speak) to be 97 and that of the standard apples to be 224. This is not the case and the problem seems to be in the second loop. Python prints the following values: standard: 234 special: 209

I think you are making this harder than it needs to be.
Given your test data from above you can simply do:
import pandas as pd
test_data = pd.read_csv(r'path\to\file', sep=',')
special = ['Golden', 'Diamond', 'Titanium']
mask = test_data['Description'].isin(special)
specials_price = sum(test_data[mask]['Price']) # -> 97
other = sum(test_data[~mask]['Price']) # -> 224
This is what test_data looks like:

I think the break statement would be useful in your case. It would allow you to break out of the inner for loop when you've retrieved the value you wanted, continuing the outer for loop.
Also covered in this question: Python: Continuing to next iteration in outer loop

Related

Why can't I replace null values in this excel sheet?

In my code, I run a t-test which sometimes yields "NaN" or "nan" when running a test on two zero value groups. I have tried making new data frames, tried replacing using .replace and also tried fillna() but nothing was successful. I get errors when also trying to define a new dataframe or read the file again after adding new calculations.
How do I replace the nulls and "nan" in these files: "significant_report2.xls" or "quant_report2.xls"
import json
import os, sys
import numpy as np
import pandas as pd
import scipy.stats
output_report = "quant_report2.xls"
significant_report = "significant_report2.xls"
output_report_writer = open(output_report, "w")
significant_writer = open(significant_report, "w")
# setup samples grouped by control and treatment
header = []
for idx in control_indices:
header.append(quant_columns[idx])
for idx in treatment_indices:
header.append(quant_columns[idx])
output_report_writer.write("Feature\t%s\tP-value\tctrl_means\tctrl_stdDev\ttx_means\ttx_stdDev\n"%"\t".join(header))
significant_writer.write("Feature\t%s\tP-value\tctrl_means\tctrl_stdDev\ttx_means\ttx_stdDev\n"%"\t".join(header))
feature_list = list(quantitative_data_frame.index)
for feature_idx in range(len(feature_list)):
feature_name = feature_list[feature_idx]
control_values = quantitative_data_frame.iloc[feature_idx, control_indices]
treatment_values = quantitative_data_frame.iloc[feature_idx, treatment_indices]
ttest_stat, ttest_pvalue = scipy.stats.ttest_ind(control_values, treatment_values, equal_var=False)
ctrl_means = scipy.mean(control_values,0)
ctrl_stdDev = scipy.stats.tstd(control_values)
tx_means= scipy.mean(treatment_values,axis=0)
tx_stdDev1 = scipy.stats.tstd(treatment_values)
output_report_writer.write("%s\t%s\t%s\t%s\t%s\t%s\t%s\t%s\n"%(feature_name,
"\t".join([str(x) for x in list(control_values)]),
"\t".join([str(x) for x in list(treatment_values)]), ttest_pvalue, ctrl_means,ctrl_stdDev,tx_means,tx_stdDev))
significant_writer.write("%s\t%s\t%s\t%s\t%s\t%s\t%s\t%s\n"%(feature_name,"\t".join([str(x) for x in list(control_values)]), "\t".join([str(x) for x in list(treatment_values)]),ttest_pvalue,ctrl_means,ctrl_stdDev,tx_means,tx_stdDev))

Convert Multiple Python Lines to a Concurrent DataFrame and Merge with Source Data

I apologize if this is a rudimentary question. I feel like it should be easy but I cannot figure it out. I have the code that is listed below that essentially looks at two columns in a CSV file and matches up job titles that have a similarity of 0.7. To do this, I use difflib.get_close_matches. However, the output is multiple single lines and whenever I try to convert to a DataFrame, every single line is its own DataFrame and I cannot figure out how to merge/concat them. All code, as well as current and desired outputs are below. Any help would be much appreciated.
Current Code is:
import pandas as pd
import difflib
df = pd.read_csv('name.csv')
aLists = list(df['JTs'])
bLists = list(df['JT'])
n=3
cutoff = 0.7
for aList in aLists:
best = difflib.get_close_matches(aList, bLists, n, cutoff)
print(best)
Current Output is:
['SW Engineer']
['Manu Engineer']
[]
['IT Help']
Desired Output is:
Output
0 SW Engineer
1 Manu Engineer
2 (blank)
3 IT Help
The table I am attempting to do this one is:
Any help would be greatly appreciated!
Here is a simple way to achieve this.I have converted first to a string.Then the first and last brackets are removed from that string and then is appended to a global list.
import pandas as pd
import difflib
import numpy as np
df = pd.read_csv('name.csv')
aLists = list(df['JTs'])
bLists = list(df['JT'])
n = 3
cutoff = 0.7
best = []
for aList in aLists:
temp = difflib.get_close_matches(aList, bLists, n, cutoff)
temp = str(temp)
strippedString = temp.lstrip("[").rstrip("]")
# print(temp)
best.append(strippedString)
print(best)
Output
[
"'SW Engineer'",
"'Manu Engineer'",
'',
"'IT Help'"
]
Here is another better way to achieve this.
You can simply use numpy to concatenate multiple arrays into single one.And then you can convert it to normal array if you want.
import pandas as pd
import difflib
import numpy as np
df = pd.read_csv('name.csv')
aLists = list(df['JTs'])
bLists = list(df['JT'])
n = 3
cutoff = 0.7
best = []
for aList in aLists:
temp = difflib.get_close_matches(aList, bLists, n, cutoff)
best.append(temp)
# print(best)
# Use concatenate() to join two arrays
combinedNumpyArray = np.concatenate(best)
#Converting numpy array to normal array
normalArray = combinedNumpyArray.tolist()
print(normalArray)
Output
['SW Engineer', 'Manu Engineer', 'IT Help']
Thanks
You could use Panda's .apply() to run your function on each entry. This could then either be added as a new column or a new dataframe created.
For example:
import pandas as pd
import difflib
def get_best_match(word):
matches = difflib.get_close_matches(word, JT, n, cutoff)
return matches[0] if matches else None
df = pd.read_csv('name.csv')
JT = df['JT']
n = 3
cutoff = 0.7
df['Output'] = df['JTs'].apply(get_best_match)
Or for a new dataframe:
df_output = pd.DataFrame({'Output' : df['JTs'].apply(get_best_match)})
Giving you:
JTs JT Output
0 Software Engineer Manu Engineer SW Engineer
1 Manufacturing Engineer SW Engineer Manu Engineer
2 Human Resource Manager IT Help None
3 IT Help Desk f IT Help
Or:
Output
0 SW Engineer
1 Manu Engineer
2 None
3 IT Help

I want to run a loop with condition and save all outputs as dataframes with different names

I wrote an function which only depends on a dataframe. The functions output is also a dataframe. I would like make different dataframes according a condition and save them as different datasets with different names. However I couldnt save them as dataframes with different names. Instead i manually do the process. Is there a code which would do the same. It would be much beneficial.
import os
import numpy as np
import pandas as pd
data1 = pd.read_csv('C:/Users/Oz/Desktop/vintage/vintage1.csv', encoding='latin-1')
product_list= data1['product_types'].unique()
def vintage_table(df):
df['Disbursement_Date']=pd.to_datetime(df.Disbursement_Date)
df['Closing_Date']=pd.to_datetime(df.Closing_Date)
df['NPL_date']=pd.to_datetime(df.NPL_date, errors='ignore')
df['NPL_date_period']=df.loc[df.NPL_date > '2015-01-01', 'NPL_date'].apply(lambda x: x.strftime('%Y-%m'))
df['Dis_date_period'] = df.Disbursement_Date.apply(lambda x: x.strftime('%Y-%m'))
df['diff']=((df.NPL_date-df.Disbursement_Date) / np.timedelta64(3, 'M')).round(0)
df=df.groupby(['Dis_date_period','NPL_date_period']).agg({'Dis_amount' : 'sum', 'NPL_amount' : 'sum', 'diff' : 'mean'})
df.reset_index(level=0, inplace=True)
df['Vintage_Ratio']=df['NPL_amount']/df['Dis_amount']
table=pd.pivot_table(df,values='Vintage_Ratio',index='Dis_date_period',columns=['diff'],).fillna(0)
return
The above is the function
#for e in product_list:
# sub = data1[data1['product_types'] == e]
# print(sub)
consumer = data1[data1['product_types'] == product_list[0]]
mortgage = data1[data1['product_types'] == product_list[1]]
vehicle = data1[data1['product_types'] == product_list[2]]
table_con = vintage_table(consumer)
table_mor = vintage_table(mortgage)
table_veh = vintage_table(vehicle)
I would like to improve this part is there a better way to do the same process?
You could have your vintage_table() function return a dataframe instead of just modifying one dataframe over and over and that way you could do this in the second code block:
table_con = vintage_table(consumer)
table_mor = vintage_table(mortgage)
table_veh = vintage_table(vechicle)

My program compute values as string and not as float even when ichange the type

i have a problem with my program and i'm confused, i don't know why it won't change the type of the columns, or maybe it is changing the type of the columns and it just still compute the columns as string. When i change the type into float, if i want it to be multiplied by 8, it will give me, for example with 4, 44444444. Here is my code.
import pandas as pd
import re
import numpy as np
link = "excelfilett.txt"
file = open(link, "r")
frames = []
is_count_frames = False
for line in file:
if "[Frames]" in line:
is_count_frames = True
if is_count_frames == True:
frames.append(line)
if "[EthernetRouting]" in line:
break
number_of_rows = len(frames) - 3
header = re.split(r'\t', frames[1])
number_of_columns = len(header)
frame_array = np.full((number_of_rows, number_of_columns), 0)
df_frame_array = pd.DataFrame(frame_array)
df_frame_array.columns= header
for row in range(number_of_rows):
frame_row = re.split(r'\t',frames[row+2])
for position in range(len(frame_row)):
df_frame_array.iloc[row, position]=frame_row[position]
df_frame_array['[MinDistance (ms)]'].astype(float)
df_frame_array.loc[:,'[MinDistance (ms)]'] *= 8
print(df_frame_array['[MinDistance (ms)]'])
but it gives me 8 times the value like (100100...100100), i also tried with puting them in a list
MinDistList = df_frame_array['[MinDistance (ms)]'].tolist()
product = []
for i in MinDistList:
product.append(i*8)
print(product)
but it still won't work, any ideas?
df_frame_array['[MinDistance (ms)]'].astype(float) doesn't change the column in place, but returns a new one.
You had the right idea, so just store it back:
df_frame_array['[MinDistance (ms)]'] = df_frame_array['[MinDistance (ms)]'].astype(float)

Time efficiency by eliminating three for loops

I have the a script similar to this:
import random
import pandas as pd
FA = []
FB = []
Value = []
df = pd.DataFrame()
df_save = pd.DataFrame(index=['min','max'])
days = ['Monday','Tuesday','Wednesday','Thursday','Friday','Saturday','Sunday']
numbers = list(range(24)) # FA.unique()
mix = '(pairwise combination of days and numbers, i.e. 0Monday,0Tuesday,...1Monday,1Tuesday,....)' 'I dont know how to do this combination btw'
def Calculus():
global min,max
min = df['Value'][boolean].min()
max = df['Value'][boolean].max()
for i in range(1000):
FA.append(random.randrange(0,23,1))
FB.append(random.choice(days))
Value.append(random.random())
df['FA'] = FA
df['FB'] = FB
df['FAB'] = df['FA'].astype(str) + df['FB'].astype(str)
df['Value'] = Value
mix_factor = df['FA'].astype(str) + df['FB'].astype(str)
for i in numbers:
boolean = df['FA'] == i
Calculus()
df_save[str(i)] = [min,max]
for i in days:
boolean = df['FB'] == i
Calculus()
df_save[str(i)] = [min,max]
for i in mix_factor.unique():
boolean = df['FAB'] == i
Calculus() #
df_save[str(i)] = [min,max]
My question is: there is another way to do the same but more time efficiently? My real data (df in this case) is a csv with millions of rows and this three loops are taking forever.
Maybe using 'apply' but I never have worked with it before.
Any insight will be very appreciate, thanks.
You could put all three loops into one, depending on what your exact code is. Is there a parameter for calculus? If not, putting them into one would allow you to have to run Calculus() less

Categories

Resources