Python pandas empty df but columns has elements - python
I have really irritating thing in my script and don't have idea what's wrong. When I try to filter my dataframe and then add rows to newone which I want to export to excel this happen.
File exports as empty DF, also print shows me that "report" is empty but when I try to print report.Name, report.Value etc. I got normal and proper output with elements. Also I can only export one column to excel not entire DF which looks like empty.... What can cause that strange accident?
So this is my script:
df = pd.read_excel('testfile2.xlsx')
report = pd.DataFrame(columns=['Type','Name','Value'])
for index, row in df.iterrows():
if type(row[0]) == str:
type_name = row[0].split(" ")
if type_name[0] == 'const':
selected_index = index
report['Type'].loc[index] = type_name[1]
report['Name'].loc[index] = type_name[2]
report['Value'].loc[index] = row[1]
else:
for elements in type_name:
report['Value'].loc[selected_index] += " " + elements
elif type(row[0]) == float:
df = df.drop(index=index)
print(report) #output - Empty DataFrame
print(report.Name) output - over 500 elements
You are trying to manipulate a series that does not exist which leads to the described behaviour.
Doing what you did just with a way more simple example i get the same result:
report = pd.DataFrame(columns=['Type','Name','Value'])
report['Type'].loc[0] = "A"
report['Name'].loc[0] = "B"
report['Value'].loc[0] = "C"
print(report) #empty df
print(report.Name) # prints "B" in a series
Easy solution: Just add the whole row instead of the three single values:
report = pd.DataFrame(columns=['Type','Name','Value'])
report.loc[0] = ["A", "B", "C"]
or in your code:
report.loc[index] = [type_name[1], type_name[2], row[1]]
If you want to do it the same way you are doing it at the moment you first need to add an empty series with the given index to your DataFrame before you can manipulate it:
report.loc[index] = pd.Series([])
report['Type'].loc[index] = type_name[1]
report['Name'].loc[index] = type_name[2]
report['Value'].loc[index] = row[1]
Related
How do I loop column names in a pandas dataframe?
I am new to Python and have never really used Pandas, so forgive me if this doesn't make sense. I am trying to create a df based on frontend data I am sending to a flask route. The data is looped through and appended for each row. My only problem is that I don't know how to get the df columns to reflect that. Here is my code to build the rows and the current output: claims = csv_data["claims"] setups = csv_data["setups"] for setup in setups: setup = setups[0] offerings = setup["currentOfferings"] considered = setup["considerationSet"] reach_dict = setup["reach"] favorite_dict = setup["favorite"] summary_dict = setup["summaryMetrics"] rows = [] for i, claim in enumerate(claims): row = [] row.append(i + 1) row.append(claim) for setup in setups: setup = setups[0] row.append("X") if claim in setup["currentOfferings"] else row.append(float('nan')) row.append("X") if claim in setup["considerationSet"] else row.append(float('nan')) if claim in setup["currentOfferings"]: reach_score = reach_dict[claim] reach_percentage = "{:.0%}".format(reach_score) row.append(reach_percentage) else: row.append(float('nan')) if claim in setup["currentOfferings"]: favorite_score = favorite_dict[claim] fav_percentage = "{:.0%}".format(favorite_score) row.append(fav_percentage) else: row.append(float('nan')) rows.append(row) I know that I can put columns = ["#", "Claims", "Setups", etc...] in the df, but that doesn't work because the rows are looping through multiple setups, and the number of setups can change. If I don't specify the column names (how it is in the image), then I just have numbers as columns names. Ideally it should loop through the data it receives in the route, and would start with "#" "Claims" as columns, and then for each setup "Setup 1", "Consideration Set 1", "Reach", "Favorite", "Setup 2", "Consideration Set 2", and so on... etc. I tried to create a similar type of loop for the columns: my_columns = [] for i, row in enumerate(rows): col = [] if row[0] != None: col.append("#") else: pass if row[1] != None: col.append("Claims") else: pass if row[2] != None: col.append("Setup") else: pass if row[3] != None: col.append("Consideration Set") else: pass if row[4] != None: col.append("Reach") else: pass if row[5] != None: col.append("Favorite") else: pass my_columns.append(col) df = pd.DataFrame( rows, columns = my_columns ) But this didn't work because I have the same issue of no loop, I have 6 columns passed and 10 data columns passed. I'm not sure if I am just not doing the loop of the columns properly, or if I am making everything more complicated than it needs to be. This is what I am trying to accomplish without having to explicitly name the columns because this is just sample data. There could end up being 3, 4, however many setups in the actual app. what I would like the ouput to look like
I don't know if this is the most efficient way of doing something like this but I think this is what you want to achieve. def create_columns(df): new_cols=[] for i in range(len(df.columns)): repeated_cols = 6 #here is the number of columns you need to repeat for every setup idx = 1 + i // repeated_cols basic = ['#', 'Claims', f'Setup_{idx}', f'Consideration_Set_{idx}', 'Reach', 'Favorite'] new_cols.append(basic[i % len(basic)]) return new_cols df.columns = create_columns(df)
If your data comes as csv then try pd.read_csv() to create dataframe.
populate column in dataframe with a list using for loop
I would like to populate a dataframe using a for loop. one of the column is a list. this list is empty at the begining at each itteration an element is added or removed from it. when I print my list at each iteration I am getting the right results, but when I print my dataframe, I am getting the same list on each row: I you have a look to my code the list I am updatin is list_employe. The magic should happen in the 3 last rows but it did not. Does anyone have an idea why the list is updated in one way and the dataframe record only the last update on all rows list_employe = [] total_employe = 0 rows=[] shiftday = example['SHIFT_DATE'].dt.strftime('%Y-%m-%d').unique().tolist() for i in shiftday: shift_day = example[example['SHIFT_DATE'] == i] list_employe_shift = example[example['SHIFT_DATE']==i]['EMPLOYEE_CODE_POS_UPPER'].unique().tolist() new_employe = 0 end_employe = 0 for k in list_employe_shift: shift_days_emp = shift_day[shift_day['EMPLOYEE_CODE_POS_UPPER'] == k] days = shift_days_emp.iloc[0]['last_day'] #print(days) if k in list_employe: if days>1: end_employe= end_employe+1 total_employe = total_employe-1 list_employe.remove(k) else: new_employe = new_employe+1 total_employe = total_employe + 1 list_employe.extend([k]) day = i total_emp = total_employe new_emp = new_employe end_emp = end_employe rows.append([day, total_emp, new_emp, end_emp, list_employe]) print(list_employe) df = pd.DataFrame(rows, columns=["day", "total_employe", "new_employe", "end_employe", "list_employe"])
the list list_employe is always the same object that you append to the list rows. What you need to do to solve the problem is at the 3rd line from the bottom : rows.append([day, total_emp, new_emp, end_emp, list(list_employe)]) Which create a new list at each itteration
convert pandas series (with strings) to python list
It's probably a silly thing but I can't seem to correctly convert a pandas series originally got from an excel sheet to a list. dfCI is created by importing data from an excel sheet and looks like this: tab var val MsrData sortfield DetailID MsrData strow 4 MsrData inputneeded "MeasDescriptionTest", "SiteLocTest", "SavingsCalcsProvided","BiMonthlyTest" # get list of cols for which input is needed cols = dfCI[((dfCI['var'] == 'inputneeded') & (dfCI['tab'] == 'MsrData'))]['val'].values.tolist() print(cols) >> ['"MeasDescriptionTest", "SiteLocTest", "SavingsCalcsProvided", "BiMonthlyTest"'] # replace null text with text invalid = 'Input Needed' for col in cols: dfMSR[col] = np.where((dfMSR[col].isnull()), invalid, dfMSR[col]) However the second set of (single) quotes added when I converted cols from series to list, makes all the columns a single value so that col = '"MeasDescriptionTest", "SiteLocTest", "SavingsCalcsProvided", "BiMonthlyTest"' The desired output for cols is cols = ["MeasDescriptionTest", "SiteLocTest", "SavingsCalcsProvided", "BiMonthlyTest"] What am I doing wrong?
Once you've got col, you can convert it to your expected output: In [1109]: col = '"MeasDescriptionTest", "SiteLocTest", "SavingsCalcsProvided", "BiMonthlyTest"' In [1114]: cols = [i.strip() for i in col.replace('"', '').split(',')] In [1115]: cols Out[1115]: ['MeasDescriptionTest', 'SiteLocTest', 'SavingsCalcsProvided', 'BiMonthlyTest']
Another possible solution that comes to mind given the structure of cols is: list(eval(cols[0])) # ['MeasDescriptionTest', 'SiteLocTest', 'SavingsCalcsProvided', 'BiMonthlyTest'] Although this is valid, it's less safe and I would go with list-comprehension as #MayankPorwal suggested.
Script keep showing "SettingCopyWithWarning'
Hello my problem is that my script keep showing below message SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy downcast=downcast I Searched the google for a while regarding this, and it seems like my code is somehow assigning sliced dataframe to new variable, which is problematic. The problem is ** I can't find where my code get problematic ** I tried copy function, or seperated the nested functions, but it is not working I attached my code below. def case_sorting(file_get, col_get, methods_get, operator_get, value_get): ops = {">": gt, "<": lt} col_get = str(col_get) value_get = int(value_get) if methods_get is "|x|": new_file = file_get[ops[operator_get](file_get[col_get], value_get)] else: new_file = file_get[ops[operator_get](file_get[col_get], np.percentile(file_get[col_get], value_get))] return new_file Basically what i was about to do was to make flask api that gets excel file as an input, and returns the csv file with some filtering. So I defined some functions first. def get_brandlist(df_input, brand_input): if brand_input == "default": final_list = (pd.unique(df_input["브랜드"])).tolist() else: final_list = brand_input.split("/") if '브랜드' in final_list: final_list.remove('브랜드') final_list = [x for x in final_list if str(x) != 'nan'] return final_list Then I defined the main function def select_bestitem(df_data, brand_name, col_name, methods, operator, value): # // 2-1 // to remove unnecessary rows and columns with na values df_data = df_data.dropna(axis=0 & 1, how='all') df_data.fillna(method='pad', inplace=True) # // 2-2 // iterate over all rows to find which row contains brand value default_number = 0 for row in df_data.itertuples(): if '브랜드' in row: df_data.columns = df_data.iloc[default_number, :] break else: default_number = default_number + 1 # // 2-3 // create the list contains all the target brand names brand_list = get_brandlist(df_input=df_data, brand_input=brand_name) # // 2-4 // subset the target brand into another dataframe df_data_refined = df_data[df_data.iloc[:, 1].isin(brand_list)] # // 2-5 // split the dataframe based on the "brand name", and apply the input condition df_per_brand = {} df_per_brand_modified = {} for brand_each in brand_list: df_per_brand[brand_each] = df_data_refined[df_data_refined['브랜드'] == brand_each] file = df_per_brand[brand_each].copy() df_per_brand_modified[brand_each] = case_sorting(file_get=file, col_get=col_name, methods_get=methods, operator_get=operator, value_get=value) # // 2-6 // merge all the remaining dataframe df_merged = pd.DataFrame() for brand_each in brand_list: df_merged = df_merged.append(df_per_brand_modified[brand_each], ignore_index=True) final_df = df_merged.to_csv(index=False, sep=',', encoding='utf-8') return final_df And I am gonna import this function in my app.py later I am quite new to all the coding, therefore really really sorry if my code is quite hard to understand, but I just really wanted to get rid of this annoying warning message. Thanks for help in advance :)
Python - Pandas library returns wrong column values after parsing a CSV file
SOLVED Found the solution by myself. Turns out that when you want to retrieve specific columns by their names you should pass the names in the order they appear inside the csv (which is really stupid for a library that is intended to save some parsing time for a developer IMO). Correct me if I am wrong but i dont see a on option to get a specific columns values by its name if the columns are in a different order... I am trying to read a comma separated value file with python and then parse it using Pandas library. Since the file has many values (columns) that are not needed I make a list of the column names i do need. Here's a look at the csv file format. Div,Date,HomeTeam,AwayTeam,FTHG,FTAG,FTR,HTHG,HTAG,HTR,Attendance,Referee,HS,AS,HST,AST,HHW,AHW,HC,AC,HF,AF,HO,AO,HY,AY,HR,AR,HBP,ABP,GBH,GBD,GBA,IWH,IWD,IWA,LBH,LBD,LBA,SBH,SBD,SBA,WHH,WHD,WHA E0,19/08/00,Charlton,Man City,4,0,H,2,0,H,20043,Rob Harris,17,8,14,4,2,1,6,6,13,12,8,6,1,2,0,0,10,20,2,3,3.2,2.2,2.9,2.7,2.2,3.25,2.75,2.2,3.25,2.88,2.1,3.2,3.1 E0,19/08/00,Chelsea,West Ham,4,2,H,1,0,H,34914,Graham Barber,17,12,10,5,1,0,7,7,19,14,2,3,1,2,0,0,10,20,1.47,3.4,5.2,1.6,3.2,4.2,1.5,3.4,6,1.5,3.6,6,1.44,3.6,6.5 E0,19/08/00,Coventry,Middlesbrough,1,3,A,1,1,D,20624,Barry Knight,6,16,3,9,0,1,8,4,15,21,1,3,5,3,1,0,75,30,2.15,3,3,2.2,2.9,2.7,2.25,3.2,2.75,2.3,3.2,2.75,2.3,3.2,2.62 E0,19/08/00,Derby,Southampton,2,2,D,1,2,A,27223,Andy D'Urso,6,13,4,6,0,0,5,8,11,13,0,2,1,1,0,0,10,10,2,3.1,3.2,1.8,3,3.5,2.2,3.25,2.75,2.05,3.2,3.2,2,3.2,3.2 E0,19/08/00,Leeds,Everton,2,0,H,2,0,H,40010,Dermot Gallagher,17,12,8,6,0,0,6,4,21,20,6,1,1,3,0,0,10,30,1.65,3.3,4.3,1.55,3.3,4.5,1.55,3.5,5,1.57,3.6,5,1.61,3.5,4.5 E0,19/08/00,Leicester,Aston Villa,0,0,D,0,0,D,21455,Mike Riley,5,5,4,3,0,0,5,4,12,12,1,4,2,3,0,0,20,30,2.15,3.1,2.9,2.3,2.9,2.5,2.35,3.2,2.6,2.25,3.25,2.75,2.4,3.25,2.5 E0,19/08/00,Liverpool,Bradford,1,0,H,0,0,D,44183,Paul Durkin,16,3,10,2,0,0,6,1,8,8,5,0,1,1,0,0,10,10,1.25,4.1,7.2,1.25,4.3,8,1.35,4,8,1.36,4,8,1.33,4,8 This list is passed to pandas.read_csv()'s names parameter. See code. # Returns an array of the column names needed for our raw data table def cols_to_extract(): cols_to_use = [None] * RawDataCols.COUNT cols_to_use[RawDataCols.DATE] = 'Date' cols_to_use[RawDataCols.HOME_TEAM] = 'HomeTeam' cols_to_use[RawDataCols.AWAY_TEAM] = 'AwayTeam' cols_to_use[RawDataCols.FTHG] = 'FTHG' cols_to_use[RawDataCols.HG] = 'HG' cols_to_use[RawDataCols.FTAG] = 'FTAG' cols_to_use[RawDataCols.AG] = 'AG' cols_to_use[RawDataCols.FTR] = 'FTR' cols_to_use[RawDataCols.RES] = 'Res' cols_to_use[RawDataCols.HTHG] = 'HTHG' cols_to_use[RawDataCols.HTAG] = 'HTAG' cols_to_use[RawDataCols.HTR] = 'HTR' cols_to_use[RawDataCols.ATTENDANCE] = 'Attendance' cols_to_use[RawDataCols.HS] = 'HS' cols_to_use[RawDataCols.AS] = 'AS' cols_to_use[RawDataCols.HST] = 'HST' cols_to_use[RawDataCols.AST] = 'AST' cols_to_use[RawDataCols.HHW] = 'HHW' cols_to_use[RawDataCols.AHW] = 'AHW' cols_to_use[RawDataCols.HC] = 'HC' cols_to_use[RawDataCols.AC] = 'AC' cols_to_use[RawDataCols.HF] = 'HF' cols_to_use[RawDataCols.AF] = 'AF' cols_to_use[RawDataCols.HFKC] = 'HFKC' cols_to_use[RawDataCols.AFKC] = 'AFKC' cols_to_use[RawDataCols.HO] = 'HO' cols_to_use[RawDataCols.AO] = 'AO' cols_to_use[RawDataCols.HY] = 'HY' cols_to_use[RawDataCols.AY] = 'AY' cols_to_use[RawDataCols.HR] = 'HR' cols_to_use[RawDataCols.AR] = 'AR' return cols_to_use # Extracts raw data from the raw data csv and populates the raw match data table in the database def extract_raw_data(csv): # Clear the database table if it has any logs # if MatchRawData.objects.count != 0: # MatchRawData.objects.delete() cols_to_use = cols_to_extract() # Read and parse the csv file parsed_csv = pd.read_csv(csv, delimiter=',', names=cols_to_use, header=0) for col in cols_to_use: values = parsed_csv[col].values for val in values: print(str(col) + ' --------> ' + str(val)) Where RawDataCols is an IntEnum. class RawDataCols(IntEnum): DATE = 0 HOME_TEAM = 1 AWAY_TEAM = 2 FTHG = 3 HG = 4 FTAG = 5 AG = 6 FTR = 7 RES = 8 ... The column names are obtained using it. That part of code works ok. The correct column name is obtained but after trying to get its values using values = parsed_csv[col].values pandas return the values of a wrong column. The wrong column index is around 13 indexes away from the one i am trying to get. What am i missing?
You can select column by name wise.Just use following line values = parsed_csv[["Column Name","Column Name2"]] Or you select Index wise by cols = [1,2,3,4] values = parsed_csv[parsed_csv.columns[cols]]