Delete a row when a cell is empty - python
I'm trying to delete a row when a cell is empty from the 'calories.xlsx' spreadsheet and send all data, except empty rows, to the 'destination.xlsx' spreadsheet. The code below is how far I got. But still, it does not delete rows that have an empty value based on the calories column.
This is the data set:
Data Set
How can I develop my code to solve this problem?
import pandas as pd
FileName = 'calories.xlsx'
SheetName = pd.read_excel(FileName, sheet_name = 'Sheet1')
df = SheetName
print(df)
ListCalories = ['Calories']
print(ListCalories)
for Cell in ListCalories:
if Cell == 'NaN':
df.drop[Cell]
print(df)
df.to_excel('destination.xlsx')
Create dummy data
df=pd.DataFrame({
'calories':[2306,3256,1235,np.nan,3654,3256],
'Person':['person1','person2','person3','person4','person5','person6',]
})
Print data frame
calories Person
0 2306.0 person1
1 3256.0 person2
2 1235.0 person3
3 person4
4 3654.0 person5
5 3256.0 person6
remove row, if calories value is missing
new_df=df.dropna(how='any',subset=['calories'])
Result
calories Person
0 2306.0 person1
1 3256.0 person2
2 1235.0 person3
4 3654.0 person5
5 3256.0 person6
save as excel
new_df.to_excel('destination.xlsx',index=False)
your ListCalories contains only one element which is Calories, I'll assume this was a typo.
what you are trying to probably do is
import pandas as pd
FileName = 'calories.xlsx'
df = pd.read_excel(FileName, sheet_name = 'Sheet1')
print(df)
# you don't need this, but I kept it for you
ListCalories = df['Calories']
print(ListCalories)
clean_df = df[df['Calories'].notna()] # this will only select the rows that doesn't have na value in the Calories column
print(clean_df)
clean_df.to_excel('destination.xlsx')
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.notna.html
Related
Pandas .Split If Else
I've created a program so that I can split the dataset's column into 4 columns but some of my datasets only have 2 columns so when that section is reached, an error is thrown. I believe an if else statement can help with this. Here is the code for my program: import pandas as pd import os # reading csv file from url filepath = "C:/Users/username/folder1/folder2/folder3/b 2 col.csv" file_encoding = 'cp1252' data = pd.read_csv(filepath , header=None, names = list(range(0,4)) , encoding=file_encoding) data.columns =['ID', 'Name', 'S ID', 'SName'] # new data frame with split value columns new = data["Name"].str.split(",", n = 1, expand = True) # making separate first name column from new data frame data["Last Name"]= new[0] # making separate last name column from new data frame data["First Name"]= new[1] # new data frame with split value columns (2) new = data["SName"].str.split(",", n = 1, expand = True) # making separate first name column from new data frame data["S Last Name"]= new[0] # making separate last name column from new data frame data["S First Name"]= new[1] # Saving File name as its path filename = os.path.basename(filepath) + ".xlsx" data.to_excel(filename, index=False) data This section onwards is responsible for the splitting of the second set of data # new data frame with split value columns (2) new = data["SName"].str.split(",", n = 1, expand = True) Problem is not all of my CSV have four columns, so if I can implement an if else here to check if data is present then proceed else skip and move to the next section: # Saving File name as its path filename = os.path.basename(filepath) + ".xlsx" data.to_excel(filename, index=False) data I believe the program would work with my datasets Link to an example of my datasets: https://drive.google.com/drive/folders/1nkLgo5tSFsxOTCa5EMWZlezDFi8AyaDq?usp=sharing Thanks for helping
IIUC, assuming the (.csv) files are in the same folder, here is a proposition with pandas.concat : import pandas as pd import os filepath = "C:/Users/username/folder1/folder2" file_encoding = "cp1252" list_df = [] for filename in os.listdir(filepath): if filename.endswith(".csv"): df = pd.read_csv(os.path.join(filepath, filename), header=None, encoding=file_encoding, on_bad_lines="skip") df = (pd.concat([df.iloc[:, i:i+5].pipe(lambda df_: df_.rename(columns={col:i for i, col in enumerate(df_.columns)})) for i in range(0, df.shape[1], 5)], axis=0) .set_axis(["ID", "FullName", "Street No", "Street Add 1", "Street Add 2"], axis=1) .dropna(how="all")) df.insert(0, "filename", filename) #comment this line if don't want to show the filename as a column list_df.append(df) out = (pd.concat(list_df, ignore_index=True) .pipe(lambda df_: df_.join(df_["FullName"] .str.split(", ", expand=True) .rename(columns={0: "FirstName", 1: "LastName"})))) Output : print(out.head()) filename ID FullName Street No Street Add 1 Street Add 2 FirstName LastName 0 a 4 col upd.csv NaN Name NaN NaN NaN Name None 1 a 4 col upd.csv 1.0 Bruce, Wayne Street No 1 Street Add 1 Street Add 2 Bruce Wayne 2 a 4 col upd.csv 2.0 James, Gordon Street No 2 Street Add 2 Street Add 3 James Gordon 3 a 4 col upd.csv 3.0 Fish, Mooney Street No 3 Street Add 3 Street Add 4 Fish Mooney 4 a 4 col upd.csv 4.0 Selina, Kyle Street No 4 Street Add 4 Street Add 5 Selina Kyle
You can split your 2 column to 4 column like that. #if there are some missing columns data['First Name'] = np.nan data['Last Name'] = np.nan data['S First Name'] = np.nan data['S Last Name'] = np.nan #if there is not missing values remove above data[['First Name', 'Last Name']] = data.Name.astype(str).str.split(",", expand=True) data[['S First Name', 'S Last Name']] = data.SName.astype(str).str.split(",", expand=True)
How to group data by count of columns in Pandas?
I have a CSV file with a lot of rows and different number of columns. How to group data by count of columns and show it in different frames? File CSV has the following data: 1 OLEG US FRANCE BIG 1 OLEG FR 18 1 NATA 18 Because I have different number of colums in each row I have to group rows by count of columns and show 3 frames to be able set header then: ID NAME STATE COUNTRY HOBBY FR1: 1 OLEG US FRANCE BIG ID NAME COUNTRY AGE FR2: 1 OLEG FR 18 FR3: ID NAME AGE 1 NATA 18 Any words, I need to group rows by count of columns and show them in different dataframes.
since pandas doesn't allow you to have different length of columns, just don't use it to import your data. Your goal is to create three seperate df, so first import the data as lists, and then deal with it and its differents lengths. One way to solve this is read the data with csv.reader and create the df's with list comprehension together with a condition for the length of the lists. with open('input.csv', 'r') as f: reader = csv.reader(f, delimiter=' ') data= list(reader) df1 = pd.DataFrame([item for item in data if len(item)==3], columns='ID NAME AGE'.split()) df2 = pd.DataFrame([item for item in data if len(item)==4], columns='ID NAME COUNTRY AGE'.split()) df3 = pd.DataFrame([item for item in data if len(item)==5], columns='ID NAME STATE COUNTRY HOBBY'.split()) print(df1, df2, df3, sep='\n\n') ID NAME AGE 0 1 NATA 18 ID NAME COUNTRY AGE 0 1 OLEG FR 18 ID NAME STATE COUNTRY HOBBY 0 1 OLEG US FRANCE BIG If you need to hardcode too many lines for the same step (e.g. too many df's), then you should consider using a loop to create them and store each dataframe as key/value in a dictionary. EDIT Here is the little optimizedway of creating those df's. I think you can't get around creating a list of columns you want to use for the seperate df's, so you need to know what variations of number of columns you have in your data (except you want to create those df's without naming the columns. col_list=[['ID', 'NAME', 'AGE'],['ID', 'NAME', 'COUNTRY', 'AGE'],['ID', 'NAME', 'STATE', 'COUNTRY', 'HOBBY']] with open('input.csv', 'r') as f: reader = csv.reader(f, delimiter=' ') data= list(reader) dict_of_dfs = {} for cols in col_list: dict_of_dfs[f'df_{len(cols)}'] = pd.DataFrame([item for item in data if len(item)==len(cols)], columns=cols) for key,val in dict_of_dfs.items(): print(f'{key=}: \n {val} \n') key='df_3': ID NAME AGE 0 1 NATA 18 key='df_4': ID NAME COUNTRY AGE 0 1 OLEG FR 18 key='df_5': ID NAME STATE COUNTRY HOBBY 0 1 OLEG US FRANCE BIG Now you don't have variables for your df, instead you have them in a dictionary as keys. (I named the df with the number of columns it has, df_3 is the df with three columns. If you need to import the data with pandas, you could have a look at this post.
How to save a pandas pivot table with xlsxwriter in excel
I want to save a pandas pivot table proberly and nice formatted into an excel workbook. I have an pandas pivot table, based on this formula: table = pd.pivot_table(d2, values=['Wert'], index=['area', 'Name'], columns=['Monat'], aggfunc={'Wert': np.sum}, margins=True).fillna('') From my original dataframe: df Area Name2 Monat Wert A A 1 2 A A 1 3 A B 1 2 A A 2 1 so the pivot table looks like this: Wert Monat 1 2 All Area Name A A 5 1 6 B 2 2 All 7 1 8 Then I want to save this in an excel workbook with the following code: import pandas as pd import xlsxwriter workbook = xlsxwriter.Workbook('myexcel.xlsx) worksheet1 = workbook.add_worksheet('table1') caption = 'Table1' worksheet1.set_column(1, 14, 25) #irrelevant, just a random size right now worksheet1.write('B1', caption) worksheet1.add_table('B3:K100', {'data': table.values.tolist()}) #also wrong size from B3 to K100 workbook.close() But this looks like this (with different values), so the headers are missing: How can I adjust it and save a pivot table in excel? If I am using the pandas command .to_excel it looks like this: Which is fine, but the column name is not respecting the width of the names and the background color is not looking nice, and I am also missing a capturing.
I found the solution with combination of this topic: flattened = pd.DataFrame(table.to_records()) flattened.columns = [column.replace("('Wert', ", "Monat: ").replace(")", "") for column in flattened.columns] ##only for renaming the column headers And then: workbook = xlsxwriter.Workbook(excelfilename, options={'nan_inf_to_errors': True}) worksheet = workbook.add_worksheet('Table1') worksheet1.set_column(0, flattned.shape[1], 25) worksheet.add_table(0, 0, flattened.shape[0], flattened.shape[1]-1, {'data': flattened.values.tolist(), 'columns': [{'header': c} for c in flattened.columns.tolist()], 'style': 'Table Style Medium 9'}) workbook.close()
Shift down one row then rename the column
My data is looking like this: pd.read_csv('/Users/admin/desktop/007538839.csv').head() 105586.18 0 105582.910 1 105585.230 2 105576.445 3 105580.016 4 105580.266 I want to move that 105568.18 to the 0 index because now it is the column name. And after that I want to name this column 'flux'. I've tried pd.read_csv('/Users/admin/desktop/007538839.csv', sep='\t', names = ["flux"]) but it did not work, probably because the dataframe is not in the right format. How can I achieve that?
For me your code working very nice: import pandas as pd temp=u"""105586.18 105582.910 105585.230 105576.445 105580.016 105580.266""" #after testing replace 'pd.compat.StringIO(temp)' to '/Users/admin/desktop/007538839.csv' df = pd.read_csv(pd.compat.StringIO(temp), sep='\t', names = ["flux"]) print (df) flux 0 105586.180 1 105582.910 2 105585.230 3 105576.445 4 105580.016 5 105580.266 For overwrite original file with same data with new header flux: df.to_csv('/Users/admin/desktop/007538839.csv', index=False)
Try this: df=pd.read_csv('/Users/admin/desktop/007538839.csv',header=None) df.columns=['flux'] header=None is the friend of yours.
How to convert series to dataframe in Pandas
I have two CSVs I need to compare them based on one column. And I need to put matched rows in one csv and unmatched rows in other. So, I created index on that column in second csv and looped through first. df1 = pd.read_csv(file1,nrows=100) df2 = pd.read_csv(file2,nrows=100) df2.set_index('crc', inplace = True) matched_list = [] non_matched_list = [] for _, row in df1.iterrows(): try: x = df2.loc[row['crc']] matched_list.append(x) except KeyError: non_matched_list.append(row) The x here is a series in the following format policyID 448094 statecode FL county CLAY COUNTY eq_site_limit 1322376.3 hu_site_limit 1322376.3 fl_site_limit 1322376.3 fr_site_limit 1322376.3 tiv_2011 1322376.3 tiv_2012 1438163.57 eq_site_deductible 0 hu_site_deductible 0.0 fl_site_deductible 0 fr_site_deductible 0 point_latitude 30.063936 point_longitude -81.707664 line Residential construction Masonry point_granularity 3 Name: 448094,FL,CLAY COUNTY,1322376.3,1322376.3,1322376.3,1322376.3,1322376.3,0,0.0, dtype: object My output csv should be in following format policyID,statecode,county,eq_site_limit,hu_site_limit,fl_site_limit,fr_site_limit,tiv_2011,tiv_2012,eq_site_deductible,hu_site_deductible,fl_site_deductible,fr_site_deductible,point_latitude,point_longitude,line,construction,point_granularity 114455,FL,CLAY COUNTY,498960,498960,498960,498960,498960,792148.9,0,9979.2,0,0,30.102261,-81.711777,Residential,Masonry,1 For all the series in the matched and unmatched. How do I do it? I can not get rid off index in second csv as performance in important. Following are the content of two csv files. File1: policyID,statecode,county,crc,hu_site_limit,fl_site_limit,fr_site_limit,tiv_2011,tiv_2012,eq_site_deductible,hu_site_deductible,fl_site_deductible,fr_site_deductible,point_latitude,point_longitude,line,construction,point_granularity 114455,FL,CLAY COUNTY,589658,498960,498960,498960,498960,792148.9,0,9979.2,0,0,30.102261,-81.711777,Residential,Masonry,1 448094,FL,CLAY COUNTY,1322376.3,1322376.3,1322376.3,1322376.3,1322376.3,1438163.57,0,0,0,0,30.063936,-81.707664,Residential,Masonry,3 206893,FL,CLAY COUNTY,745689.4,190724.4,190724.4,190724.4,190724.4,192476.78,0,0,0,0,30.089579,-81.700455,Residential,Wood,1 333743,FL,CLAY COUNTY,0,12563.76,0,0,79520.76,86854.48,0,0,0,0,30.063236,-81.707703,Residential,Wood,3 172534,FL,CLAY COUNTY,0,254281.5,0,254281.5,254281.5,246144.49,0,0,0,0,30.060614,-81.702675,Residential,Wood,1 785275,FL,CLAY COUNTY,0,515035.62,0,0,515035.62,884419.17,0,0,0,0,30.063236,-81.707703,Residential,Masonry,3 995932,FL,CLAY COUNTY,0,19260000,0,0,19260000,20610000,0,0,0,0,30.102226,-81.713882,Commercial,Reinforced Concrete,1 223488,FL,CLAY COUNTY,328500,328500,328500,328500,328500,348374.25,0,16425,0,0,30.102217,-81.707146,Residential,Wood,1 433512,FL,CLAY COUNTY,315000,315000,315000,315000,315000,265821.57,0,15750,0,0,30.118774,-81.704613,Residential,Wood,1 142071,FL,CLAY COUNTY,705600,705600,705600,705600,705600,1010842.56,14112,35280,0,0,30.100628,-81.703751,Residential,Masonry,1 File2: policyID,statecode,county,crc,hu_site_limit,fl_site_limit,fr_site_limit,tiv_2011,tiv_2012,eq_site_deductible,hu_site_deductible,fl_site_deductible,fr_site_deductible,point_latitude,point_longitude,line,construction,point_granularity 119736,FL,CLAY COUNTY,498960,498960,498960,498960,498960,792148.9,0,9979.2,0,0,30.102261,-81.711777,Residential,Masonry,1 448094,FL,CLAY COUNTY,1322376.3,1322376.3,1322376.3,1322376.3,1322376.3,1438163.57,0,0,0,0,30.063936,-81.707664,Residential,Masonry,3 206893,FL,CLAY COUNTY,190724.4,190724.4,190724.4,190724.4,190724.4,192476.78,0,0,0,0,30.089579,-81.700455,Residential,Wood,1 333743,FL,CLAY COUNTY,0,79520.76,0,0,79520.76,86854.48,0,0,0,0,30.063236,-81.707703,Residential,Wood,3 172534,FL,CLAY COUNTY,0,254281.5,0,254281.5,254281.5,246144.49,0,0,0,0,30.060614,-81.702675,Residential,Wood,1 785275,FL,CLAY COUNTY,0,51564.9,0,0,515035.62,884419.17,0,0,0,0,30.063236,-81.707703,Residential,Masonry,3 995932,FL,CLAY COUNTY,0,457962,0,0,19260000,20610000,0,0,0,0,30.102226,-81.713882,Commercial,Reinforced Concrete,1 223488,FL,CLAY COUNTY,328500,328500,328500,328500,328500,348374.25,0,16425,0,0,30.102217,-81.707146,Residential,Wood,1 433512,FL,CLAY COUNTY,315000,315000,315000,315000,315000,265821.57,0,15750,0,0,30.118774,-81.704613,Residential,Wood,1 142071,FL,CLAY COUNTY,705600,705600,705600,705600,705600,1010842.56,14112,35280,0,0,30.100628,-81.703751,Residential,Masonry,1 253816,FL,CLAY COUNTY,831498.3,831498.3,831498.3,831498.3,831498.3,1117791.48,0,0,0,0,30.10216,-81.719444,Residential,Masonry,1 894922,FL,CLAY COUNTY,0,24059.09,0,0,24059.09,33952.19,0,0,0,0,30.095957,-81.695099,Residential,Wood,1 Edit: Added sample csv
I think you can do it this way: df1.loc[df1.crc.isin(df2.index)].to_csv('/path/to/matched.csv', index=False) df1.loc[~df1.crc.isin(df2.index)].to_csv('/path/to/unmatched.csv', index=False) instead of looping... Demo: In [62]: df1.loc[df1.crc.isin(df2.index)].to_csv(r'c:/temp/matched.csv', index=False) In [63]: df1.loc[~df1.crc.isin(df2.index)].to_csv(r'c:/temp/unmatched.csv', index=False) Results: matched.csv: policyID,statecode,county,crc,hu_site_limit,fl_site_limit,fr_site_limit,tiv_2011,tiv_2012,eq_site_deductible,hu_site_deductible,fl_site_deductible,fr_site_deductible,point_latitude,point_longitude,line,construction,point_granularity 448094,FL,CLAY COUNTY,1322376.3,1322376.3,1322376.3,1322376.3,1322376.3,1438163.57,0,0.0,0,0,30.063935999999998,-81.70766400000001,Residential,Masonry,3 333743,FL,CLAY COUNTY,0.0,12563.76,0.0,0.0,79520.76,86854.48,0,0.0,0,0,30.063236,-81.70770300000001,Residential,Wood,3 172534,FL,CLAY COUNTY,0.0,254281.5,0.0,254281.5,254281.5,246144.49,0,0.0,0,0,30.060614,-81.702675,Residential,Wood,1 785275,FL,CLAY COUNTY,0.0,515035.62,0.0,0.0,515035.62,884419.17,0,0.0,0,0,30.063236,-81.70770300000001,Residential,Masonry,3 995932,FL,CLAY COUNTY,0.0,19260000.0,0.0,0.0,19260000.0,20610000.0,0,0.0,0,0,30.102226,-81.713882,Commercial,Reinforced Concrete,1 223488,FL,CLAY COUNTY,328500.0,328500.0,328500.0,328500.0,328500.0,348374.25,0,16425.0,0,0,30.102217,-81.707146,Residential,Wood,1 433512,FL,CLAY COUNTY,315000.0,315000.0,315000.0,315000.0,315000.0,265821.57,0,15750.0,0,0,30.118774,-81.704613,Residential,Wood,1 142071,FL,CLAY COUNTY,705600.0,705600.0,705600.0,705600.0,705600.0,1010842.56,14112,35280.0,0,0,30.100628000000004,-81.703751,Residential,Masonry,1 unmatched.csv: policyID,statecode,county,crc,hu_site_limit,fl_site_limit,fr_site_limit,tiv_2011,tiv_2012,eq_site_deductible,hu_site_deductible,fl_site_deductible,fr_site_deductible,point_latitude,point_longitude,line,construction,point_granularity 114455,FL,CLAY COUNTY,589658.0,498960.0,498960.0,498960.0,498960.0,792148.9,0,9979.2,0,0,30.102261,-81.711777,Residential,Masonry,1 206893,FL,CLAY COUNTY,745689.4,190724.4,190724.4,190724.4,190724.4,192476.78,0,0.0,0,0,30.089578999999997,-81.700455,Residential,Wood,1