Delete a row when a cell is empty - python

I'm trying to delete a row when a cell is empty from the 'calories.xlsx' spreadsheet and send all data, except empty rows, to the 'destination.xlsx' spreadsheet. The code below is how far I got. But still, it does not delete rows that have an empty value based on the calories column.
This is the data set:
Data Set
How can I develop my code to solve this problem?
import pandas as pd
FileName = 'calories.xlsx'
SheetName = pd.read_excel(FileName, sheet_name = 'Sheet1')
df = SheetName
print(df)
ListCalories = ['Calories']
print(ListCalories)
for Cell in ListCalories:
if Cell == 'NaN':
df.drop[Cell]
print(df)
df.to_excel('destination.xlsx')

Create dummy data
df=pd.DataFrame({
'calories':[2306,3256,1235,np.nan,3654,3256],
'Person':['person1','person2','person3','person4','person5','person6',]
})
Print data frame
calories Person
0 2306.0 person1
1 3256.0 person2
2 1235.0 person3
3 person4
4 3654.0 person5
5 3256.0 person6
remove row, if calories value is missing
new_df=df.dropna(how='any',subset=['calories'])
Result
calories Person
0 2306.0 person1
1 3256.0 person2
2 1235.0 person3
4 3654.0 person5
5 3256.0 person6
save as excel
new_df.to_excel('destination.xlsx',index=False)

your ListCalories contains only one element which is Calories, I'll assume this was a typo.
what you are trying to probably do is
import pandas as pd
FileName = 'calories.xlsx'
df = pd.read_excel(FileName, sheet_name = 'Sheet1')
print(df)
# you don't need this, but I kept it for you
ListCalories = df['Calories']
print(ListCalories)
clean_df = df[df['Calories'].notna()] # this will only select the rows that doesn't have na value in the Calories column
print(clean_df)
clean_df.to_excel('destination.xlsx')
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.notna.html

Related

Pandas .Split If Else

I've created a program so that I can split the dataset's column into 4 columns but some of my datasets only have 2 columns so when that section is reached, an error is thrown. I believe an if else statement can help with this.
Here is the code for my program:
import pandas as pd
import os
# reading csv file from url
filepath = "C:/Users/username/folder1/folder2/folder3/b 2 col.csv"
file_encoding = 'cp1252'
data = pd.read_csv(filepath , header=None, names = list(range(0,4)) , encoding=file_encoding)
data.columns =['ID', 'Name', 'S ID', 'SName']
# new data frame with split value columns
new = data["Name"].str.split(",", n = 1, expand = True)
# making separate first name column from new data frame
data["Last Name"]= new[0]
# making separate last name column from new data frame
data["First Name"]= new[1]
# new data frame with split value columns (2)
new = data["SName"].str.split(",", n = 1, expand = True)
# making separate first name column from new data frame
data["S Last Name"]= new[0]
# making separate last name column from new data frame
data["S First Name"]= new[1]
# Saving File name as its path
filename = os.path.basename(filepath) + ".xlsx"
data.to_excel(filename, index=False)
data
This section onwards is responsible for the splitting of the second set of data
# new data frame with split value columns (2)
new = data["SName"].str.split(",", n = 1, expand = True)
Problem is not all of my CSV have four columns, so if I can implement an if else here to check if data is present then proceed else skip and move to the next section:
# Saving File name as its path
filename = os.path.basename(filepath) + ".xlsx"
data.to_excel(filename, index=False)
data
I believe the program would work with my datasets
Link to an example of my datasets: https://drive.google.com/drive/folders/1nkLgo5tSFsxOTCa5EMWZlezDFi8AyaDq?usp=sharing
Thanks for helping
IIUC, assuming the (.csv) files are in the same folder, here is a proposition with pandas.concat :
import pandas as pd
import os
filepath = "C:/Users/username/folder1/folder2"
file_encoding = "cp1252"
list_df = []
for filename in os.listdir(filepath):
if filename.endswith(".csv"):
df = pd.read_csv(os.path.join(filepath, filename),
header=None, encoding=file_encoding, on_bad_lines="skip")
df = (pd.concat([df.iloc[:, i:i+5].pipe(lambda df_: df_.rename(columns={col:i for i, col in enumerate(df_.columns)}))
for i in range(0, df.shape[1], 5)], axis=0)
.set_axis(["ID", "FullName", "Street No", "Street Add 1", "Street Add 2"], axis=1)
.dropna(how="all"))
df.insert(0, "filename", filename) #comment this line if don't want to show the filename as a column
list_df.append(df)
out = (pd.concat(list_df, ignore_index=True)
.pipe(lambda df_: df_.join(df_["FullName"]
.str.split(", ", expand=True)
.rename(columns={0: "FirstName", 1: "LastName"}))))
Output :
print(out.head())
​
filename ID FullName Street No Street Add 1 Street Add 2 FirstName LastName
0 a 4 col upd.csv NaN Name NaN NaN NaN Name None
1 a 4 col upd.csv 1.0 Bruce, Wayne Street No 1 Street Add 1 Street Add 2 Bruce Wayne
2 a 4 col upd.csv 2.0 James, Gordon Street No 2 Street Add 2 Street Add 3 James Gordon
3 a 4 col upd.csv 3.0 Fish, Mooney Street No 3 Street Add 3 Street Add 4 Fish Mooney
4 a 4 col upd.csv 4.0 Selina, Kyle Street No 4 Street Add 4 Street Add 5 Selina Kyle
You can split your 2 column to 4 column like that.
#if there are some missing columns
data['First Name'] = np.nan
data['Last Name'] = np.nan
data['S First Name'] = np.nan
data['S Last Name'] = np.nan
#if there is not missing values remove above
data[['First Name', 'Last Name']] = data.Name.astype(str).str.split(",", expand=True)
data[['S First Name', 'S Last Name']] = data.SName.astype(str).str.split(",", expand=True)

How to group data by count of columns in Pandas?

I have a CSV file with a lot of rows and different number of columns.
How to group data by count of columns and show it in different frames?
File CSV has the following data:
1 OLEG US FRANCE BIG
1 OLEG FR 18
1 NATA 18
Because I have different number of colums in each row I have to group rows by count of columns and show 3 frames to be able set header then:
ID NAME STATE COUNTRY HOBBY
FR1: 1 OLEG US FRANCE BIG
ID NAME COUNTRY AGE
FR2: 1 OLEG FR 18
FR3:
ID NAME AGE
1 NATA 18
Any words, I need to group rows by count of columns and show them in different dataframes.
since pandas doesn't allow you to have different length of columns, just don't use it to import your data. Your goal is to create three seperate df, so first import the data as lists, and then deal with it and its differents lengths.
One way to solve this is read the data with csv.reader and create the df's with list comprehension together with a condition for the length of the lists.
with open('input.csv', 'r') as f:
reader = csv.reader(f, delimiter=' ')
data= list(reader)
df1 = pd.DataFrame([item for item in data if len(item)==3], columns='ID NAME AGE'.split())
df2 = pd.DataFrame([item for item in data if len(item)==4], columns='ID NAME COUNTRY AGE'.split())
df3 = pd.DataFrame([item for item in data if len(item)==5], columns='ID NAME STATE COUNTRY HOBBY'.split())
print(df1, df2, df3, sep='\n\n')
ID NAME AGE
0 1 NATA 18
ID NAME COUNTRY AGE
0 1 OLEG FR 18
ID NAME STATE COUNTRY HOBBY
0 1 OLEG US FRANCE BIG
If you need to hardcode too many lines for the same step (e.g. too many df's), then you should consider using a loop to create them and store each dataframe as key/value in a dictionary.
EDIT
Here is the little optimizedway of creating those df's. I think you can't get around creating a list of columns you want to use for the seperate df's, so you need to know what variations of number of columns you have in your data (except you want to create those df's without naming the columns.
col_list=[['ID', 'NAME', 'AGE'],['ID', 'NAME', 'COUNTRY', 'AGE'],['ID', 'NAME', 'STATE', 'COUNTRY', 'HOBBY']]
with open('input.csv', 'r') as f:
reader = csv.reader(f, delimiter=' ')
data= list(reader)
dict_of_dfs = {}
for cols in col_list:
dict_of_dfs[f'df_{len(cols)}'] = pd.DataFrame([item for item in data if len(item)==len(cols)], columns=cols)
for key,val in dict_of_dfs.items():
print(f'{key=}: \n {val} \n')
key='df_3':
ID NAME AGE
0 1 NATA 18
key='df_4':
ID NAME COUNTRY AGE
0 1 OLEG FR 18
key='df_5':
ID NAME STATE COUNTRY HOBBY
0 1 OLEG US FRANCE BIG
Now you don't have variables for your df, instead you have them in a dictionary as keys. (I named the df with the number of columns it has, df_3 is the df with three columns.
If you need to import the data with pandas, you could have a look at this post.

How to save a pandas pivot table with xlsxwriter in excel

I want to save a pandas pivot table proberly and nice formatted into an excel workbook.
I have an pandas pivot table, based on this formula:
table = pd.pivot_table(d2, values=['Wert'], index=['area', 'Name'], columns=['Monat'],
aggfunc={'Wert': np.sum}, margins=True).fillna('')
From my original dataframe:
df
Area Name2 Monat Wert
A A 1 2
A A 1 3
A B 1 2
A A 2 1
so the pivot table looks like this:
Wert
Monat 1 2 All
Area Name
A A 5 1 6
B 2 2
All 7 1 8
Then I want to save this in an excel workbook with the following code:
import pandas as pd
import xlsxwriter
workbook = xlsxwriter.Workbook('myexcel.xlsx)
worksheet1 = workbook.add_worksheet('table1')
caption = 'Table1'
worksheet1.set_column(1, 14, 25) #irrelevant, just a random size right now
worksheet1.write('B1', caption)
worksheet1.add_table('B3:K100', {'data': table.values.tolist()}) #also wrong size from B3 to K100
workbook.close()
But this looks like this (with different values), so the headers are missing:
How can I adjust it and save a pivot table in excel?
If I am using the pandas command .to_excel it looks like this:
Which is fine, but the column name is not respecting the width of the names and the background color is not looking nice, and I am also missing a capturing.
I found the solution with combination of this topic:
flattened = pd.DataFrame(table.to_records())
flattened.columns = [column.replace("('Wert', ", "Monat: ").replace(")", "") for column in flattened.columns] ##only for renaming the column headers
And then:
workbook = xlsxwriter.Workbook(excelfilename, options={'nan_inf_to_errors': True})
worksheet = workbook.add_worksheet('Table1')
worksheet1.set_column(0, flattned.shape[1], 25)
worksheet.add_table(0, 0, flattened.shape[0], flattened.shape[1]-1,
{'data': flattened.values.tolist(),
'columns': [{'header': c} for c in flattened.columns.tolist()],
'style': 'Table Style Medium 9'})
workbook.close()

Shift down one row then rename the column

My data is looking like this:
pd.read_csv('/Users/admin/desktop/007538839.csv').head()
105586.18
0 105582.910
1 105585.230
2 105576.445
3 105580.016
4 105580.266
I want to move that 105568.18 to the 0 index because now it is the column name. And after that I want to name this column 'flux'. I've tried
pd.read_csv('/Users/admin/desktop/007538839.csv', sep='\t', names = ["flux"])
but it did not work, probably because the dataframe is not in the right format.
How can I achieve that?
For me your code working very nice:
import pandas as pd
temp=u"""105586.18
105582.910
105585.230
105576.445
105580.016
105580.266"""
#after testing replace 'pd.compat.StringIO(temp)' to '/Users/admin/desktop/007538839.csv'
df = pd.read_csv(pd.compat.StringIO(temp), sep='\t', names = ["flux"])
print (df)
flux
0 105586.180
1 105582.910
2 105585.230
3 105576.445
4 105580.016
5 105580.266
For overwrite original file with same data with new header flux:
df.to_csv('/Users/admin/desktop/007538839.csv', index=False)
Try this:
df=pd.read_csv('/Users/admin/desktop/007538839.csv',header=None)
df.columns=['flux']
header=None is the friend of yours.

How to convert series to dataframe in Pandas

I have two CSVs I need to compare them based on one column. And I need to put matched rows in one csv and unmatched rows in other.
So, I created index on that column in second csv and looped through first.
df1 = pd.read_csv(file1,nrows=100)
df2 = pd.read_csv(file2,nrows=100)
df2.set_index('crc', inplace = True)
matched_list = []
non_matched_list = []
for _, row in df1.iterrows():
try:
x = df2.loc[row['crc']]
matched_list.append(x)
except KeyError:
non_matched_list.append(row)
The x here is a series in the following format
policyID 448094
statecode FL
county CLAY COUNTY
eq_site_limit 1322376.3
hu_site_limit 1322376.3
fl_site_limit 1322376.3
fr_site_limit 1322376.3
tiv_2011 1322376.3
tiv_2012 1438163.57
eq_site_deductible 0
hu_site_deductible 0.0
fl_site_deductible 0
fr_site_deductible 0
point_latitude 30.063936
point_longitude -81.707664
line Residential
construction Masonry
point_granularity 3
Name: 448094,FL,CLAY COUNTY,1322376.3,1322376.3,1322376.3,1322376.3,1322376.3,0,0.0, dtype: object
My output csv should be in following format
policyID,statecode,county,eq_site_limit,hu_site_limit,fl_site_limit,fr_site_limit,tiv_2011,tiv_2012,eq_site_deductible,hu_site_deductible,fl_site_deductible,fr_site_deductible,point_latitude,point_longitude,line,construction,point_granularity
114455,FL,CLAY COUNTY,498960,498960,498960,498960,498960,792148.9,0,9979.2,0,0,30.102261,-81.711777,Residential,Masonry,1
For all the series in the matched and unmatched. How do I do it?
I can not get rid off index in second csv as performance in important.
Following are the content of two csv files.
File1:
policyID,statecode,county,crc,hu_site_limit,fl_site_limit,fr_site_limit,tiv_2011,tiv_2012,eq_site_deductible,hu_site_deductible,fl_site_deductible,fr_site_deductible,point_latitude,point_longitude,line,construction,point_granularity
114455,FL,CLAY COUNTY,589658,498960,498960,498960,498960,792148.9,0,9979.2,0,0,30.102261,-81.711777,Residential,Masonry,1
448094,FL,CLAY COUNTY,1322376.3,1322376.3,1322376.3,1322376.3,1322376.3,1438163.57,0,0,0,0,30.063936,-81.707664,Residential,Masonry,3
206893,FL,CLAY COUNTY,745689.4,190724.4,190724.4,190724.4,190724.4,192476.78,0,0,0,0,30.089579,-81.700455,Residential,Wood,1
333743,FL,CLAY COUNTY,0,12563.76,0,0,79520.76,86854.48,0,0,0,0,30.063236,-81.707703,Residential,Wood,3
172534,FL,CLAY COUNTY,0,254281.5,0,254281.5,254281.5,246144.49,0,0,0,0,30.060614,-81.702675,Residential,Wood,1
785275,FL,CLAY COUNTY,0,515035.62,0,0,515035.62,884419.17,0,0,0,0,30.063236,-81.707703,Residential,Masonry,3
995932,FL,CLAY COUNTY,0,19260000,0,0,19260000,20610000,0,0,0,0,30.102226,-81.713882,Commercial,Reinforced Concrete,1
223488,FL,CLAY COUNTY,328500,328500,328500,328500,328500,348374.25,0,16425,0,0,30.102217,-81.707146,Residential,Wood,1
433512,FL,CLAY COUNTY,315000,315000,315000,315000,315000,265821.57,0,15750,0,0,30.118774,-81.704613,Residential,Wood,1
142071,FL,CLAY COUNTY,705600,705600,705600,705600,705600,1010842.56,14112,35280,0,0,30.100628,-81.703751,Residential,Masonry,1
File2:
policyID,statecode,county,crc,hu_site_limit,fl_site_limit,fr_site_limit,tiv_2011,tiv_2012,eq_site_deductible,hu_site_deductible,fl_site_deductible,fr_site_deductible,point_latitude,point_longitude,line,construction,point_granularity
119736,FL,CLAY COUNTY,498960,498960,498960,498960,498960,792148.9,0,9979.2,0,0,30.102261,-81.711777,Residential,Masonry,1
448094,FL,CLAY COUNTY,1322376.3,1322376.3,1322376.3,1322376.3,1322376.3,1438163.57,0,0,0,0,30.063936,-81.707664,Residential,Masonry,3
206893,FL,CLAY COUNTY,190724.4,190724.4,190724.4,190724.4,190724.4,192476.78,0,0,0,0,30.089579,-81.700455,Residential,Wood,1
333743,FL,CLAY COUNTY,0,79520.76,0,0,79520.76,86854.48,0,0,0,0,30.063236,-81.707703,Residential,Wood,3
172534,FL,CLAY COUNTY,0,254281.5,0,254281.5,254281.5,246144.49,0,0,0,0,30.060614,-81.702675,Residential,Wood,1
785275,FL,CLAY COUNTY,0,51564.9,0,0,515035.62,884419.17,0,0,0,0,30.063236,-81.707703,Residential,Masonry,3
995932,FL,CLAY COUNTY,0,457962,0,0,19260000,20610000,0,0,0,0,30.102226,-81.713882,Commercial,Reinforced Concrete,1
223488,FL,CLAY COUNTY,328500,328500,328500,328500,328500,348374.25,0,16425,0,0,30.102217,-81.707146,Residential,Wood,1
433512,FL,CLAY COUNTY,315000,315000,315000,315000,315000,265821.57,0,15750,0,0,30.118774,-81.704613,Residential,Wood,1
142071,FL,CLAY COUNTY,705600,705600,705600,705600,705600,1010842.56,14112,35280,0,0,30.100628,-81.703751,Residential,Masonry,1
253816,FL,CLAY COUNTY,831498.3,831498.3,831498.3,831498.3,831498.3,1117791.48,0,0,0,0,30.10216,-81.719444,Residential,Masonry,1
894922,FL,CLAY COUNTY,0,24059.09,0,0,24059.09,33952.19,0,0,0,0,30.095957,-81.695099,Residential,Wood,1
Edit:
Added sample csv
I think you can do it this way:
df1.loc[df1.crc.isin(df2.index)].to_csv('/path/to/matched.csv', index=False)
df1.loc[~df1.crc.isin(df2.index)].to_csv('/path/to/unmatched.csv', index=False)
instead of looping...
Demo:
In [62]: df1.loc[df1.crc.isin(df2.index)].to_csv(r'c:/temp/matched.csv', index=False)
In [63]: df1.loc[~df1.crc.isin(df2.index)].to_csv(r'c:/temp/unmatched.csv', index=False)
Results:
matched.csv:
policyID,statecode,county,crc,hu_site_limit,fl_site_limit,fr_site_limit,tiv_2011,tiv_2012,eq_site_deductible,hu_site_deductible,fl_site_deductible,fr_site_deductible,point_latitude,point_longitude,line,construction,point_granularity
448094,FL,CLAY COUNTY,1322376.3,1322376.3,1322376.3,1322376.3,1322376.3,1438163.57,0,0.0,0,0,30.063935999999998,-81.70766400000001,Residential,Masonry,3
333743,FL,CLAY COUNTY,0.0,12563.76,0.0,0.0,79520.76,86854.48,0,0.0,0,0,30.063236,-81.70770300000001,Residential,Wood,3
172534,FL,CLAY COUNTY,0.0,254281.5,0.0,254281.5,254281.5,246144.49,0,0.0,0,0,30.060614,-81.702675,Residential,Wood,1
785275,FL,CLAY COUNTY,0.0,515035.62,0.0,0.0,515035.62,884419.17,0,0.0,0,0,30.063236,-81.70770300000001,Residential,Masonry,3
995932,FL,CLAY COUNTY,0.0,19260000.0,0.0,0.0,19260000.0,20610000.0,0,0.0,0,0,30.102226,-81.713882,Commercial,Reinforced Concrete,1
223488,FL,CLAY COUNTY,328500.0,328500.0,328500.0,328500.0,328500.0,348374.25,0,16425.0,0,0,30.102217,-81.707146,Residential,Wood,1
433512,FL,CLAY COUNTY,315000.0,315000.0,315000.0,315000.0,315000.0,265821.57,0,15750.0,0,0,30.118774,-81.704613,Residential,Wood,1
142071,FL,CLAY COUNTY,705600.0,705600.0,705600.0,705600.0,705600.0,1010842.56,14112,35280.0,0,0,30.100628000000004,-81.703751,Residential,Masonry,1
unmatched.csv:
policyID,statecode,county,crc,hu_site_limit,fl_site_limit,fr_site_limit,tiv_2011,tiv_2012,eq_site_deductible,hu_site_deductible,fl_site_deductible,fr_site_deductible,point_latitude,point_longitude,line,construction,point_granularity
114455,FL,CLAY COUNTY,589658.0,498960.0,498960.0,498960.0,498960.0,792148.9,0,9979.2,0,0,30.102261,-81.711777,Residential,Masonry,1
206893,FL,CLAY COUNTY,745689.4,190724.4,190724.4,190724.4,190724.4,192476.78,0,0.0,0,0,30.089578999999997,-81.700455,Residential,Wood,1

Categories

Resources