Python Mapping and Anonymizing Script not doing what it should be - python

I have a small script that aims to anonymize an excel file using another excel file. More specifically, there is a mastersheet that contains the columns: "sensitive" and "Anonymized_Value". In another excel file called "Raw" there is also a column named "sensitive" that is the same values as the "sensitive" in the mastersheet, so I am trying to replace the "sensitive' in Raw with "Anonymized_Value" from mastersheet (Note all values in "sensitive" are unique with its own unique "Anonymized_Value"
import pandas as pd
# Load the Application_Master_Anonymizer.xlsx file into a pandas dataframe
master_df = pd.read_excel("Master_Anonymizer.xlsx")
# Create a dictionary mapping the "sensitive" to "Anonymized_Value"
sensitive_dict = dict(zip(master_df["sensitive"], master_df["Anonymized_Value"]))
# Load the raw dataset into a pandas dataframe
raw_df = pd.read_excel("Raw.xlsx")
# Check for a column that contains "acronym" (case-insensitive)
sensitive_column = [col for col in raw_df.columns if "sensitive" in col][0]
# Replace the values in the "sensitive" column with "Anonymized_Value"
raw_df[sensitive_column] = raw_df[sensitive_column].map(sensitive_dict)
# Save the anonymized dataframe to a new excel file
raw_df.to_excel("Anonymized.xlsx", index=False)
When I run it all the formatting of the "Anonymized.xlsx" becomes messed up. More specifically, the column names become bolded and there are columns (whos name do not contain "sensitive") are being altered/blanked out.
Any help?
Thank you

Related

Make pandas to_excel stop styling

I'm using Pandas to edit an Excel file which other people are using. But when I save it using df.to_excel, Pandas adds an ugly looking black border to cells in the header and in the index. I want it to be written in a plain format, how a CSV file would look if I opened it up in Excel. It would be even better if it was written back using the same styles it was read in
Is there anyway to make df.to_excel write without styling or with the original styles?
Thanks.
Try this trick:
import io
pd.read_csv(io.StringIO(df.to_csv()), header=None)
.to_excel("output.xlsx", header=None, index=None)
If you still want index and header values - but without styling, you can use this (requires openpyxl):
def insert_dataframe(df,sheet,start_row=1,start_col=1):
"""inserts a dataframe into an openpyxl sheet at the given (row,col) position.
Parameters
----------
df : pandas.Dataframe
Any dataframe
sheet : openpyxl.worksheet.worksheet
The sheet where the dataframe should be inserted
start_row : int
The row where the dataframe should be insterted (default is 1)
start_col : int
The column where the dataframe should be insterted (default is 1)
"""
#iterate dataframe index names and insert
for name_idx, name in enumerate(df.index.names):
label_col=start_col+name_idx
sheet.cell(row=start_row, column=label_col, value=name)
#for each name iterate values as rows in the current index name column
value_row=start_row+1
for i_value in list(df.index.values):
if isinstance(df.index, pd.MultiIndex):
val=i_value[name_idx]
else:
val=i_value
sheet.cell(row=value_row, column=label_col, value=val)
#goto next row
value_row+=1
row_idx=0
col_idx=label_col+1
#insert values
for label,content in df.items():
sheet.cell(row=start_row, column=col_idx, value=label)
for row_idx,value_ in enumerate(content):
sheet.cell(row=start_row+row_idx+1, column=col_idx, value=value_)
col_idx+=1
Gist: https://gist.github.com/Aer0naut/094ff1b6838b2177a4222591ace8f6bf

Split per attribute

I am trying to read a big CSV. Then split big CSV into smaller CSV files, based on unique values in the column team.
At first I created new dataframes for each team. The new txt files generated, one for each unique value in team column.
Code:
import pandas as pd
df = pd.read_csv('combined.csv')
df = df[df.team == 'RED']
df.to_csv('RED.csv')
However I want to start from a single dataframe, read all unique 'teams', and create a .txt file for each team, with headers.
Is it possible?
pandas.DataFrame.groupby, when used without an aggregation, returns the dataframe components associated with each group in the groupby column.
The following code will create a file for the data associated to each unique value in the column used to groupby.
Use f-strings to create a unique filename for each group.
import pandas as pd
# create the dataframe
df = pd.read_csv('combined.csv')
# groupby the desired column and iterate through the groupby object
for group, dataframe in df.groupby('team'):
# save the dataframe for each group to a csv
dataframe.to_csv(f'{group}.txt', sep='\t', index=False)

Write multiple dataframes into a single flat text file to input into Pandas

I have an excel file with multiple sheets that I convert into a dictionary of dataframes where the key represents the sheet's name:
xl = pd.ExcelFile(r"D:\Python Code\PerformanceTable.xlsx")
pacdict = { i : pd.read_excel(xl, i) for c, i in enumerate(xl.sheet_names, 1)}
I would like to replace this input Excel file with a flat text file -- but would still like to end up with the same outcome of a dictionary of dataframes.
Any suggestions on how I might be able to format the text file so it still contains data for multiple, named tables/sheets and can be read into the above format? Preferrably still making Pandas' built-in functionality do the heavy lifting.
Loop through each sheet. Create a new column called "sheet_source". Concatenate the sheet dataframes to a master dataframe. Lastly export to CSV file.
# create a master dataframe to store the sheets
df_master = pd.DataFrame()
# loop through each dict key
for each_df_key in pacdict.keys():
# dataframe for each sheet
sheet_df = pacdict[each_df_key]
# add column for sheet name
sheet_df['sheet_source'] = each_df_key
# concatenate each sheet to the master
df_master = pd.concat([df_master, sheet_df])
# after the for-loop, export the master dataframe to CSV
df_master.to_csv('new_dataframe.csv', index=False)

Cleaning dataframe- assign value in one cell to column

I am reading multiple CSV files from a folder into a dataframe. I loop for all the files in the folder and then concat the dataframes to obtain the final dataframe.
However the CSV file has one summary row from which I want to extract the date, and then add as a new column for all the rows in that csv/dataframe.
'''
df=pd.read_csv(f,header=None,names=['Inverter',"Day Yield",'month Yield','Year Yield','SpecificYieldDay','SYMth','SYYear','Power'],sep=';', **kwargs)
df['date']=df.loc[[0],['Day Yield']]
df
I expect ['date'] column to be filled with the date for that file for all the rows in that particular csv, but it gets filled correctly only for the first row.
Refer to image of dataframe. I want all the rows of the 'date' column to be showing 7/25/2019 instead of only the first row.
I have also added an example of one of the csv files I am reading from
csv file
If I understood correctly, the value that you want to add as a new column for all rows is in df.loc[[0],['Day Yield']].
If that is correct you can do the following:
df = df.assign(date=[df.loc[[0],['Day Yield']]]*len(df))

additional column when saving pandas data frame to csv file

Here the the code to process and save csv file, and raw input csv file and output csv file, using pandas on Python 2.7 and wondering why there is an additional column at the beginning when saving the file? Thanks.
c_a,c_b,c_c,c_d
hello,python,pandas,0.0
hi,java,pandas,1.0
ho,c++,numpy,0.0
sample = pd.read_csv('123.csv', header=None, skiprows=1,
dtype={0:str, 1:str, 2:str, 3:float})
sample.columns = pd.Index(data=['c_a', 'c_b', 'c_c', 'c_d'])
sample['c_d'] = sample['c_d'].astype('int64')
sample.to_csv('saved.csv')
Here is the saved file, there is an additional column at the beginning, whose values are 0, 1, 2.
cat saved.csv
,c_a,c_b,c_c,c_d
0,hello,python,pandas,0
1,hi,java,pandas,1
2,ho,c++,numpy,0
The additional column corresponds to the index of the dataframe and is aggregated once you read the CSV file. You can use this index to slice, select or sort your DF in an effective manner.
http://pandas.pydata.org/pandas-docs/stable/generated/pandas.Index.html
http://pandas.pydata.org/pandas-docs/stable/indexing.html
If you want to avoid this index, you can set the index flag to False when you save your dataframe with the function pd.to_csv. Also, you are removing the header and aggregating it later, but you can use the header of the CSV to avoid this step.
sample = pd.read_csv('123.csv', dtype={0:str, 1:str, 2:str, 3:float})
sample.to_csv('output.csv', index= False)
Hope it helps :)

Categories

Resources