update a column and save it back to dataset in pandas - python

I m working on a football dataset which has a few columns. There is one column called TimeUnder and the datatype of the column is int64. I want to append the unit 's' to all the values in the column and save it back to the dataset.
I converted the column to a string datatype and modified appending a 's' to each value in the column TimeUnder. I saved the modifications to a new csv file
import pandas as pd
football=pd.read_csv("Football_dataset.csv")
football1=football['TimeUnder'].astype(str) + 's'
football1.to_csv("football_modified.csv")
football_m=pd.read_csv("football_modified.csv")
football_m.head()
But the new modified csv only has the modified column, but I want all of the previous columns in the dataset along with the modified column

Currently, you are modifying the column and creating a separate dataframe from it and writing that to csv.
Instead, you need to modify the column in the original df and write it to df.
Change this football1=football['TimeUnder'].astype(str) + 's' to:
football['TimeUnder']=football['TimeUnder'].astype(str) + 's'
Then write to csv:
football.to_csv("football_modified.csv")

Related

Python Mapping and Anonymizing Script not doing what it should be

I have a small script that aims to anonymize an excel file using another excel file. More specifically, there is a mastersheet that contains the columns: "sensitive" and "Anonymized_Value". In another excel file called "Raw" there is also a column named "sensitive" that is the same values as the "sensitive" in the mastersheet, so I am trying to replace the "sensitive' in Raw with "Anonymized_Value" from mastersheet (Note all values in "sensitive" are unique with its own unique "Anonymized_Value"
import pandas as pd
# Load the Application_Master_Anonymizer.xlsx file into a pandas dataframe
master_df = pd.read_excel("Master_Anonymizer.xlsx")
# Create a dictionary mapping the "sensitive" to "Anonymized_Value"
sensitive_dict = dict(zip(master_df["sensitive"], master_df["Anonymized_Value"]))
# Load the raw dataset into a pandas dataframe
raw_df = pd.read_excel("Raw.xlsx")
# Check for a column that contains "acronym" (case-insensitive)
sensitive_column = [col for col in raw_df.columns if "sensitive" in col][0]
# Replace the values in the "sensitive" column with "Anonymized_Value"
raw_df[sensitive_column] = raw_df[sensitive_column].map(sensitive_dict)
# Save the anonymized dataframe to a new excel file
raw_df.to_excel("Anonymized.xlsx", index=False)
When I run it all the formatting of the "Anonymized.xlsx" becomes messed up. More specifically, the column names become bolded and there are columns (whos name do not contain "sensitive") are being altered/blanked out.
Any help?
Thank you

pandas slicing data frame by column numbers defined in list

Suppose I had data with 12 columns the following would get me those 12 columns.
train_data = np.asarray(pd.read_csv(StringIO(train_data), sep=',', header=None))
inputs = train_data[:, :12]
However, lets say I want a subset of these columns (not all of them).
If I had a list
a=[1,5,7,10]
is there a smart way I can pass "a" so that I get a new dataframe whose columns will reflect the entries of "a" i.e first column of new data frame is the first column of the big dataframe, then next column in the new dataframe is the 5th column in the big dataframe, etc.
Thank you.

Using the same name for multiples column in a large dataframe

I have created a large dataframe using 19 individual CSV files. All the CSV files have a similar data structure/type because those are the same experimental data from multiple runs. After merging all the CSV file into a large dataframe, I want to change the Column name. I have 40 columns. I want to use the same name for some columns, such as column 2,5,8,..should have "Counts" as column name, column 3,6,8.....should have 'File name' as column name, etc. Right now, all the column names are in number. How can I change the column name?
I have tried this code
newDf.rename(columns = {'0':'Time',tuple(['2','5','8','11','14','17','20','23','26','29','32','35','38','41','44','47','50','53','56']):'File_Name' })
But it didn't work
My datafile looks like this ...
I'm not sure if I understand it correctly, you wish to modify the name of the columns based from its content:
df.columns = [f"FileName_{v[0]}" if df[v[1]].dtype == "O" else f"Count_{v[0]}" for v in enumerate(df.columns)]
What this one does is to check if the column's data type is object where it will assign "Filename" in that element; else "Count"
Then add first column as "Time":
df.columns[0] == "Time"

Cleaning dataframe- assign value in one cell to column

I am reading multiple CSV files from a folder into a dataframe. I loop for all the files in the folder and then concat the dataframes to obtain the final dataframe.
However the CSV file has one summary row from which I want to extract the date, and then add as a new column for all the rows in that csv/dataframe.
'''
df=pd.read_csv(f,header=None,names=['Inverter',"Day Yield",'month Yield','Year Yield','SpecificYieldDay','SYMth','SYYear','Power'],sep=';', **kwargs)
df['date']=df.loc[[0],['Day Yield']]
df
I expect ['date'] column to be filled with the date for that file for all the rows in that particular csv, but it gets filled correctly only for the first row.
Refer to image of dataframe. I want all the rows of the 'date' column to be showing 7/25/2019 instead of only the first row.
I have also added an example of one of the csv files I am reading from
csv file
If I understood correctly, the value that you want to add as a new column for all rows is in df.loc[[0],['Day Yield']].
If that is correct you can do the following:
df = df.assign(date=[df.loc[[0],['Day Yield']]]*len(df))

Append new data to a dataframe

I have a csv file with many columns but for simplicity I am explaining the problem using only 3 columns. The column names are 'user', 'A' and 'B'. I have read the file using the read_csv function in pandas. The data is stored as a data frame.
Now I want to remove some rows in this dataframe based on their values. So if value in column A is not equal to a and column B is not equal to b I want to skip those user rows.
The problem is I want to dynamically create a dataframe to which I can append one row at a time. Also I do not know the number of rows that there would be. Therefore, I cannot specify the index when defining the dataframe.
I am using the following code:
import pandas as pd
header=['user','A','B']
userdata=pd.read_csv('.../path/to/file.csv',sep='\t', usecols=header);
df = pd.DataFrame(columns=header)
for index, row in userdata.iterrows():
if row['A']!='a' and row['B']!='b':
data= {'user' : row['user'], 'A' : row['A'], 'B' : row['B']}
df.append(data,ignore_index=True)
The 'data' is being populated properly but I am not able to append. At the end, df comes to be empty.
Any help would be appreciated.
Thank you in advance.
Regarding your immediate problem, append() doesn't modify the DataFrame; it returns a new one. So you would have to reassign df via:
df = df.append(data,ignore_index=True)
But a better solution would be to avoid iteration altogether and simply query for the rows you want. For example:
df = userdata.query('A != "a" and B != "b"')

Categories

Resources