How to update selected headers only in a CSV using Python? - python

I want to change my csv headers from
column_1, column_2, ABC_column, column_4, XYZ_column
To
new_column_1, new_column_2, ABC_column, new_column_4, XYZ_column
I can easily change all the columns using writer.writerow but the when there is a new value in place of ABC_column I want to keet that as well, meaning instead of ABC_column if it comes like DEF_column then I also don't want to change that.
So it should only change those columns which do not comes at 3rd place and 5th place and leave the ones that comes at 3rd and 5th place as it is.

Use pandas:
import pandas as pd
df = pd.read_csv(path_to_csv)
df = df.rename(columns={'column_1': 'new_column_1', 'column_2': 'new_column_2' ... })
df.to_csv(path_to_csv)
you can do any type of renaming logic to that dictionary

Related

Put a new column inside a Dataframe Python

The Dataframe that I am working on it have a column called "Brand" that have a value called "SEAT " with the white space. I achieved to drop the white space but I don't know how to put the new column inside the previous Dataframe. I have to do this because I need to filter the previous Dataframe by "SEAT" and show these rows.
I tried this:
import pandas as pd
brand_reviewed = df_csv2.Brand.str.rstrip()
brand_correct = 'SEAT'
brand_reviewed.loc[brand_reviewed['Brand'].isin(brand_correct)]
Thank you very much!
As far as I understand,
you're trying to return rows that match the pattern "SEAT".
You are not forced to create a new column. You can directly do the following:
df2 = brand_reviewed[brand_reviewed.Brand.str.rstrip() == "SEAT"]
print(df2)
You have done great. I will also mentioned another form on how you can clead the white spaces. And also, if you just want to add a new column in your current dataframe, just write the last line of this code.
import pandas as pd
brand_reviewed = pd.read_csv("df_csv2.csv")
data2 = data["Brand"].str.strip()
brand_reviewed["New Column"] = data2
If you have another query, let me know.
Octavio Velázquez

update 'cell' in csv row by finding a unique value in a column header?

If I have a simple csv like this:
name,age,color,team,completed
tim,34,green,5
jim,31,blue,6
kim,33,yellow,5
I want to in python (pandas is fine if need be an third party module) find an id, so in this case name (row), and then update the value under 'completed' with a YES. The names will always be unique. The sheet may not always be in the same order, but the header names will always be the same.
Is there a way to find the cell coords at name=="Jim" and 'completed' ?
Good evening,
Importing CSV
While I understand you may desire only using core Python modules, I recommend using Pandas for this task.
import pandas as pd
df = pd.read_csv('csv_file.csv')
Conditional Variable Assignment
One way is to use .loc[row, column] to return rows where df['name'] == 'jim' and assign a new column "completed" to "YES". The rows where the column name is not equal to "jim" will be set to missing values.
df.loc[df['name'] == 'jim', 'completed'] = 'YES'

Dynamically update pandas column names to avoid code change

Is there a way to dynamically update column names that are based on previous column names? Or what are best practices for column names while processing data? Below I explain the problem:
When processing data, I often need to create columns that are calculated from the previous columns, and I set up the names like below:
|STUDENT|GRADE|GRADE_AVG|GRADE_AVG_FORMATTED|GRADE_AVG_FORMATTED_FINAL
The problem is, if I need to make a change in the middle of this data flow [for example, hypothetically, say I needed to scale the grade before taking the average], I would have to rename all the column names that were produced after this point. See below:
|STUDENT|GRADE|**GRADE_SCALED**|GRADE_SCALED_AVG|GRADE_SCALED_AVG_FORMATTED|GRADE_SCALED_AVG_FORMATTED_FINAL
Since the code to calculate each column is based on the previous column names, this process of name changing in the code gets really cumbersome, specially for big datasets for which a lot of code has been produced. Any suggestion on how to dynamically update the column names? or best practices on it?
To clarify, an extension of the example:
my code would look like:
df[GRADE_AVG] = df[GRADE].apply(something)
df[GRADE_AVG_FORMATTED] = df[GRADE_AVG].apply(something)
df[GRADE_AVG_FORMATTED_FINAL] = df[GRADE_AVG_FORMATTED].apply(something)
...
... more column names based on the previous one..
...
df[FINAL_SCORE] = df[GRADE_AVG_FORMATTED_FINAL_REVISED...etc]
And then... I need to change GRADE_AVG to GRADE_SCALED_AVG in the code. So I will have change those columns names. This is a small example, but when there are a lot of column names based on the previous one, changing the code gets messy.
What I do is to change all the column names in the code, like below (but this gets really impractical), hence my question:
df[GRADE_SCALED_AVG] = df[GRADE].apply(something)
df[GRADE_SCALED_AVG_FORMATTED] = df[GRADE_SCALED_AVG].apply(something)
df[GRADE_SCALED_AVG_FORMATTED_FINAL] = df[GRADE_SCALED_AVG_FORMATTED].apply(something)
...
... more column names based on the previous one..
...
df[FINAL_SCORE] = df[GRADE_SCALED_AVG_FORMATTED_FINAL_REVISED...etc]
Lets say if your columns will start with GRADE. You can do this.
df.columns = ['GRADE_SCALED_'+ '_'.join(x.split('_')[1:]) if x.startswith('GRADE') else x for x in df.columns]
# sample test case
>>> l = ['abc','GRADE_AVG','GRADE_AVG_TOTAL']
>>> ['GRADE_SCALED_'+ '_'.join(x.split('_')[1:]) if x.startswith('GRADE') else x for x in l]
['abc', 'GRADE_SCALED_AVG', 'GRADE_SCALED_AVG_TOTAL']
A nice way to rename dynamically is with rename method:
import pandas as pd
import re
header = '|STUDENT|GRADE|GRADE_AVG|GRADE_AVG_FORMATTED|GRADE_AVG_FORMATTED_FINAL'
df = pd.DataFrame(columns=header.split('|')) # your dataframe
print(df)
# now rename: can take a function or a dictionary as a parameter
df1 = df.rename(lambda x: re.sub('^GRADE', 'GRADE_SCALE', x), axis=1)
print(df1)
#Empty DataFrame
#Columns: [, STUDENT, GRADE, GRADE_AVG, GRADE_AVG_FORMATTED, GRADE_AVG_FORMATTED_FINAL]
#Index: []
#Empty DataFrame
#Columns: [, STUDENT, GRADE_SCALE, GRADE_SCALE_AVG, GRADE_SCALE_AVG_FORMATTED, GRADE_SCALE_AVG_FORMATTED_FINAL]
#Index: []
However, in your case, I'm not sure this is what you are looking for. Are the AVG and FORMATTED columns generated from GRADE column? Also, is this RENAMING or REPLACING? doesn't the content of the columns change as well?
It seems a more complete description of the problem might help..

concatenate excel datas with python or Excel

Here's my problem, I have an Excel sheet with 2 columns (see below)
I'd like to print (on python console or in a excel cell) all the data under this form :
"1" : ["1123","1165", "1143", "1091", "n"], *** n ∈ [A2; A205]***
We don't really care about the Column B. But I need to add every postal code under this specific form.
is there a way to do it with Excel or in Python with Panda ? (If you have any other ideas I would love to hear them)
Cheers
I think you can use parse_cols for parse first column and then filter out all columns from 205 to 1000 by skiprows in read_excel:
df = pd.read_excel('test.xls',
sheet_name='Sheet1',
parse_cols=0,
skiprows=list(range(205,1000)))
print (df)
Last use tolist for convert first column to list:
print({"1": df.iloc[:,0].tolist()})
The simpliest solution is parse only first column and then use iloc:
df = pd.read_excel('test.xls',
parse_cols=0)
print({"1": df.iloc[:206,0].astype(str).tolist()})
I am not familiar with excel, but pandas could easily handle this problem.
First, read the excel to a DataFrame
import pandas as pd
df = pd.read_excel(filename)
Then, print as you like
print({"1": list(df.iloc[0:N]['A'])})
where N is the amount you would like to print. That is it. If the list is not a string list, you need to cast the int to string.
Also, there are a lot parameters that can control the load part of excel read_excel, you can go through the document to set suitable parameters.
Hope this would be helpful to you.

Append new data to a dataframe

I have a csv file with many columns but for simplicity I am explaining the problem using only 3 columns. The column names are 'user', 'A' and 'B'. I have read the file using the read_csv function in pandas. The data is stored as a data frame.
Now I want to remove some rows in this dataframe based on their values. So if value in column A is not equal to a and column B is not equal to b I want to skip those user rows.
The problem is I want to dynamically create a dataframe to which I can append one row at a time. Also I do not know the number of rows that there would be. Therefore, I cannot specify the index when defining the dataframe.
I am using the following code:
import pandas as pd
header=['user','A','B']
userdata=pd.read_csv('.../path/to/file.csv',sep='\t', usecols=header);
df = pd.DataFrame(columns=header)
for index, row in userdata.iterrows():
if row['A']!='a' and row['B']!='b':
data= {'user' : row['user'], 'A' : row['A'], 'B' : row['B']}
df.append(data,ignore_index=True)
The 'data' is being populated properly but I am not able to append. At the end, df comes to be empty.
Any help would be appreciated.
Thank you in advance.
Regarding your immediate problem, append() doesn't modify the DataFrame; it returns a new one. So you would have to reassign df via:
df = df.append(data,ignore_index=True)
But a better solution would be to avoid iteration altogether and simply query for the rows you want. For example:
df = userdata.query('A != "a" and B != "b"')

Categories

Resources