How can I clean text in an Excel file with Python? - python

I have an Excel file with numbers (integers) in some rows of the first column (A) and text in all rows of the second column (B):
I want to clean this text, that is I want to remove tags like < b r > (without spaces). My current approach doesn't seem to work:
file_name = "F:\Project\comments_all_sorted.xlsx"
import pandas as pd
df = pd.read_excel(file_name, header=None, index_col=None, usecols='B') # specify that there's no header and no column for row labels, use only column B (which includes the text)
clean_df = df.replace('<br>', '')
clean_df.to_excel('output.xlsx')
What this code does (which I don't want it to do) is it adds running numbers in the first column (A), replacing also the few numbers that were already there, and it adds a first row with '1' in second column of this row (cell 1B):
I'm sure there's an easy way to solve my problem and I'm just not trained enough to see it.
Thanks!

Try this:
df['column_name'] = df['column_name'].str.replace(r'<br>', '')

The index in the output file can be turned off with index=False in the df.to_excel function, i.e,
clean_df.to_excel('output.xlsx', index=False)

As far as I'm aware, you can't use .replace on an entire dataframe. You need to explicitly call out the column. In this case, I just iterate through all columns in case there are more than just the one column.
To get rid of the first column with the sequential numbers (that's the index of the dataframe), add the parameter index=False. The number 1 on the top is the column name. To get rid of that, use header=False
import pandas as pd
file_name = "F:\Project\comments_all_sorted.xlsx"
df = pd.read_excel(file_name, header=None, index_col=None, usecols='B') # specify that there's no header and no column for row labels, use only column B (which includes the text)
clean_df = df.copy()
for col in clean_df.columns:
clean_df[col] = df[col].str.replace('<br>', '')
clean_df.to_excel('output.xlsx', index=False, header=False)

Related

Reset labels in Pandas DataFrame, Python

I have a csv file with a wrong first row data. The names of labels are in the row number 2. So when I am storing this file to the DataFrame the names of labels are incorrect. And correct names become values of the row 0. Is there any function similar to reset_index() but for columns? PS I can not change csv file. Here is an image for better understanding. DataFrame with wrong labels
Hello let's suppose you csv file is data.csv :
Try this code:
import pandas as pd
#reading the csv file
df = pd.read_csv('data.csv')
#changing the headers name to integers
df.columns = range(df.shape[1])
#saving the data in another csv file
df.to_csv('data_without_header.csv',header=None,index=False)
#reading the new csv file
new_df = pd.read_csv('data_without_header.csv')
#plotting the new data
new_df.head()
If you do not care about the rows preceding your column names, you can pass in the "header" argument with the value of the correct row, for example if the proper column names are in row 2:
df = pd.read_csv('my_csv.csv', header=2)
Keep in mind that this will erase the previous rows from the DataFrame. If you still want to keep them, you can do the following thing:
df = pd.read_csv('my_csv.csv')
df.columns = df.iloc[2, :] # replace columns with values in row 2
Cheers.

How to efficiently remove junk above headers in an .xls file

I have a number of .xls datasheets which I am looking to clean and merge.
Each data sheet is generated by a larger system which cannot be changed.
The method that generates the data sets displays the selected parameters for the data set. (E.G 1) I am looking to automate the removal of these.
The number of rows that this takes up varies, so I am unable to blanket remove x rows from each sheet. Furthermore, the system that generates the report arbitrarily merges cells in the blank sections to the right of the information.
Currently I am attempting what feels like a very inelegant solution where I convert the file to a CSV, read it as a string and remove everything before the first column.
data_xls = pd.read_excel('InputFile.xls', index_col=None)
data_xls.to_csv('Change1.csv', encoding='utf-8')
with open("Change1.csv") as f:
s = f.read() + '\n'
a=(s[s.index("Col1"):])
df = pd.DataFrame([x.split(',') for x in a.split('\n')])
This works but it seems wildly inefficient:
Multiple format conversions
Reading every line in the file when the only rows being altered occur within first ~20
Dataframe ends up with column headers shifted over by one and must be re-aligned (Less concern)
With some of the files being around 20mb, merging a batch of 8 can take close to 10 minutes.
A little hacky, but an idea to speed up your process, by doing some operations directly on your dataframe. Considering you know your first column name to be Col1, you could try something like this:
df = pd.read_excel('InputFile.xls', index_col=None)
# Find the first occurrence of "Col1"
column_row = df.index[df.iloc[:, 0] == "Col1"][0]
# Use this row as header
df.columns = df.iloc[column_row]
# Remove the column name (currently an useless index number)
del df.columns.name
# Keep only the data after the (old) column row
df = df.iloc[column_row + 1:]
# And tidy it up by resetting the index
df.reset_index(drop=True, inplace=True)
This should work for any dynamic number of header rows in your Excel (xls & xlsx) files, as long as you know the title of the first column...
If you know the number of junk rows, you skip them using "skiprows",
data_xls = pd.read_excel('InputFile.xls', index_col=None, skiprows=2)

Cant drop columns with pandas if index_col = 0 is used while reading csv's [duplicate]

I have the following code which imports a CSV file. There are 3 columns and I want to set the first two of them to variables. When I set the second column to the variable "efficiency" the index column is also tacked on. How can I get rid of the index column?
df = pd.DataFrame.from_csv('Efficiency_Data.csv', header=0, parse_dates=False)
energy = df.index
efficiency = df.Efficiency
print efficiency
I tried using
del df['index']
after I set
energy = df.index
which I found in another post but that results in "KeyError: 'index' "
When writing to and reading from a CSV file include the argument index=False and index_col=False, respectively. Follows an example:
To write:
df.to_csv(filename, index=False)
and to read from the csv
df.read_csv(filename, index_col=False)
This should prevent the issue so you don't need to fix it later.
df.reset_index(drop=True, inplace=True)
DataFrames and Series always have an index. Although it displays alongside the column(s), it is not a column, which is why del df['index'] did not work.
If you want to replace the index with simple sequential numbers, use df.reset_index().
To get a sense for why the index is there and how it is used, see e.g. 10 minutes to Pandas.
You can set one of the columns as an index in case it is an "id" for example.
In this case the index column will be replaced by one of the columns you have chosen.
df.set_index('id', inplace=True)
If your problem is same as mine where you just want to reset the column headers from 0 to column size. Do
df = pd.DataFrame(df.values);
EDIT:
Not a good idea if you have heterogenous data types. Better just use
df.columns = range(len(df.columns))
you can specify which column is an index in your csv file by using index_col parameter of from_csv function
if this doesn't solve you problem please provide example of your data
One thing that i do is df=df.reset_index()
then df=df.drop(['index'],axis=1)
To remove or not to create the default index column, you can set the index_col to False and keep the header as Zero. Here is an example of how you can do it.
recording = pd.read_excel("file.xls",
sheet_name= "sheet1",
header= 0,
index_col= False)
The header = 0 will make your attributes to headers and you can use it later for calling the column.
It works for me this way:
Df = data.set_index("name of the column header to start as index column" )

Pandas dataframe read_excel does not consider blank upper left cells as columns?

I'm trying to read an Excel or CSV file into pandas dataframe. The file will read the first two columns only, and the top row of the first two columns will be the column names. The problem is when I have the first column of the top row empty in the Excel file.
IDs
2/26/2010 2
3/31/2010 4
4/31/2010 2
5/31/2010 2
Then, the last line of the following code fails:
uploaded_file = request.FILES['file-name']
if uploaded_file.name.endswith('.csv'):
df = pd.read_csv(uploaded_file, usecols=[0,1])
else:
df = pd.read_excel(uploaded_file, usecols=[0,1])
ref_date = 'ref_date'
regime_tag = 'regime_tag'
df.columns = [ref_date, regime_tag]
Apparently, it only reads one column (i.e. the IDs). However, with read_csv, it reads both column, with the first column being unnamed. I want it to behave that way and read both columns regardless of whether the top cells are empty or filled. How do I go about doing that?
What's happening is the first "column" in the Excel file is being read in as an index, while in the CSV file it's being treated as a column / series.
I recommend you work the other way and amend pd.read_csv to read the first column as an index. Then use reset_index to elevate the index to a series:
if uploaded_file.name.endswith('.csv'):
df = pd.read_csv(uploaded_file, usecols=[0,1], index_col=0)
else:
df = pd.read_excel(uploaded_file, header=[0,1], usecols=[0,1])
df = df.reset_index() # this will elevate index to a column called 'index'
This will give consistent output, i.e. first series will have label 'index' and the index of the dataframe will be the regular pd.RangeIndex.
You could potentially use a dispatcher to get rid of the unwieldy if / else construct:
file_flag = {True: pd.read_csv, False: pd.read_excel}
read_func = file_flag[uploaded_file.name.endswith('.csv')]
df = read_func(uploaded_file, usecols=[0,1], index_col=0).reset_index()

Concatenate 2 Rows to be header/column names

I have an excel sheet that is really poorly formatted. The actual column names I would like to use are across two rows; For example, if the correct column name should be Labor Percent, cell A1 would contain Labor, and cell A2 would contain Percent).
I try to load the file, here's what I'm doing:
import os
os.getcwd()
os.chdir(r'xxx')
import pandas as pd
file = 'problem.xls'
xl = pd.ExcelFile(file)
print(xl.sheet_names)
df = xl.parse('WEEKLY NUMBERS', skiprows=35)
As you can see in the picture, the remainder of what should be the column name is in the second row. Is there a way to rename the columns by concatenating? Can this somehow be done with the header= argument in the xl.parse bit?
You can rename the columns yourself by setting:
df.columns = ['name1', 'name2', 'name3' ...]
Note that you must specify a name for every column.
Then drop the first row to get rid of the unwanted row of column names.
df = df.drop(0)
Here's something you can try. Essentially it reads in the first two rows as your header, but treats it as a hierarchical multi-index. The second line of code below then flattens that multi-index down to a single string. I'm not 100% certain it will work for your data but is worth a try - it worked for the small dummy test data I tried it with:
df = pd.read_excel('problem.xlsx', sheetname='WEEKLY NUMBERS', header=[0, 1])
df.columns = df.columns.map(' '.join)
The second line was taken from this answer about flattening a multi-index.

Categories

Resources