How to remove decimal point from string using pandas - python

I'm reading an xls file and converting to csv file in databricks using pyspark.
My input data is of string format 101101114501700 in the xls file. But after converting it to CSV format using pandas and writing to the datalake folder my data is showing as 101101114501700.0. My code is given below. Please help me why am I getting the decimal part in the data.
for file in os.listdir("/path/to/file"):
if file.endswith(".xls"):
filepath = os.path.join("/path/to/file",file)
filepath_pd = pd.ExcelFile(filepath)
names = filepath_pd.sheet_names
df = pd.concat([filepath_pd.parse(name) for name in names])
df1 = df.to_csv("/path/to/file"+file.split('.')[0]+".csv", sep=',', encoding='utf-8', index=False)
print(time.strftime("%Y%m%d-%H%M%S") + ": XLS files converted to CSV and moved to folder"

I think the field is automatically parsed as float when reading the excel. I would correct it afterwards:
df['column_name'] = df['column_name'].astype(int)
If your column contains Nulls you canĀ“t convert to integer so you will need to fill nulls first:
df['column_name'] = df['column_name'].fillna(0).astype(int)
Then you can concatenate and store the way you were doing it

Your question has nothing to do with Spark or PySpark. It's related to Pandas.
This is because Pandas interpret and infer columns' data type automatically. Since all the values of your column are numeric, Pandas will consider it as float data type.
To avoid this, pandas.ExcelFile.parse method accepts an argument called converters, you could use this to tell Pandas the specific column data type by:
# if you want one specific column as string
df = pd.concat([filepath_pd.parse(name, converters={'column_name': str}) for name in names])
OR
# if you want all columns as string
# and you have multi sheets and they do not have same columns
# this merge all sheets into one dataframe
def get_converters(excel_file, sheet_name, dt_cols):
cols = excel_file.parse(sheet_name).columns
converters = {col: str for col in cols if col not in dt_cols}
for col in dt_cols:
converters[col] = pd.to_datetime
return converters
df = pd.concat([filepath_pd.parse(name, converters=get_converters(filepath_pd, name, ['date_column'])) for name in names]).reset_index(drop=True)
OR
# if you want all columns as string
# and all your sheets have same columns
cols = filepath_pd.parse().columns
dt_cols = ['date_column']
converters = {col: str for col in cols if col not in dt_cols}
for col in dt_cols:
converters[col] = pd.to_datetime
df = pd.concat([filepath_pd.parse(name, converters=converters) for name in names]).reset_index(drop=True)

Related

How can I clean text in an Excel file with Python?

I have an Excel file with numbers (integers) in some rows of the first column (A) and text in all rows of the second column (B):
I want to clean this text, that is I want to remove tags like < b r > (without spaces). My current approach doesn't seem to work:
file_name = "F:\Project\comments_all_sorted.xlsx"
import pandas as pd
df = pd.read_excel(file_name, header=None, index_col=None, usecols='B') # specify that there's no header and no column for row labels, use only column B (which includes the text)
clean_df = df.replace('<br>', '')
clean_df.to_excel('output.xlsx')
What this code does (which I don't want it to do) is it adds running numbers in the first column (A), replacing also the few numbers that were already there, and it adds a first row with '1' in second column of this row (cell 1B):
I'm sure there's an easy way to solve my problem and I'm just not trained enough to see it.
Thanks!
Try this:
df['column_name'] = df['column_name'].str.replace(r'<br>', '')
The index in the output file can be turned off with index=False in the df.to_excel function, i.e,
clean_df.to_excel('output.xlsx', index=False)
As far as I'm aware, you can't use .replace on an entire dataframe. You need to explicitly call out the column. In this case, I just iterate through all columns in case there are more than just the one column.
To get rid of the first column with the sequential numbers (that's the index of the dataframe), add the parameter index=False. The number 1 on the top is the column name. To get rid of that, use header=False
import pandas as pd
file_name = "F:\Project\comments_all_sorted.xlsx"
df = pd.read_excel(file_name, header=None, index_col=None, usecols='B') # specify that there's no header and no column for row labels, use only column B (which includes the text)
clean_df = df.copy()
for col in clean_df.columns:
clean_df[col] = df[col].str.replace('<br>', '')
clean_df.to_excel('output.xlsx', index=False, header=False)

Trying to convert a column of integers into date, using parse function in pandas

Really new to Python, so this might be a really stupid question. I am trying to read in a dataframe and convert a column of integers into dates. The numbers are like this: 200205,200206,... in the format YYYYMM.
parse_dates works when a similar column is just integers defining years (2005,2006,...). But when the data introduces months, it's having issues.
This is what I have tried so far:
Read in returns data and clean up some columns, drop duplicates
returns_path = path + "DE_data_long.csv"
returns_df = pd.read_csv(returns_path, parse_dates=['mdate'], sep=';')
For reference, this piece of code did do what I wanted:
ratio_path = path + "DE_data_annual_long.csv"
ratios_df = pd.read_csv(ratio_path, parse_dates=['fyear'], sep=';')
ratios_df['year'] = ratios_df['fyear'].dt.year
ratios_df['month'] = ratios_df['fyear'].dt.month
ratios_df.drop_duplicates(subset=['ISIN','month','year'], keep='first', inplace=True)
You can use date_parser arg of read_csv method to read your date formats.
parser = lambda date: pd.datetime.strptime(date, '%Y%m')
like pd.read_csv(path, date_parser=parser)

Pandas dataframe read_excel does not consider blank upper left cells as columns?

I'm trying to read an Excel or CSV file into pandas dataframe. The file will read the first two columns only, and the top row of the first two columns will be the column names. The problem is when I have the first column of the top row empty in the Excel file.
IDs
2/26/2010 2
3/31/2010 4
4/31/2010 2
5/31/2010 2
Then, the last line of the following code fails:
uploaded_file = request.FILES['file-name']
if uploaded_file.name.endswith('.csv'):
df = pd.read_csv(uploaded_file, usecols=[0,1])
else:
df = pd.read_excel(uploaded_file, usecols=[0,1])
ref_date = 'ref_date'
regime_tag = 'regime_tag'
df.columns = [ref_date, regime_tag]
Apparently, it only reads one column (i.e. the IDs). However, with read_csv, it reads both column, with the first column being unnamed. I want it to behave that way and read both columns regardless of whether the top cells are empty or filled. How do I go about doing that?
What's happening is the first "column" in the Excel file is being read in as an index, while in the CSV file it's being treated as a column / series.
I recommend you work the other way and amend pd.read_csv to read the first column as an index. Then use reset_index to elevate the index to a series:
if uploaded_file.name.endswith('.csv'):
df = pd.read_csv(uploaded_file, usecols=[0,1], index_col=0)
else:
df = pd.read_excel(uploaded_file, header=[0,1], usecols=[0,1])
df = df.reset_index() # this will elevate index to a column called 'index'
This will give consistent output, i.e. first series will have label 'index' and the index of the dataframe will be the regular pd.RangeIndex.
You could potentially use a dispatcher to get rid of the unwieldy if / else construct:
file_flag = {True: pd.read_csv, False: pd.read_excel}
read_func = file_flag[uploaded_file.name.endswith('.csv')]
df = read_func(uploaded_file, usecols=[0,1], index_col=0).reset_index()

Python pandas excel output not in decided form

I have an Excel file with 100 sheets. I need to extract data from each sheets column P beginning from row 7 & create a new file with all extracted data in same column. In my Output file, the data is located in different column, ie(Sheet 2's data in column R, Sheet 3's in column B)
How can I make the data in the same column in the new Output excel? Thank you.
ps. Combining all sheets' column P data into a single column in single sheet is enough for me
import pandas as pd
import os
Flat_Price = "Flat Pricing.xlsx"
dfs = pd.read_excel(Flat_Price, sheet_name=None, usecols = "P", skiprows=6, indexcol=1, sort=False)
df = pd.concat(dfs)
print(df)
writer = pd.ExcelWriter("Output.xlsx")
df.to_excel(writer, "Sheet1")
writer.save()
print (os.path.abspath("Output.xlsx"))
You need parameter header=None for default 0 column name:
dfs = pd.read_excel(Flat_Price,
sheet_name=None,
usecols = "P",
skiprows=6,
indexcol=1,
header=None)
Then is possible extract number from first level of MultiIndex, convert to integer and sorting by sort_index:
df =df.set_index([df.index.get_level_values(0).str.extract('(\d+)',expand=False).astype(int),
df.index.get_level_values(1)]).sort_index()

How to import all fields from xls as strings into a Pandas dataframe?

I am trying to import a file from xlsx into a Python Pandas dataframe. I would like to prevent fields/columns being interpreted as integers and thus losing leading zeros or other desired heterogenous formatting.
So for an Excel sheet with 100 columns, I would do the following using a dict comprehension with range(99).
import pandas as pd
filename = 'C:\DemoFile.xlsx'
fields = {col: str for col in range(99)}
df = pd.read_excel(filename, sheetname=0, converters=fields)
These import files do have a varying number of columns all the time, and I am looking to handle this differently than changing the range manually all the time.
Does somebody have any further suggestions or alternatives for reading Excel files into a dataframe and treating all fields as strings by default?
Many thanks!
Try this:
xl = pd.ExcelFile(r'C:\DemoFile.xlsx')
ncols = xl.book.sheet_by_index(0).ncols
df = xl.parse(0, converters={i : str for i in range(ncols)})
UPDATE:
In [261]: type(xl)
Out[261]: pandas.io.excel.ExcelFile
In [262]: type(xl.book)
Out[262]: xlrd.book.Book
Use dtype=str when calling .read_excel()
import pandas as pd
filename = 'C:\DemoFile.xlsx'
df = pd.read_excel(filename, dtype=str)
the usual solution is:
read in one row of data just to get the column names and number of columns
create the dictionary automatically where each columns has a string type
re-read the full data using the dictionary created at step 2.

Categories

Resources