I have an Excel file with 100 sheets. I need to extract data from each sheets column P beginning from row 7 & create a new file with all extracted data in same column. In my Output file, the data is located in different column, ie(Sheet 2's data in column R, Sheet 3's in column B)
How can I make the data in the same column in the new Output excel? Thank you.
ps. Combining all sheets' column P data into a single column in single sheet is enough for me
import pandas as pd
import os
Flat_Price = "Flat Pricing.xlsx"
dfs = pd.read_excel(Flat_Price, sheet_name=None, usecols = "P", skiprows=6, indexcol=1, sort=False)
df = pd.concat(dfs)
print(df)
writer = pd.ExcelWriter("Output.xlsx")
df.to_excel(writer, "Sheet1")
writer.save()
print (os.path.abspath("Output.xlsx"))
You need parameter header=None for default 0 column name:
dfs = pd.read_excel(Flat_Price,
sheet_name=None,
usecols = "P",
skiprows=6,
indexcol=1,
header=None)
Then is possible extract number from first level of MultiIndex, convert to integer and sorting by sort_index:
df =df.set_index([df.index.get_level_values(0).str.extract('(\d+)',expand=False).astype(int),
df.index.get_level_values(1)]).sort_index()
Related
I want to concat multiple dataframe with different sheet names and different columns, then export to excel.
column = [["Banana","apple"],
["Banana","Grape"],
["Apple","Pizza"]]
for i in range(3):
random_data = np.random.randint(10,25,size=(5,3))
df = pd.DataFrame(random_data, columns= column[i])
I hope there are three sheets, with different column names given.
I've tried something like pd.concat([sheet_df, df]), In this case, all the columns will show in that dataframe even that df doesn't have that column, but I don't want to.
I appreciate your help!
Use an ExcelWriter:
from pandas import ExcelWriter
...
sheets = ['Sheet1', 'Sheet2', 'Sheet3']
path = r'yourpath.xlsx'
with ExcelWriter(path, engine='openpyxl') as writer:
for cols, sheet in zip(column, sheets):
random_data = np.random.randint(10,25,size=(5,2))
df = pd.DataFrame(random_data, columns=cols)
df.to_excel(writer, sheet)
How can I convert columns to rows in a pd.dataframe, currently my code is as below in, instead of having my values returned in columns I want them to be displayed in rows, I have tried using iterrows:
df = pd.DataFrame (columns = cleaned_array)
output = df.to_csv ( index=False, mode='a', encoding = "utf-8")
print(output)
Try this:
df = pd.DataFrame (columns = cleaned_array)
df.T
This will interchange your rows and columns
You want to use the tranpose function.
df.T or df.transpose()
I have a csv file with a wrong first row data. The names of labels are in the row number 2. So when I am storing this file to the DataFrame the names of labels are incorrect. And correct names become values of the row 0. Is there any function similar to reset_index() but for columns? PS I can not change csv file. Here is an image for better understanding. DataFrame with wrong labels
Hello let's suppose you csv file is data.csv :
Try this code:
import pandas as pd
#reading the csv file
df = pd.read_csv('data.csv')
#changing the headers name to integers
df.columns = range(df.shape[1])
#saving the data in another csv file
df.to_csv('data_without_header.csv',header=None,index=False)
#reading the new csv file
new_df = pd.read_csv('data_without_header.csv')
#plotting the new data
new_df.head()
If you do not care about the rows preceding your column names, you can pass in the "header" argument with the value of the correct row, for example if the proper column names are in row 2:
df = pd.read_csv('my_csv.csv', header=2)
Keep in mind that this will erase the previous rows from the DataFrame. If you still want to keep them, you can do the following thing:
df = pd.read_csv('my_csv.csv')
df.columns = df.iloc[2, :] # replace columns with values in row 2
Cheers.
I'm reading an xls file and converting to csv file in databricks using pyspark.
My input data is of string format 101101114501700 in the xls file. But after converting it to CSV format using pandas and writing to the datalake folder my data is showing as 101101114501700.0. My code is given below. Please help me why am I getting the decimal part in the data.
for file in os.listdir("/path/to/file"):
if file.endswith(".xls"):
filepath = os.path.join("/path/to/file",file)
filepath_pd = pd.ExcelFile(filepath)
names = filepath_pd.sheet_names
df = pd.concat([filepath_pd.parse(name) for name in names])
df1 = df.to_csv("/path/to/file"+file.split('.')[0]+".csv", sep=',', encoding='utf-8', index=False)
print(time.strftime("%Y%m%d-%H%M%S") + ": XLS files converted to CSV and moved to folder"
I think the field is automatically parsed as float when reading the excel. I would correct it afterwards:
df['column_name'] = df['column_name'].astype(int)
If your column contains Nulls you canĀ“t convert to integer so you will need to fill nulls first:
df['column_name'] = df['column_name'].fillna(0).astype(int)
Then you can concatenate and store the way you were doing it
Your question has nothing to do with Spark or PySpark. It's related to Pandas.
This is because Pandas interpret and infer columns' data type automatically. Since all the values of your column are numeric, Pandas will consider it as float data type.
To avoid this, pandas.ExcelFile.parse method accepts an argument called converters, you could use this to tell Pandas the specific column data type by:
# if you want one specific column as string
df = pd.concat([filepath_pd.parse(name, converters={'column_name': str}) for name in names])
OR
# if you want all columns as string
# and you have multi sheets and they do not have same columns
# this merge all sheets into one dataframe
def get_converters(excel_file, sheet_name, dt_cols):
cols = excel_file.parse(sheet_name).columns
converters = {col: str for col in cols if col not in dt_cols}
for col in dt_cols:
converters[col] = pd.to_datetime
return converters
df = pd.concat([filepath_pd.parse(name, converters=get_converters(filepath_pd, name, ['date_column'])) for name in names]).reset_index(drop=True)
OR
# if you want all columns as string
# and all your sheets have same columns
cols = filepath_pd.parse().columns
dt_cols = ['date_column']
converters = {col: str for col in cols if col not in dt_cols}
for col in dt_cols:
converters[col] = pd.to_datetime
df = pd.concat([filepath_pd.parse(name, converters=converters) for name in names]).reset_index(drop=True)
I wonder how to save a new pandas Series into a csv file in a different column. Suppose I have two csv files which both contains a column as a 'A'. I have done some mathematical function on them and then create a new variable as a 'B'.
For example:
data = pd.read_csv('filepath')
data['B'] = data['A']*10
# and add the value of data.B into a list as a B_list.append(data.B)
This will continue until all of the rows of the first and second csv file has been reading.
I would like to save a column B in a new spread sheet from both csv files.
For example I need this result:
colum1(from csv1) colum2(from csv2)
data.B.value data.b.value
By using this code:
pd.DataFrame(np.array(B_list)).T.to_csv('file.csv', index=False, header=None)
I won't get my preferred result.
Since each column in a pandas DataFrame is a pandas Series. Your B_list is actually a list of pandas Series which you can cast to DataFrame() constructor, then transpose (or as #jezrael shows a horizontal merge with pd.concat(..., axis=1))
finaldf = pd.DataFrame(B_list).T
finaldf.to_csv('output.csv', index=False, header=None)
And should csv have different rows, unequal series are filled with NANs at corresponding rows.
I think you need concat column from data1 with column from data2 first:
df = pd.concat(B_list, axis=1)
df.to_csv('file.csv', index=False, header=None)