Hello I have xlsx files and merged them into one dataframe by using pandas. It worked but instead of getting back the column names that I had in the xlsx file I got numbers as columns instead and the column titles became a row: Like this:
Output: 1 2 3
COLTITLE1 COLTITLE2 COLTITLE3
When they should be like this:
Output: COLTITLE1 COLTITLE2 COLTITLE3
The column titles are not column titles but rather they have become a row. How can I get back the rightful column names that I had within the xlsx file. Just for clarity all the column names are the same within both the xlsx files. Help would be appreciated heres my code below:
# import modules
from IPython.display import display
import pandas as pd
import numpy as np
pd.set_option("display.max_rows", 999)
pd.set_option('max_colwidth',100)
%matplotlib inline
# filenames
file_names = ["data/OrderReport.xlsx", "data/OrderReport2.xlsx"]
# read them in
excels = [pd.ExcelFile(name) for name in file_names]
# turn them into dataframes
frames = [x.parse(x.sheet_names[0], header=None,index_col=None) for x in excels]
# concatenate them
atlantic_data = pd.concat(frames)
# write it out
combined.to_excel("c.xlsx", header=False, index=False)
I hope I understood your question correctly. You just need to get rid of the index_col=None and it will return the column name as usual:
frames = [x.parse(x.sheet_names[0], header=None) for x in excels]
If you add index_col=None pandas will treat your column name as 1 row of data rather than a column for the dataframe.
Related
I need to merge different excel sheets into one and also add a new column as a corresponding sheet name
The below code merge all sheets, but how do I add a sheet name as a column ??
import pandas as pd
df = pd.concat(pd.read_excel(r"C:\\Users\\xx\\FC_List.xlsx", sheet_name=None), ignore_index=True)
print(single_df)
df.to_csv(r"C:\\Users\\Users\\FC_List.csv", index=False)
below code fetch sheet name
import pandas as pd
df = pd.read_excel(r"C:\\Users\\cc\\FC_List.xlsx", None);
df.keys()
can u advise how to add both together as a new column
Split it into steps.
import pandas as pd
dfs = pd.read_excel(r"C:\\Users\\xx\\FC_List.xlsx", sheet_name=None)
df = pd.concat(dfs,keys=dfs.keys())
This will set your index as the column name, you can then reset it and rename it.
you could also do something like.
df = pd.concat([sheet.assign(src_sheet=sheet_name) for sheet_name,sheet in dfs.items()])
I have a csv file with a wrong first row data. The names of labels are in the row number 2. So when I am storing this file to the DataFrame the names of labels are incorrect. And correct names become values of the row 0. Is there any function similar to reset_index() but for columns? PS I can not change csv file. Here is an image for better understanding. DataFrame with wrong labels
Hello let's suppose you csv file is data.csv :
Try this code:
import pandas as pd
#reading the csv file
df = pd.read_csv('data.csv')
#changing the headers name to integers
df.columns = range(df.shape[1])
#saving the data in another csv file
df.to_csv('data_without_header.csv',header=None,index=False)
#reading the new csv file
new_df = pd.read_csv('data_without_header.csv')
#plotting the new data
new_df.head()
If you do not care about the rows preceding your column names, you can pass in the "header" argument with the value of the correct row, for example if the proper column names are in row 2:
df = pd.read_csv('my_csv.csv', header=2)
Keep in mind that this will erase the previous rows from the DataFrame. If you still want to keep them, you can do the following thing:
df = pd.read_csv('my_csv.csv')
df.columns = df.iloc[2, :] # replace columns with values in row 2
Cheers.
I have given 5 CSV file, now I want to combine all the data from these file into one single table.
I have tried pd.concat and .join from pandas so far, can only get only two files combined. so far I've tried the following
data = pd.read_csv('data.csv')
data1 = pd.read_csv('data2.csv)
merge = data.join(data1,lsuffix='_NOM',rSuffix='_NIM')
in the end, I want to have every data side by side in my table.sample data.csv
You just loop through the directory which contains the .csv files. For example, refer below:
import glob
df = pd.DataFrame() # An empty data frame
for filename in glob.glob('./<path to your data files>/*.csv'):
df_temp = pd.read_csv(filename)
df = df.append(df_temp)
I have around 20++ xlsx files, inside each xlsx files might contain different numbers of worksheets. But thank god, all the columns are the some in all worksheets and all xlsx files. By referring to here", i got some idea. I have been trying a few ways to import and append all excel files (all worksheet) into a single dataframe (around 4 million rows of records).
Note: i did check here" as well, but it only include file level, mine consits file and down to worksheet level.
I have tried below code
# import all necessary package
import pandas as pd
from pathlib import Path
import glob
import sys
# set source path
source_dataset_path = "C:/Users/aaa/Desktop/Sample_dataset/"
source_dataset_list = glob.iglob(source_dataset_path + "Sales transaction *")
for file in source_dataset_list:
#xls = pd.ExcelFile(source_dataset_list[i])
sys.stdout.write(str(file))
sys.stdout.flush()
xls = pd.ExcelFile(file)
out_df = pd.DataFrame() ## create empty output dataframe
for sheet in xls.sheet_names:
sys.stdout.write(str(sheet))
sys.stdout.flush() ## # View the excel files sheet names
#df = pd.read_excel(source_dataset_list[i], sheet_name=sheet)
df = pd.read_excel(file, sheetname=sheet)
out_df = out_df.append(df) ## This will append rows of one dataframe to another(just like your expected output)
Question:
My approach is like first read the every single excel file and get a list of sheets inside it, then load the sheets and append all sheets. The looping seems not very efficient expecially when datasize increase for every append.
Is there any other efficient way to import and append all sheets from multiple excel files?
Use sheet_name=None in read_excel for return orderdict of DataFrames created from all sheetnames, then join together by concat and last DataFrame.append to final DataFrame:
out_df = pd.DataFrame()
for f in source_dataset_list:
df = pd.read_excel(f, sheet_name=None)
cdf = pd.concat(df.values())
out_df = out_df.append(cdf,ignore_index=True)
Another solution:
cdf = [pd.read_excel(excel_names, sheet_name=None).values()
for excel_names in source_dataset_list]
out_df = pd.concat([pd.concat(x) for x in cdf], ignore_index=True)
If i understand your problem correctly, set sheet_name=None in pd.read_excel does the trick.
import os
import pandas as pd
path = "C:/Users/aaa/Desktop/Sample_dataset/"
dfs = [
pd.concat(pd.read_excel(path + x, sheet_name=None))
for x in os.listdir(path)
if x.endswith(".xlsx") or x.endswith(".xls")
]
df = pd.concat(dfs)
I have a pretty straight forward solution if you want to read all the sheets.
import pandas as pd
df = pd.concat(pd.read_excel(path+file_name, sheet_name=None),
ignore_index=True)
I combined several excel worksheets into a new workbook using pandas that looks like the following:
Example Excel Workbook
I am trying to now clean up the workbook/dataframe using python (for practice) to by creating a new column where the equal to the table name which is listed in col[0] above 'Name'. I know how to do it in excel, but am trying to learn how to transform the data using python. There are 7051 rows currently in the dataset if that help.
The final outcome would look something like this:
Example Solution
Please let me know if you have any ideas on how to further clean it up using python. I have the excel solution but am really hoping to learn how to do it with python.
Example of code used to combine worksheets:
import pandas as pd
import numpy as np
import os, collections, csv
from os.path import basename
df = []
f = 'ex_DATA.xlsx'
numberOfSheets = 22 #Modify this.
for i in range(1,numberOfSheets+1):
data = pd.read_excel(f, sheetname = 'TAB_'+str(i), header=None)
df.append(data)
final = "ex_DATA2.xlsx" #Path to the file in which new sheet will be saved.
df = pd.concat(df)
df = df.dropna(axis=0, how='all')
df.to_excel(final, header=None, index=None)