Create new dataframe column equal to the value above row = 'Name' - python

I combined several excel worksheets into a new workbook using pandas that looks like the following:
Example Excel Workbook
I am trying to now clean up the workbook/dataframe using python (for practice) to by creating a new column where the equal to the table name which is listed in col[0] above 'Name'. I know how to do it in excel, but am trying to learn how to transform the data using python. There are 7051 rows currently in the dataset if that help.
The final outcome would look something like this:
Example Solution
Please let me know if you have any ideas on how to further clean it up using python. I have the excel solution but am really hoping to learn how to do it with python.
Example of code used to combine worksheets:
import pandas as pd
import numpy as np
import os, collections, csv
from os.path import basename
df = []
f = 'ex_DATA.xlsx'
numberOfSheets = 22 #Modify this.
for i in range(1,numberOfSheets+1):
data = pd.read_excel(f, sheetname = 'TAB_'+str(i), header=None)
df.append(data)
final = "ex_DATA2.xlsx" #Path to the file in which new sheet will be saved.
df = pd.concat(df)
df = df.dropna(axis=0, how='all')
df.to_excel(final, header=None, index=None)

Related

How to read excel data only after a string is found but without using skiprows

I want to read the data after the string "Executed Trade". I want to do that dynamically. Not using "skiprows". I know openpyxl can be an option. But I am still struggling to do so. Could you guys please help me with that thing cause I have many files like the one is shown in image.
Try:
import pandas as pd
#change the Excel filename and the two mentions of 'col1' for whatever the column is
df = pd.read_excel('dictatorem.xlsx')
df = df.iloc[df.col1[df.col1 == 'Executed Trades'].index.tolist()[0]+1:]
df.columns = df.iloc[0]
df = df[1:]
df = df.reset_index(drop=True)
print(df)
Example input/output:

How to concatenate multiple csv files into a single csv file using a column as index using python

I have to merge different csv files which contain features about a place based on place_id into one so that I can create a model to predict a rating for a particular place.
I have already tried using pandas.concat and merging the files through linux terminal but I just get null values for all the other features as the place_id keeps on repeating
#importing libraries
import pandas as pd
import numpy as np
import glob
#creating a single dataframe
fileList = glob.glob('chef*.csv')
fileList.append('rating_final.csv')
dfList = []
for file in fileList:
print(file)
df = pd.read_csv(file)
dfList.append(df)
concatDf = pd.concat(dfList, axis=0)
I expect to get a csv file with different features according to a single place_id but what I get is a csv file in which place_id keeps on repeating with a single feature only.
Try this,
import pandas as pd
df2 = pd.read_csv('rating_final.csv')
df2.to_csv('chef*.csv', mode='a', header=False, index=False)
test_df = pd.concat([pd.read_csv('chef*.csv'), df2], ignore_index=True, sort=True)
print(test_df)
The merged output will be available in chef*.csv file.

python efficient way to append all worksheets in multiple excel into pandas dataframe

I have around 20++ xlsx files, inside each xlsx files might contain different numbers of worksheets. But thank god, all the columns are the some in all worksheets and all xlsx files. By referring to here", i got some idea. I have been trying a few ways to import and append all excel files (all worksheet) into a single dataframe (around 4 million rows of records).
Note: i did check here" as well, but it only include file level, mine consits file and down to worksheet level.
I have tried below code
# import all necessary package
import pandas as pd
from pathlib import Path
import glob
import sys
# set source path
source_dataset_path = "C:/Users/aaa/Desktop/Sample_dataset/"
source_dataset_list = glob.iglob(source_dataset_path + "Sales transaction *")
for file in source_dataset_list:
#xls = pd.ExcelFile(source_dataset_list[i])
sys.stdout.write(str(file))
sys.stdout.flush()
xls = pd.ExcelFile(file)
out_df = pd.DataFrame() ## create empty output dataframe
for sheet in xls.sheet_names:
sys.stdout.write(str(sheet))
sys.stdout.flush() ## # View the excel files sheet names
#df = pd.read_excel(source_dataset_list[i], sheet_name=sheet)
df = pd.read_excel(file, sheetname=sheet)
out_df = out_df.append(df) ## This will append rows of one dataframe to another(just like your expected output)
Question:
My approach is like first read the every single excel file and get a list of sheets inside it, then load the sheets and append all sheets. The looping seems not very efficient expecially when datasize increase for every append.
Is there any other efficient way to import and append all sheets from multiple excel files?
Use sheet_name=None in read_excel for return orderdict of DataFrames created from all sheetnames, then join together by concat and last DataFrame.append to final DataFrame:
out_df = pd.DataFrame()
for f in source_dataset_list:
df = pd.read_excel(f, sheet_name=None)
cdf = pd.concat(df.values())
out_df = out_df.append(cdf,ignore_index=True)
Another solution:
cdf = [pd.read_excel(excel_names, sheet_name=None).values()
for excel_names in source_dataset_list]
out_df = pd.concat([pd.concat(x) for x in cdf], ignore_index=True)
If i understand your problem correctly, set sheet_name=None in pd.read_excel does the trick.
import os
import pandas as pd
path = "C:/Users/aaa/Desktop/Sample_dataset/"
dfs = [
pd.concat(pd.read_excel(path + x, sheet_name=None))
for x in os.listdir(path)
if x.endswith(".xlsx") or x.endswith(".xls")
]
df = pd.concat(dfs)
I have a pretty straight forward solution if you want to read all the sheets.
import pandas as pd
df = pd.concat(pd.read_excel(path+file_name, sheet_name=None),
ignore_index=True)

Not getting back the column names after reading into an xlsx file

Hello I have xlsx files and merged them into one dataframe by using pandas. It worked but instead of getting back the column names that I had in the xlsx file I got numbers as columns instead and the column titles became a row: Like this:
Output: 1 2 3
COLTITLE1 COLTITLE2 COLTITLE3
When they should be like this:
Output: COLTITLE1 COLTITLE2 COLTITLE3
The column titles are not column titles but rather they have become a row. How can I get back the rightful column names that I had within the xlsx file. Just for clarity all the column names are the same within both the xlsx files. Help would be appreciated heres my code below:
# import modules
from IPython.display import display
import pandas as pd
import numpy as np
pd.set_option("display.max_rows", 999)
pd.set_option('max_colwidth',100)
%matplotlib inline
# filenames
file_names = ["data/OrderReport.xlsx", "data/OrderReport2.xlsx"]
# read them in
excels = [pd.ExcelFile(name) for name in file_names]
# turn them into dataframes
frames = [x.parse(x.sheet_names[0], header=None,index_col=None) for x in excels]
# concatenate them
atlantic_data = pd.concat(frames)
# write it out
combined.to_excel("c.xlsx", header=False, index=False)
I hope I understood your question correctly. You just need to get rid of the index_col=None and it will return the column name as usual:
frames = [x.parse(x.sheet_names[0], header=None) for x in excels]
If you add index_col=None pandas will treat your column name as 1 row of data rather than a column for the dataframe.

How to read Excel Workbook (pandas)

First I want to say that I am not an expert by any means. I am versed but carry a burden of schedule and learning Python like I should have at a younger age!
Question:
I have a workbook that will on occasion have more than one worksheet. When reading in the workbook I will not know the number of sheets or their sheet name. The data arrangement will be the same on every sheet with some columns going by the name of 'Unnamed'. The problem is that everything I try or find online uses the pandas.ExcelFile to gather all sheets which is fine but i need to be able to skips 4 rows and only read 42 rows after that and parse specific columns. Although the sheets might have the exact same structure the column names might be the same or different but would like them to be merged.
So here is what I have:
import pandas as pd
from openpyxl import load_workbook
# Load in the file location and name
cause_effect_file = r'C:\Users\Owner\Desktop\C&E Template.xlsx'
# Set up the ability to write dataframe to the same workbook
book = load_workbook(cause_effect_file)
writer = pd.ExcelWriter(cause_effect_file)
writer.book = book
writer.sheets = dict((ws.title, ws) for ws in book.worksheets)
# Get the file skip rows and parse columns needed
xl_file = pd.read_excel(cause_effect_file, skiprows=4, parse_cols = 'B:AJ', na_values=['NA'], convert_float=False)
# Loop through the sheets loading data in the dataframe
dfi = {sheet_name: xl_file.parse(sheet_name)
for sheet_name in xl_file.sheet_names}
# Remove columns labeled as un-named
for col in dfi:
if r'Unnamed' in col:
del dfi[col]
# Write dataframe to sheet so we can see what the data looks like
dfi.to_excel(writer, "PyDF", index=False)
# Save it back to the book
writer.save()
The link to the file i am working with is below
Excel File
Try to modify the following based on your specific need:
import os
import pandas as pd
df = pd.DataFrame()
xls = pd.ExcelFile(path)
Then iterate over all the available data sheets:
for x in range(0, len(xls.sheet_names)):
a = xls.parse(x,header = 4, parse_cols = 'B:AJ')
a["Sheet Name"] = [xls.sheet_names[x]] * len(a)
df = df.append(a)
You can adjust the header row and the columns to read for each sheet. I added a column that will indicate the name of the data sheet the row came from.
You probably want to look at using read_only mode in openpyxl. This will allow you to load only those sheets that you're interested and look at only the cells you're interested in.
If you want to work with Pandas dataframes then you'll have to create these yourself but that shouldn't be too hard.

Categories

Resources