I have many excel sheets in which data is arranged in table as shown below
I want to convert each of this table to another list-like format as shown below.
In this table:
Dir -> Name of the tab
Year -> Same for entire table
DOM -> Day of the month
DOW -> Day of the week.
Hour -> Column label in original table
Traffic Count -> Value in the original table
There are close to 1000 such sheets. The data in each sheet is at the same location. What is the best way to do this? Should I write a VBA script or is there any thing in Excel I can use to make my life easier?
I solved the problem using python and the xlrd module
import xlrd
import numpy as np
from os import listdir, chdir
import calendar
import re
direction = {'N':0,'S':1,'E':2,'W':3}
rend = [0, 40, 37, 40, 39, 40, 39, 40, 40, 39, 40, 39, 40]
# Initialize the matrix to store final result
aData = np.matrix(np.zeros((0,7)))
# get all the xlsx files from the directory.
filenames = [ f for f in listdir('.') if f.endswith('.xlsx') ]
# for each .xlsx in the current directory
for file in filenames:
# The file names are in the format gdot_39446_yyyy_mm.xlsx
# yyyy is the year and mm is the month number with 0 -Jan and 11 - Dec
# I extracted the month and year info from the file name
ms = re.search('.+_.+_.+_([0-9]+)\.',file,re.M).group(1)
month = int(ms) + 1
year = int(file[11:15])
# open the workbook
workbook = xlrd.open_workbook(file)
# the workbook has three sheets. I want information from
# sheet2 and sheet3 (indexed by 1 adn 2 resp.)
for i in range(1,3):
sheet = workbook.sheet_by_index(i)
di = sheet.name[-1]
data = [[sheet.cell_value(r,c) for c in range(2,26)] for r in range(9,rend[month])]
mData = np.matrix(data)
mData[np.where(mData=='')] = 0 # some cells are blank. Insert 0 in those cells
n,d = mData.shape
rows = n * d
rData = np.matrix(np.zeros((rows,7)))
rData[:,0].fill(direction[di])
rData[:,1].fill(year)
rData[:,2].fill(month)
for i in range(rows):
rData[i,3] = (i/24) + 1
rData[i,4] = calendar.weekday(year,month,(i/24) + 1)
rData[i,5] = i%24
for i in range(n):
rData[i*24:((i+1)*24),6] = mData[i,:].T
aData = np.vstack((aData,rData))
np.savetxt("alldata.csv",aData, delimiter=',', fmt='%s')
Related
For example u have 1 excel file and it consist of 10000 data in it. Later when we import that excel file in pycharm or jupiter notebook. If i run that file i will get an Index range also know as Row labels. my python code should be able to read that ten thousand row labels and should be able to separate / split into 10 different excel sheet files which will have 1000 data in each of the 10 separated sheet.
Other example is, if there is 9999 data in 1 sheet/file then my python code should divide 9000 data in 9 sheet and other 999 in other sheet without any mistakes.{This is important Question}
i am asking this because in my data there is not any unique values for my code to split the files using .unique
You could use Pandas to read your file, chunk it then re-write it :
import pandas as pd
df = pd.read_excel("/path/to/excels/file.xlsx")
n_partitions = 3
for i in range(n_partitions):
sub_df = df.iloc[(i*n_paritions):((i+1)*n_paritions)]
sub_df.to_excel(f"/output/path/to/test-{i}.xlsx", sheet_name="a")
EDIT:
Or if you prefere to set the number of lines per xls files :
import pandas as pd
df = pd.read_excel("/path/to/excels/file.xlsx")
rows_per_file = 4
n_chunks = len(df) // rows_per_file
for i in range(n_chunks):
start = i*rows_per_file
stop = (i+1) * rows_per_file
sub_df = df.iloc[start:stop]
sub_df.to_excel(f"/output/path/to/test-{i}.xlsx", sheet_name="a")
if stop < len(df):
sub_df = df.iloc[stop:]
sub_df.to_excel(f"/output/path/to/test-{i}.xlsx", sheet_name="a")
You'll need openpyxl to read/write Excel files
the following code snippet is working fine for me
import pandas as pd
import openpyxl
import math
data = pd.read_excel(r"path_to_excel_file.xlsx")
_row_range = 200
_block = math.ceil(len(data)/_row_range )
for x in range(_block,_row_range ):
startRow = x*_row_range
endRow = (x+1)*_row_range
_data = data.iloc[startRow:endRow]
_data.to_excel(f"file_name_{x}.xlsx",sheet_name="Sheet1",index=False)
This gets the job done as well. Assumes the Excels file would be 19000 rows per file. Edit that to suit your scenario.
import pandas as pd
import math
data = pd.read_excel(filename)
count = len(data)
rows_per_file = 19000
no_of_files = math.ciel(count/rows_per_file)
start_row = 0
end_row = rows_per_file
for x in range(no_of_files):
new_data = data.iloc(start_row:end_row)
newdata.to_excel(f"filename_{x}.xlsx")
start_row end_row + 1
end_row = end_row + rows_per_file
i have a an excel table that contains :
ID product
03/1/2021
16/1/2022
12/2/2022
14/3/2023
A
4
1
2
5
B
6
1
3
C
7
6
and in the same sheet I have a drop down list that contains(the year , and the month)
if i select in the drop down list for example year = 2020 and month= 1,
it will be return something like this:
ID product
03/1/2021
A
4
B
6
C
and then it will calculate the som of the cells : som = 10 in this case
here is my code :
# import load_workbook
import pandas as pd
import numpy as np
from openpyxl import load_workbook
from openpyxl.worksheet.datavalidation import DataValidation
from openpyxl import Workbook
from openpyxl.styles import PatternFill
# set file path
filepath= r'test.xlsx'
wb=load_workbook(filepath)
ws=wb["sheet1"]
#Generates 10 year in the Column MK;
for number in range(1,10):
ws['MK{}'.format(number)].value= "202{}".format(number)
data_val = DataValidation(type="list",formula1='=MK1:MK10')
ws.add_data_validation(data_val)
# drop down list with all the values from the column MK
data_val.add(ws["E2"])
#Generates the numbers of month in the Column MN;
for numbers in range(1,12):
ws['MN{}'.format(numbers)].value= "{}".format(numbers)
data_vals = DataValidation(type="list",formula1='=MN1:MN14')
ws.add_data_validation(data_vals)
# drop down list with all the values from the sheet list column MK
data_vals.add(ws["E3"])
# add a color to the cell 'year' and 'month'
ws['E2'].fill = PatternFill(start_color='FFFFFF00', end_color='FFFFFF00', fill_type = 'solid')
ws['E3'].fill = PatternFill(start_color='FFFFFF00', end_color='FFFFFF00', fill_type = 'solid')
# save workbook
wb.save(filepath)
Any suggestions?
thank you for your help.
Assuming your excel file looks like below:
Final Code looks like below:
import xlrd
file = r'C:\path\test_exl.xlsx'
sheetname='Sheet1'
n=2
df = pd.read_excel(file,skiprows=[*range(2)],index_col=[0])
workbook = xlrd.open_workbook(file)
worksheet = workbook.sheet_by_name(sheetname)
year = worksheet.cell(0,0).value
month = worksheet.cell(1,0).value
datetime_cols= pd.to_datetime(df.columns,dayfirst=True,errors='coerce')
out = (df.loc[:,(datetime_cols.year == year) & (datetime_cols.month == month)]
.reset_index())
print(out)
ID Product 03-01-2021
0 A 4.0
1 B 6.0
2 C NaN
Breakdown:
you can first read the table in pandas using pd.read_excel:
file = r'C:\path\test_exl.xlsx'
sheetname='Sheet1'
n=2 #change n to how many lines to skip to read the table.
#In the above image my dataframe starts at line 3 onwards so I put n=2
df = pd.read_excel(file,skiprows=[*range(n)],index_col=[0])
Then access the cell values using xlrd:
import xlrd
workbook = xlrd.open_workbook(file)
worksheet = workbook.sheet_by_name(sheetname)
year = worksheet.cell(0,0).value #A1 is 0,0
month = worksheet.cell(1,0).value #A2 is 1,0 and so on..
#print(year,month) gives 2021 and 1
then convert columns to datetime and filter:
datetime_cols= pd.to_datetime(df.columns,dayfirst=True,errors='coerce')
out = (df.loc[:,(datetime_cols.year == year) & (datetime_cols.month == month)]
.reset_index())
Please find the code below:
import pandas as pd
import csv
# Reading the csv file
df_new = pd.read_csv('source.csv')
# saving xlsx file
GFG = pd.ExcelWriter('source.xlsx')
df_new.to_excel(GFG, index = False)
GFG.save()
# read excel
xl = pd.ExcelFile("source.xlsx")
df = xl.parse("Sheet1")
# get the column you want to copy
column = df["Marks"]
# paste it in the new excel file
with pd.ExcelWriter('Target_excel.xlsx', mode='A') as writer:
column.to_excel(writer, sheet_name= "new sheet name", index = False)
writer.close()
In this code, it is replacing the existing contents of the target excel file.
I want to update a column in sheet 2 without changing other columns.
Example:
Excel file 1--> column_name = 'Marks'
Marks = 10,20,30
Excel file 2--> there are two columns present in this file
Subject_name = Math, English, Science
Marks = 50, 20, 40
So I want to copy "Marks" column from Excel file 1 and paste it into "Marks" column of Excel file 2(Without changing the data of "Subject" column)
import pandas as pd
import openpyxl as pxl
def get_col_idx(worksheet, col_name):
return next((i for i, col in enumerate(worksheet.iter_cols(1, worksheet.max_column)) if col[0].value == col_name), -1)
### ----- 0. csv -> xlsx (no change from your code)
df_new = pd.read_csv("source.csv")
GFG = pd.ExcelWriter("source.xlsx")
df_new.to_excel(GFG, index=False)
GFG.save()
### ----- 1. getting data to copy
# open file and get sheet of interest
source_workbook = pxl.load_workbook("source.xlsx")
source_sheet = source_workbook["Sheet1"]
# get "Marks" column index
col_idx = get_col_idx(source_sheet, "Marks")
# get contents in each cell
col_contents = [row[col_idx].value for row in source_sheet.iter_rows(min_row=2)]
### ----- 2. copy contents to target excel file
target_workbook = pxl.load_workbook("Target_excel.xlsx")
target_sheet = target_workbook["new sheet name"]
col_idx = get_col_idx(target_sheet, "Marks")
for i, value in enumerate(col_contents):
cell = target_sheet.cell(row=i+2, column=col_idx+1)
cell.value = value
target_workbook.save("Target_excel.xlsx")
I am working on one task where we have many xlsx files each with about 100 rows and I would like to merge them into one new big xlsx file with xlsxwriter.
Is it possible to do it with one loop which would read and write simultaneuosly ?
I can read the files, I can create a new one. On the first run I could write all cells into new file but when I checked the file, it is overwriting the actual values with the last read file. So I got only part where number of rows variable is not the same as in previous file.
Here is the code I wrote:
#!/usr/bin/env python3
import os
import time
import xlrd
import xlsxwriter
from datetime import datetime
from datetime import date
def print_values_to():
loc = ("dir/")
wr_workbook = xlsxwriter.Workbook('All_Year_All_Values.xlsx')
wr_worksheet = wr_workbook.add_worksheet('Test')
# --------------------------------------------------------
all_rows = 0
for file in os.listdir(loc):
print(loc + file)
workbook = xlrd.open_workbook(loc + file)
sheet = workbook.sheet_by_index(0)
number_of_rows = sheet.nrows
number_of_columns = sheet.ncols
all_rows = all_rows + number_of_rows
dropped_numbers = []
for i in range(number_of_rows): # -------- number / number_of_rows
if i == 0:
all_rows = all_rows - 1
continue
for x in range(number_of_columns):
type_value = sheet.cell_value(i, x)
if isinstance(type_value, float):
changed_to_integer = int(sheet.cell_value(i, x)) # ----
values = changed_to_integer # -----
elif isinstance(type_value, str):
new_date = datetime.strptime(type_value, "%d %B %Y")
right_format = new_date.strftime("%Y-%m-%d")
values = right_format
# write into new excel file
wr_worksheet.write(i, x, values)
# list of all values
dropped_numbers.append(values)
# print them on the console
print(dropped_numbers)
# Writing into new excel
# wr_worksheet.write(i, x, values)
# clear list of values for another run
dropped_numbers = []
print("Number of all rows: ", number_of_rows)
print("\n")
wr_workbook.close()
I went through the xlsxwrite guidance but it didnt tell exactly that it is not possible.
So I still hoping that I could arrange it somehow.
For any idea many thaanks.
me again. But now, with an answer. This was really stupid solution.
One simple variable incrementation did a trick. Right after the first loop. I just added p = p + 1 and wualaa all data are in one xlsx file.
So on the top:
for i in range(number_of_rows): # -------- number / number_of_rows
p = p + 1
and for writer just changed the row counter:
wr_worksheet.write(p, x, values)
aaaaaaaaaaah...
Many thanks.
I'm working through data for a research project. Output is in the form of .csv files, which have been converted to .xlsx files. There is a separate output file for each participant, with each file containing data on about 40 different measurements across several dozen (or so) stimuli. To make any sense of the data collected, we would need to look at each stimuli separately with relevant associated measurements. Each output file is large (50 columns by 60000 rows). I’m looking to parse the database using openpyxl to search for a cells in a pre-specified column with a particular string value. When such a cell is found, to then write that cell to a new workbook along with other specified columns in the same row.
For instance, parsing the following table, I’m trying to use openpyxl to search column A for ‘Slide 2’. When this value is found for a particular row, that cell is written to a new workbook along with the values in column C and D for that same row.
A B C D
1 Slide Data1 Data2 Data3
2 Slide 1 1 2 3
3 Slide 2 4 5 6
4 Slide 2 7 8 9
Would write:
A B C D
2 Slide 2 5 6
3
4
... or some similar format.
I would also look to fill column D and E with data from the next file, and F and G with data from the file after that (and so on), but I can probably figure that part out.
I’ve tried:
from openpyxl import load_workbook
wb = load_workbook(filename = r'test108.xlsx')
ws = wb.worksheets[0]
dest_filename = r'output.xlsx'
for x in range (0, 100): #0-100 as proof of concept before parsing entire worksheet
if ws.cell(row = x, column =26) == ‘some_image.jpg':
print (ws.cell(row =x, column =26), ws.cell(row = x, column = 10), ws.cell(row = x, column = 17))
wb.save = dest_filename
also with adding the following in an attempt to create a worksheet in memory within which to manipulate cells:
for i in range (0, 30):
for j in range (0, 100):
print (ws.cell(row =i, column=j))
... both with minor variations, but they all output a copy of the original file.
I’ve read and re-read the documentation for openpyxl but to no avail. There doesn’t seem to be any similar question on the forums here either.
Any insight in correctly manipulating and writing data would be greatly appreciated. I also hope this might help other people trying to make sense of huge datasets. Thanks in advance!
I'm on Windows 7 running Python3.3.2 (64 bit) with openpyxl-1.6.2. Data was originally in .csv format, so could be exported to .xls or other formats if this helps. I looked into xlutils (using xlwt and xlrd) briefly, but openpyxl worked better with xlsx files.
Edit
Many thanks to #MikeMüller for pointing out I needed two workbooks to transfer data between. That makes much more sense.
I now have the following, but it still returns an empty workbook. The original cells are not blank. (The commented lines are for simplification - without the indent, of course - but code not successful either way.)
import openpyxl
wb = openpyxl.load_workbook(filename = r'test108.xlsx')
ws = wb.worksheets[0]
wb_out = openpyxl.Workbook()
ws_out = wb_out.worksheets[0]
#n = 1
#for x in range (0, 1000):
#if ws.cell(row = x, column = 27) == '7.image2.jpg':
ws_out.cell(row = n, column = 1) == ws.cell(row = x, column = 26) #x changed
ws_out.cell(row = n, column = 2) == ws.cell(row = x, column = 10) #x changed
ws_out.cell(row = n, column = 3) == ws.cell(row = x, column = 17) #x changed
#n += 1
wb_out.save('output108.xlsx')
Edit 2
I've updated the code to include the .value for cells, but it still returns a blank workbook.
import openpyxl
wb = openpyxl.load_workbook(filename = r'test108.xlsx')
ws = wb.worksheets[0]
wb_out = openpyxl.Workbook()
ws_out = wb_out.worksheets[0]
n = 1
for x in range (0, 1000):
if ws.cell(row=x, column=27).value == '7.Image001.jpg':
ws_out.cell(row=n, column=1).value = ws.cell(row=x, column=27).value
ws_out.cell(row=n, column=2).value = ws.cell(row=x, column=10).value
ws_out.cell(row=n, column=3).value = ws.cell(row=x, column=17).value
n += 1
wb_out.save('output108.xlsx')
Summary for the next person with trouble:
You need to create two worksheets in memory. One to import your file, the to other to write to a new workbook file.
Use the cell.value call function to pull the text entered into each cell of your imported workbook, and set it = the desired cells in the exported workbook.
Make sure you start counting rows and columns at zero.
You are doing cell assignment incorrectly. Here's what should work:
import openpyxl
wb = openpyxl.load_workbook(filename = r'test108.xlsx')
ws = wb.worksheets[0]
wb_out = openpyxl.Workbook()
ws_out = wb_out.worksheets[0]
n = 1
for x in range (0, 1000):
if ws.cell(row=x, column=27).value == '7.image2.jpg':
ws_out.cell(row=n, column=1).value = ws.cell(row=x, column=26).value #x changed
ws_out.cell(row=n, column=2).value = ws.cell(row=x, column=10).value #x changed
ws_out.cell(row=n, column=3).value = ws.cell(row=x, column=17).value #x changed
n += 1
wb_out.save('output108.xlsx')
You need to open a second notebook for writing:
import openpyxl
wb_out = openpyxl.Workbook(dest_filename)
ws_out = wb_out.worksheets[0]
Put this in your loop:
ws_out.cell('cell indices here').value = desired_value
Save your file:
writer = openpyxl.ExelWriter(workbook=wb_out)
writer.save(dest_filename)