Python : Split 1 Excel File into multiple Excel files by rows - python

For example u have 1 excel file and it consist of 10000 data in it. Later when we import that excel file in pycharm or jupiter notebook. If i run that file i will get an Index range also know as Row labels. my python code should be able to read that ten thousand row labels and should be able to separate / split into 10 different excel sheet files which will have 1000 data in each of the 10 separated sheet.
Other example is, if there is 9999 data in 1 sheet/file then my python code should divide 9000 data in 9 sheet and other 999 in other sheet without any mistakes.{This is important Question}
i am asking this because in my data there is not any unique values for my code to split the files using .unique

You could use Pandas to read your file, chunk it then re-write it :
import pandas as pd
df = pd.read_excel("/path/to/excels/file.xlsx")
n_partitions = 3
for i in range(n_partitions):
sub_df = df.iloc[(i*n_paritions):((i+1)*n_paritions)]
sub_df.to_excel(f"/output/path/to/test-{i}.xlsx", sheet_name="a")
EDIT:
Or if you prefere to set the number of lines per xls files :
import pandas as pd
df = pd.read_excel("/path/to/excels/file.xlsx")
rows_per_file = 4
n_chunks = len(df) // rows_per_file
for i in range(n_chunks):
start = i*rows_per_file
stop = (i+1) * rows_per_file
sub_df = df.iloc[start:stop]
sub_df.to_excel(f"/output/path/to/test-{i}.xlsx", sheet_name="a")
if stop < len(df):
sub_df = df.iloc[stop:]
sub_df.to_excel(f"/output/path/to/test-{i}.xlsx", sheet_name="a")
You'll need openpyxl to read/write Excel files

the following code snippet is working fine for me
import pandas as pd
import openpyxl
import math
data = pd.read_excel(r"path_to_excel_file.xlsx")
_row_range = 200
_block = math.ceil(len(data)/_row_range )
for x in range(_block,_row_range ):
startRow = x*_row_range
endRow = (x+1)*_row_range
_data = data.iloc[startRow:endRow]
_data.to_excel(f"file_name_{x}.xlsx",sheet_name="Sheet1",index=False)

This gets the job done as well. Assumes the Excels file would be 19000 rows per file. Edit that to suit your scenario.
import pandas as pd
import math
data = pd.read_excel(filename)
count = len(data)
rows_per_file = 19000
no_of_files = math.ciel(count/rows_per_file)
start_row = 0
end_row = rows_per_file
for x in range(no_of_files):
new_data = data.iloc(start_row:end_row)
newdata.to_excel(f"filename_{x}.xlsx")
start_row end_row + 1
end_row = end_row + rows_per_file

Related

Delete rows in CSV based on specific value

I want to delete specific rows in my CSV with Python. The CSV has multiple rows and columns.
import numpy as np
np.df2 = pd.read_csv('C:/Users/.../Data.csv', delimiter='\t')
np.df_2=np.df2[['Colum.X', 'Colum.Y']]
Python should open Data.csv and then delete every (complete) rows where the value of Colum.X > 5 or the value of Colum.Y > 20 in Data.csv.
You can accomplish this with Pandas, no need for Numpy. I assume the columns in your csv are actually named 'Colum.X' and 'Colum.Y'.
import pandas as pd
df = pd.read_csv('C:/Users/.../Data.csv', delimiter='\t')
df = df.loc[df['Colum.X'] <= 5] # Take only the rows where Colum.X <= 5
df = df.loc[df['Colum.Y'] <= 20] # Take only the rows where Colum.Y <= 20
df.to_csv('C:/Users/.../Data.csv', index=False) # Export back to csv (with comma's)
Not entirely sure what you're doing with np.df2, but the following will work:
import pandas as pd
df = pd.read_csv('C:/Users/.../Data.csv', delimiter='\t')
df2 = df[(df['X'] <= 5) & (df['Y'] <= 20)]
You might have to add columns=['X', 'Y'] to the read_csv call, depending on what your CSV data looks like.
You can then overwrite the original file with:
df2.to_csv('C:/Users/.../Data.csv')

how to read many columns from excel using python

I want to read the data in like 162 rows from excel, I tried this code but I couldn't figure out a way to make it in a loop
import xlrd
file_location = "dec_DB.xlsx"
workbook = xlrd.open_workbook(file_location)
sheet = workbook.sheet_by_name('Sheet1')
x = []
for rownum in range(sheet.nrows):
x.append(sheet.cell(rownum, 3))
You can read the columns you want from an excel sheet easily in one line with pandas:
import pandas as pd
x = pd.read_excel(io = "/Users/atheeralzaydi/Desktop/KACST/DB/dec_DB.xlsx", sheet_name = "Sheet1", usecols = [1,2,3,12,34,100])
x will be a pandas DataFrame.
EDIT:
To read the first 162 rows (+1 including the header -- the row that includes the column names), instead of the parameter usecols, use the parameter nrows:
import pandas as pd
x = pd.read_excel(io = "/Users/atheeralzaydi/Desktop/KACST/DB/dec_DB.xlsx", sheet_name = "Sheet1", nrows = 162)
To import specific rows and not the first m rows, use the parameter skiprows.
To import the entire sheet, simply omit these parameters and all rows will be imported.
More info on read_excel here.

How to use Python to split Excel worksheet with rows containing \n into individual rows?

I have a spreadsheet with some rows that contain \n that need to be broken apart into individual rows.
I am able to open the sheet with openpyxl and convert the worksheet to a pandas dataframe, but I've been pulling my hair out trying to figure out how to split the rows.
Input:
Desired output:
Note that row 7 became row 7 & row 8 - that is the desired behavior for any row that has \n.
Any assistance would be much appreciated!
EDIT: My crappy original code is below; this is as far as I've gotten and I'm not sure where to go from here.
from openpyxl import load_workbook
from openpyxl import Workbook
import numpy as np
import pandas as pd
wb = load_workbook(filename="<filename>")
ws = wb["Page 1"]
# load worksheet into pandas dataframe
wsdf = pd.DataFrame(ws.values)
# create output wb/ws
output_wb = Workbook()
output_ws = output_wb.active
output_ws.title = "output"
# identify rows where crlf > 0
toBeSplit = []
pos = 0
for row in wsdf.iloc[:,1]:
#print ( pos, " ", str(row).count("\n") )
if ( str(row).count("\n") > 0 ):
toBeSplit.append(pos)
pos = pos + 1
print ( "Rows to be split: ", toBeSplit)
# write output
output_wb.save('<filename>')

openpyxl: a better way to read a range of numbers to an array

I am looking for a better (more readable / less hacked together) way of reading a range of cells using openpyxl. What I have at the moment works, but involves composing the excel cell range (e.g. A1:C3) by assembling bits of the string, which feels a bit rough.
At the moment this is how I read nCols columns and nRows rows starting from a particular cell (minimum working example, assumes that worksheet.xlsx is in working directory, and has the cell references written in cells A1 to C3 in Sheet1:
from openpyxl import load_workbook
import numpy as np
firstCol = "B"
firstRow = 2
nCols = 2
nRows = 2
lastCol = chr(ord(firstCol) + nCols - 1)
cellRange = firstCol + str(firstRow) + ":" + lastCol + str(firstRow + nRows - 1)
wsName = "Sheet1"
wb = load_workbook(filename="worksheet.xlsx", data_only=True)
data = np.array([[i.value for i in j] for j in wb[wsName][cellRange]])
print(data)
Returns:
[[u'B2' u'C2']
[u'B3' u'C3']]
As well as being a bit ugly there are functional limitations with this approach. For example in sheets with more than 26 columns it will fail for columns like AA.
Is there a better/correct way to read nRows and nCols from a given top-left corner using openpyxl?
openpyxl provides functions for converting between numerical column indices (1-based index) and Excel's 'AA' style. See the utils module for details.
However, you'll have little need for them in general. You can use the get_squared_range() method of worksheets for programmatic access. And, starting with openpyxl 2.4, you can do the same with the iter_rows() and iter_cols() methods. NB. iter_cols() is not available in read-only mode.
The equivalent MWE using iter_rows() would be:
from openpyxl import load_workbook
import numpy as np
wsName = "Sheet1"
wb = load_workbook(filename="worksheet.xlsx", data_only=True)
ws = wb[wsName]
firstRow = 2
firstCol = 2
nCols = 2
nRows = 2
allCells = np.array([[cell.value for cell in row] for row in ws.iter_rows()])
# allCells is zero-indexed
data = allCells[(firstRow-1):(firstRow-1+nRows),(firstCol-1):(firstCol-1+nCols)]
print(data)
The equivalent MWE using get_squared_range() would be:
from openpyxl import load_workbook
import numpy as np
wsName = "Sheet1"
wb = load_workbook(filename="worksheet.xlsx", data_only=True)
firstCol = 2
firstRow = 2
nCols = 2
nRows = 2
data = np.array([[i.value for i in j] for j in wb[wsName].get_squared_range(
firstCol, firstRow, firstCol+nCols-1, firstRow+nRows-1)])
print(data)
Both of which return:
[[u'B2' u'C2']
[u'B3' u'C3']]
See also https://openpyxl.readthedocs.io/en/default/pandas.html for more information on using Pandas and openpyxl together.
For completeness (and so I can find it later) the equivalent code using the pandas function read_excel suggested by #Rob in a comment would be:
import pandas
import numpy as np
wsName = "Sheet1"
df = pandas.read_excel(open("worksheet.xlsx", "rb"), sheetname=wsName, header=None)
firstRow = 2
firstCol = 2
nCols = 2
nRows = 2
# Data-frame is zero-indexed
data = np.array(df.ix[(firstRow-1):(firstRow-2+nRows), (firstRow-1):(firstRow-2+nRows)])
print(data)
Which returns:
[[u'B2' u'C2']
[u'B3' u'C3']]

subtract consecutive rows from a .dat file

I wish to subtract rows from the preceding rows in a .dat file and then make a new column out of the result. In my file, I wish to do that with the first column time , I want to find time interval for each timestep and then make a new column out of it. I took help from stackoverflow community and wrote a pseudo code in pandas python. but it's not working so far:
import pandas as pd
import numpy as np
from sys import argv
from pylab import *
import csv
script, filename = argv
# read flash.dat to a list of lists
datContent = [i.strip().split() for i in open("./flash.dat").readlines()]
# write it as a new CSV file
with open("./flash.dat", "wb") as f:
writer = csv.writer(f)
writer.writerows(datContent)
columns_to_keep = ['#time']
dataframe = pd.read_csv("./flash.csv", usecols=columns_to_keep)
df = pd.DataFrame({"#time": pd.date_range("24 sept 2016"),periods=5*24,freq="1h")})
df["time"] = df["#time"] + [pd.Timedelta(minutes=m) for m in np.random.choice(a=range(60), size=df.shape[0])]
df["value"] = np.random.normal(size=df.shape[0])
df["prev_time"] = [np.nan] + df.iloc[:-1]["time"].tolist()
df["time_delta"] = df.time - df.prev_time
df
dataframe.plot(x='#time', y='time_delta', style='r')
print dataframe
show()
I am also sharing the file for your convenience, your help is mostly appreciated.
https://www.dropbox.com/s/w4jbxmln9e83355/flash.dat?dl=0

Categories

Resources