I am trying to do the following:
For each entry in Col A, if that entry recurs in the same Col A, then add together all of its values in Col E.
Then, write only that (added) values from Col E into another excel sheet. Each Col A entry should have all the Col E values corresponding to it.
However, I can create that output sheet for the last row only.
Here is the code that I've written,
#! /usr/bin/env python
from xlrd import open_workbook
from tempfile import TemporaryFile
from xlwt import Workbook
wb = open_workbook('/Users/dem/Documents/test.xlsx')
wk = wb.sheet_by_index(0)
for i in range(wk.nrows):
a = str(wk.cell(i,0).value)
b = []
e = []
for j in range(wk.nrows):
c = str(wk.cell(j,0).value)
d = str(wk.cell(j,4).value)
if a == c:
b.append(d)
print b
e.append(b)
book = Workbook()
sheet1 = book.add_sheet('sheet1')
n = 0
for n, item in enumerate(e):
sheet1.write(n,0,item)
n +=1
book.save('/Users/dem/Documents/res.xls')
book.save(TemporaryFile())
Erred resulting sheet(mine):
Comments in the code.
#! /usr/bin/env python
from xlrd import open_workbook
from tempfile import TemporaryFile
from xlwt import Workbook
import copy
wb = open_workbook('C:\\Temp\\test.xls')
wk = wb.sheet_by_index(0)
# you need to put e=[] outside the loop in case they are reset to empty list every loop
# e is used to store final result
e = []
# f is used to store value in Col A which means we only record value once
f = []
for i in range(wk.nrows):
b = []
temp = None
a = str(wk.cell(i,0).value)
#here we only record value once
if a in f:
continue
#here you should start from i+1 to avoid double counting
for j in range(i+1, wk.nrows):
c = str(wk.cell(j,0).value)
if a == c:
# you can put operations here in order to make sure they are executed only when needed
d = str(wk.cell(j,4).value)
k = str(wk.cell(i,4).value)
f.append(a)
# record all the value in Col E
b.append(k)
b.append(d)
# you need to use deepcopy here in order to get accurate value
temp = copy.deepcopy(b)
# in your case, row 5 has no duplication, temp for row 5 will be none, we need to avoid adding none to final result
if temp:
e.append(temp)
book = Workbook()
sheet1 = book.add_sheet('sheet1')
n = 0
for n, item in enumerate(e):
sheet1.write(n,0,item)
# you don't need n+=1 here, since n will increase itself
book.save('C:\\Temp\\res.xls')
book.save(TemporaryFile())
I think you should look forward to use csv.writer with dialect='excel' There is an example in this documentation on usage. I think this is just the simplest way to work with excel if you don't need huge functionality like in your case.
Related
I am new to Python. Currently I need to count the number of duplicates, delete the duplicate ones and update the duplicates occasions into a new column. Below is my code:
import pandas as pd
from openpyxl import load_workbook
filepath = '/Users/jordanliu/Desktop/test/testA.xlsx'
data = load_workbook(filepath)
sku = data.active
duplicate_column = []
for x in range(sku.max_row):
duplicate_count = 0
for i in range(x):
if sku.cell(row =i + 2, column = 1).value == sku.cell(row = x + 2, column = 1).value:
duplicate_count = duplicate_column[i] + 1
sku.cell(row =i+2, column = 1).value = 0
duplicate_column.append(duplicate_count)
for x in range(len(duplicate_column)):
sku.cell(row=x + 2, column=3).value = duplicate_column[x]
for y in range(sku.max_row):
y = y + 1
if sku.cell(row = y, column = 1).value == 0:
sku.delete_rows(y,1)
data.save(filepath)
I've tried using both pandas but because the execution time takes extraordinary long, I've decided to change to openpyxl but it doesn't seem to help much. Many from other post have suggest to use CSV but because its the writing process that takes the majority of the time I thought that it wouldn't help much.
Can someone please provide me some help here?
for x in range(sku.max_row):
duplicate_count = 0
for i in range(x):
if sku.cell(row =i + 2, column = 1).value == sku.cell(row = x + 2, column = 1).value:
duplicate_count = duplicate_column[i] + 1
sku.cell(row =i+2, column = 1).value = 0
For this portion, you are rechecking the same values over and over. Assuming these should be unique totally, which is how I think your code is written, then you should instead implement a cache of a hashed type (dict or set) to do these subsequent lookups instead of doing the lookup via sku.cell every time.
So it would be something like:
xl_cache = {}
duplicate_count = {}
delete_set = set()
for x in range(sku.max_row):
x_val = sku.cell(row = x, column = 1).value
if x_val in xl_cache: # then this is not first time
xl_cache[x_val][1] += 1 # increase duplicate count
delete_set.add(x)
else:
xl_cache[x_val] = x # key is value for duplicate cache, value is row number
duplicate_count[x] = 0 # key is row number, value is duplicate count
So now you have a dictionary of originals with duplicate counts, you need to go back and delete your rows that you don't want plus change the duplicate counts in the sheet. So go backwards through the range and delete the row or update the duplicate count. You can do this by going to your max first and reducing by 1, check for delete first, otherwise change the duplicate.
y = sku.max_row
for i in range(y, 0, -1):
if i in delete_set:
sku.delete_rows(i,1)
else:
sku.cell(row=i, column=3) = duplicate_count[i]
In theory, this would only traverse your range twice in total, and lookups from the cache would be O(1) on average. You need to traverse this in reverse to maintain row order as you delete rows.
Since I don't actually have your sample data I can't test this code completely so there could be minor issues but I tried to use the structures that you have in your code to make it easily usable for you.
Have written a python script that fetches the cell values and displays in a list row by row.
Here is my script:
book = openpyxl.load_workbook(excel_file_name)
active = book.get_sheet_by_name(excel_file_name)
def iter_rows(active):
for row in active.iter_rows():
yield [cell.value for cell in row]
res = list(iter_rows(active))
for new in res:
print new
Output for the above script:
[state, country, code]
[abc, xyz, 0][def, lmn, 0]
I want output in below format:
[state:abc, country:xyz, code:0][state:def, country:lmn, code:0]
Please note: I want to do this from openpyxl
I think this should do exactly what you want.
import openpyxl
book = openpyxl.load_workbook("Book1.xlsx")
active = book.get_sheet_by_name("Sheet1")
res = list(active)
final = []
for x in range(1, len(res)):
partFinal = {}
partFinal[res[0][0].value] = res[x][0].value
partFinal[res[0][1].value] = res[x][1].value
partFinal[res[0][2].value] = res[x][2].value
final.append(partFinal)
for x in final:
print x
I have a huge table file that looks like the following. In order to work on individual products (name), I tried to use pandas groupby, but it seems to put the whole table (~10G) in memory, which I cannot afford.
name index change
A Q QQ
A Q QQ
A Q QQ
B L LL
C Q QQ
C L LL
C LL LL
C Q QQ
C L LL
C LL LL
The name column is well sorted and I will only care about one name at a time. I hope to use the following criteria on column "change" to filter each name:
Check if number of "QQ" overwhelms number of "LL". Basically, if the number of rows contain "QQ" subtracts the number of rows contain "LL" >=2, then discard/ignore the "LL" rows for this name from now on. If "LL" overwhelms "QQ", then discard rows with QQ. (E.g. A has 3 QQ and 0 LL, and C has 4 LL and 2 QQ. They both are fine.)
Resulting table:
name index change
A Q QQ
A Q QQ
A Q QQ
C L LL
C LL LL
C L LL
C LL LL
Comparing "change" to "index", if no change occurs (e.g. LL in both columns), the row is not valid. Further, for the valid changes, the remaining QQ or LL has to be continuous for >=3 times. Therefore C only has 2 valid changes, and it will be filtered out.
Resulting table:
name index change
A Q QQ
A Q QQ
A Q QQ
I wonder if there is a way to just work on the table name by name, and release memory after each name please. (And don't have to do the two criteria step by step.) Any hint or suggestion will be appreciated!
Because the file is sorted by "name", you can read the file row-by-row:
def process_name(name, data, output_file):
group_by = {}
for index, change in data:
if index not in group_by:
group_by[index] = []
group_by[index].append(change)
# do the step 1 filter logic here
# do the step 2 filter logic here
for index in group_by:
if index == group_by[index]:
# Because there is at least one "no change" this
# whole "name" can be thrown out, so return here.
return
output = []
for index in group_by:
output_file.write("%s\t%s\t%s\n" % (name, index, group_by[index]))
current_name = None
current_data = []
input_file = open(input_filename, "r")
output_file = open(output_filename, "w")
header = input_file.readline()
for row in input_file:
cols = row.strip().split("\t")
name = cols[0]
index = cols[1]
change = cols[2]
if name != current_name:
if name != None:
process_name(current_name, current_data, output_file)
current_name = name
current_data = []
current_data.append((index, change))
# process what's left in the buffer
if current_name is not None:
process_name(current_name, current_data, output_file)
input_file.close()
output_file.close()
I don't totally understand the logic you've explained in #1, so I left that blank. I also feel like you probably want to do step #2 first as that will quickly rule out entire "name"s.
Since your file is sorted and you only seem to be operating on the sub segments by name, perhaps just use Python's groupby and create a table for each name segment as you go:
from itertools import groupby
import pandas as pd
with open('/tmp/so.csv') as f:
header=next(f).split()
for k, segment in groupby(f, key=lambda line: line.split()[0]):
seg_data={k:[] for k in header}
for e in segment:
for key, v in zip(header, e.split()):
seg_data[key].append(v)
seg_fram=pd.DataFrame.from_dict(seg_data)
print k
print seg_fram
print
Prints:
A
change index name
0 QQ Q A
1 QQ Q A
2 QQ Q A
B
change index name
0 LL L B
C
change index name
0 QQ Q C
1 LL L C
2 LL LL C
3 QQ Q C
4 LL L C
5 LL LL C
Then the largest piece of memory you will have will be dictated by the largest contiguous group and not the size of the file.
You can use 1/2 the memory of that method by appending to the data frame row by row instead of building the intermediate dict:
with open('/tmp/so.csv') as f:
header=next(f).split()
for k, segment in groupby(f, key=lambda line: line.split()[0]):
seg_data={k:[] for k in header}
seg_fram=pd.DataFrame(columns=header)
for idx, e in enumerate(segment):
df=pd.DataFrame({k:v for k, v in zip(header, e.split())}, index=[idx])
seg_fram=seg_fram.append(df)
(might be slower though...)
If that does not work, consider using a disk database.
I want to write the values of the list values only on column A of the new workbook, for example:
a1 = 1
a2 = 2
a3 = 3
etc. etc. but right now I get this:
a1 = 1 b1 = 2 c1= 3 d1= 4
a1 = 1 b1 = 2 c1= 3 d1= 4
a1 = 1 b1 = 2 c1= 3 d1= 4
My code:
# create new workbook and worksheet
values = [1,2,3,4,5,6,7,8,9,10,11,12,13,14,15]
wb = Workbook(write_only = True)
ws = wb.create_sheet()
for row in range(0, len(values)):
ws.append([i for i in values])
wb.save('newfile.xlsx')
this code above fills all the cells in range A1:A15 to O1:O15
I only want to fill the values in column A1:A15
Not yet tested, but I would think
Tested-- you have a syntax error also; substitute 'row' for 'i'. But the following works.
for row in range(0, len(values)):
ws.append([row])
You need to create nested list to append the values in column, refer below code.
from openpyxl import Workbook
values = [1,2,3,4,5,6,7,8,9,10,11,12,13,14,15]
wb = Workbook()
ws = wb.create_sheet()
newlist = [[i] for i in values]
print(newlist)
for x in newlist:
ws.append(x)
wb.save('newfile.xlsx')
When I create a random List of numbers like so:
columns = 10
rows = 10
for x in range(rows):
a_list = []
for i in range(columns):
a_list.append(str(random.randint(1000000,99999999)))
values = ",".join(str(i) for i in a_list)
print values
then all is well.
But when I attempt to send the output to a file, like so:
sys.stdout = open('random_num.csv', 'w')
for i in a_list:
print ", ".join(map(str, a_list))
it is only the last row that is output 10 times. How do I write the entire list to a .csv file ?
In your first example, you're creating a new list for every row. (By the way, you don't need to convert them to strs twice).
In your second example, you print the last list you had created previously. Move the output into the first loop:
columns = 10
rows = 10
with open("random_num.csv", "w") as outfile:
for x in range(rows):
a_list = [random.randint(1000000,99999999) for i in range(columns)]
values = ",".join(str(i) for i in a_list)
print values
outfile.write(values + "\n")
Tim's answer works well, but I think you are trying to print to terminal and the file in different places.
So with minimal modifications to your code, you can use a new variable all_list
import random
import sys
all_list = []
columns = 10
rows = 10
for x in range(rows):
a_list = []
for i in range(columns):
a_list.append(str(random.randint(1000000,99999999)))
values = ",".join(str(i) for i in a_list)
print values
all_list.append(a_list)
sys.stdout = open('random_num.csv', 'w')
for a_list in all_list:
print ", ".join(map(str, a_list))
The csv module takes care of a bunch the the crap needed for dealing with csv files.
As you can see below, you don't need to worry about conversion to strings or adding line-endings.
import csv
columns = 10
rows = 10
with open("random_num.csv", "wb") as outfile:
writer = csv.writer(outfile)
for x in range(rows):
a_list = [random.randint(1000000,99999999) for i in range(columns)]
writer.writerow(a_list)