Help with Excel, Python and XLRD

Help with Excel, Python and XLRD - python

Relatively new to programming hence why I've chosen to use python to learn.
At the moment I'm attempting to read a list of Usernames, passwords from an Excel Spreadsheet with XLRD and use them to login to something. Then back out and go to the next line. Log in etc and keep going.
Here is a snippit of the code:
import xlrd
wb = xlrd.open_workbook('test_spreadsheet.xls')
# Load XLRD Excel Reader
sheetname = wb.sheet_names() #Read for XCL Sheet names
sh1 = wb.sheet_by_index(0) #Login
def readRows():
for rownum in range(sh1.nrows):
rows = sh1.row_values(rownum)
userNm = rows[4]
Password = rows[5]
supID = rows[6]
print userNm, Password, supID
print readRows()
I've gotten the variables out and it reads all of them in one shot, here is where my lack of programming skills come in to play. I know I need to iterate through these and do something with them but Im kind of lost on what is the best practice. Any insight would be great.
Thank you again

couple of pointers:
i'd suggest you not print your function with no return value, instead just call it, or return something to print.
def readRows():
for rownum in range(sh1.nrows):
rows = sh1.row_values(rownum)
userNm = rows[4]
Password = rows[5]
supID = rows[6]
print userNm, Password, supID
readRows()
or looking at the docs you can take a slice from the row_values:
row_values(rowx, start_colx=0,
end_colx=None) [#]
Returns a slice of the values of the cells in the given row.
because you just want rows with index 4 - 6:
def readRows():
# using list comprehension
return [ sh1.row_values(idx, 4, 6) for idx in range(sh1.nrows) ]
print readRows()
using the second method you get a list return value from your function, you can use this function to set a variable with all of your data you read from the excel file. The list is actually a list of lists containing your row values.
L1 = readRows()
for row in L1:
print row[0], row[1], row[2]
After you have your data, you are able to manipulate it by iterating through the list, much like for the print example above.
def login(name, password, id):
# do stuff with name password and id passed into method
...
for row in L1:
login(row)
you may also want to look into different data structures for storing your data. If you need to find a user by name using a dictionary is probably your best bet:
def readRows():
rows = [ sh1.row_values(idx, 4, 6) for idx in range(sh1.nrows) ]
# using list comprehension
return dict([ [row[4], (row[5], row[6])] for row in rows ])
D1 = readRows()
print D['Bob']
('sdfadfadf',23)
import pprint
pprint.pprint(D1)
{'Bob': ('sdafdfadf',23),
'Cat': ('asdfa',24),
'Dog': ('fadfasdf',24)}
one thing to note is that dictionary values returned arbitrarily ordered in python.

I'm not sure if you are intent on using xlrd, but you may want to check out PyWorkbooks (note, I am the writter of PyWorkbooks :D)
from PyWorkbooks.ExWorkbook import ExWorkbook
B = ExWorkbook()
B.change_sheet(0)
# Note: it might be B[:1000, 3:6]. I can't remember if xlrd uses pythonic addressing (0 is first row)
data = B[:1000,4:7] # gets a generator, the '1000' is arbitrarily large.
def readRows()
while True:
try:
userNm, Password, supID = data.next() # you could also do data[0]
print userNm, Password, supID
if usrNm == None: break # when there is no more data it stops
except IndexError:
print 'list too long'
readRows()
You will find that this is significantly faster (and easier I hope) than anything you would have done. Your method will get an entire row, which could be a thousand elements long. I have written this to retrieve data as fast as possible (and included support for such things as numpy).
In your case, speed probably isn't as important. But in the future, it might be :D
Check it out. Documentation is available with the program for newbie users.
http://sourceforge.net/projects/pyworkbooks/

Seems to be good. With one remark: you should replace "rows" by "cells" because you actually read values from cells in every single row

Related

My CSV files are not being assigned to the correct Key in a dictionary

def read_prices(tikrList):
#read each file and get the price list dictionary
def getPriceDict():
priceDict = {}
TLL = len(tikrList)
for x in range(0,TLL):
with open(tikrList[x] + '.csv','r') as csvFile:
csvReader = csv.reader(csvFile)
for column in csvReader:
priceDict[column[0]] = float(column[1])
return priceDict
#populate the final dictionary with the price dictionary from the previous function
def popDict():
combDict = {}
TLL = len(tikrList)
for x in range(0,TLL):
for y in tikrList:
combDict[y] = getPriceDict()
return combDict
return(popDict())
print(read_prices(['GOOG','XOM','FB']))
What is wrong with the code is that when I return the final dictionary the key for GOOG,XOM,FB is represnting the values for the FB dictionary only.
As you can see with this output:
{'GOOG': {'2015-12-31': 104.660004, '2015-12-30': 106.220001},
'XOM': {'2015-12-31': 104.660004, '2015-12-30': 106.220001},
'FB': {'2015-12-31': 104.660004, '2015-12-30': 106.220001}
I have 3 different CSV files but all of them are just reading the CSV file for FB.
I want to apologize ahead of time if my code is not easy to read or doesn't make sense. I think there is an issue with storing the values and returning the priceDict in the getPriceDict function but I cant seem to figure it out.
Any help is appreciated, thank you!

Since this is classwork I won't provide a solution but I'll point a few things out.
You have defined three functions - two are defined inside the third. While structuring functions like that can make sense for some problems/solutions I don't see any benefit in your solution. It seems to make it more complicated.
The two inner functions don't have any parameters, you might want to refactor them so that when they are called you pass them the information they need. One advantage of a function is to encapsulate an idea/process into a self-contained code block that doesn't rely on resources external to itself. This makes it easy to test so you know that the function works and you can concentrate on other parts of the code.
This piece of your code doesn't make much sense - it never uses x from the outer loop:
...
for x in range(0,TLL):
for y in tikrList:
combDict[y] = getPriceDict()
When you iterate over a list the iteration will stop after the last item and it will iterate over the items themselves - no need to iterate over numbers to access the items: don't do for i in range(thelist): print(thelist[i])
>>> tikrList = ['GOOG','XOM','FB']
>>> for name in tikrList:
... print(name)
GOOG
XOM
FB
>>>
When you read through a tutorial or the documentation, don't just look at the examples - read and understand the text .

Python - Get item from a list under a list

I have a list like below.
list = [[Name,ID,Age,mark,subject],[karan,2344,23,87,Bio],[karan,2344,23,87,Mat],[karan,2344,23,87,Eng]]
I need to get only the name 'Karan' as output.
How can I get that?

This is a 2D list,
list[i][j]
will give you the 'i'th list within your list and the 'j'th item within that list.
So to get Karen you want list[1][0]

I upvoted Lio Elbammalf, but decided to provide an answer that made a couple of assumptions that should have been clarified in the question:
The First item of the list is the headers, they are actually in the list (and not there as part of the question), and they are provided as part of the list because there is no guarantee that the headers will always be in the same order.
This is probably a CSV file
Ignoring 2 for the moment, what you would want to do is remove the "headers" from the list (so that the rest of the list is uniform), and then find the index of "Name" (your desired output).
myinput = [["Name","ID","Age","mark","subject"],
["karan",2344,23,87,"Bio"],
["karan",2344,23,87,"Mat"],
["karan",2344,23,87,"Eng"]]
## Remove the headers from the list to simplify everything
headers = myinput.pop(0)
## Figure out where to find the person's Name
nameindex = headers.index("Name")
## Return a list of the Name in each row
return [stats[nameindex] for stats in myinput]
If the name is guaranteed to be the same in each row, then you can just return myinput[0][nameindex] like is suggested in the other answer
Now, if 2 is true, I'm assuming you're using the csv module, in which case load the file using the DictReader class and then just access each row using the 'Name' key:
def loadfile(myfile):
with open(myfile) as f:
reader = csv.DictReader(f)
return list(reader)
def getname(rows):
## This is the same return as above, and again you can just
## return rows[0]['Name'] if you know you only need the first one
return [row['Name'] for row in rows]

In Python 3 you can do this
_, [x, _, _, _, _], *_ = ls
Now x will be karan.

Count and flag duplicates in a column in a csv

this type of question has been asked many times. So apologies; I have searched hard to get an answer - but have not found anything that is close enough to my needs (and I am not sufficiently advanced (I am a total newbie) to customize an existing answer). So thanks in advance for any help.
Here's my query:
I have 30 or so csv files and each contains between 500 and 15,000 rows.
Within each of them (in the 1st column) - are rows of alphabetical IDs (some contain underscores and some also have numbers).
I don't care about the unique IDs - but I would like to identify the duplicate IDs and the number of times they appear in all the different csv files.
Ideally I'd like the output for each duped ID to appear in a new csv file and be listed in 2 columns ("ID", "times_seen")
It may be that I need to compile just 1 csv with all the IDs for your code to run properly - so please let me know if I need to do that
I am using python 2.7 (a crawling script that I run needs this version, apparently).
Thanks again

It seems the most easy way to achieve want you want would make use of dictionaries.
import csv
import os
# Assuming all your csv are in a single directory we will iterate on the
# files in this directory, selecting only those ending with .csv
# to list files in the directory we will use the walk function in the
# os module. os.walk(path_to_dir) returns a generator (a lazy iterator)
# this generator generates tuples of the form root_directory,
# list_of_directories, list_of_files.
# So: declare the generator
file_generator = os.walk("/path/to/csv/dir")
# get the first values, as we won't recurse in subdirectories, we
# only ned this one
root_dir, list_of_dir, list_of_files = file_generator.next()
# Now, we only keep the files ending with .csv. Let me break that down
csv_list = []
for f in list_of_files:
if f.endswith(".csv"):
csv_list.append(f)
# That's what was contained in the line
# csv_list = [f for _, _, f in os.walk("/path/to/csv/dir").next() if f.endswith(".csv")]
# The dictionary (key value map) that will contain the id count.
ref_count = {}
# We loop on all the csv filenames...
for csv_file in csv_list:
# open the files in read mode
with open(csv_file, "r") as _:
# build a csv reader around the file
csv_reader = csv.reader(_)
# loop on all the lines of the file, transformed to lists by the
# csv reader
for row in csv_reader:
# If we haven't encountered this id yet, create
# the corresponding entry in the dictionary.
if not row[0] in ref_count:
ref_count[row[0]] = 0
# increment the number of occurrences associated with
# this id
ref_count[row[0]]+=1
# now write to csv output
with open("youroutput.csv", "w") as _:
writer = csv.writer(_)
for k, v in ref_count.iteritems():
# as requested we only take duplicates
if v > 1:
# use the writer to write the list to the file
# the delimiters will be added by it.
writer.writerow([k, v])
You may need to tweek a little csv reader and writer options to fit your needs but this should do the trick. You'll find the documentation here https://docs.python.org/2/library/csv.html. I haven't tested it though. Correcting the little mistakes that may have occurred is left as a practicing exercise :).

That's rather easy to achieve. It would look something like:
import os
# Set to what kind of separator you have. '\t' for TAB
delimiter = ','
# Dictionary to keep count of ids
ids = {}
# Iterate over files in a dir
for in_file in os.listdir(os.curdir):
# Check whether it is csv file (dummy way but it shall work for you)
if in_file.endswith('.csv'):
with open(in_file, 'r') as ifile:
for line in ifile:
my_id = line.strip().split(delimiter)[0]
# If id does not exist in a dict = set count to 0
if my_id not in ids:
ids[my_id] = 0
# Increment the count
ids[my_id] += 1
# saves ids and counts to a file
with open('ids_counts.csv', 'w') as ofile:
for key, val in ids.iteritems():
# write down counts to a file using same column delimiter
ofile.write('{}{}{}\n'.format(key, delimiter, value))

Check out the pandas package. You can read an write csv files quite easily with it.
http://pandas.pydata.org/pandas-docs/stable/10min.html#csv
Then, when having the csv-content as a dataframe you convert it with the as_matrix function.
Use the answers to this question to get the duplicates as a list.
Find and list duplicates in a list?
I hope this helps

As you are a newbie, Ill try to give some directions instead of posting an answer. Mainly because this is not a "code this for me" platform.
Python has a library called csv, that allows to read data from CSV files (Boom!, surprised?). This library allows you to read the file. Start by reading the file (preferably an example file that you create with just 10 or so rows and then increase the amount of rows or use a for loop to iterate over different files). The examples in the bottom of the page that I linked will help you printing this info.
As you will see, the output you get from this library is a list with all the elements of each row. Your next step should be extracting just the ID that you are interested in.
Next logical step is counting the amount of appearances. There is also a class from the standard library called counter. They have a method called update that you can use as follows:
from collections import Counter
c = Counter()
c.update(['safddsfasdf'])
c # Counter({'safddsfasdf': 1})
c['safddsfasdf'] # 1
c.update(['safddsfasdf'])
c # Counter({'safddsfasdf': 2})
c['safddsfasdf'] # 2
c.update(['fdf'])
c # Counter({'safddsfasdf': 2, 'fdf': 1})
c['fdf'] # 1
So basically you will have to pass it a list with the elements you want to count (you could have more than 1 id in the list, for exampling reading 10 IDs before inserting them, for improved efficiency, but remember not constructing a thousands of elements list if you are seeking good memory behaviour).
If you try this and get into some trouble come back and we will help further.
Edit
Spoiler alert: I decided to give a full answer to the problem, please avoid it if you want to find your own solution and learn Python in the progress.
# The csv module will help us read and write to the files
from csv import reader, writer
# The collections module has a useful type called Counter that fulfills our needs
from collections import Counter
# Getting the names/paths of the files is not this question goal,
# so I'll just have them in a list
files = [
"file_1.csv",
"file_2.csv",
]
# The output file name/path will also be stored in a variable
output = "output.csv"
# We create the item that is gonna count for us
appearances = Counter()
# Now we will loop each file
for file in files:
# We open the file in reading mode and get a handle
with open(file, "r") as file_h:
# We create a csv parser from the handle
file_reader = reader(file_h)
# Here you may need to do something if your first row is a header
# We loop over all the rows
for row in file_reader:
# We insert the id into the counter
appearances.update(row[:1])
# row[:1] will get explained afterwards, it is the first column of the row in list form
# Now we will open/create the output file and get a handle
with open(output, "w") as file_h:
# We create a csv parser for the handle, this time to write
file_writer = writer(file_h)
# If you want to insert a header to the output file this is the place
# We loop through our Counter object to write them:
# here we have different options, if you want them sorted
# by number of appearances Counter.most_common() is your friend,
# if you dont care about the order you can use the Counter object
# as if it was a normal dict
# Option 1: ordered
for id_and_times in apearances.most_common():
# id_and_times is a tuple with the id and the times it appears,
# so we check the second element (they start at 0)
if id_and_times[1] == 1:
# As they are ordered, we can stop the loop when we reach
# the first 1 to finish the earliest possible.
break
# As we have ended the loop if it appears once,
# only duplicate IDs will reach to this point
file_writer.writerow(id_and_times)
# Option 2: unordered
for id_and_times in apearances.iteritems():
# This time we can not stop the loop as they are unordered,
# so we must check them all
if id_and_times[1] > 1:
file_writer.writerow(id_and_times)
I offered 2 options, printing them ordered (based on Counter.most_common() doc) and unoredered (based on normal dict method dict.iteritems()). Choose one. From a speed point of view I'm not sure which one would be faster, as one first needs to order the Counter but also stops looping when finding the first element non-duplicated while the second doesn't need to order the elements but needs to loop every ID. The speed will probably be dependant on your data.
About the row[:1] thingy:
row is a list
You can get a subset of a list telling the initial and final positions
In this case the initial position is omited, so it defaults to the start
The final position is 1, so just the first element gets selected
So the output is another list with just the first element
row[:1] == [row[0]] They have the same output, getting a sublist of only the same element is the same that constructing a new list with only the first element

Is there a way to do it faster?

ladder have around 15000 elements, this code snippet performed in 5-8sec, is there any way to do it faster? I try do it without checking for duplicate and without creating accs list and time was down to 2-3sec, but I don't need duplicate in csv file.
I work in python 2.7.9
accs =[]
with codecs.open('test.csv','w', encoding="UTF-8") as out:
row = ''
for element in ladder:
if element['account']['name'] not in accs:
accs.append(element['account']['name'])
row += element['account']['name']
if 'twitch' in element['account']:
row += "," + element['account']['twitch']['name'] + ","
else:
row += ",,"
row += str(element['account']['challenges']['total']) + "\n"
out.write(row)

seen = set()
results = []
for user in ladder:
acc = user['account']
name = acc['name']
if name not in seen:
seen.add(name)
twitch_name = acc['twitch']['name'] if "twitch" in acc else ''
challenges = acc['challenges']['total']
results.append("%s,%s,%d" % (name, twitch_name, challenges))
with codecs.open('test.csv','w', encoding="UTF-8") as out:
out.write("\n".join(results))

You can’t do much about the loop, since you need to go through every element in ladder after all. However, you can improve this membership test:
if element['account']['name'] not in accs:
Since accs is a list, this will essentially loop through all items of accs and check if the name is in there. And you loop for every element in ladder, so this can easily become very inefficient.
Instead, use a set instead of a list for accs as this will give you a constant membership lookup. So you reduce your algorithm from a quadratic complexity to a linear complexity. For that, just use accs = set() and change your code to use accs.add() instead of append.
Another issue is that you are doing string concatenation. Every time you do someString + "something" you are throwing away that string object and create a new one. This can become inefficient for a high number of operations too. Instead, use a list here to collect all the elements you want to write, and then join them:
row = []
row.append(element['account']['name'])
if 'twitch' in element['account']:
row.append(element['account']['twitch']['name'])
else:
row.append('')
row.append(str(element['account']['challenges']['total']))
out.write(','.join(row))
out.write('\n')
Alternatively, since you are writing to a file anyway, you could just call out.write multiple times with each string part.
Finally, you could also look into the csv module if you are interested in writing out CSV data.

Best way to parse a file with columns that randomly change order before importing it into SQL Server 2008?

I have a file that has columns that look like this:
Column1,Column2,Column3,Column4,Column5,Column6
1,2,3,4,5,6
1,2,3,4,5,6
1,2,3,4,5,6
1,2,3,4,5,6
1,2,3,4,5,6
1,2,3,4,5,6
Column1,Column3,Column2,Column6,Column5,Column4
1,3,2,6,5,4
1,3,2,6,5,4
1,3,2,6,5,4
Column2,Column3,Column4,Column5,Column6,Column1
2,3,4,5,6,1
2,3,4,5,6,1
2,3,4,5,6,1
The columns randomly re-order in the middle of the file, and the only way to know the order is to look at the last set of headers right before the data (Column1,Column2, etc.) (I've also simplified the data so that it's easier to picture. In real life, there is no way to tell data apart as they are all large integer values that could really go into any column)
Obviously this isn't very SQL Server friendly when it comes to using BULK INSERT, so I need to find a way to arrange all of the columns in a consistent order that matches my table's column order in my SQL database. What's the best way to do this? I've heard Python is the language to use, but I have never worked with it. Any suggestions/sample scripts in any language are appreciated.

A solution in python:
I would read line-by-line and look for headers. When I find a header, I use it to figure out the order (somehow). Then I pass that order to itemgetter which will do the magic of reordering elements:
from operator import itemgetter
def header_parse(line,order_dict):
header_info = line.split(',')
indices = [None] * len(header_info)
for i,col_name in enumerate(header_info):
indices[order_dict[col_name]] = i
return indices
def fix(fname,foutname):
with open(fname) as f,open(foutname,'w') as fout:
#Assume first line is a "header" and gives the order to use for the
#rest of the file
line = f.readline()
order_dict = dict((name,i) for i,name in enumerate(line.strip().split(',')))
reorder_magic = itemgetter(*header_parse(line.strip(),order_dict))
for line in f:
if line.startswith('Column'): #somehow determine if this is a "header"
reorder_magic = itemgetter(*header_parse(line.strip(),order_dict))
else:
fout.write(','.join(reorder_magic(line.strip().split(','))) + '\n')
if __name__ == '__main__':
import sys
fix(sys.argv[1],sys.argv[2])
Now you can call it as:
python fixscript.py badfile goodfile

Since you didn't mention a specific problem, I'm going to assume you're having problems coming up with an algorithm.
For each row,
Parse the row into fields.
If it's the first header line,
Output the header.
Create a map of field names to position.
%map = map { $fields[$_] => $_ } 0..$#fields;
Create a map of original positions to new positions.
#map = #map{ #fields };
If it's a header line other than the first,
Update map of original positions to new positions.
#map = #map{ #fields };
If it's not a header line,
Reorder fields.
#fields[ #map ] = #fields;
Output the row.
(Snippets are in Perl.)

This can be fixed easily in two steps:
split file into multiple files when a new header starts
read each file using csv dict reader, sort the keys and re-output rows in correct order
Here is an example how you can ho about it,
def is_header(line):
return line.find('Column') >= 0
def process(lines):
headers = None
for line in lines:
line = line.strip()
if is_header(line):
headers = list(enumerate(line.split(",")))
headers_map = dict(headers)
headers.sort(key=lambda (i,v):headers_map[i])
print ",".join([h for i,h in headers])
continue
values = list(enumerate(line.split(",")))
values.sort(key=lambda (i,v):headers_map[i])
print ",".join([v for i,v in values])
if __name__ == "__main__":
import sys
process(open(sys.argv[1]))
You can also change function is_header to correctly identify header in real cases

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Help with Excel, Python and XLRD - python

Seems to be good. With one remark: you should replace "rows" by "cells" because you actually read values from cells in every single row

Related

My CSV files are not being assigned to the correct Key in a dictionary

Python - Get item from a list under a list

Count and flag duplicates in a column in a csv

Is there a way to do it faster?

Best way to parse a file with columns that randomly change order before importing it into SQL Server 2008?

Categories

Resources