Effective usages of using/accessing excel files with python - python

(Before i go ahead and ask this question please understand that i have done research but this is just to fill in holes in my information)
I have a standard excel spread sheet .xls that contains one table with the following info in it:
Now what i would like to achieve is to directly translate this .xls file into a data type that can be stored in memory for the python application that im writing to access this information accordingly like a dictionary.
I have read up a fair bit on this but my experience in coding isn't 100% as it has been awhile.

you can use pandas library, excellent library for excel manipulations.
import pandas as pd
data_frame = pd.read_excel("path_to_excel", "sheet_name")
data_frame is like a table or matrix that holds your data, you can manipulate this data_frame really easily

So what you want should be possible with the csv module, assuming you convert your xls to a csv (just save as...).
Like so:
import csv
with open('filepath.csv', 'r') as f:
reader = csv.DictReader(f)
for row in reader:
print(row['A'], row['B'])
DictReader takes the first row, and assumes those are the dictionary keys. Then it turns each row into a dictionary where you can access the values using the keys defined in the first row.
If you don't want it to be an actual dict, you can just use csv.reader(f) which allows you to access the rows using list indexing (the above example would end with print(row[0], row[1])).
This all has the nice bonus of being able to use the standard library without any 3rd party imports - so will run on any machine with Python.

You can use xlrd to loop through the excel file. You can loop through the excel file and create a dictionary, as you suggested.
A better alternative would be pandas, which reads your excel as a table, called data frame. You can access any cell, row or column from this data frame.
Eg, you have:
X Y
0 0.213784 0.461443
1 0.703082 0.600445
2 0.111101 0.648624
3 0.101367 0.924729
>>> import pandas as pd
>>> df = pd.read_excel(filename)
>>> df["X"]
0 0.213784
1 0.703082
2 0.111101
3 0.101367
>>> df["Y"]
0 0.461443
1 0.600445
2 0.648624
3 0.924729
>>> df["X"][0]
0.21378370373100195

Related

What is the fastest way to retrieve header names from excel files using pandas

I have a big size excel files that I'm organizing the column names into a unique list.
The code below works, but it takes ~9 minutes!
Does anyone have suggestions for speeding it up?
import pandas as pd
import os
get_col = list(pd.read_excel("E:\DATA\dbo.xlsx",nrows=1, engine='openpyxl').columns)
print(get_col)
Using pandas to extract just the column names of a large excel file is very inefficient.
You can use openpyxl for this:
from openpyxl import load_workbook
wb = load_workbook("E:\DATA\dbo.xlsx", read_only=True)
columns = {}
for sheet in worksheets:
for value in sheet.iter_rows(min_row=1, max_row=1, values_only=True):
columns = value
Assuming you only have one sheet, you will get a tuple of column names here.
If you want faster reading, then I suggest you use other type files. Excel, while convenient and fast are binary files, therefore for pandas to be able to read it and correctly parse it must use the full file. Using nrows or skipfooter to work with less data with only happen after the full data is loaded and therefore shouldn't really affect the waiting time. On the opposite, when working with a .csv() file, given its type and that there is no significant metadata, you can just extract the first rows of it as an interable using the chunksize parameter in pd.read_csv().
Other than that, using list() with a dataframe as value, returns a list of the columns already. So my only suggestion for the code you use is:
get_col = list(pd.read_excel("E:\DATA\dbo.xlsx",nrows=1, engine='openpyxl'))
The stronger suggestion is to change datatype if you specifically want to address this issue.

Replace a row in a pandas dataframe with values from dictionary

I am trying to populate an empty dataframe by using the csv module to iterate over a large tab-delimited file, and replacing each row in the dataframe with these values. (Before you ask, yes I have tried all the normal read_csv methods, and nothing has worked because of dtype issues and how large the file is).
I first made an empty numpy array using np.empty, using the dimensions of my data. I then converted this to a pandas DataFrame. Then, I did the following:
with open(input_file) as csvfile:
reader = csv.DictReader(csvfile,delimiter='\t')
row_num = 0
for row in reader:
for key, value in row.items():
df.loc[row_num,key] = value
row_num += 1
This is working great, except that my file has 900,000 columns, so it is unbelievably slow. This also feels like something that pandas could do more efficiently, but I've been unable to find how. The dictionary for each row given by DictReader looks like:
{'columnName1':<value>,'columnName2':<value> ...}
Where the values are what I want to put in the dataframe in those columns for that row.
Thanks!
So what you could do in this case is to build smaller chunks of your big csv data file. I had the same issue with a 32GB Csv-File, so I had to build chunks. After reading them in you could work with them.
# read the large csv file with specified chunksize
df_chunk = pd.read_csv(r'../input/data.csv', chunksize=1000000)
chunksize=1000000 sets how many row are read in at once
Helpfull website:
https://towardsdatascience.com/why-and-how-to-use-pandas-with-large-data-9594dda2ea4c

How to perform a check on a dataframe at the time of importing it using read_csv?

I am trying to import a .csv file using pandas in python. I am using pandas.read_csv to do that. But I have a requirement to check each row in the dataframe and take values of two specific columns into an array. As my dataframe has almost 3milion(~1gb) rows doing it iteratively after the import is taking time. Can I do that while importing the file itself? Is it a good idea to modify read_csv library function to accommodate this?
df = pd.read_csv("file.csv")
def get():
for a in list_A: #This list is of size ~2300
for b in list_B: #This list is of size ~12000
if a row exists such that it has a,b:
//do something
Due to very large size of lists, this function is running slow. Also, querying a dataframe of such big size is also slowing down the execution. Any suggestions/solutions to improve the performance.
Python's default csv module reads the file line by line, instead of loading it fully into memory.
Code would look something like this:
import csv
with open('file.csv') as csvfile:
csvreader = csv.reader(csvfile)
for row in csvreader:
if row[1] in list_A and row[3] in list_B:
# do something with the row

Methods of reading in and creating a list/array with excel information from excel

Imagine I am given two columns: a,b,c,d,e,f and e,f,g,h,i,j (commas indicating a new row in the column)
How could I read in such information from excel and put it in an two separate arrays? I would like to manipulate this info and read it off later. as part of an output.
You have a few choices here. If your data is rectangular, starts from A1, etc. you should just use pandas.read_excel:
import pandas as pd
df = pd.read_excel("/path/to/excel/file", sheetname = "My Sheet Name")
print(df["column1"].values)
print(df["column2"].values)
If your data is a little messier, and the read_excel options aren't enough to get your data, then I think your only choice will be to use something a little lower level like the fantastic xlrd module (read the quickstart on the README)

Reading from a specific row/column from and excel csv file

I am a beginner at Python and I'm looking to take 3 specific columns starting at a certain row from a .csv spreadsheet and then import each into python.
For example
I would need to take 1000 rows worth of data from column F starting at
row 12.
I've looked at options using cvs and pandas but I can't figure out how
to have them start importing at a certain row/column.
Any help would be greatly appreciated.
If the spreadsheet is not huge, the easiest approach is to load the entire CSV file into Python using the csv module and then extract the required rows and columns. For example:
import csv
rows = list(csv.reader(file('Book1.csv', 'rb')))
data = [column[5] for column in rows[11:11+1000]]
will do the trick. Remember that Python starts numbering from 0, so column[5] is column F from your spreadsheet and rows[11] is row 12.
CSV files being text files, there is no way to read a certain line. You will have to read line per line, and count... Have a look at the csv module in Python, which will explain how to (easily) read lines. Particularly this section.

Categories

Resources