I want to import one column with 10 rows in to Python as a list.
So I have in excel for example: One, Two, Three, Four,..., Ten
Everything written in column A over row 1-10.
Now I want to import these cells into Python, so that my result is:
list = ['One', 'Two', 'Three', 'Four', ..., 'Ten']
Since I am a total noob in programming, I have no clue how to do it. So please tell me the most easiest way. All tutorials I have found, did't got me the result I want.
Thank you
I am using Python 2.7
Even though pandas is a great library, for your simple task you can just use xlrd:
import xlrd
wb = xlrd.open_workbook(path_to_my_workbook)
ws = wb.sheet_by_index(0)
mylist = ws.col_values(0)
Note that list is not a good name for a variable in Python, because that is the name of a built-in function.
I am unsure if your data is in xlsx form or CSV form. If XLSX, use this Python Excel tutorial. If CSV, it is much easier, and you can follow the code snippet below. If you don't want to use pandas, you can use the numpylibrary. Use the example code snippet below for taking the top row of a CSV file:
import numpy as np
csv_file = np.genfromtxt('filepath/relative/to/your/script.csv',
delimiter=',', dtype=str)
top_row = csv_file[:].tolist()
This will work for a file that has only one column of text. If you have more columns, use the following snippet to just get the first column. The '0' indicates the first column.
top_row = csv_file[:,0].tolist()
I recommend installing pandas.
pip install pandas
and
import pandas
df = pandas.read_excel('path/to/data.xlsx') # The options of that method are quite neat; Stores to a pandas.DataFrame object
print df.head() # show a preview of the loaded data
idx_of_column = 5-1 # in case the column of interest is the 5th in Excel
print list(df.iloc[:,idx_of_column]) # access via index
print list(df.loc[['my_row_1','my_row_2'],['my_column_1','my_column_2']]) # access certain elements via row and column names
print list(df['my_column_1']) # straight forward access via column name
(checkout pandas doc)
or
pip install xlrd
code
from xlrd import open_workbook
wb = open_workbook('simple.xls')
for s in wb.sheets():
print 'Sheet:',s.name
for row in range(s.nrows):
values = []
for col in range(s.ncols):
values.append(s.cell(row,col).value)
print ','.join(values)
(example from https://github.com/python-excel/tutorial/raw/master/python-excel.pdf)
Related
I've done research and can't find anything that has solved my issue. I need a python script to read csv files using a folder path. This script needs to check for empty cells within a column and then display a popup statement notifying users of the empty cells. Anything helps!!
Use the pandas library
pip install pandas
You can import the excel file as a DataFrame and check each cell with loops.
A simple example using Python csv module:
cat test.csv
1,,test
2,dog,cat
3,foo,
import csv
with open('test.csv') as csv_file:
empty_list = []
c_reader = csv.DictReader(csv_file, fieldnames=["id", "fld_1", "fld_2"])
for row in c_reader:
row_dict = {row["id"]: item for item in row if not row[item]}
if row_dict:
empty_list.append(row_dict)
empty_list
[{'1': 'fld_1'}, {'3': 'fld_2'}]
This example assumes that there is at least one column that will always have a value and is the equivalent of a primary key. You have not mentioned what client you will be running this in. So it is not possible at this time to come up with a code example that presents this to the user for action.
Hello I think it's quite easy to solve with pandas:
import pandas as pd
df = pd.read_csv('<path>')
df.describe() # to just see empoty stuff
np.where(pd.isnull(df)) # to show indexes of empty celss -> from https://stackoverflow.com/questions/27159189/find-empty-or-nan-entry-in-pandas-dataframe
alternatively you can read the file and check line by line for empty cells
I have an workbook in Excel and I need to find the first column that is empty / has no data in it. I need to keep Excel open at all times, so something like openpyxl won't do.
Here's my code so far:
import xlwings as xw
from pathlib import Path
wbPath = Path('test.xlsx')
wb = xw.Book(wbPath)
sourceSheet = wb.sheets['source']
This can be done using
destinationSheet["A1"].expand("right").last_cell.column
Depending on what you need exactly, this code might be most robust. With using used_range, the code gives you the first empty column at the very end of the data as integer, regardless of empty/blank columns before the last column with data.
a_rng = sourceSheet.used_range[-1].offset(column_offset=1).column
print(a_rng)
There are multiple ways to read excel data into python.
Pandas provides aslo an API for writing and reading
import pandas as pd
from pandas import ExcelWriter
from pandas import ExcelFile
df = pd.read_excel('File.xlsx', sheetname='Sheet1')
That works fine.
BUT: What is the way to access the tables of every sheet directly into a pandas dataframe??
The above picture shows a sheet including a table SEPARATED THAN CELL (1,1).
Moreover the sheet might include several tables (listobjects in VBA).
I can not find anywhere the way to read them into pandas.
Note1: It is not possible to modify the workbook to bring all the tables towards cell(1,1).
Note2: I would like to use just pandas (if it is possible) and minimize the need to import other libraries. But it there is no other way I am ready to use other lybray. In any case I could not manage with xlwings for instance.
here it looks like its possible to parse the excel file, but no soilution is provided for tables, just for complete sheets.
The documentation of pandas does not seem to offer that possibility.
Thanks.
You can use xlwings, great package for working with excel files in python.
This is for a single table, but it is pretty trivial to use xlwings collections (App>books>sheets>tables) to iterate over all tables. Tables are ofcourse listobjects.
import xlwings
import pandas
with xlwings.App() as App:
_ = App.books.open('my.xlsx')
rng = App.books['my.xlsx'].sheets['mysheet'].tables['mytablename'].range
df: pandas.DataFrame = rng.expand().options(pandas.DataFrame).value
I understand that this question has been marked solved already, but I found an article that provides a much more robust solution:
Full Post
I suppose a newer version of this library supports better visibility of the workbook structure. Here is a summary:
Load the workbook using the load_workbook function from openpyxl
Then, you are able to access the sheets within, which contains collection of List-Objects (Tables) in excel.
Once you gain access to the tables, you are able to get to the range addresses of those tables.
Finally they loop through the ranges and create a pandas data-frame from it.
This is a nicer solution as it gives us the ability to loop through all the sheets and tables in a workbook.
Here is a way to parse one table, howver it's need you to know some informations on the seet parsed.
df = pd.read_excel("file.xlsx", usecols="B:I", index_col=3)
print(df)
Not elegant and work only if one table is present inside the sheet, but that a first step:
import pandas as pd
import string
letter = list(string.ascii_uppercase)
df1 = pd.read_excel("file.xlsx")
def get_start_column(df):
for i, column in enumerate(df.columns):
if df[column].first_valid_index():
return letter[i]
def get_last_column(df):
columns = df.columns
len_column = len(columns)
for i, column in enumerate(columns):
if df[column].first_valid_index():
return letter[len_column - i]
def get_first_row(df):
for index, row in df.iterrows():
if not row.isnull().values.all():
return index + 1
def usecols(df):
start = get_start_column(df)
end = get_last_column(df)
return f"{start}:{end}"
df = pd.read_excel("file.xlsx", usecols=usecols(df1), header=get_first_row(df1))
print(df)
(Before i go ahead and ask this question please understand that i have done research but this is just to fill in holes in my information)
I have a standard excel spread sheet .xls that contains one table with the following info in it:
Now what i would like to achieve is to directly translate this .xls file into a data type that can be stored in memory for the python application that im writing to access this information accordingly like a dictionary.
I have read up a fair bit on this but my experience in coding isn't 100% as it has been awhile.
you can use pandas library, excellent library for excel manipulations.
import pandas as pd
data_frame = pd.read_excel("path_to_excel", "sheet_name")
data_frame is like a table or matrix that holds your data, you can manipulate this data_frame really easily
So what you want should be possible with the csv module, assuming you convert your xls to a csv (just save as...).
Like so:
import csv
with open('filepath.csv', 'r') as f:
reader = csv.DictReader(f)
for row in reader:
print(row['A'], row['B'])
DictReader takes the first row, and assumes those are the dictionary keys. Then it turns each row into a dictionary where you can access the values using the keys defined in the first row.
If you don't want it to be an actual dict, you can just use csv.reader(f) which allows you to access the rows using list indexing (the above example would end with print(row[0], row[1])).
This all has the nice bonus of being able to use the standard library without any 3rd party imports - so will run on any machine with Python.
You can use xlrd to loop through the excel file. You can loop through the excel file and create a dictionary, as you suggested.
A better alternative would be pandas, which reads your excel as a table, called data frame. You can access any cell, row or column from this data frame.
Eg, you have:
X Y
0 0.213784 0.461443
1 0.703082 0.600445
2 0.111101 0.648624
3 0.101367 0.924729
>>> import pandas as pd
>>> df = pd.read_excel(filename)
>>> df["X"]
0 0.213784
1 0.703082
2 0.111101
3 0.101367
>>> df["Y"]
0 0.461443
1 0.600445
2 0.648624
3 0.924729
>>> df["X"][0]
0.21378370373100195
I'm trying to use it to manipulate data in large txt-files.
I have a txt-file with more than 2000 columns, and about a third of these have a title which contains the word 'Net'. I want to extract only these columns and write them to a new txt file. Any suggestion on how I can do that?
I have searched around a bit but haven't been able to find something that helps me. Apologies if similar questions have been asked and solved before.
EDIT 1: Thank you all! At the moment of writing 3 users have suggested solutions and they all work really well. I honestly didn't think people would answer so I didn't check for a day or two, and was happily surprised by this. I'm very impressed.
EDIT 2: I've added a picture that shows what a part of the original txt-file can look like, in case it will help anyone in the future:
One way of doing this, without the installation of third-party modules like numpy/pandas, is as follows. Given an input file, called "input.csv" like this:
a,b,c_net,d,e_net
0,0,1,0,1
0,0,1,0,1
(remove the blank lines in between, they are just for formatting the
content in this post)
The following code does what you want.
import csv
input_filename = 'input.csv'
output_filename = 'output.csv'
# Instantiate a CSV reader, check if you have the appropriate delimiter
reader = csv.reader(open(input_filename), delimiter=',')
# Get the first row (assuming this row contains the header)
input_header = reader.next()
# Filter out the columns that you want to keep by storing the column
# index
columns_to_keep = []
for i, name in enumerate(input_header):
if 'net' in name:
columns_to_keep.append(i)
# Create a CSV writer to store the columns you want to keep
writer = csv.writer(open(output_filename, 'w'), delimiter=',')
# Construct the header of the output file
output_header = []
for column_index in columns_to_keep:
output_header.append(input_header[column_index])
# Write the header to the output file
writer.writerow(output_header)
# Iterate of the remainder of the input file, construct a row
# with columns you want to keep and write this row to the output file
for row in reader:
new_row = []
for column_index in columns_to_keep:
new_row.append(row[column_index])
writer.writerow(new_row)
Note that there is no error handling. There are at least two that should be handled. The first one is the check for the existence of the input file (hint: check the functionality provide by the os and os.path modules). The second one is to handle blank lines or lines with an inconsistent amount of columns.
This could be done for instance with Pandas,
import pandas as pd
df = pd.read_csv('path_to_file.txt', sep='\s+')
print(df.columns) # check that the columns are parsed correctly
selected_columns = [col for col in df.columns if "net" in col]
df_filtered = df[selected_columns]
df_filtered.to_csv('new_file.txt')
Of course, since we don't have the structure of your text file, you would have to adapt the arguments of read_csv to make this work in your case (see the the corresponding documentation).
This will load all the file in memory and then filter out the unnecessary columns. If your file is so large that it cannot be loaded in RAM at once, there is a way to load only specific columns with the usecols argument.
You can use pandas filter function to select few columns based on regex
data_filtered = data.filter(regex='net')