I think I have a fairly simple question for a python expert. With a lot of struggling I put together underneath code. I am opening an excel file, transforming it to a list of lists and adding a column to this list of lists. Now I want to rename and recalculate the rows of this added column. How do I script that I always take the last column of a list of lists, even though the number of columns could differ.
import xlrd
file_location = "path"
workbook = xlrd.open_workbook(file_location)
sheet = workbook.sheet_by_index(0)
data = [[sheet.cell_value(r, c) for c in range(sheet.ncols)] for r in range(sheet.nrows)]
data = [x + [0] for x in data]
If you have a function called calculate_value that takes a row and returns the value for that row, you could do it like this:
def calculate_value(row):
# calculate it...
return value
def add_calculated_column(rows, func):
result_rows = []
for row in rows:
# create a new row to avoid changing the old data
new_row = row + [func(row)]
result_rows.append(new_row)
return result_rows
data_with_column = add_calculated_column(data, calculate_value)
I found a more easy and more flexible way of adjusting the values in the last column.
counter = 0
for list in data:
counter = counter + 1
if counter == 1:
value = 'Vrspng'
else:
value = counter
list.append(value)
Related
So this is kind of weird but I'm new to Python and I'm committed to seeing my first project with Python through to the end.
So I am reading about 100 .xlsx files in from a file path. I then trim each file and send only the important information to a list, as an individual and unique dataframe. So now I have a list of 100 unique dataframes, but iterating through the list and writing to excel just overwrites the data in the file. I want to append the end of the .xlsx file. The biggest catch to all of this is, I can only use Excel 2010, I do not have any other version of the application. So the openpyxl library seems to have some interesting stuff, I've tried something like this:
from openpyxl.utils.dataframe import dataframe_to_rows
wb = load_workbook(outfile_path)
ws = wb.active
for frame in main_df_list:
for r in dataframe_to_rows(frame, index = True, header = True):
ws.append(r)
Note: In another post I was told it's not best practice to read dataframes line by line using loops, but when I started I didn't know that. I am however committed to this monstrosity.
Edit after reading Comments
So my code scrapes .xlsx files and stores specific data based on a keyword comparison into dataframes. These dataframes are stored in a list, I will list the entirety of the program below so hopefully I can explain what's in my head. Also, feel free to roast my code because I have no idea what is actual good python practices vs. not.
import os
import pandas as pd
from openpyxl import load_workbook
#the file path I want to pull from
in_path = r'W:\R1_Manufacturing\Parts List Project\Tool_scraping\Excel'
#the file path where row search items are stored
search_parameters = r'W:\R1_Manufacturing\Parts List Project\search_params.xlsx'
#the file I will write the dataframes to
outfile_path = r'W:\R1_Manufacturing\Parts List Project\xlsx_reader.xlsx'
#establishing my list that I will store looped data into
file_list = []
main_df = []
master_list = []
#open the file path to store the directory in files
files = os.listdir(in_path)
#database with terms that I want to track
search = pd.read_excel(search_parameters)
search_size = search.index
#searching only for files that end with .xlsx
for file in files:
if file.endswith('.xlsx'):
file_list.append(in_path + '/' + file)
#read in the files to a dataframe, main loop the files will be maninpulated in
for current_file in file_list:
df = pd.read_excel(current_file)
#get columns headers and a range for total rows
columns = df.columns
total_rows = df.index
#adding to store where headers are stored in DF
row_list = []
column_list = []
header_list = []
for name in columns:
for number in total_rows:
cell = df.at[number, name]
if isinstance(cell, str) == False:
continue
elif cell == '':
continue
for place in search_size:
search_loop = search.at[place, 'Parameters']
#main compare, if str and matches search params, then do...
if insensitive_compare(search_loop, cell) == True:
if cell not in header_list:
header_list.append(df.at[number, name]) #store data headers
row_list.append(number) #store row number where it is in that data frame
column_list.append(name) #store column number where it is in that data frame
else:
continue
else:
continue
for thing in column_list:
df = pd.concat([df, pd.DataFrame(0, columns=[thing], index = range(2))], ignore_index = True)
#turns the dataframe into a set of booleans where its true if
#theres something there
na_finder = df.notna()
#create a new dataframe to write the output to
outdf = pd.DataFrame(columns = header_list)
for i in range(len(row_list)):
k = 0
while na_finder.at[row_list[i] + k, column_list[i]] == True:
#I turn the dataframe into booleans and read until False
if(df.at[row_list[i] + k, column_list[i]] not in header_list):
#Store actual dataframe into my output dataframe, outdf
outdf.at[k, header_list[i]] = df.at[row_list[i] + k, column_list[i]]
k += 1
main_df.append(outdf)
So main_df is a list that has 100+ dataframes in it. For this example I will only use 2 of them. I would like them to print out into excel like:
So the comment from Ashish really helped me, all of the dataframes had different column titles so my 100+ dataframes eventually concat'd to a dataframe that is 569X52. Here is the code that I used, I completely abandoned openpyxl because once I was able to concat all of the dataframes together, I just had to export it using pandas:
# what I want to do here is grab all the data in the same column as each
# header, then move to the next column
for i in range(len(row_list)):
k = 0
while na_finder.at[row_list[i] + k, column_list[i]] == True:
if(df.at[row_list[i] + k, column_list[i]] not in header_list):
outdf.at[k, header_list[i]] = df.at[row_list[i] + k, column_list[i]]
k += 1
main_df.append(outdf)
to_xlsx_df = pd.DataFrame()
for frame in main_df:
to_xlsx_df = pd.concat([to_xlsx_df, frame])
to_xlsx_df.to_excel(outfile_path)
The output to excel ended up looking something like this:
Hopefully this can help someone else out too.
I am trying to iterate through a dataframe, classify each row and add the output to the end of the row in a new column.
It seems to be adding the same classification to every row
dfMach = pd.read_csv("C:/Users/nicholas/Desktop/machineSum.csv", encoding='latin-1')
dfNew = dfMach
dfNew["Classification"] = ""
for index, row in dfMach.iterrows():
aVar = dfMach['Summary'].iat[0]
aClass = cl.classify(aVar)
dfNew['Classification'] = aClass
Where am I going wrong?
Thank you
Use apply instead of looping explicitly i.e
dfMach['Classification'] = dfMach['Summary'].apply(cl.classify)
A couple of simple mistake to be corrected in your code and a bit of improvement i.e
dfNew = dfMach.copy() # dfNew = dfMach This will not let you create a new copy so you have to use dfMach.copy()
dfNew["Classification"] = ""
for index, row in dfMach.iterrows():
# As #jez suggested we need to use loc for assignemnt
dfNew.loc[index, 'Classification'] = cl.classify(row['Summary'])
I am struggling to write codes that find me the first empty row of a google sheet.
I am using gspread package from github.com/burnash/gspread
I would be glad if someone can help :)
I currently have just imported modules and opened the worksheet
scope = ['https://spreadsheets.google.com/feeds']
credentials = ServiceAccountCredentials.from_json_keyfile_name('ddddd-61d0b758772b.json', scope)
gc = gspread.authorize(credentials)
sheet = gc.open("Event Discovery")
ws = sheet.worksheet('Event Discovery')
I want to find row 1158 which is the first empty row of the worksheet with a function, which means everytime the old empty row is filled, it will find the next empty row
See here
I solved this using:
def next_available_row(worksheet):
str_list = list(filter(None, worksheet.col_values(1)))
return str(len(str_list)+1)
scope = ['https://spreadsheets.google.com/feeds']
credentials = ServiceAccountCredentials.from_json_keyfile_name('auth.json', scope)
gc = gspread.authorize(credentials)
worksheet = gc.open("sheet name").sheet1
next_row = next_available_row(worksheet)
#insert on the next available row
worksheet.update_acell("A{}".format(next_row), somevar)
worksheet.update_acell("B{}".format(next_row), somevar2)
This alternative method resolves issues with the accepted answer by accounting for rows that may have skipped values (such as fancy header sections in a document) as well as sampling the first N columns:
def next_available_row(sheet, cols_to_sample=2):
# looks for empty row based on values appearing in 1st N columns
cols = sheet.range(1, 1, sheet.row_count, cols_to_sample)
return max([cell.row for cell in cols if cell.value]) + 1
If you can count on all of your previous rows being filled in:
len(sheet.get_all_values()) + 1
will give you the first free row
get_all_values returns a 2D list of the sheet's data. Each nested list is a row, so the length of the 2D list is the number of rows that has any data.
Similar problem is first free column:
from xlsxwriter.utility import xl_col_to_name
# Square 2D list, doesn't matter which row len you check
column_count = len(sheet.get_all_values()[0])
column = xl_col_to_name(column_count)
def find_empty_cell():
alphabet = list(map(chr, range(65, 91)))
for letter in alphabet[0:1]: #look only at column A and B
for x in range(1, 1000):
cell_coord = letter+ str(x)
if wks.acell(cell_coord).value == "":
return(cell_coord)
I use this kinda sloppy function to find the first empty cell. I can't find an empty row because the other columns already have values.
Oh, and there are some issues between 2.7 and 3.6 with mapping that required me to turn the alphabet into a string.
import pygsheets
gc = pygsheets.authorize(service_file='************************.json')
ss = gc.open('enterprise_finance')
ws = ss[0]
row_count = len(ws.get_all_records()) + 2
ws.set_dataframe(raw_output,(row_count,1), copy_index = 'TRUE', copy_head = 'TRUE')
ws.delete_rows(row_count , number=1)
I am processing a CSV file in python thats delimited by a comma (,).
Each column is a sampled parameter, for instance column 0 is time, sampled at once a second, column 1 is altitude sampled at 4 times a second, etc.
So columns will look like as below:
Column 0 -> ["Time", 0, " "," "," ",1]
Column 1 -> ["Altitude", 100, 200, 300, 400]
I am trying to create a list for each column that captures its name and all its data. That way I can do calculations and organize my data into a new file automatically (the sampled data I am working with has substantial number of rows)
I want to do this for any file not just one, so the number of columns can vary.
Normally if every file was consistent I would do something like:
import csv
time =[]
alt = []
dct = {}
with open('test.csv',"r") as csvfile:
csv_f = csv.reader(csvfile)
for row in csv_f:
header.append(row[0])
alt.append(row[1]) #etc for all columns
I am pretty new in python. Is this a good way to tackle this, if not what is better methodology?
Thanks for your time
Pandas will probably work best for you. If you use csv_read from pandas, it will create a DataFrame based on the column. It's roughly a dictionary of lists.
You can also use the .tolist() functionality of pandas to convert it to a list if you want a list specifically.
import pandas as pd
data = pd.read_csv("soqn.csv")
dict_of_lists = {}
for column_name in data.columns:
temp_list = data[column_name].tolist()
dict_of_lists[column_name] = temp_list
print dict_of_lists
EDIT:
dict_of_lists={column_name: data[column_name].tolist() for column_name in data.columns}
#This list comprehension might work faster.
I think I made my problem more simpler and just focused on one column.
What I ultimately wanted to do was to interpolate to the highest sampling rate. So here is what I came up with... Please let me know if I can do anything more efficient. I used A LOT of searching on this site to help build this. Again I am new at Python (about 2-3 weeks but some former programming experience)
import csv
header = []
#initialize variables
loc_int = 0
loc_fin = 0
temp_i = 0
temp_f = 0
with open('test2.csv',"r") as csvfile: # open csv file
csv_f = csv.reader(csvfile)
for row in csv_f:
header.append(row[0]) #make a list that consists of all content in column A
for x in range(0,len(header)-1): #go through entire column
if header[x].isdigit() and header[x+1]=="": # find lower bound of sample to be interpolated
loc_int = x
temp_i = int(header[x])
elif header[x+1].isdigit() and header[x]=="": # find upper bound of sample to be interpolated
loc_fin = x
temp_f = int(header[x+1])
if temp_f>temp_i: #calculate interpolated values
f_min_i = temp_f - temp_i
interp = f_min_i/float((loc_fin+1)-loc_int)
for y in range(loc_int, loc_fin+1):
header[y] = temp_i + interp*(y-loc_int)
print header
with open("output.csv", 'wb') as g: #write to new file
writer = csv.writer(g)
for item in header:
writer.writerow([item])
I couldn't figure out how to write my new list "header" with its interpolated values and replace it with column A of my old file , test2.csv.
Anywho thank you very much for looking...
I am trying to loop through a spreadsheet and grab the value of a cell in a row under a certain column, as so:
# Row by row, go through the originalWorkSheet and save the values from the selected columns
numberOfRowsInOriginalWorkSheet = originalWorkSheet.nrows - 1
rowCounter = 0
while rowCounter <= numberOfRowsInOriginalWorkSheet:
row = originalWorkSheet.row(rowCounter)
#Grab the values in certain columns, say with the
# column name "Promotion" and save them to a variable
Is this possible? my google-foo has failed me on this one.
Thank you for the help!
The simplest way:
from xlrd import open_workbook
book = open_workbook(path_to_file)
sheet = book.sheet_by_index(0)
for i in range(1, sheet.nrows):
row = sheet.row_values(i)
variable = row[0] # Instead zero number of certain column
or you can loop row list and print each cell value
book = open_workbook(path_to_file)
sheet = book.sheet_by_index(0)
for i in range(1, sheet.nrows):
row = sheet.row_values(i)
for cnt in range(len(row)):
print row[cnt]
Hope this helps
There are many ways to do this, take a look at the docs
Something like this:
promotion_col_index = <promotion column index>
list_of_promotion_cells = originalWorkSheet.col(promotion_col_index)
list_of_promotion_values = [cell.value for cell in list_of_promotion_cells]
will get you a list of the values in the "Promotion" column