Python Pandas - Iterating and adding data to blank column - python

I am trying to iterate through a dataframe, classify each row and add the output to the end of the row in a new column.
It seems to be adding the same classification to every row
dfMach = pd.read_csv("C:/Users/nicholas/Desktop/machineSum.csv", encoding='latin-1')
dfNew = dfMach
dfNew["Classification"] = ""
for index, row in dfMach.iterrows():
aVar = dfMach['Summary'].iat[0]
aClass = cl.classify(aVar)
dfNew['Classification'] = aClass
Where am I going wrong?
Thank you

Use apply instead of looping explicitly i.e
dfMach['Classification'] = dfMach['Summary'].apply(cl.classify)
A couple of simple mistake to be corrected in your code and a bit of improvement i.e
dfNew = dfMach.copy() # dfNew = dfMach This will not let you create a new copy so you have to use dfMach.copy()
dfNew["Classification"] = ""
for index, row in dfMach.iterrows():
# As #jez suggested we need to use loc for assignemnt
dfNew.loc[index, 'Classification'] = cl.classify(row['Summary'])

Related

I want to save my scrape record in csv, this Python Code working fine for me but it's making extra columns each time new data is saved

it's making extra columns for each time new data is saved in this sheet How can I stop making extra columns?
Data = [[Category,Headlines,Author,Source,Published_Date,Feature_Image,Content,url]]
cols = ['Category','Headlines','Author','Source','Published_Date','Feature_Image','Content','URL']
try:
opened_df = pd.read_csv('C:/Users/Public/pagedata.csv')
opened_df = pd.concat([opened_df,pd.DataFrame(Data, columns = cols)])
except:
opened_df = pd.DataFrame(Data, columns = cols)
opened_df.to_csv('C:/Users/Public/pagedata.csv')
Try this in your last line of your code:
opened_df.to_csv('C:/Users/Public/pagedata.csv', index = False)
in case the "Unnamed: 0" gets created everytime and is an unwanted column in your csv try this, before the last line in your code:
opened_df.drop('Unnamed: 0', axis = 1, inplace = True)
opened_df.to_csv('C:/Users/Public/pagedata.csv', index = False)
i hope this solves your issue.

How to find the first empty row of a google spread sheet using python GSPREAD?

I am struggling to write codes that find me the first empty row of a google sheet.
I am using gspread package from github.com/burnash/gspread
I would be glad if someone can help :)
I currently have just imported modules and opened the worksheet
scope = ['https://spreadsheets.google.com/feeds']
credentials = ServiceAccountCredentials.from_json_keyfile_name('ddddd-61d0b758772b.json', scope)
gc = gspread.authorize(credentials)
sheet = gc.open("Event Discovery")
ws = sheet.worksheet('Event Discovery')
I want to find row 1158 which is the first empty row of the worksheet with a function, which means everytime the old empty row is filled, it will find the next empty row
See here
I solved this using:
def next_available_row(worksheet):
str_list = list(filter(None, worksheet.col_values(1)))
return str(len(str_list)+1)
scope = ['https://spreadsheets.google.com/feeds']
credentials = ServiceAccountCredentials.from_json_keyfile_name('auth.json', scope)
gc = gspread.authorize(credentials)
worksheet = gc.open("sheet name").sheet1
next_row = next_available_row(worksheet)
#insert on the next available row
worksheet.update_acell("A{}".format(next_row), somevar)
worksheet.update_acell("B{}".format(next_row), somevar2)
This alternative method resolves issues with the accepted answer by accounting for rows that may have skipped values (such as fancy header sections in a document) as well as sampling the first N columns:
def next_available_row(sheet, cols_to_sample=2):
# looks for empty row based on values appearing in 1st N columns
cols = sheet.range(1, 1, sheet.row_count, cols_to_sample)
return max([cell.row for cell in cols if cell.value]) + 1
If you can count on all of your previous rows being filled in:
len(sheet.get_all_values()) + 1
will give you the first free row
get_all_values returns a 2D list of the sheet's data. Each nested list is a row, so the length of the 2D list is the number of rows that has any data.
Similar problem is first free column:
from xlsxwriter.utility import xl_col_to_name
# Square 2D list, doesn't matter which row len you check
column_count = len(sheet.get_all_values()[0])
column = xl_col_to_name(column_count)
def find_empty_cell():
alphabet = list(map(chr, range(65, 91)))
for letter in alphabet[0:1]: #look only at column A and B
for x in range(1, 1000):
cell_coord = letter+ str(x)
if wks.acell(cell_coord).value == "":
return(cell_coord)
I use this kinda sloppy function to find the first empty cell. I can't find an empty row because the other columns already have values.
Oh, and there are some issues between 2.7 and 3.6 with mapping that required me to turn the alphabet into a string.
import pygsheets
gc = pygsheets.authorize(service_file='************************.json')
ss = gc.open('enterprise_finance')
ws = ss[0]
row_count = len(ws.get_all_records()) + 2
ws.set_dataframe(raw_output,(row_count,1), copy_index = 'TRUE', copy_head = 'TRUE')
ws.delete_rows(row_count , number=1)

Import CSV and create one list for each column in Python

I am processing a CSV file in python thats delimited by a comma (,).
Each column is a sampled parameter, for instance column 0 is time, sampled at once a second, column 1 is altitude sampled at 4 times a second, etc.
So columns will look like as below:
Column 0 -> ["Time", 0, " "," "," ",1]
Column 1 -> ["Altitude", 100, 200, 300, 400]
I am trying to create a list for each column that captures its name and all its data. That way I can do calculations and organize my data into a new file automatically (the sampled data I am working with has substantial number of rows)
I want to do this for any file not just one, so the number of columns can vary.
Normally if every file was consistent I would do something like:
import csv
time =[]
alt = []
dct = {}
with open('test.csv',"r") as csvfile:
csv_f = csv.reader(csvfile)
for row in csv_f:
header.append(row[0])
alt.append(row[1]) #etc for all columns
I am pretty new in python. Is this a good way to tackle this, if not what is better methodology?
Thanks for your time
Pandas will probably work best for you. If you use csv_read from pandas, it will create a DataFrame based on the column. It's roughly a dictionary of lists.
You can also use the .tolist() functionality of pandas to convert it to a list if you want a list specifically.
import pandas as pd
data = pd.read_csv("soqn.csv")
dict_of_lists = {}
for column_name in data.columns:
temp_list = data[column_name].tolist()
dict_of_lists[column_name] = temp_list
print dict_of_lists
EDIT:
dict_of_lists={column_name: data[column_name].tolist() for column_name in data.columns}
#This list comprehension might work faster.
I think I made my problem more simpler and just focused on one column.
What I ultimately wanted to do was to interpolate to the highest sampling rate. So here is what I came up with... Please let me know if I can do anything more efficient. I used A LOT of searching on this site to help build this. Again I am new at Python (about 2-3 weeks but some former programming experience)
import csv
header = []
#initialize variables
loc_int = 0
loc_fin = 0
temp_i = 0
temp_f = 0
with open('test2.csv',"r") as csvfile: # open csv file
csv_f = csv.reader(csvfile)
for row in csv_f:
header.append(row[0]) #make a list that consists of all content in column A
for x in range(0,len(header)-1): #go through entire column
if header[x].isdigit() and header[x+1]=="": # find lower bound of sample to be interpolated
loc_int = x
temp_i = int(header[x])
elif header[x+1].isdigit() and header[x]=="": # find upper bound of sample to be interpolated
loc_fin = x
temp_f = int(header[x+1])
if temp_f>temp_i: #calculate interpolated values
f_min_i = temp_f - temp_i
interp = f_min_i/float((loc_fin+1)-loc_int)
for y in range(loc_int, loc_fin+1):
header[y] = temp_i + interp*(y-loc_int)
print header
with open("output.csv", 'wb') as g: #write to new file
writer = csv.writer(g)
for item in header:
writer.writerow([item])
I couldn't figure out how to write my new list "header" with its interpolated values and replace it with column A of my old file , test2.csv.
Anywho thank you very much for looking...

Calculate and add column in python

I think I have a fairly simple question for a python expert. With a lot of struggling I put together underneath code. I am opening an excel file, transforming it to a list of lists and adding a column to this list of lists. Now I want to rename and recalculate the rows of this added column. How do I script that I always take the last column of a list of lists, even though the number of columns could differ.
import xlrd
file_location = "path"
workbook = xlrd.open_workbook(file_location)
sheet = workbook.sheet_by_index(0)
data = [[sheet.cell_value(r, c) for c in range(sheet.ncols)] for r in range(sheet.nrows)]
data = [x + [0] for x in data]
If you have a function called calculate_value that takes a row and returns the value for that row, you could do it like this:
def calculate_value(row):
# calculate it...
return value
def add_calculated_column(rows, func):
result_rows = []
for row in rows:
# create a new row to avoid changing the old data
new_row = row + [func(row)]
result_rows.append(new_row)
return result_rows
data_with_column = add_calculated_column(data, calculate_value)
I found a more easy and more flexible way of adjusting the values in the last column.
counter = 0
for list in data:
counter = counter + 1
if counter == 1:
value = 'Vrspng'
else:
value = counter
list.append(value)

In python removing rows from a excel file using xlrd, xlwt, and xlutils

Hello everyone and thank you in advance.
I have a python script where I am opening a template excel file, adding data (while preserving the style) and saving again. I would like to be able to remove rows that I did not edit before saving out the new xls file. My template xls file has a footer so I want to delete the extra rows before the footer.
Here is how I am loading the xls template:
self.inBook = xlrd.open_workbook(file_path, formatting_info=True)
self.outBook = xlutils.copy.copy(self.inBook)
self.outBookCopy = xlutils.copy.copy(self.inBook)
I then write the info to outBook while grabbing the style from outBookCopy and applying it to each row that I modify in outbook.
so how do I delete rows from outBook before writing it? Thanks everyone!
I achieved using Pandas package....
import pandas as pd
#Read from Excel
xl= pd.ExcelFile("test.xls")
#Parsing Excel Sheet to DataFrame
dfs = xl.parse(xl.sheet_names[0])
#Update DataFrame as per requirement
#(Here Removing the row from DataFrame having blank value in "Name" column)
dfs = dfs[dfs['Name'] != '']
#Updating the excel sheet with the updated DataFrame
dfs.to_excel("test.xls",sheet_name='Sheet1',index=False)
xlwt does not provide a simple interface for doing this, but I've had success with a somewhat similar problem (inserting multiple copies of a row into a copied workbook) by directly changing the worksheet's rows attribute and the row numbers on the row and cell objects.
The rows attribute is a dict, indexed on row number, so iterating a row range takes a little care and you can't slice it.
Given the number of rows you want to delete and the initial row number of the first row you want to keep, something like this might work:
rows_indices_to_move = range(first_kept_row, worksheet.last_used_row + 1)
max_used_row = 0
for row_index in rows_indices_to_move:
new_row_number = row_index - number_to_delete
if row_index in worksheet.rows():
row = worksheet.rows[row_index]
row._Row__idx = new_row_number
for cell in row._Row__cells.values():
if cell:
cell.rowx = new_row_number
worksheet.rows[new_row_number] = row
max_used_row = new_row_number
else:
# There's no row in the block we're trying to slide up at this index, but there might be a row already present to clear out.
if new_row_number in worksheet.rows():
del worksheet.rows[new_row_number]
# now delete any remaining rows
del worksheet.rows[new_row_number + 1:]
# and update the internal marker for the last remaining row
if max_used_row:
worksheet.last_used_row = max_used_row
I would believe that there are bugs in that code, it's untested and relies on direct manipulation of the underlying data structures, but it should show the general idea. Modify the row and cell objects and adjust the rows dictionary so that the indices are correct.
Do you have merged ranges in the rows you want to delete, or below them? If so you'll also need to run through the worksheet's merged_ranges attribute and update the rows for them. Also, if you have multiple groups of rows to delete you'll need to adjust this answer - this is specific to the case of having a block of rows to delete and shifting everything below up.
As a side note - I was able to write text to my worksheet and preserve the predefined style thus:
def write_with_style(ws, row, col, value):
if ws.rows[row]._Row__cells[col]:
old_xf_idx = ws.rows[row]._Row__cells[col].xf_idx
ws.write(row, col, value)
ws.rows[row]._Row__cells[col].xf_idx = old_xf_idx
else:
ws.write(row, col, value)
That might let you skip having two copies of your spreadsheet open at once.
For those of us still stuck with xlrd/xlwt/xlutils, here's a filter you could use:
from xlutils.filter import BaseFilter
class RowFilter(BaseFilter):
rows_to_exclude: "Iterable[int]"
_next_output_row: int
def __init__(
self,
rows_to_exclude: "Iterable[int]",
):
self.rows_to_exclude = rows_to_exclude
self._next_output_row = -1
def _should_include_row(self, rdrowx):
return rdrowx not in self.rows_to_exclude
def row(self, rdrowx, wtrowx):
if self._should_include_row(rdrowx):
# Proceed with writing out the row to the output file
self._next_output_row += 1
self.next.row(
rdrowx, self._next_output_row,
)
# After `row()` has been called, `cell()` is called for each cell of the row
def cell(self, rdrowx, rdcolx, wtrowx, wtcolx):
if self._should_include_row(rdrowx):
self.next.cell(
rdrowx, rdcolx, self._next_output_row, wtcolx,
)
Then put it to use with e.g.:
from xlrd import open_workbook
from xlutils.filter import DirectoryWriter, XLRDReader
xlutils.filter.process(
XLRDReader(open_workbook("input_filename.xls", "output_filename.xls")),
RowFilter([3, 4, 5]),
DirectoryWriter("output_dir"),
)

Categories

Resources