Endswith and Replacement Syntax - Pandas - python

I'm using python pandas to manipulate current columns STY and size (both which are object datatypes). What I want is to remove sty that ends with 'X' and concatanate SIZ + X
Please see the following example below: the top is what i have now, bottom is the result I want (How do I get the result I want?)
This is my current code:
#for removing style
def reformat():
for n in df['STY']:
if str(n).endswith('X') :
x = str(n).replace('X','')
#for adding string 'X'
for x in df['SIZ']:
if str(n) in df['STY'].endswith('X'):
str(n).join('X')
in the end, I want to be able to apply the change and download it in excel.

Your problem, as described, can be solved with two statements. The first adds an "X" to the items in the second column if the items in the first column end with an "X":
df.loc[df.STY.str.endswith("X"), "SIZ"] += "X"
The second removes "X" from the ends of the items in the first column:
df.STY.replace("X$", "", regex=True, inplace=True)

Give this a try. The df.at call allows an in-place update of the dataframe at a specified location.
for i, row in df.iterrows():
sty = str(row['STY'])
siz = str(row['SIZ'])
if sty.endswith('X'):
df.at[i, 'STY'] = sty[:-1] # remove last character from STY
df.at[i, 'SIZ'] = siz + 'X' # udpate SIZ to include an X
Edit: to save it as excel, do this
df.to_excel("output.xlsx")
UPDATE
While this solution will work just fine, DYZ's solution is far more efficient.

Related

how to find the value in dataframe by multi condtition with unknown column and return the value position and get the next row value [duplicate]

I want to get column name from the whole database (assume the database contains more than 100 rows with more than 50 column) based on specific value that contain in a specific column in pandas.
with the help of Bkmm3 (member from india) I've succeeded on numerical term but failed on alphabetic term. the way I've tried is this:
df = pd.DataFrame({'A':['APPLE','BALL','CAT'],
'B':['ACTION','BATMAN','CATCHUP'],
'C':['ADVERTISE','BEAST','CARTOON']})
response = input("input")
for i in df.columns: if(len(df.query(i + '==' + str(response))) > 0):
print(i)`
then output arise as error:
Traceback (most recent call last): NameError: name 'APPLE' is not defined
Any Help from You Guys will be very Appreciated, Thank You . . .
isin/eq works for DataFrames, and you can 100% vectorize this:
df.columns[df.isin(['APPLE']).any()] # df.isin([response])
Or,
df.columns[df.eq(response).any()]
Index(['A'], dtype='object')
And here's the roundabout way with DataFrame.eval and np.logical_or (were you to loop on columns):
df.columns[
np.logical_or.reduce(
[df.eval(f"{repr(response)} in {i}") for i in df]
)]
Index(['A'], dtype='object')
First, the reason for your error. With pd.DataFrame.query, as with regular comparisons, you need to surround strings with quotation marks. So this would work (notice the pair of " quotations):
response = input("input")
for i in df.columns:
if not df.query(i + '=="' + str(response) + '"').empty:
print(i)
inputAPPLE
A
Next, you can extract index and/or columns via pd.DataFrame.any. coldspeed's solution is fine here, I'm just going to show how similar syntax can be used to extract both row and column labels.
# columns
print(df.columns[(df == response).any(1)])
Index(['A'], dtype='object')
# rows
print(df.index[(df == response).any(0)])
Int64Index([0], dtype='int64')
Notice in both cases you get as your result Index objects. The code differs only in the property being extracted and in the axis parameter of pd.DataFrame.any.

Adding Multiple Columns at Specific Locations in CSV file using Pandas

I am trying to place multiple columns (Score1, Score2, Score3 etc) before columns whose name begins with a certain text e.g.: Certainty.
I can insert columns at fixed locations using:
df.insert(17, "Score1", " ")
Adding a column then changes the column sequence, so then I would have to look and see where the next column is located. I can add a list of blank columns to the end of a CSV.
So essentially, my understanding is that I have to get pandas to read the column header. If the header text starts with "Certainty", then place a column called Score1 before it.
I tried using:
df.insert(df.filter(regex='Certainty').columns, "Score", " ")
However, as can be guessed it doesn't work.
From what I understand is that pandas is not efficient at iterative methods? Am I misinformed here?
Writing this also leads me to think that it needs a counter for Score1, 2, 3.
Any suggestions would be appreciated!
Thanks in advance.
Updates------Based on feedback provided
Using the method by #SergeBallesta works.
cur=0
for i, col in enumerate(df.columns):
if col.startswith('Certainty'):
df.insert(i+cur, f'Score{cur + 1}', '')
cur += 1
Using the method by #JacoSolari
I needed to make a modification to allow it to find all columns starting with "Certainty". And also needed to add Score1, Score2, Score3 automatically.
Version 1: This only adds Score1 in the correct place and then nothing else
counter=0
certcol = df.columns[df.columns.str.contains('Certainty')]
col_idx = df.columns.get_loc(certcol[0])
col_names = [f'Score{counter + 1}']
[df.insert(col_idx, col_name, ' ')
for col_name in col_names[::-1]]
Version 2: This adds Score1 in the correct place and then adds the rest after the first "Certainty" column. So it does not proceed to find the next one. Perhaps it needs a for loop somewhere?
cur=0
certcol = df.columns[df.columns.str.contains('Certainty')]
for col in enumerate(certcol):
col_idx = df.columns.get_loc(certcol[0])
df.insert(cur+col_idx, f'Score{cur + 1}', '')
cur += 1
I have posted this, in case anyone stumbles across the same need.
You will have to iterate over the columns. It is not as performant as numpy vectorized accesses but sometimes you have no other choice:
Here I would just do:
cur = 0
for i, col in enumerate(df.columns):
if col.startswith('Certainty')
df.insert(i+cur, f'Score{cur + 1}', '')
cur += 1
You can find the location of your Certainty column like this
col_idx = df.columns.get_loc('Certainty')
Then you can add in a for loop each of your new columns and data (here just empty string as in your example) like this
col_names = ['1', '2', '3']
[df.insert(col_idx, col_name, '') for col_name in col_names[::-1]]
So you don't need to update the column index as long as you add the reversed ([::-1]) list of new columns.
Also have a look at this question if you didn't already.

How to return the string of a header based on the max value of a cell in Openpyxl

Good morning guys! quick question for Openpyxl:
I am working with Python editing a xlsx document and generating various stats. Part of my script is to generate max values of a cell range :
temp_list=[]
temp_max=[]
for row in sheet.iter_rows(min_row=3, min_col=10, max_row=508, max_col=13):
print(row)
for cell in row:
temp_list.append(cell.value)
print(temp_list)
temp_max.append(max(temp_list))
temp_list=[]
I would also like to be able to print the string of the header of the column that contains the max value for the cell range desired. My data structure looks like this :
Any idea on how to do so?
Thanks!
This seems like a typical INDEX/MATCH Excel problem.
Have you tried retrieving the index for the max value in each temp_list?
You can use a function like numpy.argmax() to get the index of your max value within your "temp_list" array, then use this index to locate the header and append the string to a new list called, say, "max_headers" which contains all the header strings in order of appearance.
It would look something like this
for cell in row:
temp_list.append(cell.value)
i_max = np.argmax(temp_list)
max_headers.append(cell(row = 1, column = i_max).value)
And so on and so forth. Of course, for that to work, your temp_list should be a numpy array instead of a simple python list, and the max_headers list would have to be defined.
First, Thanks Bernardo for the hint. I found a decently working solution but still have a little issue. Perhaps someone can be of assistance.
Let me amend my initial statement : here is the code I am working with now :
temp_list=[]
headers_list=[]
for row in sheet.iter_rows(min_row=3, min_col=27, max_row=508, max_col=32): #Index starts at 1 // Here we set the rows/columns containing the data to be analyzed
for cell in row:
temp_list.append(cell.value)
for cell in row:
if cell.value == max(temp_list):
print(str(cell.column))
print(cell.value)
print(sheet.cell(row=1, column=cell.column).value)
headers_list.append(sheet.cell(row=1,column=cell.column).value)
else:
print('keep going.')
temp_list = []
This formula works, but has a little issue : If, for instance, a row has the same value twice (ie : 25,9,25,8,9), this loop will print 2 headers instead of one. My question is :
how can I get this loop to take in account only the first match of a max value in a row?
You probably want something like this:
headers = [c for c in next(ws.iter_rows(min_col=27, max_col=32, min_row=1, max_row=1, values_only=True))]
for row in ws.iter_rows(min_row=3, min_col=27, max_row=508, max_col=32, values_only=True):
mx = max(row)
idx = row.index(mx)
col = headers[idx]

Getting all column values from google sheet using Gspread and Python

So i have a problem with the Gspread for python 3
when i do something like:
x = worksheet.cell(1,1).value
print(x)
Then i get the value of cell 1,1 which in my case is:
Nice
But when i do:
x = worksheet.col_values(1)
print(x)
Then i get all the results as in
'Nice', 'Cool','','','','','','','','','','','','','',''
And all the empty cells as well which i don't understand since i am asking just for values why i do i get all the '', empty brackets and why the other results are also in brackets ? I would expect something like:
Nice
Cool
When i call for the values of a column and those are the only values. Anyone know how to get such results ?
According to this https://github.com/burnash/gspread documentation it should work but it dose not.
You are getting all of the column data, contained in a list. It starts at row one and gives you all rows in that column to the bottom of the spreadsheet (1000 rows by default), including empty cells. The documentation tells you this:
col_values(col) Returns a list of all values in column col.
Empty cells in this list will be rendered as None.
This seems to have been changed to return empty strings instead, but the principle is the same.
To get just values, use a list comprehension:
x = [item for item in worksheet.col_values(1) if item]
Noting that the above will remove blank rows between items, which might cause misalignment if you try to work with multiple columns where row number is important. Since it's a list, individual items are accessed with:
for item in x:
print(item)
Looking again at the gspread-documentation, I was able to create a dataframe and then thereafter obtain the column-values:
gc = gspread.authorize(GoogleCredentials.get_application_default())
sht2 = gc.open_by_url('https://docs.google.com/spreadsheets/d/<id>')
worksheet = sht2.worksheet("Sheet-name")
dataframe = pd.DataFrame(worksheet.get_all_records())
dataframe.head(3)
Note: Don't forget to enable your gsheet's sharing-settings to "Anyone with a link", to be able to access the sheet from e.g. google colab.
You can also create a while loop and make something like this.
Let's say you want column E to G, you can start the loop from x=5 and end it on x=7. Just make sure that you transpose the dataframe at the end before printing it.
columns = []
x = 5
while x < 8:
data = sheet.col_values(x)[1:]
x += 1
columns.append(data)
df = pd.DataFrame(columns).T
print(df)

DataFrame change doesn't save when iterating

I am trying to read a certain DF from file and add to it two more columns containing, say, the year and the week from other columns in DF. When i apply the code to generate a single new column, all works great. But when there are few columns to be created, the change does not apply. Specifically, new columns are created but their values are not what they are supposed to be.
I know that this happens because i first set all new values to a certain initial string and then change some of them, but I don't understand why it works on a single column and is "nulled" for multiple columns, leaving only the latest column changed... Help please?
tbl = pd.read_csv(file).fillna('No Fill')
date_cols = ['Col1','Col2']
for i in range(len(date_cols)):
tmp_col_name = date_cols[i] + '_WEEK'
tbl[tmp_col_name] = 'No Week'
bad_ind = list(np.where(tbl[date_cols[i]] == 'No Fill')[0])
tbl_ind = range(len(tbl))
for i in range(len(bad_ind)):
tbl_ind.remove(bad_ind[i])
tmp = pd.to_datetime(tbl[date_cols[i]][tbl_ind])
tbl[tmp_col_name][tbl_ind] = tmp.apply(lambda x: str(x.isocalendar()[0]) + '+' + str(x.isocalendar()[1]))
If I try the following lines, disregarding possible "empty data values", everything works...
tbl = pd.read_csv(file).fillna('No Fill')
date_cols = ['Col1','Col2']
for i in range(len(date_cols)):
tmp_col_name = date_cols[i] + '_WEEK'
tbl[tmp_col_name] = 'No Week'
tmp = pd.to_datetime(tbl[date_cols[i]])
tbl[tmp_col_name] = tmp.apply(lambda x: str(x.isocalendar()[0]) + '+' + str(x.isocalendar()[1]))
it has to do with not changing all data values, but i don't understand why the change does not apply - after all, before the second iteration begins, the DF seems to be updated and then tbl[tmp_col_name] = 'No Week' for the second iteration "deletes" the changes made in the first iteration, but only partially - it leaves the new column created but filled with 'No Week' values...
Many thanks to #EdChum! Performing chained indexing may or may not work. In case of creating new multiple columns and then filling in only some of their values, it doesn't work. More precise, it does work but only on the last updated column. Using loc, iloc or ix accessors to set the data works. In case of the above code, to make it work, one needs to cast the tbl_ind into np.array, using tbl[col_name[j]].iloc[np.array(tbl_ind)] = tmp.apply(lambda x: x.year)
Many thanks and credit for the answer to #EdChum.

Categories

Resources