Insert Row in Python Pandas DataFrame - python

(I´m new to python, sorry for any mistakes I make, I hope you can understand me)
I have searched for a method to insert a Row into a Pandas DataFrame in Python, and I have found this:
add one row in a pandas.DataFrame
I have use the code provided in the accepted answer of that topic by fred, but the code overwrites my row:
My code (Inserts a row with values "-1" for each column in certains conditions):
df.loc[i+1] = [-1 for n in range(len(df.columns))]
How can I make the code insert a row WITHOUT overriding it?
For example, if I have a 50 row DataFrame, insert a row in position 25 (Just as an example), and make the DataFrame have 51 rows (Being the 25 a extra NEW row).
I hope you can understand my question. (Sorry for any english mistakes)
Thank you.
UPDATE:
Someone suggested this:
Is it possible to insert a row at an arbitrary position in a dataframe using pandas?
Tried, did not work. In fact, it does nothing, does not add any row.
line = pd.DataFrame({[-1 for n in range(len(df.columns))]}, index=[i+1])
df = pd.concat([ df.ix[:i], line, df.ix[i+1:]]).reset_index(drop=True)
Any other ideas?

With a small DataFrame, if you want to insert a line, at position 1:
df = pd.DataFrame({'a':[0,1],'b':[2,3]})
df1 = pd.DataFrame([4,5]).T
pd.concat([df[:1], df1.rename(columns=dict(zip(df1.columns,df.columns))), df[1:]])
#Out[46]:
# a b
#0 0 2
#0 4 5
#1 1 3

Related

Adding a column into dataframe that is based on the number of another dataframe

Hey guys I have the following issue following a code that would help me do the following.
I have this dataframe (is based on a the max column of MSSQL table that functions as an index there, data is already downloaded and passed to a df):
last_row
39021
And I have the following Data frame that was created by consuming several CSV's and other sources:
blank_col
random column1
random column2
asgshg2342342d
testdata1
asert54363546
testdata2
As you can see the first column is blank and need to insert the index based on the first dataframe so the final product with the column inserted should look like this:
blank_col
random column1
random column2
39022
asgshg2342342d
testdata1
39023
asert54363546
testdata2
This is the code that I've been trying and gives me an error
last_row_counter = df1.last_row.to_list()
n = int(float(input(last_row)))
df2['column_Id'] = n + 1
So basically just inserting the last_row + 1 each row on the second dataframe
Any assistance with this will be much appreciate. PD: apologize for my English is not my first language.
In you case
df2['column_Id'] = np.arange(len(df2)) + lastrowfromdf1 + 1
You can use squeeze to the the value from a Series if there're only one, and then create a range and add it to the number:
df2['blank_col'] = df1['last_row'].squeeze() + np.arange(df2.shape[0])+1
Output:
>>> df2
blank_col random column1 random column2
0 39022 asgshg2342342d testdata1
1 39023 asert54363546 testdata2

How to skip rows in pandas dataframe iteration

So I've created a dataframe called celtics and the last column is called 'Change in W/L%' and is right now filled with all 0s.
I want to calculate the change in Win-Loss Percentage (see 'W/L%' column) if the coach's name in one row of Coaches is different from the name of the coach right underneath that row. I have written this loop to try and execute this program:
i = 0
while i < len(celtics) - 1:
if (celtics["Coaches"].loc[i].split("("))[0] != (celtics["Coaches"].loc[i + 1].split("("))[0]:
celtics["Change in W/L%"].loc[i] = celtics["W/L%"].loc[i] - celtics["W/L%"].loc[i + 1]
i = i + 1
i = i + 1
So basically, if the name of the coach in Row i is different than the name of the coach in Row i+1, the change in W/L% between the two rows and is added to Row i of the Change in W/L% column. However, when I execute the code, the dataframe ends up looking like this.
For example, Row 1 should just have 0 in the Change in W/L% column; instead, it has been replaced by the difference in W/L% between Row 1 and Row 2, even though the coach's name is the same in both Rows. Could anyone help me resolve this issue? Thanks!
Check out this solution from this question here on StackOverflow.
Skip rows while looping over data frame Pandas

For Python Pandas, how to implement a "running" check of 2 rows against the previous 2 rows?

[updated with expected outcome]
I'm trying to implement a "running" check where I need the sum and mean of two rows to be more than the previous 2 rows.
Referring to the dataframe (copied into spreadsheet) below, I'm trying code out a function where if the mean of those two orange cells is more than the blue cells, the function will return true for row 8, under a new column called 'Cond11'. The dataframe here is historical, so all rows are available.
Note that that Rows column is added in the spreadsheet, easier for me to reference the rows here.
I have been using .rolling to refer to the current row + whatever number of rows to refer to, or using shift(1) to refer to the previous row.
df.loc[:, ('Cond9')] = df.n.rolling(4).mean() >= 30
df.loc[:, ('Cond10')] = df.a > df.a.shift(1)
I'm stuck here... how to I do this 2 rows vs the previous 2 rows? Please advise!
The 2nd part of this question: I have another function that checks the latest rows in the dataframe for the same condition above. This function is meant to be used in real-time, when new data is streaming into the dataframe and the function is supposed to check the latest rows only.
Can I check if the following code works to detect the same conditions above?
cond11 = candles.n[-2:-1].sum() > candles.n[-4:-3].sum()
I believe this solves your problem:
df.rolling(4).apply(lambda rows: rows[0] + rows[1] < rows[2] + rows[3])
The first 3 rows will be NaNs but you did not define what you would like to happen there.
As for the second part, to be able to produce this condition live for new data you just have to prepend the last 3 rows of your current data and then apply the same process to it:
pd.concat([df[-3:], df])

How to return the string of a header based on the max value of a cell in Openpyxl

Good morning guys! quick question for Openpyxl:
I am working with Python editing a xlsx document and generating various stats. Part of my script is to generate max values of a cell range :
temp_list=[]
temp_max=[]
for row in sheet.iter_rows(min_row=3, min_col=10, max_row=508, max_col=13):
print(row)
for cell in row:
temp_list.append(cell.value)
print(temp_list)
temp_max.append(max(temp_list))
temp_list=[]
I would also like to be able to print the string of the header of the column that contains the max value for the cell range desired. My data structure looks like this :
Any idea on how to do so?
Thanks!
This seems like a typical INDEX/MATCH Excel problem.
Have you tried retrieving the index for the max value in each temp_list?
You can use a function like numpy.argmax() to get the index of your max value within your "temp_list" array, then use this index to locate the header and append the string to a new list called, say, "max_headers" which contains all the header strings in order of appearance.
It would look something like this
for cell in row:
temp_list.append(cell.value)
i_max = np.argmax(temp_list)
max_headers.append(cell(row = 1, column = i_max).value)
And so on and so forth. Of course, for that to work, your temp_list should be a numpy array instead of a simple python list, and the max_headers list would have to be defined.
First, Thanks Bernardo for the hint. I found a decently working solution but still have a little issue. Perhaps someone can be of assistance.
Let me amend my initial statement : here is the code I am working with now :
temp_list=[]
headers_list=[]
for row in sheet.iter_rows(min_row=3, min_col=27, max_row=508, max_col=32): #Index starts at 1 // Here we set the rows/columns containing the data to be analyzed
for cell in row:
temp_list.append(cell.value)
for cell in row:
if cell.value == max(temp_list):
print(str(cell.column))
print(cell.value)
print(sheet.cell(row=1, column=cell.column).value)
headers_list.append(sheet.cell(row=1,column=cell.column).value)
else:
print('keep going.')
temp_list = []
This formula works, but has a little issue : If, for instance, a row has the same value twice (ie : 25,9,25,8,9), this loop will print 2 headers instead of one. My question is :
how can I get this loop to take in account only the first match of a max value in a row?
You probably want something like this:
headers = [c for c in next(ws.iter_rows(min_col=27, max_col=32, min_row=1, max_row=1, values_only=True))]
for row in ws.iter_rows(min_row=3, min_col=27, max_row=508, max_col=32, values_only=True):
mx = max(row)
idx = row.index(mx)
col = headers[idx]

Loop for creating csv out of dataframe column index

I want to create a loop which creates multiple csvs which have the same 9 columns in the beginning but differ iteratively in the last column.
[col1,col2,col3,col4,...,col9,col[i]]
I have a dataframe with a shape of (20000,209).
What I want is that I create a loop which does not takes too much computation power and resources but creates 200 csvs which differ in the last column. All columns exist in one dataframe. The columns which should be added are in columns i =[10:-1].
I thought of something like:
for col in df.columns[10:-1]:
dfi = df[:9]
dfi.concat(df[10])
dfi.dropna()
dfi.to_csv('dfi.csv'))
Maybe it is also possible to use
dfi.to_csv('dfi.csv', sequence = [:9,i])
The i should display the number of the added column. Any idea how to make this happen easily? :)
Thanks a lot!
I'm not sure I understand fully what you want but are you saying that each csv should just have 10 columns, all should have the first 9 and then one csv for each of the remaining 200 columns?
If so I would go for something as simple as:
base_cols = list(range(9))
for i in range(9, 209):
df.iloc[:, base_cols+[i]].to_csv('csv{}.csv'.format(i))
Which should work I think.

Categories

Resources