How to get the column name when iterating through dataframe pandas? - python

I only want the column name when iterating using
for index, row in df.iterrows()

When iterating over a dataframe using df.iterrows:
for i, row in df.iterrows():
...
Each row row is converted to a Series, where row.index corresponds to df.columns, and row.values corresponds to df.loc[i].values, the column values at row i.
Minimal Code Sample
df = pd.DataFrame({'A': [1, 2], 'B': [3, 4]}, index=['a', 'b'])
df
A B
a 1 3
b 2 4
row = None
for i, row in df.iterrows():
print(row['A'], row['B'])
# 1 3
# 2 4
row # outside the loop, `row` holds the last row
A 2
B 4
Name: b, dtype: int64
row.index
# Index(['A', 'B'], dtype='object')
row.index.equals(df.columns)
# True
row.index[0]
# A

You are already getting to column name, so if you just want to drop the series you can just use the throwaway _ variable when starting the loop.
for column_name, _ in df.iteritems():
# do something
However, I don't really understand the use case. You could just iterate over the column names directly:
for column in df.columns:
# do something

when we use for index, row in df.iterrows() the right answer is row.index[i] to get the cloumn name, for example:
pdf = pd.DataFrame(np.random.randint(0,10,size=(10, 4)), columns=list('ABCD'))
pdf.head(5)
A B C D
0 3 1 2 6
1 5 8 7 3
2 7 2 2 5
3 0 9 9 4
4 1 8 1 4
for index, row in pdf[:3].iterrows():# we check only 3 rows in the dataframe
for i in range(4):
if row[i] > 7 :
print(row.index[i]) #then the answer is B
B

Related

Conditionally insert rows in the middle of dataframe using pandas

I have a dataset that I need to add rows based on conditions. Rows can be added anywhere within the dataset. i.e., middle, top, and bottom.
I have 26 columns in the data but will only use a few to set conditions.
I want my code to go through each row and check if a column named "potveg" has values 4,8 or 9. If true, add a row below it and set 'col,' 'lat' column values similar to those of the last row, and set the values of columns 'icohort' and 'isrccohort' to those of the last row + 1. Then export the new data frame to CSV. I have tried several implementations based on this logic: Pandas: Conditionally insert rows into DataFrame while iterating through rows in the middle
PS* New to Python and Pandas
Here is the code I have so far:
for index, row in df.iterrows():
last_row = df.iloc[index-1]
next_row = df.iloc[index]
new_row = {
'col':last_row.col,
'row':last_row.row,
'tmpvarname':last_row.tmpvarname,
'year':last_row.year,
'icohort':next_row.icohort,
'isrccohort':next_row.icohort,
'standage':3000,
'chrtarea':0,
'potveg':13,
'currentveg':13,
'subtype':13,
'agstate':0,
'agprevstate':0,
'tillflag':0,
'fertflag':0,
'irrgflag':0,
'disturbflag':0,
'disturbmonth':0,
'FRI':2000,
'slashpar':0,
'vconvert':0,
'prod10par':0,
'prod100par':0,
'vrespar':0,
'sconvert':0,
'tmpregion':last_row.tmpregion
}
new_row = {k:v for k,v in new_row.items()}
if (df.iloc[index]['potveg'] == 4):
newdata =df.append(new_row, ignore_index=True)
Following the steps you suggested, you could write something like:
df = pd.DataFrame({'id':[1,2,4,5], 'before': [1,2,4,5], 'after': [1,2,4,5]})
new_df = pd.DataFrame()
for i, row in df.iterrows():
new_df = pd.concat([new_df, pd.DataFrame(row.to_frame().transpose())])
if row['id'] == 2:
# add our new row, with data for `col` before coming from the previous row, and `after` coming from the following row
temp = pd.DataFrame({'id': [3], 'before': [df.loc[i]['before']], 'after': [df.loc[i+1]['after']]})
new_df = pd.concat([new_df, pd.DataFrame(temp)])
You might need to consider exploring how you could approach the problem without iterating over the dataframe as this might be quite slow if you have a large dataset. I'd suggest checking the apply function.
You should expect new_df to have:
id before after
1 1 1
2 2 2
3 2 4
4 4 4
5 5 5
With a row with id 3 added after the row with id 2.
Inserting rows at a specific position can be done this way:
import pandas as pd
df = pd.DataFrame({'col1': [1, 2, 4, 5], 'col2': ['A', 'B', 'D', 'E']})
new_row = pd.DataFrame({'col1': [3], 'col2': ['C']})
idx_pos = 2
pd.concat([df.iloc[:idx_pos], new_row, df.iloc[idx_pos:]]).reset_index(drop=True)
Output:
col1 col2
0 1 A
1 2 B
2 3 C
3 4 D
4 5 E

Drop every nth column in pandas dataframe

i have a pandas dataframe where the columns are named like:
0,1,2,3,4,.....,n
i would like to drop every 3rd column so that i get a new dataframe where i would have the columns like:
0,1,3,4,6,7,9,.....,n
I have tried like this:
shape = df.shape[1]
for i in range(2,shape,3):
df = df.drop(df.columns[i], axis=1)
but i get an error saying index is out of bound and i assume this happens because the shape of the dataframe changes when i am dropping the columns. if i just don't store the output of the "for" loop, then the code works but i don't get my new dataframe.
How do i solve this?
Thanks
The issue with code is, each time you drop a column in your loop, you end up with a different set of columns because you overwrite the df back after each iteration. When you try to drop the next 3rd column of THAT new set of columns, you not only drop the wrong one, you end up running out of columns eventually. That's why you get the error you are getting.
iter1 -> 0,1,3,4,5,6,7,8,9,10 ... n #first you drop 2 which is 3rd col
iter2 -> 0,1,3,4,5,7,8,9,10 ... n #next you drop 6 which is 6th col (should be 5)
iter3 -> 0,1,3,4,5,7,8,9, ... n #next you drop 10 which is 9th col (should be 8)
What you want to do is calculate the indexes beforehand and then remove them in one go.
You can simply just get the indexes of columns you want to remove with range and then drop those.
drop_idx = list(range(2,df.shape[1],3)) #Indexes to drop
df2 = df.drop(drop_idx, axis=1) #Drop them at once over axis=1
print('old columns->', list(df.columns))
print('idx to drop->', drop_idx)
print('new columns->',list(df2.columns))
old columns-> [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
idx to drop-> [2, 5, 8]
new columns-> [0, 1, 3, 4, 6, 7, 9]
Note: This works only because your columns names are same as indexes. If however, your column names are not like that, you will have to do an extra step of fetching the column names based on the index you want to drop.
drop_idx = list(range(2,df.shape[1],3))
drop_cols = [j for i,j in enumerate(df.columns) if i in drop_idx] #<--
df2 = df.drop(drop_cols, axis=1)
Here is solution with inverted logic - select all columns with removed each 3rd column.
You can filter values by compare added 1 to helper array, with 3 modulo compare for not equal 0 and pass to DataFrame.loc:
df = pd.DataFrame({
'A':list('abcdef'),
'B':[4,5,4,5,5,4],
'C':[7,8,9,4,2,3],
'D':[1,3,5,7,1,0],
'E':[5,3,6,9,2,4],
'F':list('aaabbb')
})
df = df.loc[:, (np.arange(len(df.columns)) + 1) % 3 != 0]
print (df)
A B D E
0 a 4 1 5
1 b 5 3 3
2 c 4 5 6
3 d 5 7 9
4 e 5 1 2
5 f 4 0 4
You can use list comprehension to filter columns:
df = df[[k for k in df.columns if (k + 1) % 3 != 0]]
If the names are different (e.g. strings) and you want to discard every 3rd column regardless of its name, then:
df = df[[k for i, k in enumerate(df.columns, 1) if i % 3 != 0]]

How to remove all rows that have values above the 99th percentile for certain columns in pandas efficiently?

I have a pandas dataframe
a b c d e f
0 0.025641 0.554686 0.988809 0.176905 0.050028 0.333333
1 0.027151 0.520914 0.985590 0.409572 0.163980 0.424242
2 0.028788 0.478810 0.970480 0.288557 0.095053 0.939394
3 0.018692 0.450573 0.985910 0.178048 0.118399 0.484848
4 0.023256 0.787253 0.865287 0.217591 0.205670 0.303030
And a list of columns
cols_list = ['a', 'd', 'f']
I want to filter out all rows which have have values above the 99th percentile for all of these columns.
I could do something like:
for col in cols_list:
df[f'q_{col}'] = df[col].quantile([0.99]).values[0]
for col in cols_list:
df = df[df[col] <= df[f'q_{col}']]
Is there a more efficient way to do this ?
You can use the operator le to compare the dataframe with the quantiles, then use all/any to check for values along the rows:
valids = df[cols_list].le(df[cols_list].quantile(0.99)).all(1)
df[valids]
Output:
a b c d e f
0 0.025641 0.554686 0.988809 0.176905 0.050028 0.333333
3 0.018692 0.450573 0.985910 0.178048 0.118399 0.484848
4 0.023256 0.787253 0.865287 0.217591 0.205670 0.303030

reshape a pandas dataframe

suppose a dataframe like this one:
df = pd.DataFrame([[1,2,3,4],[5,6,7,8],[9,10,11,12]], columns = ['A', 'B', 'A1', 'B1'])
I would like to have a dataframe which looks like:
what does not work:
new_rows = int(df.shape[1]/2) * df.shape[0]
new_cols = 2
df.values.reshape(new_rows, new_cols, order='F')
of course I could loop over the data and make a new list of list but there must be a better way. Any ideas ?
The pd.wide_to_long function is built almost exactly for this situation, where you have many of the same variable prefixes that end in a different digit suffix. The only difference here is that your first set of variables don't have a suffix, so you will need to rename your columns first.
The only issue with pd.wide_to_long is that it must have an identification variable, i, unlike melt. reset_index is used to create a this uniquely identifying column, which is dropped later. I think this might get corrected in the future.
df1 = df.rename(columns={'A':'A1', 'B':'B1', 'A1':'A2', 'B1':'B2'}).reset_index()
pd.wide_to_long(df1, stubnames=['A', 'B'], i='index', j='id')\
.reset_index()[['A', 'B', 'id']]
A B id
0 1 2 1
1 5 6 1
2 9 10 1
3 3 4 2
4 7 8 2
5 11 12 2
You can use lreshape, for column id numpy.repeat:
a = [col for col in df.columns if 'A' in col]
b = [col for col in df.columns if 'B' in col]
df1 = pd.lreshape(df, {'A' : a, 'B' : b})
df1['id'] = np.repeat(np.arange(len(df.columns) // 2), len (df.index)) + 1
print (df1)
A B id
0 1 2 1
1 5 6 1
2 9 10 1
3 3 4 2
4 7 8 2
5 11 12 2
EDIT:
lreshape is currently undocumented, but it is possible it might be removed(with pd.wide_to_long too).
Possible solution is merging all 3 functions to one - maybe melt, but now it is not implementated. Maybe in some new version of pandas. Then my answer will be updated.
I solved this in 3 steps:
Make a new dataframe df2 holding only the data you want to be added to the initial dataframe df.
Delete the data from df that will be added below (and that was used to make df2.
Append df2 to df.
Like so:
# step 1: create new dataframe
df2 = df[['A1', 'B1']]
df2.columns = ['A', 'B']
# step 2: delete that data from original
df = df.drop(["A1", "B1"], 1)
# step 3: append
df = df.append(df2, ignore_index=True)
Note how when you do df.append() you need to specify ignore_index=True so the new columns get appended to the index rather than keep their old index.
Your end result should be your original dataframe with the data rearranged like you wanted:
In [16]: df
Out[16]:
A B
0 1 2
1 5 6
2 9 10
3 3 4
4 7 8
5 11 12
Use pd.concat() like so:
#Split into separate tables
df_1 = df[['A', 'B']]
df_2 = df[['A1', 'B1']]
df_2.columns = ['A', 'B'] # Make column names line up
# Add the ID column
df_1 = df_1.assign(id=1)
df_2 = df_2.assign(id=2)
# Concatenate
pd.concat([df_1, df_2])

How to iterate over DataFrame and generate a new DataFrame

I have a data frame looks like this:
P Q L
1 2 3
2 3
4 5 6,7
The objective is to check if there is any value in L, if yes, extract the value on L and P column:
P L
1 3
4,6
4,7
Note there might more than one values in L, in the case of more than 1 value, I would need two rows.
Bellow is my current script, it cannot generate the expected result.
df2 = []
ego
other
newrow = []
for item in data_DF.iterrows():
if item[1]["L"] is not None:
ego = item[1]['P']
other = item[1]['L']
newrow = ego + other + "\n"
df2.append(newrow)
data_DF2 = pd.DataFrame(df2)
First, you can extract all rows of the L and P columns where L is not missing like so:
df2 = df[~pd.isnull(df.L)].loc[:, ['P', 'L']].set_index('P')
Next, you can deal with the multiple values in some of the remaining L rows as follows:
df2 = df2.L.str.split(',', expand=True).stack()
df2 = df2.reset_index().drop('level_1', axis=1).rename(columns={0: 'L'}).dropna()
df2.L = df2.L.str.strip()
To explain: with P as index, the code splits the string content of the L column on ',' and distributes the individual elements across various columns. It then stacks the various new columns into a single new column, and cleans up the result.
First I extract multiple values of column L to new dataframe s with duplicity index from original index. Remove unnecessary columns L and Q. Then output join to original df and drop rows with NaN values.
print df
P Q L
0 1 2 3
1 2 3 NaN
2 4 5 6,7
s = df['L'].str.split(',').apply(pd.Series, 1).stack()
s.index = s.index.droplevel(-1) # to line up with df's index
s.name = 'L'
print s
0 3
2 6
2 7
Name: L, dtype: object
df = df.drop( ['L', 'Q'], axis=1)
df = df.join(s)
print df
P L
0 1 3
1 2 NaN
2 4 6
2 4 7
df = df.dropna().reset_index(drop=True)
print df
P L
0 1 3
1 4 6
2 4 7
I was solving a similar issue when I needed to create a new dataframe as a subset of a larger dataframe. Here's how I went about generating the second dataframe:
import pandas as pd
df2 = pd.DataFrame(columns=['column1','column2'])
for i, row in df1.iterrows():
if row['company_id'] == 12345 or row['company_id'] == 56789:
df2 = df2.append(row, ignore_index = True)

Categories

Resources