df = pd.DataFrame([['A',7], ['A',5], ['B',6]], columns = ['group', 'value'])
If I want to keep one row by group, the one having the minimum value, I use :
df[df['value'] == df.groupby('group')['value'].transform('min')]
However if I want to keep the row with the lowest index, the following does not work :
df[df.index == df.groupby('group').index.transform('min')]
I know I just could use reset_index() and deal with the index as a column, but can I avoid this :
df[df.reset_index()['index'] == df.reset_index().groupby('group')['index'].transform('min')]
You can sort by index (if it's not already sorted) and then take the first row in each group:
df.sort_index().groupby('group').first()
You could do:
import pandas as pd
df = pd.DataFrame([['A', 7], ['A', 5], ['B', 6]], columns=['group', 'value'])
idxs = df.reset_index().groupby('group').index.idxmin()
result = df.loc[idxs]
print(result)
Output
group value
0 A 7
2 B 6
Related
I have a dataset that I need to add rows based on conditions. Rows can be added anywhere within the dataset. i.e., middle, top, and bottom.
I have 26 columns in the data but will only use a few to set conditions.
I want my code to go through each row and check if a column named "potveg" has values 4,8 or 9. If true, add a row below it and set 'col,' 'lat' column values similar to those of the last row, and set the values of columns 'icohort' and 'isrccohort' to those of the last row + 1. Then export the new data frame to CSV. I have tried several implementations based on this logic: Pandas: Conditionally insert rows into DataFrame while iterating through rows in the middle
PS* New to Python and Pandas
Here is the code I have so far:
for index, row in df.iterrows():
last_row = df.iloc[index-1]
next_row = df.iloc[index]
new_row = {
'col':last_row.col,
'row':last_row.row,
'tmpvarname':last_row.tmpvarname,
'year':last_row.year,
'icohort':next_row.icohort,
'isrccohort':next_row.icohort,
'standage':3000,
'chrtarea':0,
'potveg':13,
'currentveg':13,
'subtype':13,
'agstate':0,
'agprevstate':0,
'tillflag':0,
'fertflag':0,
'irrgflag':0,
'disturbflag':0,
'disturbmonth':0,
'FRI':2000,
'slashpar':0,
'vconvert':0,
'prod10par':0,
'prod100par':0,
'vrespar':0,
'sconvert':0,
'tmpregion':last_row.tmpregion
}
new_row = {k:v for k,v in new_row.items()}
if (df.iloc[index]['potveg'] == 4):
newdata =df.append(new_row, ignore_index=True)
Following the steps you suggested, you could write something like:
df = pd.DataFrame({'id':[1,2,4,5], 'before': [1,2,4,5], 'after': [1,2,4,5]})
new_df = pd.DataFrame()
for i, row in df.iterrows():
new_df = pd.concat([new_df, pd.DataFrame(row.to_frame().transpose())])
if row['id'] == 2:
# add our new row, with data for `col` before coming from the previous row, and `after` coming from the following row
temp = pd.DataFrame({'id': [3], 'before': [df.loc[i]['before']], 'after': [df.loc[i+1]['after']]})
new_df = pd.concat([new_df, pd.DataFrame(temp)])
You might need to consider exploring how you could approach the problem without iterating over the dataframe as this might be quite slow if you have a large dataset. I'd suggest checking the apply function.
You should expect new_df to have:
id before after
1 1 1
2 2 2
3 2 4
4 4 4
5 5 5
With a row with id 3 added after the row with id 2.
Inserting rows at a specific position can be done this way:
import pandas as pd
df = pd.DataFrame({'col1': [1, 2, 4, 5], 'col2': ['A', 'B', 'D', 'E']})
new_row = pd.DataFrame({'col1': [3], 'col2': ['C']})
idx_pos = 2
pd.concat([df.iloc[:idx_pos], new_row, df.iloc[idx_pos:]]).reset_index(drop=True)
Output:
col1 col2
0 1 A
1 2 B
2 3 C
3 4 D
4 5 E
I have been trying to build a preprocessing pipeline, but I am struggling a little to generate a list of the indexes for each column that is an object dtype. I have been able to get the names of each into an array using the following code:
categorical_features = [col for col in input.columns if input[col].dtype == 'object']
Is there an easy way to get the index of these columns, from the original input dataframe into a list, like this one that I built manually?
c = [1,3,4,5,6,7,8,9,10,11,12,14,15,16,17,18,19,20,21,22,23,24,25,28,29,
30,31,38,39,40,41,42,43,44,45,50,51,55,56]
Use df.select_dtypes + df.columns.get_indexer:
categorical_features = df.columns.get_indexer(df.select_dtypes('object').columns)
df.select_dtypes returns a copy of df with only the columns that are of the specified dtype(s) (you can specify multiple, e.g. df.select_dtypes(['object', 'int'])).
df.columns.get_indexer returns the indexes of the specified columns.
I think you need select.dtypes and enumerate
df = pd.DataFrame({'A' : ['A', 'B', 'C'], 'B' : [1,2,3], 'C' : [1, '2', '3']})
print(df)
A B C
0 A 1 1
1 B 2 2
2 C 3 3
idx_cols = [idx for idx, col in enumerate(df.select_dtypes('object').columns) ]
[0, 1]
enumerate can help with that:
categorical_features_indexes = [i for i, col in enumerate(input.columns) if input[col].dtype == 'object']
We have a dataframe:
column1 column2
0 A h
1 B l
2 C p
and a list:
li = ['p', 'l']
How can I use the list to look up values in column2 and return the corresponding column1 values in that order?
The desired result:
np.array(['C', 'B'])
I tried this, but it does not preserve the order:
df.loc[df['column2'].isin(li)]['column1'].values
Set column2 as the index and use loc:
df.set_index('column2').loc[li, 'column1'].values
# array(['C', 'B'], dtype=object)
You can do something like -
df_new = df[df['column2'].isin(li)]
df_new.index = [li.index(x) for x in df_new['column2']]
df_new = df_new.sort_index()
df_new['column1'].values
# array(['C', 'B'], dtype=object)
If you are using version 1.x of pandas - you can think of using the key argument that comes with the sort_values function like here
I have a dataframe with a lot of columns using the suffix '_o'. Is there a way to drop all the columns that has '_o' in the end of its label?
In this post I've seen a way to drop the columns that start with something using the filter function. But how to drop the ones that end with something?
Pandonic
df = df.loc[:, ~df.columns.str.endswith('_o')]
df = df[df.columns[~df.columns.str.endswith('_o')]]
List comprehensions
df = df[[x for x in df if not x.endswith('_o')]]
df = df.drop([x for x in df if x.endswith('_o')], 1)
To use df.filter() properly here you could use it with a lookbehind:
>>> df = pd.DataFrame({'a': [1, 2], 'a_o': [2, 3], 'o_b': [4, 5]})
>>> df.filter(regex=r'.*(?<!_o)$')
a o_b
0 1 4
1 2 5
This can be done by re-assigning the dataframe with only the needed columns
df = df.iloc[:, [not o.endswith('_o') for o in df.columns]]
I have a object of which type is Panda and the print(object) is giving below output
print(type(recomen_total))
print(recomen_total)
Output is
<class 'pandas.core.frame.Pandas'>
Pandas(Index=12, instrument_1='XXXXXX', instrument_2='XXXX', trade_strategy='XXX', earliest_timestamp='2016-08-02T10:00:00+0530', latest_timestamp='2016-08-02T10:00:00+0530', xy_signal_count=1)
I want to convert this obejct in pd.DataFrame, how i can do it ?
i tried pd.DataFrame(object), from_dict also , they are throwing error
Interestingly, it will not convert to a dataframe directly but to a series. Once this is converted to a series use the to_frame method of series to convert it to a DataFrame
import pandas as pd
df = pd.DataFrame({'col1': [1, 2], 'col2': [0.1, 0.2]},
index=['a', 'b'])
for row in df.itertuples():
print(pd.Series(row).to_frame())
Hope this helps!!
EDIT
In case you want to save the column names use the _asdict() method like this:
import pandas as pd
df = pd.DataFrame({'col1': [1, 2], 'col2': [0.1, 0.2]},
index=['a', 'b'])
for row in df.itertuples():
d = dict(row._asdict())
print(pd.Series(d).to_frame())
Output:
0
Index a
col1 1
col2 0.1
0
Index b
col1 2
col2 0.2
To create new DataFrame from itertuples namedtuple you can use list() or Series too:
import pandas as pd
# source DataFrame
df = pd.DataFrame({'a': [1,2], 'b':[3,4]})
# empty DataFrame
df_new_fromAppend = pd.DataFrame(columns=['x','y'], data=None)
for r in df.itertuples():
# create new DataFrame from itertuples() via list() ([1:] for skipping the index):
df_new_fromList = pd.DataFrame([list(r)[1:]], columns=['c','d'])
# or create new DataFrame from itertuples() via Series (drop(0) to remove index, T to transpose column to row)
df_new_fromSeries = pd.DataFrame(pd.Series(r).drop(0)).T
# or use append() to insert row into existing DataFrame ([1:] for skipping the index):
df_new_fromAppend.loc[df_new_fromAppend.shape[0]] = list(r)[1:]
print('df_new_fromList:')
print(df_new_fromList, '\n')
print('df_new_fromSeries:')
print(df_new_fromSeries, '\n')
print('df_new_fromAppend:')
print(df_new_fromAppend, '\n')
Output:
df_new_fromList:
c d
0 2 4
df_new_fromSeries:
1 2
0 2 4
df_new_fromAppend:
x y
0 1 3
1 2 4
To omit index, use param index=False (but I mostly need index for the iteration)
for r in df.itertuples(index=False):
# the [1:] needn't be used, for example:
df_new_fromAppend.loc[df_new_fromAppend.shape[0]] = list(r)
The following works for me:
import pandas as pd
df = pd.DataFrame({'col1': [1, 2], 'col2': [0.1, 0.2]}, index=['a', 'b'])
for row in df.itertuples():
row_as_df = pd.DataFrame.from_records([row], columns=row._fields)
print(row_as_df)
The result is:
Index col1 col2
0 a 1 0.1
Index col1 col2
0 b 2 0.2
Sadly, AFAIU, there's no simple way to keep column names, without explicitly utilizing "protected attributes" such as _fields.
With some tweaks in #Igor's answer
I concluded with this satisfactory code which preserved column names and used as less of pandas code as possible.
import pandas as pd
df = pd.DataFrame({'col1': [1, 2], 'col2': [0.1, 0.2]})
# Or initialize another dataframe above
# Get list of column names
column_names = df.columns.values.tolist()
filtered_rows = []
for row in df.itertuples(index=False):
# Some code logic to filter rows
filtered_rows.append(row)
# Convert pandas.core.frame.Pandas to pandas.core.frame.Dataframe
# Combine filtered rows into a single dataframe
concatinated_df = pd.DataFrame.from_records(filtered_rows, columns=column_names)
concatinated_df.to_csv("path_to_csv", index=False)
The result is a csv containing:
col1 col2
1 0.1
2 0.2
To convert a list of objects returned by Pandas .itertuples to a DataFrame, while preserving the column names:
# Example source DF
data = [['cheetah', 120], ['human', 44.72], ['dragonfly', 54]]
source_df = pd.DataFrame(data, columns=['animal', 'top_speed'])
animal top_speed
0 cheetah 120.00
1 human 44.72
2 dragonfly 54.00
Since Pandas does not recommended building DataFrames by adding single rows in a for loop, we will iterate and build the DataFrame at the end:
WOW_THAT_IS_FAST = 50
list_ = list()
for animal in source_df.itertuples(index=False, name='animal'):
if animal.top_speed > 50:
list_.append(animal)
Now build the DF in a single command and without manually recreating the column names.
filtered_df = pd.DataFrame(list_)
animal top_speed
0 cheetah 120.00
2 dragonfly 54.00