Drop multiple columns that end with certain string in Pandas - python

I have a dataframe with a lot of columns using the suffix '_o'. Is there a way to drop all the columns that has '_o' in the end of its label?
In this post I've seen a way to drop the columns that start with something using the filter function. But how to drop the ones that end with something?

Pandonic
df = df.loc[:, ~df.columns.str.endswith('_o')]
df = df[df.columns[~df.columns.str.endswith('_o')]]
List comprehensions
df = df[[x for x in df if not x.endswith('_o')]]
df = df.drop([x for x in df if x.endswith('_o')], 1)

To use df.filter() properly here you could use it with a lookbehind:
>>> df = pd.DataFrame({'a': [1, 2], 'a_o': [2, 3], 'o_b': [4, 5]})
>>> df.filter(regex=r'.*(?<!_o)$')
a o_b
0 1 4
1 2 5

This can be done by re-assigning the dataframe with only the needed columns
df = df.iloc[:, [not o.endswith('_o') for o in df.columns]]

Related

Fastest way of filter index values based on list values from multiple columns in Pandas Dataframe?

I have a data frame like below. It has two columns column1,column2 from these two columns I want to filter few values(combinations of two lists) and get the index. Though I wrote the logic for it. It is will be too slow for filtering from a larger data frame. Is there any faster way to filter the data and get the list of indexes?
Data frame:-
import pandas as pd
d = {'col1': [11, 20,90,80,30], 'col2': [30, 40,50,60,90]}
df = pd.DataFrame(data=d)
print(df)
col1 col2
0 11 30
1 20 40
2 90 50
3 80 60
4 30 90
l1=[11,90,30]
l2=[30,50,90]
final_result=[]
for i,j in zip(l1,l2):
res=df[(df['col1']==i) & (df['col2']==j)]
final_result.append(res.index[0])
print(final_result)
[0, 2, 4]
You can just use underlying numpy array and create the boolean indexing:
mask=(df[['col1', 'col2']].values[:,None]==np.vstack([l1,l2]).T).all(-1).any(1)
# mask
# array([ True, False, True, False, True])
df.index[mask]
# prints
# Int64Index([0, 2, 4], dtype='int64')
you can use:
condition_1=df['col1'].astype(str).str.contains('|'.join(map(str, l1)))
condition_2=df['col2'].astype(str).str.contains('|'.join(map(str, l2)))
final_result=df.loc[ condition_1 & condition_2 ].index.to_list()
here is one way to do it. Merging the two DF and filtering where value exists in both DF
# create a DF of the list you like to match with
df2=pd.DataFrame({'col1': l1, 'col2': l2})
# merge the two DF
df3=df.merge(df2, how='left',
on=['col1', 'col2'], indicator='foundIn')
# filterout rows that are in both
out=df3[df3['foundIn'].eq('both')].index.to_list()
out
[0, 2, 4]

How to match multiple columns from two dataframes that have different sizes?

One of the solutions that is similar is found in here where the asker only have a single dataframe and their requirements was to match a fixed string value:
result = df.loc[(df['Col1'] =='Team2') & (df['Col2']=='Medium'), 'Col3'].values[0]
However, the problem I encountered with the .loc method is that it requires the 2 dataframes to have the same size because it will only match values on the same row position of each dataframe. So if the orders of the rows are mixed in either of the dataframes, it will not work as expected.
Sample of this situation is shown below:
df1 - df1 = pd.DataFrame({'a': [1,2,3], 'b': [4,5,6]})
df2 - df2 = pd.DataFrame({'a': [1, 3, 2], 'b': [4, 6, 5]})
Using, df1.loc[(df1['a'] == df2['a']) & (df1['b'] == df2['b']), 'Status'] = 'Included' will yield:
But i'm looking for something like this:
I have looked into methods such as .lookup but is deprecated as of December of the year 2020 (which also requires similar sized dataframes).
Use DataFrame.merge with indicator parameter for new column with this information, if need change values e.g. use numpy.where:
df = df1.merge(df2, indicator='status', how='left')
df['status'] = np.where(df['status'].eq('both'), 'included', 'not included')
print (df)
a b status
0 1 4 included
1 2 5 included
2 3 6 included

Python Pandas : How to use transform with the DataFrame index?

df = pd.DataFrame([['A',7], ['A',5], ['B',6]], columns = ['group', 'value'])
If I want to keep one row by group, the one having the minimum value, I use :
df[df['value'] == df.groupby('group')['value'].transform('min')]
However if I want to keep the row with the lowest index, the following does not work :
df[df.index == df.groupby('group').index.transform('min')]
I know I just could use reset_index() and deal with the index as a column, but can I avoid this :
df[df.reset_index()['index'] == df.reset_index().groupby('group')['index'].transform('min')]
You can sort by index (if it's not already sorted) and then take the first row in each group:
df.sort_index().groupby('group').first()
You could do:
import pandas as pd
df = pd.DataFrame([['A', 7], ['A', 5], ['B', 6]], columns=['group', 'value'])
idxs = df.reset_index().groupby('group').index.idxmin()
result = df.loc[idxs]
print(result)
Output
group value
0 A 7
2 B 6

Combine columns in different rows in dataframe python

I am trying to join columns in different rows in a dataframe.
import pandas as pd
tdf = {'ph1': [1, 2], 'ph2': [3, 4], 'ph3': [5,6], 'ph4': [nan,nan]}
df = pd.DataFrame(data=tdf)
df
Output:
ph1 ph2 ph3 ph4
0 1 3 5 nan
1 2 4 6 nan
I combined ph1, ph2, ph3, ph4 with below code:
for idx, row in df.iterrows():
df = df[[ph1, ph2, ph3, ph4]]
df["ConcatedPhoneNumbers"] = df.loc[0:].apply(lambda x: ', '.join(x), axis=1)
I got
df["ConcatPhoneNumbers"]
ConcatPhoneNumbers
1,3,5,,
2,4,6,,
Now I need to combine these columns using pandas with appropriate function.
My result should be 1,3,5,2,4,6
Also need to remove these extra commas.
I am new Python learner.I did some research and reached till here. Please help me to get the exact approach.
It seems you need stack for remove NaNs, then convert to int, str and list and last join:
tdf = {'ph1': [1, 2], 'ph2': [3, 4], 'ph3': [5,6], 'ph4': [np.nan,np.nan]}
df = pd.DataFrame(data=tdf)
cols = ['ph1', 'ph2', 'ph3', 'ph4']
s = ','.join(df[cols].stack().astype(int).astype(str).values.tolist())
print (s)
1,3,5,2,4,6

Pandas: Creating DataFrame from Series

My current code is shown below - I'm importing a MAT file and trying to create a DataFrame from variables within it:
mat = loadmat(file_path) # load mat-file
Variables = mat.keys() # identify variable names
df = pd.DataFrame # Initialise DataFrame
for name in Variables:
B = mat[name]
s = pd.Series (B[:,1])
So within the loop, I can create a series of each variable (they're arrays with two columns - so the values I need are in column 2)
My question is how do I append the series to the dataframe? I've looked through the documentation and none of the examples seem to fit what I'm trying to do.
Here is how to create a DataFrame where each series is a row.
For a single Series (resulting in a single-row DataFrame):
series = pd.Series([1,2], index=['a','b'])
df = pd.DataFrame([series])
For multiple series with identical indices:
cols = ['a','b']
list_of_series = [pd.Series([1,2],index=cols), pd.Series([3,4],index=cols)]
df = pd.DataFrame(list_of_series, columns=cols)
For multiple series with possibly different indices:
list_of_series = [pd.Series([1,2],index=['a','b']), pd.Series([3,4],index=['a','c'])]
df = pd.concat(list_of_series, axis=1).transpose()
To create a DataFrame where each series is a column, see the answers by others. Alternatively, one can create a DataFrame where each series is a row, as above, and then use df.transpose(). However, the latter approach is inefficient if the columns have different data types.
No need to initialize an empty DataFrame (you weren't even doing that, you'd need pd.DataFrame() with the parens).
Instead, to create a DataFrame where each series is a column,
make a list of Series, series, and
concatenate them horizontally with df = pd.concat(series, axis=1)
Something like:
series = [pd.Series(mat[name][:, 1]) for name in Variables]
df = pd.concat(series, axis=1)
Nowadays there is a pandas.Series.to_frame method:
Series.to_frame(name=NoDefault.no_default)
Convert Series to DataFrame.
Parameters
nameobject, optional: The passed name should substitute for the series name (if it has one).
Returns
DataFrame: DataFrame representation of Series.
Examples
s = pd.Series(["a", "b", "c"], name="vals")
s.to_frame()
I guess anther way, possibly faster, to achieve this is
1) Use dict comprehension to get desired dict (i.e., taking 2nd col of each array)
2) Then use pd.DataFrame to create an instance directly from the dict without loop over each col and concat.
Assuming your mat looks like this (you can ignore this since your mat is loaded from file):
In [135]: mat = {'a': np.random.randint(5, size=(4,2)),
.....: 'b': np.random.randint(5, size=(4,2))}
In [136]: mat
Out[136]:
{'a': array([[2, 0],
[3, 4],
[0, 1],
[4, 2]]), 'b': array([[1, 0],
[1, 1],
[1, 0],
[2, 1]])}
Then you can do:
In [137]: df = pd.DataFrame ({name:mat[name][:,1] for name in mat})
In [138]: df
Out[138]:
a b
0 0 0
1 4 1
2 1 0
3 2 1
[4 rows x 2 columns]

Categories

Resources