Get the column index as a value when value exists - python

I need do create some additional columns to my table or separate table based on following:
I have a table
and I need to create additional columns where the column indexes (names of columns) will be inserted as values. Like this:
How to do it in pandas? Any ideas?
Thank you

If need matched columns only for 1 values:
df = (df.set_index('name')
.eq(1)
.dot(df.columns[1:].astype(str) + ',')
.str.rstrip(',')
.str.split(',', expand=True)
.add_prefix('c')
.reset_index())
print (df)
Explanation:
Idea is create boolean mask with True for values which are replaced by columns names - so compare by DataFrame.eq by 1 and used matrix multiplication by DataFrame.dot by all columns without first with added separator. Then remove last traling separator by Series.str.rstrip and use Series.str.split for new column, changed columns names by DataFrame.add_prefix.
Another solution:
df1 = df.set_index('name').eq(1).apply(lambda x: x.index[x].tolist(), 1)
df = pd.DataFrame(df1.values.tolist(), index=df1.index).add_prefix('c').reset_index()

Related

Drop all rows that contain any string from a dataframe in Pandas

I have a dataframe that contains strings in columns that should be only floats. I saw several solutions on how to drop a row with a specific string or parts of it from an individual column.
So for an individual column I suppose one could do it like this
new_df = df[df['Column'].dtypes != object]
But this
new_df = df[df.dtypes != object]
did not work. One could iterate over all columns via a loop, but is there a way to drop the strings for all columns at once?
Use DataFrame.select_dtypes:
#excluding object columns
new_df = df.select_dtypes(exclude=object)
#only floats columns
new_df = df.select_dtypes(include=float)
#only numeric columns
new_df = df.select_dtypes(include=np.number)
EDIT:
new_df = df.apply(pd.to_numeric, errors='coerce').dropna()

How to drop columns which contains specific characters except one column?

Pandas dataframe is having 5 columns which contains 'verifier' in it. I want to drop all those columns which contained 'verified in it except the column named 'verified_90'(Pandas dataframe). I am trying following code but it is removing all columns which contains that specific word.
Column names: verified_30 verified_60 verified_90 verified_365 logo.verified. verified.at etc
''''
df = df[df.columns.drop(list(df.filter(regex='Test')))]
''''
You might be able to use a regex approach here:
df = df[df.columns.drop(list(df.filter(regex='^(?!verified_90$).*verified.*$')))]
Filter columns with not verified OR with verified_90 in DataFrame.loc, here : means select all rows and columns by mask:
df.loc[:, ~df.columns.str.contains('verified') | (df.columns == 'verified_90')]

Deleting columns from a csv if it contains a certain value

I have a csv from which I want to drop the columns which has only '-' values in it. These are the columns I want to drop:
How can I do this?
Use DataFrame.ne for test not - value with DataFrame.all for test if not exist in all rows anf filter by DataFrame.loc - first : means al rows and second is mask for filter columns:
df = df.loc[:, df.ne('-').all()]

replace/change duplicate columns values where column name is same but values are different, then drop duplicate columns

Is there any way to drop duplicate columns, but replacing their values depending upon conditions like
in table below, I would like to remove duplicate/second A and B columns, but want to replace the value of primary A and B (1st and 2nd column) where value is 0 but 1 in duplicate columns.
Ex - In 3rd row, where A, B have value 0 , should replace with 1 with their respective duplicate columns value..
Input Data :
Output Data:
This is an example of a problem I'm working on, my real data have around 200 columns so i'm hoping to find an optimal solution without hardcoding columns names for removal..
Use DataFrame.any per duplicated columns names if only 1,0 values in columns:
df = df.any(axis=1, level=0).astype(int)
Or if need maximal value per duplicated columns names:
df = df.max(axis=1, level=0)

Create a new dataframe with only duplicated rows

I would like to have a new dataframe with only rows that are duplicated in the previous df.
I tried to assign a new column that it is true if there are duplicates and then select only rows that are true. However I got 0 entities. I am sure that I have duplicates in the df
I want to keep in the old dataframe the first rows and remove all the other duplicates.
Column with duplicate values is called 'merged'
df=df.assign(
is_duplicate= lambda d: d.duplicated()
).sort_values('merged').reset_index(drop=True)
df2= df.loc[df['is_duplicate'] == 'True']
They are not strings, they are booleans, so use:
df2 = df.loc[df['is_duplicate']]
I think you need boolean indexing, loc should be removed:
df[df.duplicated()]
Or your solution cannot be used with .reset_index(drop=True), because then filtered another rows, also sorting should be better before or after solution:
df = df.assign(is_duplicate= lambda d: d.duplicated())
df2= df[df['is_duplicate']]

Categories

Resources