How to check in which column is certain value in pandas.DataFrame? - python

I have DataFrame in Python like below:
df = pd.DataFrame({"col1" : ["a", "b", "c"], "col2" : ["a", "d", "e"], "col3" : ["r", "t" , "g"]})
And I would like to check in which columns is value "a" (of course in "col1" and "col2"). How can I check it ?

(df=='a').any()
col1 True
col2 True
col3 False

If need columns names in list compare all values by DataFrame.eq with DataFrame.any for check if at least one True (match) per columns, last filter columns names:
c = df.columns[df.eq('a').any()].tolist()
print (c)
['col1', 'col2']
If need filter columns to new DataFrame use DataFrame.loc:
df1 = df.loc[:, df.eq('a').any()]
print (df1)
col1 col2
0 a a
1 b d
2 c e

Related

How to repeat row n times inside a iterrows

For each row of a dataframe I want to repeat the row n times inside a iterrows in a new dataframe. Basically I'm doing this:
df = pd.DataFrame(
[
("abcd", "abcd", "abcd") # create your data here, be consistent in the types.
],
["A", "B", "C"] # add your column names here
)
n_times = 2
for index, row in df.iterrows():
new_df = row.loc[row.index.repeat(n_times)]
new_df
and I get the following output:
0 abcd
0 abcd
1 abcd
1 abcd
2 abcd
2 abcd
Name: C, dtype: object
while it should be:
A B C
0 abcd abcd abcd
1 abcd abcd abcd
How should I proceed to get the desired output?
The df.T attribute in Pandas is used to transpose a DataFrame. Transposing a DataFrame means to flip its rows and columns, so that the rows become columns and the columns become rows.
I don't think you defined your df the right way.
df = pd.DataFrame(data = [["abcd", "abcd", "abcd"]],
columns = ["A", "B", "C"])
n_times = 2
for _ in range(n_times):
new_df = pd.concat([df, df], axis=0)
Is that how it should look like?

Add empty row after column that contains spefific text

I have a large dataframe where I need to add an empty row after any instance where colA contains a colon.
To be honest I have absolutely no clue how to do this, my guess is that a function/ for loop needs to be written but I have had no luck...
I think you are looking for this
You have dataframe like this
df = pd.DataFrame({"cola": ["a", "b", ":", "c", "d", ":", "e"]})
# wherever you find : in column a you want to append new empty row
idx = [0] + (df[df.cola.str.match(':')].index +1).tolist()
df1 = pd.DataFrame()
for i in range(len(idx)-1):
df1 = pd.concat([df1, df.iloc[idx[i]: idx[i+1]]],ignore_index=True)
df1.loc[len(df1)] = ""
df1 = pd.concat([df1, df.iloc[idx[-1]: ]], ignore_index=True)
print(df1)
# df1 is your result dataframe also it handles the case where colon is present at the last row of dataframe
Resultant dataframe
cola
0 a
1 b
2 :
3
4 c
5 d
6 :
7
8 e

Opposite of factorize function (Map numeric to categorical values)

I am searching for a way to map some numeric columns to categorical features.
All columns are of categorical nature but are represented as integers. However I need them to be a "String".
e.g.
col1 col2 col3 -> col1new col2new col3new
0 1 1 -> "0" "1" "1"
2 2 3 -> "2" "2" "3"
1 3 2 -> "1" "3" "2"
It does not matter what kind of String the new column contains as long as all distinct values from the original data set map to the same new String value.
Any ideas?
I have a bumpy representation of my data right now but any pandas solution would be also helpful.
Thanks a lot!
You can use applymap method. Cosider the following example:
df = pd.DataFrame({'col1': [0, 2, 1], 'col2': [1, 2, 3], 'col3': [1, 3, 2]})
df.applymap(str)
col1 col2 col3
0 0 1 1
1 2 2 3
2 1 3 2
You can convert all elements of col1, col2, and col3 to str using the following command:
df = df.applymap(str)
you can modify the type of the elements in a list by using the dataframe.apply function which is offered by pandas-dataframe-apply.
frame = pd.DataFrame(np.random.randint(0, 90, size =(5000000, 3)), columns =['col1', 'col2', 'col3'])
in the new dataframe you can define columns and the value by:
updated_frame = pd.DataFrame(np.random.randint(0, 90, size =(5000000, 3)), columns =['col1new', 'col2new', 'col3new'])
updated_frame['col1new'] = frame['col1'].apply(str)
updated_frame['col2new'] = frame['col2'].apply(str)
updated_frame['col3new'] = frame['col3'].apply(str)
You could use the .astype method. If you want to replace all the current columns with a string version then you could do (df your dataframe):
df = df.astype(str)
If you want to add the string columns as new ones:
df = df.assign(**{f"{col}new": df[col].astype(str) for col in df.columns})

Manipulate multiindex column in pivot_table

I see this question asked multiple times but solutions from other questions did not worked!
I have data frame like
df = pd.DataFrame({
"date": ["20180920"] * 3 + ["20180921"] * 3,
"id": ["A12","A123","A1234","A12345","A123456","A0"],
"mean": [1,2,3,4,5,6],
"std" :[7,8,9,10,11,12],
"test": ["a", "b", "c", "d", "e", "f"],
"result": [70, 90, 110, "(-)", "(+)", 0.3],})
using pivot_table
df_sum_table = (pd.pivot_table(df,index=['id'], columns = ['date'], values = ['mean','std']))
I got
df_sum_table.columns
MultiIndex([('mean', '20180920'),
('mean', '20180921'),
( 'std', '20180920'),
( 'std', '20180921')],
names=[None, 'date'])
So I wanted to shift date column one row below and remove id row. but keep id name there.
by following these past solutions
ValueError when trying to have multi-index in DataFrame.pivot
Removing index name from df created with pivot_table()
Resetting index to flat after pivot_table in pandas
pandas pivot_table keep index
df_sum_table = (pd.pivot_table(df,index=['id'], columns = ['date'], values = ['mean','std'])).reset_index().rename_axis(None, axis=1)
but getting error
TypeError: Must pass list-like as names.
How can I remove date but keep the id in the first column ?
The desired output
#jezrael
Try with rename_axis:
df = df.pivot_table(index=['id'], columns = ['date'], values = ['mean', 'std']).rename_axis(columns={'date': None}).fillna('').reset_index().T.reset_index(level=1).T.reset_index(drop=True).reset_index(drop=True)
df.index = df.pop('id').replace('', 'id').tolist()
print(df)
Output:
mean mean std std
id 20180920 20180921 20180920 20180921
A0 6 12
A12 1 7
A123 2 8
A1234 3 9
A12345 4 10
A123456 5 11
You could use rename_axis and rename the specific column axis name with dictionary mapping. I specify the columns argument for column axis name mapping.

Pandas: Calculate mean leaving out own row's value

I want to calculate means by group, leaving out the value of the row itself.
import pandas as pd
d = {'col1': ["a", "a", "b", "a", "b", "a"], 'col2': [0, 4, 3, -5, 3, 4]}
df = pd.DataFrame(data=d)
I know how to return means by group:
df.groupby('col1').agg({'col2': 'mean'})
Which returns:
Out[247]:
col1 col2
1 a 4
3 a -5
5 a 4
But what I want is mean by group, leaving out the row's value. E.g. for the first row:
df.query('col1 == "a"')[1:4].mean()
which returns:
Out[251]:
col2 1.0
dtype: float64
Edit:
Expected output is a dataframe of the same format as df above, with a column mean_excl_own which is the mean across all other members in the group, excluding the row's own value.
You could GroupBy col1and transform with the mean. Then subtract the value from a given row from the mean:
df['col2'] = df.groupby('col1').col2.transform('mean').sub(df.col2)
Thanks for all your input. I ended up using the approach linked to by #VnC.
Here's how I solved it:
import pandas as pd
d = {'col1': ["a", "a", "b", "a", "b", "a"], 'col2': [0, 4, 3, -5, 3, 4]}
df = pd.DataFrame(data=d)
group_summary = df.groupby('col1', as_index=False)['col2'].agg(['mean', 'count'])
df = pd.merge(df, group_summary, on = 'col1')
df['other_sum'] = df['col2'] * df['mean'] - df['col2']
df['result'] = df['other_sum'] / (df['count'] - 1)
Check out the final result:
df['result']
Which prints:
Out:
0 1.000000
1 -0.333333
2 2.666667
3 -0.333333
4 3.000000
5 3.000000
Name: result, dtype: float64
Edit: I previously had some trouble with column names, but I fixed it using this answer.

Categories

Resources