How to keep a row if any column contains a certain substring?

How to keep a row if any column contains a certain substring? - python

I have a pandas DataFrame df:
import pandas as pd
# Create a Pandas dataframe from some data.
df = pd.DataFrame({'Var1': ['d', 'a --> b', 'e', 'c --> d'],
'Var2': ['a', 'e', 'a --> b', 'd'],
'Var3': ['c', 'd', 'a --> b', 'e']})
Which looks like this when printed (for reference):
| | Var1 | Var2 | Var3 |
|---|---------|---------|---------|
| 0 | d | a | c |
| 1 | a --> b | e | d |
| 2 | e | a --> b | a --> b |
| 3 | c --> d | d | e |
I would like to keep just the rows 1, 2 and 3 that contains the value '-->'. In another words, I want to drop all rows in my dataframe that doesn't contains at least one column with the value '-->'.
I know how to filter just one column, df[df['Var1'].str.contains('-->', regex=False)] like gives me rows 1 and 3.
But I don't know how to apply to all columns. And I read some similar cases here and here, but couldn't figure out how to adapt to my case.
Can you suggest a way to select those rows?

Combine all columns into one and search for the substring:
df[df.sum(axis=1).str.contains('-->')]
# Var1 Var2 Var3
#1 a --> b e d
#2 e a --> b a --> b

You can filter them out using this.
df1= df[df.apply(lambda x: any(x.str.contains('-->')),axis=1)]
print (df1)
The output of this will be:
Original DataFrame:
Var1 Var2 Var3
0 d a c
1 a --> b e d
2 e a --> b a --> b
3 c d e
DF1: contains only rows with arrows
Var1 Var2 Var3
1 a --> b e d
2 e a --> b a --> b

try .stack() with a boolean index.
s = df.stack().str.contains('-->').reset_index(1,drop=True)
df.loc[s[s].index.unique()]
Var1 Var2 Var3
1 a --> b e d
2 e a --> b a --> b

Related

How to select multiple columns in pandas dataframe after doing some operation on one column?

Say I have a dataframe with 7 columns. I'm only interested in columns A and B. Column B contains numerical values.
What I want to do is select only columns A and B, after doing some mathematical operation f on B. The sql equivalent of what I'm saying is:
SELECT A, f(B)
FROM df;
I know that I can select just columns A and B by doing df[['A', 'B']]. Also, I can just add another column f_B saying: df['f_B'] = f(df['B']), and then select df[['A', 'f_B']].
However, is there a way of doing it without adding an extra column? What if when f is as simple as a divide by 100 or something?
EDIT: I do not want to use pandasql
EDIT2: Sharing sample input and expected output:
Input:
A | B | C | D
--------------
a | 1 | c | d
b | 2 | c | d
c | 3 | c | d
d | 4 | c | d
Expected output (only columns A and B required), assuming f is multiply by 2:
A | B
-----
a | 2
b | 4
c | 6
d | 8

First you take only the columns you need:
df = df[['A', 'B']] # replace the original df with a smaller one
new_df = df[['A', 'B']] # or allocate a new space
You can simply do:
df.B = df.B / 10
Using lambda:
df.B = df.B.apply(lambda value: value / 10)
For more complicated cases:
def f(value):
# some logic
result = value ** 2
return result
df.B = df.B.apply(f)

How do I merge categories for crosstab in pandas where some categories are common?

A while ago I asked this question
But that does not cover the case where two merged categories might have a common category
In that case I wanted to merge the categories A and B into AB. What if I have categories A, B, C and I want to merge A,B into AB, and B,C into BC?
Suppose I have the data:
+---+---+
| X | Y |
+---+---+
| A | D |
| B | D |
| B | E |
| B | D |
| A | E |
| C | D |
| C | E |
| B | E |
+---+---+
I want the cross-tab to look like:
+--------+---+---+
| X/Y | D | E |
+--------+---+---+
| A or B | 3 | 3 |
| B or C | 3 | 2 |
| C | 1 | 1 |
+--------+---+---+

I think you can use crosstab by all unique values and then sum values by selecting by categories in index values:
df = pd.crosstab(df.X, df.Y)
df.loc['A or B'] = df.loc[['A','B']].sum()
df.loc['B or C'] = df.loc[['C','B']].sum()
df = df.drop(['A','B'])
print (df)
Y D E
X
C 1 1
A or B 3 3
B or C 3 3
EDIT: If want general solution it is not easy, because is necessary repeat groups with rename like:
df1 = df[df['X'] == 'B'].assign(X = 'B or C')
df2 = df[df['X'] == 'C']
df = pd.concat([df, df1], ignore_index=True)
df['X'] = df['X'].replace({'A':'A or B', 'B': 'A or B', 'C': 'B or C'})
df = pd.concat([df, df2], ignore_index=True)
df = pd.crosstab(df.X, df.Y)
print (df)
Y D E
X
A or B 3 3
B or C 3 3
C 1 1

converting a dictionary to dataframe where keys and values each have their own columns

I have a dictionary d:
d = dict({'foo': ['a', 'b', 'c'], 'bar': ['d', 'e', 'f']})
how would I get a dataframe that looks like:
+-----+--------+
| Key | Values |
+-----+--------+
| foo | a |
+-----+--------+
| foo | b |
+-----+--------+
| foo | c |
+-----+--------+
| bar | d |
+-----+--------+
| bar | e |
+-----+--------+
| bar | f |
+-----+--------+
This doesn't answer my question:
Dictionary with values as lists to pandas dataframe frame

You can try this:
df = pd.DataFrame(d).stack().sort_values()
df
#Out[2037]:
#0 foo a
#1 foo b
#2 foo c
#0 bar d
#1 bar e
#2 bar f

After pandas 0.25
pd.Series(d).explode().reset_index()
Out[114]:
index 0
0 foo a
1 foo b
2 foo c
3 bar d
4 bar e
5 bar f

You could do:
import pandas as pd
d = {'foo': ['a', 'b', 'c'], 'bar': ['d', 'e', 'f']}
df = pd.DataFrame([[key, value] for key, values in d.items() for value in values], columns=['keys', 'values'])
print(df)
Output
keys values
0 foo a
1 foo b
2 foo c
3 bar d
4 bar e
5 bar f
As an alternative you could use explode:
df = pd.DataFrame({'keys': list(d.keys()), 'values': list(d.values())}).explode('values').reset_index(drop=True)
print(df)
Output
keys values
0 foo a
1 foo b
2 foo c
3 bar d
4 bar e
5 bar f

How to split columns when content isn't aligned

I have a CSV file with survey data. One of the columns contains responses from a multi-select question. The values in that column are separated by ";"
| Q10 |
----------------
| A; B; C |
| A; B; D |
| A; D |
| A; D; E |
| B; C; D; E |
I want to split the column into multiple columns, one for each option:
| A | B | C | D | E |
---------------------
| A | B | C | | |
| A | B | | D | |
| A | | | D | |
| A | | | D | E |
| | B | C | D | E |
Is there anyway to do this in excel or python or some other way?

Here is a simple formula that does what is asked:
=IF(ISNUMBER(SEARCH("; "&B$1&";","; "&$A2&";")),B$1,"")
This assumes there is always a space between the ; and the look up value. If not we can remove the space with substitute:
=IF(ISNUMBER(SEARCH(";"&B$1&";",";"&SUBSTITUTE($A2," ","")&";")),B$1,"")

I know this question has been answered but for those looking for a Python way to solve it, here it is (may be not the most efficient way though):
First split the column value, explode them and get the dummies. Next, group the dummy values together across the given 5 (or N) columns:
df['Q10'] = df['Q10'].str.split('; ')
df = df.explode('Q10')
df = pd.get_dummies(df, columns=['Q10'])
dummy_col_list = df.columns.tolist()
df['New'] = df.index
new_df = df.groupby('New')[dummy_col_list].sum().reset_index()
del new_df['New']
You will get:
Q10_A Q10_B Q10_C Q10_D Q10_E
0 1 1 1 0 0
1 1 1 0 1 0
2 1 0 0 1 0
3 1 0 0 1 1
4 0 1 1 1 1
Now, if you want, you can rename the columns and replacing 1 with the column name:
colName = new_df.columns.tolist()
newColList = []
for i in colName:
newColName = i.split('_', 1)[1]
newColList.append(newColName)
new_df.columns = newColList
for col in list(new_df.columns):
new_df[col] = np.where(new_df[col] == 1, col, '')
Final output:
A B C D E
0 A B C
1 A B D
2 A D
3 A D E
4 B C D E

If you want to do the job in python:
import pandas as pd
import numpy as np
df = pd.read_csv('file.csv')
df['A'] = np.where(df.Q10.str.contains('A'), 'A', '')
df['B'] = np.where(df.Q10.str.contains('B'), 'B', '')
df['C'] = np.where(df.Q10.str.contains('C'), 'C', '')
df['D'] = np.where(df.Q10.str.contains('D'), 'D', '')
df['E'] = np.where(df.Q10.str.contains('E'), 'E', '')
df.drop('Q10', axis=1, inplace=True)
df
Output:
A B C D E
0 A B C
1 A B D
2 A D
3 A D E
4 B C D E
It's not the most efficient way, but it works ;)

Rolling up data frame along with count of rows in python

I am still in a learning phase in python and wanted to know how do we roll up the data and count the duplicate data rows in a column called count
The data frame structure is as follows
Col1| Value
A | 1
B | 1
A | 1
B | 1
C | 3
C | 3
C | 3
C | 3
My result should be as follows
Col1|Value|Count
A | 1 | 2
B | 1 | 2
C | 3 | 4

>>> df2 = df.groupby(['Col1', 'Value']).size().reset_index()
>>> df2.columns = ['Col1', 'Value', 'Count']
>>> df2
Col1 Value Count
0 A 1 2
1 B 1 2
2 C 3 4

Roman Pekar's fine answer is correct for this case. However, I saw it after trying to write a solution for the general case stated in the text of your question, not just the example with specific column names. So, for the general case, consider:
df.groupby([df[c] for c in df.columns]).size().reset_index().rename(columns={0: 'Count'})
For example:
import pandas as pd
df = pd.DataFrame({'Col1': ['a', 'a', 'a', 'b', 'c'], 'Value': [1, 2, 1, 3, 2]})
>>> df.groupby([df[c] for c in df.columns]).size().reset_index().rename(columns={0: 'Count'})
Col1 Value Count
0 a 1 2
1 a 2 1
2 b 3 1
3 c 2 1

You can also try:
df.groupby('Col1')['Value'].value_counts().reset_index(name='Count')

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to keep a row if any column contains a certain substring? - python

Combine all columns into one and search for the substring: df[df.sum(axis=1).str.contains('-->')] # Var1 Var2 Var3 #1 a --> b e d #2 e a --> b a --> b

try .stack() with a boolean index. s = df.stack().str.contains('-->').reset_index(1,drop=True) df.loc[s[s].index.unique()] Var1 Var2 Var3 1 a --> b e d 2 e a --> b a --> b

Related

How to select multiple columns in pandas dataframe after doing some operation on one column?

How do I merge categories for crosstab in pandas where some categories are common?

converting a dictionary to dataframe where keys and values each have their own columns

How to split columns when content isn't aligned

Rolling up data frame along with count of rows in python

Categories

Resources