I have a pandas DataFrame df:
import pandas as pd
# Create a Pandas dataframe from some data.
df = pd.DataFrame({'Var1': ['d', 'a --> b', 'e', 'c --> d'],
'Var2': ['a', 'e', 'a --> b', 'd'],
'Var3': ['c', 'd', 'a --> b', 'e']})
Which looks like this when printed (for reference):
| | Var1 | Var2 | Var3 |
|---|---------|---------|---------|
| 0 | d | a | c |
| 1 | a --> b | e | d |
| 2 | e | a --> b | a --> b |
| 3 | c --> d | d | e |
I would like to keep just the rows 1, 2 and 3 that contains the value '-->'. In another words, I want to drop all rows in my dataframe that doesn't contains at least one column with the value '-->'.
I know how to filter just one column, df[df['Var1'].str.contains('-->', regex=False)] like gives me rows 1 and 3.
But I don't know how to apply to all columns. And I read some similar cases here and here, but couldn't figure out how to adapt to my case.
Can you suggest a way to select those rows?
Combine all columns into one and search for the substring:
df[df.sum(axis=1).str.contains('-->')]
# Var1 Var2 Var3
#1 a --> b e d
#2 e a --> b a --> b
You can filter them out using this.
df1= df[df.apply(lambda x: any(x.str.contains('-->')),axis=1)]
print (df1)
The output of this will be:
Original DataFrame:
Var1 Var2 Var3
0 d a c
1 a --> b e d
2 e a --> b a --> b
3 c d e
DF1: contains only rows with arrows
Var1 Var2 Var3
1 a --> b e d
2 e a --> b a --> b
try .stack() with a boolean index.
s = df.stack().str.contains('-->').reset_index(1,drop=True)
df.loc[s[s].index.unique()]
Var1 Var2 Var3
1 a --> b e d
2 e a --> b a --> b
Related
Say I have a dataframe with 7 columns. I'm only interested in columns A and B. Column B contains numerical values.
What I want to do is select only columns A and B, after doing some mathematical operation f on B. The sql equivalent of what I'm saying is:
SELECT A, f(B)
FROM df;
I know that I can select just columns A and B by doing df[['A', 'B']]. Also, I can just add another column f_B saying: df['f_B'] = f(df['B']), and then select df[['A', 'f_B']].
However, is there a way of doing it without adding an extra column? What if when f is as simple as a divide by 100 or something?
EDIT: I do not want to use pandasql
EDIT2: Sharing sample input and expected output:
Input:
A | B | C | D
--------------
a | 1 | c | d
b | 2 | c | d
c | 3 | c | d
d | 4 | c | d
Expected output (only columns A and B required), assuming f is multiply by 2:
A | B
-----
a | 2
b | 4
c | 6
d | 8
First you take only the columns you need:
df = df[['A', 'B']] # replace the original df with a smaller one
new_df = df[['A', 'B']] # or allocate a new space
You can simply do:
df.B = df.B / 10
Using lambda:
df.B = df.B.apply(lambda value: value / 10)
For more complicated cases:
def f(value):
# some logic
result = value ** 2
return result
df.B = df.B.apply(f)
A while ago I asked this question
But that does not cover the case where two merged categories might have a common category
In that case I wanted to merge the categories A and B into AB. What if I have categories A, B, C and I want to merge A,B into AB, and B,C into BC?
Suppose I have the data:
+---+---+
| X | Y |
+---+---+
| A | D |
| B | D |
| B | E |
| B | D |
| A | E |
| C | D |
| C | E |
| B | E |
+---+---+
I want the cross-tab to look like:
+--------+---+---+
| X/Y | D | E |
+--------+---+---+
| A or B | 3 | 3 |
| B or C | 3 | 2 |
| C | 1 | 1 |
+--------+---+---+
I think you can use crosstab by all unique values and then sum values by selecting by categories in index values:
df = pd.crosstab(df.X, df.Y)
df.loc['A or B'] = df.loc[['A','B']].sum()
df.loc['B or C'] = df.loc[['C','B']].sum()
df = df.drop(['A','B'])
print (df)
Y D E
X
C 1 1
A or B 3 3
B or C 3 3
EDIT: If want general solution it is not easy, because is necessary repeat groups with rename like:
df1 = df[df['X'] == 'B'].assign(X = 'B or C')
df2 = df[df['X'] == 'C']
df = pd.concat([df, df1], ignore_index=True)
df['X'] = df['X'].replace({'A':'A or B', 'B': 'A or B', 'C': 'B or C'})
df = pd.concat([df, df2], ignore_index=True)
df = pd.crosstab(df.X, df.Y)
print (df)
Y D E
X
A or B 3 3
B or C 3 3
C 1 1
I have a dictionary d:
d = dict({'foo': ['a', 'b', 'c'], 'bar': ['d', 'e', 'f']})
how would I get a dataframe that looks like:
+-----+--------+
| Key | Values |
+-----+--------+
| foo | a |
+-----+--------+
| foo | b |
+-----+--------+
| foo | c |
+-----+--------+
| bar | d |
+-----+--------+
| bar | e |
+-----+--------+
| bar | f |
+-----+--------+
This doesn't answer my question:
Dictionary with values as lists to pandas dataframe frame
You can try this:
df = pd.DataFrame(d).stack().sort_values()
df
#Out[2037]:
#0 foo a
#1 foo b
#2 foo c
#0 bar d
#1 bar e
#2 bar f
After pandas 0.25
pd.Series(d).explode().reset_index()
Out[114]:
index 0
0 foo a
1 foo b
2 foo c
3 bar d
4 bar e
5 bar f
You could do:
import pandas as pd
d = {'foo': ['a', 'b', 'c'], 'bar': ['d', 'e', 'f']}
df = pd.DataFrame([[key, value] for key, values in d.items() for value in values], columns=['keys', 'values'])
print(df)
Output
keys values
0 foo a
1 foo b
2 foo c
3 bar d
4 bar e
5 bar f
As an alternative you could use explode:
df = pd.DataFrame({'keys': list(d.keys()), 'values': list(d.values())}).explode('values').reset_index(drop=True)
print(df)
Output
keys values
0 foo a
1 foo b
2 foo c
3 bar d
4 bar e
5 bar f
I have a CSV file with survey data. One of the columns contains responses from a multi-select question. The values in that column are separated by ";"
| Q10 |
----------------
| A; B; C |
| A; B; D |
| A; D |
| A; D; E |
| B; C; D; E |
I want to split the column into multiple columns, one for each option:
| A | B | C | D | E |
---------------------
| A | B | C | | |
| A | B | | D | |
| A | | | D | |
| A | | | D | E |
| | B | C | D | E |
Is there anyway to do this in excel or python or some other way?
Here is a simple formula that does what is asked:
=IF(ISNUMBER(SEARCH("; "&B$1&";","; "&$A2&";")),B$1,"")
This assumes there is always a space between the ; and the look up value. If not we can remove the space with substitute:
=IF(ISNUMBER(SEARCH(";"&B$1&";",";"&SUBSTITUTE($A2," ","")&";")),B$1,"")
I know this question has been answered but for those looking for a Python way to solve it, here it is (may be not the most efficient way though):
First split the column value, explode them and get the dummies. Next, group the dummy values together across the given 5 (or N) columns:
df['Q10'] = df['Q10'].str.split('; ')
df = df.explode('Q10')
df = pd.get_dummies(df, columns=['Q10'])
dummy_col_list = df.columns.tolist()
df['New'] = df.index
new_df = df.groupby('New')[dummy_col_list].sum().reset_index()
del new_df['New']
You will get:
Q10_A Q10_B Q10_C Q10_D Q10_E
0 1 1 1 0 0
1 1 1 0 1 0
2 1 0 0 1 0
3 1 0 0 1 1
4 0 1 1 1 1
Now, if you want, you can rename the columns and replacing 1 with the column name:
colName = new_df.columns.tolist()
newColList = []
for i in colName:
newColName = i.split('_', 1)[1]
newColList.append(newColName)
new_df.columns = newColList
for col in list(new_df.columns):
new_df[col] = np.where(new_df[col] == 1, col, '')
Final output:
A B C D E
0 A B C
1 A B D
2 A D
3 A D E
4 B C D E
If you want to do the job in python:
import pandas as pd
import numpy as np
df = pd.read_csv('file.csv')
df['A'] = np.where(df.Q10.str.contains('A'), 'A', '')
df['B'] = np.where(df.Q10.str.contains('B'), 'B', '')
df['C'] = np.where(df.Q10.str.contains('C'), 'C', '')
df['D'] = np.where(df.Q10.str.contains('D'), 'D', '')
df['E'] = np.where(df.Q10.str.contains('E'), 'E', '')
df.drop('Q10', axis=1, inplace=True)
df
Output:
A B C D E
0 A B C
1 A B D
2 A D
3 A D E
4 B C D E
It's not the most efficient way, but it works ;)
I am still in a learning phase in python and wanted to know how do we roll up the data and count the duplicate data rows in a column called count
The data frame structure is as follows
Col1| Value
A | 1
B | 1
A | 1
B | 1
C | 3
C | 3
C | 3
C | 3
My result should be as follows
Col1|Value|Count
A | 1 | 2
B | 1 | 2
C | 3 | 4
>>> df2 = df.groupby(['Col1', 'Value']).size().reset_index()
>>> df2.columns = ['Col1', 'Value', 'Count']
>>> df2
Col1 Value Count
0 A 1 2
1 B 1 2
2 C 3 4
Roman Pekar's fine answer is correct for this case. However, I saw it after trying to write a solution for the general case stated in the text of your question, not just the example with specific column names. So, for the general case, consider:
df.groupby([df[c] for c in df.columns]).size().reset_index().rename(columns={0: 'Count'})
For example:
import pandas as pd
df = pd.DataFrame({'Col1': ['a', 'a', 'a', 'b', 'c'], 'Value': [1, 2, 1, 3, 2]})
>>> df.groupby([df[c] for c in df.columns]).size().reset_index().rename(columns={0: 'Count'})
Col1 Value Count
0 a 1 2
1 a 2 1
2 b 3 1
3 c 2 1
You can also try:
df.groupby('Col1')['Value'].value_counts().reset_index(name='Count')