I have a dataframe table:
Test results | Make
P | BMW
F | VW
F | VW
P | VW
P | VW
P | VW
And I want to group by both make and test result to output a count something like this, including both original columns:
Test results | Make | count
P | BMW | 1
F | VW | 2
P | VW | 3
I am currently doing:
pass_rates = df.groupby(['Test Results','Make']).size()
but it groups both make and test result in one column when I need it to stay in the original structure
You can add reset_index with parameter name:
name : object, default None
The name of the column corresponding to the Series values
pass_rates = df.groupby(['Test Results','Make']).size().reset_index(name='count')
print pass_rates
Test Results Make count
0 F VW 2
1 P BMW 1
2 P VW 3
If you want disable sorting, add parameter sort=False to groupby:
sort : boolean, default True
Sort group keys. Get better performance by turning this off. Note this does not influence the order of observations within each group. groupby preserves the order of rows within each group.
pass_rates = df.groupby(['Test Results','Make'], sort=False).size().reset_index(name='count')
print pass_rates
Test Results Make count
0 P BMW 1
1 F VW 2
2 P VW 3
Related
I have two dataframes, and I want to mark the second one if the first one contains a pattern. Very large of rows (>10000's)
date | items
20100605 | apple is red
20110606 | orange is orange
20120607 | apple is green
B: shorter with a few hundred rows.
id | color
123 | is Red
234 | not orange
235 | is green
Result would be to flag all columns in B if pattern found in A, possibly adding a column to B like
B:
id | color | found
123 | is Red | true
234 | not orange | false
235 | is green | true
thinking of something like, dfB['found'] = dfB['color'].isin(dfA['items']) but don't see any way to ignore case. Also, with this approach it will change true to false. Don't want to change those which are already set true. Also, I believe it's inefficient to loop large dataframes more than once. Running through A once and marking B would be better way but not sure how to achieve that using isin(). Any other ways? Especially ignoring case sensitivity of pattern.
You can use something like this:
df2['check'] = df2['color'].apply(lambda x: True if any(x.casefold() in i.casefold() for i in df['items']) else False)
or you can use str.contains:
df2['check'] = df2['color'].str.contains('|'.join(df['items'].str.split(" ").str[1] + ' ' + df['items'].str.split(" ").str[2]),case=False)
#get second and third words
I am trying to create a column that includes a percentage of values based on the values in other columns in python. For example, let's assume that we have the following dataset.
+------------------------------------+------------+--------+
| Teacher | grades | counts |
+------------------------------------+------------+--------+
| Teacher1 | 1 | 1 |
| | 2 | 2 |
| | 3 | 1 |
| Teacher2 | 2 | 1 |
| Teacher3 | 3 | 2 |
| Teacher4 | 2 | 2 |
| | 3 | 2 |
+------------------------------------+------------+--------+
As you can see we have teachers in the first columns, grades that teacher gives (1,2 and 3) in the second column, and the number of given corresponding grade in third columns. Here, I am trying to get the percentage of grade numbers 1 and 2 in total given grade for each teacher. For instance, teacher 1 gave one grade 1, two grade 2, and one grade 3. In this case, the percentage of given grade numbers 1 and 2 in the total grade is 75%. Teacher 2 gave only 1 grade 2 so the percentage is 100%. Similarly, teacher 3 gave two grade 3 so the percentage 0% because he/she did not give any grades 1 and 2. So these percentages should be added to the new column in the dataset. Honestly, I couldn't even think about anything to try and I didn't find anything about it when I search it in here. Could you please help me to get the column.
I am not sure this is the most efficient way, but I find it quite readable and easy to follow.
percents = {} #store Teacher:percent
for t, g in df.groupby('Teacher'): #t,g is short for teacher,group
total = g.counts.sum()
one_two = g.loc[g.grades.isin([1,2])].counts.sum() #consider only 1&2
percent = (one_two/total)*100
#print(t, percent)
percents[t] = [percent]
xf = pd.DataFrame(percents).T.reset_index() #make a df from the dic
xf.columns = ['Teacher','percent'] #rename columns
df = df.merge(xf) #merge with initial df
print(df)
Teacher grades counts percent
0 Teacher1 1 1 75.0
1 Teacher1 2 2 75.0
2 Teacher1 3 1 75.0
3 Teacher2 2 1 100.0
4 Teacher3 3 2 0.0
5 Teacher4 2 2 50.0
6 Teacher4 3 2 50.0
I believe this will solve your query
y=0
data['Percentage']='None'
for teacher in teachers:
x=data[data['Teachers']==teacher]
total=sum(x['Counts'])
condition1= 1 in set(x['Grades'])
condition2= 2 in set(x['Grades'])
if (condition1==True or condition2==True):
for i in range(y,y+len(x)):
data['Percentage'].iloc[i]=(data['Counts'].iloc[i]/total)*100
else:
for i in range(y,y+len(x)):
data['Percentage'].iloc[i]=0
y=y+len(x)
Output:
Teachers Grades Counts Percentage
0 Teacher1 1 1 25
1 Teacher1 2 2 50
2 Teacher1 3 1 25
3 Teacher2 2 1 100
4 Teacher3 3 2 0
5 Teacher4 2 2 50
6 Teacher4 3 2 50
I have made used of boolean comprehension to segregate the data on
basis on each teacher. Most of the code is self explanatory. For any
other clarification please fill free to leave a comment.
Is there a wayto add a prefix when filling na's with ffill in pandas? I have a dataframe containing, taxonomic information like so:
| Kingdom | Phylum | Class | Order | Family | Genus |
| Bacteria | Firmicutes | Bacilli | Lactobacillales | Lactobacillaceae | Lactobacillus |
| Bacteria | Bacteroidetes | Bacteroidia | Bacteroidales | | |
| Bacteria | Bacteroidetes | | | | |
Since not all of the taxa in my dataframe can be classified fully, I have some empty cells. Replacing the spaces with NA and using ffill I can fill these with the last valid string in each row but I would like to add a string to these (for example "Unknown_Bacteroidales") so I can identify which ones were carried forward.
So far I tried this taxa_formatted = "unknown_" + taxonomy.fillna(method='ffill', axis=1) but this of course adds the "unknown_" prefix to everything in the dataframe.
You can this using boolean masking with df.isna.
df = df.replace("", np.nan) # if already NaN present skip this step
d = df.ffill()
d[df.isna()]+="(Copy)"
d
Kingdom Phylum Class Order Family Genus
0 Bacteria Firmicutes Bacilli Lactobacillales Lactobacillaceae Lactobacillus
1 Bacteria Bacteroidetes Bacteroidia Bacteroidales Lactobacillaceae(Copy) Lactobacillus(Copy)
2 Bacteria Bacteroidetes Bacteroidia(Copy) Bacteroidales(Copy) Lactobacillaceae(Copy) Lactobacillus(Copy)
You can use df.add here.
d = df.ffill(axis=1)
df.add("unkown_" + d[df.isna()],fill_value='')
Kingdom Phylum Class Order Family Genus
0 Bacteria Firmicutes Bacilli Lactobacillales Lactobacillaceae Lactobacillus
1 Bacteria Bacteroidetes Bacteroidia Bacteroidales unkown_Bacteroidales unkown_Bacteroidales
2 Bacteria Bacteroidetes unkown_Bacteroidetes unkown_Bacteroidetes unkown_Bacteroidetes unkown_Bacteroidetes
You need to use mask and update:
#make true nan's first.
#df = df.replace('',np.nan)
s = df.isnull()
df = df.ffill(axis=1)
df.update('unknown_' + df.mask(~s) )
print(df)
Bacteria Firmicutes Bacilli Lactobacillales \
0 Bacteria Bacteroidetes Bacteroidia Bacteroidales
1 Bacteria Bacteroidetes unknown_Bacteroidetes unknown_Bacteroidetes
Lactobacillaceae Lactobacillus
0 unknown_Bacteroidales unknown_Bacteroidales
1 unknown_Bacteroidetes unknown_Bacteroidetes
I have a pandas dataframe that look like this:
df = pd.DataFrame({'name': ['bob', 'time', 'jane', 'john', 'andy'], 'favefood': [['kfc', 'mcd', 'wendys'], ['mcd'], ['mcd', 'popeyes'], ['wendys', 'kfc'], ['tacobell', 'innout']]})
-------------------------------
name | favefood
-------------------------------
bob | ['kfc', 'mcd', 'wendys']
tim | ['mcd']
jane | ['mcd', 'popeyes']
john | ['wendys', 'kfc']
andy | ['tacobell', 'innout']
For each person, I want to find out how many favefood's of other people overlap with their own.
I.e., for each person I want to find out how many other people have a non-empty intersection with them.
The resulting dataframe would look like this:
------------------------------
name | overlap
------------------------------
bob | 3
tim | 2
jane | 2
john | 1
andy | 0
The problem is that I have about 2 million rows of data. The only way I can think of doing this would be through a nested for-loop - i.e. for each person, go through the entire dataframe to see what overlaps (this would be extremely inefficient). Would there be anyway to do this more efficiently using pandas notation? Thanks!
Logic behind it
s=df['favefood'].explode().str.get_dummies().sum(level=0)
s.dot(s.T).ne(0).sum(axis=1)-1
Out[84]:
0 3
1 2
2 2
3 1
4 0
dtype: int64
df['overlap']=s.dot(s.T).ne(0).sum(axis=1)-1
Method from sklearn
from sklearn.preprocessing import MultiLabelBinarizer
mlb = MultiLabelBinarizer()
s=pd.DataFrame(mlb.fit_transform(df['favefood']),columns=mlb.classes_, index=df.index)
s.dot(s.T).ne(0).sum(axis=1)-1
0 3
1 2
2 2
3 1
4 0
dtype: int64
I have a column of sites: ['Canada', 'USA', 'China' ....]
Each site occurs many times in the SITE column and next to each instance is a true or false value.
INDEX | VALUE | SITE
0 | True | Canada
1 | False | Canada
2 | True | USA
3 | True | USA
And it goes on.
Goal 1: I want to find, for each site, what percent of the VALUE column is True.
Goal 2: I want to return a list of sites where % True in the VALUE column is greater than 10%.
How do I use groupby to achieve this? I only know how to use groupby to find the mean for each site which won't help me here.
Something like this:
In [13]: g = df.groupby('SITE')['VALUE'].mean()
In [14]: g[g > 0.1]
Out[14]:
SITE
Canada 0.5
USA 1.0