Group by two columns and output original data structure

Group by two columns and output original data structure - python

I have a dataframe table:
Test results | Make
P | BMW
F | VW
F | VW
P | VW
P | VW
P | VW
And I want to group by both make and test result to output a count something like this, including both original columns:
Test results | Make | count
P | BMW | 1
F | VW | 2
P | VW | 3
I am currently doing:
pass_rates = df.groupby(['Test Results','Make']).size()
but it groups both make and test result in one column when I need it to stay in the original structure

You can add reset_index with parameter name:
name : object, default None
The name of the column corresponding to the Series values
pass_rates = df.groupby(['Test Results','Make']).size().reset_index(name='count')
print pass_rates
Test Results Make count
0 F VW 2
1 P BMW 1
2 P VW 3
If you want disable sorting, add parameter sort=False to groupby:
sort : boolean, default True
Sort group keys. Get better performance by turning this off. Note this does not influence the order of observations within each group. groupby preserves the order of rows within each group.
pass_rates = df.groupby(['Test Results','Make'], sort=False).size().reset_index(name='count')
print pass_rates
Test Results Make count
0 P BMW 1
1 F VW 2
2 P VW 3

Related

mark one dataframe if pattern found in another dataframe

I have two dataframes, and I want to mark the second one if the first one contains a pattern. Very large of rows (>10000's)
date | items
20100605 | apple is red
20110606 | orange is orange
20120607 | apple is green
B: shorter with a few hundred rows.
id | color
123 | is Red
234 | not orange
235 | is green
Result would be to flag all columns in B if pattern found in A, possibly adding a column to B like
B:
id | color | found
123 | is Red | true
234 | not orange | false
235 | is green | true
thinking of something like, dfB['found'] = dfB['color'].isin(dfA['items']) but don't see any way to ignore case. Also, with this approach it will change true to false. Don't want to change those which are already set true. Also, I believe it's inefficient to loop large dataframes more than once. Running through A once and marking B would be better way but not sure how to achieve that using isin(). Any other ways? Especially ignoring case sensitivity of pattern.

You can use something like this:
df2['check'] = df2['color'].apply(lambda x: True if any(x.casefold() in i.casefold() for i in df['items']) else False)
or you can use str.contains:
df2['check'] = df2['color'].str.contains('|'.join(df['items'].str.split(" ").str[1] + ' ' + df['items'].str.split(" ").str[2]),case=False)
#get second and third words

Calculating the percentage of values based on the values in other columns

I am trying to create a column that includes a percentage of values based on the values in other columns in python. For example, let's assume that we have the following dataset.
+------------------------------------+------------+--------+
| Teacher | grades | counts |
+------------------------------------+------------+--------+
| Teacher1 | 1 | 1 |
| | 2 | 2 |
| | 3 | 1 |
| Teacher2 | 2 | 1 |
| Teacher3 | 3 | 2 |
| Teacher4 | 2 | 2 |
| | 3 | 2 |
+------------------------------------+------------+--------+
As you can see we have teachers in the first columns, grades that teacher gives (1,2 and 3) in the second column, and the number of given corresponding grade in third columns. Here, I am trying to get the percentage of grade numbers 1 and 2 in total given grade for each teacher. For instance, teacher 1 gave one grade 1, two grade 2, and one grade 3. In this case, the percentage of given grade numbers 1 and 2 in the total grade is 75%. Teacher 2 gave only 1 grade 2 so the percentage is 100%. Similarly, teacher 3 gave two grade 3 so the percentage 0% because he/she did not give any grades 1 and 2. So these percentages should be added to the new column in the dataset. Honestly, I couldn't even think about anything to try and I didn't find anything about it when I search it in here. Could you please help me to get the column.

I am not sure this is the most efficient way, but I find it quite readable and easy to follow.
percents = {} #store Teacher:percent
for t, g in df.groupby('Teacher'): #t,g is short for teacher,group
total = g.counts.sum()
one_two = g.loc[g.grades.isin([1,2])].counts.sum() #consider only 1&2
percent = (one_two/total)*100
#print(t, percent)
percents[t] = [percent]
xf = pd.DataFrame(percents).T.reset_index() #make a df from the dic
xf.columns = ['Teacher','percent'] #rename columns
df = df.merge(xf) #merge with initial df
print(df)
Teacher grades counts percent
0 Teacher1 1 1 75.0
1 Teacher1 2 2 75.0
2 Teacher1 3 1 75.0
3 Teacher2 2 1 100.0
4 Teacher3 3 2 0.0
5 Teacher4 2 2 50.0
6 Teacher4 3 2 50.0

I believe this will solve your query
y=0
data['Percentage']='None'
for teacher in teachers:
x=data[data['Teachers']==teacher]
total=sum(x['Counts'])
condition1= 1 in set(x['Grades'])
condition2= 2 in set(x['Grades'])
if (condition1==True or condition2==True):
for i in range(y,y+len(x)):
data['Percentage'].iloc[i]=(data['Counts'].iloc[i]/total)*100
else:
for i in range(y,y+len(x)):
data['Percentage'].iloc[i]=0
y=y+len(x)
Output:
Teachers Grades Counts Percentage
0 Teacher1 1 1 25
1 Teacher1 2 2 50
2 Teacher1 3 1 25
3 Teacher2 2 1 100
4 Teacher3 3 2 0
5 Teacher4 2 2 50
6 Teacher4 3 2 50
I have made used of boolean comprehension to segregate the data on
basis on each teacher. Most of the code is self explanatory. For any
other clarification please fill free to leave a comment.

Add prefix to ffill, identifying values which were carried forward

Is there a wayto add a prefix when filling na's with ffill in pandas? I have a dataframe containing, taxonomic information like so:
| Kingdom | Phylum | Class | Order | Family | Genus |
| Bacteria | Firmicutes | Bacilli | Lactobacillales | Lactobacillaceae | Lactobacillus |
| Bacteria | Bacteroidetes | Bacteroidia | Bacteroidales | | |
| Bacteria | Bacteroidetes | | | | |
Since not all of the taxa in my dataframe can be classified fully, I have some empty cells. Replacing the spaces with NA and using ffill I can fill these with the last valid string in each row but I would like to add a string to these (for example "Unknown_Bacteroidales") so I can identify which ones were carried forward.
So far I tried this taxa_formatted = "unknown_" + taxonomy.fillna(method='ffill', axis=1) but this of course adds the "unknown_" prefix to everything in the dataframe.

You can this using boolean masking with df.isna.
df = df.replace("", np.nan) # if already NaN present skip this step
d = df.ffill()
d[df.isna()]+="(Copy)"
d
Kingdom Phylum Class Order Family Genus
0 Bacteria Firmicutes Bacilli Lactobacillales Lactobacillaceae Lactobacillus
1 Bacteria Bacteroidetes Bacteroidia Bacteroidales Lactobacillaceae(Copy) Lactobacillus(Copy)
2 Bacteria Bacteroidetes Bacteroidia(Copy) Bacteroidales(Copy) Lactobacillaceae(Copy) Lactobacillus(Copy)
You can use df.add here.
d = df.ffill(axis=1)
df.add("unkown_" + d[df.isna()],fill_value='')
Kingdom Phylum Class Order Family Genus
0 Bacteria Firmicutes Bacilli Lactobacillales Lactobacillaceae Lactobacillus
1 Bacteria Bacteroidetes Bacteroidia Bacteroidales unkown_Bacteroidales unkown_Bacteroidales
2 Bacteria Bacteroidetes unkown_Bacteroidetes unkown_Bacteroidetes unkown_Bacteroidetes unkown_Bacteroidetes

You need to use mask and update:
#make true nan's first.
#df = df.replace('',np.nan)
s = df.isnull()
df = df.ffill(axis=1)
df.update('unknown_' + df.mask(~s) )
print(df)
Bacteria Firmicutes Bacilli Lactobacillales \
0 Bacteria Bacteroidetes Bacteroidia Bacteroidales
1 Bacteria Bacteroidetes unknown_Bacteroidetes unknown_Bacteroidetes
Lactobacillaceae Lactobacillus
0 unknown_Bacteroidales unknown_Bacteroidales
1 unknown_Bacteroidetes unknown_Bacteroidetes

find rows that share values

I have a pandas dataframe that look like this:
df = pd.DataFrame({'name': ['bob', 'time', 'jane', 'john', 'andy'], 'favefood': [['kfc', 'mcd', 'wendys'], ['mcd'], ['mcd', 'popeyes'], ['wendys', 'kfc'], ['tacobell', 'innout']]})
-------------------------------
name | favefood
-------------------------------
bob | ['kfc', 'mcd', 'wendys']
tim | ['mcd']
jane | ['mcd', 'popeyes']
john | ['wendys', 'kfc']
andy | ['tacobell', 'innout']
For each person, I want to find out how many favefood's of other people overlap with their own.
I.e., for each person I want to find out how many other people have a non-empty intersection with them.
The resulting dataframe would look like this:
------------------------------
name | overlap
------------------------------
bob | 3
tim | 2
jane | 2
john | 1
andy | 0
The problem is that I have about 2 million rows of data. The only way I can think of doing this would be through a nested for-loop - i.e. for each person, go through the entire dataframe to see what overlaps (this would be extremely inefficient). Would there be anyway to do this more efficiently using pandas notation? Thanks!

Logic behind it
s=df['favefood'].explode().str.get_dummies().sum(level=0)
s.dot(s.T).ne(0).sum(axis=1)-1
Out[84]:
0 3
1 2
2 2
3 1
4 0
dtype: int64
df['overlap']=s.dot(s.T).ne(0).sum(axis=1)-1
Method from sklearn
from sklearn.preprocessing import MultiLabelBinarizer
mlb = MultiLabelBinarizer()
s=pd.DataFrame(mlb.fit_transform(df['favefood']),columns=mlb.classes_, index=df.index)
s.dot(s.T).ne(0).sum(axis=1)-1
0 3
1 2
2 2
3 1
4 0
dtype: int64

Pandas groupby, get ratio of boolean variable [duplicate]

I have a column of sites: ['Canada', 'USA', 'China' ....]
Each site occurs many times in the SITE column and next to each instance is a true or false value.
INDEX | VALUE | SITE
0 | True | Canada
1 | False | Canada
2 | True | USA
3 | True | USA
And it goes on.
Goal 1: I want to find, for each site, what percent of the VALUE column is True.
Goal 2: I want to return a list of sites where % True in the VALUE column is greater than 10%.
How do I use groupby to achieve this? I only know how to use groupby to find the mean for each site which won't help me here.

Something like this:
In [13]: g = df.groupby('SITE')['VALUE'].mean()
In [14]: g[g > 0.1]
Out[14]:
SITE
Canada 0.5
USA 1.0

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Group by two columns and output original data structure - python

Related

mark one dataframe if pattern found in another dataframe

Calculating the percentage of values based on the values in other columns

Add prefix to ffill, identifying values which were carried forward

find rows that share values

Pandas groupby, get ratio of boolean variable [duplicate]

Categories

Resources