I have a column of sites: ['Canada', 'USA', 'China' ....]
Each site occurs many times in the SITE column and next to each instance is a true or false value.
INDEX | VALUE | SITE
0 | True | Canada
1 | False | Canada
2 | True | USA
3 | True | USA
And it goes on.
Goal 1: I want to find, for each site, what percent of the VALUE column is True.
Goal 2: I want to return a list of sites where % True in the VALUE column is greater than 10%.
How do I use groupby to achieve this? I only know how to use groupby to find the mean for each site which won't help me here.
Something like this:
In [13]: g = df.groupby('SITE')['VALUE'].mean()
In [14]: g[g > 0.1]
Out[14]:
SITE
Canada 0.5
USA 1.0
Related
I have two dataframes, and I want to mark the second one if the first one contains a pattern. Very large of rows (>10000's)
date | items
20100605 | apple is red
20110606 | orange is orange
20120607 | apple is green
B: shorter with a few hundred rows.
id | color
123 | is Red
234 | not orange
235 | is green
Result would be to flag all columns in B if pattern found in A, possibly adding a column to B like
B:
id | color | found
123 | is Red | true
234 | not orange | false
235 | is green | true
thinking of something like, dfB['found'] = dfB['color'].isin(dfA['items']) but don't see any way to ignore case. Also, with this approach it will change true to false. Don't want to change those which are already set true. Also, I believe it's inefficient to loop large dataframes more than once. Running through A once and marking B would be better way but not sure how to achieve that using isin(). Any other ways? Especially ignoring case sensitivity of pattern.
You can use something like this:
df2['check'] = df2['color'].apply(lambda x: True if any(x.casefold() in i.casefold() for i in df['items']) else False)
or you can use str.contains:
df2['check'] = df2['color'].str.contains('|'.join(df['items'].str.split(" ").str[1] + ' ' + df['items'].str.split(" ").str[2]),case=False)
#get second and third words
I have a pandas dataframe that look like this:
df = pd.DataFrame({'name': ['bob', 'time', 'jane', 'john', 'andy'], 'favefood': [['kfc', 'mcd', 'wendys'], ['mcd'], ['mcd', 'popeyes'], ['wendys', 'kfc'], ['tacobell', 'innout']]})
-------------------------------
name | favefood
-------------------------------
bob | ['kfc', 'mcd', 'wendys']
tim | ['mcd']
jane | ['mcd', 'popeyes']
john | ['wendys', 'kfc']
andy | ['tacobell', 'innout']
For each person, I want to find out how many favefood's of other people overlap with their own.
I.e., for each person I want to find out how many other people have a non-empty intersection with them.
The resulting dataframe would look like this:
------------------------------
name | overlap
------------------------------
bob | 3
tim | 2
jane | 2
john | 1
andy | 0
The problem is that I have about 2 million rows of data. The only way I can think of doing this would be through a nested for-loop - i.e. for each person, go through the entire dataframe to see what overlaps (this would be extremely inefficient). Would there be anyway to do this more efficiently using pandas notation? Thanks!
Logic behind it
s=df['favefood'].explode().str.get_dummies().sum(level=0)
s.dot(s.T).ne(0).sum(axis=1)-1
Out[84]:
0 3
1 2
2 2
3 1
4 0
dtype: int64
df['overlap']=s.dot(s.T).ne(0).sum(axis=1)-1
Method from sklearn
from sklearn.preprocessing import MultiLabelBinarizer
mlb = MultiLabelBinarizer()
s=pd.DataFrame(mlb.fit_transform(df['favefood']),columns=mlb.classes_, index=df.index)
s.dot(s.T).ne(0).sum(axis=1)-1
0 3
1 2
2 2
3 1
4 0
dtype: int64
I want to merge two pandas dataframes:
df 1
City | Attraction | X | Z | Y
Somewhere Rainbows 1 2 3
Somewhere Trees 4 4 4
Somewhere Unicorns
df 2
City | Other Column | Also another column
Somewhere Something Something else
Normally this would be done so:
df2.merge(df1[['City', 'Attraction']], left_on='City', right_on='City. how='left']
City | Other Column | Also another column | Attraction
Somewhere Something Something else Rainbows
Somewhere Something Something else Trees
Somewhere Something Something else Unicorns
However, I would like to group the results of the join into a comma separated list (or whatever):
City | Other Column | Also another column | Attraction
Somewhere Something Something else Rainbows, Trees, Unicorns
groupby() and map:
df2['Attaction'] = df2['City'].map(df1.groupby('City').Attraction.agg(', '.join))
I'm trying to solve a problem for a course in Python and found someone has implemented solutions for the same problem in github. I'm just trying to understand the solution given in github.
I have a pandas dataframe called Top15 with 15 countries and one of the columns in the dataframe is 'HighRenew'. This column stores the % of renewable energy used in each country. My task is to convert the column values in 'HighRenew' column into boolean datatype.
If the value for a particular country is higher than the median renewable energy percentage in all the 15 countries then I should encode it as 1 otherwise it should a 0. The 'HighRenew' column is sliced out as a Series from the dataframe, which is copied below.
Country
China True
United States False
Japan False
United Kingdom False
Russian Federation True
Canada True
Germany True
India False
France True
South Korea False
Italy True
Spain True
Iran False
Australia False
Brazil True
Name: HighRenew, dtype: bool
The github solution is implemented in 3 steps, of which I understand the first 2 but not the last one where lambda function is used. Can someone explain how this lambda function works?
median_value = Top15['% Renewable'].median()
Top15['HighRenew'] = Top15['% Renewable']>=median_value
Top15['HighRenew'] = Top15['HighRenew'].apply(lambda x:1 if x else 0)
lambda represents an anonymous (i.e. unnamed) function. If it is used with pd.Series.apply, each element of the series is fed into the lambda function. The result will be another pd.Series with each element run through the lambda.
apply + lambda is just a thinly veiled loop. You should prefer to use vectorised functionality where possible. #jezrael offers such a vectorised solution.
The equivalent in regular python is below, given a list lst. Here each element of lst is passed through the lambda function and aggregated in a list.
list(map(lambda x: 1 if x else 0, lst))
It is a Pythonic idiom to test for "Truthy" values using if x rather than if x == True, see this answer for more information on what is considered True.
I think apply are loops under the hood, better is use vectorized astype - it convert True to 1 and False to 0:
Top15['HighRenew'] = (Top15['% Renewable']>=median_value).astype(int)
lambda x:1 if x else 0
means anonymous function (lambda function) with condition - if True return 1 else return 0.
For more information about lambda function check this answers.
Instead of using workarounds or lambdas, just use Panda's built-in functionality meant for this problem. The approach is called masking, and in essence we use comparators against a Series (column of a df) to get the boolean values:
import pandas as pd
import numpy as np
foo = [{
'Country': 'Germany',
'Percent Renew': 100
}, {
'Country': 'Germany',
'Percent Renew': 75
}, {
'Country': 'China',
'Percent Renew': 25
}, {
'Country': 'USA',
'Percent Renew': 5
}]
df = pd.DataFrame(foo, index=pd.RangeIndex(0, len(foo)))
df
#| Country | Percent Renew |
#| Germany | 100 |
#| Australia | 75 |
#| China | 25 |
#| USA | 5 |
np.mean(df['Percent Renew'])
# 51.25
df['Better Than Average'] = df['Percent Renew'] > np.mean(df['Percent Renew'])
#| Country | Percent Renew | Better Than Average |
#| Germany | 100 | True
#| Australia | 75 | True
#| China | 25 | False
#| USA | 5 | False
The reason specifically why I propose this over the other solutions is that masking can be used for a host of other purposes as well. I wont get into them here, but once you learn that pandas supports this kind of functionality, it becomes a lot easier to perform other data manipulations in pandas.
EDIT: I read needing boolean datatype as needing True False and not as needing the encoded version 1 and 0 in which case the astype that was proposed will sufficiently convert the booleans to integer values. For masking purposes though, the 'True' 'False' is needed for slicing.
I have a dataframe table:
Test results | Make
P | BMW
F | VW
F | VW
P | VW
P | VW
P | VW
And I want to group by both make and test result to output a count something like this, including both original columns:
Test results | Make | count
P | BMW | 1
F | VW | 2
P | VW | 3
I am currently doing:
pass_rates = df.groupby(['Test Results','Make']).size()
but it groups both make and test result in one column when I need it to stay in the original structure
You can add reset_index with parameter name:
name : object, default None
The name of the column corresponding to the Series values
pass_rates = df.groupby(['Test Results','Make']).size().reset_index(name='count')
print pass_rates
Test Results Make count
0 F VW 2
1 P BMW 1
2 P VW 3
If you want disable sorting, add parameter sort=False to groupby:
sort : boolean, default True
Sort group keys. Get better performance by turning this off. Note this does not influence the order of observations within each group. groupby preserves the order of rows within each group.
pass_rates = df.groupby(['Test Results','Make'], sort=False).size().reset_index(name='count')
print pass_rates
Test Results Make count
0 P BMW 1
1 F VW 2
2 P VW 3