I am attempting to do a Excel countif function with pandas but hitting a roadblock in doing so.
I have this dataframe. I need to count the YES for each country quarter-wise. I have posted the requested answers below.
result.head(3)
Country Jan 1 Feb 1 Mar 1 Apr 1 May 1 Jun 1 Quarter_1 Quarter_2
FRANCE Yes Yes No No No No 2 0
BELGIUM Yes Yes No Yes No No 2 1
CANADA Yes No No Yes No No 1 1
I tried the following but Pandas spats out a total value instead showing a 5 for all the values under Quarter_1. I am oblivious on how to calculate my function below by Country? Any assistance with this please!
result['Quarter_1'] = len(result[result['Jan 1'] == 'Yes']) + len(result[result['Feb 1'] == 'Yes'])
+ len(result[result['Mar 1'] == 'Yes'])
We can use the length of your column and take the floor division to create your quarters. Then we groupby on these and take the sum.
Finally to we add the prefix Quarter:
df = df.set_index('Country')
grps = np.arange(len(df.columns)) // 3
dfn = (
df.join(df.eq('Yes')
.groupby(grps, axis=1)
.sum()
.astype(int)
.add_prefix('Quarter_'))
.reset_index()
)
Or using list comprehension to rename your columns:
df = df.set_index('Country')
grps = np.arange(len(df.columns)) // 3
dfn = df.eq('Yes').groupby(grps, axis=1).sum().astype(int)
dfn.columns = [f'Quarter_{col+1}' for col in dfn.columns]
df = df.join(dfn).reset_index()
Country Jan 1 Feb 1 Mar 1 Apr 1 May 1 Jun 1 Quarter_1 Quarter_2
0 FRANCE Yes Yes No No No No 2 0
1 BELGIUM Yes Yes No Yes No No 2 1
2 CANADA Yes No No Yes No No 1 1
Related
I have this data;
ID Month
001 June
001 July
001 August
002 July
I want the result to be like this:
ID June July August
001 1 1 1
002 0 1 0
I have tried one-hot encoding, my query is like this:
one_hot = pd.get_dummies(frame['month'])
frame = frame.drop('Month',axis = 1)
frame = frame.join(one_hot)
However, the result is like this
ID June July August
001 1 0 0
001 0 1 0
001 0 0 1
002 0 1 0
May I know which part of my query is wrong?
get_dummies returns strictly 1-hot encoded values, you can use pd.crosstab:
>>> out = pd.crosstab(df.ID, df.Month)
>>> out
Month August July June
ID
1 1 1 1
2 0 1 0
To preserve the order of appearance of Months, you can reindex:
>>> out.reindex(df.Month.unique(), axis=1)
Month June July August
ID
1 1 1 1
2 0 1 0
In case an ID can have more than 1 month associated with it and you want to see it as 1:
out = out.ne(0).astype(int)
can be used afterwards.
If need hot encoding convert ID to index and aggregate max for always 0,1 ouput:
one_hot = (pd.get_dummies(frame.set_index('ID')['Month'])
.max(level=0)
.reindex(df.Month.unique(), axis=1))
print (one_hot)
June July August
ID
1 1 1 1
2 0 1 0
i'm working on spread r equivalent in pandas my dataframe looks like below
Name age Language year Period
Nik 18 English 2018 Beginer
John 19 French 2019 Intermediate
Kane 33 Russian 2017 Advanced
xi 44 Thai 2015 Beginer
and looking for output like this
Name age Language Beginer Intermediate Advanced
Nik 18 English 2018
John 19 French 2019
Kane 33 Russian 2017
John 44 Thai 2015
my code
pd.pivot(x1,values='year', columns=['Period'])
i'm getting only these columns Beginer,Intermediate,Advanced not the entire dataframe
while reshaping it i tried using index but says no duplicates in index.
So i created new index column but still not getting entire dataframe
If I understood correctly you could do something like this:
# create dummy columns
res = pd.get_dummies(df['Period']).astype(np.int64)
res.values[np.arange(len(res)), np.argmax(res.values, axis=1)] = df['year']
# concat and drop columns
output = pd.concat((df.drop(['year', 'Period'], 1), res), 1)
print(output)
Output
Name age Language Advanced Beginner Intermediate
0 Nik 18 English 0 2018 0
1 John 19 French 0 0 2019
2 Kane 33 Russian 2017 0 0
3 xi 44 Thai 0 2015 0
If you want to match the exact same output, convert the column to categorical first, and specify the order:
# encode as categorical
df['Period'] = pd.Categorical(df['Period'], ['Beginner', 'Advanced', 'Intermediate'], ordered=True)
# create dummy columns
res = pd.get_dummies(df['Period']).astype(np.int64)
res.values[np.arange(len(res)), np.argmax(res.values, axis=1)] = df['year']
# concat and drop columns
output = pd.concat((df.drop(['year', 'Period'], 1), res), 1)
print(output)
Output
Name age Language Beginner Advanced Intermediate
0 Nik 18 English 2018 0 0
1 John 19 French 0 0 2019
2 Kane 33 Russian 0 2017 0
3 xi 44 Thai 2015 0 0
Finally if you want to replace the 0, with missing values, add a third step:
# create dummy columns
res = pd.get_dummies(df['Period']).astype(np.int64)
res.values[np.arange(len(res)), np.argmax(res.values, axis=1)] = df['year']
res = res.replace(0, np.nan)
Output (with missing values)
Name age Language Beginner Advanced Intermediate
0 Nik 18 English 2018.0 NaN NaN
1 John 19 French NaN NaN 2019.0
2 Kane 33 Russian NaN 2017.0 NaN
3 xi 44 Thai 2015.0 NaN NaN
One way you can get to the equivalent of R's spread function using pd.pivot_table:
If you don't mind about the index, you can use reset_index() on the newly created df:
new_df = (pd.pivot_table(df, index=['Name','age','Language'],columns='Period',values='year',aggfunc='sum')).reset_index()
which will get you:
Period Name age Language Advanced Beginer Intermediate
0 John 19 French NaN NaN 2019.0
1 Kane 33 Russian 2017.0 NaN NaN
2 Nik 18 English NaN 2018.0 NaN
3 xi 44 Thai NaN 2015.0 NaN
EDIT
If you have many columns in your dataframe and you want to include them in the reshaped dataset:
Grab in a list the columns to be used in pivot table (i.e. Period and year)
Grab all the other columns in your dataframe in a list (using not in)
Use the index_cols as index in the pd.pivot_table() command
non_index_cols = ['Period','year'] # SPECIFY THE 2 COLUMNS IN THE PIVOT TABLE TO BE USED
index_cols = [i for i in df.columns if i not in non_index_cols] # GET ALL THE REST IN A LIST
new_df = (pd.pivot_table(df, index=index_cols,columns='Period',values='year',aggfunc='sum')).reset_index()
The new_df, will include all the columns of your initial dataframe.
df = {'Region':['France','France','France','France'],'total':[1,2,3,4],'date':['12/30/19','12/31/19','01/01/20','01/02/20']}
df=pd.DataFrame.from_dict(df)
print(df)
Region total date
0 France 1 12/30/19
1 France 2 12/31/19
2 France 3 01/01/20
3 France 4 01/02/20
The dates are ordered. Now if I am using pivot
pandas_temp = df.pivot(index='Region',values='total', columns='date')
print(pandas_temp)
date 01/01/20 01/02/20 12/30/19 12/31/19
Region
France 3 4 1 2
I am losing the order. How can I keep it ?
Convert values to datetimes before pivot and then if necessary convert to your custom format:
df['date'] = pd.to_datetime(df['date'])
pandas_temp = df.pivot(index='Region',values='total', columns='date')
pandas_temp = pandas_temp.rename(columns=lambda x: x.strftime('%m/%d/%y'))
#alternative
#pandas_temp.columns = pandas_temp.columns.strftime('%m/%d/%y')
print (pandas_temp)
date 12/30/19 12/31/19 01/01/20 01/02/20
Region
France 1 2 3 4
I have a dataframe structured as follows:
Name Month Grade
Sue Jan D
Sue Feb D
Jason Mar B
Sue Mar D
Jason Jan B
Sue Apr A
Jason Feb C
I want to get the list of students who got D 3 consecutive months in the past 6 months. In the example above, Sue will be on the list since she got D in Jan, Feb ad March. How can I do that using Python or Pandas or Numpy?
I tried to solve your problem. I do have a solution for you but it may not be the fastest in terms of efficiency / code execution. Please see below:
newdf = df.pivot(index='Name', columns='Month', values='Grade')
newdf = newdf[['Jan', 'Feb', 'Mar', 'Apr']].fillna(-1)
newdf['concatenated'] = newdf['Jan'].astype('str') + newdf['Feb'].astype('str') + newdf['Mar'].astype('str') + newdf['Apr'].astype('str')
newdf[newdf['concatenated'].str.contains('DDD', regex=False, na=False)]
Output will be like:
Month Jan Feb Mar Apr concatenated
Name
Sue D D D A DDDA
If you just want the names, then the following command instead.
newdf[newdf['concatenated'].str.contains('DDD', regex=False, na=False)].index.to_list()
I came up with this.
df['Month_Nr'] = pd.to_datetime(df.Month, format='%b').dt.month
names = df.Name.unique()
students = np.array([])
for name in names:
filter = df[(df.Name==name) & (df.Grade=='D')].sort_values('Month_Nr')
if filter['Month_Nr'].diff().cumsum().max() >= 2:
students = np.append(students, name)
print(students)
Output:
['Sue']
you have a few ways to deal with this, first use my previous solution but this will require mapping academic numbers to months (i.e September = 1, August = 12) that way you can apply math to work out consecutive values.
the following is to covert the Month into a DateTime and work out the difference in months, we can then apply a cumulative sum and filter any values greater than 3.
d = StringIO("""Name Month Grade
Sue Jan D
Sue Feb D
Jason Mar B
Sue Dec D
Jason Jan B
Sue Apr A
Jason Feb C""")
df = pd.read_csv(d,sep='\s+')
df['date'] = pd.to_datetime(df['Month'],format='%b').dt.normalize()
# set any values greater than June to the previous year.
df['date'] = np.where(df['date'].dt.month > 6,
(df['date'] - pd.DateOffset(years=1)),df['date'])
df.sort_values(['Name','date'],inplace=True)
def month_diff(date):
cumlative_months = (
np.round(((date.sub(date.shift(1)) / np.timedelta64(1, "M")))).eq(1).cumsum()
) + 1
return cumlative_months
df['count'] = df.groupby(["Name", "Grade"])["date"].apply(month_diff)
print(df.drop('date',axis=1))
Name Month Grade count
4 Jason Jan B 1
6 Jason Feb C 1
2 Jason Mar B 1
3 Sue Dec D 1
0 Sue Jan D 2
1 Sue Feb D 3
5 Sue Apr A 1
print(df.loc[df['Name'] == 'Sue'])
Name Month Grade date count
3 Sue Dec D 1899-12-01 1
0 Sue Jan D 1900-01-01 2
1 Sue Feb D 1900-02-01 3
5 Sue Apr A 1900-04-01 1
I have a tab-delimited file with movie genre and year in 2 columns:
Comedy 2013
Comedy 2014
Drama 2012
Mystery 2011
Comedy 2013
Comedy 2013
Comedy 2014
Comedy 2013
News 2012
Sport 2012
Sci-Fi 2013
Comedy 2014
Family 2013
Comedy 2013
Drama 2013
Biography 2013
I want to group the genres together by year and print out in the following format (does not have to be in alphabetical order):
Year 2011 2012 2013 2014
Biography 0 0 1 0
Comedy 0 0 5 3
Drama 0 1 1 0
Family 0 0 1 0
Mystery 1 0 0 0
News 0 1 0 0
Sci-Fi 0 0 1 0
Sport 0 1 0 0
How should I approach it? At the moment I'm creating my output through MS Excel, but I would like to do it through Python.
If you don't like to use pandas, you can do it as follows:
from collections import Counter
# load file
with open('tab.txt') as f:
lines = f.read().split('\n')
# replace separating whitespace with exactly one space
lines = [' '.join(l.split()) for l in lines]
# find all years and genres
genres = sorted(set(l.split()[0] for l in lines))
years = sorted(set(l.split()[1] for l in lines))
# count genre-year combinations
C = Counter(lines)
# print table
print "Year".ljust(10),
for y in years:
print y.rjust(6),
print
for g in genres:
print g.ljust(10),
for y in years:
print `C[g + ' ' + y]`.rjust(6),
print
The most interesting function is probably Counter, which counts the number of occurrences of each element. To make sure that the length of the separating whitespace does not influence the counting, I replace it with a single space beforehand.
The easiest way do to this is using the pandas library, which provides lots of way of interacting with tables of data:
df = pd.read_clipboard(names=['genre', 'year'])
df.pivot_table(index='genre', columns='year', aggfunc=len, fill_value=0)
Output:
year 2011 2012 2013 2014
genre
Biography 0 0 1 0
Comedy 0 0 5 3
Drama 0 1 1 0
Family 0 0 1 0
Mystery 1 0 0 0
News 0 1 0 0
Sci-Fi 0 0 1 0
Sport 0 1 0 0
If you're only just starting with Python, you might find trying to learn pandas is a bit too much on top of learning the language, but once you have some Python knowledge, pandas provides very intuitive ways to interact with data.