Count how many values occur in a month Pandas - python

I have a dataframe as follows;
Country
From Date
02/04/2020 Canada
04/02/2020 Ireland
10/03/2020 France
11/03/2020 Italy
15/03/2020 Hungary
.
.
.
10/10/2020 Canada
And I simply want to do a groupby() or something similar which will tell me how many times a country occurs per month
eg.
Canada Ireland France . . .
2010 1 3 4 1
2 4 3 2
.
.
.
10 4 4 4
Is there a simple way to do this?
Any help much appreciated!

A different angle to solve your question would be to use groupBy, count_values and unstack.
It goes like this:
I assume your "from date" is type date (datetime64[ns]) if not:
df['From Date']=pd.to_datetime(df['From Date'], format= '%d/%m/%Y')
convert the date to string with Year + Month:
df['From Date'] = df['From Date'].dt.strftime('%Y-%m')
group by From Date and count the values:
df.groupby(['From Date'])['Country'].value_counts().unstack().fillna(0).astype(int).reindex()
desired result (from the snapshot in your question):
Country Canada France Hungary Ireland Italy
From Date
2020-02 0 0 0 1 0
2020-03 0 1 1 0 1
2020-04 1 0 0 0 0
note the unstack that places the countries on the horizontal, astype(int) to avoid instances such as 1.0 and fillna(0) just in case any country has nothing - show zero.

Check with crosstab
# df.index=pd.to_datetime(df.index, format= '%d/%m/%Y')
pd.crosstab(df.index.strftime('%Y-%m'), df['Country'])

Related

How can I count # of occurences of more than one column (eg city & country)?

Given the following data ...
city country
0 London UK
1 Paris FR
2 Paris US
3 London UK
... I'd like a count of each city-country pair
city country n
0 London UK 2
1 Paris FR 1
2 Paris US 1
The following works but feels like a hack:
df = pd.DataFrame([('London', 'UK'), ('Paris', 'FR'), ('Paris', 'US'), ('London', 'UK')], columns=['city', 'country'])
df.assign(**{'n': 1}).groupby(['city', 'country']).count().reset_index()
I'm assigning an additional column n of all 1s, grouping on city&country, and then count()ing occurrences of this new 'all 1s' column. It works, but adding a column just to count it feels wrong.
Is there a cleaner solution?
There is a better way..use value_counts
df.value_counts().reset_index(name='n')
city country n
0 London UK 2
1 Paris FR 1
2 Paris US 1

keep order while using python pandas pivot

df = {'Region':['France','France','France','France'],'total':[1,2,3,4],'date':['12/30/19','12/31/19','01/01/20','01/02/20']}
df=pd.DataFrame.from_dict(df)
print(df)
Region total date
0 France 1 12/30/19
1 France 2 12/31/19
2 France 3 01/01/20
3 France 4 01/02/20
The dates are ordered. Now if I am using pivot
pandas_temp = df.pivot(index='Region',values='total', columns='date')
print(pandas_temp)
date 01/01/20 01/02/20 12/30/19 12/31/19
Region
France 3 4 1 2
I am losing the order. How can I keep it ?
Convert values to datetimes before pivot and then if necessary convert to your custom format:
df['date'] = pd.to_datetime(df['date'])
pandas_temp = df.pivot(index='Region',values='total', columns='date')
pandas_temp = pandas_temp.rename(columns=lambda x: x.strftime('%m/%d/%y'))
#alternative
#pandas_temp.columns = pandas_temp.columns.strftime('%m/%d/%y')
print (pandas_temp)
date 12/30/19 12/31/19 01/01/20 01/02/20
Region
France 1 2 3 4

Calculating new rows in a Pandas Dataframe on two different columns

So I'm a beginner at Python and I have a dataframe with Country, avgTemp and year.
What I want to do is calculate new rows on each country where the year adds 20 and avgTemp is multiplied by a variable called tempChange. I don't want to remove the previous values though, I just want to append the new values.
This is how the dataframe looks:
Preferably I would also want to create a loop that runs the code a certain number of times
Super grateful for any help!
If you need to copy the values from the dataframe as an example you can have it here:
Country avgTemp year
0 Afghanistan 14.481583 2012
1 Africa 24.725917 2012
2 Albania 13.768250 2012
3 Algeria 23.954833 2012
4 American Samoa 27.201417 2012
243 rows × 3 columns
If you want to repeat the rows, I'd create a new dataframe, perform any operation in the new dataframe (sum 20 years, multiply the temperature by a constant or an array, etc...) and use then use concat() to append it to the original dataframe:
import pandas as pd
tempChange=1.15
data = {'Country':['Afghanistan','Africa','Albania','Algeria','American Samoa'],'avgTemp':[14,24,13,23,27],'Year':[2012,2012,2012,2012,2012]}
df = pd.DataFrame(data)
df_2 = df.copy()
df_2['avgTemp'] = df['avgTemp']*tempChange
df_2['Year'] = df['Year']+20
df = pd.concat([df,df_2]) #ignore_index=True if you wish to not repeat the index value
print(df)
Output:
Country avgTemp Year
0 Afghanistan 14.00 2012
1 Africa 24.00 2012
2 Albania 13.00 2012
3 Algeria 23.00 2012
4 American Samoa 27.00 2012
0 Afghanistan 16.10 2032
1 Africa 27.60 2032
2 Albania 14.95 2032
3 Algeria 26.45 2032
4 American Samoa 31.05 2032
where df is your data frame name:
df['tempChange'] = df['year']+ 20 * df['avgTemp']
This will add a new column to your df with the logic above. I'm not sure if I understood your logic correct so the math may need some work
I believe that what you're looking for is
dfName['newYear'] = dfName.apply(lambda x: x['year'] + 20,axis=1)
dfName['tempDiff'] = dfName.apply(lambda x: x['avgTemp']*tempChange,axis=1)
This is how you apply to each row.

How to output the top 5 of a specific column along with associated columns using python?

I've tried to use df2.nlargest(5, ['1960'] this gives me:
Country Name Country Code ... 2017 2018
0 IDA & IBRD total IBT ... 6335039629.0000 6412522234.0000
1 Low & middle income LMY ... 6306560891.0000 6383958209.0000
2 Middle income MIC ... 5619111361.0000 5678540888.0000
3 IBRD only IBD ... 4731120193.0000 4772284113.0000
6 Upper middle income UMC ... 2637690770.0000 2655635719.0000
This is somewhat right, but it's outputting all the columns. I just want it to include the column name "Country Name" and "1960" only, but sort by the column "1960."
So the output should look like this...
Country Name 1960
China 5000000000
India 499999999
USA 300000
France 100000
Germany 90000

Conditionally filling blank values in Pandas dataframes

I have a datafarme which looks like as follows (there are more columns having been dropped off):
memberID shipping_country
264991
264991 Canada
100 USA
5000
5000 UK
I'm trying to fill the blank cells with existing value of shipping country for each user:
memberID shipping_country
264991 Canada
264991 Canada
100 USA
5000 UK
5000 UK
However, I'm not sure what's the most efficient way to do this on a large scale dataset. Perhaps, using a vectored groupby method?
You can use GroupBy + ffill / bfill:
def filler(x):
return x.ffill().bfill()
res = df.groupby('memberID')['shipping_country'].apply(filler)
A custom function is necessary as there's no combined Pandas method to ffill and bfill sequentially.
This also caters for the situation where all values are NaN for a specific memberID; in this case they will remain NaN.
For the following sample dataframe (I added a memberID group that only contains '' in the shipping_country column):
memberID shipping_country
0 264991
1 264991 Canada
2 100 USA
3 5000
4 5000 UK
5 54
This should work for you, and also as the behavior that if a memberID group only has empty string values ('') in shipping_country, those will be retained in the output df:
df['shipping_country'] = df.replace('',np.nan).groupby('memberID')['shipping_country'].transform('first').fillna('')
Yields:
memberID shipping_country
0 264991 Canada
1 264991 Canada
2 100 USA
3 5000 UK
4 5000 UK
5 54
If you would like to leave the empty strings '' as NaN in the output df, then just remove the fillna(''), leaving:
df['shipping_country'] = df.replace('',np.nan).groupby('memberID')['shipping_country'].transform('first')
You can use chained groupbys, one with forward fill and one with backfill:
# replace blank values with `NaN` first:
df['shipping_country'].replace('',pd.np.nan,inplace=True)
df.iloc[::-1].groupby('memberID').ffill().groupby('memberID').bfill()
memberID shipping_country
0 264991 Canada
1 264991 Canada
2 100 USA
3 5000 UK
4 5000 UK
This method will also allow a group made up of all NaN to remain NaN:
>>> df
memberID shipping_country
0 264991
1 264991 Canada
2 100 USA
3 5000
4 5000 UK
5 1
6 1
df['shipping_country'].replace('',pd.np.nan,inplace=True)
df.iloc[::-1].groupby('memberID').ffill().groupby('memberID').bfill()
memberID shipping_country
0 264991 Canada
1 264991 Canada
2 100 USA
3 5000 UK
4 5000 UK
5 1 NaN
6 1 NaN

Categories

Resources