Here comes an easy one.. I want to create lists from each of the column in my data frame and tried to loop over it.
for columnName in grouped.iteritems():
columnName = grouped[columnName]
It gives me a TypeError: '('africa', year (note africa is one of the columns and year the index). Anybody knows what is going on here?
This is my dataframe
continent africa antarctica asia ... north america oceania south america
year ...
2009 NaN NaN 1.0 ... NaN NaN NaN
2010 94.0 1.0 306.0 ... 72.0 12.0 21.0
2011 26.0 NaN 171.0 ... 21.0 2.0 4.0
2012 975.0 28.0 5318.0 ... 480.0 58.0 140.0
2013 1627.0 30.0 7363.0 ... 725.0 124.0 335.0
2014 3476.0 41.0 7857.0 ... 1031.0 202.0 520.0
2015 2999.0 43.0 12048.0 ... 1374.0 256.0 668.0
2016 2546.0 55.0 11429.0 ... 1798.0 325.0 3021.0
2017 7486.0 155.0 18467.0 ... 2696.0 640.0 2274.0
2018 10903.0 340.0 22979.0 ... 2921.0 723.0 1702.0
2019 7367.0 194.0 15928.0 ... 1971.0 457.0 993.0
[11 rows x 7 columns]
So I would expect to get one list with eleven elements for each column.
iteritems returns pairs of column_name, column_data similar to python's dict.items(). If you want to iterate over the column names you can just iterate over grouped like so:
result = {}
for column_name in grouped:
result[column_name] = [*grouped[column_name]]
This will leave you with a plain python dict containing plain python lists in result. Note that you would get pandas Series instead of lists if you would just do result[column_name] = grouped[column_name].
Related
I am working with the following df:
date/time wind (°) wind (kt) temp (C°) humidity(%) currents (°) currents (kt) stemp (C°) stemp_diff
0 2017-04-27 11:21:54 180.0 14.0 27.0 5000.0 200.0 0.4 25.4 2.6
1 2017-05-04 20:31:12 150.0 15.0 22.0 5000.0 30.0 0.2 26.5 -1.2
2 2017-05-08 05:00:52 110.0 6.0 25.0 5000.0 30.0 0.2 27.0 -1.7
3 2017-05-09 05:00:55 160.0 13.0 23.0 5000.0 30.0 0.6 27.0 -2.0
4 2017-05-10 16:39:16 160.0 20.0 22.0 5000.0 30.0 0.6 26.5 -1.8
5 ... ... ... ... ... ... ... ... ...
6 2020-10-25 00:00:00 5000.0 5000.0 21.0 81.0 5000.0 5000.0 23.0 -2.0
7 2020-10-26 00:00:00 5000.0 5000.0 21.0 77.0 5000.0 5000.0 23.0 -2.0
8 2020-10-27 00:00:00 5000.0 5000.0 21.0 80.0 5000.0 5000.0 23.0 -2.0
9 2020-10-31 00:00:00 5000.0 5000.0 22.0 79.0 5000.0 5000.0 23.0 -2.0
10 2020-11-01 00:00:00 5000.0 5000.0 19.0 82.0 5000.0 5000.0 23.0 -2.0
I would like to find a way to iterate through the years and months my date/time column (i.e. April 2017, all the way to November 2020), and for each unique time period, calculate the average difference for the stemp column as a starting point (so the average for April 2017, May 2017, and so on, or even just use the months).
I'm trying to use this sort of code, but not quite sure how to translate it into something that works:
for '%MM' in australia_in['date/time']:
australia_in['stemp_diff'] = australia_in['stemp (C°)'] - australia_out['stemp (C°)']
australia_in
Any thoughts?
Let's first make sure that your date/time columns is of datetime type, and set it to index:
df['date/time'] = pd.to_datetime(df['date/time'])
df = df.set_index('date/time')
Then you have several options.
Using pandas.Grouper:
>>> df.groupby(pd.Grouper(freq='M'))['stemp (C°)'].mean().dropna()
date/time
2017-04-30 25.40
2017-05-31 26.75
2020-10-31 23.00
2020-11-30 23.00
Name: stemp (C°), dtype: float64
Using the year and month:
>>> df.groupby([df.index.year, df.index.month])['stemp (C°)'].mean()
date/time date/time
2017 4 25.40
5 26.75
2020 10 23.00
11 23.00
Name: stemp (C°), dtype: float64
If you do not want to have the date/time as index, you can use the column instead:
df.groupby([df['date/time'].dt.year, df['date/time'].dt.month])['stemp (C°)'].mean()
And finally, to have nice names for the year and month:
(df.groupby([df['date/time'].dt.year.rename('year'),
df['date/time'].dt.month.rename('month')])
['stemp (C°)'].mean()
)
output:
year month
2017 4 25.40
5 26.75
2020 10 23.00
11 23.00
Name: stemp (C°), dtype: float64
Applying the same calculation on stemp_diff:
year month
2017 4 2.600
5 -1.675
2020 10 -2.000
11 -2.000
Name: stemp_diff, dtype: float64
Assuming I have a pandas dataframe df, whereby df is:
vaccine age height
0 pfizer 48.0 181.0
1 moderna 81.0 175.0
2 moderna 27.0 190.4
3 moderna 64.0 178.5
I am trying to join the two columns (age, height) into a single column row-wise, whereby age and height are grouped by their respective vaccine.
Basically, I am trying to get:
vaccine new_col
0 pfizer 48.0
1 pfizer 181.0
2 moderna 81.0
3 moderna 175.0
4 moderna 27.0
5 moderna 190.4
6 moderna 64.0
7 moderna 178.5
I have unsuccessfully tried using pd.concat, df.merge, etc. I am not familiar with any pandas function that does this. I also tried using the apply function but I wasn't successful.
First set the index as vaccine then stack the dataframe, and drop index level at 1, finally reset the index.
df.set_index('vaccine').stack().droplevel(1).to_frame('new_col').reset_index()
vaccine new_col
0 pfizer 48.0
1 pfizer 181.0
2 moderna 81.0
3 moderna 175.0
4 moderna 27.0
5 moderna 190.4
6 moderna 64.0
7 moderna 178.5
I have pivoted the Customer ID against their year of purchase, so that I know how many times each customer purchased in different years:
Customer ID 1996 1997 ... 2019 2020
100000000000001 7 7 ... NaN NaN
100000000000002 8 8 ... NaN NaN
100000000000003 7 4 ... NaN NaN
100000000000004 NaN NaN ... 21 24
100000000000005 17 11 ... 18 NaN
My desired result is to append the column names with the latest year of purchase, and thus the number of years since their last purchase:
Customer ID 1996 1997 ... 2019 2020 Last Recency
100000000000001 7 7 ... NaN NaN 1997 23
100000000000002 8 8 ... NaN NaN 1997 23
100000000000003 7 4 ... NaN NaN 1997 23
100000000000004 NaN NaN ... 21 24 2020 0
100000000000005 17 11 ... 18 NaN 2019 1
Here is what I tried:
df_pivot["Last"] = 2020
k = 2020
while math.isnan(df_pivot2[k]):
df_pivot["Last"] = k-1
k = k-1
df_pivot["Recency"] = 2020 - df_pivot["Last"]
However what I got is "TypeError: cannot convert the series to <class 'float'>"
Could anyone help me to get the result I need?
Thanks a lot!
Dennis
You can get last year of purchase using notna + cumsum and idxmax along axis=1 then subtract this last year of purchase from the max year to compute Recency:
c = df.filter(regex=r'\d+').columns
df['Last'] = df[c].notna().cumsum(1).idxmax(1)
df['Recency'] = c.max() - df['Last']
Customer ID 1996 1997 2019 2020 Last Recency
0 100000000000001 7.0 7.0 NaN NaN 1997 23
1 100000000000002 8.0 8.0 NaN NaN 1997 23
2 100000000000003 7.0 4.0 NaN NaN 1997 23
3 100000000000004 NaN NaN 21.0 24.0 2020 0
4 100000000000005 17.0 11.0 18.0 NaN 2019 1
one idea is to apply "applymap(float)" to your dataFrame
Documentation from pandas
I am working with the John Hopkins Covid data for personal use to create charts. The data shows cumulative deaths by country, I want deaths per day. Seems to me the easiest way is to create two dataframes and subtract one from the other. But the file has column names as dates and the code, e.g. df3 = df2 - df1 subtracts the columns with the matching dates. So I want to rename all the columns with some easy index, for example, 1, 2, 3, ....
I cannot figure out how to do this?
new_names=list(range(data.shape[1]))
data.columns=new_names
This renames the columns of data from 0 upwards.
You could re-shape the data: use dates and row labels, and use country, province as column labels.
import pandas as pd
covid_csv = 'https://raw.githubusercontent.com/CSSEGISandData/COVID-19/master/csse_covid_19_data/csse_covid_19_time_series/time_series_covid19_deaths_global.csv'
df_raw = (pd.read_csv(covid_csv)
.set_index(['Country/Region', 'Province/State'])
.drop(columns=['Lat', 'Long'])
.transpose())
df_raw.index = pd.to_datetime(df_raw.index)
print( df_raw.iloc[-5:, 0:5] )
Country/Region Afghanistan Albania Algeria Andorra Angola
Province/State NaN NaN NaN NaN NaN
2020-07-27 1269 144 1163 52 41
2020-07-28 1270 148 1174 52 47
2020-07-29 1271 150 1186 52 48
2020-07-30 1271 154 1200 52 51
2020-07-31 1272 157 1210 52 52
Now, you can use the rich set of pandas tools for time-series analysis. For example, use diff() to go from cumulative deaths to per-day rates. Or, you could compute N-day moving averages, create time-series plots, ...
print(df_raw.diff().iloc[-5:, 0:5])
Country/Region Afghanistan Albania Algeria Andorra Angola
Province/State NaN NaN NaN NaN NaN
2020-07-27 10.0 6.0 8.0 0.0 1.0
2020-07-28 1.0 4.0 11.0 0.0 6.0
2020-07-29 1.0 2.0 12.0 0.0 1.0
2020-07-30 0.0 4.0 14.0 0.0 3.0
2020-07-31 1.0 3.0 10.0 0.0 1.0
Finally, df_raw.sum(level='Country/Region', axis=1) will aggregate all Provinces within a Country.
Thanks for the time and effort but I figured out a simple way.
for i, row in enumerate(df):
df.rename(columns = { row : str(i)}, inplace = True)
to change the columns names and then
for i, row in enumerate(df):
df.rename(columns = { row : str( i + 43853)}, inplace = True)
to change them back to the dates I want.
I've used group by and pivot table from pandas package in order to create the following table:
Input:
q4 = q1[['category','Month']].groupby(['category','Month']).Month.agg({'Count':'count'}).reset_index()
q4 = pd.DataFrame(q4.pivot(index='category',columns='Month').reset_index())
then the output :
category Count
Month 6 7 8
0 adult-classes 29.0 109.0 162.0
1 air-pollution 27.0 43.0 13.0
2 babies-and-toddlers 4.0 51.0 2.0
3 bicycle 210.0 96.0 23.0
4 building NaN 17.0 NaN
5 buildings-maintenance 23.0 12.0 NaN
6 catering 1351.0 4881.0 1040.0
7 childcare 9.0 NaN NaN
8 city-planning 105.0 81.0 23.0
9 city-services 2461.0 2130.0 1204.0
10 city-taxes 1.0 4.0 42.0
I'm trying to add a condition to the months,
the problem I'm having is that after pivoting I can't access the columns
how can I show only the rows where 6<7<8?
To flatten your multi-index, you can use renaming of your columns (check out this answer).
q4.columns = [''.join([str(c) for c in col]).strip() for col in q4.columns.values]
To remove NaNs:
q4.fillna(0, inplace=True)
To select according to your constraint:
result = q4[(q4['Count6'] < q['Count7']) & (q4['Count7'] < q4['Count8'])]