Assuming I have a pandas dataframe df, whereby df is:
vaccine age height
0 pfizer 48.0 181.0
1 moderna 81.0 175.0
2 moderna 27.0 190.4
3 moderna 64.0 178.5
I am trying to join the two columns (age, height) into a single column row-wise, whereby age and height are grouped by their respective vaccine.
Basically, I am trying to get:
vaccine new_col
0 pfizer 48.0
1 pfizer 181.0
2 moderna 81.0
3 moderna 175.0
4 moderna 27.0
5 moderna 190.4
6 moderna 64.0
7 moderna 178.5
I have unsuccessfully tried using pd.concat, df.merge, etc. I am not familiar with any pandas function that does this. I also tried using the apply function but I wasn't successful.
First set the index as vaccine then stack the dataframe, and drop index level at 1, finally reset the index.
df.set_index('vaccine').stack().droplevel(1).to_frame('new_col').reset_index()
vaccine new_col
0 pfizer 48.0
1 pfizer 181.0
2 moderna 81.0
3 moderna 175.0
4 moderna 27.0
5 moderna 190.4
6 moderna 64.0
7 moderna 178.5
Related
I have a sample similar to the problem I am running into. Here, I have company name and revenue for 3 years. The revenue is given in 3 different datasets. When I concatenate the data, it looks as follows:
company_name 2020 Revenue 2021 Revenue 2022 Revenue
0 company_1 10.0 NaN NaN
1 company_2 20.0 NaN NaN
2 company_3 30.0 NaN NaN
3 company_1 NaN 20.0 NaN
4 company_2 NaN 30.0 NaN
5 company_3 NaN 40.0 NaN
6 company_1 NaN NaN 50.0
7 company_2 NaN NaN 60.0
8 company_3 NaN NaN 70.0
9 company_4 NaN NaN 80.0
What I am trying to do is have company name followed by the actual revenue columns. In a sense drop the duplicate company_name rows and put that data into the corresponding company_name. An image of my desired output:
company_name 2020 Revenue 2021 Revenue 2022 Revenue
0 company_1 10 20 50
1 company_2 20 30 60
2 company_3 30 40 70
3 company_4 0 0 80
Use melt and pivot_table:
out = (df.melt('company_name').dropna()
.pivot_table('value', 'company_name', 'variable', fill_value=0)
.rename_axis(columns=None).reset_index())
print(out)
# Output
company_name 2020 Revenue 2021 Revenue 2022 Revenue
0 company_1 10 20 50
1 company_2 20 30 60
2 company_3 30 40 70
3 company_4 0 0 80
You can try:
df.set_index('company_name').stack().unstack().reset_index()
Or
df.groupby('company_name', as_index=False).first()
Output:
company_name 2020 Revenue 2021 Revenue 2022 Revenue
0 company_1 10.0 20.0 50.0
1 company_2 20.0 30.0 60.0
2 company_3 30.0 40.0 70.0
3 company_4 NaN NaN 80.0
I would say your concat might not be the join you should be using, but instead try df_merge pd.merge(df1, df2, how = 'inner', left_on = 'company', left_on = 'company') Then you can do that against with df_merge (your newly merged data) and the next dataframe. This should keep everything in line with each other and only add columns that they do not share. If they don't only have the 2 columns you are looking at you might need to do a little more cleaning of the data to get only the results you are looking for, but that should for the most part get you started and your data all in the correct place.
I have a Pandas dataset with a monthly Date-time index and a column of outstanding orders (like below):
Date
orders
1991-01-01
nan
1991-02-01
nan
1991-03-01
24
1991-04-01
nan
1991-05-01
nan
1991-06-01
nan
1991-07-01
nan
1991-08-01
34
1991-09-01
nan
1991-10-01
nan
1991-11-01
22
1991-12-01
nan
I want to linearly interpolate the values to fill the nans. However it has to be applied within 6-month blocks (non-rolling). So for example, one 6-month block would be all the rows between 1991-01-01 and 1991-06-01, where we would do forward and backward linear imputation such that if there is a nan the interpolation would be descending to a final value of 0. So for the same dataset above here is how I would like the end result to look:
Date
orders
1991-01-01
8
1991-02-01
16
1991-03-01
24
1991-04-01
18
1991-05-01
12
1991-06-01
6
1991-07-01
17
1991-08-01
34
1991-09-01
30
1991-10-01
26
1991-11-01
22
1991-12-01
11
I am lost on how to do this in Pandas however. Any ideas?
Idea is grouping per 6 months with prepend and append 0 values, interpolate and then remove first and last 0 values per groups:
df['Date'] = pd.to_datetime(df['Date'])
f = lambda x: pd.Series([0] + x.tolist() + [0]).interpolate().iloc[1:-1]
df['orders'] = (df.groupby(pd.Grouper(freq='6MS', key='Date'))['orders']
.transform(f))
print (df)
Date orders
0 1991-01-01 8.0
1 1991-02-01 16.0
2 1991-03-01 24.0
3 1991-04-01 18.0
4 1991-05-01 12.0
5 1991-06-01 6.0
6 1991-07-01 17.0
7 1991-08-01 34.0
8 1991-09-01 30.0
9 1991-10-01 26.0
10 1991-11-01 22.0
11 1991-12-01 11.0
Here comes an easy one.. I want to create lists from each of the column in my data frame and tried to loop over it.
for columnName in grouped.iteritems():
columnName = grouped[columnName]
It gives me a TypeError: '('africa', year (note africa is one of the columns and year the index). Anybody knows what is going on here?
This is my dataframe
continent africa antarctica asia ... north america oceania south america
year ...
2009 NaN NaN 1.0 ... NaN NaN NaN
2010 94.0 1.0 306.0 ... 72.0 12.0 21.0
2011 26.0 NaN 171.0 ... 21.0 2.0 4.0
2012 975.0 28.0 5318.0 ... 480.0 58.0 140.0
2013 1627.0 30.0 7363.0 ... 725.0 124.0 335.0
2014 3476.0 41.0 7857.0 ... 1031.0 202.0 520.0
2015 2999.0 43.0 12048.0 ... 1374.0 256.0 668.0
2016 2546.0 55.0 11429.0 ... 1798.0 325.0 3021.0
2017 7486.0 155.0 18467.0 ... 2696.0 640.0 2274.0
2018 10903.0 340.0 22979.0 ... 2921.0 723.0 1702.0
2019 7367.0 194.0 15928.0 ... 1971.0 457.0 993.0
[11 rows x 7 columns]
So I would expect to get one list with eleven elements for each column.
iteritems returns pairs of column_name, column_data similar to python's dict.items(). If you want to iterate over the column names you can just iterate over grouped like so:
result = {}
for column_name in grouped:
result[column_name] = [*grouped[column_name]]
This will leave you with a plain python dict containing plain python lists in result. Note that you would get pandas Series instead of lists if you would just do result[column_name] = grouped[column_name].
I've used group by and pivot table from pandas package in order to create the following table:
Input:
q4 = q1[['category','Month']].groupby(['category','Month']).Month.agg({'Count':'count'}).reset_index()
q4 = pd.DataFrame(q4.pivot(index='category',columns='Month').reset_index())
then the output :
category Count
Month 6 7 8
0 adult-classes 29.0 109.0 162.0
1 air-pollution 27.0 43.0 13.0
2 babies-and-toddlers 4.0 51.0 2.0
3 bicycle 210.0 96.0 23.0
4 building NaN 17.0 NaN
5 buildings-maintenance 23.0 12.0 NaN
6 catering 1351.0 4881.0 1040.0
7 childcare 9.0 NaN NaN
8 city-planning 105.0 81.0 23.0
9 city-services 2461.0 2130.0 1204.0
10 city-taxes 1.0 4.0 42.0
I'm trying to add a condition to the months,
the problem I'm having is that after pivoting I can't access the columns
how can I show only the rows where 6<7<8?
To flatten your multi-index, you can use renaming of your columns (check out this answer).
q4.columns = [''.join([str(c) for c in col]).strip() for col in q4.columns.values]
To remove NaNs:
q4.fillna(0, inplace=True)
To select according to your constraint:
result = q4[(q4['Count6'] < q['Count7']) & (q4['Count7'] < q4['Count8'])]
I have a dataframe (df), that I break down into 4 new dfs (media, client, code_type, and date). media has one column of null values, while the other three are only 1-dim dfs, each consisting of nulls. After replacing the nulls in each dataframe, I try to pd.concatto get a single df and get the result below.
code_type
0 P
1 P
2 P
3 P
4 P
5 P
code_name media_type acq. revenue
0 RASH NaN 50.0 34004.0
1 100 NaN 10.0 1035.0
2 NEWS NaN 61.0 3475.0
3 DR NaN 53.0 4307.0
4 SPORTS NaN 45.0 6503.0
5 DOUBL NaN 13.0 4205.0
client_id
0 2.0
1 2.0
2 2.0
3 2.0
4 2.0
5 2.0
date
0 2016-08-15
1 2016-08-15
2 2016-08-15
3 2016-08-15
4 2016-08-15
5 2016-08-15
I pd.merge media with another a separate df to replace the NaNs under media.media_type, which appends a new media_type_y
code_name media_type_x acq. revenue media_type_y
0 RASH NaN 282 34004.0 Radio
1 100 NaN 119 1035.0 NaN
2 NEWS NaN 81 3475.0 SiriusXM
3 DR NaN 33 4307.0 SiriusXM
4 SPORTS NaN 25 6503.0 SiriusXM
5 DOUBL NaN 23 4205.0 Podcast
I then drop media_type_x and rename media_type_y to just media_type
final = m.loc[:,('code_name','media_type_y', 'acquisition', 'revenue')]
final = final.rename(columns={'media_type_y': 'media_type'})
So that when I concatenate, I have a complete df.
clean = pd.concat([media, client, code_type, date], axis=1)
code media acq. revenue client code_type date
0 RASH Radio 50.0 34004.0 NaN NaN NaT
1 100 NaN 10.0 1035.0 NaN NaN NaT
2 NEWS SiriusXM 61.0 3475.0 NaN NaN NaT
3 DR SiriusXM 53.0 4307.0 NaN NaN NaT
4 SPORTS SiriusXM 45.0 6503.0 NaN NaN NaT
5 DOUBL Podcast 13.0 4205.0 NaN NaN NaT
clean.client is supposed to be all 2
clean.code_type should be all P
clean.date should be all 08/15/2016
The dfs by themselves show the data, it's only when I concatenate that I lose the information. I think it may be something with the indexes, but I'm not sure. Could also be something to do with the fact that I have a column with both str and int (see clean.code above) which might be why I get the runtime error listed below.
//anaconda/lib/python3.5/site-packages/pandas/indexes/api.py:71: RuntimeWarning: unorderable types: int() < str(), sort order is undefined for incomparable objects
result = result.union(other)
Starting with this:
code_name media_type acq. revenue
0 RASH Radio 50.0 34004.0
1 100 NaN 10.0 1035.0
2 NEWS SiriusXM 61.0 3475.0
3 DR SiriusXM 53.0 4307.0
4 SPORTS SiriusXM 45.0 6503.0
5 DOUBL Podcast 13.0 4205.0
Try this:
df['client_id'] = 2
df['date'] = '08/15/2016'
df['code_type'] = 'P'
df
code_name media_type acq. revenue client_id date code_type
0 RASH Radio 50.0 34004.0 2 08/15/2016 P
1 100 NaN 10.0 1035.0 2 08/15/2016 P
2 NEWS SiriusXM 61.0 3475.0 2 08/15/2016 P
3 DR SiriusXM 53.0 4307.0 2 08/15/2016 P
4 SPORTS SiriusXM 45.0 6503.0 2 08/15/2016 P
5 DOUBL Podcast 13.0 4205.0 2 08/15/2016 P