Economy Year Indicator1 Indicator2 Indicator3 Indicator4 .
UK 1 23 45 56 78
UK 2 24 87 32 42
UK 3 22 87 32 42
UK 4 2 87 32 42
FR . . . . .
This is my data which extends on and held as a DataFrame, I want to switch the Header(Indicators) and the Year columns, seems like a pivot. There are hundreds of indicators and 20 years.
Use DataFrame.melt with DataFrame.pivot:
df = (df.melt(['Economy','Year'], var_name='Ind')
.pivot(['Economy','Ind'], 'Year', 'value')
.reset_index()
.rename_axis(None, axis=1))
print (df)
Economy Ind 1 2 3 4
0 UK Indicator1 23 24 22 2
1 UK Indicator2 45 87 87 87
2 UK Indicator3 56 32 32 32
3 UK Indicator4 78 42 42 42
Another option is to set Year column as index and then use transpose.
Consider the code below:
import pandas as pd
df = pd.DataFrame(columns=['Economy', 'Year', 'Indicator1', 'Indicator2', 'Indicator3', 'Indicator4'],
data=[['UK', 1, 23, 45, 56, 78],['UK', 2, 24, 87, 32, 42],['UK', 3, 22, 87, 32, 42],['UK', 4, 2, 87, 32, 42],
['FR', 1, 22, 33, 11, 35]])
# Make Year column as index
df = df.set_index('Year')
# Transpose columns to rows and vice-versa
df = df.transpose()
print(df)
gives you
Year 1 2 3 4 1
Economy UK UK UK UK FR
Indicator1 23 24 22 2 22
Indicator2 45 87 87 87 33
Indicator3 56 32 32 32 11
Indicator4 78 42 42 42 35
You can use transpose
like this :
df = df.set_index('Year')
df = df.transpose()
print (df)
Related
This is my sample code. My database contains columns for every date of the year, going back multiple years. Each column corresponds to a specific date.
import pandas as pd
df = pd.DataFrame([[10, 5, 25, 67,25,56],
[20, 10, 26, 45, 56, 34],
[30, 3, 27, 34, 78, 34],
[40, 9, 28, 45, 34,76]],
columns=[pd.to_datetime('2022-09-14'), pd.to_datetime('2022-08-14'), pd.to_datetime('2022-07-14'), pd.to_datetime('2021-09-14'),
pd.to_datetime('2020-09-14'), pd.to_datetime('2019-09-14')])
Is there a way to select only those columns which fit a particular criteria based on year, month or quarter.
For example, I was hoping to get only those columns which is the same date as today (any starting date) for every year. For example, today is Sep 14, 2022 and I need columns only for Sep 14, 2021, Sep 14, 2020 and so on. Another option could be to do the same on a month or quarter basis.
How can this be done in pandas?
Yes, you can do:
# day
df.loc[:, df.columns.day == 14]
2022-09-14 2022-08-14 2022-07-14 2021-09-14 2020-09-14 2019-09-14
0 10 5 25 67 25 56
1 20 10 26 45 56 34
2 30 3 27 34 78 34
3 40 9 28 45 34 76
# month
df.loc[:, df.columns.month == 9]
2022-09-14 2021-09-14 2020-09-14 2019-09-14
0 10 67 25 56
1 20 45 56 34
2 30 34 78 34
3 40 45 34 76
# quarter
df.loc[:, df.columns.quarter == 3]
2022-09-14 2022-08-14 2022-07-14 2021-09-14 2020-09-14 2019-09-14
0 10 5 25 67 25 56
1 20 10 26 45 56 34
2 30 3 27 34 78 34
3 40 9 28 45 34 76
beginner here!
I have a dataframe similar to this:
df = pd.DataFrame({'Country_Code':['FR','FR','FR','USA','USA','USA','BR','BR','BR'],'Indicator_Name':['GPD','Pop','birth','GPD','Pop','birth','GPD','Pop','birth'],'2005':[14,34,56, 25, 67, 68, 55, 8,99], '2006':[23, 34, 34, 43,34,34, 65, 34,45]})
Index Country_Code Inndicator_Name 2005 2006
0 FR GPD 14 23
1 FR Pop 34 34
2 FR birth 56 34
3 USA GPD 25 43
4 USA Pop 67 34
5 USA birth 68 34
6 BR GPD 55 65
7 BR Pop 8 34
8 BR birth 99 45
I need to pivot or transpose it, keeping the Country Code, the years, and the indicators names as columns, like this:
index Country_Code year GPD Pop Birth
0 FR 2005 14 34 56
1 FR 2006 23 34 34
3 USA 2005 25 67 68
4 USA 2006 43 34 34
...
I used the transposed function like this:
df.set_index(['Indicator Name']).transpose()
The result is nice, but I have the Countries as a row like this:
Inndicator_Name GPD Pop birth GPD Pop birth GPD Pop birth
Country_Code FR FR FR USA USA USA BR BR BR
2005 14 34 56 25 67 68 55 8 99
2006 23 34 34 43 34 34 65 34 45
I also tried to use the "pivot" and the "pivot table" function, but the result is not satisfactory. Could you please give me some advice?
import pandas as pd
df = pd.DataFrame({'Country_Code':['FR','FR','FR','USA','USA','USA','BR','BR','BR'],'Indicator_Name':['GPD','Pop','birth','GPD','Pop','birth','GPD','Pop','birth'],'2005':[14,34,56, 25, 67, 68, 55, 8,99], '2006':[23, 34, 34, 43,34,34, 65, 34,45]})
df
#%% Pivot longer columns `'2005'` and `'2006'` to `'Year'`
df1 = df.melt(id_vars=["Country_Code", "Indicator_Name"],
var_name="Year",
value_name="Value")
#%% Pivot wider by values in `'Indicator_Name'`
df2 = (df1.pivot_table(index=['Country_Code', 'Year'],
columns=['Indicator_Name'],
values=['Value'],
aggfunc='first'))
Output:
Value
Indicator_Name GPD Pop birth
Country_Code Year
BR 2005 55 8 99
2006 65 34 45
FR 2005 14 34 56
2006 23 34 34
USA 2005 25 67 68
2006 43 34 34
The simplest in my opinion, you can pivot+stack:
(df.pivot(index='Country_Code', columns='Indicator_Name')
.rename_axis(columns=['year', None]).stack(0).reset_index()
)
output:
Country_Code year GPD Pop birth
0 BR 2005 55 8 99
1 BR 2006 65 34 45
2 FR 2005 14 34 56
3 FR 2006 23 34 34
4 USA 2005 25 67 68
5 USA 2006 43 34 34
I have a DataFrame as follows:
d = {'name': ['a', 'a','a','b','b','b'],
'var': ['v1', 'v2', 'v3', 'v1', 'v2', 'v3'],
'Yr1': [11, 21, 31, 41, 51, 61],
'Yr2': [12, 22, 32, 42, 52, 62],
'Yr3': [13, 23, 33, 43, 53, 63]}
df = pd.DataFrame(d)
name var Yr1 Yr2 Yr3
a v1 11 12 13
a v2 21 22 23
a v3 31 32 33
b v1 41 42 43
b v2 51 52 53
b v3 61 62 63
and I want to rearrange it to look like this:
name Yr v1 v2 v3
a 1 11 21 31
a 2 12 22 32
a 3 13 23 33
b 1 41 51 61
b 2 42 52 62
b 3 43 53 63
I am new to pandas and tried using other threads I found here but struggled to make it work. Any help would be much appreciated.
Try this
import pandas as pd
d = {'name': ['a', 'a', 'a', 'b', 'b', 'b'],
'var': ['v1', 'v2', 'v3', 'v1', 'v2', 'v3'],
'Yr1': [11, 21, 31, 41, 51, 61],
'Yr2': [12, 22, 32, 42, 52, 62],
'Yr3': [13, 23, 33, 43, 53, 63]}
df = pd.DataFrame(d)
# Solution
df.set_index(['name', 'var'], inplace=True)
df = df.unstack().stack(0)
print(df.reset_index())
output:
var name level_1 v1 v2 v3
0 a Yr1 11 21 31
1 a Yr2 12 22 32
2 a Yr3 13 23 33
3 b Yr1 41 51 61
4 b Yr2 42 52 62
5 b Yr3 43 53 63
Reference: pandas.DataFrame.stack
Try groupby apply:
df.groupby("name").apply(
lambda x: x.set_index("var").T.drop("name")
).reset_index().rename(columns={"level_1": "Yr"}).rename_axis(columns=None)
name Yr v1 v2 v3
0 a Yr1 11 21 31
1 a Yr2 12 22 32
2 a Yr3 13 23 33
3 b Yr1 41 51 61
4 b Yr2 42 52 62
5 b Yr3 43 53 63
Or better:
df.pivot("var", "name", ["Yr1", "Yr2", "Yr3"]).T.sort_index(
level=1
).reset_index().rename({"level_0": "Yr"}, axis=1).rename_axis(columns=None)
Yr name v1 v2 v3
0 Yr1 a 11 21 31
1 Yr2 a 12 22 32
2 Yr3 a 13 23 33
3 Yr1 b 41 51 61
4 Yr2 b 42 52 62
5 Yr3 b 43 53 63
We can use pd.wide_to_long + df.unstack here.
pd.wide_to_long doc:
With stubnames [‘A’, ‘B’], this function expects to find one or more groups of columns with format A-suffix1, A-suffix2,…, B-suffix1, B-suffix2,… You specify what you want to call this suffix in the resulting long format with j (for example j=’year’).
pd.wide_to_long(
df, stubnames="Yr", i=["name", "var"], j="Y"
).squeeze().unstack(level=1).reset_index()
var name Y v1 v2 v3
0 a 1 11 21 31
1 a 2 12 22 32
2 a 3 13 23 33
3 b 1 41 51 61
4 b 2 42 52 62
5 b 3 43 53 63
We can use df.melt + df.pivot here.
out = df.melt(id_vars=['name', 'var'], var_name='Yr')
out['Yr'] = out['Yr'].str.replace('Yr', '')
out.pivot(index=['name', 'Yr'], columns='var', values='value').reset_index()
var name Yr v1 v2 v3
0 a 1 11 21 31
1 a 2 12 22 32
2 a 3 13 23 33
3 b 1 41 51 61
4 b 2 42 52 62
5 b 3 43 53 63
I have following list in python
movie_list = [11, 21, 31, 41, 51, 62, 55]
and following movie dataframe
userId movieId
1 11
1 21
1 31
2 62
2 55
Now what I want to do is generate similar dataframe, where movieId is not in dataframe, but there in movie_list
My desired dataframe would be
userId movieId
1 41
1 51
1 62
1 55
2 11
2 21
2 31
2 41
2 51
How can I do it in pandas?
IIUC, we can do the agg with list , then find the different between the original value in df with the movie_list
s=df.groupby('userId').movieId.agg(list).\
map(lambda x : list(set(movie_list)-set(x))).explode().reset_index()
userId movieId
0 1 41
1 1 51
2 1 62
3 1 55
4 2 41
5 2 11
6 2 51
7 2 21
8 2 31
One approach would be to use itertools.product to create all combinations of userId & movieId, then concat and drop_duplicates:
from itertools import product
movie_list = [11, 21, 31, 41, 51, 62, 55]
df_all = pd.DataFrame(product(df['userId'].unique(), movie_list), columns=df.columns)
df2 = pd.concat([df, df_all]).drop_duplicates(keep=False)
print(df2)
[out]
userId movieId
3 1 41
4 1 51
5 1 62
6 1 55
7 2 11
8 2 21
9 2 31
10 2 41
11 2 51
prod = pd.MultiIndex.from_product([df.userId.unique().tolist(), movie_list]).tolist()
(
pd.DataFrame(set(prod).difference([tuple(e) for e in df.values]),
columns=['userId', 'movieId'])
.sort_values(by=['userId', 'movieId'])
)
userId movieId
7 1 41
6 1 51
2 1 55
8 1 62
5 2 11
4 2 21
3 2 31
1 2 41
0 2 51
I think you need:
df = df.groupby("userId")["movieId"].apply(list).reset_index()
df["movieId"] = df["movieId"].apply(lambda x: list(set(movie_list)-set(x)))
df = df.explode("movieId")
print(df)
Output:
userId movieId
0 1 41
0 1 51
0 1 62
0 1 55
1 2 41
1 2 11
1 2 51
1 2 21
1 2 31
Edit: Added defT
Does using pandas.cut change the structure of a pandas.DataFrame.
I am using pandas.cut in the following manner to map single age years to age groups and then aggregating afterwards. However, the aggregation does not work as I end up with NaN in all columns that are being aggregated. Here is my code:
cutoff = numpy.hstack([numpy.array(defT.MinAge[0]), defT.MaxAge.values])
labels = defT.AgeGrp
df['ageGrp'] = pandas.cut(df.Age,
bins = cutoff,
labels = labels,
include_lowest = True)
Here is defT:
AgeGrp MaxAge MinAge
1 18 14
2 21 19
3 24 22
4 34 25
5 44 35
6 54 45
7 65 55
Then I pass the data-frame into another function to aggregate:
grouped = df.groupby(['Year', 'Month', 'OccID', 'ageGrp', 'Sex', \
'Race', 'Hisp', 'Educ'],
as_index = False)
final = grouped.aggregate(numpy.sum)
If I change the ages to age groups via this manner it works perfectly:
df['ageGrp'] = 1
df.ix[(df.Age >= 14) & (df.Age <= 18), 'ageGrp'] = 1 # Age 16 - 20
df.ix[(df.Age >= 19) & (df.Age <= 21), 'ageGrp'] = 2 # Age 21 - 25
df.ix[(df.Age >= 22) & (df.Age <= 24), 'ageGrp'] = 3 # Age 26 - 44
df.ix[(df.Age >= 25) & (df.Age <= 34), 'ageGrp'] = 4 # Age 45 - 64
df.ix[(df.Age >= 35) & (df.Age <= 44), 'ageGrp'] = 5 # Age 64 - 85
df.ix[(df.Age >= 45) & (df.Age <= 54), 'ageGrp'] = 6 # Age 64 - 85
df.ix[(df.Age >= 55) & (df.Age <= 64), 'ageGrp'] = 7 # Age 64 - 85
df.ix[df.Age >= 65, 'ageGrp'] = 8 # Age 85+
I would prefer to do this on the fly, importing the definition table and using pandas.cut, instead of being hard-coded.
Thank you in advance.
Here is, perhaps, a work-around.
Consider the following example which replicates the symptom you describe:
import numpy as np
import pandas as pd
np.random.seed(2015)
defT = pd.DataFrame({'AgeGrp': [1, 2, 3, 4, 5, 6, 7],
'MaxAge': [18, 21, 24, 34, 44, 54, 65],
'MinAge': [14, 19, 22, 25, 35, 45, 55]})
cutoff = np.hstack([np.array(defT['MinAge'][0]), defT['MaxAge'].values])
labels = defT['AgeGrp']
N = 50
df = pd.DataFrame(np.random.randint(100, size=(N,2)), columns=['Age', 'Year'])
df['ageGrp'] = pd.cut(df['Age'], bins=cutoff, labels=labels, include_lowest=True)
grouped = df.groupby(['Year', 'ageGrp'], as_index=False)
final = grouped.agg(np.sum)
print(final)
# Year ageGrp Age
# Year ageGrp
# 3 1 NaN NaN NaN
# 2 NaN NaN NaN
# ...
# 97 1 NaN NaN NaN
# 2 NaN NaN NaN
# [294 rows x 3 columns]
If we change
grouped = df.groupby(['Year', 'ageGrp'], as_index=False)
final = grouped.agg(np.sum)
to
grouped = df.groupby(['Year', 'ageGrp'], as_index=True)
final = grouped.agg(np.sum).dropna()
print(final)
then we obtain:
Age
Year ageGrp
6 7 61
16 4 32
18 1 34
25 3 23
28 5 39
34 7 60
35 5 42
38 4 25
40 2 19
53 7 59
56 4 25
5 35
66 6 54
67 7 55
70 7 56
73 6 51
80 5 36
81 6 46
85 5 38
90 7 58
97 1 18