Transpose dataset and append different columns - python

I have a dataset like this:
date | a | diff_a | b | diff_b | c | diff_c
2020 0 NaN 10 Nan 5 NaN
2021 1 1 20 10 7 2
2022 3 2 30 10 13 6
2023 4 1 40 10 20 7
And I want to transpose this dataset and merge different columns below, like this:
date | Cat | value | diff
2020 a 0 NaN
2021 a 1 1
2022 a 3 2
2023 a 4 1
2020 b 10 ...
2021 b 20
2022 b 30
2023 b 40
2020 c 5
2021 c 7
2022 c 13
2023 c 20
The diff is not important since if I can put the other columns below I can just filter and then concat the dataframes, but how do I pass this columns to rows?
Kind Regards

My approach with DataFrame.melt
m=df.columns.str.contains('diff')
new_df = (df.melt(df.columns[~m],df.columns[m],var_name='Cat',value_name = 'diff')
.assign(Cat = lambda x: x['Cat'].str.split('_').str[-1],
a = lambda x: x.lookup(x.index, x.Cat)))
new_df = new_df.drop(columns = list(filter(lambda x: x != 'a',new_df.Cat.unique())))
print(new_df)
date a Cat diff
0 2020 0 a NaN
1 2021 1 a 1
2 2022 3 a 2
3 2023 4 a 1
4 2020 10 b Nan
5 2021 20 b 10
6 2022 30 b 10
7 2023 40 b 10
8 2020 5 c NaN
9 2021 7 c 2
10 2022 13 c 6
11 2023 20 c 7
EDIT
If we do not want the diff column we can delete it, also if the column does not have to call a we can do:
m=df.columns.str.contains('diff')
new_df = (df.melt(df.columns[~m],df.columns[m],var_name='Cat',value_name = 'diff')
#.drop(columns = 'diff') #if you want drop diff
.assign(Cat = lambda x: x['Cat'].str.split('_').str[-1],
other = lambda x: x.lookup(x.index, x.Cat)))
new_df = new_df.drop(columns = new_df['Cat'])
print(new_df)
date Cat diff other
0 2020 a NaN 0
1 2021 a 1 1
2 2022 a 2 3
3 2023 a 1 4
4 2020 b Nan 10
5 2021 b 10 20
6 2022 b 10 30
7 2023 b 10 40
8 2020 c NaN 5
9 2021 c 2 7
10 2022 c 6 13
11 2023 c 7 20

I like #ansev answer. Definitely elegant and oozing experience.
My attempt below. Please note I drop the diff columns now that you don't need them and then;
df2=df.set_index('date').stack().reset_index(level=0, drop=False).rename_axis('Cat', axis=0).reset_index().sort_values(by='date')
df2.rename(columns = {0:'value'}, inplace = True)

Related

Filter individuals that don't have data for the whole period

I am using Python 3.9 on Pycharm. I have the following dataframe:
id year gdp
0 A 2019 3
1 A 2020 0
2 A 2021 5
3 B 2019 4
4 B 2020 2
5 B 2021 1
6 C 2020 5
7 C 2021 4
I want to keep individuals that have available data for the whole period. In other words, I would like to filter the rows such that I only keep id that have data for the three years (2019, 2020, 2021). This means excluding all observations of id C and keep all observations of id A and B:
id year gdp
0 A 2019 3
1 A 2020 0
2 A 2021 5
3 B 2019 4
4 B 2020 2
5 B 2021 1
Is it feasible in Python?
As you want to include only the ids for which all three year exist, you can group the dataframe by id then filter based on set equalities for the years you want versus the years available for particular id:
>>> years = {2019, 2020, 2021}
>>> df.groupby('id').filter(lambda x: set(x['year'].values)==years)
# df is your dataframe
id year gdp
0 A 2019 3
1 A 2020 0
2 A 2021 5
3 B 2019 4
4 B 2020 2
5 B 2021 1
First, make a set of all the years existing in the column year then use a boolean mask to filter your dataframe. For that, you need pandas.DataFrame.groupby and pandas.DataFrame.transform to count the occurences of each id in each group of year.
from io import StringIO
import pandas as pd
s = """id year gdp
A 2019 3
A 2020 0
A 2021 5
B 2019 4
B 2020 2
B 2021 1
C 2020 5
C 2021 4
"""
df = pd.read_csv(StringIO(s), sep='\t')
mask = df.groupby('id')['year'].transform('count').eq(len(set(df['id'])))
out = df[mask]
>>> print(out)
id year gdp
0 A 2019 3
1 A 2020 0
2 A 2021 5
3 B 2019 4
4 B 2020 2
5 B 2021 1
Here is a way using pivot and dropna to automatically find ids with missing values:
keep = df.pivot('id', 'year', 'gdp').dropna().index
# ['A', 'B']
out = df[df['id'].isin(keep)]
output:
id year gdp
0 A 2019 3
1 A 2020 0
2 A 2021 5
3 B 2019 4
4 B 2020 2
5 B 2021 1

Python - Select all rows where there is data for each hour in a day

I have a data frame that consists of a series of dates:
Year Month Day Hour
2020 12 3 22
2021 1 1 0
2021 1 1 1
2021 1 1 2
...
2021 1 1 23
2021 1 2 1
2021 1 2 3
...
I would like to return all rows for dates that have information for all 24 hours in the day. In the above example, I would only want to return the rows:
2021 1 1 0
2021 1 1 1
2021 1 1 2
...
2021 1 1 23
My data set is very long. I would appreciate any assistance. Thank you.
import pandas as pd
import random as rd
# generate dummy data
sz = 40000
df = pd.DataFrame()
df['Y'] = [rd.randint(2020, 2021) for _ in range(sz)]
df['M'] = [rd.randint(1, 12) for _ in range(sz)]
df['D'] = [rd.randint(1, 31) for _ in range(sz)]
df['H'] = [rd.randint(0, 23) for _ in range(sz)]
# make an ethalon hour sequence
h24 = [i for i in range(24)]
# group and check if we have 24 hours in the group
# if NaN then no 24 hours here - drop, explode the rest
df = df.groupby(by=['Y', 'M', 'D']).apply(lambda x: None if x.value_counts().size != 24 else h24). \
dropna(how='any').explode().reset_index().rename(columns={0: "H"})
print(df)
Prints:
Y M D H
0 2020 1 3 0
1 2020 1 3 1
2 2020 1 3 2
3 2020 1 3 3
4 2020 1 3 4
... ... .. .. ..
1363 2021 12 11 19
1364 2021 12 11 20
1365 2021 12 11 21
1366 2021 12 11 22
1367 2021 12 11 23
[1368 rows x 4 columns]

Pandas groupby and add new rows with random data

I have a pandas dataframe like so:
id date variable value
1 2019 x 100
1 2019 y 50.5
1 2020 x 10.0
1 2020 y NA
Now, I want to groupby id and date, and for each group add 3 more variables a, b, c with random values such that a+b+c=1.0 and a>b>c.
So my final dataframe will be something like this:
id date variable value
1 2019 x 100
1 2019 y 50.5
1 2019 a 0.49
1 2019 b 0.315
1 2019 c 0.195
1 2020 x 10.0
1 2020 y NA
1 2020 a 0.55
1 2020 b 0.40
1 2020 c 0.05
Update
It's possible without a loop and append dataframes.
d = df.groupby(['date','id','variable'])['value'].mean().unstack('variable').reset_index()
x = np.random.random((len(d),3))
x /= x.sum(1)[:,None]
x[:,::-1].sort()
d[['a','b','c']] = pd.DataFrame(x)
pd.melt(d, id_vars=['date','id']).sort_values(['date','id']).reset_index(drop=True)
Output
date id variable value
0 2019 1 x 100.000000
1 2019 1 y 50.500000
2 2019 1 a 0.367699
3 2019 1 b 0.320325
4 2019 1 c 0.311976
5 2020 1 x 10.000000
6 2020 1 y NaN
7 2020 1 a 0.556441
8 2020 1 b 0.336748
9 2020 1 c 0.106812
Solution with loop
Not elegant, but works.
gr = df.groupby(['id','date'])
l = []
for i,g in gr:
d = np.random.random(3)
d /= d.sum()
d[::-1].sort()
ndf = pd.DataFrame({
'variable': list('abc'),
'value': d
})
ndf['id'] = g['id'].iloc[0]
ndf['date'] = g['date'].iloc[0]
l.append(pd.concat([g, ndf], sort=False).reset_index(drop=True))
pd.concat(l).reset_index(drop=True)
Output
id date variable value
0 1 2019 x 100.000000
1 1 2019 y 50.500000
2 1 2019 a 0.378764
3 1 2019 b 0.366415
4 1 2019 c 0.254821
5 1 2020 x 10.000000
6 1 2020 y NaN
7 1 2020 a 0.427007
8 1 2020 b 0.317555
9 1 2020 c 0.255439

Grabbing data from previous year in a Pandas DataFrame

I've got this df:
d={'year':[2019,2018,2017],'B':[10,5,17]}
df=pd.DataFrame(data=d)
print(df):
year B
0 2019 10
1 2018 5
2 2017 17
I want to create a column "B_previous_year" that grabs B data from the previous year, in a way it looks like this:
year B B_previous_year
0 2019 10 5
1 2018 5 17
2 2017 17 NaN
I'm trying this:
df['B_previous_year']=df.B.loc[df.year == (df.year - 1)]
However my B_previous_year is getting full of NaN
year B B_previous_year
0 2019 10 NaN
1 2018 5 NaN
2 2017 17 NaN
How could I do that?
In case if you want to keep in Integer format:
df = df.convert_dtypes()
df['New'] = df.B.shift(-1)
df
Output:
year B New
0 2019 10 5
1 2018 5 17
2 2017 17 <NA>
You might want to sort the dataframe by year first, then verify that the difference from one row to the next is, indeed, one year:
df = df.sort_values(by='year')
df['B_previous_year'] = df[df.year.diff() == 1]['B']
year B B_previous_year
2 2017 17 NaN
1 2018 5 5.0
0 2019 10 10.0

Fill a column in a dataframe if a condition is met

I have the following dataframe:
PersonID AmountPaid PaymentReceivedDate StartDate withinNYears
1 100 2017 2016
2 20 2014 2014
1 30 2017 2016
1 40 2016 2016
4 300 2015 2000
5 150 2005 2002
What I'm looking for is the Amount Paid should appear in the withNYears column if the payment was made within n years of start date otherwise you get NaN.
N years can be any number but let's say 2 for this example (as I will be playing with this to see findings).
so basically the above dataframe would come out like this if the amount was paid within 2 years:
PersonID AmountPaid PaymentReceivedDate StartDate withinNYears
1 100 2017 2016 100
2 20 2014 2014 20
1 30 2017 2016 30
1 40 2016 2016 40
4 300 2015 2000 NaN
5 150 2005 2002 NaN
does anyone know how to achieve this? cheers.
Subtract columns and compare by scalar for boolean mask and then set value by numpy.where, Series.where or DataFrame.loc:
m = (df['PaymentReceivedDate'] - df['StartDate']) < 2
df['withinNYears'] = np.where(m, df['AmountPaid'], np.nan)
#alternatives
#df['withinNYears'] = df['AmountPaid'].where(m)
#df.loc[m, 'withinNYears'] = df['AmountPaid']
print (df)
PersonID AmountPaid PaymentReceivedDate StartDate \
0 1 100 2017 2016
1 2 20 2014 2014
2 1 30 2017 2016
3 1 40 2016 2016
4 4 300 2015 2000
5 5 150 2005 2002
withinNYears
0 100.0
1 20.0
2 30.0
3 40.0
4 NaN
5 NaN
EDIT:
If StartDate column have datetimes:
m = (df['PaymentReceivedDate'] - df['StartDate'].dt. year) < 2
Just do with assign using loc
df.loc[(df['PaymentReceivedDate'] - df['StartDate']<2),'withinNYears']=df.AmountPaid
df
Out[37]:
PersonID AmountPaid ... StartDate withinNYears
0 1 100 ... 2016 100.0
1 2 20 ... 2014 20.0
2 1 30 ... 2016 30.0
3 1 40 ... 2016 40.0
4 4 300 ... 2000 NaN
5 5 150 ... 2002 NaN
[6 rows x 5 columns]

Categories

Resources