How to Transpose dataframe column when duplicate entries exist in python? - python

I am having difficulty in transposing a certain column in python.
I have the following df
ID Value Date
1 15 2019/01/01
1 13 2019/02/01
1 17 2019/03/01
2 16 2019/01/01
2 14 2019/02/01
2 15 2019/03/01
I want to create a df such that the duplicates from ID column are removed and the Values get transposed
ID Value_01 Value_02 Value_03
1 15 13 17
2 16 14 15

use cumcount with groupby to make your columns, then crosstab
df1 = df.assign(key=df.groupby('ID').cumcount() + 1)
df2 = pd.crosstab(df1["ID"], df1["key"], df1["Value"], aggfunc='first').add_prefix(
"Value_"
).reset_index().rename_axis(None, axis=1)
print(df2)
ID Value_1 Value_2 Value_3
0 1 15 13 17
1 2 16 14 15

Related

select specific rows from a large data frame

I have a data frame with 790 rows. I want to create a new data frame that excludes rows from 300 to 400 and leave the rest.
I tried:
df.loc[[:300, 400:]]
df.iloc[[:300, 400:]]
df_new=df.drop(labels=range([300:400]),
axis=0)
This does not work. How can I achieve this goal?
Thanks in advance
Use range or numpy.r_ for join indices:
df_new=df.drop(range(300,400))
df_new=df.iloc[np.r_[0:300, 400:len(df)]]
Sample:
df = pd.DataFrame({'a':range(20)})
# print (df)
df1 = df.drop(labels=range(7,15))
print (df1)
a
0 0
1 1
2 2
3 3
4 4
5 5
6 6
15 15
16 16
17 17
18 18
19 19
df1 = df.iloc[np.r_[0:7, 15:len(df)]]
print (df1)
a
0 0
1 1
2 2
3 3
4 4
5 5
6 6
15 15
16 16
17 17
18 18
19 19
First select index you want to drop and then create a new df
i = df.iloc[299:400].index
new_df = df.drop(i)

update non-available values of one pandas column based on another

I have a 2-column dataframe with records:column names ['user_id', 'cookie_id'] and I would like to update user_id values if they are NaN and there is a available user_id value for the common cookie_id.
Example:
(before)
user_id cookie_id
2 15
2 15
3 22
NaN 15
NaN 15
NaN 38
(after)
user_id cookie_id
2 15
2 15
3 22
2 15
2 15
NaN 38
If need replace only missing values first non missing value per user_id use GroupBy.transform with GroupBy.first and Series.fillna:
df['user_id'] = df['user_id'].fillna(df.groupby("cookie_id")['user_id'].transform('first'))
print (df)
user_id cookie_id
0 2.0 15
1 2.0 15
2 3.0 22
3 2.0 15
4 2.0 15
5 NaN 38
Or if need first non missing value per group then use:
df['user_id'] = df.groupby("cookie_id")['user_id'].transform('first')

How to write the fucntion that transfrom the columns of my dataframe to a single column?

I have a dataframe like this:
A = ID Material1 Materia2 Material3
14 0 0 0
24 1 0 0
12 1 1 0
25 0 0 2
I want to have all information in one column like this:
A = ID Materials
14 Nan
24 Material1
12 Material1
12 Material2
25 Material3
25 Material3
can anyone help write a function please !
Use DataFrame.melt with repeat rows by counts with Index.repeat and DataFrame.loc:
df1 = df.melt('ID', var_name='Materials')
df1 = df1.loc[df1.index.repeat(df1['value'])].drop('value', axis=1).reset_index(drop=True)
print (df1)
ID Materials
0 24 Material1
1 12 Material1
2 12 Materia2
3 25 Material3
4 25 Material3
EDIT: For add only 0 Materials with missing values use DataFrame.merge with left join by original df['ID'] in one column DataFrame withoiut duplications by DataFrame.drop_duplicates:
df1 = df.melt('ID', var_name='Materials')
df0 = df[['ID']].drop_duplicates()
print (df0)
ID
0 14
1 24
2 12
3 25
df2 = df1.loc[df1.index.repeat(df1['value'])].drop('value', axis=1).reset_index(drop=True)
df2 = df0.merge(df2, on='ID', how='left')
print (df2)
ID Materials
0 14 NaN
1 24 Material1
2 12 Material1
3 12 Materia2
4 25 Material3
5 25 Material3

Pandas: Pack column into rows

I've been reading through pd.stack, pd.unstack and pd.pivot but I can't wrap my head around getting what I want done
Given a dataframe as follows
id1 id2 id3 vals vals1
0 1 a -1 10 20
1 1 a -2 11 21
2 1 a -3 12 22
3 1 a -4 13 23
4 1 b -1 14 24
5 1 b -2 15 25
6 1 b -3 16 26
7 1 b -4 17 27
I'd like to get the following result
id1 id2 -1_vals -2_vals ... -1_vals1 -2_vals1 -3_vals1 -4_vals1
0 1 a 10 11 ... 20 21 22 23
1 1 b 14 15 ... 24 25 26 27
It's kind of a groupby with a pivot, The column id3 is being spread into rows, where the new column names is the corresponding concatenation of the original column and the value of id3
EDIT: It is guaranteed that per id1 + id2 id3 will be unique, but some groups of id1 + id2 will have diffenet id3 - in this case it is ok to put NaNs there
Use DataFrame.set_index with DataFrame.unstack and DataFrame.sort_index for MultiIndex in columns and then flatten it by list comprehension with f-strings:
df1 = (df.set_index(['id1','id2','id3'])
.unstack()
.sort_index(level=[0,1], ascending=[True, False], axis=1))
#python 3.6+
df1.columns = [f'{b}_{a}' for a, b in df1.columns]
#python below
#df1.columns = ['{}_{}'.format(a, b) for a, b in df1.columns]
df1 = df1.reset_index()
print (df1)
id1 id2 -1_vals -2_vals -3_vals -4_vals -1_vals1 -2_vals1 -3_vals1 \
0 1 a 10 11 12 13 20 21 22
1 1 b 14 15 16 17 24 25 26
-4_vals1
0 23
1 27

Generate a crosstab type dataframe with a binary count value in pandas

I have a pandas dataframe like this
UIID ISBN
a 12
b 13
I want to compare each UUID with the ISBN and add a count column in the dataframe.
UUID ISBN Count
a 12 1
a 13 0
b 12 0
b 13 1
How can this be done in pandas. I know the crosstab function does the same thing but I want the data in this format.
Use crosstab with melt:
df = pd.crosstab(df['UIID'], df['ISBN']).reset_index().melt('UIID', value_name='count')
print (df)
UIID ISBN count
0 a 12 1
1 b 12 0
2 a 13 0
3 b 13 1
Alternative solution with GroupBy.size and reindex by MultiIndex.from_product:
s = df.groupby(['UIID','ISBN']).size()
mux = pd.MultiIndex.from_product(s.index.levels, names=s.index.names)
df = s.reindex(mux, fill_value=0).reset_index(name='count')
print (df)
UIID ISBN count
0 a 12 1
1 a 13 0
2 b 12 0
3 b 13 1
You can also use pd.DataFrame.unstack:
df = pd.crosstab(df.UIID, df.ISBN).unstack().reset_index()
print(df)
ISBN UIID 0
0 12 a 1
1 12 b 0
2 13 a 0
3 13 b 1

Categories

Resources