pandas append duplicates as columns - python

I have a df that looks like this
ID data1 data2
index
1 1 3 4
2 1 2 5
3 2 9 3
4 3 7 2
5 3 4 7
6 1 10 12
What I'm trying to do is append as columns all the lines that have the same ID, so that I'd get something like this
ID data2 data3 data4 data5 data6 data7
index
1 1 3 4 2 5 10 12
3 2 9 3
4 3 7 2 4 7
The problem is that I don't know how many columns I will have to append.
The column. Note that ID is NOT an index but a normal column, but the one that is used to find the duplicates.
I have already tried with pd.concat(), but had no luck.

You can use cumcount for count duplicates with set_index + unstack for reshaping. Then convert MultiIndex to columns by map and last reset_index for column ID from index.
df['g'] = df.groupby('ID').cumcount().astype(str)
df = df.set_index(['ID','g']).unstack().sort_index(axis=1, level=1)
df.columns = df.columns.map('_'.join)
df = df.reset_index()
print (df)
ID data1_0 data2_0 data1_1 data2_1 data1_2 data2_2
0 1 3.0 4.0 2.0 5.0 10.0 12.0
1 2 9.0 3.0 NaN NaN NaN NaN
2 3 7.0 2.0 4.0 7.0 NaN NaN
Solution with pivot:
df['g'] = df.groupby('ID').cumcount().astype(str)
df = df.pivot(index='ID',columns='g').sort_index(axis=1, level=1)
df.columns = df.columns.map('_'.join)
df = df.reset_index()
print (df)
ID data1_0 data2_0 data1_1 data2_1 data1_2 data2_2
0 1 3.0 4.0 2.0 5.0 10.0 12.0
1 2 9.0 3.0 NaN NaN NaN NaN
2 3 7.0 2.0 4.0 7.0 NaN NaN
Another solution with apply and DataFrame constructor:
df = df.groupby('ID')['data1','data2']
.apply(lambda x: pd.DataFrame(x.values, columns=['a','b']))
.unstack()
.sort_index(axis=1, level=1)
df.columns = df.columns.map('{0[0]}_{0[1]}'.format)
df = df.reset_index()
print (df)
ID a_0 b_0 a_1 b_1 a_2 b_2
0 1 3.0 4.0 2.0 5.0 10.0 12.0
1 2 9.0 3.0 NaN NaN NaN NaN
2 3 7.0 2.0 4.0 7.0 NaN NaN

Related

How to merge 3 dataframes' column with two criteria in python

I tried to merge 3 columns from 3 dataframes based on 2 conditions. For example I have the 3 dataframes below called df_a, df_b and df_c
df_a:
df_b:
df_c:
I want to merge the column Results_b from df_b to df_a if they are the same company and in the same period. Also I would like to remove the column of factor a and factor b.
I tried df_merged = pd.merge(df_a, df_b, on=['Company name', 'Period'], how='left') for merging df_a and df_b and it works, but I am not sure how to only merge the column of Results_a and Results_b instead of merging all columns.
Lastly, I would also like to merge the column Results_c from df_c if they are the same company and in the same period. However, df_c data are based on each quarter (or every 3 months) and df_a and df_b are based on every month, so for the months which is not in df_c, I would like the data to be the same from previous available data. I am not so sure how to deal with it.
This is the outcome that I would like to see:
It would be really appreciated if someone can help me!! Thanks a lot
For reproducing the dataframes:
df_a = pd.DataFrame({
'Company name': ['A','B','C','A','B','C','A','B','C','A','B','D'],
'Period': ['2019-01-31','2019-01-31','2019-01-31','2019-02-28','2019-02-28','2019-02-28','2019-03-31','2019-03-31','2019-03-31','2019-04-30','2019-04-30','2019-04-30'],
'factor a': [37,41,64,52,97,10,55,47,52,61,59,70],
'Results_a': [1,4,2,3,4,1,2,3,3,1,2,4]
})
# b
df_b = pd.DataFrame({
'Company name': ['A','B','C','A','B','A','D','B','C'],
'Period': ['2019-01-31','2019-01-31','2019-01-31','2019-02-28','2019-02-28','2019-03-31','2019-03-31','2019-04-30','2019-04-30'],
'factor b': [55,34,28,17,95,98,61,14,87],
'Results_b': [2,3,1,4,2,1,4,1,4]
})
#c
df_c = pd.DataFrame({
'Company name': ['A','B','C','A','D'],
'Period': ['2019-01-31','2019-01-31','2019-01-31','2019-04-30','2019-04-30'],
'factor c': [27,63,18,23,89],
'Results_c' : [2,1,3,4,1],
})```
You Can also use merge & update value of column "Results_c" in loop, if required;
# Merge Data Sets
data_frames = [df_a,df_b,df_c]
df_result = reduce(lambda left,right: pd.merge(left,right,on=['Company name', 'Period'],
how='left'), data_frames)
df_result = df_result[[col for col in df_result.columns if "factor" not in col]]
# Updating Results_c values if NaN for the month
lst_period = sorted(list(df_result["Period"].unique()))
for i in range(0,len(lst_period)):
df_temp = df_result[df_result["Period"] == lst_period[i]]
if df_temp["Results_c"].isna().sum() == 3: #Edit this number depending on your company's count... as of now 3 bcz of A,B,C
lst_val = df_result[df_result["Period"] == lst_period[i-1]]["Results_c"]
df_result.loc[df_result["Period"] == lst_period[i],"Results_c"] = list(lst_val)
Hope this Helps...
Company name Period Results_a Results_b Results_c
0 A 2019-01-31 1 2.0 2.0
1 B 2019-01-31 4 3.0 1.0
2 C 2019-01-31 2 1.0 3.0
3 A 2019-02-28 3 4.0 2.0
4 B 2019-02-28 4 2.0 1.0
5 C 2019-02-28 1 NaN 3.0
6 A 2019-03-31 2 1.0 2.0
7 B 2019-03-31 3 NaN 1.0
8 C 2019-03-31 3 NaN 3.0
9 A 2019-04-30 1 NaN 4.0
10 B 2019-04-30 2 1.0 NaN
11 D 2019-04-30 4 NaN 1.0
Use concat:
dfs = [df_a, df_b, df_c]
out = pd.concat([df.set_index(['Company name', 'Period'])
.filter(like='Results_')
for df in dfs],
axis=1).reset_index()
NB. to mimick the left merge, you can add .dropna(subset='Results_a').
output:
Company name Period Results_a Results_b Results_c
0 A 2019-01-31 1.0 2.0 2.0
1 B 2019-01-31 4.0 3.0 1.0
2 C 2019-01-31 2.0 1.0 3.0
3 A 2019-02-28 3.0 4.0 NaN
4 B 2019-02-28 4.0 2.0 NaN
5 C 2019-02-28 1.0 NaN NaN
6 A 2019-03-31 2.0 1.0 NaN
7 B 2019-03-31 3.0 NaN NaN
8 C 2019-03-31 3.0 NaN NaN
9 A 2019-04-30 1.0 NaN 4.0
10 B 2019-04-30 2.0 1.0 NaN
11 D 2019-04-30 4.0 NaN 1.0
12 D 2019-03-31 NaN 4.0 NaN
13 C 2019-04-30 NaN 4.0 NaN

In python, how to reshape a dataframe so that some datetime columns become rows

In a pandas dataframe, I want to transpose and agrupate datetime columns into rows.
Like this (there are about 12 date columns):
Category Type 11/2021 12/2021
0 A 1 0.0 20
1 A 2 NaN 13
2 B 1 5.0 7
3 B 2 20.0 4
to one like this:
Date Category Type1 Type2
0 2021-11 A 0 NaN
1 2021-11 B 5 20.0
2 2021-12 A 20 13.0
3 2021-12 B 7 4.0
I tought about using pivot tables, but I wasnt able to do so.
You could do:
(df.melt(['Category', 'Type'], var_name = 'Date').
pivot(['Date', 'Category'],'Type').reset_index())
Date Category value
Type 1 2
0 11/2021 A 0.0 NaN
1 11/2021 B 5.0 20.0
2 12/2021 A 20.0 13.0
3 12/2021 B 7.0 4.0
To be alitle cleaner you could use janitor:
import janitor
(df.pivot_longer(['Category', 'Type'], names_to = 'Date', values_to = 'type').
pivot_wider(['Date', 'Category'], names_from = 'Type', names_sep = ''))
Date Category type1 type2
0 11/2021 A 0.0 NaN
1 11/2021 B 5.0 20.0
2 12/2021 A 20.0 13.0
3 12/2021 B 7.0 4.0
Another solution:
x = (
df.set_index(["Category", "Type"])
.stack()
.unstack("Type")
.add_prefix("Type")
.reset_index()
)
x = x.rename(columns={"level_1": "Date"})
x.columns.name = None
print(x)
Prints:
Category Date Type1 Type2
0 A 11/2021 0.0 NaN
1 A 12/2021 20.0 13.0
2 B 11/2021 5.0 20.0
3 B 12/2021 7.0 4.0

split pandas column with tuple

I have a dictionary of the form;
data = {A:[(1,2),(3,4),(5,6),(7,8),(8,9)],
B:[(3,4),(4,5),(5,6),(6,7)],
C:[(10,11),(12,13)]}
I create a dataFrame by:
df = pd.DataFrame(dict([ (k,pd.Series(v)) for k,v in data.iteritems()]))
which in turn becomes;
A B C
(1,2) (3,4) (10,11)
(3,4) (4,5) (12,13)
(5,6) (5,6) NaN
(6,7) (6,7) NaN
(8,9) NaN NaN
Is there a way to go from the dataframe above to the one below:
A B C
one two one two one two
1 2 3 4 10 11
3 4 4 5 12 13
5 6 5 6 NaN NaN
6 7 6 7 NaN NaN
8 9 NaN NaN NaN NaN
You can use list comprehension with DataFrame constructor with converting columns to numpy array by values + tolist and concat:
cols = ['A','B','C']
L = [pd.DataFrame(df[x].values.tolist(), columns=['one','two']) for x in cols]
df = pd.concat(L, axis=1, keys=cols)
print (df)
A B C
one two one two one two
0 1 2 3 4 5 6
1 7 8 9 10 11 12
2 13 14 15 16 17 18
EDIT:
Similar solution with dict comprehension, integers values was converted to floats, because type of NaN is float too.
data = {'A':[(1,2),(3,4),(5,6),(7,8),(8,9)],
'B':[(3,4),(4,5),(5,6),(6,7)],
'C':[(10,11),(12,13)]}
cols = ['A','B','C']
d = {k: pd.DataFrame(v, columns=['one','two']) for k,v in data.items()}
df = pd.concat(d, axis=1)
print (df)
A B C
one two one two one two
0 1 2 3.0 4.0 10.0 11.0
1 3 4 4.0 5.0 12.0 13.0
2 5 6 5.0 6.0 NaN NaN
3 7 8 6.0 7.0 NaN NaN
4 8 9 NaN NaN NaN NaN
EDIT:
For multiple by one column is possible use slicers:
s = df[('A', 'one')]
print (s)
0 1
1 3
2 5
3 7
4 8
Name: (A, one), dtype: int64
df.loc(axis=1)[:, 'one'] = df.loc(axis=1)[:, 'one'].mul(s, axis=0)
print (df)
A B C
one two one two one two
0 1.0 2 3.0 4.0 10.0 11.0
1 9.0 4 12.0 5.0 36.0 13.0
2 25.0 6 25.0 6.0 NaN NaN
3 49.0 8 42.0 7.0 NaN NaN
4 64.0 9 NaN NaN NaN NaN
Another solution:
idx = pd.IndexSlice
df.loc[:, idx[:, 'one']] = df.loc[:, idx[:, 'one']].mul(s, axis=0)
print (df)
A B C
one two one two one two
0 1.0 2 3.0 4.0 10.0 11.0
1 9.0 4 12.0 5.0 36.0 13.0
2 25.0 6 25.0 6.0 NaN NaN
3 49.0 8 42.0 7.0 NaN NaN
4 64.0 9 NaN NaN NaN NaN

Set nans across multiple pandas dataframes

I have number of similar dataframes where I would like to standardize the nans across all the dataframes. For instance, if a nan exists in df1.loc[0,'a'] then ALL other dataframes should be set to nan for the same index location.
I am aware that I could group the dataframes to create one big multiindexed dataframe but sometimes I find it easier to work with a group of dataframes of the same structure.
Here is an example:
import pandas as pd
import numpy as np
df1 = pd.DataFrame(np.reshape(np.arange(12), (4,3)), columns=['a', 'b', 'c'])
df2 = pd.DataFrame(np.reshape(np.arange(12), (4,3)), columns=['a', 'b', 'c'])
df3 = pd.DataFrame(np.reshape(np.arange(12), (4,3)), columns=['a', 'b', 'c'])
df1.loc[3,'a'] = np.nan
df2.loc[1,'b'] = np.nan
df3.loc[0,'c'] = np.nan
print df1
print ' '
print df2
print ' '
print df3
Output:
a b c
0 0.0 1 2
1 3.0 4 5
2 6.0 7 8
3 NaN 10 11
a b c
0 0 1.0 2
1 3 NaN 5
2 6 7.0 8
3 9 10.0 11
a b c
0 0 1 NaN
1 3 4 5.0
2 6 7 8.0
3 9 10 11.0
However, I would like df1, df2 and df3 to have nans in the same locations:
print df1
a b c
0 0.0 1.0 NaN
1 3.0 NaN 5.0
2 6.0 7.0 8.0
3 NaN 10.0 11.0
Using the answer provided by piRSquared, I was able to extend it for dataframes of different sizes. Here is the function:
def set_nans_over_every_df(df_list):
# Find unique index and column values
complete_index = sorted(set([idx for df in df_list for idx in df.index]))
complete_columns = sorted(set([idx for df in df_list for idx in df.columns]))
# Ensure that every df has the same indexes and columns
df_list = [df.reindex(index=complete_index, columns=complete_columns) for df in df_list]
# Find the nans in each df and set nans in every other df at the same location
mask = np.isnan(np.stack([df.values for df in df_list])).any(0)
df_list = [df.mask(mask) for df in df_list]
return df_list
And an example using different sized dataframes:
df1 = pd.DataFrame(np.reshape(np.arange(15), (5,3)), index=[0,1,2,3,4], columns=['a', 'b', 'c'])
df2 = pd.DataFrame(np.reshape(np.arange(12), (4,3)), index=[0,1,2,3], columns=['a', 'b', 'c'])
df3 = pd.DataFrame(np.reshape(np.arange(16), (4,4)), index=[0,1,2,3], columns=['a', 'b', 'c', 'd'])
df1.loc[3,'a'] = np.nan
df2.loc[1,'b'] = np.nan
df3.loc[0,'c'] = np.nan
df1, df2, df3 = set_nans_over_every_df([df1, df2, df3])
print df1
a b c d
0 0.0 1.0 NaN NaN
1 3.0 NaN 5.0 NaN
2 6.0 7.0 8.0 NaN
3 NaN 10.0 11.0 NaN
4 NaN NaN NaN NaN
I'd set up a mask in numpy then use this mask in the pd.DataFrame.mask method
mask = np.isnan(np.stack([d.values for d in [df1, df2, df3]])).any(0)
print(df1.mask(mask))
a b c
0 0.0 1.0 NaN
1 3.0 NaN 5.0
2 6.0 7.0 8.0
3 NaN 10.0 11.0
print(df2.mask(mask))
a b c
0 0.0 1.0 NaN
1 3.0 NaN 5.0
2 6.0 7.0 8.0
3 NaN 10.0 11.0
print(df3.mask(mask))
a b c
0 0.0 1.0 NaN
1 3.0 NaN 5.0
2 6.0 7.0 8.0
3 NaN 10.0 11.0
You can create mask and then apply to all dataframes:
mask = df1.notnull() & df2.notnull() & df3.notnull()
print (mask)
a b c
0 True True False
1 True False True
2 True True True
3 False True True
You can also set mask dynamically with reduce:
import functools
masks = [df1.notnull(),df2.notnull(),df3.notnull()]
mask = functools.reduce(lambda x,y: x & y, masks)
print (mask)
a b c
0 True True False
1 True False True
2 True True True
3 False True True
print (df1[mask])
a b c
0 0.0 1.0 NaN
1 3.0 NaN 5.0
2 6.0 7.0 8.0
3 NaN 10.0 11.0
print (df2[mask])
a b c
0 0.0 1.0 NaN
1 3.0 NaN 5.0
2 6.0 7.0 8.0
3 NaN 10.0 11.0
print (df2[mask])
a b c
0 0.0 1.0 NaN
1 3.0 NaN 5.0
2 6.0 7.0 8.0
3 NaN 10.0 11.0
assuming that all your DF are of the same shape and have the same indexes:
In [196]: df2[df1.isnull()] = df3[df1.isnull()] = np.nan
In [197]: df1[df3.isnull()] = df2[df3.isnull()] = np.nan
In [198]: df1[df2.isnull()] = df3[df2.isnull()] = np.nan
In [199]: df1
Out[199]:
a b c
0 0.0 1.0 NaN
1 3.0 NaN 5.0
2 6.0 7.0 8.0
3 NaN 10.0 11.0
In [200]: df2
Out[200]:
a b c
0 0.0 1.0 NaN
1 3.0 NaN 5.0
2 6.0 7.0 8.0
3 NaN 10.0 11.0
In [201]: df3
Out[201]:
a b c
0 0.0 1.0 NaN
1 3.0 NaN 5.0
2 6.0 7.0 8.0
3 NaN 10.0 11.0
One simple method is to add the DataFrames together and multiply the result by 0 and then add this DataFrame to all the others individually.
df_zero = (df1 + df2 + df3) * 0
df1 + df_zero
df2 + df_zero
df3 + df_zero

Delete rows in dataframe based on column values

I need to rid myself of all rows with a null value in column C. Here is the code:
infile="C:\****"
df=pd.read_csv(infile)
A B C D
1 1 NaN 3
2 3 7 NaN
4 5 NaN 8
5 NaN 4 9
NaN 1 2 NaN
There are two basic methods I have attempted.
method 1:
source: How to drop rows of Pandas DataFrame whose value in certain columns is NaN
df.dropna()
The result is an empty dataframe, which makes sense because there is an NaN value in every row.
df.dropna(subset=[3])
For this method I tried to play around with the subset value using both column index number and column name. The dataframe is still empty.
method 2:
source: Deleting DataFrame row in Pandas based on column value
df = df[df.C.notnull()]
Still results in an empty dataframe!
What am I doing wrong?
df = pd.DataFrame([[1,1,np.nan,3],[2,3,7,np.nan],[4,5,np.nan,8],[5,np.nan,4,9],[np.nan,1,2,np.nan]], columns = ['A','B','C','D'])
df = df[df['C'].notnull()]
df
It's just a prove that your method 2 works properly (at least with pandas 0.18.0):
In [100]: df
Out[100]:
A B C D
0 1.0 1.0 NaN 3.0
1 2.0 3.0 7.0 NaN
2 4.0 5.0 NaN 8.0
3 5.0 NaN 4.0 9.0
4 NaN 1.0 2.0 NaN
In [101]: df.dropna(subset=['C'])
Out[101]:
A B C D
1 2.0 3.0 7.0 NaN
3 5.0 NaN 4.0 9.0
4 NaN 1.0 2.0 NaN
In [102]: df[df.C.notnull()]
Out[102]:
A B C D
1 2.0 3.0 7.0 NaN
3 5.0 NaN 4.0 9.0
4 NaN 1.0 2.0 NaN
In [103]: df = df[df.C.notnull()]
In [104]: df
Out[104]:
A B C D
1 2.0 3.0 7.0 NaN
3 5.0 NaN 4.0 9.0
4 NaN 1.0 2.0 NaN

Categories

Resources