My dataFrame is:
df1 = {'ID': [1, 1,3], 'aa': [52, 52,55],'ab': [8285,2490,1000],'type': ['A','B','C'] }
df1 = pd.DataFrame(data=df1)
df1
ID aa ab type
0 1 52 8285 A
1 1 52 2490 B
2 3 55 1000 C
I want to merge overlapping intervals on the column "type" for each ID
Desired dataframe :
ID aa ab type
0 1 52 8285 A,B
1 1 52 2490 B,A
2 3 55 1000 C
A schema of the dataframe:
my schema
Related
I have 3 tables of following form:
import pandas as pd
df1 = pd.DataFrame({'ISIN': [1, 4, 7, 10],
'Value1': [2012, 2014, 2013, 2014],
'Value2': [55, 40, 84, 31]})
df1 = df1.set_index("ISIN")
df2 = pd.DataFrame({'ISIN': [1, 4, 7, 10],
'Symbol': ['a', 'b', 'c', 'd']})
df2 = df2.set_index("ISIN")
df3 = pd.DataFrame({'Symbol': ['a', 'b', 'c', 'd'],
'01.01.2020': [1, 2, 3, 4],
'01.01.2021': [3,2,3,2]})
df3 = df3.set_index("Symbol")
My aim now is to merge all 3 tabels together. I would go the following way:
Step1 (merge df1 and df2):
result1 = pd.merge(df1, df2, on=["ISIN"])
print(result1)
The result is ok and gives me the table:
Value1 Value2 Symbol
ISIN
1 2012 55 a
4 2014 40 b
7 2013 84 c
10 2014 31 d
In next step I want to merge it with df3, so I did make a step between and merge df2 and df3:
print(result1)
result2 = pd.merge(df2, df3, on=["Symbol"])
print(result2)
My problem now, the output is:
Symbol 01.01.2020 01.01.2021
0 a 1 3
1 b 2 2
2 c 3 3
3 d 4 2
the column ISIN here is lost. And the step
result = pd.merge(result, result2, on=["ISIN"])
result.set_index("ISIN")
produces an error.
Is there an elegant way to merge this 3 tabels together (with key column ISIN) and why is the key column lost in the second merge process?
Just chain the merge operations:
result = df1.merge(df2.reset_index(), on='ISIN').merge(df3, on='Symbol')
Or using your syntax, use result1 as source for the second merge:
result1 = pd.merge(df1, df2.reset_index(), on=["ISIN"])
result2 = pd.merge(result1, df3, on=["Symbol"])
output:
ISIN Value1 Value2 Symbol 01.01.2020 01.01.2021
0 1 2012 55 a 1 3
1 4 2014 40 b 2 2
2 7 2013 84 c 3 3
3 10 2014 31 d 4 2
You should not set the index prior to joining if you wish to keep it as part of the data in your dataframe. I suggest first merging, then setting the index to your desired value. In a single line:
output = df1.merge(df2,on='ISIN').merge(df3,on='Symbol')
Outputs:
ISIN Value1 Value2 Symbol 01.01.2020 01.01.2021
0 1 2012 55 a 1 3
1 4 2014 40 b 2 2
2 7 2013 84 c 3 3
3 10 2014 31 d 4 2
You can now set the index to ISIN by adding .set_index('ISIN') to output:
Value1 Value2 Symbol 01.01.2020 01.01.2021
ISIN
1 2012 55 a 1 3
4 2014 40 b 2 2
7 2013 84 c 3 3
10 2014 31 d 4 2
I have my input in a Pandas dataframe in the following format.
I would like to convert it into the below format
What I have managed to do so far:
I managed to extract the col values A and B from the column names have have cross joined it with the name column to obtain the following dataframe. I am not sure if my approach is correct.
I am not sure how I should go about it. Any help would be appreciated. Thanks
I agree with the earlier comment about posting data/code, but in this case it's simple enough to type in an example:
df = pd.DataFrame( { 'Name' : ['AA','BB','CC'],
'col1_A' : [5,2,5],
'col2_A' : [10,3,6],
'col1_B' : [15,4,7],
'col2_B' : [20,6,21],
})
print(df)
Name col1_A col2_A col1_B col2_B
0 AA 5 10 15 20
1 BB 2 3 4 6
2 CC 5 6 7 21
You can create a pd.MultiIndex to replace the column names to match the structure of the table:
df = df.set_index('Name')
df.columns = pd.MultiIndex.from_product([['A','B'],['val_1','val_2']], names=('col', None))
print(df)
col A B
val_1 val_2 val_1 val_2
Name
AA 5 10 15 20
BB 2 3 4 6
CC 5 6 7 21
Then stack() the 'col' column index, and reset both indices to be columns:
df = df.stack('col').reset_index()
print(df)
Name col val_1 val_2
0 AA A 5 10
1 AA B 15 20
2 BB A 2 3
3 BB B 4 6
4 CC A 5 6
5 CC B 7 21
Example code:
import pandas as pd
import re
# Dummy dataframe
d = {'Name': ['AA', 'BB'], 'col1_A': [5, 4], 'col1_B': [10, 9], 'col2_A': [15, 14], 'col2_B': [20, 19]}
df = pd.DataFrame(d)
# Get all the number index inside 'col' columns name
col_idx = [re.findall(r'\d+', name)[0] for name in list(df.columns[df.columns.str.contains('col')])]
# Get all the alphabet suffix at end of 'col' columns name
col_sfx = [name.split('_')[-1] for name in list(df.columns[df.columns.str.contains('col')])]
# Get unique value in list
col_idx = list(dict.fromkeys(col_idx))
col_sfx = list(dict.fromkeys(col_sfx))
# Create new df with repeated 'Name' and 'col'
new_d = {'Name': [name for name in df['Name'] for i in range(len(col_sfx))], 'col': col_sfx * len(df.index)}
new_df = pd.DataFrame(new_d)
all_sub_df = []
all_sub_df.append(new_df)
print("Name and col:\n{}\n".format(new_df))
# Create new df for each val columns
for i_c in col_idx:
df_coli = df.filter(like='col' + i_c, axis=1)
df_coli = df_coli.stack().reset_index()
df_coli = df_coli[df_coli.columns[-1:]]
df_coli.columns = ['val_' + i_c]
print("df_col{}:\n{}\n".format(i_c, df_coli))
all_sub_df.append(df_coli)
# Concatenate all columns for result
new_df = pd.concat(all_sub_df, axis=1)
new_df
Outputs:
Name and col:
Name col
0 AA A
1 AA B
2 BB A
3 BB B
df_col1:
val_1
0 5
1 10
2 4
3 9
df_col2:
val_2
0 15
1 20
2 14
3 19
Name col val_1 val_2
0 AA A 5 15
1 AA B 10 20
2 BB A 4 14
3 BB B 9 19
i have 2 sample datasets dfa and dfb:
import pandas as pd
a = {
'unit': ['A', 'B', 'C', 'D'],
'count': [ 1, 12, 34, 52]
}
b = {
'department': ['E', 'F'],
'count': [ 6, 12]
}
dfa = pd.DataFrame(a)
dfb = pd.DataFrame(b)
they looks like:
dfa
count unit
1 A
12 B
34 C
52 D
dfb
count department
6 E
12 F
what I want is simply have dfa stack on top of dfb not based on any column or any index. i have checked this page: https://pandas.pydata.org/pandas-docs/stable/merging.html but couldn't find the right one for my purpose.
my desired output is to create a dfc that looks like below dataset, i want to keep the headers:
dfc:
count unit
1 A
12 B
34 C
52 D
count department
6 E
12 F
In [37]: pd.concat([dfa, pd.DataFrame(dfb.T.reset_index().T.values, columns=dfa.columns)],
ignore_index=True)
Out[37]:
count unit
0 1 A
1 12 B
2 34 C
3 52 D
4 count department
5 6 E
6 12 F
or
In [39]: dfa.append(pd.DataFrame(dfb.T.reset_index().T.values, columns=dfa.columns)) \
.reset_index(drop=True)
Out[39]:
count unit
0 1 A
1 12 B
2 34 C
3 52 D
4 count department
5 6 E
6 12 F
UPDATE: merging 3 DFs:
pd.concat([dfa,
pd.DataFrame(dfb.T.reset_index().T.values, columns=dfa.columns),
pd.DataFrame(dfc.T.reset_index().T.values, columns=dfa.columns)],
ignore_index=True)
Option 1
You can construct it from scratch using np.vstack
pd.DataFrame(
np.vstack([dfa.values, dfb.columns, dfb.values]),
columns=dfa.columns
)
count unit
0 1 A
1 12 B
2 34 C
3 52 D
4 count department
5 6 E
6 12 F
Option 2
You can export to csv and read it back
from io import StringIO
import pandas as pd
pd.read_csv(StringIO(
'\n'.join([d.to_csv(index=None) for d in [dfa, dfb]])
))
count unit
0 1 A
1 12 B
2 34 C
3 52 D
4 count department
5 6 E
6 12 F
dfa.loc[len(dfa),:] = dfb.columns
dfb.columns = dfa.columns
dfa.append(dfb)
I have two columns as below:
id, colA, colB
0, a, 13
1, a, 52
2, b, 16
3, a, 34
4, b, 946
etc...
I am trying to create a third column, colC, that is colB if colA == a, otherwise 0.
This is what I was thinking, but it does not work:
data[data['colA']=='a']['colC'] = data[data['colA']=='a']['colB']
I was also thinking about using np.where(), but I don't think that would work here.
Any thoughts?
Use loc with a mask to assign:
In [300]:
df.loc[df['colA'] == 'a', 'colC'] = df['colB']
df['colC'] = df['colC'].fillna(0)
df
Out[300]:
id colA colB colC
0 0 a 13 13
1 1 a 52 52
2 2 b 16 0
3 3 a 34 34
4 4 b 946 0
EDIT
or use np.where:
In [296]:
df['colC'] = np.where(df['colA'] == 'a', df['colC'],0)
df
Out[296]:
id colA colB colC
0 0 a 13 13
1 1 a 52 52
2 2 b 16 0
3 3 a 34 34
4 4 b 946 0
df['colC'] = df[df['colA'] == 'a']['colB']
should result in exactly what you want, afaik.
Then replace the NaN's with zeroes with df.fillna(inplace=True)
I'd like to convert a Pandas DataFrame that is derived from a pivot table into a row representation as shown below.
This is where I'm at:
import pandas as pd
import numpy as np
df = pd.DataFrame({
'goods': ['a', 'a', 'b', 'b', 'b'],
'stock': [5, 10, 30, 40, 10],
'category': ['c1', 'c2', 'c1', 'c2', 'c1'],
'date': pd.to_datetime(['2014-01-01', '2014-02-01', '2014-01-06', '2014-02-09', '2014-03-09'])
})
# we don't care about year in this example
df['month'] = df['date'].map(lambda x: x.month)
piv = df.pivot_table(["stock"], "month", ["goods", "category"], aggfunc="sum")
piv = piv.reindex(np.arange(piv.index[0], piv.index[-1] + 1))
piv = piv.ffill(axis=0)
piv = piv.fillna(0)
print piv
which results in
stock
goods a b
category c1 c2 c1 c2
month
1 5 0 30 0
2 5 10 30 40
3 5 10 10 40
And this is where I want to get to.
goods category month stock
a c1 1 5
a c1 2 0
a c1 3 0
a c2 1 0
a c2 2 10
a c2 3 0
b c1 1 30
b c1 2 0
b c1 3 10
b c2 1 0
b c2 2 40
b c2 3 0
Previously, I used
piv = piv.stack()
piv = piv.reset_index()
print piv
to get rid of the multi-indexes, but this results in this because I pivot now on two columns (["goods", "category"]):
month category stock
goods a b
0 1 c1 5 30
1 1 c2 0 0
2 2 c1 5 30
3 2 c2 10 40
4 3 c1 5 10
5 3 c2 10 40
Does anyone know how I can get rid of the multi-index in the column and get the result into a DataFrame of the exemplified format?
>>> piv.unstack().reset_index().drop('level_0', axis=1)
goods category month 0
0 a c1 1 5
1 a c1 2 5
2 a c1 3 5
3 a c2 1 0
4 a c2 2 10
5 a c2 3 10
6 b c1 1 30
7 b c1 2 30
8 b c1 3 10
9 b c2 1 0
10 b c2 2 40
11 b c2 3 40
then all you need is to change last column name from 0 to stock.
It seems to me that melt (aka unpivot) is very close to what you want to do:
In [11]: pd.melt(piv)
Out[11]:
NaN goods category value
0 stock a c1 5
1 stock a c1 5
2 stock a c1 5
3 stock a c2 0
4 stock a c2 10
5 stock a c2 10
6 stock b c1 30
7 stock b c1 30
8 stock b c1 10
9 stock b c2 0
10 stock b c2 40
11 stock b c2 40
There's a rogue column (stock), that appears here that column header is constant in piv. If we drop it first the melt works OOTB:
In [12]: piv.columns = piv.columns.droplevel(0)
In [13]: pd.melt(piv)
Out[13]:
goods category value
0 a c1 5
1 a c1 5
2 a c1 5
3 a c2 0
4 a c2 10
5 a c2 10
6 b c1 30
7 b c1 30
8 b c1 10
9 b c2 0
10 b c2 40
11 b c2 40
Edit: The above actually drops the index, you need to make it a column with reset_index:
In [21]: pd.melt(piv.reset_index(), id_vars=['month'], value_name='stock')
Out[21]:
month goods category stock
0 1 a c1 5
1 2 a c1 5
2 3 a c1 5
3 1 a c2 0
4 2 a c2 10
5 3 a c2 10
6 1 b c1 30
7 2 b c1 30
8 3 b c1 10
9 1 b c2 0
10 2 b c2 40
11 3 b c2 40
I know that the question has already been answered, but for my dataset multiindex column problem, the provided solution was unefficient. So here I am posting another solution for unpivoting multiindex columns using pandas.
Here is the problem I had:
As one can see, the dataframe is composed of 3 multiindex, and two levels of multiindex columns.
The desired dataframe format was:
When I tried the options given above, the pd.melt function didn't allow to have more than one column in the var_name attribute. Therefore, every time that I tried a melt, I would end up losing some attribute from my table.
The solution I found was to apply a double stacking function over my dataframe.
Before the coding, it is worth notice that the desired var_name for my unpivoted table column was "Populacao residente em domicilios particulares ocupados" (see in the code below). Therefore, for all my value entries, they should be stacked in this newly created var_name new column.
Here is a snippet code:
import pandas as pd
# reading my table
df = pd.read_excel(r'my_table.xls', sep=',', header=[2,3], encoding='latin3',
index_col=[0,1,2], na_values=['-', ' ', '*'], squeeze=True).fillna(0)
df.index.names = ['COD_MUNIC_7', 'NOME_MUN', 'TIPO']
df.columns.names = ['sexo', 'faixa_etaria']
df.head()
# making the stacking:
df = pd.DataFrame(pd.Series(df.stack(level=0).stack(), name='Populacao residente em domicilios particulares ocupados')).reset_index()
df.head()
Another solution I found was to first apply a stacking function over the dataframe and then apply the melt.
Here is an alternative code:
df = df.stack('faixa_etaria').reset_index().melt(id_vars=['COD_MUNIC_7', 'NOME_MUN','TIPO', 'faixa_etaria'],
value_vars=['Homens', 'Mulheres'],
value_name='Populacao residente em domicilios particulares ocupados',
var_name='sexo')
df.head()
Sincerely yours,
Philipe Riskalla Leal