I would like to reverse a dataframe with dummy variables. For example,
from df_input:
Course_01 Course_02 Course_03
0 0 1
1 0 0
0 1 0
To df_output
Course
0 03
1 01
2 02
I have been looking at the solution provided at Reconstruct a categorical variable from dummies in pandas but it did not work. Please, Any help would be much appreciated.
Many Thanks,
Best Regards,
Carlo
We can use wide_to_long, then select rows that are not equal to zero i.e
ndf = pd.wide_to_long(df, stubnames='T_', i='id',j='T')
T_
id T
id1 30 0
id2 30 1
id1 40 1
id2 40 0
not_dummy = ndf[ndf['T_'].ne(0)].reset_index().drop('T_',1)
id T
0 id2 30
1 id1 40
Update based on your edit :
ndf = pd.wide_to_long(df.reset_index(), stubnames='T_',i='index',j='T')
not_dummy = ndf[ndf['T_'].ne(0)].reset_index(level='T').drop('T_',1)
T
index
1 30
0 40
You can use:
#create id to index if necessary
df = df.set_index('id')
#create MultiIndex
df.columns = df.columns.str.split('_', expand=True)
#reshape by stack and remove 0 rows
df = df.stack().reset_index().query('T != 0').drop('T',1).rename(columns={'level_1':'T'})
print (df)
id T
1 id1 40
2 id2 30
EDIT:
col_name = 'Course'
df.columns = df.columns.str.split('_', expand=True)
df = (df.replace(0, np.nan)
.stack()
.reset_index()
.drop([col_name, 'level_0'],1)
.rename(columns={'level_1':col_name})
)
print (df)
Course
0 03
1 01
2 02
Suppose you have the following dummy DF:
In [152]: d
Out[152]:
id T_30 T_40 T_50
0 id1 0 1 1
1 id2 1 0 1
we can prepare the following helper Series:
In [153]: v = pd.Series(d.columns.drop('id').str.replace(r'\D','').astype(int), index=d.columns.drop('id'))
In [155]: v
Out[155]:
T_30 30
T_40 40
T_50 50
dtype: int64
now we can multiply them, stack and filter:
In [154]: d.set_index('id').mul(v).stack().reset_index(name='T').drop('level_1',1).query("T > 0")
Out[154]:
id T
1 id1 40
2 id1 50
3 id2 30
5 id2 50
I think melt() was pretty much made for this?
Your data, I think:
df_input = pd.DataFrame.from_dict({'Course_01':[0,1,0],
'Course_02':[0,0,1],
'Course_03':[1,0,0]})
Change names to match your desired output:
df_input.columns = df_input.columns.str.replace('Course_','')
Melt the dataframe:
dataMelted = pd.melt(df_input,
var_name='Course',
ignore_index=False)
Clean up zeros, etc:
df_output = (dataMelted[dataMelted['value'] != 0]
.drop('value', axis=1)
.sort_index())
>>> df_output
Course
0 03
1 01
2 02
#Create a new column for the categorical
df['categ']=0
for i in range(df):
if df['Course01']==1:
df['categ']='01'
if df['Course02']==1:
df['categ']='02'
if df['Course03']==1:
df['categ']='03'
df.categ.astype('category']
Related
My goal is to iterate through a list of possible B values, such that each ID (col A) will have new rows added with C = 0 where the possible B value did not previously exist in the DF.
I have a dataframe with:
A B C
0 id1 2 10
1 id1 3 20
2 id2 1 30
possible_B_values = [1 2 3]
Resulting in:
A B C
0 id1 1 0
1 id1 2 10
2 id1 3 20
3 id2 1 30
4 id2 2 0
5 id2 3 0
Thanks in advance!
Using some index trickery:
import pandas as pd
df = pd.read_clipboard() # Your df here
possible_B_values = [1, 2, 3]
extrapolate_columns = ["A", "B"]
index = pd.MultiIndex.from_product(
[df["A"].unique(), possible_B_values],
names=extrapolate_columns
)
out = df.set_index(extrapolate_columns).reindex(index, fill_value=0).reset_index()
out:
A B C
0 id1 1 0
1 id1 2 10
2 id1 3 20
3 id2 1 30
4 id2 2 0
5 id2 3 0
Maybe you can create a dataframe with list of tuples with possible B values and merge it with original one
import pandas as pd
# Create a list of tuples with the possible B values and a C value of 0
possible_b_values = [1, 2, 3]
possible_b_rows = [(id, b, 0) for id in df['A'].unique() for b in possible_b_values]
# Create a new DataFrame from the list of tuples
possible_b_df = pd.DataFrame(possible_b_rows, columns=['A', 'B', 'C'])
# Merge the new DataFrame with the original one, using the 'A' and 'B' columns as the keys
df = df.merge(possible_b_df, on=['A', 'B'], how='outer')
# Fill any null values in the 'C' column with 0
df['C'] = df['C'].fillna(0)
print(df)
Here is a one-liner pure pandas way of solving this -
You set the index as B (this will help in re-indexing later)
Gropuby column A and then for column C apply the following apply function to reindex B
The lambda function x.reindex(range(1,4), fill_value=0) basically takes each group of dataframe x for each id, and then reindexes it from range(1,4) = 1,2,3 and fills the nan values with 0.
Finally you reset_index to bring A and B back into the dataframe.
out = df.set_index('B') \ # Set index as B
.groupby(['A'])['C'] \ # Groupby A and use apply on column C
.apply(lambda x: x.reindex(range(1,4), fill_value=0))\ # Reindex B to range(1,4) for each group and fill 0
.reset_index() # Reset index
print(out)
A B C
0 id1 1 0
1 id1 2 10
2 id1 3 20
3 id2 1 30
4 id2 2 0
5 id2 3 0
I come from a sql background and I use the following data processing step frequently:
Partition the table of data by one or more fields
For each partition, add a rownumber to each of its rows that ranks the row by one or more other fields, where the analyst specifies ascending or descending
EX:
df = pd.DataFrame({'key1' : ['a','a','a','b','a'],
'data1' : [1,2,2,3,3],
'data2' : [1,10,2,3,30]})
df
data1 data2 key1
0 1 1 a
1 2 10 a
2 2 2 a
3 3 3 b
4 3 30 a
I'm looking for how to do the PANDAS equivalent to this sql window function:
RN = ROW_NUMBER() OVER (PARTITION BY Key1 ORDER BY Data1 ASC, Data2 DESC)
data1 data2 key1 RN
0 1 1 a 1
1 2 10 a 2
2 2 2 a 3
3 3 3 b 1
4 3 30 a 4
I've tried the following which I've gotten to work where there are no 'partitions':
def row_number(frame,orderby_columns, orderby_direction,name):
frame.sort_index(by = orderby_columns, ascending = orderby_direction, inplace = True)
frame[name] = list(xrange(len(frame.index)))
I tried to extend this idea to work with partitions (groups in pandas) but the following didn't work:
df1 = df.groupby('key1').apply(lambda t: t.sort_index(by=['data1', 'data2'], ascending=[True, False], inplace = True)).reset_index()
def nf(x):
x['rn'] = list(xrange(len(x.index)))
df1['rn1'] = df1.groupby('key1').apply(nf)
But I just got a lot of NaNs when I do this.
Ideally, there'd be a succinct way to replicate the window function capability of sql (i've figured out the window based aggregates...that's a one liner in pandas)...can someone share with me the most idiomatic way to number rows like this in PANDAS?
you can also use sort_values(), groupby() and finally cumcount() + 1:
df['RN'] = df.sort_values(['data1','data2'], ascending=[True,False]) \
.groupby(['key1']) \
.cumcount() + 1
print(df)
yields:
data1 data2 key1 RN
0 1 1 a 1
1 2 10 a 2
2 2 2 a 3
3 3 3 b 1
4 3 30 a 4
PS tested with pandas 0.18
Use groupby.rank function.
Here the working example.
df = pd.DataFrame({'C1':['a', 'a', 'a', 'b', 'b'], 'C2': [1, 2, 3, 4, 5]})
df
C1 C2
a 1
a 2
a 3
b 4
b 5
df["RANK"] = df.groupby("C1")["C2"].rank(method="first", ascending=True)
df
C1 C2 RANK
a 1 1
a 2 2
a 3 3
b 4 1
b 5 2
You can do this by using groupby twice along with the rank method:
In [11]: g = df.groupby('key1')
Use the min method argument to give values which share the same data1 the same RN:
In [12]: g['data1'].rank(method='min')
Out[12]:
0 1
1 2
2 2
3 1
4 4
dtype: float64
In [13]: df['RN'] = g['data1'].rank(method='min')
And then groupby these results and add the rank with respect to data2:
In [14]: g1 = df.groupby(['key1', 'RN'])
In [15]: g1['data2'].rank(ascending=False) - 1
Out[15]:
0 0
1 0
2 1
3 0
4 0
dtype: float64
In [16]: df['RN'] += g1['data2'].rank(ascending=False) - 1
In [17]: df
Out[17]:
data1 data2 key1 RN
0 1 1 a 1
1 2 10 a 2
2 2 2 a 3
3 3 3 b 1
4 3 30 a 4
It feels like there ought to be a native way to do this (there may well be!...).
You can use transform and Rank together Here is an example
df = pd.DataFrame({'C1' : ['a','a','a','b','b'],
'C2' : [1,2,3,4,5]})
df['Rank'] = df.groupby(by=['C1'])['C2'].transform(lambda x: x.rank())
df
Have a look at Pandas Rank method for more information
pandas.lib.fast_zip() can create a tuple array from a list of array. You can use this function to create a tuple series, and then rank it:
values = {'key1' : ['a','a','a','b','a','b'],
'data1' : [1,2,2,3,3,3],
'data2' : [1,10,2,3,30,20]}
df = pd.DataFrame(values, index=list("abcdef"))
def rank_multi_columns(df, cols, **kw):
data = []
for col in cols:
if col.startswith("-"):
flag = -1
col = col[1:]
else:
flag = 1
data.append(flag*df[col])
values = pd.lib.fast_zip(data)
s = pd.Series(values, index=df.index)
return s.rank(**kw)
rank = df.groupby("key1").apply(lambda df:rank_multi_columns(df, ["data1", "-data2"]))
print rank
the result:
a 1
b 2
c 3
d 2
e 4
f 1
dtype: float64
I have two dataframes like this:
df1 = pd.DataFrame({'ID1':['A','B','C','D','E','F'],
'ID2':['0','10','80','0','0','0']})
df2 = pd.DataFrame({'ID1':['A','D','E','F'],
'ID2':['50','30','90','50'],
'aa':['1','2','3','4']})
I want to insert ID2 in df2 into ID2 in df1, and at the same time insert aa into df1 according to ID1 to obtain a new dataframe like this:
df_result = pd.DataFrame({'ID1':['A','B','C','D','E','F'],
'ID2':['50','10','80','30','90','50'],
'aa':['1','NaN','NaN','2','3','4']})
I've tried to use merge, but it didn't work.
You can use combine_first on the DataFrame after setting the index to ID1:
(df2.set_index('ID1') # values of df2 have priority in case of overlap
.combine_first(df1.set_index('ID1')) # add missing values from df1
.reset_index() # reset ID1 as column
)
output:
ID1 ID2 aa
0 A 50 1
1 B 10 NaN
2 C 80 NaN
3 D 30 2
4 E 90 3
5 F 50 4
Try this:
new_df = df1.assign(ID2=df1['ID2'].replace('0', np.nan)).merge(df2, on='ID1', how='left').pipe(lambda g: g.assign(ID2=g.filter(like='ID2').bfill(axis=1).iloc[:, 0]).drop(['ID2_x', 'ID2_y'], axis=1))
Output:
>>> new_df
ID1 aa ID2
0 A 1 50
1 B NaN 10
2 C NaN 80
3 D 2 30
4 E 3 90
5 F 4 50
Use df.merge with Series.combine_first:
In [568]: x = df1.merge(df2, on='ID1', how='left')
In [571]: x['ID2'] = x.ID2_y.combine_first(x.ID2_x)
In [574]: x.drop(['ID2_x', 'ID2_y'], 1, inplace=True)
In [575]: x
Out[575]:
ID1 aa ID2
0 A 1 50
1 B NaN 10
2 C NaN 80
3 D 2 30
4 E 3 90
5 F 4 50
OR use df.filter with df.ffill:
In [568]: x = df1.merge(df2, on='ID1', how='left')
In [597]: x['ID2'] = x.filter(like='ID2').ffill(axis=1)['ID2_y']
In [599]: x.drop(['ID2_x', 'ID2_y'], 1, inplace=True)
I need to merge 1 df with 1 csv.
df1 contains only 1 columns (id list of the product I want to update)
df2 contains 2 columns (id of all the products, quantity)
df1=pd.read_csv(id_file, header=0, index_col=False)
df2 = pd.DataFrame(data=result_q)
df3=pd.merge(df1, df2)
What I want: a dataframe that contains only id from csv/df1 merge with the quantities of df2 for the same id
if you want only the products that ya have in first data_frame you can use this:
df_1
Out[11]:
id
0 1
1 2
2 4
3 5
df_2
Out[12]:
id prod
0 1 a
1 2 b
2 3 c
3 4 d
4 5 e
5 6 f
6 7 g
7 8 h
df_3 = df_1.merge(df_2,on='id')
df_3
Out[14]:
id prod
0 1 a
1 2 b
2 4 d
3 5 e
you neede use the parameter on='column' so the will generate a new df only with the correspondent rows that have the same id.
you can use new_df= pd.merge(df1,df2, on=['Product_id'])
I've found the solution. I needed to reset the index for my df2
df1=pd.read_csv(id_file)
df2 = pd.DataFrame(data=result_q).reset_index()
df1['id'] = pd.to_numeric(df1['id'], errors = 'coerce')
df2['id'] = pd.to_numeric(df2['id'], errors = 'coerce')
df3=df1.merge(df2, on='id')
Thank you everyone!
I have a dataframe containing (record formatted) json strings as follows:
In[9]: pd.DataFrame( {'col1': ['A','B'], 'col2': ['[{"t":"05:15","v":"20.0"}, {"t":"05:20","v":"25.0"}]',
'[{"t":"05:15","v":"10.0"}, {"t":"05:20","v":"15.0"}]']})
Out[9]:
col1 col2
0 A [{"t":"05:15","v":"20.0"}, {"t":"05:20","v":"2...
1 B [{"t":"05:15","v":"10.0"}, {"t":"05:20","v":"1...
I would like to extract the json and for each record add a new row to the dataframe:
co1 t v
0 A 05:15:00 20
1 A 05:20:00 25
2 B 05:15:00 10
3 B 05:20:00 15
I've been experimenting with the following code:
def json_to_df(x):
df2 = pd.read_json(x.col2)
return df2
df.apply(json_to_df, axis=1)
but the resulting dataframes are assigned as tuples, rather than creating new rows. Any advice?
The problem with apply is that you need to return mulitple rows and it expects only one. A possible solution:
def json_to_df(row):
_, row = row
df_json = pd.read_json(row.col2)
col1 = pd.Series([row.col1]*len(df_json), name='col1')
return pd.concat([col1,df_json],axis=1)
df = map(json_to_df, df.iterrows()) #returns a list of dataframes
df = reduce(lambda x,y:x.append(y), x) #glues them together
df
col1 t v
0 A 05:15 20
1 A 05:20 25
0 B 05:15 10
1 B 05:20 15
Ok, taking a little inspiration from hellpanderrr's answer above, I came up with the following:
In [92]:
pd.DataFrame( {'X': ['A','B'], 'Y': ['fdsfds','fdsfds'], 'json': ['[{"t":"05:15","v":"20.0"}, {"t":"05:20","v":"25.0"}]',
'[{"t":"05:15","v":"10.0"}, {"t":"05:20","v":"15.0"}]']},)
Out[92]:
X Y json
0 A fdsfds [{"t":"05:15","v":"20.0"}, {"t":"05:20","v":"2...
1 B fdsfds [{"t":"05:15","v":"10.0"}, {"t":"05:20","v":"1...
In [93]:
dfs = []
def json_to_df(row, json_col):
json_df = pd.read_json(row[json_col])
dfs.append(json_df.assign(**row.drop(json_col)))
_.apply(json_to_df, axis=1, json_col='json')
pd.concat(dfs)
Out[93]:
t v X Y
0 05:15 20 A fdsfds
1 05:20 25 A fdsfds
0 05:15 10 B fdsfds
1 05:20 15 B fdsfds