Split pandas data frame columns by row value - python

I have a data frame like this:
>df = pd.DataFrame({'A':['M',2,3],'B':['M',2,3],'AA':['N',20,30],'BB':['N',20,30]})
>df = df.rename(columns={df.columns[2]: 'A'})
>df = df.rename(columns={df.columns[3]: 'B'})
>df
A B A B
0 M M N N
1 2 2 20 20
2 3 3 30 30
and I have to split the data frame vertically by row index 0 = 'M' and 'N':
A B
0 M M
1 2 2
2 3 3
A B
0 N N
1 20 20
2 30 30
The data in the data frame comes from an Excel sheet and the column names are not unique.
Thanks for help!

This should get the job done:
df.loc[:,df.iloc[0, :] == "M"]
df.loc[:,df.iloc[0, :] == "N"]

Use pandas iloc for selecting columns:
=^..^=
import pandas as pd
df = pd.DataFrame({'A':['M',2,3],'B':['M',2,3],'AA':['N',20,30],'BB':['N',20,30]})
df = df.rename(columns={df.columns[2]: 'A'})
df = df.rename(columns={df.columns[3]: 'B'})
df1 = df.iloc[:, :2]
df2 = df.iloc[:, 2:]
Output:
A B
0 M M
1 2 2
2 3 3
A B
0 N N
1 20 20
2 30 30

Use list comprehension with loc as:
dfs = [df.loc[:, df.loc[0,:].eq(s)] for s in ['M','N']]
This gives seperate dataframes in list.

Related

How do you generate a rolling count the number of rows that are duplicated in Pandas? [duplicate]

I come from a sql background and I use the following data processing step frequently:
Partition the table of data by one or more fields
For each partition, add a rownumber to each of its rows that ranks the row by one or more other fields, where the analyst specifies ascending or descending
EX:
df = pd.DataFrame({'key1' : ['a','a','a','b','a'],
'data1' : [1,2,2,3,3],
'data2' : [1,10,2,3,30]})
df
data1 data2 key1
0 1 1 a
1 2 10 a
2 2 2 a
3 3 3 b
4 3 30 a
I'm looking for how to do the PANDAS equivalent to this sql window function:
RN = ROW_NUMBER() OVER (PARTITION BY Key1 ORDER BY Data1 ASC, Data2 DESC)
data1 data2 key1 RN
0 1 1 a 1
1 2 10 a 2
2 2 2 a 3
3 3 3 b 1
4 3 30 a 4
I've tried the following which I've gotten to work where there are no 'partitions':
def row_number(frame,orderby_columns, orderby_direction,name):
frame.sort_index(by = orderby_columns, ascending = orderby_direction, inplace = True)
frame[name] = list(xrange(len(frame.index)))
I tried to extend this idea to work with partitions (groups in pandas) but the following didn't work:
df1 = df.groupby('key1').apply(lambda t: t.sort_index(by=['data1', 'data2'], ascending=[True, False], inplace = True)).reset_index()
def nf(x):
x['rn'] = list(xrange(len(x.index)))
df1['rn1'] = df1.groupby('key1').apply(nf)
But I just got a lot of NaNs when I do this.
Ideally, there'd be a succinct way to replicate the window function capability of sql (i've figured out the window based aggregates...that's a one liner in pandas)...can someone share with me the most idiomatic way to number rows like this in PANDAS?
you can also use sort_values(), groupby() and finally cumcount() + 1:
df['RN'] = df.sort_values(['data1','data2'], ascending=[True,False]) \
.groupby(['key1']) \
.cumcount() + 1
print(df)
yields:
data1 data2 key1 RN
0 1 1 a 1
1 2 10 a 2
2 2 2 a 3
3 3 3 b 1
4 3 30 a 4
PS tested with pandas 0.18
Use groupby.rank function.
Here the working example.
df = pd.DataFrame({'C1':['a', 'a', 'a', 'b', 'b'], 'C2': [1, 2, 3, 4, 5]})
df
C1 C2
a 1
a 2
a 3
b 4
b 5
df["RANK"] = df.groupby("C1")["C2"].rank(method="first", ascending=True)
df
C1 C2 RANK
a 1 1
a 2 2
a 3 3
b 4 1
b 5 2
You can do this by using groupby twice along with the rank method:
In [11]: g = df.groupby('key1')
Use the min method argument to give values which share the same data1 the same RN:
In [12]: g['data1'].rank(method='min')
Out[12]:
0 1
1 2
2 2
3 1
4 4
dtype: float64
In [13]: df['RN'] = g['data1'].rank(method='min')
And then groupby these results and add the rank with respect to data2:
In [14]: g1 = df.groupby(['key1', 'RN'])
In [15]: g1['data2'].rank(ascending=False) - 1
Out[15]:
0 0
1 0
2 1
3 0
4 0
dtype: float64
In [16]: df['RN'] += g1['data2'].rank(ascending=False) - 1
In [17]: df
Out[17]:
data1 data2 key1 RN
0 1 1 a 1
1 2 10 a 2
2 2 2 a 3
3 3 3 b 1
4 3 30 a 4
It feels like there ought to be a native way to do this (there may well be!...).
You can use transform and Rank together Here is an example
df = pd.DataFrame({'C1' : ['a','a','a','b','b'],
'C2' : [1,2,3,4,5]})
df['Rank'] = df.groupby(by=['C1'])['C2'].transform(lambda x: x.rank())
df
Have a look at Pandas Rank method for more information
pandas.lib.fast_zip() can create a tuple array from a list of array. You can use this function to create a tuple series, and then rank it:
values = {'key1' : ['a','a','a','b','a','b'],
'data1' : [1,2,2,3,3,3],
'data2' : [1,10,2,3,30,20]}
df = pd.DataFrame(values, index=list("abcdef"))
def rank_multi_columns(df, cols, **kw):
data = []
for col in cols:
if col.startswith("-"):
flag = -1
col = col[1:]
else:
flag = 1
data.append(flag*df[col])
values = pd.lib.fast_zip(data)
s = pd.Series(values, index=df.index)
return s.rank(**kw)
rank = df.groupby("key1").apply(lambda df:rank_multi_columns(df, ["data1", "-data2"]))
print rank
the result:
a 1
b 2
c 3
d 2
e 4
f 1
dtype: float64

Subtracting multiple columns between dataframes based on key

I have two dataframes, example:
Df1 -
A B C D
x j 5 2
y k 7 3
z l 9 4
Df2 -
A B C D
z o 1 1
x p 2 1
y q 3 1
I want to deduct columns C and D in Df2 from columns C and D in Df1 based on the key contained in column A.
I also want to ensure that column B remains untouched, example:
Df3 -
A B C D
x j 3 1
y k 4 2
z l 8 3
I found an almost perfect answer in the following thread:
Subtracting columns based on key column in pandas dataframe
However what the answer does not explain is if there are other columns in the primary df (such as column B) that should not be involved as an index or with the operation.
Is somebody please able to advise?
I was originally performing a loop which find the value in the other df and deducts it however this takes too long for my code to run with the size of data I am working with.
Idea is specify column(s) for maching and column(s) for subtract, convert all not cols columnsnames to MultiIndex, subtract:
match = ['A']
cols = ['C','D']
df1 = Df1.set_index(match + Df1.columns.difference(match + cols).tolist())
df = df1.sub(Df2.set_index(match)[cols], level=0).reset_index()
print (df)
A B C D
0 x j 3 1
1 y k 4 2
2 z l 8 3
Or replace not matched values to original Df1:
match = ['A']
cols = ['C','D']
df1 = Df1.set_index(match)
df = df1.sub(Df2.set_index(match)[cols], level=0).reset_index().fillna(Df1)
print (df)
A B C D
0 x j 3 1
1 y k 4 2
2 z l 8 3

Copy column value from one dataframe to another based on id in Pandas

I am trying to copy Name from df2 into df1 where ID is common between both dataframes.
df1:
ID Name
1 A
2 B
4 C
16 D
7 E
df2:
ID Name
1 X
2 Y
7 Z
Expected Output:
ID Name
1 X
2 Y
4 C
16 D
7 Z
I have tried like this, but it didn't worked. I am not able to understand how to assign value here. I am assigning =df2['Name'] which is wrong.
for i in df2["ID"].tolist():
df1['Name'].loc[(df1['ID'] == i)] = df2['Name']
Try with update
df1 = df1.set_index('ID')
df1.update(df2.set_index('ID'))
df1 = df1.reset_index()
df1
Out[476]:
ID Name
0 1 X
1 2 Y
2 4 C
3 16 D
4 7 Z
If the order of rows does not matter, then concatenate two dfs and drop_duplicates will achieve the result,
df2.append(df1).drop_duplicates(subset='ID')
another solution would be,
s = df1["Name"]
df1.loc[:,"Name"]=df1["ID"].map(df2.set_index("ID")["Name"].to_dict()).fillna(s)
o/P:
ID Name
0 1 X
1 2 Y
2 4 C
3 16 D
4 7 Z
One more for consideration
df,dg = df1,df2
df = df.set_index('ID')
dg = dg.set_index('ID')
df.loc[dg.index,:] = dg # All columns
#df.loc[dg.index,'Name'] = dg.Name # Single column
df = df.reset_index()
>>> df
ID Name
0 1 X
1 2 Y
2 4 C
3 16 D
4 7 Z
Or for a single column (index for both is 'ID'

How do I hide the index column in pandas dataframe?

How am I supposed to remove the index column in the first row. I know it is not counted as a column but when I transpose the data frame, it does not allow me to use my headers anymore.
In[297] df = df.transpose()
print(df)
df = df.drop('RTM',1)
df = df.drop('Requirements', 1)
df = df.drop('Test Summary Report', 1)
print(df)
This throws me an error "labels ['RTM'] not contained in axis".
RTM is contained in an axis and this works if I do index_col=0
df = xl.parse(sheet_name,header=1, index_col=0, usecols="A:E", nrows=6, index_col=None)
but then I lose my (0,0) value "Artifact name" as a header. Any help will be appreciated.
You can do this with .iloc, to assign the column names to the first row after transposing. Then you can drop the first row, and clean up the name
import pandas as pd
import numpy as np
df = pd.DataFrame({'id': list('ABCDE'),
'val1': np.arange(1,6,1),
'val2': np.arange(11,16,1)})
id val1 val2
0 A 1 11
1 B 2 12
2 C 3 13
3 D 4 14
4 E 5 15
Transpose and clean up the names
df = df.T
df.columns = df.iloc[0]
df = df.drop(df.iloc[0].index.name)
df.columns.name = None
df is now:
A B C D E
val1 1 2 3 4 5
val2 11 12 13 14 15
Alternatively, just create a new DataFrame to begin with, specifying which column you want to be the header column.
header_col = 'id'
cols = [x for x in df.columns if x != header_col]
pd.DataFrame(df[cols].values.T, columns=df[header_col], index=cols)
Output:
id A B C D E
val1 1 2 3 4 5
val2 11 12 13 14 15
Using the setup from #ALollz:
df.set_index('id').rename_axis(None).T
A B C D E
val1 1 2 3 4 5
val2 11 12 13 14 15

Pandas parse json in column and expand to new rows in dataframe

I have a dataframe containing (record formatted) json strings as follows:
In[9]: pd.DataFrame( {'col1': ['A','B'], 'col2': ['[{"t":"05:15","v":"20.0"}, {"t":"05:20","v":"25.0"}]',
'[{"t":"05:15","v":"10.0"}, {"t":"05:20","v":"15.0"}]']})
Out[9]:
col1 col2
0 A [{"t":"05:15","v":"20.0"}, {"t":"05:20","v":"2...
1 B [{"t":"05:15","v":"10.0"}, {"t":"05:20","v":"1...
I would like to extract the json and for each record add a new row to the dataframe:
co1 t v
0 A 05:15:00 20
1 A 05:20:00 25
2 B 05:15:00 10
3 B 05:20:00 15
I've been experimenting with the following code:
def json_to_df(x):
df2 = pd.read_json(x.col2)
return df2
df.apply(json_to_df, axis=1)
but the resulting dataframes are assigned as tuples, rather than creating new rows. Any advice?
The problem with apply is that you need to return mulitple rows and it expects only one. A possible solution:
def json_to_df(row):
_, row = row
df_json = pd.read_json(row.col2)
col1 = pd.Series([row.col1]*len(df_json), name='col1')
return pd.concat([col1,df_json],axis=1)
df = map(json_to_df, df.iterrows()) #returns a list of dataframes
df = reduce(lambda x,y:x.append(y), x) #glues them together
df
col1 t v
0 A 05:15 20
1 A 05:20 25
0 B 05:15 10
1 B 05:20 15
Ok, taking a little inspiration from hellpanderrr's answer above, I came up with the following:
In [92]:
pd.DataFrame( {'X': ['A','B'], 'Y': ['fdsfds','fdsfds'], 'json': ['[{"t":"05:15","v":"20.0"}, {"t":"05:20","v":"25.0"}]',
'[{"t":"05:15","v":"10.0"}, {"t":"05:20","v":"15.0"}]']},)
Out[92]:
X Y json
0 A fdsfds [{"t":"05:15","v":"20.0"}, {"t":"05:20","v":"2...
1 B fdsfds [{"t":"05:15","v":"10.0"}, {"t":"05:20","v":"1...
In [93]:
dfs = []
def json_to_df(row, json_col):
json_df = pd.read_json(row[json_col])
dfs.append(json_df.assign(**row.drop(json_col)))
_.apply(json_to_df, axis=1, json_col='json')
pd.concat(dfs)
Out[93]:
t v X Y
0 05:15 20 A fdsfds
1 05:20 25 A fdsfds
0 05:15 10 B fdsfds
1 05:20 15 B fdsfds

Categories

Resources