I have a pandas DataFrame of this type
col1 col2 col3
1 [blue] [in,out]
2 [green, green] [in]
3 [green] [in]
and I need convert it to a dataframe that keep the first column and distribute all the other values in columns as rows:
col1 value
1 blue
1 in
1 out
2 green
2 green
2 in
3 green
3 in
Use DataFrame.stack with Series.explode for convert lists, last some data cleaning with DataFrame.reset_index:
df1 = (df.set_index('col1')
.stack()
.explode()
.reset_index(level=1, drop=True)
.reset_index(name='value'))
Alternative with DataFrame.melt and DataFrame.explode:
df1 = (df.melt('col1')
.explode('value')
.sort_values('col1')[['col1','value']]
.reset_index(drop=True)
)
print (df1)
col1 value
0 1 blue
1 1 in
2 1 out
3 2 green
4 2 green
5 2 in
6 3 green
7 3 in
Or list comprehension solution:
L = [(k, x) for k, v in df.set_index('col1').to_dict('index').items()
for k1, v1 in v.items()
for x in v1]
df1 = pd.DataFrame(L, columns=['col1','value'])
print (df1)
col1 value
0 1 blue
1 1 in
2 1 out
3 2 green
4 2 green
5 2 in
6 3 green
7 3 in
Another solution could consist of:
list comprehension to make col1 with new values and
using list concatenation of values in df['col2'] and df['col3'] in order to make value column.
The code is following:
df_final = pd.DataFrame(
{
'col1': [
i for i, sublist in zip(df['col1'], (df['col2'] + df['col3']).values)
for val in range(len(sublist))
],
'value': sum((df['col2'] + df['col3']).values, [])
}
)
print(df_final)
col1 value
0 1 blue
1 1 in
2 1 out
3 2 green
4 2 green
5 2 in
6 3 green
7 3 in
d = []
c = []
for i in range(len(df)):
d.append([j for j in df['c2'][i]])
d.append([j for j in df['c3'][i]])
c.append(str(df['c1'][i]) * (len(df['c2'][i])+ len(df['c3'][i])))
c = [list(j) for j in c]
d = [i for sublist in d for i in sublist]
c = [i for sublist in d for i in sublist]
df1 = pd.DataFrame()
df1['c1'] = c
df1['c2'] = d
df = df1
Related
I come from a sql background and I use the following data processing step frequently:
Partition the table of data by one or more fields
For each partition, add a rownumber to each of its rows that ranks the row by one or more other fields, where the analyst specifies ascending or descending
EX:
df = pd.DataFrame({'key1' : ['a','a','a','b','a'],
'data1' : [1,2,2,3,3],
'data2' : [1,10,2,3,30]})
df
data1 data2 key1
0 1 1 a
1 2 10 a
2 2 2 a
3 3 3 b
4 3 30 a
I'm looking for how to do the PANDAS equivalent to this sql window function:
RN = ROW_NUMBER() OVER (PARTITION BY Key1 ORDER BY Data1 ASC, Data2 DESC)
data1 data2 key1 RN
0 1 1 a 1
1 2 10 a 2
2 2 2 a 3
3 3 3 b 1
4 3 30 a 4
I've tried the following which I've gotten to work where there are no 'partitions':
def row_number(frame,orderby_columns, orderby_direction,name):
frame.sort_index(by = orderby_columns, ascending = orderby_direction, inplace = True)
frame[name] = list(xrange(len(frame.index)))
I tried to extend this idea to work with partitions (groups in pandas) but the following didn't work:
df1 = df.groupby('key1').apply(lambda t: t.sort_index(by=['data1', 'data2'], ascending=[True, False], inplace = True)).reset_index()
def nf(x):
x['rn'] = list(xrange(len(x.index)))
df1['rn1'] = df1.groupby('key1').apply(nf)
But I just got a lot of NaNs when I do this.
Ideally, there'd be a succinct way to replicate the window function capability of sql (i've figured out the window based aggregates...that's a one liner in pandas)...can someone share with me the most idiomatic way to number rows like this in PANDAS?
you can also use sort_values(), groupby() and finally cumcount() + 1:
df['RN'] = df.sort_values(['data1','data2'], ascending=[True,False]) \
.groupby(['key1']) \
.cumcount() + 1
print(df)
yields:
data1 data2 key1 RN
0 1 1 a 1
1 2 10 a 2
2 2 2 a 3
3 3 3 b 1
4 3 30 a 4
PS tested with pandas 0.18
Use groupby.rank function.
Here the working example.
df = pd.DataFrame({'C1':['a', 'a', 'a', 'b', 'b'], 'C2': [1, 2, 3, 4, 5]})
df
C1 C2
a 1
a 2
a 3
b 4
b 5
df["RANK"] = df.groupby("C1")["C2"].rank(method="first", ascending=True)
df
C1 C2 RANK
a 1 1
a 2 2
a 3 3
b 4 1
b 5 2
You can do this by using groupby twice along with the rank method:
In [11]: g = df.groupby('key1')
Use the min method argument to give values which share the same data1 the same RN:
In [12]: g['data1'].rank(method='min')
Out[12]:
0 1
1 2
2 2
3 1
4 4
dtype: float64
In [13]: df['RN'] = g['data1'].rank(method='min')
And then groupby these results and add the rank with respect to data2:
In [14]: g1 = df.groupby(['key1', 'RN'])
In [15]: g1['data2'].rank(ascending=False) - 1
Out[15]:
0 0
1 0
2 1
3 0
4 0
dtype: float64
In [16]: df['RN'] += g1['data2'].rank(ascending=False) - 1
In [17]: df
Out[17]:
data1 data2 key1 RN
0 1 1 a 1
1 2 10 a 2
2 2 2 a 3
3 3 3 b 1
4 3 30 a 4
It feels like there ought to be a native way to do this (there may well be!...).
You can use transform and Rank together Here is an example
df = pd.DataFrame({'C1' : ['a','a','a','b','b'],
'C2' : [1,2,3,4,5]})
df['Rank'] = df.groupby(by=['C1'])['C2'].transform(lambda x: x.rank())
df
Have a look at Pandas Rank method for more information
pandas.lib.fast_zip() can create a tuple array from a list of array. You can use this function to create a tuple series, and then rank it:
values = {'key1' : ['a','a','a','b','a','b'],
'data1' : [1,2,2,3,3,3],
'data2' : [1,10,2,3,30,20]}
df = pd.DataFrame(values, index=list("abcdef"))
def rank_multi_columns(df, cols, **kw):
data = []
for col in cols:
if col.startswith("-"):
flag = -1
col = col[1:]
else:
flag = 1
data.append(flag*df[col])
values = pd.lib.fast_zip(data)
s = pd.Series(values, index=df.index)
return s.rank(**kw)
rank = df.groupby("key1").apply(lambda df:rank_multi_columns(df, ["data1", "-data2"]))
print rank
the result:
a 1
b 2
c 3
d 2
e 4
f 1
dtype: float64
I have a panda dataframe (here represented using excel):
Now I would like to delete all dublicates (1) of a specific row (B).
How can I do it ?
For this example, the result would look like that:
You can use duplicated for boolean mask and then set NaNs by loc, mask or numpy.where:
df.loc[df['B'].duplicated(), 'B'] = np.nan
df['B'] = df['B'].mask(df['B'].duplicated())
df['B'] = np.where(df['B'].duplicated(), np.nan,df['B'])
Alternative if need remove duplicates rows by B column:
df = df.drop_duplicates(subset=['B'])
Sample:
df = pd.DataFrame({
'B': [1,2,1,3],
'A':[1,5,7,9]
})
print (df)
A B
0 1 1
1 5 2
2 7 1
3 9 3
df.loc[df['B'].duplicated(), 'B'] = np.nan
print (df)
A B
0 1 1.0
1 5 2.0
2 7 NaN
3 9 3.0
df = df.drop_duplicates(subset=['B'])
print (df)
A B
0 1 1
1 5 2
3 9 3
I'm looking create a new dataframe from data in two separate dataframes - effectively matching the index of each cell and input into a two column dataframe. My real datasets have the exact same number of rows and columns, FWIW. Example below:
DF1:
Col1 Col2 Col3
1 2 3
3 8 7
DF2:
Col1 Col2 Col3
A B E
R S W
Desired Dataframe:
Col1 Col2
1 A
2 B
3 E
3 R
8 S
7 W
Thank you for your help!
here is your code
df3 = pd.Series(df1.values.ravel('F'))
df4 = pd.Series(df2.values.ravel('F'))
df = pd.concat([df3, df4], axis=1)
Use, DataFrame.to_numpy and .flatten:
df = pd.DataFrame(
{'Col1': df1.to_numpy().flatten(), 'Col2': df2.to_numpy().flatten()})
# print(df)
Col1 Col2
0 1 A
1 2 B
2 3 E
3 3 R
4 8 S
5 7 W
You can do it easily like so:
list1 = df1.values.tolist()
list1 = [item for sublist in list1 for item in sublist]
list2 = df2.values.tolist()
list2 = [item for sublist in list2 for item in sublist]
df = {
'Col1': list1,
'Col2': list2
}
df = DataFrame(df)
print(df)
Hope this helps :)
pd.concat(map(lambda x: x.unstack().sort_index(level=-1), (df1, df2)), axis=1).reset_index(drop=True).rename(columns=['Col1', 'Col2'].__getitem__)
Result:
Col1 Col2
0 1 A
1 2 B
2 3 E
3 3 R
4 8 S
5 7 W
Another way (alternative):
pd.concat((df1.stack(),df2.stack()),axis=1).add_prefix('Col').reset_index(drop=True)
or:
d = {'Col1':df1,'Col2':df2}
pd.concat((v.stack() for k,v in d.items()),axis=1,keys=d.keys()).reset_index(drop=True)
#or pd.concat((d.values()),keys=d.keys()).stack().unstack(0).reset_index(drop=True)
Col1 Col2
0 1 A
1 2 B
2 3 E
3 3 R
4 8 S
5 7 W
I have 2 data frames that I want to merge/combine based on a condition.
Let's say these are two dfs.
df1
name tpye option store
a 2 8 0
b 4 9 8
c 3 6 2
g 3 2 7
k 1 6 2
m 3 6 5
df2
name red green yellow
a r g y
b r g y
m r g y
What I am trying to do if df2['name'] value exist in df1['name'], add the red, green columns to final_df .
So the final_df would like
name tpye option store red green yellow
a 2 8 0 r g y
b 4 9 8 r g y
c 3 6 2
g 3 2 7
k 1 6 2
m 3 6 5 r g y
Try this. It works because pandas can assign efficiently via index, especially when the index is unique within each dataframe.
df1 = df1.set_index('name')
df2 = df2.set_index('name')
df1[['red', 'green', 'yellow']] = df2[['red', 'green', 'yellow']]
Alternatively, pd.merge will work, as #PaulH mentioned:
df1.merge(df2, how='left', on='name')
You can use pandas join function. Your first dataframe would be the one you want all the values. For example:
import pandas as pd
d1 = pd.DataFrame({'col1': [1, 2 , 4], 'col2': [3, 4 , 5]})
d2 = pd.DataFrame({'col1': [1, 10], 'col3': [3, 4]})
joined = d1.set_index('col1').join(d2.set_index('col1'))
Which gives exactly what you want:
>>joined
col2 col3
col1
1 3 3.0
2 4 NaN
4 5 NaN
For a given dataframe df, imported from a csv file and containing redundant data (columns), I would like to write a function that allows to perform recursive filtering and sub-sequent renaming of df.columns, based on the amount of arguments given.
Ideally the function should perform as follows.
When input is (df, 'string1a', 'string1b', 'new_col_name1'), then:
filter1 = [col for col in df.columns if 'string1a' in col and 'string1b' in col]
df_out = df [ filter1]
df_out.columns= ['new_col_name1']
return df_out
Whereas, when input is:
(df, 'string1a', 'string1b', 'new_col_name1','string2a', 'string2b', 'new_col_name2', 'string3a', 'string3b', 'new_col_name3')the function should return
filter1 = [col for col in df.columns if 'string1a' in col and 'string1b' in col]
filter2 = [col for col in df.columns if 'string2a' in col and 'string2b' in col]
filter3 = [col for col in df.columns if 'string3a' in col and 'string3b' in col]
df_out = df [ filter1 + filter2 + filter3 ]
df_out.columns= ['new_col_name1','new_col_name2','new_col_name3']
return df_out
I think you can use dictionary for define values and then apply function with np.logical_and.reduce because need check multiple values in list:
df = pd.DataFrame({'aavfb':list('abcdef'),
'cedf':[4,5,4,5,5,4],
'd':[7,8,9,4,2,3],
'c':[1,3,5,7,1,0],
'abds':[5,3,6,9,2,4],
'F':list('aaabbb')})
print (df)
F aavfb abds c cedf d
0 a a 5 1 4 7
1 a b 3 3 5 8
2 a c 6 5 4 9
3 b d 9 7 5 4
4 b e 2 1 5 2
5 b f 4 0 4 3
def rename1(df, d):
#loop in dict
for k,v in d.items():
#get mask for columns contains all values in lists
m = np.logical_and.reduce([df.columns.str.contains(x) for x in v])
#set new columns names by mask
df.columns = np.where(m, k, df.columns)
#filter all columns by keys of dict
return df.loc[:, df.columns.isin(d.keys())]
d = {'new_col_name1':['a', 'b'],
'new_col_name2':['c', 'd']}
print (rename1(df, d))
new_col_name1 new_col_name1 new_col_name2
0 a 5 4
1 b 3 5
2 c 6 4
3 d 9 5
4 e 2 5
5 f 4 4