I have one dataset with 7 variables and 5 indicators (df1):
A B C D E F G R_1 R_2 R_3 R_4 R_5
0 4 16 5 7 1 12 9 B C D F A
1 8 4 10 14 4 5 9 B E A NaN NaN
Second key-value dataset showing cut-off value for each variable (df2):
Variable Value
0 A 11
1 B 15
2 C 22
3 D 25
4 E 3
5 F 14
6 G 15
Want to add another 5 columns R_new_1-R_new_5 on the condition:
if R_1 = B and (value of B, df1) 16>15 (from df2):
df1['R_new_1'] = "C" (from R_2)
df1['R_new_2'] = "D" (from R_3)
df1['R_new_3'] = "F" (from R_4)
df1['R_new_4'] = "A" (from R_5)
df1['R_new_5'] = np.nan
Repeating above for the new R_2 value which is now stored in R_new_2
R_new_1 R_new_2 R_new_3 R_new_4 R_new_5
0 C D F A NaN
1 B A NaN NaN NaN
I have tried the below to automate the above:
var_list={'A','B','C','D','E','F','G'}
for col in var_list:
df1[str(col) + "_val"] = df2[df2['Variable']==str(col)].iloc[0][1]
for col in var_list:
if (df1[str(col) + "_val"] > df1[str(col)]):
df1[str(col) + "_ind"] = "OK"
else:
df1[str(col) + "_ind"] = "NOK"
The first run checks for R_1.
Unable to recursively replace B with C in R_new_1 and replace C with D in R_new_2 and replace D with F in R_new_3 and replace F with A in R_new_4 and CONTINUE checking for R_new_2 and so on
import pandas as pd
##dataframes constructers
#input
data1 = [{'A': 4, 'B': 16, 'C': 5, 'D': 7, 'E': 1, 'F': 12, 'G': 9, 'R_1':'B', 'R_2':'C', 'R_3':'D', 'R_4':'F', 'R_5':'A'},
{'A': 8, 'B': 4, 'C': 10, 'D': 14, 'E': 4, 'F': 5, 'G': 9, 'R_1':'B', 'R_2':'E', 'R_3':'A', 'R_4':np.nan, 'R_5':np.nan}]
df1 = pd.DataFrame(data1)
#input
data2 = [['A', 11], ['B', 15], ['C', 22], ['D', 25], ['E', 3], ['F', 14], ['G', 15]]
df2 = pd.DataFrame(data2, columns=['Variable', 'Value'])
#desired output
data3 = [{'A': 4, 'B': 16, 'C': 5, 'D': 7, 'E': 1, 'F': 12, 'G': 9, 'R_1':'B', 'R_2':'C', 'R_3':'D', 'R_4':'F', 'R_5':'A', 'R_new_1':'C', 'R_new_2':'D', 'R_new_3':'F', 'R_new_4':'A', 'R_new_5':np.nan},
{'A': 8, 'B': 4, 'C': 10, 'D': 14, 'E': 4, 'F': 5, 'G': 9, 'R_1':'B', 'R_2':'E', 'R_3':'A', 'R_4':np.nan, 'R_5':np.nan, 'R_new_1':'B', 'R_new_2':'A', 'R_new_3':np.nan, 'R_new_4':np.nan, 'R_new_5':np.nan}]
df3 = pd.DataFrame(data3)
Related
Consider a dataframe like pivoted, where replicates of some data are given as lists in a dataframe:
d = {'Compound': ['A', 'A', 'A', 'B', 'B', 'B', 'C', 'C', 'C', 'C'],
'Conc': [1, 0.5, 0.1, 1, 0.5, 0.1, 2, 1, 0.5, 0.1],
'Data': [[100, 90, 80], [50, 40, 30], [10, 9.7, 8],
[20, 15, 10], [3, 4, 5, 6], [100, 110, 80],
[30, 40, 50, 20], [10, 5, 9, 3], [2, 1, 2, 2], [1, 1, 0]]}
df = pd.DataFrame(data=d)
pivoted = df.pivot(index='Conc', columns='Compound', values='Data')
This df can be written to an excel file as such:
with pd.ExcelWriter('output.xlsx') as writer:
pivoted.to_excel(writer, sheet_name='Sheet1', index_label='Conc')
How can this instead be written where replicate data are given in side-by-side cells? Desired excel file:
Then you need to pivot your data in a slightly different way, first explode the Data column, and deduplicate with groupby.cumcount:
(df.explode('Data')
.assign(n=lambda d: d.groupby(level=0).cumcount())
.pivot(index='Conc', columns=['Compound', 'n'], values='Data')
.droplevel('n', axis=1).rename_axis(columns=None)
)
Output:
A A A B B B B C C C C
Conc
0.1 10 9.7 8 100 110 80 NaN 1 1 0 NaN
0.5 50 40 30 3 4 5 6 2 1 2 2
1.0 100 90 80 20 15 10 NaN 10 5 9 3
2.0 NaN NaN NaN NaN NaN NaN NaN 30 40 50 20
Beside the #mozway's answer, just for formatting, you can use:
piv = (df.explode('Data').assign(col=lambda x: x.groupby(level=0).cumcount())
.pivot(index='Conc', columns=['Compound', 'col'], values='Data')
.rename_axis(None))
piv.columns = pd.Index([i if j == 0 else '' for i, j in piv.columns], name='Conc')
piv.to_excel('file.xlsx')
Let's assume I have the following DataFrame:
dic = {'a' : [1, 1, 2, 2, 2, 2, 3, 3, 3, 3],
'b' : [1, 1, 1, 1, 2, 2, 1, 1, 2, 2],
'c' : ['f', 'f', 'f', 'e', 'f', 'f', 'f', 'e', 'f', 'f'],
'd' : [10, 20, 30, 40, 50, 60, 70, 80, 90, 100]}
df = pd.DataFrame(dic)
df
Out[10]:
a b c d
0 1 1 f 10
1 1 1 f 20
2 2 1 f 30
3 2 1 e 40
4 2 2 f 50
5 2 2 f 60
6 3 1 f 70
7 3 1 e 80
8 3 2 f 90
9 3 2 f 100
In the following I want to take the values of column a and b, where c='e' and use those values to select respective rows of df (which would filter rows 2, 3, 6, 7). The idea is to create a list of tuples and index df by that list:
list_tup = list(df.loc[df['c'] == 'e', ['a','b']].to_records(index=False))
df_new = df.set_index(['a', 'b']).sort_index()
df_new
Out[13]:
c d
a b
1 1 f 10
1 f 20
2 1 f 30
1 e 40
2 f 50
2 f 60
3 1 f 70
1 e 80
2 f 90
2 f 100
list_tup
Out[14]: [(2, 1), (3, 1)]
df.loc[list_tup]
Results in an TypeError: unhashable type: 'writeable void-scalar', which I don't understand. Any suggestions? I'm pretty new to python and pandas, hence I assume that I miss something fundamental.
I believe it's better to groupby().transform() and boolean indexing in this use case:
valids = (df['c'].eq('e') # check if `c` is 'e`
.groupby([df['a'],df['b']]) # group by `a` and `b`
.transform('any') # check if `True` occurs in the group
# use the same label for all rows in group
)
# filter with `boolean indexing
df[valids]
Output:
a b c d
2 2 1 f 30
3 2 1 e 40
6 3 1 f 70
7 3 1 e 80
A similar idea with groupby().filter() which is more readable but can be slightly slower:
df.groupby(['a','b']).filter(lambda x: x['c'].eq('e').any())
You could try an innner join.
import pandas as pd
dic = {'a' : [1, 1, 2, 2, 2, 2, 3, 3, 3, 3],
'b' : [1, 1, 1, 1, 2, 2, 1, 1, 2, 2],
'c' : ['f', 'f', 'f', 'e', 'f', 'f', 'f', 'e', 'f', 'f'],
'd' : [10, 20, 30, 40, 50, 60, 70, 80, 90, 100]}
df = pd.DataFrame(dic)
df.merge(df.loc[df['c']=='e', ['a','b']], on=['a','b'])
Output
a b c d
0 2 1 f 30
1 2 1 e 40
2 3 1 f 70
3 3 1 e 80
May be try a MultiIndex:
df_new.loc[pd.MultiIndex.from_tuples(list_tup)]
Full code:
list_tup = list(df.loc[df['c'] == 'e', ['a','b']].to_records(index=False))
df_new = df.set_index(['a', 'b']).sort_index()
df_new.loc[pd.MultiIndex.from_tuples(list_tup)]
Outputs:
c d
a b
2 1 f 30
1 e 40
3 1 f 70
1 e 80
Let us try
m = df[['a','b']].apply(tuple,1).isin(df.loc[df['c'] == 'e', ['a','b']].to_records(index=False).tolist())
out = df[m].copy()
I'm looking for a way to sort pandas DataFrame. pd.DataFrame.sort_values doesn't accept a key function. I can convert it to list and apply a key to sorted function, but that will be slow. The other way seems something related to categorical index. I don't have a fixed number of rows so I don't know if categorical index will be applicable.
I have given an example case of what kind of data I want to sort:
Input DataFrame:
clouds fluff
0 {[} 1
1 >>> 2
2 {1 3
3 123 4
4 AAsda 5
5 aad 6
Output DataFrame:
clouds fluff
0 >>> 2
1 {[} 1
2 {1 3
3 123 4
4 aad 6
5 AAsda 5
The rule for sorting (priority):
First special characters (sort by ascii among themselves)
Next is by numbers
next is by lower case alphabets (lexicographical)
next is Capital case alphabets (lexicographical)
In plain python I'd do it like
from functools import cmp_to_key
def ks(a, b):
# "Not exactly this but similar"
if a.isupper():
return -1
else:
return 1
Case
sorted(['aa', 'AA', 'dd', 'DD'], key=cmp_to_key(ks))
Answer:
['DD', 'AA', 'aa', 'dd']
How would you do it with Pandas?
As of pandas 1.1.0, pandas.DataFrame.sort_values accepts an argument key with type callable.
So in this case we would use:
df.sort_values(by='clouds', key=kf)
where kf is the key function that operates on type Series. Accepts and returns Series.
As of pandas 1.2.0,
I did this
import numpy as np
import pandas as pd
df = pd.DataFrame(['aa', 'dd', 'DD', 'AA'], columns=["data"])
# This is the sorting rule
rule = {
"DD": 1,
"AA": 10,
"aa": 20,
"dd": 30,
}
def particular_sort(series):
"""
Must return one Series
"""
return series.apply(lambda x: rule.get(x, 1000))
new_df = df.sort_values(by=["data"], key=particular_sort)
print(new_df) # DD, AA, aa, dd
Of course, you can do this too, but it may be difficult to understand,smile
new_df = df.sort_values(by=["data"], key=lambda x: x.apply(lambda y: rule.get(y, 1000)))
print(new_df) # DD, AA, aa, dd
This might be useful, yet still not sure about special characters! can they actally be sorted!!
import pandas as pd
a = [2, 'B', 'c', 1, 'a', 'b',3, 'C', 'A']
df = pd.DataFrame({"a": a})
df['upper'] = df['a'].str.isupper()
df['lower'] = df['a'].str.islower()
df['int'] = df['a'].apply(isinstance,args = [int])
df2 = pd.concat([df[df['int'] == True].sort_values(by=['a']),
df[df['lower'] == True].sort_values(by=['a']),
df[df['upper'] == True].sort_values(by=['a'])])
print(df2)
a upper lower int
3 1 NaN NaN True
0 2 NaN NaN True
6 3 NaN NaN True
4 a False True False
5 b False True False
2 c False True False
8 A True False False
1 B True False False
7 C True False False
you can also do it in one step with creating new True False columns!
a = [2, 'B', 'c', 1, 'a', 'b',3, 'C', 'A']
df = pd.DataFrame({"a": a})
df2 = pd.concat([df[df['a'].apply(isinstance,args = [int])].sort_values(by=['a']),
df[df['a'].str.islower() == True].sort_values(by=['a']),
df[df['a'].str.isupper() == True].sort_values(by=['a'])])
a
3 1
0 2
6 3
4 a
5 b
2 c
8 A
1 B
7 C
This seems to work:
def sort_dataframe_by_key(dataframe: DataFrame, column: str, key: Callable) -> DataFrame:
""" Sort a dataframe from a column using the key """
sort_ixs = sorted(np.arange(len(dataframe)), key=lambda i: key(dataframe.iloc[i][column]))
return DataFrame(columns=list(dataframe), data=dataframe.iloc[sort_ixs].values)
It passes tests:
def test_sort_dataframe_by_key():
dataframe = DataFrame([{'a': 1, 'b': 2, 'c': 3}, {'a': 2, 'b': 1, 'c': 1}, {'a': 3, 'b': 4, 'c': 0}])
assert sort_dataframe_by_key(dataframe, column='a', key=lambda x: x).equals(
DataFrame([{'a': 1, 'b': 2, 'c': 3}, {'a': 2, 'b': 1, 'c': 1}, {'a': 3, 'b': 4, 'c': 0}]))
assert sort_dataframe_by_key(dataframe, column='a', key=lambda x: -x).equals(
DataFrame([{'a': 3, 'b': 4, 'c': 0}, {'a': 2, 'b': 1, 'c': 1}, {'a': 1, 'b': 2, 'c': 3}]))
assert sort_dataframe_by_key(dataframe, column='b', key=lambda x: -x).equals(
DataFrame([{'a': 3, 'b': 4, 'c': 0}, {'a': 1, 'b': 2, 'c': 3}, {'a': 2, 'b': 1, 'c': 1}]))
assert sort_dataframe_by_key(dataframe, column='c', key=lambda x: x).equals(
DataFrame([{'a': 3, 'b': 4, 'c': 0}, {'a': 2, 'b': 1, 'c': 1}, {'a': 1, 'b': 2, 'c': 3}]))
What would be the workaround (or the more tidy way) to insert a column in a pandas dataframe where some indices are duplicated?
For example, having the following dataframe:
df1 = pd.DataFrame({ 0: (1, 2, 3, 4, 1, 2, 3, 4),
1: (51, 51, 74, 29, 39, 3, 14, 16),
2: pd.Categorical(['R', 'R', 'R', 'R', 'F', 'F', 'F', 'F']) })
df1 = df1.set_index([0])
df1
1 2
0
1 51 R
2 51 R
3 74 R
4 29 R
1 39 F
2 3 F
3 14 F
4 16 F
how can I insert the column foo from df2 (below) in df1?
df2 = pd.DataFrame({ 0: (1, 2, 3, 4, 1, 3, 4),
'foo': (5, 5, 7, 2, 3, 1, 1),
2: pd.Categorical(['R', 'R', 'R', 'R', 'F', 'F', 'F']) })
df2 = df2.set_index([0])
df2
foo 2
0
1 5 R
2 5 R
3 7 R
4 2 R
1 3 F
3 1 F
4 1 F
Note that the index 2 is missing from category F.
I would like the result to be something like:
1 foo 2
0
1 51 5 R
2 51 5 R
3 74 7 R
4 29 2 R
1 39 3 F
2 3 NaN F
3 14 1 F
4 16 1 F
I tried the DataFrame.insert method but am getting
df1.insert(2, 'FOO', df2['foo'])
ValueError: cannot reindex from a duplicate axis
The index and column 2 uniquely define a row on both data frames, you can do a join on the two columns (after resetting the index):
df1.reset_index().merge(df2.reset_index(), how='left', on=[0,2]).set_index([0])
# 1 2 foo
#0
#1 51 R 5.0
#2 51 R 5.0
#3 74 R 7.0
#4 29 R 2.0
#1 39 F 3.0
#2 3 F NaN
#3 14 F 1.0
#4 16 F 1.0
df1 = pd.DataFrame({ 0: (1, 2, 3, 4, 1, 2, 3, 4),
1: (51, 51, 74, 29, 39, 3, 14, 16),
2: pd.Categorical(['R', 'R', 'R', 'R', 'F', 'F', 'F', 'F']) })
df2 = pd.DataFrame({ 0: (1, 2, 3, 4, 1, 3, 4),
'foo': (5, 5, 7, 2, 3, 1, 1),
2: pd.Categorical(['R', 'R', 'R', 'R', 'F', 'F', 'F']) })
df1 = df1.set_index([0, 2])
df2 = df2.set_index([0, 2])
df1.join(df2, how='left').reset_index(level=2)
2 1 foo
0
1 R 51 5.0
2 R 51 5.0
3 R 74 7.0
4 R 29 2.0
1 F 39 3.0
2 F 3 NaN
3 F 14 1.0
4 F 16 1.0
You're very close...
As you already know based on your question, you can't do this for reasons clearly stated in the error, because you have a repeated index. If you must have column '0' as the index, then don't set it as the index before your merge, set it after:
df1 = pd.DataFrame({ 0: (1, 2, 3, 4, 1, 2, 3, 4),
1: (51, 51, 74, 29, 39, 3, 14, 16),
2: pd.Categorical(['R', 'R', 'R', 'R', 'F', 'F', 'F', 'F']) })
df2 = pd.DataFrame({ 0: (1, 2, 3, 4, 1, 3, 4),
'foo': (5, 5, 7, 2, 3, 1, 1),
2: pd.Categorical(['R', 'R', 'R', 'R', 'F', 'F', 'F']) })
df = df1.merge(df2, how='left')
df.set_index([0])
I tried to combine rows with apply function in dataframe but couldn't.
I would like to combine rows to one list if column (c1+c2) information is same.
for example
Dataframe df1
c1 c2 c3
0 0 x {'a':1 ,'b':2}
1 0 x {'a':3 ,'b':4}
2 0 y {'a':5 ,'b':6}
3 0 y {'a':7 ,'b':8}
4 2 x {'a':9 ,'b':10}
5 2 x {'a':11 ,'b':12}
expected result
Dataframe df1
c1 c2 c3
0 0 x [{'a':1 ,'b':2},{'a':3 ,'b':4}]
1 0 y [{'a':5 ,'b':6},{'a':7 ,'b':8}]
2 2 z [{'a':9 ,'b':10},{'a':11,'b':12}]
Source Pandas DF:
In [20]: df
Out[20]:
c1 c2 c3
0 0 x {'a': 1, 'b': 2}
1 0 x {'a': 3, 'b': 4}
2 0 y {'a': 5, 'b': 6}
3 0 y {'a': 7, 'b': 8}
4 2 x {'a': 9, 'b': 10}
5 2 x {'a': 11, 'b': 12}
Solution:
In [21]: df.groupby(['c1','c2'])['c3'].apply(list).to_frame('c3').reset_index()
Out[21]:
c1 c2 c3
0 0 x [{'a': 1, 'b': 2}, {'a': 3, 'b': 4}]
1 0 y [{'a': 5, 'b': 6}, {'a': 7, 'b': 8}]
2 2 x [{'a': 9, 'b': 10}, {'a': 11, 'b': 12}]
NOTE: I would recommend you to avoid using non-scalar values in Pandas DFs cells - this might cause various difficulties and performance issues