python: How can combine rows in dataframe - python

I tried to combine rows with apply function in dataframe but couldn't.
I would like to combine rows to one list if column (c1+c2) information is same.
for example
Dataframe df1
c1 c2 c3
0 0 x {'a':1 ,'b':2}
1 0 x {'a':3 ,'b':4}
2 0 y {'a':5 ,'b':6}
3 0 y {'a':7 ,'b':8}
4 2 x {'a':9 ,'b':10}
5 2 x {'a':11 ,'b':12}
expected result
Dataframe df1
c1 c2 c3
0 0 x [{'a':1 ,'b':2},{'a':3 ,'b':4}]
1 0 y [{'a':5 ,'b':6},{'a':7 ,'b':8}]
2 2 z [{'a':9 ,'b':10},{'a':11,'b':12}]

Source Pandas DF:
In [20]: df
Out[20]:
c1 c2 c3
0 0 x {'a': 1, 'b': 2}
1 0 x {'a': 3, 'b': 4}
2 0 y {'a': 5, 'b': 6}
3 0 y {'a': 7, 'b': 8}
4 2 x {'a': 9, 'b': 10}
5 2 x {'a': 11, 'b': 12}
Solution:
In [21]: df.groupby(['c1','c2'])['c3'].apply(list).to_frame('c3').reset_index()
Out[21]:
c1 c2 c3
0 0 x [{'a': 1, 'b': 2}, {'a': 3, 'b': 4}]
1 0 y [{'a': 5, 'b': 6}, {'a': 7, 'b': 8}]
2 2 x [{'a': 9, 'b': 10}, {'a': 11, 'b': 12}]
NOTE: I would recommend you to avoid using non-scalar values in Pandas DFs cells - this might cause various difficulties and performance issues

Related

Recursively creating columns with cross joining dictionary-like dataframe in Python

I have one dataset with 7 variables and 5 indicators (df1):
A B C D E F G R_1 R_2 R_3 R_4 R_5
0 4 16 5 7 1 12 9 B C D F A
1 8 4 10 14 4 5 9 B E A NaN NaN
Second key-value dataset showing cut-off value for each variable (df2):
Variable Value
0 A 11
1 B 15
2 C 22
3 D 25
4 E 3
5 F 14
6 G 15
Want to add another 5 columns R_new_1-R_new_5 on the condition:
if R_1 = B and (value of B, df1) 16>15 (from df2):
df1['R_new_1'] = "C" (from R_2)
df1['R_new_2'] = "D" (from R_3)
df1['R_new_3'] = "F" (from R_4)
df1['R_new_4'] = "A" (from R_5)
df1['R_new_5'] = np.nan
Repeating above for the new R_2 value which is now stored in R_new_2
R_new_1 R_new_2 R_new_3 R_new_4 R_new_5
0 C D F A NaN
1 B A NaN NaN NaN
I have tried the below to automate the above:
var_list={'A','B','C','D','E','F','G'}
for col in var_list:
df1[str(col) + "_val"] = df2[df2['Variable']==str(col)].iloc[0][1]
for col in var_list:
if (df1[str(col) + "_val"] > df1[str(col)]):
df1[str(col) + "_ind"] = "OK"
else:
df1[str(col) + "_ind"] = "NOK"
The first run checks for R_1.
Unable to recursively replace B with C in R_new_1 and replace C with D in R_new_2 and replace D with F in R_new_3 and replace F with A in R_new_4 and CONTINUE checking for R_new_2 and so on
import pandas as pd
##dataframes constructers
#input
data1 = [{'A': 4, 'B': 16, 'C': 5, 'D': 7, 'E': 1, 'F': 12, 'G': 9, 'R_1':'B', 'R_2':'C', 'R_3':'D', 'R_4':'F', 'R_5':'A'},
{'A': 8, 'B': 4, 'C': 10, 'D': 14, 'E': 4, 'F': 5, 'G': 9, 'R_1':'B', 'R_2':'E', 'R_3':'A', 'R_4':np.nan, 'R_5':np.nan}]
df1 = pd.DataFrame(data1)
#input
data2 = [['A', 11], ['B', 15], ['C', 22], ['D', 25], ['E', 3], ['F', 14], ['G', 15]]
df2 = pd.DataFrame(data2, columns=['Variable', 'Value'])
#desired output
data3 = [{'A': 4, 'B': 16, 'C': 5, 'D': 7, 'E': 1, 'F': 12, 'G': 9, 'R_1':'B', 'R_2':'C', 'R_3':'D', 'R_4':'F', 'R_5':'A', 'R_new_1':'C', 'R_new_2':'D', 'R_new_3':'F', 'R_new_4':'A', 'R_new_5':np.nan},
{'A': 8, 'B': 4, 'C': 10, 'D': 14, 'E': 4, 'F': 5, 'G': 9, 'R_1':'B', 'R_2':'E', 'R_3':'A', 'R_4':np.nan, 'R_5':np.nan, 'R_new_1':'B', 'R_new_2':'A', 'R_new_3':np.nan, 'R_new_4':np.nan, 'R_new_5':np.nan}]
df3 = pd.DataFrame(data3)

Double loop with dictionary avoiding repetitions

Consider the following code snippet:
foo = {'a': 0, 'b': 1, 'c': 2}
for k1 in foo:
for k2 in foo:
print(foo[k1], foo[k2])
The output will be
0 0
0 1
0 2
1 0
1 1
1 2
2 0
2 1
2 2
I do not care for the order of the key couples, so I would like a code that outputs
0 0
0 1
0 2
1 1
1 2
2 2
I tried with
foo = {'a': 0, 'b': 1, 'c': 2}
foo_2 = foo.copy()
for k1 in foo_2:
for k2 in foo_2:
print(foo[k1], foo[k2])
foo_2.pop(k1)
but I clearly got
RuntimeError: dictionary changed size during iteration
Other solutions?
foo = {'a': 0, 'b': 1, 'c': 2}
foo_2 = foo.copy()
for k1 in foo:
for k2 in foo_2:
print(foo[k1], foo[k2])
foo_2.pop(k1)
You looped in foo_2 two times and when you tried to pop k1 from foo_2 it changed the dictionary while looping causing the error so by first looping foo you avoid the error.
>>> foo = {'a': 0, 'b': 1, 'c': 2}
>>> keys = list(foo.keys())
>>> for i, v in enumerate(keys):
... for j, v2 in enumerate(keys[i:]):
... print(foo[v], foo[v2])
...
0 0
0 1
0 2
1 1
1 2
2 2
A basic approach.
foo = {'a': 0, 'b': 1, 'c': 2}
for v1 in foo.values():
for v2 in foo.values():
if v1 <= v2:
print(v1, v2)
It could be done with itertools.combinations_with_replacement as well:
from itertools import combinations_with_replacement
foo = {'a': 0, 'b': 1, 'c': 2}
print(*[f'{foo[k1]} {foo[k2]}' for k1, k2 in combinations_with_replacement(foo.keys(), r=2)], sep='\n')
You can just pass the values of foo dictionary to a list and loop.
foo = {'a': 0, 'b': 1, 'c': 2}
val_list = list(foo.values())
for k1 in foo.values():
for row in val_list:
print(k1, row)
val_list.pop(0)

How to generate the component id in the Networkx graph?

I have a large Graph Network generated using Networkx package.
Here I'm adding a sample
import networkx as nx
import pandas as pd
G = nx.path_graph(4)
nx.add_path(G, [10, 11, 12])
I'm trying to create a dataframe with Node, degrees, component id, component.
Created degrees using
degrees = list(nx.degree(G))
data = pd.DataFrame([list(d) for d in degrees], columns=['Node', 'degree']).sort_values('degree', ascending=False)
extracted components using
Gcc = sorted(nx.connected_components(G), key=len, reverse=True)
Gcc
[{0, 1, 2, 3}, {10, 11, 12}]
And not sure how I can create the Component ID and components in the data.
Required output:
Node degree ComponentID Components
1 1 2 1 {0, 1, 2, 3}
2 2 2 1 {0, 1, 2, 3}
5 11 2 2 {10, 11, 12}
0 0 1 1 {0, 1, 2, 3}
3 3 1 1 {0, 1, 2, 3}
4 10 1 2 {10, 11, 12}
6 12 1 2 {10, 11, 12}
How to generate the component ids and add them to the nodes and degrees?
Create triplets of Node, ComponentId and Component by enumerating over the connected component list, then create a new dataframe from these triplets and merge it with the given dataframe on Node
df = pd.DataFrame([(n, i, c) for i,c in enumerate(Gcc, 1) for n in c],
columns=['Node', 'ComponentID', 'Components'])
data = data.merge(df, on='Node')
Alternatively you can use map instead of merge to individually create ComponentID and Components columns
d = dict(enumerate(Gcc, 1))
data['ComponentID'] = data['Node'].map({n:i for i,c in d.items() for n in c})
data['Components'] = data['ComponentID'].map(d)
print(data)
Node degree ComponentID Components
1 1 2 1 {0, 1, 2, 3}
2 2 2 1 {0, 1, 2, 3}
5 11 2 2 {10, 11, 12}
0 0 1 1 {0, 1, 2, 3}
3 3 1 1 {0, 1, 2, 3}
4 10 1 2 {10, 11, 12}
6 12 1 2 {10, 11, 12}

How to perform union between sets from different rows at same column at a Dataframe

Which is the best way (fastest) to perform union between sets from different rows at same column of a Dataframe.
For example for the following dataframe:
df_input=pd.DataFrame([[1,{1,2,3}],[1,{11,12}],[2,{1111,2222}],[2,{0,99}]], columns=['name', 'set'])
name set
0 1 {1, 2, 3}
1 1 {11, 12}
2 2 {2222, 1111}
3 2 {0, 99}
I would like to get:
name set
0 1 {1, 2, 3, 11, 12}
1 2 {0, 99, 2222, 1111}
And in case I have two columns wiht different sets, how can I join both columns?
For example, for this dataframe:
df_input=pd.DataFrame([[1,{1,2,3},{'a','b'}],[1,{11,12},{'j'}],[2,{1111,2222},{'m','n'}],[2,{0,99},{'p'}]], columns=['name', 'set1', 'set2'])
name set1 set2
0 1 {1, 2, 3} {b, a}
1 1 {11, 12} {j}
2 2 {2222, 1111} {m, n}
3 2 {0, 99} {p}
I am looking for the way to get this as ouput:
name set1 set2
0 1 {1, 2, 3, 11, 12} {b, j, a}
1 2 {0, 99, 2222, 1111} {m, p, n}
Thank you.
I am really not very knowleadgable in Pandas, and I'm sure there's a better way and (if you have time) you should probably wait for a better answer, but something like this seems to do the trick?
import pandas as pd
df_input=pd.DataFrame([[1,{1,2,3},{'a','b'}],[1,{11,12},{'j'}],[2,{1111,2222},{'m','n'}],[2,{0,99},{'p'}]], columns=['name', 'set1', 'set2'])
new = pd.DataFrame()
for name, agg_df in df_input.groupby('name'):
data = {
'name': name,
'set1': set(),
'set2': set(),
}
agg_df['set1'].apply(lambda c: data['set1'].update(c))
agg_df['set2'].apply(lambda c: data['set2'].update(c))
new = new.append(data, ignore_index=True)
print(new.head())
prints:
name set1 set2
0 1.0 {1, 2, 3, 11, 12} {b, j, a}
1 2.0 {0, 99, 2222, 1111} {p, n, m}
There is more Python syntactic sugar that you sure can use, but that's not really pandas...
import pandas as pd
df_input=pd.DataFrame([[1,{1,2,3},{'a','b'}],[1,{11,12},{'j'}],[2,{1111,2222},{'m','n'}],[2,{0,99},{'p'}]], columns=['name', 'set1', 'set2'])
SET_COLUMNS = ('set1', 'set2')
new = pd.DataFrame()
for name, agg_df in df_input.groupby('name'):
data = {**{'name': name}, **{set_col: set() for set_col in SET_COLUMNS}}
for set_col in SET_COLUMNS:
agg_df[set_col].apply(lambda c: data[set_col].update(c))
new = new.append(data, ignore_index=True)
print(new.head())

How to sort pandas DataFrame with a key?

I'm looking for a way to sort pandas DataFrame. pd.DataFrame.sort_values doesn't accept a key function. I can convert it to list and apply a key to sorted function, but that will be slow. The other way seems something related to categorical index. I don't have a fixed number of rows so I don't know if categorical index will be applicable.
I have given an example case of what kind of data I want to sort:
Input DataFrame:
clouds fluff
0 {[} 1
1 >>> 2
2 {1 3
3 123 4
4 AAsda 5
5 aad 6
Output DataFrame:
clouds fluff
0 >>> 2
1 {[} 1
2 {1 3
3 123 4
4 aad 6
5 AAsda 5
The rule for sorting (priority):
First special characters (sort by ascii among themselves)
Next is by numbers
next is by lower case alphabets (lexicographical)
next is Capital case alphabets (lexicographical)
In plain python I'd do it like
from functools import cmp_to_key
def ks(a, b):
# "Not exactly this but similar"
if a.isupper():
return -1
else:
return 1
Case
sorted(['aa', 'AA', 'dd', 'DD'], key=cmp_to_key(ks))
Answer:
['DD', 'AA', 'aa', 'dd']
How would you do it with Pandas?
As of pandas 1.1.0, pandas.DataFrame.sort_values accepts an argument key with type callable.
So in this case we would use:
df.sort_values(by='clouds', key=kf)
where kf is the key function that operates on type Series. Accepts and returns Series.
As of pandas 1.2.0,
I did this
import numpy as np
import pandas as pd
df = pd.DataFrame(['aa', 'dd', 'DD', 'AA'], columns=["data"])
# This is the sorting rule
rule = {
"DD": 1,
"AA": 10,
"aa": 20,
"dd": 30,
}
def particular_sort(series):
"""
Must return one Series
"""
return series.apply(lambda x: rule.get(x, 1000))
new_df = df.sort_values(by=["data"], key=particular_sort)
print(new_df) # DD, AA, aa, dd
Of course, you can do this too, but it may be difficult to understand,smile
new_df = df.sort_values(by=["data"], key=lambda x: x.apply(lambda y: rule.get(y, 1000)))
print(new_df) # DD, AA, aa, dd
This might be useful, yet still not sure about special characters! can they actally be sorted!!
import pandas as pd
a = [2, 'B', 'c', 1, 'a', 'b',3, 'C', 'A']
df = pd.DataFrame({"a": a})
df['upper'] = df['a'].str.isupper()
df['lower'] = df['a'].str.islower()
df['int'] = df['a'].apply(isinstance,args = [int])
df2 = pd.concat([df[df['int'] == True].sort_values(by=['a']),
df[df['lower'] == True].sort_values(by=['a']),
df[df['upper'] == True].sort_values(by=['a'])])
print(df2)
a upper lower int
3 1 NaN NaN True
0 2 NaN NaN True
6 3 NaN NaN True
4 a False True False
5 b False True False
2 c False True False
8 A True False False
1 B True False False
7 C True False False
you can also do it in one step with creating new True False columns!
a = [2, 'B', 'c', 1, 'a', 'b',3, 'C', 'A']
df = pd.DataFrame({"a": a})
df2 = pd.concat([df[df['a'].apply(isinstance,args = [int])].sort_values(by=['a']),
df[df['a'].str.islower() == True].sort_values(by=['a']),
df[df['a'].str.isupper() == True].sort_values(by=['a'])])
a
3 1
0 2
6 3
4 a
5 b
2 c
8 A
1 B
7 C
This seems to work:
def sort_dataframe_by_key(dataframe: DataFrame, column: str, key: Callable) -> DataFrame:
""" Sort a dataframe from a column using the key """
sort_ixs = sorted(np.arange(len(dataframe)), key=lambda i: key(dataframe.iloc[i][column]))
return DataFrame(columns=list(dataframe), data=dataframe.iloc[sort_ixs].values)
It passes tests:
def test_sort_dataframe_by_key():
dataframe = DataFrame([{'a': 1, 'b': 2, 'c': 3}, {'a': 2, 'b': 1, 'c': 1}, {'a': 3, 'b': 4, 'c': 0}])
assert sort_dataframe_by_key(dataframe, column='a', key=lambda x: x).equals(
DataFrame([{'a': 1, 'b': 2, 'c': 3}, {'a': 2, 'b': 1, 'c': 1}, {'a': 3, 'b': 4, 'c': 0}]))
assert sort_dataframe_by_key(dataframe, column='a', key=lambda x: -x).equals(
DataFrame([{'a': 3, 'b': 4, 'c': 0}, {'a': 2, 'b': 1, 'c': 1}, {'a': 1, 'b': 2, 'c': 3}]))
assert sort_dataframe_by_key(dataframe, column='b', key=lambda x: -x).equals(
DataFrame([{'a': 3, 'b': 4, 'c': 0}, {'a': 1, 'b': 2, 'c': 3}, {'a': 2, 'b': 1, 'c': 1}]))
assert sort_dataframe_by_key(dataframe, column='c', key=lambda x: x).equals(
DataFrame([{'a': 3, 'b': 4, 'c': 0}, {'a': 2, 'b': 1, 'c': 1}, {'a': 1, 'b': 2, 'c': 3}]))

Categories

Resources