Some weird transformation to pandas dataframe - python

My dataframe:
df = pd.DataFrame({'a':['A', 'B'], 'b':[{5:1, 11:2}, {5:3}]})
Expected output (Each Key will be transformed to 'n' keys. Example row 1, key =5 (with value =2) get transformed to 5, 6. This change also need to reflect on 'a' column)
df_expected = pd.DataFrame({'a':['A1', 'A2', 'A1', 'A2', 'B1', 'B2', 'B3'], 'key':[5, 6, 11, 12, 5, 6, 7]})
My present state:
df['key']=df.apply(lambda x: x['b'].keys(), axis=1)
df['value']=df.apply(lambda x: max(x['b'].values()), axis=1)
df = df.loc[df.index.repeat(df.value)]
Stuck here. What should be next step?
Expected output:
df_expected = pd.DataFrame({'a':['A1', 'A2', 'A1', 'A2', 'B1', 'B2', 'B3'], 'key':[5, 6, 11, 12, 5, 6, 7]})

This will do your transform, outside of pandas.
d = {'a':['A', 'B'], 'b':[{5:1, 11:2}, {5:3}]}
out = { 'a':[], 'b':[] }
for a,b in zip(d['a'],d['b']):
n = max(b.values())
for k in b:
for i in range(n):
out['a'].append(f'{a}{i+1}')
out['b'].append(k+i)
print(out)
Output:
{'a': ['A1', 'A2', 'A1', 'A2', 'B1', 'B2', 'B3'], 'b': [5, 6, 11, 12, 5, 6, 7]}

First you need to preprocess your input dictionary like this
import pandas as pd
d = {'a':['A', 'B'], 'b':[{5:2, 11:2}, {5:3}]} # Assuming 5:2 instead of 5:1.
res = {"a": [], "keys": []}
for idx, i in enumerate(d['b']):
res['a'].extend([f"{d['a'][idx]}{k}" for j in i for k in range(1,i[j]+1) ])
res['keys'].extend([k for j in i for k in range(j, j+i[j])])
df = pd.DataFrame(res)
output
{'a': ['A1', 'A2', 'A1', 'A2', 'B1', 'B2', 'B3'], 'keys': [5, 6, 11, 12, 5, 6, 7]}

For a pandas solution:
df2 = (df.drop(columns='b')
.join(pd.json_normalize(df['b'])
.rename_axis(columns='key')
.stack().reset_index(-1, name='repeat')
)
.loc[lambda d: d.index.repeat(d.pop('repeat'))]
)
g = df2.groupby(['a', 'key']).cumcount()
df2['a'] += g.add(1).astype(str)
df2['key'] += g
print(df2)
Output:
a key
0 A1 5
0 A1 11
0 A2 6
0 A2 12
0 A3 7
0 A3 13
1 B1 5
1 B2 6
1 B3 7

Related

Remove element from every list in a column in pandas dataframe based on another column

I'd like to remove values in list from column B based on column A, wondering how.
Given:
df = pd.DataFrame({
'A': ['a1', 'a2', 'a3', 'a4'],
'B': [['a1', 'a2'], ['a1', 'a2', 'a3'], ['a1', 'a3'], []]
})
I want:
result = pd.DataFrame({
'A': ['a1', 'a2', 'a3', 'a4'],
'B': [['a1', 'a2'], ['a1', 'a2', 'a3'], ['a1', 'a3'], []],
'Output': [['a2'], ['a1', 'a3'], ['a1'], []]
})
One way of doing that is applying a filtering function to each row via DataFrame.apply:
df['Output'] = df.apply(lambda x: [i for i in x.B if i != x.A], axis=1)
Another solution using iterrows():
for i,value in df.iterrows():
try:
value['B'].remove(value['A'])
except ValueError:
pass
print(df)
Output:
A B
0 a1 [a2]
1 a2 [a1, a3]
2 a3 [a1]
3 a4 []

Compare nested list values within columns of a dataframe

How can I compare lists within two columns of a dataframe and identify if the elements of one list is within the other list and create another column with the missing elements.
The dataframe looks something like this:
df = pd.DataFrame({'A': ['a1', 'a2', 'a3'],
'B': [['b1', 'b2'], ['b1', 'b2', 'b3'], ['b2']],
'C': [['c1', 'b1'], ['b3'], ['b2', 'b1']],
'D': ['d1', 'd2', 'd3']})
I want to compare if elements of column C are in column B and output the missing values to column E, the desired output is:
df = pd.DataFrame({'A': ['a1', 'a2', 'a3'],
'B': [['b1', 'b2'], ['b1', 'b2', 'b3'], ['b2']],
'C': [['c1', 'b1'], ['b3'], ['b2', 'b1']],
'D': ['d1', 'd2', 'd3']
'E': ['b2', ['b1','b2'],'']})
Like your previous related question, you can use a list comprehension. As a general rule, you shouldn't force multiple different types of output, e.g. list or str, depending on result. Therefore, I have chosen lists throughout in this solution.
df['E'] = [list(set(x) - set(y)) for x, y in zip(df['B'], df['C'])]
print(df)
A B C D E
0 a1 [b1, b2] [c1, b1] d1 [b2]
1 a2 [b1, b2, b3] [b3] d2 [b1, b2]
2 a3 [b2] [b2, b1] d3 []
def Desintersection(i):
Output = [b for b in df['B'][i] if b not in df['C'][i]]
if(len(Output) == 0):
return ''
elif(len(Output) == 1):
return Output[0]
else:
return Output
df['E'] = df.index.map(Desintersection)
df
Like what I do for my previous answer
(df.B.map(set)-df.C.map(set)).map(list)
Out[112]:
0 [b2]
1 [b2, b1]
2 []
dtype: object
I agree with #jpp that you shouldn't mix the types so much, as when you try to apply the same function to the new E column, it will fail, cause it expected each element to be a list.
This would work on E, as it converts single str values to [str] before comparison.
import pandas as pd
df = pd.DataFrame({'A': ['a1', 'a2', 'a3'],
'B': [['b1', 'b2'], ['b1', 'b2', 'b3'], ['b2']],
'C': [['c1', 'b1'], ['b3'], ['b2', 'b1']],
'D': ['d1', 'd2', 'd3']})
def difference(df, A, B):
elements_to_list = lambda x: [n if isinstance(n, list) else [n] for n in x]
diff = [list(set(a).difference(set(b))) for a, b in zip(elements_to_list(df[A]), elements_to_list(df[B]))]
diff = [d if d else "" for d in diff] # replace empty lists with empty strings
return [d if len(d) != 1 else d[0] for d in diff] # return with single values extracted from the list
df['E'] = difference(df, "B", "C")
df['F'] = difference(df, "B", "E")
print(list(df['E']))
print(list(df['F']))
['b2', ['b2', 'b1'], '']
['b1', 'b3', 'b2']

How to create PySpark DataFrames if we have more no of columns

|CallID| Customer | Response |
+------+----------------------------------+------------------------------------+
| 1 |Ready to repay the amount. |He is ready to pay $50 by next week.|
| 2 |Mr. John's credit card is blocked.|Asked to verify last 3 transactions.|
| 3 |Mr. Tom is unable to pay bills. |Asked to verify registered email add|
+------+----------------------------------+------------------------------------
I am selecting individual columns, performing Spelling Correction and joining them back. Here's my code:
1. Selecting individual columns
from textblob import TextBlob
from itertools import islice
from pyspark.sql.functions import monotonically_increasing_id, col, asc
t = df.count()
newColumns = df.schema.names
df_t = df.select(df['Customer'])
s1 = ''
for i in range(t):
rdd = df_t.rdd
s = str(rdd.collect()[i][0])
s1 = s1 + '|' + s
text = str(TextBlob(s1).correct())
l = text.split('|')
rdd2 = sc.parallelize(l)
df1 = rdd2.map(lambda x: (x,)) \
.mapPartitionsWithIndex(lambda idx, it: islice(it, 1, None) if idx == 0 else
it) \
.toDF([newColumns[1]])
s = s1 = rdd = rdd2 = text = ''
l = []
df_t = df.select(df['Response'])
for i in range(t):
rdd = df_t.rdd
s = str(rdd.collect()[i][0])
s1 = s1 + '|' + s
text = str(TextBlob(s1).correct())
l = text.split('|')
rdd2 = sc.parallelize(l)
df2 = rdd2.map(lambda x: (x,)) \
.mapPartitionsWithIndex(lambda idx, it: islice(it, 1, None) if idx == 0 else
it) \
.toDF([newColumns[2]])`
2. Joining them back
df1 = df1.withColumn("id", monotonically_increasing_id())
df2 = df2.withColumn("id", monotonically_increasing_id())
dffinal = df2.join(df1, "id", "outer").orderBy('id',
ascending=True).drop("id")
3. Final result
| Customer | Response |
+----------------------------------+------------------------------------+
|Ready to repay the amount. |He is ready to pay $50 by next week.|
|Mr. John's credit card is blocked.|Asked to verify last 3 transactions.|
|Mr. Tom is unable to pay bills. |Asked to verify registered email add|
+----------------------------------+------------------------------------+
This is a good approach if we have less no of columns. But is there a way to write a generalized code where we could create the DataFrames and join them based on no of columns just like an array or list of elements?
##consider below array ##
In [1]: df1 = pd.DataFrame({'A': ['A0', 'A1', 'A2', 'A3'],
'B': ['B0', 'B1', 'B2', 'B3'],
'C': ['C0', 'C1', 'C2', 'C3'],
'D': ['D0', 'D1', 'D2', 'D3']},
index=[0, 1, 2, 3])
In [8]: df2 = pd.DataFrame({'E': ['B2', 'B3', 'B6', 'B7'],
'F': ['D2', 'D3', 'D6', 'D7'],
'G': ['F2', 'F3', 'F6', 'F7']},
index=[2, 3, 6, 7])
In [9]: result = pd.concat([df1, df4], axis=1, sort=False)

Reorder your dataframe by reordering one column

Having a dataframe which looks like this:
import pandas as pd
df1 = pd.DataFrame({'A': ['A0', 'A1', 'A2', 'A3'],
'B': ['B0', 'B1', 'B2', 'B3'],
'C': ['C0', 'C1', 'C2', 'C3'],
'D': ['D0', 'D1', 'D2', 'D3']},
index=[0, 1, 2, 3])
I wonder how to rearange the dataframe when having a different order in one column that one wants to apply to all the others, for example having changed the A column in this example?
df2 = pd.DataFrame({'A': ['A3', 'A0', 'A2', 'A1'],
'B': ['B3', 'B0', 'B2', 'B1'],
'C': ['C3', 'C0', 'C2', 'C1'],
'D': ['D3', 'D0', 'D2', 'D1']},
index=[0, 1, 2, 3])
You can use indexing via set_index, reindex and reset_index. Assumes your values in A are unique, which is the only case where such a transformation would make sense.
L = ['A3', 'A0', 'A2', 'A1']
res = df1.set_index('A').reindex(L).reset_index()
print(res)
A B C D
0 A3 B3 C3 D3
1 A0 B0 C0 D0
2 A2 B2 C2 D2
3 A1 B1 C1 D1
did you mean to sort 1 specific row? if so, use:
df1.iloc[:1] = df1.iloc[:1].sort_index(axis=1,ascending=False)
print(df1)
for all columns use:
df1 = df1.sort_index(axis=0,ascending=False)
for specific columns use the iloc function.
You can use the key parameter from the sorted function:
import pandas as pd
df1 = pd.DataFrame({'A': ['A0', 'A1', 'A2', 'A3'],
'B': ['B0', 'B1', 'B2', 'B3'],
'C': ['C0', 'C1', 'C2', 'C3'],
'D': ['D0', 'D1', 'D2', 'D3']},
index=[0, 1, 2, 3])
key = {'A3': 0, 'A0': 1, 'A2' : 2, 'A1': 3}
df1['A'] = sorted(df1.A, key=lambda e: key.get(e, 4))
print(df1)
Output
A B C D
0 A3 B0 C0 D0
1 A0 B1 C1 D1
2 A2 B2 C2 D2
3 A1 B3 C3 D3
By changing the values of key, you can set whatever order you want.
UPDATE
If want you want is to alter the order of the other columns based on the new order of A, you could try something like this:
df1 = pd.DataFrame({'A': ['A0', 'A1', 'A2', 'A3'],
'B': ['B0', 'B1', 'B2', 'B3'],
'C': ['C0', 'C1', 'C2', 'C3'],
'D': ['D0', 'D1', 'D2', 'D3']},
index=[0, 1, 2, 3])
df2 = pd.DataFrame({'A': ['A3', 'A0', 'A2', 'A1'],
'B': ['B0', 'B1', 'B2', 'B3'],
'C': ['C0', 'C1', 'C2', 'C3'],
'D': ['D0', 'D1', 'D2', 'D3']},
index=[0, 1, 2, 3])
key = [df1.A.values.tolist().index(k) for k in df2.A]
df2.B = df2['B'][key].tolist()
print(df2)
Output
A B C D
0 A3 B3 C0 D0
1 A0 B0 C1 D1
2 A2 B2 C2 D2
3 A1 B1 C3 D3
To alter all the columns just apply the above for each column. Somthing like this:
for column in df2.columns.values:
if column != 'A':
df2[column] = df2[column][key].tolist()
print(df2)
Output
A B C D
0 A3 B3 C3 D3
1 A0 B0 C0 D0
2 A2 B2 C2 D2
3 A1 B1 C1 D1

A strange errro,ValueError: Shape of passed values is (7, 4), indices imply (7, 2)

The codes below throw an exception, ValueError: Shape of passed values is (7, 4), indices imply (7, 2).
df4 = pd.DataFrame({'E': ['B2', 'B3', 'B6', 'B7'],
'F': ['D2', 'D3', 'D6', 'D7'],
'G': ['F2', 'F3', 'F6', 'F7']},
index=[2, 2, 6, 7])
df1 = pd.DataFrame({'A': ['A0', 'A1', 'A2'],
'B': ['B0', 'B1', 'B2'],
'C': ['C0', 'C1', 'C2'],
'D': ['D0', 'D1', 'D2']},
index=[0, 1, 2])
result00 = pd.concat([df1, df4], axis=1,join='inner')
I am confused about the error. How to merge the two dataframe?
The result of merging i want is like below
you can use merge() method:
In [122]: pd.merge(df1, df4, left_index=True, right_index=True)
Out[122]:
A B C D E F G
2 A2 B2 C2 D2 B2 D2 F2
2 A2 B2 C2 D2 B3 D3 F3
you can use the pd.concat in the following form:
result00 = pd.concat([df1, df4], axis=1, join_axes = [df4.index], join = 'inner').dropna()
The earlier code did not work since there was a duplicate index in df2. Hope this helps

Categories

Resources