I would like to compare two parts of two different columns from an Excel file that have a different number of elements. The comparison should be made between a part of Column 3 and a part of Column 2. Column 3 part has a length of j elements and Column 2 has a length of k elements(k>j). Column 2 part starts from row "j+1" and column 3 part starts from row 1. If an element from column 3 part is matching an element from column 2 part, then should check if the element from column1, before the j row, which has the same index as matched item from column 3 part is matching with the element from Column 1 part between j+1 and k, which has the same index as matched item from column 2 part. If yes, then should be written the element from Column 4 with the same index as matched element from column 2 part in a new Excel sheet.
Example: Column3[1]==Column2[2](which represents element 'A') => Column1[1]==Column1[j+2](which represents element 'P') => Column4[j+2] should be written in a new sheet.
Column 1 Column 2 Column 3 Column 4
P F A S
B G X T
C H K V
D I M W
P B R B
P A R D
C D H E
D E J k
E M K W
F F L Q
Q F K Q
For reading the Excel sheet cells from original sheet, I have used the df27.ix[:j-1,1].
One part of the code which reads the values of the mention part from column 3 and column 2 might be:
for j in range(1,j):
c3=sheet['B'+str(j)].value
for k in range(j,j+k):
c2=sheet['B'+str(k)].value
Any hint how I can accomplish this?
UPDATED
I have tried a new code which takes in consideration that we have '-', like joaquin mentioned in his example.
Joaquin's example:
C1 C2 C3 C4
0 P - A -
1 B - X -
2 C - K -
3 D - M -
4 P B - B
5 P A - D
6 C D - E
7 D E - k
8 E M - W
9 F F - Q
10 Q F - Q
New code:
from pandas import DataFrame as df
import pandas as pd
import openpyxl
wb=openpyxl.load_workbook('/media/sf_vboxshared/x.xlsx')
sheet=wb.get_sheet_by_name('Sheet1')
C13=[]
C12=[]
C1=[]
C2=[]
C3=[]
for s in range(2, sheet.max_row+1):
C1second=sheet['A'+str(s)].value
C2second=sheet['B'+str(s)].value
C3second=sheet['C'+str(s)].value
C1.append(C1second)
C2.append(C2second)
C3.append(C3second)
C1=[x.encode('UTF8') for x in C1]
for y in C2:
if y is not None:
C2=[x.encode('UTF8') if x is not None else None for x in C2]
for z in C3:
if z is not None:
C3=[x.encode('UTF8') if x is not None else None for x in C3]
for x in C1:
C13.append(x)
for x in C3:
C13.append(x)
for x in C1:
C12.append(x)
for x in C2:
C12.append(x)
tosave = pd.DataFrame()
df[C13]=pd.DataFrame(C13)
df[C12]=pd.DataFrame(C12)
for item in df[C13]:
if '-' in item: continue
new = df[df[C12] == item]
tosave = tosave.append(new)
But I still get the following error: df[C13]=pd.DataFrame(C13) TypeError: 'type' object does not support item assignment. Any idea what is wrong?
Many thanks in advance,
Dan
Given your df is
C1 C2 C3 C4
0 P - A -
1 B - X -
2 C - K -
3 D - M -
4 P B - B
5 P A - D
6 C D - E
7 D E - k
8 E M - W
9 F F - Q
10 Q F - Q
then, I combine C1 and C3 and C1 and C2
df['C13'] = df.apply(lambda x: x['C1'] + x['C3'], axis=1)
df['C12'] = df.apply(lambda x: x['C1'] + x['C2'], axis=1)
and compare which rows have the same pair of characters in columns C13 and C12, and save them in tosave
tosave = p.DataFrame()
for item in df['C13']:
if '-' in item: continue
new = df[df['C12'] == item]
tosave = tosave.append(new)
this gives you a tosave dataframe with the rows matching:
C1 C2 C3 C4 C13 C12
5 P A - D P- PA
That can be directly saved as it is or you can save just column C4
UPDATE: If you have data on each row, then you can not use the '-' detection (or any other kind of detection based on the differences between empty and filled columns). On the other hand, if j,k are not defined (for any j and k), your problem is actually reduced to find, for each row, identical pairs below that row. In consecuence, this:
tosave = p.DataFrame()
for idx, item in enumerate(df['C13']):
new = df[df['C12'] == item]
tosave = tosave.append(new.loc[idx+1:])
solves the problem given your labels and data is like:
C1 C2 C3 C4
0 P F A S
1 B G X T
2 C H K V
3 D I M W
4 P B R B
5 P A R D
6 C D H E
7 D E J k
8 E M K W
9 F F L Q
10 Q F K Q
This code also produces the same output as before:
C1 C2 C3 C4 C13 C12
5 P A R D PR PA
Note this probably needs some refinenment (p.e when a row produces 2 matches, the second row with produce 1 match, and you will need to remove replicates from the final output).
Related
My problem is kind of a bit tricky (similar to sql merge/update), and not understanding how to fix: ( I am giving a small sample of the dataframes below)
I have two dataframes :
dfOld
A B C D E
x1 x2 g h r
q1 q2 x y s
t1 t2 h j u
p1 p2 r s t
AND
dfNew
A B C D E
x1 x2 a b c
s1 s2 p q r
t1 t2 h j u
q1 q2 x y z
We want to merge the dataframes with the following rule : ( we can think Col A & ColB as keys)
For any ColA & ColB combination if C/D/E are exact match then it takes value from any dataframe, however if any value has changed in Col C/D/E , it takes the value from new dataframe and if a new ColA/Col B combination is in DfNew then it takes those values and if the ColA/ColB combination does not exist in dfNew then it takes the value from dfOld:
So my OutPut should be like:
A B C D E
x1 x2 a b c
q1 q2 x y z
t1 t2 h j u
p1 p2 r s t
s1 s2 p q r
I was trying :
mydfL = (df.merge(df1,indicator = True, how='left').loc[lambda x : x['_merge']!='both'])
mydfR = (df1.merge(df,indicator = True, how='left').loc[lambda x : x['_merge']!='both'])
dfO = pd.concat([mydfL,mydfR])
dfO.drop("_merge", axis=1, inplace=True)
My output looks like: ( I kept the index for clarity)
A B C D E
0 x1 x2 a b c
2 s1 s2 p q r
3 q1 q2 x y z
0 x1 x2 g h r
2 q1 q2 x y s
3 p1 p2 r s t
However, this output does not serve my purpose. First and foremost it does not include the totally identical row (between dfOld & dfnew) which consists of :
t1 t2 h j u
and next it includes all the rows where for the ColA/Col x, y and q1, q2, where I just wanted the updated values in ColC/D/E in the new data frame ( dfNew). It includes data from both.
So can I get some help as to what am I missing and what may be a better and elegant way to do this. Thanks in advance.
You can use combine_first using A/B as temporary index:
out = (dfNew.set_index(['A', 'B'])
.combine_first(dfOld.set_index(['A', 'B']))
.reset_index()
)
I got a rather large pandas dataframe (5k rows, 30 cols). I need to do as described below. I tried
pseudocode
for i in main_df.iterrows():
for j in sub_df.iterrows():
if j == part of i:
i["sub_uid"] = j["sub_uid"]
But this does not seem to work, or is just too hard to debug for me. (is also absurdly time consuming)
I am basically out of ideas and hope for your help guys :)
main_df:
v1 v2 vx3 vx4
1 a b h j
2 a b n p
3 a c r g
4 d e p j
sub_df: take only part of main_df columns, drop duplicates. Assign uids for all combinations of v1 v2 parameters
v1 v2 sub_uid
1 a b 01
2 a c 02
3 d e 03
now back to main_df: add a column for sub_uids. For each record, determine sub_uid using sub_df
v1 v2 vx3 vx4 sub_uid
1 a b h j 01
2 a b n p 01
3 a c r g 02
4 d e p j 03
Use Groupby.ngroup to directly assign sub_uid to main_df without creating sub_df:
In [2473]: df['sub_uid'] = df.groupby(['v1', 'v2']).ngroup().add(1)
In [2474]: df
Out[2474]:
v1 v2 vx3 vx4 sub_uid
1 a b h j 1
2 a b n p 1
3 a c r g 2
4 d e p j 3
I have a dataframe of 9,000 columns and 100 rows. I want to insert a column after every 3rd column such that its value is equal to 50 for all rows.
Existing DataFrame
0 1 2 3 4 5 6 7 8 9....9000
0 a b c d e f g h i j ....x
1 k l m n o p q r s t ....x
.
.
100 u v w x y z aa bb cc....x
Desired DataFrame
0 1 2 3 4 5 6 7 8 9....12000
0 a b c 50 d e f 50 g h i j ....x
1 k l m 50 n o p 50 q r s t ....x
.
.
100 u v w 50 x y z 50 aa bb cc....x
Create new DataFrame by indexing each 3rd column, add .5 for correct sorting and add to original with concat:
df.columns = np.arange(len(df.columns))
df1 = pd.DataFrame(50, index=df.index, columns= df.columns[2::3] + .5)
df2 = pd.concat([df, df1], axis=1).sort_index(axis=1)
df2.columns = np.arange(len(df2.columns))
print (df2)
0 1 2 3 4 5 6 7 8 9 10 11 12
0 a b c 50 d e f 50 g h i 50 j
1 k l m 50 n o p 50 q r s 50 t
Numpy
# How many columns to group
x = 3
# Get the shape of things
a = df.to_numpy()
m, n = a.shape
k = n // x
# Get only a multiple of x columns and reshape
b = a[:, :k * x].reshape(m, k, x)
# Get the other columns missed by b
c = a[:, k * x:]
# array of 50's that we'll append to the last dimension
_50 = np.ones((m, k, 1), np.int64) * 50
# append 50's and reshape back to 2D
d = np.append(b, _50, axis=2).reshape(m, k * (x + 1))
# Create DataFrame while appending the missing bit
pd.DataFrame(np.append(d, c, axis=1))
0 1 2 3 4 5 6 7 8 9 10 11 12
0 a b c 50 d e f 50 g h i 50 j
1 k l m 50 n o p 50 q r s 50 t
Setup
df = pd.DataFrame(np.reshape([*'abcdefghijklmnopqrst'], (2, -1)))
So here is one solution
s=pd.concat([y.assign(new=50) for x, y in df.groupby(np.arange(df.shape[1])//3,axis=1)],axis=1)
s.columns=np.arange(s.shape[1])
Create column E that fills column C. If D is <10, then it fill C of earlier row and current row.
This is my Input DataSet:
I,A,B,C,D
1,P,100+,L,15
2,P,100+,M,9
3,P,100+,N,15
4,P,100+,O,15
5,Q,100+,L,2
6,Q,100+,M,15
7,Q,100+,N,3
8,Q,100+,O,15
I tried using some for loops. However, i think we can use shift or append functions to complete this. However, i am getting value errors using the shift function.
Desired Output:
I,A,B,C,D,E
1,P,100+,L,15,L
2,P,100+,M,9,M+N
3,P,100+,N,15,M+N
4,P,100+,O,15,O
5,Q,100+,L,2,L+O
6,Q,100+,M,15,M+N
7,Q,100+,N,3,M+N
8,Q,100+,O,15,L+O
I am working out the column E given in desired output table above.
using np.where and pd.shift
##will populate C values index+1 where the condition is True
df['E'] = np.where( df['D'] < 10,df.loc[df.index + 1,'C'] , df['C'])
##Appending the values of C and E
df['E'] = df.apply(lambda x: x.C + '+' + x.E if x.C != x.E else x.C, axis=1)
df['F'] = df['E'].shift(1)
##Copying the values at index+1 position where the condition is True
df['E'] = df.apply(lambda x: x.F if '+' in str(x.F) else x.E, axis=1)
df.drop('F', axis=1, inplace=True)
Output
I A B C D E
0 1 P 100+ L 15 L
1 2 P 100+ M 9 M+N
2 3 P 100+ N 15 M+N
3 4 P 100+ O 15 O
4 5 Q 100+ L 2 L+M
5 6 Q 100+ M 15 L+M
6 7 Q 100+ N 3 N+O
7 8 Q 100+ O 15 N+O
Idea is create helper groups by replace values of index by mask with Series.where and forward filling only one missing value, then set new column by numpy.where with GroupBy.transform and join:
m = df['D'].lt(10)
g = df.index.to_series().where(m).ffill(limit=1)
df['E'] = np.where(g.notna(), df['C'].groupby(g.fillna(-1)).transform('+'.join), df['C'])
print (df)
I A B C D E
0 1 P 100+ L 15 L
1 2 P 100+ M 9 M+N
2 3 P 100+ N 15 M+N
3 4 P 100+ O 15 O
4 5 Q 100+ L 2 L+M
5 6 Q 100+ M 15 L+M
6 7 Q 100+ N 3 N+O
7 8 Q 100+ O 15 N+O
I have two functions applied on a dataframe
res = df.apply(lambda x:pd.Series(list(x)))
res = res.applymap(lambda x: x.strip('"') if isinstance(x, str) else x)
{{Update}} Dataframe has got almost 700 000 rows. This is taking much time to run.
How to reduce the running time?
Sample data :
A
----------
0 [1,4,3,c]
1 [t,g,h,j]
2 [d,g,e,w]
3 [f,i,j,h]
4 [m,z,s,e]
5 [q,f,d,s]
output:
A B C D E
-------------------------
0 [1,4,3,c] 1 4 3 c
1 [t,g,h,j] t g h j
2 [d,g,e,w] d g e w
3 [f,i,j,h] f i j h
4 [m,z,s,e] m z s e
5 [q,f,d,s] q f d s
This line of code res = df.apply(lambda x:pd.Series(list(x))) takes items from a list and fill one by one to each column as shown above. There will be almost 38 columns.
I think:
res = df.apply(lambda x:pd.Series(list(x)))
should be changed to:
df1 = pd.DataFrame(df['A'].values.tolist())
print (df1)
0 1 2 3
0 1 4 3 c
1 t g h j
2 d g e w
3 f i j h
4 m z s e
5 q f d s
And second if not mixed columns values - numeric with strings:
cols = res.select_dtypes(object).columns
res[cols] = res[cols].apply(lambda x: x.str.strip('"'))