I have a dataframe like this:
import numpy as np
import pandas as pd
from collections import Counter
df = pd.DataFrame({'c0': ['app','e','i','owl','u'],'c1': ['p','app','i','g',''],'c2': ['g','p','app','owl','']})
df
c0 c1 c2
0 app p g
1 e app p
2 i i app
3 owl g owl
4 u
I would like to align the rows based on frequency of items.
Required dataframe with quantities:
c0 c1 c2
0 app app app
1 i i
2 owl owl
3 e p p
4 u g g
My attempt
all_cols = df.values.flatten()
all_cols = [i for i in all_cols if i]
freq = Counter(all_cols)
freq
I can get you this far:
import pandas as pd
df = pd.DataFrame({'c0': list('aeiou'),'c1': ['p','a','i','g',''],'c2': ['g','p','a','o','']})
allLetters = set(x for x in df.to_numpy().flatten() if x)
binaryIncidence = []
for letter in allLetters:
binaryIncidence.append(tuple(int(letter in df[col].tolist()) for col in df.columns))
x = list(zip(allLetters, binaryIncidence))
x.sort(key=lambda y:(y[1], -ord(y[0])), reverse=True)
x = [[y[0] if b else '' for b in y[1]] for y in x]
df_results = pd.DataFrame(x, columns=df.columns)
print(df_results)
... with this output:
c0 c1 c2
0 a a a
1 i i
2 o o
3 e
4 u
5 g g
6 p p
However, in the sample output from your question, you show 'e' getting paired up with 'p', 'p', and also 'u' getting paired up with 'g', 'g'. It's not clear to me how this selection would be made.
UPDATE: generalize to strings of arbitrary length
This will work not just with strings of length <=1 but of arbitrary length:
import pandas as pd
df = pd.DataFrame({'c0': ['app','e','i','owl','u'],'c1': ['p','app','i','g',''],'c2': ['g','p','app','owl','']})
allStrings = set(x for x in df.to_numpy().flatten() if x)
binaryIncidence = []
for s in allStrings:
binaryIncidence.append(tuple(int(s in df[col].tolist()) for col in df.columns))
x = list(zip(allStrings, binaryIncidence))
x.sort(key=lambda y:(tuple(-b for b in y[1]), y[0]))
x = [[y[0] if b else '' for b in y[1]] for y in x]
df_results = pd.DataFrame(x, columns=df.columns)
print(df_results)
Output:
c0 c1 c2
0 app app app
1 i i
2 owl owl
3 e
4 u
5 g g
6 p p
Related
I have a string column that I wish to split into three columns depending on the string. The column looks like this
full_string
x a b c
d e
m n o
y m n
y d e f
d e f
x and y are prefixes. I want to convert this column into three columns
prefix_string first_string last_string
x a c
d e
m o
y m n
y d f
d f
I have this code
df['first_string'] = df[df['full_string'].str.split().str.len() == 2]['full_string'].str.split().str[0]
df['first_string'] = df[df['full_string'].str.split().str.len() > 2]['full_string'].str.split().str[1]
df['last_string'] = df['full_string'].str.split().str[-1]
prefix_string = ['x', 'y']
df['prefix_string'] = df[df['full_string'].str.split().str[0].isin(prefix_string)]['full_string'].str.split().str[0]
This code isn't working correctly for first_string. Is there a way to extract the first string irrespective of prefix_string and the string length?
Try with numpy.where and pandas.Series.str.split:
import numpy as np
prefix_str = ["x", "y"]
res = df["full_string"].str.split(" ", expand=True).ffill(axis=1)
res["last_string"] = res.iloc[:, -1]
res["prefix_string"] = np.where(res[0].isin(prefix_str), res[0], "")
res["first_string"] = np.where(res["prefix_string"].ne(""), res[1], res[0])
res = res[["prefix_string", "first_string", "last_string"]]
Outputs:
prefix_string first_string last_string
0 x a c
1 d e
2 m o
3 y m n
4 y d f
5 d f
Instead of these lines in your Above code:
df['first_string'] = df[df['full_string'].str.split().str.len() == 2]['full_string'].str.split().str[0]
df['first_string'] = df[df['full_string'].str.split().str.len() > 2]['full_string'].str.split().str[1]
make use of split(),contains() and fillna() method:
df['first_string']=df['full_string'].str.split(expand=True).loc[~df['full_string'].str.split(expand=True)[0].str.contains('x|y'),0]
df['first_string']=df['first_string'].fillna(df['full_string'].str.split(expand=True)[1])
Output of df:
full_string first_string last_string prefix_string
0 x a b c a c x
1 d e d e NaN
2 m n o m o NaN
3 y m n m n y
4 y d e f d f y
5 d e f d f NaN
I have two functions applied on a dataframe
res = df.apply(lambda x:pd.Series(list(x)))
res = res.applymap(lambda x: x.strip('"') if isinstance(x, str) else x)
{{Update}} Dataframe has got almost 700 000 rows. This is taking much time to run.
How to reduce the running time?
Sample data :
A
----------
0 [1,4,3,c]
1 [t,g,h,j]
2 [d,g,e,w]
3 [f,i,j,h]
4 [m,z,s,e]
5 [q,f,d,s]
output:
A B C D E
-------------------------
0 [1,4,3,c] 1 4 3 c
1 [t,g,h,j] t g h j
2 [d,g,e,w] d g e w
3 [f,i,j,h] f i j h
4 [m,z,s,e] m z s e
5 [q,f,d,s] q f d s
This line of code res = df.apply(lambda x:pd.Series(list(x))) takes items from a list and fill one by one to each column as shown above. There will be almost 38 columns.
I think:
res = df.apply(lambda x:pd.Series(list(x)))
should be changed to:
df1 = pd.DataFrame(df['A'].values.tolist())
print (df1)
0 1 2 3
0 1 4 3 c
1 t g h j
2 d g e w
3 f i j h
4 m z s e
5 q f d s
And second if not mixed columns values - numeric with strings:
cols = res.select_dtypes(object).columns
res[cols] = res[cols].apply(lambda x: x.str.strip('"'))
I would like to compare two parts of two different columns from an Excel file that have a different number of elements. The comparison should be made between a part of Column 3 and a part of Column 2. Column 3 part has a length of j elements and Column 2 has a length of k elements(k>j). Column 2 part starts from row "j+1" and column 3 part starts from row 1. If an element from column 3 part is matching an element from column 2 part, then should check if the element from column1, before the j row, which has the same index as matched item from column 3 part is matching with the element from Column 1 part between j+1 and k, which has the same index as matched item from column 2 part. If yes, then should be written the element from Column 4 with the same index as matched element from column 2 part in a new Excel sheet.
Example: Column3[1]==Column2[2](which represents element 'A') => Column1[1]==Column1[j+2](which represents element 'P') => Column4[j+2] should be written in a new sheet.
Column 1 Column 2 Column 3 Column 4
P F A S
B G X T
C H K V
D I M W
P B R B
P A R D
C D H E
D E J k
E M K W
F F L Q
Q F K Q
For reading the Excel sheet cells from original sheet, I have used the df27.ix[:j-1,1].
One part of the code which reads the values of the mention part from column 3 and column 2 might be:
for j in range(1,j):
c3=sheet['B'+str(j)].value
for k in range(j,j+k):
c2=sheet['B'+str(k)].value
Any hint how I can accomplish this?
UPDATED
I have tried a new code which takes in consideration that we have '-', like joaquin mentioned in his example.
Joaquin's example:
C1 C2 C3 C4
0 P - A -
1 B - X -
2 C - K -
3 D - M -
4 P B - B
5 P A - D
6 C D - E
7 D E - k
8 E M - W
9 F F - Q
10 Q F - Q
New code:
from pandas import DataFrame as df
import pandas as pd
import openpyxl
wb=openpyxl.load_workbook('/media/sf_vboxshared/x.xlsx')
sheet=wb.get_sheet_by_name('Sheet1')
C13=[]
C12=[]
C1=[]
C2=[]
C3=[]
for s in range(2, sheet.max_row+1):
C1second=sheet['A'+str(s)].value
C2second=sheet['B'+str(s)].value
C3second=sheet['C'+str(s)].value
C1.append(C1second)
C2.append(C2second)
C3.append(C3second)
C1=[x.encode('UTF8') for x in C1]
for y in C2:
if y is not None:
C2=[x.encode('UTF8') if x is not None else None for x in C2]
for z in C3:
if z is not None:
C3=[x.encode('UTF8') if x is not None else None for x in C3]
for x in C1:
C13.append(x)
for x in C3:
C13.append(x)
for x in C1:
C12.append(x)
for x in C2:
C12.append(x)
tosave = pd.DataFrame()
df[C13]=pd.DataFrame(C13)
df[C12]=pd.DataFrame(C12)
for item in df[C13]:
if '-' in item: continue
new = df[df[C12] == item]
tosave = tosave.append(new)
But I still get the following error: df[C13]=pd.DataFrame(C13) TypeError: 'type' object does not support item assignment. Any idea what is wrong?
Many thanks in advance,
Dan
Given your df is
C1 C2 C3 C4
0 P - A -
1 B - X -
2 C - K -
3 D - M -
4 P B - B
5 P A - D
6 C D - E
7 D E - k
8 E M - W
9 F F - Q
10 Q F - Q
then, I combine C1 and C3 and C1 and C2
df['C13'] = df.apply(lambda x: x['C1'] + x['C3'], axis=1)
df['C12'] = df.apply(lambda x: x['C1'] + x['C2'], axis=1)
and compare which rows have the same pair of characters in columns C13 and C12, and save them in tosave
tosave = p.DataFrame()
for item in df['C13']:
if '-' in item: continue
new = df[df['C12'] == item]
tosave = tosave.append(new)
this gives you a tosave dataframe with the rows matching:
C1 C2 C3 C4 C13 C12
5 P A - D P- PA
That can be directly saved as it is or you can save just column C4
UPDATE: If you have data on each row, then you can not use the '-' detection (or any other kind of detection based on the differences between empty and filled columns). On the other hand, if j,k are not defined (for any j and k), your problem is actually reduced to find, for each row, identical pairs below that row. In consecuence, this:
tosave = p.DataFrame()
for idx, item in enumerate(df['C13']):
new = df[df['C12'] == item]
tosave = tosave.append(new.loc[idx+1:])
solves the problem given your labels and data is like:
C1 C2 C3 C4
0 P F A S
1 B G X T
2 C H K V
3 D I M W
4 P B R B
5 P A R D
6 C D H E
7 D E J k
8 E M K W
9 F F L Q
10 Q F K Q
This code also produces the same output as before:
C1 C2 C3 C4 C13 C12
5 P A R D PR PA
Note this probably needs some refinenment (p.e when a row produces 2 matches, the second row with produce 1 match, and you will need to remove replicates from the final output).
I have a table in pandas df
id_x id_y
a b
b c
a c
d a
x a
m b
c z
a k
b q
d w
a w
q v
How to read this table is :
the combinations for a is, a-b,a-c,a-k,a-w, similarly for b(b-c,b-q) and so on..
I want to write a function which takes id_x from the df def test_func(id)
and check whether the occurrences of that id is greater than 3 or not, which may be done by df['id_x'].value_counts .
for eg.
def test_func(id):
if id_count >= 3:
print 'yes'
ddf = df[df['id_x'] == id]
ddf.to_csv(id+".csv")
else:
print 'no'
while id_count <3:
# do something.(I've explained below what I have to do when count<3)
Say for b the occurrence is only 2(i.e b-c, and b-q) which is less than 3.
so in such case, look if 'c'(from id_y) has any combinations.
c has 1 combination(c-z) and similarly q has 1 combination(q-v)
thus b should be linked with z and v.
id_x id_y
b c
b q
b z
b v
and store it in ddf2 like we stored for >10.
Also for particular id,if I could have csv saved with the name of id.
I hope I explained my question correctly, I am very new to python and I don't know to write functions, this was my logic.
Can anyone help me with the implementation part.
Thanks in advance.
Edited: solution redesign according to comments
import pandas as pd
def direct_related(df, values, column_names=('x', 'y')):
rels = set()
for value in values:
for i, v in df[df[column_names[0]]==value][column_names[1]].iteritems():
rels.add(v)
return rels
def indirect_related(df, values, recursion=1, column_names=('x', 'y')):
rels = direct_related(df, values, column_names)
for i in range(recursion):
rels = rels.union(direct_related(df, rels, column_names))
return rels
def related(df, value, recursion=1, column_names=('x', 'y')):
rels = indirect_related(df, [value], recursion, column_names)
return pd.DataFrame(
{
column_names[0]: value,
column_names[1]: list(rels)
}
)
def min_related(df, value, min_appearances=3, max_recursion=10, column_names=('x', 'y')):
for i in range(max_recursion + 1):
if len(indirect_related(df, [value], i, column_names)) >= min_appearances:
return related(df, value, i, column_names)
return None
df = pd.DataFrame(
{
'x': ['a', 'b', 'a', 'd', 'x', 'm', 'c', 'a', 'b', 'd', 'a', 'q'],
'y': ['b', 'c', 'c', 'a', 'a', 'b', 'z', 'k', 'q', 'w', 'w', 'v']
}
)
print(min_related(df, 'b', 3))
First filter DataFrame by length (for testing < 3)
a = df.groupby('id_x').filter(lambda x: len(x) < 3)
print (a)
id_x id_y
1 b c
3 d a
4 x a
5 m b
6 c z
8 b q
9 d w
11 q v
Then filter where b and rename columns:
a1 = a.query("id_x == 'b'").rename(columns={'id_y':'id'})
print (a1)
id_x id
1 b c
8 b q
Also filter where are NOT b:
a2 = a.query("id_y != 'b'").rename(columns={'id_x':'id'})
print (a2)
id id_y
1 b c
3 d a
4 x a
6 c z
8 b q
9 d w
11 q v
Then merge by column id:
b = pd.merge(a1,a2, on='id').drop('id', axis=1)
print (b)
id_x id_y
0 b z
1 b v
Last concat filtered by b to new dataframe b:
c = pd.concat([a.query("id_x == 'b'"), b])
print (c)
id_x id_y
1 b c
8 b q
0 b z
1 b v
Can new index be applied to DF, respectively to grouping made with groupby? Precisely - is there an elegant way to do that, and can original DF be changed through groupby groups at all?
UPD:
My data looks like this:
A B C
0 a x 0.903343
1 a z 0.982050
2 g x 0.274823
3 g y 0.334491
4 c z 0.756728
5 f z 0.697841
6 d z 0.505845
7 b z 0.768199
8 b y 0.743012
9 e x 0.697212
I grouping by columns 'A' and 'B', and I want that every unique pair of values of that columns would have same index value in original DF. Also - original DF can be big, and Im trying to figure how to make such reindex without inefficient forming whole new DF.
Currently Im using this solution:
df = pd.DataFrame({'A': [random.choice(ascii_lowercase[:5]) for _ in xrange(10)],
'B': [random.choice(['x', 'y']) for _ in xrange(10)],
'C': [random.random() for _ in xrange(10)]})
df['id'] = None
new_df = pd.DataFrame()
for i, (n, g) in enumerate(df.groupby(['A', 'B'])):
g['id'] = i
new_df = new_df.append(g)
new_df.set_index('id', inplace=True)
You can do this quickly with some internal function in pandas:
Create test DataFrame first:
import pandas as pd
import random
random.seed(1)
df = pd.DataFrame({'A': [random.choice(ascii_lowercase[:5]) for _ in xrange(10)],
'B': [random.choice(['x', 'y']) for _ in xrange(10)],
'C': [random.random() for _ in xrange(10)]})
If you want the new id the same order as column A & B:
m = pd.MultiIndex.from_arrays((df.A, df.B))
df.index = pd.factorize(pd.lib.fast_zip(m.labels), sort=True)[0]
print df
The output is:
A B C
1 a y 0.025446
7 e x 0.541412
6 d y 0.939149
2 b x 0.381204
3 c x 0.216599
4 c y 0.422117
5 d x 0.029041
6 d y 0.221692
1 a y 0.437888
0 a x 0.495812
If you don't care the order of new id:
m = pd.MultiIndex.from_arrays((df.A, df.B))
la, lb = m.labels
df.index = pd.factorize(la*len(lb)+lb)[0]
print df
The output is:
A B C
0 a y 0.025446
1 e x 0.541412
2 d y 0.939149
3 b x 0.381204
4 c x 0.216599
5 c y 0.422117
6 d x 0.029041
2 d y 0.221692
0 a y 0.437888
7 a x 0.495812