I have a table in pandas df
id_x id_y
a b
b c
a c
d a
x a
m b
c z
a k
b q
d w
a w
q v
How to read this table is :
the combinations for a is, a-b,a-c,a-k,a-w, similarly for b(b-c,b-q) and so on..
I want to write a function which takes id_x from the df def test_func(id)
and check whether the occurrences of that id is greater than 3 or not, which may be done by df['id_x'].value_counts .
for eg.
def test_func(id):
if id_count >= 3:
print 'yes'
ddf = df[df['id_x'] == id]
ddf.to_csv(id+".csv")
else:
print 'no'
while id_count <3:
# do something.(I've explained below what I have to do when count<3)
Say for b the occurrence is only 2(i.e b-c, and b-q) which is less than 3.
so in such case, look if 'c'(from id_y) has any combinations.
c has 1 combination(c-z) and similarly q has 1 combination(q-v)
thus b should be linked with z and v.
id_x id_y
b c
b q
b z
b v
and store it in ddf2 like we stored for >10.
Also for particular id,if I could have csv saved with the name of id.
I hope I explained my question correctly, I am very new to python and I don't know to write functions, this was my logic.
Can anyone help me with the implementation part.
Thanks in advance.
Edited: solution redesign according to comments
import pandas as pd
def direct_related(df, values, column_names=('x', 'y')):
rels = set()
for value in values:
for i, v in df[df[column_names[0]]==value][column_names[1]].iteritems():
rels.add(v)
return rels
def indirect_related(df, values, recursion=1, column_names=('x', 'y')):
rels = direct_related(df, values, column_names)
for i in range(recursion):
rels = rels.union(direct_related(df, rels, column_names))
return rels
def related(df, value, recursion=1, column_names=('x', 'y')):
rels = indirect_related(df, [value], recursion, column_names)
return pd.DataFrame(
{
column_names[0]: value,
column_names[1]: list(rels)
}
)
def min_related(df, value, min_appearances=3, max_recursion=10, column_names=('x', 'y')):
for i in range(max_recursion + 1):
if len(indirect_related(df, [value], i, column_names)) >= min_appearances:
return related(df, value, i, column_names)
return None
df = pd.DataFrame(
{
'x': ['a', 'b', 'a', 'd', 'x', 'm', 'c', 'a', 'b', 'd', 'a', 'q'],
'y': ['b', 'c', 'c', 'a', 'a', 'b', 'z', 'k', 'q', 'w', 'w', 'v']
}
)
print(min_related(df, 'b', 3))
First filter DataFrame by length (for testing < 3)
a = df.groupby('id_x').filter(lambda x: len(x) < 3)
print (a)
id_x id_y
1 b c
3 d a
4 x a
5 m b
6 c z
8 b q
9 d w
11 q v
Then filter where b and rename columns:
a1 = a.query("id_x == 'b'").rename(columns={'id_y':'id'})
print (a1)
id_x id
1 b c
8 b q
Also filter where are NOT b:
a2 = a.query("id_y != 'b'").rename(columns={'id_x':'id'})
print (a2)
id id_y
1 b c
3 d a
4 x a
6 c z
8 b q
9 d w
11 q v
Then merge by column id:
b = pd.merge(a1,a2, on='id').drop('id', axis=1)
print (b)
id_x id_y
0 b z
1 b v
Last concat filtered by b to new dataframe b:
c = pd.concat([a.query("id_x == 'b'"), b])
print (c)
id_x id_y
1 b c
8 b q
0 b z
1 b v
Related
I want to divide below Columns_A and Columns_B into 3 columns.
What approach I am thinking of creating(but no idea what to write in python):
breakdown Columns_A and Columns_B into 3 columns
merge pass_one and pass_two and pass_three
append Columns_C and Columns_D to the longest values of list
Original data(I changed it to list of lists):
Columns_A
Columns_B
Columns_C
Columns_D
1
A
X
Y
1
A
X
Y
1
A
X
Y
2
B
X
Y
2
B
X
Y
3
C
X
Y
3
C
X
Y
3
C
X
Y
3
C
X
Y
11
D
Z
Q
12
E
Z
Q
12
E
Z
Q
12
E
Z
Q
13
F
Z
Q
13
F
Z
Q
What I would like to create:
Columns_A_1
Columns_B_1
Columns_A_2
Columns_B_2
Columns_A_3
Columns_B_3
Columns_C
Columns_D
1
A
2
B
3
C
X
Y
1
A
2
B
3
C
X
Y
1
A
Blacnk
Blacnk
3
C
X
Y
Blacnk
Blacnk
Blacnk
Blacnk
3
C
X
Y
11
D
12
E
13
F
Z
Q
Blank
Blank
12
E
13
F
Z
Q
Blank
Blank
12
E
Blank
Blank
Z
Q
Code that I tried but didn't work (no error but pass_two & pass_two output blank):
#â‘ breakdown Columns_A and Columns_B into 3 columns
!pip install pandas
import pandas as pd
dic = {'Column_A': ["1","1","1","2","2","3","3","3","3","11","12","12","12","13","13"],
'Column_B': ['A', 'A', 'A', 'B', 'B', 'C', 'C', 'C', 'C', 'D', 'E', 'E', 'E', 'F', 'F'],
'Column_C': ['X'] * 9 + ['Z'] * 6,
'Column_D': ['Y'] * 9 + ['Q'] * 6,}
df = pd.DataFrame(dic)
list_data = df.values.tolist()
pass_one = [] #Columns_A_1 and Columns_A_1
pass_two = [] #Columns_A_2 and Columns_B_2
pass_three = [] #Columns_A_3 and Columns_B_3
for row in list_data:
Columns_A = row[0]
Columns_B = row[1]
Columns_C = row[2]
Columns_D = row[3]
list_one = [Columns_A ,Columns_B] #I would like to append these data set
if Columns_C in Columns_C and Columns_A not in Columns_A:
pass_two.append(list_one)
if Columns_C in Columns_C and Columns_A not in Columns_A:
pass_three.append(list_one)
else:
pass_one.append(list_one)
Once Columns_A and Columns_B is separated into 3 list of lists:
I would like to merge pass_one and pass_two and pass_three
At last, append Columns_C and Columns_D to the longest values of list
Does anyone have any ideas how to do this??
This is not a complete answer, but perhaps it'll get you one step further. I assumed your sort criteria was Column_A mod 10:
# create the column we can group by; column A integers mod 10
df['Column_A_sort'] = df['Column_A'].astype(int) % 10
# group by that value
df.groupby('Column_A_sort').agg(list)
Output:
for i in g.groups:
print(g.get_group(i))
prints:
Column_A Column_B Column_C Column_D Column_A_sort
0 1 A X Y 1
1 1 A X Y 1
2 1 A X Y 1
9 11 D Z Q 1
Column_A Column_B Column_C Column_D Column_A_sort
3 2 B X Y 2
4 2 B X Y 2
10 12 E Z Q 2
11 12 E Z Q 2
12 12 E Z Q 2
Column_A Column_B Column_C Column_D Column_A_sort
5 3 C X Y 3
6 3 C X Y 3
7 3 C X Y 3
8 3 C X Y 3
13 13 F Z Q 3
14 13 F Z Q 3
As ignoring_gravity suggests, in order to go further, it'd be helpful to understand exactly your criteria for sorting and recombining the columns.
I have a list of objects like in the test variable below:
#dataclasses.dataclass
class A:
a: float
b: float
c: float
#dataclasses.dataclass
class B:
prop: str
attr: List["A"]
test = [
B("z", [A('a', 'b', 'c'), A('d', 'l', 's')]),
B("a", [A('s', 'v', 'c')]),
]
And I want it to transform it into a pandas df like this:
prop a b c
0 z a b c
0 z d l s
1 a s v c
I can do it in several steps, but it seems unnecessary and inneficient as I'm going multiple times through the same data:
a = pd.DataFrame(
[obj.__dict__ for obj in test]
)
a
prop attr
0 z [A(a='a', b='b', c='c'), A(a='d', b='l', c='s')]
1 a [A(a='s', b='v', c='c')]
b = a.explode('attr')
b
prop attr
0 z A(a='a', b='b', c='c')
0 z A(a='d', b='l', c='s')
1 a A(a='s', b='v', c='c')
b[["a", "b", "c"]] = b.apply(lambda x: [x.attr.a, x.attr.b, x.attr.c], axis=1, result_type="expand")
b
prop attr a b c
0 z A(a='a', b='b', c='c') a b c
0 z A(a='d', b='l', c='s') d l s
1 a A(a='s', b='v', c='c') s v c
Can it be done a bit more efficient?
Use a combination of dataclasses.asdict and pd.json_normalize
In [59]: pd.json_normalize([dataclasses.asdict(k) for k in test], 'attr', ['prop'])
Out[59]:
a b c prop
0 a b c z
1 d l s z
2 s v c a
Another version:
df = pd.DataFrame({"prop": b.prop, **a.__dict__} for b in test for a in b.attr)
Result:
prop a b c
0 z a b c
1 z d l s
2 a s v c
I have the following piece of code:
rev = input(rev: )
def ds():
data = pd.read_excel(r'H:\sysfile.xls', skiprows=2)
dataset1 = pd.DataFrame(data, columns=['Col_1', 'Col_2', 'Col_3', 'Col_4'])
dataset2 = pd.DataFrame(data, columns=['Col_5'])
dataset2['Col_5'] = dataset2['Col_5'].fillna(rev)
ds()
Col_5 is an existing Column in the xls File. I want to give every cell in the column the input from rev
If I print() the dataset1 i get the content of the existing DataFrame (read from the xls-File):
A B C
0 x y z
1 x y z
2 x y z
Now I want to write the userinput from rev=input() into DataFrame dataset2 and append it to dataset 1.
INPUT:
>>>rev: h
Should become this (dataset2):
D
0 h
1 h
2 h
and appand to dataset1:
A B C D
0 x y z h
1 x y z h
2 x y z h
I need help!
If you managed already to fill the dataset2 with the user input, what you need is to join both datasets:
dataset1 = pd.DataFrame({'A': ['X', 'X', 'X', 'X'], 'B': ['Y', 'Y', 'Y', 'Y'], 'C': ['z', 'z', 'z', 'z']})
Out[5]:
A B C
0 X Y z
1 X Y z
2 X Y z
3 X Y z
dataset2 = pd.DataFrame({'D': ['h', 'h', 'h', 'h']})
Out[8]:
D
0 h
1 h
2 h
3 h
At this point you need to join them:
result = pd.concat([df1, df2], axis=1, sort=False)
Out[10]:
A B C D
0 X Y z h
1 X Y z h
2 X Y z h
3 X Y z h
You can read it in here for further details:
https://pandas.pydata.org/pandas-docs/stable/user_guide/merging.html
You can get Transpose of your data frame and add that column as a row.
After adding that row you can use Transpose again to transform your data frame to its normal form
I have a large(ish) set of experimental data that contains pairs of values. Each pair is associated with a particular barcode. Ideally, each pair should have a unique barcode. Unfortunately, it turns out that I screwed something up during the experiment. Now several pairs share a single barcode. I need to exclude these pairs/barcodes from my analysis.
My data looks kind of like this:
The pairs are in columns 'A' and 'B' -- I just included 'X' to represent some arbitrary associated data:
df = pd.DataFrame({'Barcode' : ['AABBCC', 'AABBCC', 'BABACC', 'AABBCC', 'DABBAC', 'ABDABD', 'DABBAC'],
'A' : ['v', 'v', 'x', 'y', 'z', 'h', 'z'],
'B' : ['h', 'h', 'j', 'k', 'l', 'v', 'l'],
'X' : np.random.randint(10, size = 7)})
df = df[['Barcode', 'A', 'B', 'X']]
df
Barcode A B X
0 AABBCC v h 8
1 AABBCC v h 7
2 BABACC x j 2
3 AABBCC y k 3
4 DABBAC z l 8
5 ABDABD h v 0
6 DABBAC z l 4
I want to get rid of the rows described by barcode 'AABBCC', since this barcode is associated with two different pairs (rows 0 and 1 are both the same pair -- which is fine -- but, row 3 is a different pair).
df.loc[df.Barcode != 'AABBCC']
Barcode A B X
2 BABACC x j 6
4 DABBAC z l 0
5 ABDABD h v 7
6 DABBAC z l 5
My solution thus far:
def duplicates(bar):
if len(df.loc[df.Barcode == bar].A.unique()) > 1 or len(df.loc[df.Barcode == bar].B.unique()) > 1:
return 'collision'
else:
return 'single'
df['Barcode_collision'] = df.apply(lambda row: duplicates(row['Barcode']), axis = 1)
df.loc[df.Barcode_collision == 'single']
Barcode A B X Barcode_collision
2 BABACC x j 6 single
4 DABBAC z l 0 single
5 ABDABD h v 7 single
6 DABBAC z l 5 single
Unfortunately, this is very slow with a large dataframe (~500,000 rows) using my delicate computer. I'm sure there must be a better/faster way. Maybe using the groupby function?
df.groupby(['Barcode', 'A', 'B']).count()
X
Barcode A B
AABBCC v h 2
y k 1
ABDABD h v 1
BABACC x j 1
DABBAC z l 2
Then filtering out rows that have more than one value in the second or third indexes? But my brain and my googling skills can't seem to get me further than this...
You can use filter:
print(df.groupby('Barcode').filter(lambda x: ((x.A.nunique() == 1) or (x.B.nunique() == 1))))
Barcode A B X Barcode_collision
2 BABACC x j 4 single
4 DABBAC z l 9 single
5 ABDABD h v 3 single
6 DABBAC z l 9 single
Another solution with transform and boolean indexing:
g = df.groupby('Barcode')
A = g.A.transform('nunique')
B = g.B.transform('nunique')
print (df[(A == 1) | (B == 1)])
Barcode A B X Barcode_collision
2 BABACC x j 2 single
4 DABBAC z l 6 single
5 ABDABD h v 1 single
6 DABBAC z l 3 single
Can new index be applied to DF, respectively to grouping made with groupby? Precisely - is there an elegant way to do that, and can original DF be changed through groupby groups at all?
UPD:
My data looks like this:
A B C
0 a x 0.903343
1 a z 0.982050
2 g x 0.274823
3 g y 0.334491
4 c z 0.756728
5 f z 0.697841
6 d z 0.505845
7 b z 0.768199
8 b y 0.743012
9 e x 0.697212
I grouping by columns 'A' and 'B', and I want that every unique pair of values of that columns would have same index value in original DF. Also - original DF can be big, and Im trying to figure how to make such reindex without inefficient forming whole new DF.
Currently Im using this solution:
df = pd.DataFrame({'A': [random.choice(ascii_lowercase[:5]) for _ in xrange(10)],
'B': [random.choice(['x', 'y']) for _ in xrange(10)],
'C': [random.random() for _ in xrange(10)]})
df['id'] = None
new_df = pd.DataFrame()
for i, (n, g) in enumerate(df.groupby(['A', 'B'])):
g['id'] = i
new_df = new_df.append(g)
new_df.set_index('id', inplace=True)
You can do this quickly with some internal function in pandas:
Create test DataFrame first:
import pandas as pd
import random
random.seed(1)
df = pd.DataFrame({'A': [random.choice(ascii_lowercase[:5]) for _ in xrange(10)],
'B': [random.choice(['x', 'y']) for _ in xrange(10)],
'C': [random.random() for _ in xrange(10)]})
If you want the new id the same order as column A & B:
m = pd.MultiIndex.from_arrays((df.A, df.B))
df.index = pd.factorize(pd.lib.fast_zip(m.labels), sort=True)[0]
print df
The output is:
A B C
1 a y 0.025446
7 e x 0.541412
6 d y 0.939149
2 b x 0.381204
3 c x 0.216599
4 c y 0.422117
5 d x 0.029041
6 d y 0.221692
1 a y 0.437888
0 a x 0.495812
If you don't care the order of new id:
m = pd.MultiIndex.from_arrays((df.A, df.B))
la, lb = m.labels
df.index = pd.factorize(la*len(lb)+lb)[0]
print df
The output is:
A B C
0 a y 0.025446
1 e x 0.541412
2 d y 0.939149
3 b x 0.381204
4 c x 0.216599
5 c y 0.422117
6 d x 0.029041
2 d y 0.221692
0 a y 0.437888
7 a x 0.495812