Creating a subset of array from another array : Python - python

I have a basic question regarding working with arrays:
a= ([ c b a a b b c a a b b c a a b a c b])
b= ([ 0 1 0 1 0 0 0 0 2 0 1 0 2 0 0 1 0 1])
I) Is there a short way, to count the number of time 'c' in a corresponds to 0, 1, and 2 in b and 'b' in a corresponds to 0, 1, 2 and so on
II) How do I create a new array c (subset of a) and d(subset of b) such that it only contains those elements if the corresponding element in a is 'c' ?

In [10]: p = ['a', 'b', 'c', 'a', 'c', 'a']
In [11]: q = [1, 2, 1, 3, 3, 1]
In [12]: z = zip(p, q)
In [13]: z
Out[13]: [('a', 1), ('b', 2), ('c', 1), ('a', 3), ('c', 3), ('a', 1)]
In [14]: counts = {}
In [15]: for pair in z:
...: if pair in counts.keys():
...: counts[pair] += 1
...: else:
...: counts[pair] = 1
...:
In [16]: counts
Out[16]: {('a', 1): 2, ('a', 3): 1, ('b', 2): 1, ('c', 1): 1, ('c', 3): 1}
In [17]: sub_p = []
In [18]: sub_q = []
In [19]: for i, element in enumerate(p):
...: if element == 'a':
...: sub_p.append(element)
...: sub_q.append(q[i])
In [20]: sub_p
Out[20]: ['a', 'a', 'a']
In [21]: sub_q
Out[21]: [1, 3, 1]
Explanation
zip takes two lists and runs a figurative zipper between them. Resulting in a list of tuples
I've used a simplistic approach, I'm just maintaining a map/dictionary that makes not of how many times it has seen a pair of char-int tuples
Then I make 2 sub lists that you can modify to use the character in question and figure out what it maps to
Alternative methods
As abarnert suggested you could use A Counter from collections instead.
Or you could just a count method on z . eg: z.count('a',1). Or you can use a defaultdict instead.

The questions are a bit vague but here's a quick method (some would call it dirty) using Pandas though I think something written without recourse to Pandas should be preferred.
import pandas as pd
#create OP's lists
a= [ 'c', 'b', 'a', 'a', 'b', 'b', 'c', 'a', 'a', 'b', 'b', 'c', 'a', 'a', 'b', 'a', 'c', 'b']
b= [ 0, 1, 0, 1, 0, 0, 0, 0, 2, 0, 1, 0, 2, 0, 0, 1, 0, 1]
#dump lists to a Pandas DataFrame
df = pd.DataFrame({'a':a, 'b':b})
Question 1
provided I interpreted it correctly, you can cross-tabulate the two arrays:
pd.crosstab(df.a, df.b).stack(). Cross-tabulate basically counts the number of times each number corresponds to a particular letter. .stack is a command to turn output from .crosstab into a more legible format.
#question 1
pd.crosstab(df.a, df.b).stack()
## -- End pasted text --
Out[9]:
a b
a 0 3
1 2
2 2
b 0 4
1 3
2 0
c 0 4
1 0
2 0
dtype: int64
Question 2
Here, I use Pandas boolean indexing ability to only select the elements in array a that correspond to value 'c'. So df.a=='c' will return True for every value in a that is 'c' and False otherwise. df.loc[df.a=='c','a'] will return values from a for which the boolean statement was true.
c = df.loc[df.a == 'c', 'a']
d = df.loc[df.a == 'c', 'b']
In [15]: c
Out[15]:
0 c
6 c
11 c
16 c
Name: a, dtype: object
In [16]: d
Out[16]:
0 0
6 0
11 0
16 0
Name: b, dtype: int64

Python List : https://www.tutorialspoint.com/python/python_lists.htm has a count method.
I suggest you to first zip both lists, as said in comments, and then count occurances of tuple c, 1 and occurances of tuple c, 0 and sum them up, thats what you need for (I), basically.
For (II), if I understood you correctly, you have to take the zipped lists and apply filter on them with lambda x: x[0]==x[1]

Related

Python: How to get the statistics of the position of each item in multiple lists?

I want to analyze the sequences items of the items and the positions in the sequence where the item appear.
For example:
dataframe['sequence_list'][0] = ['a','b', 'f', 'e']
dataframe['sequence_list'][1] = ['a','c', 'd', 'e']
dataframe['sequence_list'][2] = ['a','d']
...
dataframe['sequence_list'][i] = ['a','b', 'c']
What I want to get is:
How many times 'a' appear in position 0, 1, 2, 3 of the list ?
How many times 'b' appear in position 0, 1, 2, 3 of the list ?
...
Output would be like:
output[1,'a'] = 4
output[2,'a'] = 0
output[3,'a'] = 0
output[4,'a'] = 0
output[1,'b'] = 2
...
The output format could be different. I just want to tell if there are any quick matrix computing methodology help me get the stats quickly?
Start by converting the lists into Series using one of the two statements:
df_ser = dataframe.sequence_list.apply(pd.Series)
df_ser = pd.DataFrame(dataframe.sequence_list.tolist()) # ~30% faster?
The columns of the new dataframe are item positions for each row:
# 0 1 2 3
#0 a b f e
#1 a c d e
#2 a d NaN NaN
#3 a b c NaN
Convert the column numbers into the second-level index, then the second-level index into a column of its own:
df_col = df_ser.stack().reset_index(level=1)
# level_1 0
#0 0 a
#0 1 b
#0 2 f
#....
Count the combinations. This is your answer:
output = df_col.groupby(['level_1', 0]).size()
#level_1 0
#0 a 4
#1 b 2
# c 1
# d 1
#2 c 1
# d 1
# f 1
#3 e 2
You can have it as dictionary:
output.to_dict()
#{(0, 'a'): 4, (1, 'b'): 2, (1, 'c'): 1, (1, 'd'): 1,
# (2, 'c'): 1, (2, 'd'): 1, (2, 'f'): 1, (3, 'e'): 2}
All in one line:
dataframe.sequence_list.apply(pd.Series)\
.stack().reset_index(level=1)\
.groupby(['level_1',0]).size().to_dict()
Setup
Using the setup
df = pd.DataFrame({'col': [['a','b', 'f', 'e'], ['a','c', 'd', 'e'], ['a','d'], ['a','b', 'c']]})
col
0 [a, b, f, e]
1 [a, c, d, e]
2 [a, d]
3 [a, b, c]
You can apply+Counter
pd.DataFrame(df.col.tolist()).apply(Counter)
which yields
0 {'a': 4}
1 {'b': 2, 'c': 1, 'd': 1}
2 {'f': 1, 'd': 1, None: 1, 'c': 1}
3 {'e': 2, None: 2}
dtype: object
for each index.
You can just parse your data the way you need, e.g. fill your dicts now to add the zeroes or disconsider, if thats the case, the Nones.

Implementing k nearest neighbours from distance matrix?

I am trying to do the following:
Given a dataFrame of distance, I want to identify the k-nearest neighbours for each element.
Example:
A B C D
A 0 1 3 2
B 5 0 2 2
C 3 2 0 1
D 2 3 4 0
If k=2, it should return:
A: B D
B: C D
C: D B
D: A B
Distances are not necessarily symmetric.
I am thinking there must be something somewhere that does this in an efficient way using Pandas DataFrames. But I cannot find anything?
Homemade code is also very welcome! :)
Thank you!
The way I see it, I simply find n + 1 smallest numbers/distances/neighbours for each row and remove the 0, which would then give you n numbers/distances/neighbours. Keep in mind that the code will not work if you have a distance of zeroes! Only the diagonals are allowed to be 0.
import pandas as pd
import numpy as np
X = pd.DataFrame([[0, 1, 3, 2],[5, 0, 2, 2],[3, 2, 0, 1],[2, 3, 4, 0]])
X.columns = ['A', 'B', 'C', 'D']
X.index = ['A', 'B', 'C', 'D']
X = X.T
for i in X.index:
Y = X.nsmallest(3, i)
Y = Y.T
Y = Y[Y.index.str.startswith(i)]
Y = Y.loc[:, Y.any()]
for j in Y.index:
print(i + ": ", list(Y.columns))
This prints out:
A: ['B', 'D']
B: ['C', 'D']
C: ['D', 'B']
D: ['A', 'B']

How to get back the index after groupby in pandas

I am trying to find the the record with maximum value from the first record in each group after groupby and delete the same from the original dataframe.
import pandas as pd
df = pd.DataFrame({'item_id': ['a', 'a', 'b', 'b', 'b', 'c', 'd'],
'cost': [1, 2, 1, 1, 3, 1, 5]})
print df
t = df.groupby('item_id').first() #lost track of the index
desired_row = t[t.cost == t.cost.max()]
#delete this row from df
cost
item_id
d 5
I need to keep track of desired_row and delete this row from df and repeat the process.
What is the best way to find and delete the desired_row?
I am not sure of a general way, but this will work in your case since you are taking the first item of each group (it would also easily work on the last). In fact, because of the general nature of split-aggregate-combine, I don't think this is easily achievable without doing it yourself.
gb = df.groupby('item_id', as_index=False)
>>> gb.groups # Index locations of each group.
{'a': [0, 1], 'b': [2, 3, 4], 'c': [5], 'd': [6]}
# Get the first index location from each group using a dictionary comprehension.
subset = {k: v[0] for k, v in gb.groups.iteritems()}
df2 = df.iloc[subset.values()]
# These are the first items in each groupby.
>>> df2
cost item_id
0 1 a
5 1 c
2 1 b
6 5 d
# Exclude any items from above where the cost is equal to the max cost across the first item in each group.
>>> df[~df.index.isin(df2[df2.cost == df2.cost.max()].index)]
cost item_id
0 1 a
1 2 a
2 1 b
3 1 b
4 3 b
5 1 c
Try this ?
import pandas as pd
df = pd.DataFrame({'item_id': ['a', 'a', 'b', 'b', 'b', 'c', 'd'],
'cost': [1, 2, 1, 1, 3, 1, 5]})
t=df.drop_duplicates(subset=['item_id'],keep='first')
desired_row = t[t.cost == t.cost.max()]
df[~df.index.isin([desired_row.index[0]])]
Out[186]:
cost item_id
0 1 a
1 2 a
2 1 b
3 1 b
4 3 b
5 1 c
Or using not in
Consider this df with few more rows
pd.DataFrame({'item_id': ['a', 'a', 'b', 'b', 'b', 'c', 'd', 'd','d'],
'cost': [1, 2, 1, 1, 3, 1, 5,1,7]})
df[~df.cost.isin(df.groupby('item_id').first().max().tolist())]
cost item_id
0 1 a
1 2 a
2 1 b
3 1 b
4 3 b
5 1 c
7 1 d
8 7 d
Overview: Create a dataframe using an dictionary. Group by item_id and find the max value. enumerate over the grouped dataframe and use the key which is an numeric value to return the alpha index value. Create an result_df dataframe if you desire.
df_temp = pd.DataFrame({'item_id': ['a', 'a', 'b', 'b', 'b', 'c', 'd'],
'cost': [1, 2, 1, 1, 3, 1, 5]})
grouped=df_temp.groupby(['item_id'])['cost'].max()
result_df=pd.DataFrame(columns=['item_id','cost'])
for key, value in enumerate(grouped):
index=grouped.index[key]
result_df=result_df.append({'item_id':index,'cost':value},ignore_index=True)
print(result_df.head(5))

groupby multiple value columns

I need to do a fuzzy groupby where a single record can be in one or more groups.
I have a DataFrame like this:
test = pd.DataFrame({'score1' : pandas.Series(['a', 'b', 'c', 'd', 'e']), 'score2' : pd.Series(['b', 'a', 'k', 'n', 'c'])})
Output:
score1 score2
0 a b
1 b a
2 c k
3 d n
4 e c
I wish to have groups like this:
The group keys should be the union of the unique values between score1 and score2. Record 0 should be in groups a and b because it contains both score values. Similarly record 1 should be in groups b and a; record 2 should be in groups c and k and so on.
I've tried doing a groupby on two columns like this:
In [192]: score_groups = pd.groupby(['score1', 'score2'])
However I get the group keys as tuples - (1, 2), (2, 1), (3, 8), etc, instead of unique group keys where records can be in multiple groups. The output is shown below:
In [192]: score_groups.groups
Out[192]: {('a', 'b'): [0],
('b', 'a'): [1],
('c', 'k'): [2],
('d', 'n'): [3],
('e', 'c'): [4]}
Also, I need the indexes preserved because I'm using them for another operation later.
Please help!
Combine the two columns in a single column using e.g. pd.concat():
s = pd.concat([test['score1'], test['score2'].rename(columns={'score2': 'score1'})]).reset_index()
s.columns = ['val', 'grp']
val grp
0 0 a
1 1 b
2 2 c
3 3 d
4 4 e
5 0 b
6 1 a
7 2 k
8 3 n
9 4 c
And then .groupby() on 'grp' and collect 'val' in a list:
s = s.groupby('grp').apply(lambda x: x.val.tolist())
a [0, 1]
b [1, 0]
c [2, 4]
d [3]
e [4]
k [2]
n [3]
or, if you prefer dict:
s.to_dict()
{'e': [4], 'd': [3], 'n': [3], 'k': [2], 'a': [0, 1], 'c': [2, 4], 'b': [1, 0]}
Or, to the same effect in a single step, skipping renaming the columns:
test.unstack().reset_index(-1).groupby(0).apply(lambda x: x.level_1.tolist())
a [0, 1]
b [1, 0]
c [2, 4]
d [3]
e [4]
k [2]
n [3]
Using Stefan's help, I solved it like this.
In (283): frame1 = test[['score1']]
frame2 = test[['score2']]
frame2.rename(columns={'score2': 'score1'}, inplace=True)
test = pandas.concat([frame1, frame2])
test
Out[283]:
score1
0 a
1 b
2 c
3 d
4 e
0 b
1 a
2 k
3 n
4 c
Notice the duplicate indices. The indexes have been preserved, which is what I wanted. Now, lets get to business - the group by operation.
In (283): groups = test.groupby('score1')
groups.get_group('a') # Get group with key a
Out[283]:
score1
0 a
1 a
In (283): groups.get_group('b') # Get group with key b
Out[283]:
score1
1 b
0 b
In (283): groups.get_group('c') # Get group with key c
Out[283]:
score1
2 c
4 c
In (283): groups.get_group('k') # Get group with key k
Out[283]:
score1
2 k
I'm baffled by how pandas retrieves rows with the correct index even though they are duplicated. As I understand, the group by operation uses an inverted index data structure to store the references (indexes) to rows. Any insights would be greatly appreciated. Anyone who answers this will have their answer accepted :)
Reorganizing your data for ease of manipulation (having multiple value columns for the same data will always cause you headaches).
import pandas as pd
test = pd.DataFrame({'score1' : pd.Series([1, 2, 3, 4, 5], index=['a', 'b', 'c', 'd', 'e']), 'score2' : pd.Series([2, 1, 8, 9, 3], index=['a', 'b', 'c', 'd', 'e'])})
test['name'] = test.index
result = pd.melt(test, id_vars=['name'], value_vars=['score1', 'score2'])
name variable value
0 a score1 1
1 b score1 2
2 c score1 3
3 d score1 4
4 e score1 5
5 a score2 2
6 b score2 1
7 c score2 8
8 d score2 9
9 e score2 3
Now you have only one column for your value and it's easy to groupby score or select by your name column:
hey = result.groupby('value')
hey.groups
#below are the indices that you care about
{1: [0, 6], 2: [1, 5], 3: [2, 9], 4: [3], 5: [4], 8: [7], 9: [8]}

Mapping values into a new dataframe column

I have a dataset (~7000 rows) that I have imported in Pandas for some "data wrangling" but I need some pointers in the right direction to take the next step. My data looks something like the below and it is a description of a structure with several sub levels. B, D and again B are sub levels to A. Cis a sub level to B. and so on...
Level, Name
0, A
1, B
2, C
1, D
2, E
3, F
3, G
1, B
2, C
But i want something like the below, with Name and Mother_name on the same row:
Level, Name, Mother_name
1, B, A
2, C, B
1, D, A
2, E, D
3, F, E
3, G, E
1, B, A
2, C, B
If I understand the format correctly, the parent of a name depends on the
nearest prior row whose level is one less than the current row's level.
Your DataFrame has a modest number of rows (~7000). So there is little harm (to
performance) in simply iterating through the rows. If the DataFrame were very
large, you often get better performance if you can use column-wise vectorized Pandas
operations instead of row-wise iteration. However, in this case it appears that
using column-wise vectorized Pandas operations is awkward and
overly-complicated. So I believe row-wise iteration is the best choice here.
Using df.iterrows to perform row-wise iteration, you can simply record the current parents for every level as you go, and fill in the "mother"s as appropriate:
import pandas as pd
df = pd.DataFrame({'level': [0, 1, 2, 1, 2, 3, 3, 1, 2],
'name': ['A', 'B', 'C', 'D', 'E', 'F', 'G', 'B', 'C']})
parent = dict()
mother = []
for index, row in df.iterrows():
parent[row['level']] = row['name']
mother.append(parent.get(row['level']-1))
df['mother'] = mother
print(df)
yields
level name mother
0 0 A None
1 1 B A
2 2 C B
3 1 D A
4 2 E D
5 3 F E
6 3 G E
7 1 B A
8 2 C B
If you can specify the mapping of the two columns in something like a dictionary, then you can just use the map method of the original column.
import pandas
names = ['A', 'B', 'C', 'D', 'E', 'F', 'G', 'B', 'C']
# name -> sublevel
sublevel_map = {
'A': 'A',
'B': 'A',
'C': 'B',
'D': 'A',
'E': 'D',
'F': 'E',
'G': 'E'
}
df = pandas.DataFrame({'Name': names})
df['Sublevel'] = df['Name'].map(sublevel_map)
Which gives you:
Name Sublevel
0 A A
1 B A
2 C B
3 D A
4 E D
5 F E
6 G E
7 B A
8 C B

Categories

Resources