I have data
member_id device_id
19404 dfbc9d3230304cdfb0316cc32c41b67f [2016-04-28, 2016-04-27, 2016-04-26, 2016-04-22]
19555 176e307bd8714a00ac2b99276123f0a7 [2016-04-29, 2016-04-28, 2016-04-27, 2016-04-23]
19632 a6d4b631e09a4b31afef4c93472c7da3 [2016-04-29, 2016-04-28, 2016-04-27]
19792 0146b09048ce4c47af4bbc69e7999137 [2016-04-23, 2016-04-22, 2016-04-21, 2016-04-20]
20258 1510f9b4efc14183ad412eb54c9e058f [2016-04-09]
5f42f4d02d38456689e58d6a1b9a3e16 [2016-04-29, 2016-04-28, 2016-04-25, 2016-04-22]
and I need to count values in the third column in list.
I try len(), I thought it returns length of list, but it's wrong.
new = data.groupby(['member_id', 'device_id'])['event_date'].unique()
count() returns sum of all values
assuming that you have a list of values in your last column l:
In [113]: df.l.map(len)
Out[113]:
0 4
1 4
2 3
3 4
4 1
5 4
Name: l, dtype: int64
if your last column is string, you can convert it to list first:
df.l.str.replace('[\[\]]', '').str.split('\s*,\s*').map(len)
Is this what you are looking for:
import pandas as pd
df = pd.DataFrame(columns=('member_id','device_id','event_date'),data=[
[19404,'dfbc9d3230304cdfb0316cc32c41b67f',['2016-04-28', '2016-04-27', '2016-04-26', '2016-04-22']],
[19555,'176e307bd8714a00ac2b99276123f0a7',['2016-04-29', '2016-04-28', '2016-04-27', '2016-04-23']],
[19632,'a6d4b631e09a4b31afef4c93472c7da3',['2016-04-29', '2016-04-28', '2016-04-27']],
[19792,'0146b09048ce4c47af4bbc69e7999137',['2016-04-23', '2016-04-22', '2016-04-21', '2016-04-20']],
[20258,'1510f9b4efc14183ad412eb54c9e058f',['2016-04-09']],
[20258,'5f42f4d02d38456689e58d6a1b9a3e16',['2016-04-29', '2016-04-28', '2016-04-25', '2016-04-22']]
])
new = df.groupby(['member_id', 'device_id'])['event_date']
for each_n in new:
print each_n[0],len(each_n[1].values[0])
Output
(19404, 'dfbc9d3230304cdfb0316cc32c41b67f') 4
(19555, '176e307bd8714a00ac2b99276123f0a7') 4
(19632, 'a6d4b631e09a4b31afef4c93472c7da3') 3
(19792, '0146b09048ce4c47af4bbc69e7999137') 4
(20258, '1510f9b4efc14183ad412eb54c9e058f') 1
(20258, '5f42f4d02d38456689e58d6a1b9a3e16') 4
You can apply the len function to the grouped column. The .iat[0] gets the first item in the group, which in this case is your list.
>>> df.groupby(['member_id', 'device_id'])['event_date'].agg(
{'event_count': lambda group: len(group.iat[0])})
event_count
member_id device_id
19404 dfbc9d3230304cdfb0316cc32c41b67f 4
19555 176e307bd8714a00ac2b99276123f0a7 4
19632 a6d4b631e09a4b31afef4c93472c7da3 3
19792 0146b09048ce4c47af4bbc69e7999137 4
20258 1510f9b4efc14183ad412eb54c9e058f 1
5f42f4d02d38456689e58d6a1b9a3e16 4
Related
For example, I have a pandas dataframe like this :
Ignoring the "Name" column, I want a dataframe that looks like this, labelling the Hashes of the same group with their "ID"
Here, we traverse each row, we encounter "8a43", and assign ID 1 to it, and wherever we find the same hash value, we assign ID as 1. Then we move on to the next row, and encounter 79e2 and b183. We then traverse all the rows and wherever we find these values, we store their IDs as 2. Now the issue will arise when we reach "abc7". It will be assigned ID=5 as it was previously encountered in "abc5". But I also want that in rows after the current one, wherever I find "26ea", assign the ID=5 to those as well.
I hope all this makes sense. If not, feel free to reach out to me via comments or message. I will clear it out quickly.
Solution using dict
import numpy as np
import pandas as pd
hashvalues = list(df['Hash_Value'])
dic, i = {}, 1
id_list = []
for hashlist in hashvalues:
# convert to list
if isinstance(hashlist, str):
hashlist = hashlist.replace('[','').replace(']', '')
hashlist = hashlist.split(',')
# check if the hash is unknown
if hashlist[0] not in dic:
# Assign a new id
dic[hashlist[0]] = i
k = i
i += 1
else:
# if known use existing id
k = dic[hashlist[0]]
for h in hashlist[1:]:
# set id of the rest of the list hashes
# equal to the first hashes's id
dic[h] = k
id_list.append(k)
else:
id_list.append(np.nan)
print(df)
Hash Name ID
0 [8a43] abc1 1
1 [79e2,b183] abc2 2
2 [f82a] abc3 3
3 [b183] abc4 2
4 [eaa7,5ea9,1cee] abc5 4
5 [5ea9] abc6 4
6 [1cee,26ea] abc7 4
7 [79e2] abc8 2
8 [8a43] abc9 1
9 [26ea] abc10 4
Use networkx solution for dictionary for common values, select first value in Hash_Value by str and use Series.map:
#if necessary convert to lists
#df['Hash_Value'] = df['Hash_Value'].str.strip('[]').str.split(', ')
import networkx as nx
G=nx.Graph()
for l in df['Hash_Value']:
nx.add_path(G, l)
new = list(nx.connected_components(G))
print (new)
[{'8a43'}, {'79e2', 'b183'}, {'f82a'}, {'5ea9', '1cee', '26ea', 'eaa7'}]
mapped = {node: cid for cid, component in enumerate(new) for node in component}
df['ID'] = df['Hash_Value'].str[0].map(mapped) + 1
print (df)
Hash_Value Name ID
0 [8a43] abcl 1
1 [79e2, b183] abc2 2
2 [f82a] abc3 3
3 [b183] abc4 2
4 [eaa7, 5ea9, 1cee] abc5 4
5 [5ea9] abc6 4
6 [1cee, 26ea] abc7 4
7 [79e2] abc8 2
8 [8a43] abc9 1
9 [26ea] abc10 4
I need a way to extract all words that start with 'A' followed by a 6-digit numeric string right after (i.e. A112233, A000023).
Each cell contains sentences and there could potentially be a user error where they forget to put a space, so if you could account for that as well it would be greatly appreciated.
I've done research into using Python regex and Pandas, but I just don't know enough yet and am kind of on a time crunch.
Suppose your df's content construct from the following code:
import pandas as pd
df1=pd.DataFrame(
{
"columnA":["A194533","A4A556633 system01A484666","A4A556633","a987654A948323a882332A484666","A238B004867","pageA000023lol","a089923","something lol a484876A48466 emoji","A906633 A556633a556633"]
}
)
print(df1)
Output:
columnA
0 A194533
1 A4A556633 system01A484666
2 A4A556633
3 a987654A948323a882332A484666
4 A238B004867
5 pageA000023lol
6 a089923
7 something lol a484876A48466 emoji
8 A906633 A556633a556633
Now let's fetch the target corresponding to the regex patern:
result = df1['columnA'].str.extractall(r'([A]\d{6})')
Output:
0
match
0 0 A194533
1 0 A556633
1 A484666
2 0 A556633
3 0 A948323
1 A484666
5 0 A000023
8 0 A906633
1 A556633
And count them:
result.value_counts()
Output:
A556633 3
A484666 2
A000023 1
A194533 1
A906633 1
A948323 1
dtype: int64
Send the unique index into a list:
unique_list = [i[0] for i in result.value_counts().index.tolist()]
Output:
['A556633', 'A484666', 'A000023', 'A194533', 'A906633', 'A948323']
Value counts into a list:
unique_count_list = result.value_counts().values.tolist()
Output:
[3, 2, 1, 1, 1, 1]
I want to convert a dataframe which has tuples in cells into a dataframe with MultiIndex.
Here is an example of the table code:
d = {2:[(0,2),(0,4)], 3:[(826.0, 826.0),(4132.0, 4132.0)], 4:[(6019.0, 6019.0),(12037.0, 12037.0)], 6:[(18337.0, 18605.0),(36674.0, 37209.0)]}
test = pd.DataFrame(d)
This is how the dataframe looks like:
2 3 4 6
0 (0, 2) (826.0, 826.0) (6019.0, 6019.0) (18337.0, 18605.0)
1 (0, 4) (4132.0, 4132.0) (12037.0, 12037.0) (36674.0, 37209.0)
This is what I want it to look like
2 3 4 6
0 A 0 826.0 6019.0 18337.0
B 2 826.0 6019.0 18605.0
1 A 0 4132.0 12037.0 36674.0
B 4 4132.0 12037.0 37209.0
Thanks for your help!
Unsure for the efficiency, because this will rely an the apply method, but you could concat the dataframe with itself, adding a 'A' column to the first and a 'B' one to the second. Then you sort the resulting dataframe by its index, and use apply to change even rows to the first value of the tuple and odd ones to the second:
df = pd.concat([test.assign(X='A'), test.assign(X='B')]).set_index(
'X', append=True).sort_index().rename_axis(index=(None, None))
df.iloc[0:len(df):2] = df.iloc[0:len(df):2].apply(lambda x: x.apply(lambda y: y[0]))
df.iloc[1:len(df):2] = df.iloc[1:len(df):2].apply(lambda x: x.apply(lambda y: y[1]))
It gives as expected:
2 3 4 6
0 A 0 826 6019 18337
B 2 826 6019 18605
1 A 0 4132 12037 36674
B 4 4132 12037 37209
s = pd.Series(["08-10-2017", "08-10-2017", "08-10-2017", "09-10-2017", "09-10-2017", "09-10-2017", "10-10-2017", "10-10-2017", "10-10-2017", "11-10-2017", "11-10-2017", "11-10-2017", "12-10-2017", "12-10-2017", "12-10-2017", "13-10-2017", "13-10-2017", "13-10-2017", "14-10-2017", "14-10-2017"])
p = pd.DataFrame(data=s)
p.columns = ['date']
p.groupby('date').agg('count').reset_index().columns
Where is 'count' column ?
I think you are looking for value_counts
p.date.value_counts()
Out[1095]:
09-10-2017 3
13-10-2017 3
10-10-2017 3
12-10-2017 3
08-10-2017 3
11-10-2017 3
14-10-2017 2
Name: date, dtype: int64
And if you want to do with groupby
p.groupby('date').size()
And if do want using count
p.groupby('date').agg({'date':'count'})
Out[1101]:
date
date
08-10-2017 3
09-10-2017 3
10-10-2017 3
11-10-2017 3
12-10-2017 3
13-10-2017 3
14-10-2017 2
I have a Dataframe like so:
p_rel y_BET sq_resid
1 0.069370 41.184996 0.292942
2 0.116405 43.101090 0.010953
3 0.173409 44.727748 0.036832
4 0.225629 46.681293 0.540616
5 0.250682 46.980616 0.128191
6 0.294650 47.446113 0.132367
7 0.322530 48.078038 0.235047
How do I get rid of the fourth row because it has the max value of sq_resid? note: the max will change from dataset to dataset so just removing the 4th row isn't enough.
I have tried several things such as I can remove the max value which leaves the dataframe like below but haven't been able to remove the whole row.
p_rel y_BET sq_resid
1 0.069370 41.184996 0.292942
2 0.116405 43.101090 0.010953
3 0.173409 44.727748 0.036832
4 0.225629 46.681293 Nan
5 0.250682 46.980616 0.128191
6 0.294650 47.446113 0.132367
7 0.322530 48.078038 0.235047
You could just filter the df like so:
In [255]:
df.loc[df['sq_resid']!=df['sq_resid'].max()]
Out[255]:
p_rel y_BET sq_resid
1 0.069370 41.184996 0.292942
2 0.116405 43.101090 0.010953
3 0.173409 44.727748 0.036832
5 0.250682 46.980616 0.128191
6 0.294650 47.446113 0.132367
or drop using idxmax which will return the label row of the max value:
In [257]:
df.drop(df['sq_resid'].idxmax())
Out[257]:
p_rel y_BET sq_resid
1 0.069370 41.184996 0.292942
2 0.116405 43.101090 0.010953
3 0.173409 44.727748 0.036832
5 0.250682 46.980616 0.128191
6 0.294650 47.446113 0.132367
7 0.322530 48.078038 0.235047