Count various entrys in a DataFrame - python
I want to find out how many different devices are in this list?
Is this sufficient for my SQL statement or do I have to do more for it.
Unfortunately, I do not know with such a large amount of data, which method is right and if the solution is right.
Some devices come in more than once. That is, line number not = number of devices
Suggestions as Python or as SQL are welcome
import pandas as pd
from sqlalchemy import create_engine # database connection
from IPython.display import display
disk_engine = create_engine('sqlite:///gender-train-devices.db')
phones = pd.read_sql_query('SELECT device_id, COUNT(device_id) FROM phone_brand_device_model GROUP BY [device_id]', disk_engine)
print phones
the output is:
device_id COUNT(device_id)
0 -9223321966609553846 1
1 -9223067244542181226 1
2 -9223042152723782980 1
3 -9222956879900151005 1
4 -9222896629442493034 1
5 -9222894989445037972 1
6 -9222894319703307262 1
7 -9222754701995937853 1
8 -9222661944218806987 1
9 -9222399302879214035 1
10 -9222352239947207574 1
11 -9222173362545970626 1
12 -9221825537663503111 1
13 -9221768839350705746 1
14 -9221767098072603291 1
15 -9221674814957667064 1
16 -9221639938103564513 1
17 -9221554258551357785 1
18 -9221307795397202665 1
19 -9221086586254644858 1
20 -9221079146476055829 1
21 -9221066489596332354 1
22 -9221046405740900422 1
23 -9221026417907250887 1
24 -9221015678978880842 1
25 -9220961720447724253 1
26 -9220830859283101130 1
27 -9220733369151052329 1
28 -9220727250496861488 1
29 -9220452176650064280 1
... ... ...
186686 9219686542557325817 1
186687 9219842210460037807 1
186688 9219926280825642237 1
186689 9219937375310355234 1
186690 9219958455132520777 1
186691 9220025918063413114 1
186692 9220160557900894171 1
186693 9220562120895859549 1
186694 9220807070557263555 1
186695 9220814716773471568 1
186696 9220880169487906579 1
186697 9220914901466458680 1
186698 9221114774124234731 1
186699 9221149157342105139 1
186700 9221152396628736959 1
186701 9221297143137682579 1
186702 9221586026451102237 1
186703 9221608286127666096 1
186704 9221693095468078153 1
186705 9221768426357971629 1
186706 9221843411551060582 1
186707 9222110179000857683 1
186708 9222172248989688166 1
186709 9222214407720961524 1
186710 9222355582733155698 1
186711 9222539910510672930 1
186712 9222779211060772275 1
186713 9222784289318287993 1
186714 9222849349208140841 1
186715 9223069070668353002 1
[186716 rows x 2 columns]
If you want the number of different devices, you can just query the database:
SELECT COUNT(distinct device_id)
FROM phone_brand_device_model ;
Of course, if you already have the data in a data frame for some other purpose you can count the number of rows there.
If you already have data in memory as a dataframe, you can use:
df['device_id'].nunique()
otherwise use Gordon's solution - it should be faster
If you want to do it in pandas. You can do something like:
len(phones.device_id.unique())
Related
Sample from dataframe with conditions
I have a large dataset and I want to sample from it but with a conditional. What I need is a new dataframe with the almost the same amount (count) of values of a boolean column of `0 and 1' What I have: df['target'].value_counts() 0 = 4000 1 = 120000 What I need: new_df['target'].value_counts() 0 = 4000 1 = 6000 I know I can df.sample but I dont know how to insert the conditional. Thanks
Since 1.1.0, you can use groupby.sample if you need the same number of rows for each group: df.groupby('target').sample(4000) Demo: df = pd.DataFrame({'x': [0] * 10 + [1] * 25}) df.groupby('x').sample(5) x 8 0 6 0 7 0 2 0 9 0 18 1 33 1 24 1 32 1 15 1 If you need to sample conditionally based on the group value, you can do: df.groupby('target', group_keys=False).apply( lambda g: g.sample(4000 if g.name == 0 else 6000) ) Demo: df.groupby('x', group_keys=False).apply( lambda g: g.sample(4 if g.name == 0 else 6) ) x 7 0 8 0 2 0 1 0 18 1 12 1 17 1 22 1 30 1 28 1
Assuming the following input and using the values 4/6 instead of 4000/6000: df = pd.DataFrame({'target': [0,1,1,1,0,1,1,1,0,1,1,1,0,1,1,1]}) You could groupby your target and sample to take at most N values per group: df.groupby('target', group_keys=False).apply(lambda g: g.sample(min(len(g), 6))) example output: target 4 0 0 0 8 0 12 0 10 1 14 1 1 1 7 1 11 1 13 1 If you want the same size you can simply use df.groupby('target').sample(n=4)
Python Groupby and Count
I'm working on create a sankey plot and have the raw data mapped so that I know source and target node. I'm having an issue with grouping the source & target and then counting the number of times each occurs. E.g. using the table below finding out how many time 0 -> 4 occurs and recording that in the dataframe. index event_action_num next_action_num 227926 0 6 227928 1 5 227934 1 6 227945 1 7 227947 1 6 227951 0 7 227956 0 6 227958 2 6 227963 0 6 227965 1 6 227968 1 5 227972 3 6 Where I want to send up is: event_action_num next_action_num count_of 0 4 1728 0 5 2382 0 6 3739 etc Have tried: df_new_2 = df_new.groupby(['event_action_num', 'next_action_num']).count() but doesn't give me the result I'm looking for. Thanks in advance
Try to use agg('size') instead of count(): df_new_2.groupby(['event_action_num', 'next_action_num']).agg('size') For your sample data output will be:
Compare preceding two rows with subsequent two rows of each group till last record
I had a question earlier which is deleted and now modified to a less verbose form for you to read easily. I have a dataframe as given below df = pd.DataFrame({'subject_id' :[1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2],'day':[1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20] , 'PEEP' :[7,5,10,10,11,11,14,14,17,17,21,21,23,23,25,25,22,20,26,26,5,7,8,8,9,9,13,13,15,15,12,12,15,15,19,19,19,22,22,15]}) df['fake_flag'] = '' I would like to fill values in column fake_flag based on the below rules 1) if preceding two rows are constant (ex:5,5) or decreasing (7,5), then pick the highest of the two rows. In this case, it is 7 from (7,5) and 5 from (5,5) 2) Check whether the current row is greater than the output from rule 1 by 3 or more points (>=3) and it repeats in another (next) row (2 occurrences of same value). It can be 8/gt 8(if rule 1 output is 5). ex: (8 in row n,8 in row n+1 or 10 in row n,10 in row n+1) If yes, then key in fake VAC in the fake_flag column This is what I tried for i in t1.index: if i >=2: print("current value is ", t1[i]) print("preceding 1st (n-1) ", t1[i-1]) print("preceding 2nd (n-2) ", t1[i-2]) if (t1[i-1] == t1[i-2] or t1[i-2] >= t1[i-1]): # rule 1 check r1_output = t1[i-2] # we get the max of these two values (t1[i-2]), it doesn't matter when it's constant(t1[i-2] or t1[i-1]) will have the same value anyway print("rule 1 output is ", r1_output) if t1[i] >= r1_output + 3: print("found a value for rule 2", t1[i]) print("check for next value is same as current value", t1[i+1]) if (t1[i]==t1[i+1]): # rule 2 check print("fake flag is being set") df['fake_flag'][i] = 'fake_vac' This check should happen for all records (one by one) for each subject_id. I have a dataset which has million records. Any efficient and elegant solution is helpful. I can't run a loop over million records. I expect my output to be like as shown below subject_id = 1 subject_id = 2
import pandas as pd df = pd.DataFrame({'subject_id' :[1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2],'day':[1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20] , 'PEEP' :[7,5,10,10,11,11,14,14,17,17,21,21,23,23,25,25,22,20,26,26,5,7,8,8,9,9,13,13,15,15,12,12,15,15,19,19,19,22,22,15]}) df['shift1']=df['PEEP'].shift(1) df['shift2']=df['PEEP'].shift(2) df['fake_flag'] = np.where((df['shift1'] ==df['shift2']) | (df['shift1'] < df['shift2']), 'fake VAC', '') df.drop(['shift1','shift2'],axis=1) Output 0 1 1 7 1 1 2 5 2 1 3 10 fake VAC 3 1 4 10 4 1 5 11 fake VAC 5 1 6 11 6 1 7 14 fake VAC 7 1 8 14 8 1 9 17 fake VAC 9 1 10 17 10 1 11 21 fake VAC 11 1 12 21 12 1 13 23 fake VAC 13 1 14 23 14 1 15 25 fake VAC 15 1 16 25 16 1 17 22 fake VAC 17 1 18 20 fake VAC 18 1 19 26 fake VAC 19 1 20 26 20 2 1 5 fake VAC 21 2 2 7 fake VAC 22 2 3 8 23 2 4 8 24 2 5 9 fake VAC 25 2 6 9 26 2 7 13 fake VAC
Python Pandas Feature Generation as aggregate function
I have a pandas df which is mire or less like ID key dist 0 1 57 1 1 2 22 1 2 3 12 1 3 4 45 1 4 5 94 1 5 6 36 1 6 7 38 1 ..... this DF contains couple of millions of points. I am trying to generate some descriptors now to incorporate the time nature of the data. The idea is for each line I should create a window of lenght x going back in the data and counting the occurrences of the particular key in the window. I did a implementation, but according to my estimation for 23 different windows the calculation will run 32 days. Here is the code def features_wind2(inp): all_window = inp all_window['window1'] = 0 for index, row in all_window.iterrows(): lid = index lid1 = lid - 200 pid = row['key'] row['window1'] = all_window.query('index < %d & index > %d & key == %d' % (lid, lid1, key)).count()[0] return all_window There are multiple different windows of different length. I however have that uneasy feeling that the iteration is probably not the smartest way to go for this data aggregation. Is there way to implement it to run faster?
On a toy example data frame, you can achieve about a 7x speedup by using apply() instead of iterrows(). Here's some sample data, expanded a bit from OP to include multiple key values: ID key dist 0 1 57 1 1 2 22 1 2 3 12 1 3 4 45 1 4 5 94 1 5 6 36 1 6 7 38 1 7 8 94 1 8 9 94 1 9 10 38 1 import pandas as pd df = pd.read_clipboard() Based on these data, and the counting criteria defined by OP, we expect the output to be: key dist window ID 1 57 1 0 2 22 1 0 3 12 1 0 4 45 1 0 5 94 1 0 6 36 1 0 7 38 1 0 8 94 1 1 9 94 1 2 10 38 1 1 Using OP's approach: def features_wind2(inp): all_window = inp all_window['window1'] = 0 for index, row in all_window.iterrows(): lid = index lid1 = lid - 200 pid = row['key'] row['window1'] = all_window.query('index < %d & index > %d & key == %d' % (lid, lid1, pid)).count()[0] return all_window print('old solution: ') %timeit features_wind2(df) old solution: 10 loops, best of 3: 25.6 ms per loop Using apply(): def compute_window(row): # when using apply(), .name gives the row index # pandas indexing is inclusive, so take index-1 as cut_idx cut_idx = row.name - 1 key = row.key # count the number of instances key appears in df, prior to this row return sum(df.ix[:cut_idx,'key']==key) print('new solution: ') %timeit df['window1'] = df.apply(compute_window, axis='columns') new solution: 100 loops, best of 3: 3.71 ms per loop Note that with millions of records, this will still take awhile, and the relative performance gains will likely be diminished somewhat compared to this small test case. UPDATE Here's an even faster solution, using groupby() and cumsum(). I made some sample data that seems roughly in line with the provided example, but with 10 million rows. The computation finishes in well under a second, on average: # sample data import numpy as np import pandas as pd N = int(1e7) idx = np.arange(N) keys = np.random.randint(1,100,size=N) dists = np.ones(N).astype(int) df = pd.DataFrame({'ID':idx,'key':keys,'dist':dists}) df = df.set_index('ID') Now performance testing: %timeit df['window'] = df.groupby('key').cumsum().subtract(1) 1 loop, best of 3: 755 ms per loop Here's enough output to show that the computation is working: dist key window ID 0 1 83 0 1 1 4 0 2 1 87 0 3 1 66 0 4 1 31 0 5 1 33 0 6 1 1 0 7 1 77 0 8 1 49 0 9 1 49 1 10 1 97 0 11 1 36 0 12 1 19 0 13 1 75 0 14 1 4 1 Note: To revert ID from index to column, use df.reset_index() at the end.
Divide part of a dataframe by another while keeping columns that are not being divided
I have two data frames as below: Sample_name C14-Cer C16-Cer C18-Cer C18:1-Cer C20-Cer 0 1 1 0.161456 0.033139 0.991840 2.111023 0.846197 1 1 10 0.636140 1.024235 36.333741 16.074662 3.142135 2 1 13 0.605840 0.034337 2.085061 2.125908 0.069698 3 1 14 0.038481 0.152382 4.608259 4.960007 0.162162 4 1 5 0.035628 0.087637 1.397457 0.768467 0.052605 5 1 6 0.114375 0.020196 0.220193 7.662065 0.077727 Sample_name C14-Cer C16-Cer C18-Cer C18:1-Cer C20-Cer 0 1 1 0.305224 0.542488 66.428382 73.615079 10.342252 1 1 10 0.814696 1.246165 73.802644 58.064363 11.179206 2 1 13 0.556437 0.517383 50.555948 51.913547 9.412299 3 1 14 0.314058 1.148754 56.165767 61.261950 9.142128 4 1 5 0.499129 0.460813 40.182454 41.770906 8.263437 5 1 6 0.300203 0.784065 47.359506 52.841821 9.833513 I want to divide the numerical values in the selected cells of the first by the second and I am using the following code: df1_int.loc[:,'C14-Cer':].div(df2.loc[:,'C14-Cer':]) However, this way I lose the information from the column "Sample_name". C14-Cer C16-Cer C18-Cer C18:1-Cer C20-Cer 0 0.528977 0.061088 0.014931 0.028677 0.081819 1 0.780831 0.821909 0.492309 0.276842 0.281070 2 1.088785 0.066367 0.041243 0.040951 0.007405 3 0.122529 0.132650 0.082047 0.080964 0.017738 4 0.071381 0.190178 0.034778 0.018397 0.006366 5 0.380993 0.025759 0.004649 0.145000 0.007904 How can I perform the division while keeping the column "Sample_name" in the resulting dataframe?
You can selectively overwrite using loc, the same way that you're already performing the division: df1_int.loc[:,'C14-Cer':] = df1_int.loc[:,'C14-Cer':].div(df2.loc[:,'C14-Cer':]) This preserves the sample_name col: In [12]: df.loc[:,'C14-Cer':] = df.loc[:,'C14-Cer':].div(df1.loc[:,'C14-Cer':]) df Out[12]: Sample_name C14-Cer C16-Cer C18-Cer C18:1-Cer C20-Cer index 0 1 1 0.528975 0.061087 0.014931 0.028677 0.081819 1 1 10 0.780831 0.821910 0.492309 0.276842 0.281070 2 1 13 1.088785 0.066367 0.041243 0.040951 0.007405 3 1 14 0.122528 0.132650 0.082047 0.080964 0.017738 4 1 5 0.071380 0.190179 0.034778 0.018397 0.006366 5 1 6 0.380992 0.025758 0.004649 0.145000 0.007904