Count various entrys in a DataFrame - python

I want to find out how many different devices are in this list?
Is this sufficient for my SQL statement or do I have to do more for it.
Unfortunately, I do not know with such a large amount of data, which method is right and if the solution is right.
Some devices come in more than once. That is, line number not = number of devices
Suggestions as Python or as SQL are welcome
import pandas as pd
from sqlalchemy import create_engine # database connection
from IPython.display import display
disk_engine = create_engine('sqlite:///gender-train-devices.db')
phones = pd.read_sql_query('SELECT device_id, COUNT(device_id) FROM phone_brand_device_model GROUP BY [device_id]', disk_engine)
print phones
the output is:
device_id COUNT(device_id)
0 -9223321966609553846 1
1 -9223067244542181226 1
2 -9223042152723782980 1
3 -9222956879900151005 1
4 -9222896629442493034 1
5 -9222894989445037972 1
6 -9222894319703307262 1
7 -9222754701995937853 1
8 -9222661944218806987 1
9 -9222399302879214035 1
10 -9222352239947207574 1
11 -9222173362545970626 1
12 -9221825537663503111 1
13 -9221768839350705746 1
14 -9221767098072603291 1
15 -9221674814957667064 1
16 -9221639938103564513 1
17 -9221554258551357785 1
18 -9221307795397202665 1
19 -9221086586254644858 1
20 -9221079146476055829 1
21 -9221066489596332354 1
22 -9221046405740900422 1
23 -9221026417907250887 1
24 -9221015678978880842 1
25 -9220961720447724253 1
26 -9220830859283101130 1
27 -9220733369151052329 1
28 -9220727250496861488 1
29 -9220452176650064280 1
... ... ...
186686 9219686542557325817 1
186687 9219842210460037807 1
186688 9219926280825642237 1
186689 9219937375310355234 1
186690 9219958455132520777 1
186691 9220025918063413114 1
186692 9220160557900894171 1
186693 9220562120895859549 1
186694 9220807070557263555 1
186695 9220814716773471568 1
186696 9220880169487906579 1
186697 9220914901466458680 1
186698 9221114774124234731 1
186699 9221149157342105139 1
186700 9221152396628736959 1
186701 9221297143137682579 1
186702 9221586026451102237 1
186703 9221608286127666096 1
186704 9221693095468078153 1
186705 9221768426357971629 1
186706 9221843411551060582 1
186707 9222110179000857683 1
186708 9222172248989688166 1
186709 9222214407720961524 1
186710 9222355582733155698 1
186711 9222539910510672930 1
186712 9222779211060772275 1
186713 9222784289318287993 1
186714 9222849349208140841 1
186715 9223069070668353002 1
[186716 rows x 2 columns]

If you want the number of different devices, you can just query the database:
SELECT COUNT(distinct device_id)
FROM phone_brand_device_model ;
Of course, if you already have the data in a data frame for some other purpose you can count the number of rows there.

If you already have data in memory as a dataframe, you can use:
df['device_id'].nunique()
otherwise use Gordon's solution - it should be faster

If you want to do it in pandas. You can do something like:
len(phones.device_id.unique())

Related

Sample from dataframe with conditions

I have a large dataset and I want to sample from it but with a conditional. What I need is a new dataframe with the almost the same amount (count) of values of a boolean column of `0 and 1'
What I have:
df['target'].value_counts()
0 = 4000
1 = 120000
What I need:
new_df['target'].value_counts()
0 = 4000
1 = 6000
I know I can df.sample but I dont know how to insert the conditional.
Thanks
Since 1.1.0, you can use groupby.sample if you need the same number of rows for each group:
df.groupby('target').sample(4000)
Demo:
df = pd.DataFrame({'x': [0] * 10 + [1] * 25})
df.groupby('x').sample(5)
x
8 0
6 0
7 0
2 0
9 0
18 1
33 1
24 1
32 1
15 1
If you need to sample conditionally based on the group value, you can do:
df.groupby('target', group_keys=False).apply(
lambda g: g.sample(4000 if g.name == 0 else 6000)
)
Demo:
df.groupby('x', group_keys=False).apply(
lambda g: g.sample(4 if g.name == 0 else 6)
)
x
7 0
8 0
2 0
1 0
18 1
12 1
17 1
22 1
30 1
28 1
Assuming the following input and using the values 4/6 instead of 4000/6000:
df = pd.DataFrame({'target': [0,1,1,1,0,1,1,1,0,1,1,1,0,1,1,1]})
You could groupby your target and sample to take at most N values per group:
df.groupby('target', group_keys=False).apply(lambda g: g.sample(min(len(g), 6)))
example output:
target
4 0
0 0
8 0
12 0
10 1
14 1
1 1
7 1
11 1
13 1
If you want the same size you can simply use df.groupby('target').sample(n=4)

Python Groupby and Count

I'm working on create a sankey plot and have the raw data mapped so that I know source and target node. I'm having an issue with grouping the source & target and then counting the number of times each occurs. E.g. using the table below finding out how many time 0 -> 4 occurs and recording that in the dataframe.
index event_action_num next_action_num
227926 0 6
227928 1 5
227934 1 6
227945 1 7
227947 1 6
227951 0 7
227956 0 6
227958 2 6
227963 0 6
227965 1 6
227968 1 5
227972 3 6
Where I want to send up is:
event_action_num next_action_num count_of
0 4 1728
0 5 2382
0 6 3739
etc
Have tried:
df_new_2 = df_new.groupby(['event_action_num', 'next_action_num']).count()
but doesn't give me the result I'm looking for.
Thanks in advance
Try to use agg('size') instead of count():
df_new_2.groupby(['event_action_num', 'next_action_num']).agg('size')
For your sample data output will be:

Compare preceding two rows with subsequent two rows of each group till last record

I had a question earlier which is deleted and now modified to a less verbose form for you to read easily.
I have a dataframe as given below
df = pd.DataFrame({'subject_id' :[1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2],'day':[1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20] , 'PEEP' :[7,5,10,10,11,11,14,14,17,17,21,21,23,23,25,25,22,20,26,26,5,7,8,8,9,9,13,13,15,15,12,12,15,15,19,19,19,22,22,15]})
df['fake_flag'] = ''
I would like to fill values in column fake_flag based on the below rules
1) if preceding two rows are constant (ex:5,5) or decreasing (7,5), then pick the highest of the two rows. In this case, it is 7 from (7,5) and 5 from (5,5)
2) Check whether the current row is greater than the output from rule 1 by 3 or more points (>=3) and it repeats in another (next) row (2 occurrences of same value). It can be 8/gt 8(if rule 1 output is 5). ex: (8 in row n,8 in row n+1 or 10 in row n,10 in row n+1) If yes, then key in fake VAC in the fake_flag column
This is what I tried
for i in t1.index:
if i >=2:
print("current value is ", t1[i])
print("preceding 1st (n-1) ", t1[i-1])
print("preceding 2nd (n-2) ", t1[i-2])
if (t1[i-1] == t1[i-2] or t1[i-2] >= t1[i-1]): # rule 1 check
r1_output = t1[i-2] # we get the max of these two values (t1[i-2]), it doesn't matter when it's constant(t1[i-2] or t1[i-1]) will have the same value anyway
print("rule 1 output is ", r1_output)
if t1[i] >= r1_output + 3:
print("found a value for rule 2", t1[i])
print("check for next value is same as current value", t1[i+1])
if (t1[i]==t1[i+1]): # rule 2 check
print("fake flag is being set")
df['fake_flag'][i] = 'fake_vac'
This check should happen for all records (one by one) for each subject_id. I have a dataset which has million records. Any efficient and elegant solution is helpful. I can't run a loop over million records.
I expect my output to be like as shown below
subject_id = 1
subject_id = 2
import pandas as pd
df = pd.DataFrame({'subject_id' :[1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2],'day':[1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20] , 'PEEP' :[7,5,10,10,11,11,14,14,17,17,21,21,23,23,25,25,22,20,26,26,5,7,8,8,9,9,13,13,15,15,12,12,15,15,19,19,19,22,22,15]})
df['shift1']=df['PEEP'].shift(1)
df['shift2']=df['PEEP'].shift(2)
df['fake_flag'] = np.where((df['shift1'] ==df['shift2']) | (df['shift1'] < df['shift2']), 'fake VAC', '')
df.drop(['shift1','shift2'],axis=1)
Output
0 1 1 7
1 1 2 5
2 1 3 10 fake VAC
3 1 4 10
4 1 5 11 fake VAC
5 1 6 11
6 1 7 14 fake VAC
7 1 8 14
8 1 9 17 fake VAC
9 1 10 17
10 1 11 21 fake VAC
11 1 12 21
12 1 13 23 fake VAC
13 1 14 23
14 1 15 25 fake VAC
15 1 16 25
16 1 17 22 fake VAC
17 1 18 20 fake VAC
18 1 19 26 fake VAC
19 1 20 26
20 2 1 5 fake VAC
21 2 2 7 fake VAC
22 2 3 8
23 2 4 8
24 2 5 9 fake VAC
25 2 6 9
26 2 7 13 fake VAC

Python Pandas Feature Generation as aggregate function

I have a pandas df which is mire or less like
ID key dist
0 1 57 1
1 2 22 1
2 3 12 1
3 4 45 1
4 5 94 1
5 6 36 1
6 7 38 1
.....
this DF contains couple of millions of points. I am trying to generate some descriptors now to incorporate the time nature of the data. The idea is for each line I should create a window of lenght x going back in the data and counting the occurrences of the particular key in the window. I did a implementation, but according to my estimation for 23 different windows the calculation will run 32 days. Here is the code
def features_wind2(inp):
all_window = inp
all_window['window1'] = 0
for index, row in all_window.iterrows():
lid = index
lid1 = lid - 200
pid = row['key']
row['window1'] = all_window.query('index < %d & index > %d & key == %d' % (lid, lid1, key)).count()[0]
return all_window
There are multiple different windows of different length. I however have that uneasy feeling that the iteration is probably not the smartest way to go for this data aggregation. Is there way to implement it to run faster?
On a toy example data frame, you can achieve about a 7x speedup by using apply() instead of iterrows().
Here's some sample data, expanded a bit from OP to include multiple key values:
ID key dist
0 1 57 1
1 2 22 1
2 3 12 1
3 4 45 1
4 5 94 1
5 6 36 1
6 7 38 1
7 8 94 1
8 9 94 1
9 10 38 1
import pandas as pd
df = pd.read_clipboard()
Based on these data, and the counting criteria defined by OP, we expect the output to be:
key dist window
ID
1 57 1 0
2 22 1 0
3 12 1 0
4 45 1 0
5 94 1 0
6 36 1 0
7 38 1 0
8 94 1 1
9 94 1 2
10 38 1 1
Using OP's approach:
def features_wind2(inp):
all_window = inp
all_window['window1'] = 0
for index, row in all_window.iterrows():
lid = index
lid1 = lid - 200
pid = row['key']
row['window1'] = all_window.query('index < %d & index > %d & key == %d' % (lid, lid1, pid)).count()[0]
return all_window
print('old solution: ')
%timeit features_wind2(df)
old solution:
10 loops, best of 3: 25.6 ms per loop
Using apply():
def compute_window(row):
# when using apply(), .name gives the row index
# pandas indexing is inclusive, so take index-1 as cut_idx
cut_idx = row.name - 1
key = row.key
# count the number of instances key appears in df, prior to this row
return sum(df.ix[:cut_idx,'key']==key)
print('new solution: ')
%timeit df['window1'] = df.apply(compute_window, axis='columns')
new solution:
100 loops, best of 3: 3.71 ms per loop
Note that with millions of records, this will still take awhile, and the relative performance gains will likely be diminished somewhat compared to this small test case.
UPDATE
Here's an even faster solution, using groupby() and cumsum(). I made some sample data that seems roughly in line with the provided example, but with 10 million rows. The computation finishes in well under a second, on average:
# sample data
import numpy as np
import pandas as pd
N = int(1e7)
idx = np.arange(N)
keys = np.random.randint(1,100,size=N)
dists = np.ones(N).astype(int)
df = pd.DataFrame({'ID':idx,'key':keys,'dist':dists})
df = df.set_index('ID')
Now performance testing:
%timeit df['window'] = df.groupby('key').cumsum().subtract(1)
1 loop, best of 3: 755 ms per loop
Here's enough output to show that the computation is working:
dist key window
ID
0 1 83 0
1 1 4 0
2 1 87 0
3 1 66 0
4 1 31 0
5 1 33 0
6 1 1 0
7 1 77 0
8 1 49 0
9 1 49 1
10 1 97 0
11 1 36 0
12 1 19 0
13 1 75 0
14 1 4 1
Note: To revert ID from index to column, use df.reset_index() at the end.

Divide part of a dataframe by another while keeping columns that are not being divided

I have two data frames as below:
Sample_name C14-Cer C16-Cer C18-Cer C18:1-Cer C20-Cer
0 1 1 0.161456 0.033139 0.991840 2.111023 0.846197
1 1 10 0.636140 1.024235 36.333741 16.074662 3.142135
2 1 13 0.605840 0.034337 2.085061 2.125908 0.069698
3 1 14 0.038481 0.152382 4.608259 4.960007 0.162162
4 1 5 0.035628 0.087637 1.397457 0.768467 0.052605
5 1 6 0.114375 0.020196 0.220193 7.662065 0.077727
Sample_name C14-Cer C16-Cer C18-Cer C18:1-Cer C20-Cer
0 1 1 0.305224 0.542488 66.428382 73.615079 10.342252
1 1 10 0.814696 1.246165 73.802644 58.064363 11.179206
2 1 13 0.556437 0.517383 50.555948 51.913547 9.412299
3 1 14 0.314058 1.148754 56.165767 61.261950 9.142128
4 1 5 0.499129 0.460813 40.182454 41.770906 8.263437
5 1 6 0.300203 0.784065 47.359506 52.841821 9.833513
I want to divide the numerical values in the selected cells of the first by the second and I am using the following code:
df1_int.loc[:,'C14-Cer':].div(df2.loc[:,'C14-Cer':])
However, this way I lose the information from the column "Sample_name".
C14-Cer C16-Cer C18-Cer C18:1-Cer C20-Cer
0 0.528977 0.061088 0.014931 0.028677 0.081819
1 0.780831 0.821909 0.492309 0.276842 0.281070
2 1.088785 0.066367 0.041243 0.040951 0.007405
3 0.122529 0.132650 0.082047 0.080964 0.017738
4 0.071381 0.190178 0.034778 0.018397 0.006366
5 0.380993 0.025759 0.004649 0.145000 0.007904
How can I perform the division while keeping the column "Sample_name" in the resulting dataframe?
You can selectively overwrite using loc, the same way that you're already performing the division:
df1_int.loc[:,'C14-Cer':] = df1_int.loc[:,'C14-Cer':].div(df2.loc[:,'C14-Cer':])
This preserves the sample_name col:
In [12]:
df.loc[:,'C14-Cer':] = df.loc[:,'C14-Cer':].div(df1.loc[:,'C14-Cer':])
df
Out[12]:
Sample_name C14-Cer C16-Cer C18-Cer C18:1-Cer C20-Cer
index
0 1 1 0.528975 0.061087 0.014931 0.028677 0.081819
1 1 10 0.780831 0.821910 0.492309 0.276842 0.281070
2 1 13 1.088785 0.066367 0.041243 0.040951 0.007405
3 1 14 0.122528 0.132650 0.082047 0.080964 0.017738
4 1 5 0.071380 0.190179 0.034778 0.018397 0.006366
5 1 6 0.380992 0.025758 0.004649 0.145000 0.007904

Categories

Resources