Efficient method to split dataframe multiple times in Python? - python

I currently have a pandas DataFrame df with the size of 168078 rows × 43 columns. A summary of df is shown below:
doi gender order year ... count
9384155 10.1103/PRL.102.039801 male 1 2009 ... 1
...
3679211 10.1103/PRD.69.024009 male 2 2004 ... 501
The df is currently sorted by count, and therefore varies from 1 to 501.
I would like to split the df into 501 smaller subdata by splitting it by count. In other words, at the end of the process, I would have 501 different sub-df with each characteristic count value.
Since the number of resulting (desired) DataFrames is quite high, and since it is a quantitative data, I was wondering if:
a) it is possible to split the DataFrame that many times (if yes, then how), and
b) it is possible to name each DataFrame quantitatively without manually assigning a name 501 times; i.e. for example, df with count == 1 would be df.1 without having to assign it.

The best practice you can do is create a dictionary of data frames. Below I show you an example:
df=pd.DataFrame({'A':[4,5,6,7,7,5,4,5,6,7],
'count':[1,2,3,4,5,6,7,8,9,10],
'C':['a','b','c','d','e','f','g','h','i','j']})
print(df)
A count C
0 4 1 a
1 5 2 b
2 6 3 c
3 7 4 d
4 7 5 e
5 5 6 f
6 4 7 g
7 5 8 h
8 6 9 i
9 7 10 j
Now we create the dictionary. As you can see the key is the value of count in each row.
keep in mind that here Series.unique is used to make that in the case where there are two rows with the same count value then they are created in the same dictionary.
dfs={key:df[df['count']==key] for key in df['count'].unique()}
Below I show the content of the entire dictionary created and how to access it:
for key in dfs:
print(f'dfs[{key}]')
print(dfs[key])
print('-'*50)
dfs[1]
A count C
0 4 1 a
--------------------------------------------------
dfs[2]
A count C
1 5 2 b
--------------------------------------------------
dfs[3]
A count C
2 6 3 c
--------------------------------------------------
dfs[4]
A count C
3 7 4 d
--------------------------------------------------
dfs[5]
A count C
4 7 5 e
--------------------------------------------------
dfs[6]
A count C
5 5 6 f
--------------------------------------------------
dfs[7]
A count C
6 4 7 g
--------------------------------------------------
dfs[8]
A count C
7 5 8 h
--------------------------------------------------
dfs[9]
A count C
8 6 9 i
--------------------------------------------------
dfs[10]
A count C
9 7 10 j
--------------------------------------------------

you can just use groupby to get the result like below
here
g.groups: will give group name (group id) for each group
g.get_group: will give you one group with given group name
import numpy as np
import pandas as pd
df=pd.DataFrame({'A':np.random.choice(["a","b","c", "d"], 10),
'count':np.random.choice(10,10)
})
g = df.groupby("count")
for key in g.groups:
print(g.get_group(key))
print("\n---------------")
Result
A count
3 c 0
---------------
A count
9 a 2
---------------
A count
0 c 3
2 b 3
---------------
A count
1 b 4
5 d 4
6 a 4
7 b 4
---------------
A count
8 c 5
---------------
A count
4 d 8
---------------

Related

Sampling one sample from each ID with equal number of samples in each group

I have a dataset with multiple rows of data for each ID. There are around 5000 IDs, and each ID can have 1 to 22 rows of data, each row belonging to a different group. I want to sample 1 row from each ID, and I want the sampled data to be equally distributed among the groups.
This is a dummy df, which is simplified so that there are 8 IDs, and each ID can have 1 to 4 rows of data:
id group
1 a
1 b
1 c
1 d
2 a
2 b
3 a
3 b
3 c
3 d
4 a
4 b
4 d
5 a
5 b
5 c
5 d
6 a
6 d
7 a
7 b
7 d
8 a
8 b
8 c
8 d
Since there are 8 IDs and 4 groups, I want the sampled data to have 2 IDs from each group. The number 2 is just because I want an equal distribution among groups, so if there are 20 IDs and 4 groups, I would want the sampled data to have 5 IDs from each group. Also, I want to sample one row from each ID, so all IDs should appear once and only once in the sampled data. Is there a way to do this?
I've tried using weights in pd.DataFrame.sample, using 1/frequency of each group as the weight, hoping that rows in groups with less frequency will have more weight and therefore have higher chance of being sampled, so that the final sampled data would be roughly equally distributed among groups. But it didn't work as I expected. I tried using different random states, but none of them gave me a sampled dataset that was equally distributed among groups. This is the code I used:
#Create dummy dataframe:
d = {'id': [1,1,1,1,
2,2,
3,3,3,3,
4,4,4,
5,5,5,5,
6,6,
7,7,7,
8,8,8,8],
'group': ['a','b','c','d',
'a','b',
'a','b','c','d',
'a','b','d',
'a','b','c','d',
'a','d',
'a','b','d',
'a','b','c','d']}
df = pd.DataFrame(data=d)
#Calculate weights
df['inverted_freq'] = 1./df.groupby('group')['group'].transform('count')
#Sample one row from each ID
df1 = df.groupby('id').apply(pd.DataFrame.sample, random_state=1, n=1, weights=df.inverted_freq).reset_index(drop=True)
My expected output is:
id group
1 d
2 b
3 a
4 d
5 c
6 a
7 b
8 c
or something similar to this, with one row per ID and an equal number of rows per group.
Suggestions in either R or Python would be greatly appreciated. Thanks!
We can use data.table
library(data.table)
setDT(df1)[df1[, sample(.I, 2), group]$V1]
In R, you can use dplyr::slice_sample:
library(dplyr)
df %>%
group_by(group) %>%
slice_sample(n = 2)
try
df.groupby("id").sample(n=2)
id group
0 1 a
2 1 c
4 2 a
5 2 b
7 3 b
9 3 d
12 4 d
10 4 a
13 5 a
16 5 d
18 6 d
17 6 a
21 7 d
19 7 a
23 8 b
24 8 c

How to select the 3 last dates in Python

I have a dataset that looks like his:
ID date
1 O1-01-2012
1 05-02-2012
1 25-06-2013
1 14-12-2013
1 10-04-2014
2 19-05-2012
2 07-08-2014
2 10-09-2014
2 27-11-2015
2 01-12-2015
3 15-04-2013
3 17-05-2015
3 22-05-2015
3 30-10-2016
3 02-11-2016
I am working with Python and I would like to select the 3 last dates for each ID. Here is the dataset I would like to have:
ID date
1 25-06-2013
1 14-12-2013
1 10-04-2014
2 10-09-2014
2 27-11-2015
2 01-12-2015
3 22-05-2015
3 30-10-2016
3 02-11-2016
I used this code to select the very last date for each ID:
df_2=df.sort_values(by=['date']).drop_duplicates(subset='ID',keep='last')
But how can I select more than one date (for example the 3 last dates, or 4 last dates, etc)?
You might use groupby and tail following way to get 2 last items from each group:
import pandas as pd
df = pd.DataFrame({'ID':[1,1,1,2,2,2,3,3,3],'value':['A','B','C','D','E','F','G','H','I']})
df2 = df.groupby('ID').tail(2)
print(df2)
Output:
ID value
1 1 B
2 1 C
4 2 E
5 2 F
7 3 H
8 3 I
Note that for simplicity sake I used other (already sorted) data for building df.
can try this:
df.sort_values(by=['date']).groupby('ID').tail(3).sort_values(['ID', 'date'])
I tried this but with a non-datetime data type
a = [1,1,1,1,1,2,2,2,2,2,3,3,3,3,3]
b = ['a','b','c','d','e','f','g','h','i','j','k','l','m','n','o']
import pandas as pd
import numpy as np
a = np.array([a,b])
df=pd.DataFrame(a.T,columns=['ID','Date'])
# the tail would give you the last n number of elements you are interested in
df_ = df.groupby('ID').tail(3)
df_
output:
ID Date
2 1 c
3 1 d
4 1 e
7 2 h
8 2 i
9 2 j
12 3 m
13 3 n
14 3 o

python pandas, trying to find unique combinations of two columns and merging while summing a third column

Hi I will show what im trying to do through examples:
I start with a dataframe like this:
> pd.DataFrame({'A':['a','a','a','c'],'B':[1,1,2,3], 'count':[5,6,1,7]})
A B count
0 a 1 5
1 a 1 6
2 a 2 1
3 c 3 7
I need to find a way to get all the unique combinations between column A and B, and merge them. The count column should be added together between the merged columns, the result should be like the following:
A B count
0 a 1 11
1 a 2 1
2 c 3 7
Thans for any help.
Use groupby with aggregating sum:
print (df.groupby(['A','B'], as_index=False)['count'].sum())
A B count
0 a 1 11
1 a 2 1
2 c 3 7
print (df.groupby(['A','B'])['count'].sum().reset_index())
A B count
0 a 1 11
1 a 2 1
2 c 3 7

groupby, sum and count to one table

I have a dataframe below
df=pd.DataFrame({"A":np.random.randint(1,10,9),"B":np.random.randint(1,10,9),"C":list('abbcacded')})
A B C
0 9 6 a
1 2 2 b
2 1 9 b
3 8 2 c
4 7 6 a
5 3 5 c
6 1 3 d
7 9 9 e
8 3 4 d
I would like to get grouping result (with key="C" column) below,and the row c d and e is dropped intentionally.
number A_sum B_sum
a 2 16 15
b 2 3 11
this is 2row*3column dataframe. the grouping key is column C. And
The column "number"represents the count of each letter(a and b).
A_sum and B_sum represents grouping sum of letters in column C.
I guess we should use method groupby but how can I get this data summary table ?
You can do this using a single groupby with
res = df.groupby(df.C).agg({'A': 'sum', 'B': {'sum': 'sum', 'count': 'count'}})
res.columns = ['A_sum', 'B_sum', 'count']
One option is to count the size and sum the columns for each group separately and then join them by index:
df.groupby("C")['A'].agg({"number": 'size'}).join(df.groupby('C').sum())
number A B
# C
# a 2 11 8
# b 2 14 12
# c 2 8 5
# d 2 11 12
# e 1 7 2
You can also do df.groupby('C').agg(["sum", "size"]) which gives an extra duplicated size column, but if you are fine with that, it should also work.

pandas - groupby and select variable amount of random values according to column

Starting from this simple dataframe df:
df = pd.DataFrame({'c':[1,1,2,2,2,2,3,3,3], 'n':[1,2,3,4,5,6,7,8,9], 'N':[1,1,2,2,2,2,2,2,2]})
I'm trying to select N random value from n for each c. So far I managed to groupby and get one single element / group with:
sample = df.groupby('c').apply(lambda x :x.iloc[np.random.randint(0, len(x))])
that returns:
N c n
c
1 1 1 2
2 2 2 4
3 2 3 8
My expected output would be something like:
N c n
c
1 1 1 2
2 2 2 4
2 2 2 3
3 2 3 8
3 2 3 7
so getting 1 sample from c=1 and 2 samples for c=2 and c=3, according to the N column.
Pandas objects now have a .sample method to return a random number of rows:
>>> df.groupby('c').apply(lambda g: g.n.sample(g.N.iloc[0]))
c
1 1 2
2 5 6
2 3
3 6 7
7 8
Name: n, dtype: int64

Categories

Resources