I have a large dataset (df) with lots of columns and I am trying to get the total number of each day.
|datetime|id|col3|col4|col...
1 |11-11-2020|7|col3|col4|col...
2 |10-11-2020|5|col3|col4|col...
3 |09-11-2020|5|col3|col4|col...
4 |10-11-2020|4|col3|col4|col...
5 |10-11-2020|4|col3|col4|col...
6 |07-11-2020|4|col3|col4|col...
I want my result to be something like this
|datetime|id|col3|col4|col...|Count
6 |07-11-2020|4|col3|col4|col...| 1
3 |5|col3|col4|col...| 1
2 |10-11-2020|5|col3|col4|col...| 1
4 |4|col3|col4|col...| 2
1 |11-11-2020|7|col3|col4|col...| 1
I tried to use resample like this df = df.groupby(['id','col3', pd.Grouper(key='datetime', freq='D')]).sum().reset_index() and this is my result. I am still new to programming and Pandas but I have read up on pandas docs and am still unable to do it.
|datetime|id|col3|col4|col...
6 |07-11-2020|4|col3|1|0.0
3 |07-11-2020|5|col3|1|0.0
2 |10-11-2020|5|col3|1|0.0
4 |10-11-2020|4|col3|2|0.0
1 |11-11-2020|7|col3|1|0.0
try this:
df = df.groupby(['datetime','id','col3']).count()
If you want the count values for all columns based only on the date, then:
df.groupby('datetime').count()
And you'll get a DataFrame who has the date time as the index and the column cells representing the number of entries for that given index.
Related
I am trying to add values in cells of one column in Pandas Dataframe. The dataframe was created:
data = [['ID_123456', 'example=1(abc)'], ['ID_123457', 'example=1(def)'], ['ID_123458', 'example=1(try)'], ['ID_123459', 'example=1(try)'], ['ID_123460', 'example=1(try),2(test)'], ['ID_123461', 'example=1(try),2(test),9(yum)'], ['ID_123462', 'example=1(try)'], ['ID_123463', 'example=1(try),7(test)']]
df = pd.DataFrame(data, columns = ['ID', 'occ'])
display(df)
The table looks like this:
ID occ
ID_123456 example=1(abc)
ID_123457 example=1(def)
ID_123458 example=1(try)
ID_123459 example=1(test)
ID_123460 example=1(try),2(test)
ID_123461 example=1(try),2(test),9(yum)
ID_123462 example=1(test)
ID_123463 example=1(try),7(test)
The following link is related to it but I was unable to run the command on my dataframe.
Sum all integers in a PANDAS DataFrame "cell"
The command gives an error of "string index out of range".
The output should look like this:
ID occ count
ID_123456 example=1(abc) 1
ID_123457 example=1(def) 1
ID_123458 example=1(try) 1
ID_123459 example=1(test) 1
ID_123460 example=1(try),2(test) 3
ID_123461 example=1(try),2(test),9(yum) 12
ID_123462 example=1(test) 1
ID_123463 example=1(try),7(test) 8
If want sum all numbers on column occ use Series.str.extractall, convert to integers with sum:
df['count'] = df['occ'].str.extractall('(\d+)')[0].astype(int).sum(level=0)
print (df)
ID occ count
0 ID_123456 example=1(abc) 1
1 ID_123457 example=1(def) 1
2 ID_123458 example=1(try) 1
3 ID_123459 example=1(try) 1
4 ID_123460 example=1(try),2(test) 3
5 ID_123461 example=1(try),2(test),9(yum) 12
6 ID_123462 example=1(try) 1
7 ID_123463 example=1(try),7(test) 8
I have a dataframe with more than 4 million rows and 30 columns. I am just providing a sample of my patient dataframe
df = pd.DataFrame({
'subject_ID':[1,1,1,1,1,2,2,2,2,2,3,3,3],
'date_visit':['1/1/2020 12:35:21','1/1/2020 14:35:32','1/1/2020 16:21:20','01/02/2020 15:12:37','01/03/2020 16:32:12',
'1/1/2020 12:35:21','1/3/2020 14:35:32','1/8/2020 16:21:20','01/09/2020 15:12:37','01/10/2020 16:32:12',
'11/01/2022 13:02:31','13/01/2023 17:12:31','16/01/2023 19:22:31'],
'item_name':['PEEP','Fio2','PEEP','Fio2','PEEP','PEEP','PEEP','PEEP','PEEP','PEEP','Fio2','Fio2','Fio2']})
I would like to do two things
1) Find the subjects and their records which are missing in the sequence
2) Get the count of item_name for each subjects
For q2, this is what I tried
df.groupby(['subject_ID','item_name']).count() # though this produces output, column name is not okay. I mean why do it show the count value on `date_visit` column?
For q1, this is what I am trying
df['day'].le(df['shift_date'].add(1))
I expect my output to be like as shown below
You can get the first part with:
In [14]: df.groupby("subject_ID")['item_name'].value_counts().unstack(fill_value=0)
Out[14]:
item_name Fio2 PEEP
subject_ID
1 2 3
2 0 5
3 3 0
EDIT:
I think you've still got your date formats a bit messed up in your sample output, and strongly recommend switching everything to the ISO 8601 standard since that prevents problems like that down the road. pandas won't correctly parse that 11/01/2022 entry on its own, so I've manually fixed it in the sample.
Using what I assume these dates are supposed to be, you can find the gaps by grouping and using .resample():
In [73]: df['dates'] = pd.to_datetime(df['date_visit'])
In [74]: df.loc[10, 'dates'] = pd.to_datetime("2022-01-11 13:02:31")
In [75]: dates = df.groupby("subject_ID").apply(lambda x: x.set_index('dates').resample('D').first())
In [76]: dates.index[dates.isnull().any(axis=1)].to_frame().reset_index(drop=True)
Out[76]:
subject_ID dates
0 2 2020-01-02
1 2 2020-01-04
2 2 2020-01-05
3 2 2020-01-06
4 2 2020-01-07
5 3 2022-01-12
6 3 2022-01-14
7 3 2022-01-15
You can then add seq status to that first frame by checking whether the ID shows up in this new frame.
I have a dataframe with individuals and their household IDs and I would like to create a variable that contains the household size.
I am using Python 3.7. I tried to use the groupby function combined with the size (I tried count as well count) function. The idea is for each observation about an individual, I want to count in the dataframe the number of observations with the same household ID and store it in a new variable.
Consider that each observation has a household ID (hh_id) and that I would like to store the household size in the hh_size variable.
I tried the following:
df['hh_size'] = df.groupby('hh_id').size
I expect hh_size variable to contain for each observation the household size. However, I get a column with only nan.
When I usedf.groupby('hh_id').size alone, I get the expected result but I cannot manage to store it in the hh_size variable.
For example:
individual hh_id hh_size
1 1 2
2 1 2
3 2 1
4 3 1
Thanks,
Julien
If I understand it you have to convert it to new DataFrame - .to_frame(name='hh_size') - and you may have to reset index.
import pandas as pd
df = pd.DataFrame({
'individual': [1,1,2,2,3,4],
'hh_id': [1,1,1,1,2,3],
})
sizes = df.groupby(['individual', 'hh_id']).size()
new_df = sizes.to_frame(name='hh_size').reset_index()
print(new_df)
Result:
individual hh_id hh_size
0 1 1 2
1 2 1 2
2 3 2 1
3 4 3 1
I have a tall pandas dataframe called use with columns ID, Date, .... Each row is unique, but each ID has many rows, with one row ID per date.
ID Date Other_data
1 1-1-01 10
2 1-1-01 23
3 1-1-01 0
1 1-2-01 11
3 1-2-01 1
1 1-3-01 9
2 1-3-01 20
3 1-3-01 2
I also have a list of unique ids, ids=use['ID'].drop_duplicates
I want to find the intersection of all of the dates, that is, only the dates for which each ID has data. The end result in this toy problem should be [1-1-01, 1-3-01]
Currently, I loop through, subsetting by ID and taking the intersection. Roughly speaking, it looks like this:
dates = use['Date'].drop_duplicates()
for i in ids:
id_dates = use[(use['ID'] == i)]['Date'].values
dates = set(dates).intersection(id_dates)
This strikes me as horrifically inefficient. What is a more efficient way to identify dates where each ID has data?
Thanks very much!
Using crosstab, when the value is 0 should be the target row . using df.eq(0).any(1). to find it
df=pd.crosstab(use.ID,use.Date)
df
Out[856]:
Date 1-1-01 1-2-01 1-3-01
ID
1 1 1 1
2 1 0 1
3 1 1 1
Find the unique IDs per date, then check if that's all of them.
gp = df.groupby('Date').ID.nunique()
gp[gp == df.ID.nunique()].index.tolist()
#['1-1-01', '1-3-01']
I have seen a variant of this question asked that keeps the top n rows of each group in a pandas dataframe and the solutions use n as an absolute number rather than a percentage here Pandas get topmost n records within each group. However, in my dataframe, each group has different numbers of rows in it and I want to keep the top n% rows of each group. How would I approach this problem?
You can construct a Boolean series of flags and filter before you groupby. First let's create an example dataframe and look at the number of row for each unique value in the first series:
np.random.seed(0)
df = pd.DataFrame(np.random.randint(0, 2, (10, 3)))
print(df[0].value_counts())
0 6
1 4
Name: 0, dtype: int64
Then define a fraction, e.g. 50% below, and construct a Boolean series for filtering:
n = 0.5
g = df.groupby(0)
flags = (g.cumcount() + 1) <= g[1].transform('size') * n
Then apply the condition, set the index as the first series and (if required) sort the index:
df = df.loc[flags].set_index(0).sort_index()
print(df)
1 2
0
0 1 1
0 1 1
0 1 0
1 1 1
1 1 0
As you can see, the resultant dataframe only has 3 0 indices and 2 1 indices, in each case half the number in the original dataframe.
Here is another option which builds on some of the answers in the post you mentioned
First of all here is a quick function to either round up or round down. If we want the top 30% of rows of a dataframe 8 rows long then we would try to take 2.4 rows. So we will need to either round up or down.
My preferred option is to round up. This is because, for eaxample, if we were to take 50% of the rows, but had one group which only had one row, we would still keep that one row. I kept this separate so that you can change the rounding as you wish
def round_func(x, up=True):
'''Function to round up or round down a float'''
if up:
return int(x+1)
else:
return int(x)
Next I make a dataframe to work with and set a parameter p to be the fraction of the rows from each group that we should keep. Everything follows and I have commented it so that hopefully you can follow.
import pandas as pd
df = pd.DataFrame({'id':[1,1,1,2,2,2,2,3,4],'value':[1,2,3,1,2,3,4,1,1]})
p = 0.30 # top fraction to keep. Currently set to 80%
df_top = df.groupby('id').apply( # group by the ids
lambda x: x.reset_index()['value'].nlargest( # in each group take the top rows by column 'value'
round_func(x.count().max()*p))) # calculate how many to keep from each group
df_top = df_top.reset_index().drop('level_1', axis=1) # make the dataframe nice again
df looked like this
id value
0 1 1
1 1 2
2 1 3
3 2 1
4 2 2
5 2 3
6 2 4
7 3 1
8 4 1
df_top looks like this
id value
0 1 3
1 2 4
2 2 3
3 3 1
4 4 1