Pandas dataframe compression

Pandas dataframe compression - python

How do I map one dataframe into another df with less number of rows summing values of rows whoose indices are in given interval?
For example
Given df:
Survived
Age
20 1
22 1
23 3
24 2
30 2
33 1
40 8
42 7
Desired df
(for interval = 5):
Survived
Age
20 7
25 0
30 3
35 0
40 15
(for interval = 10):
Survived
Age
20 7
30 3
40 15

You can use a function for the groupby argument:
In [6]: df.groupby(lambda x: x//10 * 10).sum()
Out[6]:
Survived
20 7
30 3
40 15
Note, this also works with 5 but it doesn't work the way you want with empty groups, that is, it doesn't fill in with zeroes!
In [12]: df.groupby(lambda x: x//5 *5).sum()
Out[12]:
Survived
20 7
30 3
40 15
However, if the data were to contain values for those groups in the 5 interval, you can see it is working.
In [18]: df
Out[18]:
Survived
Age
20 1
22 1
23 3
24 2
26 99
30 2
33 1
40 8
42 7
47 99
In [19]: df.groupby(lambda x: x//5 *5).sum()
Out[19]:
Survived
20 7
25 99
30 3
40 15
45 99

First convert int index to TimedeltaIndex and then resample:
df.index = pd.TimedeltaIndex(df.index.to_series(), unit='s')
print (df)
Survived
00:00:20 1
00:00:22 1
00:00:23 3
00:00:24 2
00:00:30 2
00:00:33 1
00:00:40 8
00:00:42 7
df1 = df.resample('5S').sum().fillna(0)
df1.index = df1.index.seconds
print (df1)
Survived
20 7.0
25 0.0
30 3.0
35 0.0
40 15.0
df2 = df.resample('10S').sum().fillna(0)
df2.index = df2.index.seconds
print (df2)
Survived
20 7
30 3
40 15
EDIT:
If Age > 60 it works nice too:
print (df)
Survived
Age
20 1
22 1
23 3
24 2
30 2
33 1
40 8
42 7
60 8
62 7
70 8
72 7
df.index = pd.TimedeltaIndex(df.index.to_series(), unit='s')
df1 = df.resample('5S').sum().fillna(0)
df1.index = df1.index.seconds
print (df1)
Survived
20 7.0
25 0.0
30 3.0
35 0.0
40 15.0
45 0.0
50 0.0
55 0.0
60 15.0
65 0.0
70 15.0
df2 = df.resample('10S').sum().fillna(0)
df2.index = df2.index.seconds
print (df2)
Survived
20 7.0
30 3.0
40 15.0
50 0.0
60 15.0
70 15.0

You can create an new column from the column Age and then use groupby:
In order to create the new column, Age needs to be taken out of the index:
df.reset_index(inplace = True)
def cat_age(age):
return 10*int(age/10.)
df['category_age'] = df.Age.apply(lambda x: cat_age(x))
df.groupby('category_age',as_index = False).agg({'Survived':sum})
Output:
category_age Survived
0 20 7
1 30 3
2 40 15
Of course if you want to change the categories, you can pass the interval in cat_age:
def cat_age(age,interval)
return interval*int(1.*age/interval)

Related

Pandas fill dataframe with count of values within a range from another dataframe

I currently have two dataframes, df_ages and df_count:
In [1]: df_ages
Out [1]:
Enrolled Age
1 Y 44
2 Y 35
3 N 37
4 Y 55
5 N 26
6 Y 19
7 N 18
8 N 49
9 Y 26
10 Y 25
11 Y 25
12 Y 32
13 Y 25
14 N 50
15 N 58
In [2]: df_count
Out [2]:
Min Max counts percentage
1 18 25
2 26 35
3 36 45
4 46 55
5 56 65
I am looking for code to populate df_count [count] column with the sum of people who fit within the min and max age range in the previous columns.
The [percentage] column should be the percentage of number of entries.
The desired resulting output is shown below:
In [2]: df_count
Out [2]:
Min Max counts percentage
1 18 25 5 33.3
2 26 35 4 26.7
3 36 45 2 13.3
4 46 55 3 20.0
5 56 65 1 6.7

You can try apply on rows with Series.between
df_count['counts'] = df_count.apply(lambda row: df_ages['Age'].between(row['Min'], row['Max']).sum(), axis=1)
df_count['percentage'] = df_count['counts'].div(len(df_ages)).mul(100).round(1)
print(df_count)
Min Max counts percentage
0 18 25 5 33.3
1 26 35 4 26.7
2 36 45 2 13.3
3 46 55 3 20.0
4 56 65 1 6.7

Select rows from pandas df, where index appears somewhere in another df

Assume the following:
df1:
x y z
1 10 11
2 20 22
3 30 33
4 40 44
1 20 21
1 30 31
1 40 41
2 10 12
2 30 32
2 40 42
3 10 31
3 20 23
3 40 43
4 10 14
4 20 24
4 30 34
df2:
x b
1 100
2 200
df3:
y c
10 1000
20 2000
I want all rows from df1, for which either x or y appears in either df2 or df3 respectively, meaning in this case
out:
x y z
1 10 11
2 20 22
1 20 21
1 30 31
1 40 41
2 10 12
2 30 32
2 40 42
3 10 31
3 20 23
4 10 14
4 20 24
I would like to do this in pure pandas, with no for loops, seems standard enough to me, but I don't really know what to look for

You can use isin on both cases, chain the conditions with a bitwise OR and perform boolean indexation on the dataframe with the result:
df1[df1.x.isin(df2.x) | df1.y.isin(df3.y)]

Slice values of a column and calculate average in python

I have a dataframe with three columns:
a b c
0 73 12
73 80 2
80 100 5
100 150 13
Values in "a" and "b" are days. I need to find the average values of "c" in each 30 day-interval (slice values inside [min(a),max(b)] in 30 days and calculate average of c). I want as a result have a dataframe like this:
aa bb c_avg
0 30 12
30 60 12
60 90 6.33
90 120 9
120 150 13
Another sample data could be:
a b c
0 1264.0 1629.0 0.000000
1 1629.0 1632.0 133.333333
6 1632.0 1699.0 0.000000
2 1699.0 1706.0 21.428571
7 1706.0 1723.0 0.000000
3 1723.0 1726.0 50.000000
8 1726.0 1890.0 0.000000
4 1890.0 1893.0 33.333333
1 1893.0 1994.0 0.000000
How can I get to the final table?

First create ranges DataFrame by ranges defined a and b columns:
a = np.arange(0, 180, 30)
df1 = pd.DataFrame({'aa':a[:-1], 'bb':a[1:]})
#print (df1)
Then cross join all rows by helper column tmp:
df3 = pd.merge(df1.assign(tmp=1), df.assign(tmp=1), on='tmp')
#print (df3)
And last filter - There are 2 solution by columns for filtering:
df4 = df3[df3['aa'].between(df3['a'], df3['b']) | df3['bb'].between(df3['a'], df3['b'])]
print (df4)
aa bb tmp a b c
0 0 30 1 0 73 12
4 30 60 1 0 73 12
8 60 90 1 0 73 12
10 60 90 1 80 100 5
14 90 120 1 80 100 5
15 90 120 1 100 150 13
19 120 150 1 100 150 13
df4 = df4.groupby(['aa','bb'], as_index=False)['c'].mean()
print (df4)
aa bb c
0 0 30 12.0
1 30 60 12.0
2 60 90 8.5
3 90 120 9.0
4 120 150 13.0
df5 = df3[df3['a'].between(df3['aa'], df3['bb']) | df3['b'].between(df3['aa'], df3['bb'])]
print (df5)
aa bb tmp a b c
0 0 30 1 0 73 12
8 60 90 1 0 73 12
9 60 90 1 73 80 2
10 60 90 1 80 100 5
14 90 120 1 80 100 5
15 90 120 1 100 150 13
19 120 150 1 100 150 13
df5 = df5.groupby(['aa','bb'], as_index=False)['c'].mean()
print (df5)
aa bb c
0 0 30 12.000000
1 60 90 6.333333
2 90 120 9.000000
3 120 150 13.000000

Best approach to create time difference variable by id

I am working with a pandas df that looks like this:
ID time
34 43
2 99
2 20
34 8
2 90
What would be the best approach to a create variable that represents the difference from the most recent time per ID?
ID time diff
34 43 35
2 99 9
2 20 NA
34 8 NA
2 90 70

Here's one possibility
df["diff"] = df.sort_values("time").groupby("ID")["time"].diff()
df
ID time diff
0 34 43 35.0
1 2 99 9.0
2 2 20 NaN
3 34 8 NaN
4 2 90 70.0

pandas drop row below each row containing an 'na'

i have a dataframe with, say, 4 columns [['a','b','c','d']], to which I add another column ['total'] containing the sum of all the other columns for each row. I then add another column ['growth of total'] with the growth rate of the total.
some of the values in [['a','b','c','d']] are blank, rendering the ['total'] column invalid for these rows. I can easily get rid of these rows with df.dropna(how='any').
However, my growth rate will be invalid not only for rows with missing values in [['a','b','c','d']], but also for the following row. How do I drop all these rows?

IIUC correctly you can use notnull with all to mask off any rows with NaN and any rows that follow NaN rows:
In [43]:
df = pd.DataFrame({'a':[0,np.NaN, 2, 3,np.NaN], 'b':[np.NaN, 1,2,3,4], 'c':[0, np.NaN,2,3,4]})
df
Out[43]:
a b c
0 0 NaN 0
1 NaN 1 NaN
2 2 2 2
3 3 3 3
4 NaN 4 4
In [44]:
df[df.notnull().all(axis=1) & df.shift().notnull().all(axis=1)]
Out[44]:
a b c
3 3 3 3

Here's one option that I think does what you're looking for:
In [76]: df = pd.DataFrame(np.arange(40).reshape(10,4))
In [77]: df.ix[1,2] = np.nan
In [78]: df.ix[6,1] = np.nan
In [79]: df['total'] = df.sum(axis=1, skipna=False)
In [80]: df
Out[80]:
0 1 2 3 total
0 0 1 2 3 6
1 4 5 NaN 7 NaN
2 8 9 10 11 38
3 12 13 14 15 54
4 16 17 18 19 70
5 20 21 22 23 86
6 24 NaN 26 27 NaN
7 28 29 30 31 118
8 32 33 34 35 134
9 36 37 38 39 150
In [81]: df['growth'] = df['total'].iloc[1:] - df['total'].values[:-1]
In [82]: df
Out[82]:
0 1 2 3 total growth
0 0 1 2 3 6 NaN
1 4 5 NaN 7 NaN NaN
2 8 9 10 11 38 NaN
3 12 13 14 15 54 16
4 16 17 18 19 70 16
5 20 21 22 23 86 16
6 24 NaN 26 27 NaN NaN
7 28 29 30 31 118 NaN
8 32 33 34 35 134 16
9 36 37 38 39 150 16

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Pandas dataframe compression - python

Related

Pandas fill dataframe with count of values within a range from another dataframe

Select rows from pandas df, where index appears somewhere in another df

Slice values of a column and calculate average in python

Best approach to create time difference variable by id

pandas drop row below each row containing an 'na'

Categories

Resources