I have a pandas df, like this:
ID date value
0 10 2022-01-01 100
1 10 2022-01-02 150
2 10 2022-01-03 0
3 10 2022-01-04 0
4 10 2022-01-05 200
5 10 2022-01-06 0
6 10 2022-01-07 150
7 10 2022-01-08 0
8 10 2022-01-09 0
9 10 2022-01-10 0
10 10 2022-01-11 0
11 10 2022-01-12 100
12 23 2022-02-01 490
13 23 2022-02-02 0
14 23 2022-02-03 350
15 23 2022-02-04 333
16 23 2022-02-05 0
17 23 2022-02-06 0
18 23 2022-02-07 0
19 23 2022-02-08 211
20 23 2022-02-09 100
I would like calculate the days of last value. Like the bellow example. How can I using diff() for this? And the calculus change by ID.
Output:
ID date value days_last_value
0 10 2022-01-01 100 0
1 10 2022-01-02 150 1
2 10 2022-01-03 0
3 10 2022-01-04 0
4 10 2022-01-05 200 3
5 10 2022-01-06 0
6 10 2022-01-07 150 2
7 10 2022-01-08 0
8 10 2022-01-09 0
9 10 2022-01-10 0
10 10 2022-01-11 0
11 10 2022-01-12 100 5
12 23 2022-02-01 490 0
13 23 2022-02-02 0
14 23 2022-02-03 350 2
15 23 2022-02-04 333 1
16 23 2022-02-05 0
17 23 2022-02-06 0
18 23 2022-02-07 0
19 23 2022-02-08 211 4
20 23 2022-02-09 100 1
Explanation below.
import pandas as pd
df = pd.DataFrame({'ID': 12 * [10] + 9 * [23],
'value': [100, 150, 0, 0, 200, 0, 150, 0, 0, 0, 0, 100, 490, 0, 350, 333, 0, 0, 0, 211, 100]})
days = df.groupby(['ID', (df['value'] != 0).cumsum()]).size().groupby('ID').shift(fill_value=0)
days.index = df.index[df['value'] != 0]
df['days_last_value'] = days
df
ID value days_last_value
0 10 100 0.0
1 10 150 1.0
2 10 0 NaN
3 10 0 NaN
4 10 200 3.0
5 10 0 NaN
6 10 150 2.0
7 10 0 NaN
8 10 0 NaN
9 10 0 NaN
10 10 0 NaN
11 10 100 5.0
12 23 490 0.0
13 23 0 NaN
14 23 350 2.0
15 23 333 1.0
16 23 0 NaN
17 23 0 NaN
18 23 0 NaN
19 23 211 4.0
20 23 100 1.0
First, we'll have to group by 'ID'.
We also creates groups for each block of days, by creating a True/False series where value is not 0, then performing a cumulative sum. That is the part (df['value'] != 0).cumsum(), which results in
0 1
1 2
2 2
3 2
4 3
5 3
6 4
7 4
8 4
9 4
10 4
11 5
12 6
13 6
14 7
15 8
16 8
17 8
18 8
19 9
20 10
We can use the values in this series to also group on; combining that with the 'ID' group, you have the individual blocks of days. This is the df.groupby(['ID', (df['value'] != 0).cumsum()]) part.
Now, for each block, we get its size, which is obviously the interval in days; which is what you want. We do need to shift one up, since we've counted the total number of days per group, and the difference would be one less; and fill with 0 at the bottom. But this shift has to be by ID group, so we first group by ID again before shifting (as we lost the grouping after doing .size()).
Now, this new series needs to get assigned back to the dataframe, but it's obviously shorter. Since its index it also reset, we can't easily reassign it (not with df['days_last_value'], df.loc[...] or df.iloc).
Instead, we select the index values of the original dataframe where value is not zero, and set the index of the days equal to that.
Now, it's easy step to directly assign the days to relevant column in the dataframe: Pandas will match the indices.
I have a pandas.DataFrame of the form. I'll show you a simple example.(In reality, it consists of hundreds of millions of rows of data.).
I want to change the number as the letter in column '2' changes. Numbers in the remaining columns (columns:1,3 ~) should not change.
df=
index 1 2 3
0 0 a100 1
1 1.04 a100 2
2 32 a100 3
3 5.05 a105 4
4 1.01 a105 5
5 155 a105 6
6 3155.26 a105 7
7 354.12 a100 8
8 5680.13 a100 9
9 125.55 a100 10
10 13.32 a100 11
11 5656.33 a156 12
12 456.61 a156 13
13 23.52 a1235 14
14 35.35 a1235 15
15 350.20 a100 16
16 30. a100 17
17 13.50 a100 18
18 323.13 a231 19
19 15.11 a1111 20
20 11.22 a1111 21
Here is my expected result:
df=
index 1 2 3
0 0 0 1
1 1.04 0 2
2 32 0 3
3 5.05 1 4
4 1.01 1 5
5 155 1 6
6 3155.26 1 7
7 354.12 2 8
8 5680.13 2 9
9 125.55 2 10
10 13.32 2 11
11 5656.33 3 12
12 456.61 3 13
13 23.52 4 14
14 35.35 4 15
15 350.20 5 16
16 30 5 17
17 13.50 5 18
18 323.13 6 19
19 15.11 7 20
20 11.22 7 21
How do I solve this problem?
Use consecutive groups created by compare for not equal shifted values with cumulative sum and then subtract 1:
#if column is string '2'
df['2'] = df['2'].ne(df['2'].shift()).cumsum().sub(1)
#if column is number 2
df[2] = df[2].ne(df[2].shift()).cumsum().sub(1)
print (df)
index 1 2 3
0 0 0.00 0 1
1 1 1.04 0 2
2 2 32.00 0 3
3 3 5.05 1 4
4 4 1.01 1 5
5 5 155.00 1 6
6 6 3155.26 1 7
7 7 354.12 2 8
8 8 5680.13 2 9
9 9 125.55 2 10
10 10 13.32 2 11
11 11 5656.33 3 12
12 12 456.61 3 13
13 13 23.52 4 14
14 14 35.35 4 15
15 15 350.20 5 16
16 16 30.00 5 17
17 17 13.50 5 18
18 18 323.13 6 19
19 19 15.11 7 20
20 20 11.22 7 21
I have a data frame that looks like this
DRUG READING1 READING2 READING3 TEMPERATURE
A 1 2 3 12
A 1 2 3 12
A 2 2 1 14
A 2 3 3 16
B 8 9 7 12
B 8 8 8 14
B 8 9 9 14
B 9 8 7 16
C 12 11 12 12
C 11 11 11 12
C 12 11 11 14
C 12 11 11 14
C 11 12 12 14
C 11 12 11 16
C 11 12 12 16
D 20 21 21 12
Now I want to replace the values of a particular Drug 'C' with the average values of the readings belonging to the particular temperature for the final outcome to be something like this
DRUG READING1 READING2 READING3 TEMPERATURE
A 1 2 3 12
A 1 2 3 12
A 2 2 1 14
A 2 3 3 16
B 8 9 7 12
B 8 8 8 14
B 8 9 9 14
B 9 8 7 16
C 11.5 11 11.5 12
C 11.5 11 11.5 12
C 11.7 11.3 11.3 14
C 11.7 11.3 11.3 14
C 11.7 11.3 11.3 14
C 11 12 11 16
D 20 21 21 12
or other way to see this is:
DRUG READING1 READING2 READING3 TEMPERATURE
A 1 2 3 12
A 1 2 3 12
A 2 2 1 14
A 2 3 3 16
B 8 9 7 12
B 8 8 8 14
B 8 9 9 14
B 9 8 7 16
C 11.5 11 11.5 12
C 11.7 11.3 11.3 14
C 11 12 11 16
D 20 21 21 12
This is groupby().mean():
df.groupby(['DRUG', 'TEMPERATURE'], as_index=False).mean()
Output:
DRUG TEMPERATURE READING1 READING2 READING3
0 A 12 1.000000 2.000000 3.000000
1 A 14 2.000000 2.000000 1.000000
2 A 16 2.000000 3.000000 3.000000
3 B 12 8.000000 9.000000 7.000000
4 B 14 8.000000 8.000000 8.000000
5 B 16 9.000000 8.000000 7.000000
6 C 12 11.500000 11.000000 11.500000
7 C 14 11.666667 11.333333 11.333333
8 C 16 11.000000 12.000000 11.500000
9 D 12 20.000000 21.000000 21.000000
I have a column as follows:
A B
0 0 20.00
1 1 35.00
2 2 75.00
3 3 29.00
4 4 125.00
5 5 16.00
6 6 52.50
7 7 NaN
8 8 NaN
9 9 NaN
10 10 NaN
11 11 NaN
12 12 NaN
13 13 239.91
14 14 22.87
15 15 52.74
16 16 37.20
17 17 27.44
18 18 57.01
19 19 29.88
I want to change the values of the column as follows
if 0<B<10.0, then Replace the cell value of B by "0 to 10"
if 10.1<B<20.0, then Replace the cell value of B by "10 to 20"
continue like this until the maximum range achieved.
I have tried
ds['B'] = np.where(ds['B'].between(10.0,20.0), "10 to 20", ds['B'])
But once I perform this operation, the DataFrame is occupied by the string "10 to 20" so I cannot perform this operation again for the remaining values of the DataFrame. After this step, the DataFrame looks like this:
A B
0 0 10 to 20
1 1 35.0
2 2 75.0
3 3 29.0
4 4 125.0
5 5 10 to 20
6 6 52.5
7 7 nan
8 8 nan
9 9 nan
10 10 nan
11 11 nan
12 12 nan
13 13 239.91
14 14 22.87
15 15 52.74
16 16 37.2
17 17 27.44
18 18 57.01
19 19 29.88
And the following line: ds['B'] = np.where(ds['B'].between(20.0,30.0), "20 to 30", ds['B']) will throw TypeError: '>=' not supported between instances of 'str' and 'float'
How can i code this to change all of the values in the DataFrame to these strings of ranges all at once?
Build your bins and labels and use pd.cut:
bins = np.arange(0, df["B"].max() // 10 * 10 + 10, 10).astype(int)
labels = [' to '.join(t) for t in zip(bins[:-1].astype(str), bins[1:].astype(str))]
df["B"] = pd.cut(df["B"], bins=bins, labels=labels)
>>> df
A B
0 0 10 to 20
1 1 30 to 40
2 2 70 to 80
3 3 20 to 30
4 4 120 to 130
5 5 10 to 20
6 6 50 to 60
7 7 NaN
8 8 NaN
9 9 NaN
10 10 NaN
11 11 NaN
12 12 NaN
13 13 NaN
14 14 20 to 30
15 15 50 to 60
16 16 30 to 40
17 17 20 to 30
18 18 50 to 60
19 19 20 to 30
This can be done with much less code as this is actually just a matter of string formatting.
ds['B'] = ds['B'].apply(lambda x: f'{int(x/10) if x>=10 else ""}0 to {int(x/10)+1}0' if pd.notnull(x) else x)
You can create a custom function that maps each range to a string. For example, 19.0 will be mapped to "10 to 20", and then apply this function to each row.
I've written the code so that the minimum and maximum of the range is generalizable to the DataFrame, and takes on values that are multiples of 10.
import numpy as np
import pandas as pd
## copy and paste your DataFrame
ds = pd.read_clipboard()
# floor to nearest multiple of 10
ds_min = ds['B'].min()//10*10
# ceiling to the nearest multiple of 10
ds_max = round(ds['B'].max(),-1)
ranges = np.linspace(ds_min, ds_max, ((ds_max-ds_min)/10)+1)
def map_value_to_string(value):
for idx in range(1,len(ranges)):
low_value, high_value = ranges[idx-1], ranges[idx]
if low_value < value <= high_value:
return f"{int(low_value)} to {int(high_value)}"
else:
continue
ds['B'] = ds['B'].apply(lambda x: map_value_to_string(x))
Output:
>>> ds
A B
0 0 10 to 20
1 1 30 to 40
2 2 70 to 80
3 3 20 to 30
4 4 120 to 130
5 5 10 to 20
6 6 50 to 60
7 7 None
8 8 None
9 9 None
10 10 None
11 11 None
12 12 None
13 13 230 to 240
14 14 20 to 30
15 15 50 to 60
16 16 30 to 40
17 17 20 to 30
18 18 50 to 60
19 19 20 to 30
I'm working with a large dataset. The following is an example, calculated with a smaller dataset.
In this example i got the measurements of the pollution of 3 rivers for different timespans. Each year, the amount pollution of a river is measured at a measuring station downstream ("pollution"). It has already been calculated, in which year the river water was polluted upstream ("year_of_upstream_pollution"). My goal ist to create a new column ["result_of_upstream_pollution"], which contains the amount of pollution connected to the "year_of_upstream_pollution". For this, the data from the "pollution"-column has to be reassigned.
ids = [1,1,1,1,1,1,2,2,2,2,2,2,3,3,3,3,3,3]
year = [2000,2001,2002,2003,2004,2005,1990,1991,1992,1993,1994,1995,2000,2001,2002,2003,2004,2005]
y1 = [2002,2002,2003,2005,2005,np.NaN,1991,1992,1993,1994,np.NaN,np.NaN,2012,2012,2013,2014,2015,np.NaN]
poll = [10,14,20,11,8,11,
20,22,20,25,18,21,
30,19,15,10,26,28]
dictr1 ={"river_id":ids,"year":year,"pollution": poll,"year_of_upstream_pollution":y1}
dfr1 = pd.DataFrame(dictr1)
print(dfr1)
river_id year pollution year_of_upstream_pollution
0 1 2000 10 2002.0
1 1 2001 14 2002.0
2 1 2002 20 2003.0
3 1 2003 11 2005.0
4 1 2004 8 2005.0
5 1 2005 11 NaN
6 2 1990 20 1991.0
7 2 1991 22 1992.0
8 2 1992 20 1993.0
9 2 1993 25 1994.0
10 2 1994 18 NaN
11 2 1995 21 NaN
12 3 2000 30 2002.0
13 3 2001 19 2002.0
14 3 2002 15 2003.0
15 3 2003 10 2004.0
16 3 2004 26 2005.0
17 3 2005 28 NaN
Example: river_id = 1, year = 2000, year_of_upstream_pollution = 2002
value of the pollution-column in year 2002 = 20
Therefore: result_of_upstream_pollution = 20
The resulting column should look like this:
result_of_upstream_pollution
0 20.0
1 20.0
2 11.0
3 11.0
4 11.0
5 NaN
6 22.0
7 20.0
8 25.0
9 18.0
10 NaN
11 NaN
12 15.0
13 15.0
14 10.0
15 26.0
16 28.0
17 NaN
My own approach:
### My approach
# Split dfr1 in two
dfr3 = pd.DataFrame(dfr1, columns = ["river_id","year","pollution"])
dfr4 = pd.DataFrame(dfr1, columns = ["river_id","year_of_upstream_pollution"])
# Merge the two dataframes on the "year" and "year_of_upstream_pollution"-column
arrayr= dfr4.merge(dfr3, left_on = "year_of_upstream_pollution", right_on = "year", how = "left").pollution.values
listr = arrayr.tolist()
dfr1["result_of_upstream_pollution"] = listr
print(dfr1)
len(listr) # = 28
This results in the following ValueError:
"Length of values does not match length of index"
My explanation for this is, that the values in the "year"-column of "dfr3" are not unique, which leads to several numbers being assigned to each year and explains why: len(listr) = 28
I haven't been able to find a way around this error yet. Please keep in mind that the real dataset is much larger than this one. Any help would be much appreciated!
As you said in the title, this is merge on two column:
dfr1['result_of_upstream_pollution'] = dfr1.merge(dfr1, left_on=['river_id','year'],
right_on=['river_id','year_of_upstream_pollution'],
how='right')['pollution_x']
print(df)
Output:
result_of_upstream_pollution
0 20.0
1 20.0
2 11.0
3 11.0
4 11.0
5 NaN
6 22.0
7 20.0
8 25.0
9 18.0
10 NaN
11 NaN
12 15.0
13 15.0
14 10.0
15 26.0
16 28.0
17 NaN
I just realized that this solution doesn't seem to be working for me.
When i execute the code, this is what happens:
dfr1['result_of_upstream_pollution'] = dfr1.merge(dfr1, left_on=['river_id','year'],
right_on=['river_id','year_of_upstream_pollution'],
how='right')['pollution_x']
print(dfr1)
river_id year pollution year_of_upstream_pollution \
0 1 2000 10 2002.0
1 1 2001 14 2002.0
2 1 2002 20 2003.0
3 1 2003 11 2005.0
4 1 2004 8 2005.0
5 1 2005 11 NaN
6 2 1990 20 1991.0
7 2 1991 22 1992.0
8 2 1992 20 1993.0
9 2 1993 25 1994.0
10 2 1994 18 NaN
11 2 1995 21 NaN
12 3 2000 30 2002.0
13 3 2001 19 2002.0
14 3 2002 15 2003.0
15 3 2003 10 2004.0
16 3 2004 26 2005.0
17 3 2005 28 NaN
result_of_upstream_pollution
0 20.0
1 20.0
2 11.0
3 11.0
4 11.0
5 22.0
6 20.0
7 25.0
8 18.0
9 15.0
10 15.0
11 10.0
12 26.0
13 28.0
14 NaN
15 NaN
16 NaN
17 NaN
For some reason, this code doesn't seem to be handling the "NaN" values in the right way.
If there is an "NaN"-value (in the column: "year_of_upstream_pollution"), there shouldnt be a value in "result_of_upstream_pollution".
Equally, the ids 14,15 and 16 all have values for the "year_of_upstream_pollution" which has matching data in the "pollution-column" and therefore should also have values in the result-column.
On top of that, it seems that all values after the first "NaN" (at id = 5) are assigned the wrong values.
#Quang Hoang Thank you very much for trying to solve my problem! Could you maybe explain why my results differ from yours?
Does anyone know how i can get this code to work?