Categorize the repetitive values in a column using pandas

Categorize the repetitive values in a column using pandas - python

I have a Dataframe and I have one column in data frame name 'Pressure' it has repetitive value and I want categorize it. I have column like this
enter image description here
pressure
0.03
0.03
0.03
2.07
2.07
2.07
3.01
3.01
I have tried groupby() method but not able to make a segment column. I think is a easy way in panda can anybody help me in this.
I need an output like this
enter image description here
Pressue Segment
0.03 1
0.03 1
0.03 1
2.07 2
2.07 2
2.07 2
3.01 3
3.01 3
Thanks in advance

Use factorize if performance is important:
data["Segment"]= pd.factorize(data["pressure"])[0] + 1
print (data)
pressure Segment
0 0.03 1
1 0.03 1
2 0.03 1
3 2.07 2
4 2.07 2
5 2.07 2
6 3.01 3
7 3.01 3
Performance:
data = pd.DataFrame({'pressure': np.sort(np.random.randint(1000, size=10000)) / 100})
In [312]: %timeit data["pressure"].replace({j: i for i,j in enumerate(data["pressure"].unique(),start=1)}).astype("int")
141 ms ± 3.11 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [313]: %timeit pd.factorize(data["pressure"])[0] + 1
751 µs ± 3.97 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

Create dict with unique values from a column pressure and label corresponding the same then use replace
d = {j: i for i,j in enumerate(data["Pressure"].unique(),start=1)}
data["Segment"]= data["Pressure"].replace(d).astype("int")
print(data)
Output:
Pressure Segment
0.03 1
0.03 1
0.03 1
2.07 2
2.07 2
2.07 2
3.01 3
3.01 3

Related

How to pivot a dataframe [duplicate]

This question already has answers here:
How can I pivot a dataframe?
(5 answers)
Closed 1 year ago.
I have a df that contains two rows and I wanted to use the first row as a header in a new df.
This is what my data looks like:
ver
time
a
2.31
b
3.45
b
3.75
a
2.21
b
3.87
b
4.02
a
1.97
a
3.56
This is what I am trying to get:
a
b
2.31
3.45
2.21
3.75
1.97
3.87
3.56
4.02

Try with cumcount create the key then pivot
out = df.assign(key=df.groupby('ver').cumcount()).pivot('key','ver','time')
ver a b
key
0 2.31 3.45
1 2.21 3.75
2 1.97 3.87
3 3.56 4.02

How to extract the last value in the last timestamp of a day?

My dataframe has multiple values in a day. I want to extract the value which is from the last timestamp in a day.
Date_Timestamp Values
2010-01-01 11:00:00 2.5
2010-01-01 15:00:00 7.1
2010-01-01 23:59:00 11.1
2010-02-01 08:00:00 12.5
2010-02-01 17:00:00 37.1
2010-02-01 23:53:00 71.1
output:
Date_Timestamp Values
2010-01-01 23:59:00 11.1
2010-02-01 23:53:00 71.1

df['Date_Timestamp']=pd.to_datetime(df['Date_Timestamp'])
df.groupby(df['Date_Timestamp'].dt.date)['Values'].apply(lambda x: x.tail(1))

Use pandas.core.groupby.GroupBy.last
This is a vectorized method, that is incredibly fast, compared to .apply.
# given dataframe df with Date_Timestamp as a datetime
dfg = df.groupby(df.Date_Timestamp.dt.date).last().reset_index(drop=True)
# display(dfg)
Date_Timestamp Values
2010-01-01 23:59:00 11.1
2010-02-01 23:53:00 71.1
timeit test
import pandas as pd
import numpy as np
from datetime import datetime
# test data with 2M rows
np.random.seed(365)
rows = 2000000
df = pd.DataFrame({'datetime': pd.bdate_range(datetime(2020, 1, 1), freq='h', periods=rows).tolist(),
'values': np.random.rand(rows, )*1000})
# display(df.head())
datetime values
2020-01-01 00:00:00 941.455743
2020-01-01 01:00:00 641.602705
2020-01-01 02:00:00 684.610467
2020-01-01 03:00:00 588.562066
2020-01-01 04:00:00 543.887219
%%timeit
df.groupby(df.datetime.dt.date).last().reset_index(drop=True)
[out]:
100k: 39.8 ms ± 1 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
200k: 80.7 ms ± 438 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
400k: 164 ms ± 659 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
2M: 791 ms ± 18.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
# This answer, with apply, is terrible
# I let it run for 1.5 hours and it didn't finish
# I reran the test for this is 100k and 200k
%%timeit
df.groupby(df.datetime.dt.date)['values'].apply(lambda x: x.tail(1))
[out]:
100k: 2.42 s ± 23.7 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
200k: 8.77 s ± 328 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
400k: 38.2 s # I only did %%time instead of %%timeit - it takes to long
800k: 2min 54s

Speeding up iterator operation in python

[pd.Series(pd.date_range(row[1].START_DATE, row[1].END_DATE)) for row in df[['START_DATE', 'END_DATE']].iterrows()]
Is there anyway to speed up this operation?
Basically for a given date range I am creating all rows of dates in between them.

Use DataFrame.itertuples:
L = [pd.Series(pd.date_range(r.START_DATE, r.END_DATE)) for r in df.itertuples()]
Or zip of both columns:
L = [pd.Series(pd.date_range(s, e)) for s, e in zip(df['START_DATE'], df['END_DATE'])]
If want join together:
s = pd.concat(L, ignore_index=True)
Performance for 100 rows:
np.random.seed(123)
def random_dates(start, end, n=100):
start_u = start.value//10**9
end_u = end.value//10**9
return pd.to_datetime(np.random.randint(start_u, end_u, n), unit='s')
start = pd.to_datetime('2015-01-01')
end = pd.to_datetime('2018-01-01')
df = pd.DataFrame({'START_DATE': start, 'END_DATE':random_dates(start, end)})
print (df)
In [155]: %timeit [pd.Series(pd.date_range(row[1].START_DATE, row[1].END_DATE)) for row in df[['START_DATE', 'END_DATE']].iterrows()]
33.5 ms ± 145 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [156]: %timeit [pd.date_range(row[1].START_DATE, row[1].END_DATE) for row in df[['START_DATE', 'END_DATE']].iterrows()]
30.3 ms ± 1.91 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [157]: %timeit [pd.Series(pd.date_range(r.START_DATE, r.END_DATE)) for r in df.itertuples()]
25.3 ms ± 218 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [158]: %timeit [pd.Series(pd.date_range(s, e)) for s, e in zip(df['START_DATE'], df['END_DATE'])]
24.3 ms ± 594 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
And for 1000 rows:
start = pd.to_datetime('2015-01-01')
end = pd.to_datetime('2018-01-01')
df = pd.DataFrame({'START_DATE': start, 'END_DATE':random_dates(start, end, n=1000)})
In [159]: %timeit [pd.Series(pd.date_range(row[1].START_DATE, row[1].END_DATE)) for row in df[['START_DATE', 'END_DATE']].iterrows()]
333 ms ± 3.32 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [160]: %timeit [pd.date_range(row[1].START_DATE, row[1].END_DATE) for row in df[['START_DATE', 'END_DATE']].iterrows()]
314 ms ± 36.3 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [161]: %timeit [pd.Series(pd.date_range(s, e)) for s, e in zip(df['START_DATE'], df['END_DATE'])]
243 ms ± 1.49 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [162]: %timeit [pd.Series(pd.date_range(r.START_DATE, r.END_DATE)) for r in df.itertuples()]
246 ms ± 2.93 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Instead of creating a pd.Series on each iteration, do:
[pd.date_range(row[1].START_DATE, row[1].END_DATE))
for row in df[['START_DATE', 'END_DATE']].iterrows()]
And create a dataframe from the result. Here's an example:
df = pd.DataFrame([
{'start_date': pd.datetime(2019,1,1), 'end_date': pd.datetime(2019,1,10)},
{'start_date': pd.datetime(2019,1,2), 'end_date': pd.datetime(2019,1,8)},
{'start_date': pd.datetime(2019,1,6), 'end_date': pd.datetime(2019,1,14)}
])
dr = [pd.date_range(df.loc[i,'start_date'], df.loc[i,'end_date']) for i,_ in df.iterrows()]
pd.DataFrame(dr)
0 1 2 3 4 5 \
0 2019-01-01 2019-01-02 2019-01-03 2019-01-04 2019-01-05 2019-01-06
1 2019-01-02 2019-01-03 2019-01-04 2019-01-05 2019-01-06 2019-01-07
2 2019-01-06 2019-01-07 2019-01-08 2019-01-09 2019-01-10 2019-01-11
6 7 8 9
0 2019-01-07 2019-01-08 2019-01-09 2019-01-10
1 2019-01-08 NaT NaT NaT
2 2019-01-12 2019-01-13 2019-01-14 NaT

Transposing a column in a pandas dataframe while keeping other column intact with duplicates

My data frame is as follows
selection_id last_traded_price
430494 1.46
430494 1.48
430494 1.56
430494 1.57
430495 2.45
430495 2.67
430495 2.72
430495 2.87
I have lots of rows that contain selection id's and I need to keep selection_id column the same but transpose the data in last traded price to look like this.
selection_id last_traded_price
430494 1.46 1.48 1.56 1.57 e.t.c
430495 2.45 2.67 2.72 2.87 e.t.c
I've tried a to use a pivot
(df.pivot(index='selection_id', columns=last_traded_price', values='last_traded_price')
Pivot isn't working due to duplicate rows in selection_id.
is it possible to transpose the data first and drop the duplicates after?

Option 1
groupby + apply
v = df.groupby('selection_id').last_traded_price.apply(list)
pd.DataFrame(v.tolist(), index=v.index)
0 1 2 3
selection_id
430494 1.46 1.48 1.56 1.57
430495 2.45 2.67 2.72 2.87
Option 2
You can do this with pivot, as long as you have another column of counts to pass for the pivoting (it needs to be pivoted along something, that's why).
df['Count'] = df.groupby('selection_id').cumcount()
df.pivot('selection_id', 'Count', 'last_traded_price')
Count 0 1 2 3
selection_id
430494 1.46 1.48 1.56 1.57
430495 2.45 2.67 2.72 2.87

You can use cumcount for Counter for new columns names created by set_index + unstack or pandas.pivot:
g = df.groupby('selection_id').cumcount()
df = df.set_index(['selection_id',g])['last_traded_price'].unstack()
print (df)
0 1 2 3
selection_id
430494 1.46 1.48 1.56 1.57
430495 2.45 2.67 2.72 2.87
Similar solution with pivot:
df = pd.pivot(index=df['selection_id'],
columns=df.groupby('selection_id').cumcount(),
values=df['last_traded_price'])
print (df)
0 1 2 3
selection_id
430494 1.46 1.48 1.56 1.57
430495 2.45 2.67 2.72 2.87

Transforming Pandas DataFrame into List of DataFrames

I have data that looks like this:
1.00 1.00 1.00
3.23 4.23 0.33
1.23 0.13 3.44
4.55 12.3 14.1
2.00 2.00 2.00
1.21 1.11 1.11
3.55 5.44 5.22
4.11 1.00 4.00
It comes in chunk of 4. The first line of the chunk is index and the rest are the values.
The chunk always comes in 4 lines, but number of columns can be more than 3.
For example:
1.00 1.00 1.00 <- 1st chunk, the index = 1
3.23 4.23 0.33 <- values
1.23 0.13 3.44 <- values
4.55 12.3 14.1 <- values
My example above only contains 2 chunks, but actually it can contain more than that.
What I want to do is to create a dictionary of data frames so I can process them
chunk by chunk. Namely from this:
In [1]: import pandas as pd
In [2]: df = pd.read_table("http://dpaste.com/29R0BSS.txt",header=None, sep = " ")
In [3]: df
Out[3]:
0 1 2
0 1.00 1.00 1.00
1 3.23 4.23 0.33
2 1.23 0.13 3.44
3 4.55 12.30 14.10
4 2.00 2.00 2.00
5 1.21 1.11 1.11
6 3.55 5.44 5.22
7 4.11 1.00 4.00
Into list of data frame, such that I can do something like this (I do this by hand):
>> # Let's call new data frame `nd`.
>> nd[1]
>> 0 1 2
0 3.23 4.23 0.33
1 1.23 0.13 3.44
2 4.55 12.30 14.10

There are lots of ways to do this; I tend to use groupby, e.g. something like
>>> grouped = df.groupby(np.arange(len(df)) // 4)
>>> d = {v.iloc[0][0]: v.iloc[1:].reset_index(drop=True) for k,v in grouped}
>>> for k,v in d.items():
... print(k)
... print(v)
...
1.0
0 1 2
0 3.23 4.23 0.33
1 1.23 0.13 3.44
2 4.55 12.30 14.10
2.0
0 1 2
0 1.21 1.11 1.11
1 3.55 5.44 5.22
2 4.11 1.00 4.00

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Categorize the repetitive values in a column using pandas - python

Related

How to pivot a dataframe [duplicate]

How to extract the last value in the last timestamp of a day?

Speeding up iterator operation in python

Transposing a column in a pandas dataframe while keeping other column intact with duplicates

Transforming Pandas DataFrame into List of DataFrames

Categories

Resources