Pandas: Find first occurrences of elements that appear in a certain column - python

Let's assume that I have the following data-frame:
df_raw = pd.DataFrame({"id": [102, 102, 103, 103, 103], "val1": [9,2,4,7,6], "val2": [np.nan, 3, np.nan, 4, 5], "val3": [4, np.nan, np.nan, 5, 1], "date": [pd.Timestamp(2002, 1, 1), pd.Timestamp(2002, 3, 3), pd.Timestamp(2003, 4, 4), pd.Timestamp(2003, 8, 9), pd.Timestamp(2005, 2, 3)]})
I want to have access to the rows where the first occurrence of each id is. So these rows would be:
df_first = pd.DataFrame({"id": [102, 103], "val1": [9, 4], "val2": [np.nan, np.nan], "val3": [4, np.nan], "date": [pd.Timestamp(2002, 1, 1), pd.Timestamp(2003, 4, 4)]})
Basically, at the end what I would like to achieve is fill up the NaNs that appear in the first occurrence of each id. So the final data frame might be:
df_processed = pd.DataFrame({"id": [102, 102, 103, 103, 103], "val1": [9,2,4,7,6], "val2": [-1, 3, -1, 4, 5], "val3": [4, np.nan, -1, 5, 1], "date": [pd.Timestamp(2002, 1, 1), pd.Timestamp(2002, 3, 3), pd.Timestamp(2003, 4, 4), pd.Timestamp(2003, 8, 9), pd.Timestamp(2005, 2, 3)]})
An important note is that the rows are already grouped by id and date and sorted in a ascending manner. So they appear exactly as in the provided example.

IIUC using drop_duplicates then concat
df1=df_raw.drop_duplicates('id').fillna(-1)
target=pd.concat([df1,df_raw.loc[~df_raw.index.isin(df1.index)]]).sort_index()
target
date id val1 val2 val3
0 2002-01-01 102 9 -1.0 4.0
1 2002-03-03 102 2 3.0 NaN
2 2003-04-04 103 4 -1.0 -1.0
3 2003-08-09 103 7 4.0 5.0
4 2005-02-03 103 6 5.0 1.0

You can use pd.Series.duplicated with Boolean row indexing:
mask = ~df_raw['id'].duplicated()
val_cols = ['val2', 'val3']
df_raw.loc[mask, val_cols] = df_raw.loc[mask, val_cols].fillna(-1)
print(df_raw)
id val1 val2 val3 date
0 102 9 -1.0 4.0 2002-01-01
1 102 2 3.0 NaN 2002-03-03
2 103 4 -1.0 -1.0 2003-04-04
3 103 7 4.0 5.0 2003-08-09
4 103 6 5.0 1.0 2005-02-03

Related

Writing a DataFrame to an excel file where items in a list are put into separate cells

Consider a dataframe like pivoted, where replicates of some data are given as lists in a dataframe:
d = {'Compound': ['A', 'A', 'A', 'B', 'B', 'B', 'C', 'C', 'C', 'C'],
'Conc': [1, 0.5, 0.1, 1, 0.5, 0.1, 2, 1, 0.5, 0.1],
'Data': [[100, 90, 80], [50, 40, 30], [10, 9.7, 8],
[20, 15, 10], [3, 4, 5, 6], [100, 110, 80],
[30, 40, 50, 20], [10, 5, 9, 3], [2, 1, 2, 2], [1, 1, 0]]}
df = pd.DataFrame(data=d)
pivoted = df.pivot(index='Conc', columns='Compound', values='Data')
This df can be written to an excel file as such:
with pd.ExcelWriter('output.xlsx') as writer:
pivoted.to_excel(writer, sheet_name='Sheet1', index_label='Conc')
How can this instead be written where replicate data are given in side-by-side cells? Desired excel file:
Then you need to pivot your data in a slightly different way, first explode the Data column, and deduplicate with groupby.cumcount:
(df.explode('Data')
.assign(n=lambda d: d.groupby(level=0).cumcount())
.pivot(index='Conc', columns=['Compound', 'n'], values='Data')
.droplevel('n', axis=1).rename_axis(columns=None)
)
Output:
A A A B B B B C C C C
Conc
0.1 10 9.7 8 100 110 80 NaN 1 1 0 NaN
0.5 50 40 30 3 4 5 6 2 1 2 2
1.0 100 90 80 20 15 10 NaN 10 5 9 3
2.0 NaN NaN NaN NaN NaN NaN NaN 30 40 50 20
Beside the #mozway's answer, just for formatting, you can use:
piv = (df.explode('Data').assign(col=lambda x: x.groupby(level=0).cumcount())
.pivot(index='Conc', columns=['Compound', 'col'], values='Data')
.rename_axis(None))
piv.columns = pd.Index([i if j == 0 else '' for i, j in piv.columns], name='Conc')
piv.to_excel('file.xlsx')

Make multiple means from one column with pandas

With Python 3.10
Sample data:
import pandas as pd
data = [[1, 14890, 3], [4, 5, 6], [7, 8, 9], [11, 13, 14], [12, 0, 18], [87, None, 54], [1, 0, 3], [4, 5, 6], [7, 8, 9],
[11, 13, 14], [12, 0, 18], [87, None, 54], [1, 0, 3], [4, 5, 6], [7, 8, 9], [11, 13, 14], [12, 0, 18],
[87,10026, 54]]
df = pd.DataFrame(data, columns=['column', 'data', 'something'])
print(df)
df = df.mask(df == 0).fillna(df.mean())
print(df) # <---this works but you will see what I mean about looking off..
Updated Solution:
df = pd.DataFrame(data, columns=['column', 'data', 'something'])
df['ma'] = round(df['data'].rolling(4, 1).apply(lambda x: np.nanmean(x)), 2)
df['final2'] = np.where(df['data'] > 0, df['data'], df['ma'])
print(df)
# it replaces the zeros and NULLS with a value, (sometimes it fits well, sometimes, not so much).
The idea is I have one or more column(s) with bad or missing data.
If I use .fillna(df.mean()) for this it sticks out like a sore thumb.
My Goal is to have a percentage of the total number of elements in the dataframe column to make the new mean from...
I would like to take a len(df)*0.30 (30%) and use divide it in half.
I would collect half the numbers above the index point where the (null/0/bad data) data exists.
I would collect half the numbers below the index where the
These collected elements would be the be used to calculate the missing or bad index point.
This would be more helpful if there were a data set that irregular or had missing bad data
You can take a rolling mean with min periods = 1 to smooth out the data.
or you can do a variant of this method to customise what you want.
Inside the lambda i used this np.nanmean(x).
import pandas as pd
import numpy as np
data = [[1, 14890, 3], [4, 5, 6], [7, 8, 9], [11, 13, 14], [12, 0, 18], [87, None, 54], [1, 0, 3], [4, 5, 6], [7, 8, 9],
[11, 13, 14], [12, 0, 18], [87, None, 54], [1, 0, 3], [4, 5, 6], [7, 8, 9], [11, 13, 14], [12, 0, 18],
[87,10026, 54]]
df = pd.DataFrame(data, columns=['column', 'data', 'something'])
df['ma'] = df['data'].rolling(3,1).apply(lambda x : np.nanmean(x))
df['final'] = np.where(df['data'] >= 0, df['data'], df['ma'])
print(df)
result:
column data something ma final
0 1 14890.0 3 14890.000000 14890.0
1 4 5.0 6 7447.500000 5.0
2 7 8.0 9 4967.666667 8.0
3 11 13.0 14 8.666667 13.0
4 12 0.0 18 7.000000 0.0
5 87 NaN 54 6.500000 6.5
6 1 0.0 3 0.000000 0.0
7 4 5.0 6 2.500000 5.0
8 7 8.0 9 4.333333 8.0
9 11 13.0 14 8.666667 13.0
10 12 0.0 18 7.000000 0.0
11 87 NaN 54 6.500000 6.5
12 1 0.0 3 0.000000 0.0
13 4 5.0 6 2.500000 5.0
14 7 8.0 9 4.333333 8.0
15 11 13.0 14 8.666667 13.0
16 12 0.0 18 7.000000 0.0
17 87 10026.0 54 3346.333333 10026.0

How to convert pandas dataframe using Unstack?

I want to convert the following DataFrame
into a new DataFrame
using pandas unstack function. Please help me out?
df = pd.DataFrame([["A", 1, 1, 10],
["A", 2, 1, 20],
["A", 1, 2, 30],
["A", 2, 2, 40],
["B", 1, 1, 50],
["B", 2, 1, 60],
["B", 1, 2, 70],
["B", 2, 2, 80],
["B", 1, 3, 90]],
columns=["itemid", "segment", "pass", "p1"])
piv = df.pivot(["itemid", "segment"], "pass", "p1")
piv.columns = [f"p1_pass_{idx}" for idx in piv.columns]
piv.reset_index()
Output:
itemid segment p1_pass_1 p1_pass_2 p1_pass_3
0 A 1 10.0 30.0 NaN
1 A 2 20.0 40.0 NaN
2 B 1 50.0 70.0 90.0
3 B 2 60.0 80.0 NaN
I don't know whether the use of unstack is totally necessary for you.

Keep the last n real values of uneven rows in a dataframe?

I am collecting heart rate values over the course of time. Each subject varies in the length of time that data was collected. I would like to make a table of the last 2 seconds of collected data.
import pandas as pd
import numpy as np
#example data
example_s = [["4/20/21 4:20", 302, 0, 0, 1, 2, 3],
["2/17/21 9:20",135, 1, 1.4, 8, 10, np.NaN, np.NaN],
["2/17/21 9:20", 111, 5, 5,1, np.NaN, np.NaN,np.NaN, np.NaN]]
example_s_table = pd.DataFrame(example_s,columns=['Date_Time','CID', 0, 1, 2, 3, 4, 5, 6])
desired_outcome = [["4/20/21 4:20",302,1, 2, 3],
["2/17/21 9:20",135, 1.4, 8, 10 ],
["2/17/21 9:20",111, 5, 5,1 ]]
desired_outcome_table = pd.DataFrame(desired_outcome,columns=['Date_Time','CID', "Second 1", "Second 2", "Second 3"])
I can see how to collect a single instance of the data from the example shown here, but would like to know how to quickly add multiple values to my table:
desired_outcome_table["Last Second"]=example_s_table.iloc[:,1:].ffill(axis=1).iloc[:, -1]
Python Dataframe Get Value of Last Non Null Column for Each Row
Try:
df = example_s_table.copy()
df = df.set_index(['Date_Time', 'CID'])
df_out = df.mask(df.eq(0))\
.apply(lambda x: pd.Series(x.dropna().tail(3).values), axis=1)\
.rename(columns = lambda x: f'Second {x+1}')
df_out['Last Second'] = df_out['Second 3']
print(df_out.reset_index())
Output:
Date_Time CID Second 1 Second 2 Second 3 Last Second
0 4/20/21 4:20 302 1.0 2.0 3.0 3.0
1 2/17/21 9:20 135 1.4 8.0 10.0 10.0
2 2/17/21 9:20 111 5.0 5.0 1.0 1.0

Pandas get_group method on DatetimeIndexResamplerGroupby

Question: Does the get_group method work on a DataFrame with a DatetimeIndexResamplerGroupby index? If so, what is the appropriate syntax?
Sample data:
data = [[2, 4, 1, datetime.datetime(2017, 1, 1)],
[2, 4, 2, datetime.datetime(2017, 1, 5)],
[3, 4, 1, datetime.datetime(2017, 1, 7)]]
df1 = pd.DataFrame(data, columns=list('abc') + ['dates'])
gb3 = df1.set_index('dates').groupby('a').resample('D')
DatetimeIndexResamplerGroupby [freq=<Day>, axis=0, closed=left, label=left, convention=e, base=0]
gb3.sum()
a b c
a dates
2 2017-01-01 2.0 4.0 1.0
2017-01-02 NaN NaN NaN
2017-01-03 NaN NaN NaN
2017-01-04 NaN NaN NaN
2017-01-05 2.0 4.0 2.0
3 2017-01-07 3.0 4.0 1.0
The get_group method is working for me on a pandas.core.groupby.DataFrameGroupBy object.
I've tried various approaches, the typical error is TypeError: Cannot convert input [(0, 1)] of type <class 'tuple'> to Timestamp
The below should be what you're looking for (if I understand the question correctly):
import pandas as pd
import datetime
​
data = [[2, 4, 1, datetime.datetime(2017, 1, 1)],
[2, 4, 2, datetime.datetime(2017, 1, 5)],
[3, 4, 1, datetime.datetime(2017, 1, 7)]]
df1 = pd.DataFrame(data, columns=list('abc') + ['dates'])
gb3 = df1.groupby(['a',pd.Grouper('dates')])
gb3.get_group((2, '2017-01-01'))
​
Out[14]:
a b c dates
0 2 4 1 2017-01-01
I believe resample/pd.Grouper can be used interchangeably in this case (someone correct me if I'm wrong). Let me know if this works for you.
Yes it does, the following code returns the monthly values sum of the year 2015
df.resample('MS').sum().resample('Y').get_group('2015-12-31')

Categories

Resources