Pandas.DataFrame: efficient way to add a column "seconds since last event" - python

I have a Pandas.DataFrame with a standard index representing seconds, and I want to add a column "seconds elapsed since last event" where the events are given in a list. Specifically, say
event = [2, 5]
and
df = pd.DataFrame(np.zeros((7, 1)))
| | 0 |
|---:|----:|
| 0 | 0 |
| 1 | 0 |
| 2 | 0 |
| 3 | 0 |
| 4 | 0 |
| 5 | 0 |
| 6 | 0 |
Then I want to obtain
| | 0 | x |
|---:|----:|-----:|
| 0 | 0 | <NA> |
| 1 | 0 | <NA> |
| 2 | 0 | 0 |
| 3 | 0 | 1 |
| 4 | 0 | 2 |
| 5 | 0 | 0 |
| 6 | 0 | 1 |
I tried
df["x"] = pd.Series(range(5)).shift(2)
| | 0 | x |
|---:|----:|----:|
| 0 | 0 | nan |
| 1 | 0 | nan |
| 2 | 0 | 0 |
| 3 | 0 | 1 |
| 4 | 0 | 2 |
| 5 | 0 | nan |
| 6 | 0 | nan |
so apparently to make it work I need to write df["x"] = pd.Series(range(5+2)).shift(2).
More importantly, when I then do df["x"] = pd.Series(range(2+5)).shift(5) I obtain
| | 0 | x |
|---:|----:|----:|
| 0 | 0 | nan |
| 1 | 0 | nan |
| 2 | 0 | nan |
| 3 | 0 | nan |
| 4 | 0 | nan |
| 5 | 0 | 0 |
| 6 | 0 | 1 |
That is: the previous has been overwritten. Is there a way to assign new values without overwriting existing values by nan ?
Then, I can do something like
for i in event:
df["x"] = pd.Series(range(len(df))).shift(i)
Or is there a more efficient way ?
For the record, here is my naive code. It works, but looks inefficient and of poor design:
c = 1000000
df["x"] = c
if event:
idx = 0
for i in df.itertuples():
print(i)
if idx < len(event) and i.Index == event[idx]:
c = 0
idx += 1
df.loc[i.Index, "x"] = c
c += 1
return df

IIUC, you can do double groupby:
s = df.index.isin(event).cumsum()
# or equivalently
# s = df.loc[event, 0].reindex(df.index).isna().cumsum()
df['x'] = np.where(s>0,df.groupby(s).cumcount(), np.nan)
Output:
0 x
0 0.0 NaN
1 0.0 NaN
2 0.0 0.0
3 0.0 1.0
4 0.0 2.0
5 0.0 0.0
6 0.0 1.0

Let's try this:
df = pd.DataFrame(np.zeros((7, 1)))
event = [2, 5]
df.loc[event, 0] = 1
df = df.replace(0, np.nan)
grp=df[0].cumsum().ffill()
df['x'] = df.groupby(grp).cumcount().mask(grp.isna())
df
Output:
| | 0 | x |
|---:|----:|----:|
| 0 | nan | nan |
| 1 | nan | nan |
| 2 | 1 | 0 |
| 3 | nan | 1 |
| 4 | nan | 2 |
| 5 | 1 | 0 |
| 6 | nan | 1 |

Related

Have the name of the column if my row is equal to 1 return in another one

I have a dataframe that looks like this
+--------+--------+--------+--------+--------+
| index | Q111 | Q570 | Q7891 |Info583 |
+--------+--------+--------+--------+--------+
| 1 | 1 | 0 | 0 | 0 |
| 2 | 0 | 1 | 1 | 0 |
| 3 | 0 | 0 | 0 | 1 |
| code | 1 | 0 | 0 | 1 |
+--------+--------+--------+--------+--------+
I would like that for each 1 for the index rows 'code', I have the name of the corresponding column in a new column 'key_name', here is the desired final result
+--------+--------+--------+--------+--------+--------+
| index | Q111 | Q570 | Q7891 |Info583|key_name|
+--------+--------+--------+--------+--------+--------+
| 1 | 1 | 0 | 0 | 0 | Q111 |
| 2 | 0 | 1 | 1 | 0 | nan |
| 3 | 0 | 0 | 0 | 1 | nan |
| 4 | 1 | 0 | 0 | 1 | Info583|
| code | 1 | 0 | 0 | 1 | nan |
+--------+--------+--------+--------+--------+--------+
Thx for any help or advices !
I think this is what you're looking for:
df['key_name'] = np.nan
condition = df.loc['code', :] == 1
df.loc[condition, 'key_name'] = df.columns[condition]
First make the column with just nan's. Then compute your condition: row with index 'code' equals 1. Then plug in the column names when condition is met.

Fill duplicates with missing value after grouping with some logic

I have a dataframe, I need to take off the duplicates of ticket_id if the owner_type is the same, and if not, pick 'm' over 's', if no value is picket then a NaN is returned:
data = pd.DataFrame({'owner_type':['m','m','m','s','s','m','s','s'],'ticket_id':[1,1,2,2,3,3,4,4]})
'| | owner_type | ticket_id |
|---:|:-------------|------------:|
| 0 | m | 1 |
| 1 | m | 1 |
| 2 | m | 2 |
| 3 | s | 2 |
| 4 | s | 3 |
| 5 | m | 3 |
| 6 | s | 4 |
| 7 | s | 4 |'
Should give back:
'| | owner_type | ticket_id |
|---:|:-------------|------------:|
| 0 | m | NaN |
| 1 | m | NaN |
| 2 | m | 2 |
| 3 | s | NaN |
| 4 | s | NaN |
| 5 | m | 3 |
| 6 | s | NaN |
| 7 | s | NaN |'
Pseudo code would be like : If ticket_id is duplicated, look at owner_type, if owner_type has mover than one value, return value of 'm' and NaN for 's'.
My attempt
data.groupby('ticket_id').apply(lambda x: x['owner_type'] if len(x) < 2 else NaN)
Not working
Try this:
(df['ticket_id'].where(
~df.duplicated(['owner_type','ticket_id'],keep=False) &
df['owner_type'].eq(df.groupby('ticket_id')['owner_type'].transform('min'))))
Old answer:
m = ~df.duplicated(keep=False) & df['owner_type'].eq('m')
df['ticket_id'].where(m)
Output:
0 NaN
1 NaN
2 2.0
3 NaN
4 NaN
5 3.0
6 NaN
7 NaN

Is there a way to find out each occurrence of a column value in another column from a different dataset?

I have two datasets: dataset1 & dataset2 (image link provided), which have a common column called SAX which is a string object.
dataset1=
SAX
0 glngsyu
1 zicobgm
2 eerptow
3 cqbsynt
4 zvmqben
.. ...
475 rfikekw
476 bnbzvqx
477 rsuhgax
478 ckhloio
479 lbzujtw
480 rows × 2 columns
and
dataset2 =
SAX timestamp
0 hssrlcu 16015
1 ktyuymp 16016
2 xncqmfr 16017
3 aanlmna 16018
4 urvahvo 16019
... ... ...
263455 jeivqzo 279470
263456 bzasxgw 279471
263457 jspqnqv 279472
263458 sxwfchj 279473
263459 gxqnhfr 279474
263460 rows × 2 columns
I need to find and print out the timestamps for whenever a value in SAX column of dataset1 exists in SAX column of dataset2.
Is there a function/method for accomplishing the above?
Thanks.
Let's create an arbitrary dataset to showcase how it works:
import pandas as pd
import numpy as np
def sax_generator(num):
return [''.join(chr(x) for x in np.random.randint(97, 97+26, size=4)) for _ in range(num)]
df1 = pd.DataFrame(sax_generator(10), columns=['sax'])
df2 = pd.DataFrame({'sax': sax_generator(10), 'timestamp': range(10)})
Let's peek into the data:
df1 =
| | sax |
|---:|:------|
| 0 | cvtj |
| 1 | fmjy |
| 2 | rjpi |
| 3 | gwtv |
| 4 | qhov |
| 5 | uriu |
| 6 | kpku |
| 7 | xkop |
| 8 | kzoe |
| 9 | nydj |
df2 =
| | sax | timestamp |
|---:|:------|------------:|
| 0 | kzoe | 0 |
| 1 | npyo | 1 |
| 2 | uriu | 2 |
| 3 | hodu | 3 |
| 4 | rdko | 4 |
| 5 | pspn | 5 |
| 6 | qnut | 6 |
| 7 | gtyz | 7 |
| 8 | gfzs | 8 |
| 9 | gcel | 9 |
Now ensure we have some matching values in df2 from df1, which we can later check:
df2['sax'][2] = df1['sax'][5]
df2['sax'][0] = df1['sax'][8]
Then use:
df2.loc[df1.sax.apply(lambda x: df2.sax.str.contains(x)).any(), 'timestamp']
to get:
| | timestamp |
|---:|------------:|
| 0 | 0 |
| 2 | 2 |
With np.where docs here you can get the indices back as well:
np.where(df1.sax.apply(lambda x: df2.sax.str.contains(x)) == True)
# -> (array([5, 8]), array([2, 0]))
Here we can see that df1 has matching indices [5, 8] and df2 has [2, 0], which is exactly what we enforced with the lines above...
If we have a look at the return of df1.sax.apply(lambda x: df2.sax.str.contains(x)), the result above matches exactly the indices (magic...whooo):
| | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 |
|---:|----:|----:|----:|----:|----:|----:|----:|----:|----:|----:|
| 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 2 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 3 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 4 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 5 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 6 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 7 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 8 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 9 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
Step1: Convert Dataset 2 to a dict using:
import numpy as np
import pandas as pd
a_dictionary = df.to_dict['list]
Step2: Use a comparator in a for loop to extract time stamps.
lookup_value = "abcdef" #This can be a list item.
all_keys = []
for key, value in a_dictionary.items():
if(value == lookup_value):
all_keys.append(key)
print(all_keys)
Step3: ENJOY!

Assign a total value of 1 if any number is present in a column, else 0

I have a dataset similar to the this sample below:
| id | old_a | old_b | new_a | new_b |
|----|-------|-------|-------|-------|
| 6 | 3 | 0 | 0 | 0 |
| 6 | 9 | 0 | 2 | 0 |
| 13 | 3 | 0 | 0 | 0 |
| 13 | 37 | 0 | 0 | 1 |
| 13 | 30 | 0 | 0 | 6 |
| 13 | 12 | 2 | 0 | 0 |
| 6 | 7 | 0 | 2 | 0 |
| 6 | 8 | 0 | 0 | 0 |
| 6 | 19 | 0 | 3 | 0 |
| 6 | 54 | 0 | 0 | 0 |
| 87 | 6 | 0 | 2 | 0 |
| 87 | 11 | 1 | 1 | 0 |
| 87 | 25 | 0 | 1 | 0 |
| 87 | 10 | 0 | 0 | 0 |
| 9 | 8 | 1 | 0 | 0 |
| 9 | 19 | 0 | 2 | 0 |
| 9 | 1 | 0 | 0 | 0 |
| 9 | 34 | 0 | 7 | 0 |
I'm providing this sample dataset for the above table:
data=[[6,3,0,0,0],[6,9,0,2,0],[13,3,0,0,0],[13,37,0,0,1],[13,30,0,0,6],[13,12,2,0,0],[6,7,0,2,0],
[6,8,0,0,0],[6,19,0,3,0],[6,54,0,0,0],[87,6,0,2,0],[87,11,1,1,0],[87,25,0,1,0],[87,10,0,0,0],
[9,8,1,0,0],[9,19,0,2,0],[9,1,0,0,0],[9,34,0,7,0]]
data= pd.DataFrame(data,columns=['id','old_a','old_b','new_a','new_b'])
I want to look into columns 'new_a' and 'new_b' for each id and even if a single value exists in these two columns for each id, I want to count it as 1 irrespective of the number of times any value has occurred and assign 0 if no value is present. For example, if I look into the id '9', there are two distinct values in new_a, but I want to count it as 1. Similarly, for id '13', there are no values in new_a, so I would want to assign it 0.
My final output should like:
| id | new_a | new_b |
|----|-------|-------|
| 6 | 1 | 0 |
| 9 | 1 | 0 |
| 13 | 0 | 1 |
| 87 | 1 | 0 |
I would eventually want to calculate the % of clients using new_a and new_b. So from the above table, 75% clients use new_a and 25% use new_b. I'm a beginner in python and not sure how to proceed in this.
Use GroupBy.any, because 0 are processing like Falses and convert output boolean to integers:
df = data.groupby('id')[['new_a','new_b']].any().astype(int).reset_index()
print (df)
id new_a new_b
0 6 1 0
1 9 1 0
2 13 0 1
3 87 1 0
For percentage use mean of output above:
s = df[['new_a','new_b']].mean().mul(100)
print (s)
new_a 75.0
new_b 25.0
dtype: float64

Panda's : Matrix from pd.crosstab()

I'm progressively learning pandas, I figured out that pd.crosstab() can do marvels but I've hard time to make it work in that case.
I have a list of objects obj tagged with an int, I want to have the matrix of the objects sharing the same tag (1 if it's the same, 0 else)
| obj | tag |
|-----|-----|
| a | 0 |
| b | 2 |
| c | 1 |
| ... | ... |
| z | 2 |
->
| | a | b | c | ... | z |
|-----|---|---|---|-----|---|
| a | 1 | 0 | 0 | . | 0 |
| b | 0 | 1 | 0 | . | 1 |
| c | 0 | 0 | 1 | . | 0 |
| ... | . | . | . | . | 0 |
| z | 0 | 1 | 0 | 0 | 1 |
There are some formidables ways to do it, is there one more panda-friendly ?
PS : Tried with pd.crosstab(df.obj, df.obj, values=df.tag, aggfunc=[np.sum]) but NaN filled.
You can use merge with crosstab and DataFrame.rename_axis:
df = df.merge(df, on='tag')
df = pd.crosstab(df.obj_x, df.obj_y).rename_axis(None).rename_axis(None, axis=1)
print (df)
a b c z
a 1 0 0 0
b 0 1 0 1
c 0 0 1 0
z 0 1 0 1

Categories

Resources