I have the follosing dataset:
import pandas as pd
from datetime import datetime
import numpy as np
date_rng = pd.date_range(start='2020-07-01', end='2020-07-10', freq='d')
l1 = [np.nan, np.nan, 3, np.nan, np.nan, 4, np.nan, np.nan, 5, np.nan]
l2 = [np.nan, np.nan, np.nan, np.nan, np.nan, 4, np.nan, np.nan, 1, 3]
df = pd.DataFrame({
'date':date_rng,
'value':l1,
'group':'a'
})
df2 = pd.DataFrame({
'date':date_rng,
'value':l2,
'group':'b'
})
df = df.append(df2, ignore_index=True)
df
I would like to count the days until the first value appears for each group. I was able to find the date with the following code, but would get the number of days for each group.
# first valid valuefor each column
df.set_index(["date"]).groupby('group')['value'].apply(pd.Series.first_valid_index)
EDIT:
This would be the expected outcome:
columns = ["group", "number_of_days"]
df_features = pd.DataFrame([["a", 3],
["b", 6],],
columns=columns)
df_features
Use GroupBy.first for first days per groups, subtract by Series.sub, convert to days by Series.dt.days, add 1 and convert to 2 column DataFrame:
s1 = df.groupby('group')['date'].first()
s2 = df.set_index(["date"]).groupby('group')['value'].apply(pd.Series.first_valid_index)
df = s2.sub(s1).dt.days.add(1).reset_index(name='number_of_days')
print (df)
group number_of_days
0 a 3
1 b 6
Related
I everyone, I'm quite new to Pandas dataset but, so I won't attach code if not pseudo-code cause I have no idea how to implement this.
I have two DataFrames, one with a Job number and a date related (let's call this DF2) to it and the bigger one with a bunch of different data (this will be DF1).
I would like to compare DF1 with DF2 and if the string in DF1[jobNo.] is equal to a string in DF2[jobNo.] get DF1[Date] == DF2[Date].
Any ideas? I really need your help.
Thanks
If you're trying to check if the dates match when the jobNo match, my approach would be to merge the two dataframes on jobNo and compare the dates.
import pandas as pd
df1 = pd.DataFrame({'jobNo': [0, 3, 1], 'date': [9, 8, 3]})
df2 = pd.DataFrame({'jobNo': [0, 3, 2], 'date': [9, 5, 3]})
df3 = df2.merge(df1, on=["jobNo"], suffixes=('_2', '_1'))
df3["date_match"] = df3.apply(lambda x: x["date_2"] == x["date_1"], axis=1)
print(df3)
jobNo date_2 date_1 date_match
0 0 9 9 True
1 3 5 8 False
if what you mean by df1["date"]==df2["date"] is that we're going to change the date in df1 if there's a match then this code looks for a match and replaces the date using apply
import pandas as pd
df1 = pd.DataFrame({'jobNo': [0, 3, 1], 'date': [9, 8, 3]})
df2 = pd.DataFrame({'jobNo': [0, 3, 2], 'date': [7, 5, 4]})
df1['new_date'] = df1.apply(lambda x: (x['date'] if x['jobNo']
not in df2['jobNo'
].values else df2[df2['jobNo'] == x['jobNo'
]]['date'].values[0]), axis=1)
print(df1)
jobNo date new_date
0 0 9 7
1 3 8 5
2 1 3 3
is there a way to find the maximum length of contiguous periods without data for each column? `
df.isna().sum() gives me the number of all nan but here in the example I am looking for a way to get for A=3 and B=2:
import pandas as pd
import numpy as np
i = pd.date_range('2018-04-09', periods=8, freq='1D')
df = pd.DataFrame({'A': [1, 5, np.nan ,np.nan, np.nan, 2, 5, np.nan], 'B' : [np.nan, 2, 3, np.nan, np.nan, 6, np.nan, 8]}, index=i)
df
For one Series you can make groups of consecutive NaNs using the non-NaNs as starting points. Then count them and get the max:
s = df['A'].isna()
s.groupby((~s).cumsum()).sum().max()
Output: 3
Now do this for all columns:
def max_na_stretch(s):
s = s.isna()
return s.groupby((~s).cumsum()).sum().max()
df.apply(max_na_stretch)
Output:
A 3
B 2
dtype: int64
I have the next csv and I need get the values duplicated from DialedNumer column and then the averege Duration of those duplicates.
I already got the duplicates with the next code:
df = pd.read_csv('cdrs.csv')
dnidump = pd.DataFrame(df, columns=['DialedNumber'])
pd.options.display.float_format = '{:.0f}'.format
dupl_dni = dnidump.pivot_table(index=['DialedNumber'], aggfunc='size')
a1 = dupl_dni.to_frame().rename(columns={0:'TimesRepeated'}).sort_values(by=['TimesRepeated'], ascending=False)
b = a1.head(10)
print(b)
Output:
DialedNumber TimesRepeated
50947740194 4
50936564292 2
50931473242 3
I can't figure out how to get the duration avarege of those duplicates, any ideas?
thx
try:
df_mean = df.groupby('DialedNumber').mean()
Use df.groupby('column').mean()
Here is sample code.
Input
df = pd.DataFrame({'A': [1, 1, 1, 2, 2],
'B': [2461, 1023, 9, 5614, 212],
'C': [2, 4, 8, 16, 32]}, columns=['A', 'B', 'C'])
df.groupby('A').mean()
Output
B C
A
1 1164.333333 4.666667
2 2913.000000 24.000000
API reference of pandas.core.groupby.GroupBy.mean
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.core.groupby.GroupBy.mean.html
How can I make the Groupby Apply run faster, or how can I write it a different way?
import numpy as np
import pandas as pd
df = pd.DataFrame({'ID':[1,1,1,1,1,2,2,2,2,2],\
'value':[1,2,np.nan,3,np.nan,1,2,np.nan,4,np.nan]})
result = df.groupby("ID").apply(lambda x: len(x[x['value'].notnull()].index)\
if((len(x[x['value']==1].index)>=1)&\
(len(x[x['value']==4].index)==0)) else 0)
output:
Index 0
1 3
2 0
My program runs very slow right now. Can I make it faster? I have in the past filtered before using groupby() but I don't see an easy way to do it in this situation.
Not sure if this is what you need. I have decomposed it a bit, but you can easily method-chain it to get the code more compact:
df = pd.DataFrame(
{
"ID": [1, 1, 1, 1, 1, 2, 2, 2, 2, 2],
"value": [1, 2, np.nan, 3, np.nan, 1, 2, np.nan, 4, np.nan],
}
)
df["x1"] = df["value"] == 1
df["x2"] = df["value"] == 4
df2 = df.groupby("ID").agg(
y1=pd.NamedAgg(column="x1", aggfunc="max"),
y2=pd.NamedAgg(column="x2", aggfunc="max"),
cnt=pd.NamedAgg(column="value", aggfunc="count"),
)
df3 = df2.assign(z=lambda x: (x['y1'] & ~x['y2'])*x['cnt'])
result = df3.drop(columns=['y1', 'y2', 'cnt'])
print(result)
which will yield
z
ID
1 3
2 0
if any elements are there along with nan then i want to keep element and want to delete nan only like
example 1 ->
index values
0 [nan,'a',nan,nan]
output should be like
index values
0 [a]
example 2->
index values
0 [nan,'a',b,c]
1 [nan,nan,nan]
output should be like
index values
0 [a,b,c]
1 []
This is one approach using df.apply.
import pandas as pd
import numpy as np
df = pd.DataFrame({"a": [[np.nan, np.nan, np.nan, "a", np.nan], [np.nan, np.nan], ["a", "b"]]})
df["a"] = df["a"].apply(lambda x: [i for i in x if str(i) != "nan"])
print(df)
Output:
a
0 [a]
1 []
2 [a, b]
You can use the fact that np.nan == np.nan evaluates to False:
df = pd.DataFrame([[0, [np.nan, 'a', 'b', 'c']],
[1, [np.nan, np.nan, np.nan]],
[2, [np.nan, 'a', np.nan, np.nan]]],
columns=['index', 'values'])
df['values'] = df['values'].apply(lambda x: [i for i in x if i == i])
print(df)
index values
0 0 [a, b, c]
1 1 []
2 2 [a]
lambda is just an anonymous function. You could also use a named function:
def remove_nan(x):
return [i for i in x if i == i]
df['values'] = df['values'].apply(remove_nan)
Related: Why is NaN not equal to NaN?
df['values'].apply(lambda v: pd.Series(v).dropna().values )
You can use pd.Series.map on df.values
import pandas as pd
my_filter = lambda x: not pd.isna(x)
df['new_values'] = df['values'].map(lambda x: list(filter(my_filter, x)))