I have a large data set that I need to calculate the number of checked out items vs the number of checked in items.
Sample data where rollingTotalCheckedOut describes the expected value. While items are checked out, the number of checked out items increases. When items are checked back in, the number of checked out items decreases.
df = pd.DataFrame([
['A', 1624990605, 1627102404, 1],
['A', 1624990635, 1625015061, 2],
['A', 1624990790, 1624991096, 3],
['A', 1624990790, 1624990913, 4],
['A', 1624990822, 1624991711, 5],
['A', 1624990945, 1624991096, 5],
['A', 1624991036, 1624991066, 6],
['A', 1624991067, 1624991188, 6],
],
columns = ['ID', 'out_ts', 'in_ts', 'rollingTotalCheckedOut'])
# some helpers
df['checkoutTime'] = pd.to_datetime(df['out_ts'], unit='s', origin='unix')
df['checkinTime'] = pd.to_datetime(df['in_ts'], unit='s', origin='unix')
I am not even sure how to best describe this problem. What is my strategy here / how to frame and tackle this problem? A rolling window does not seem suitable because, in this case, the first row is "checked out" for a very long time.
Here is what I got. not exactly your calculation but I can't immediately see an error. Will check again. But honestly I am not sure why you have a 5 in the end. Previous period ended but new just started.
import pandas as pd
df = pd.DataFrame([
['A', 1624990605, 1627102404, 1],
['A', 1624990635, 1625015061, 2],
['A', 1624990790, 1624991096, 3],
['A', 1624990790, 1624990913, 4],
['A', 1624990822, 1624991711, 5],
['A', 1624990945, 1624991096, 5],
['A', 1624991036, 1624991066, 6],
['A', 1624991067, 1624991188, 5],
],
columns = ['ID', 'out_ts', 'in_ts', 'rollingTotalCheckedOut'])
df["full_interval"] = df["out_ts"].astype("str") + "_" + df["in_ts"].astype("str")
df_out= df.drop(columns = ["in_ts"])
df_out["ts"] = df_out["out_ts"]
df_out["op"] = "out"
df_out["op_val"] = 1
df_in= df.drop(columns = ["out_ts"])
df_in["ts"] = df_in["in_ts"]
df_in["op"] = "in"
df_in["op_val"] = -1
df_stacked = pd.concat([df_out, df_in]).sort_values("ts")
df_stacked["rollingTotalCheckedOut"] = df_stacked["op_val"].cumsum()
df_stacked = df_stacked.sort_values("out_ts").dropna(subset=["out_ts"])
df = df.merge(df_stacked.loc[:,["ID","full_interval", "rollingTotalCheckedOut"]], how="left", on=["ID", "full_interval"])
df = df.drop(columns=["full_interval"])
df
Output:
ID out_ts in_ts rollingTotalCheckedOut_x rollingTotalCheckedOut_y
0 A 1624990605 1627102404 1 1
1 A 1624990635 1625015061 2 2
2 A 1624990790 1624991096 3 3
3 A 1624990790 1624990913 4 4
4 A 1624990822 1624991711 5 5
5 A 1624990945 1624991096 5 5
6 A 1624991036 1624991066 6 6
7 A 1624991067 1624991188 5 6
Related
I have a dataframe with some string columns and number columns. I want to manage the missing values.
I want to change the "nan" values with mean of each row.
I saw the different question in this website, however, they are different from my question. Like this link: Pandas Dataframe: Replacing NaN with row average
If all the values of a rows are "Nan" values, I want to delete that rows. I have also provide a sample case as follows:
import pandas as pd
import numpy as np
df = pd.DataFrame()
df['id'] = ['a', 'b', 'c', 'n']
df['md'] = ['d', 'e', 'f', 'l']
df['c1'] = [2, np.nan,np.nan, 5]
df['c2'] = [0, 5, np.nan, 3]
df['c3'] = [8, 7, np.nan,np.nan]
df = pd.DataFrame()
df['id'] = ['a', 1, 'n']
df['md'] = ['d', 6, 'l']
df['c1'] = [2, 6, 5]
df['c2'] = [0, 5, 3]
df['c3'] = [8, 7,4]
df
Note:
I have used the following code, however it is very slow and for a big dataframe it take a looong time to run.
index_colum = df.columns.get_loc("c1")
df_withno_id = df.iloc[:,index_colum:]
rowsidx_with_all_NaN = df_withno_id[df_withno_id.isnull().all(axis=1)].index.values
df = df.drop(df.index[rowsidx_with_all_NaN])
for i, cols in df_withno_id.iterrows():
if i not in rowsidx_with_all_NaN:
endsidx = len(cols)
extract_data = list(cols[0:endsidx])
mean = np.nanmean(extract_data)
fill_nan = [mean for x in extract_data if np.isnan(x)]
df.loc[i] = df.loc[i].replace(np.nan, mean)
Can anybody help me with this? thanks.
First, you can select only float columns types. Second, for these columns drop rows with all nan values. Finally, you can transpose dataframe (only float columns), calculate average value and later transpose again.
import pandas as pd
import numpy as np
df = pd.DataFrame()
df['id'] = ['a', 'b', 'c', 'n']
df['md'] = ['d', 'e', 'f', 'l']
df['c1'] = [2, np.nan,np.nan, 5]
df['c2'] = [0, 5, np.nan, 3]
df['c3'] = [8, 7, np.nan,np.nan]
numeric_cols = df.select_dtypes(include='float64').columns
df.dropna(how = 'all', subset = numeric_cols, inplace = True)
df[numeric_cols] = df[numeric_cols].T.fillna(df[numeric_cols].T.mean()).T
df
I want to select products as long as it does not contain 0 in x.
Input:
test = pd.DataFrame(
[
['a', 0],
['a', 3],
['a', 4],
['b', 3],
['b', 2],
['c', 1],
['d', 0]
]
)
test.columns = ['product', 'x']
test.query("select distinct (product) where x not in (0) ")
expected out come:
b,c
How to do this in both pandas and SQL?
In SQL, you would use:
select product
from t
group by product
having min(x) > 0;
This works assuming x is never negative. A more general formulation is:
having sum(case when x = 0 then 1 else 0 end) = 0
In your case pandas can do with isin
test.loc[~test['product'].isin(test.loc[test.x.eq(0),'product']),'product'].unique()
Out[41]: array(['b', 'c'], dtype=object)
Or do with set
set(test['product'].tolist())-set(test.loc[test.x.eq(0),'product'].tolist())
Out[47]: {'b', 'c'}
If you want to filter your dataframe, you can use groupby with .any():
test[~test.groupby('product')['x'].transform(lambda x: x.eq(0).any())]
Output:
product x
b 3
b 2
c 1
If you only want to see unique values you can add ['product'].unique().tolist() at the end of the code which I pasted above.
Then we have the output:
['b', 'c']
I have a DF where the index is equal strings.
df = pd.DataFrame([[0, 2, 3], [0, 4, 1], [10, 20, 30]],
index=['a', 'a', 'a'], columns=['A', 'B', 'C'])
>>> df
A B C
a 0 2 3
a 0 4 1
a 10 20 30
Let's say I am trying to access the value in col 'B' at the first row. I am using something like this:
>>> df.iloc[0]['B']
2
Reading the post here it seems .at is recommended to be used for efficiency. Is there any better way in my example to return the value by the index row number and column name?
Try with iat with get_indexer
df.iat[0,df.columns.get_indexer(['B'])[0]]
Out[124]: 2
I have the next csv and I need get the values duplicated from DialedNumer column and then the averege Duration of those duplicates.
I already got the duplicates with the next code:
df = pd.read_csv('cdrs.csv')
dnidump = pd.DataFrame(df, columns=['DialedNumber'])
pd.options.display.float_format = '{:.0f}'.format
dupl_dni = dnidump.pivot_table(index=['DialedNumber'], aggfunc='size')
a1 = dupl_dni.to_frame().rename(columns={0:'TimesRepeated'}).sort_values(by=['TimesRepeated'], ascending=False)
b = a1.head(10)
print(b)
Output:
DialedNumber TimesRepeated
50947740194 4
50936564292 2
50931473242 3
I can't figure out how to get the duration avarege of those duplicates, any ideas?
thx
try:
df_mean = df.groupby('DialedNumber').mean()
Use df.groupby('column').mean()
Here is sample code.
Input
df = pd.DataFrame({'A': [1, 1, 1, 2, 2],
'B': [2461, 1023, 9, 5614, 212],
'C': [2, 4, 8, 16, 32]}, columns=['A', 'B', 'C'])
df.groupby('A').mean()
Output
B C
A
1 1164.333333 4.666667
2 2913.000000 24.000000
API reference of pandas.core.groupby.GroupBy.mean
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.core.groupby.GroupBy.mean.html
I am trying to find the the record with maximum value from the first record in each group after groupby and delete the same from the original dataframe.
import pandas as pd
df = pd.DataFrame({'item_id': ['a', 'a', 'b', 'b', 'b', 'c', 'd'],
'cost': [1, 2, 1, 1, 3, 1, 5]})
print df
t = df.groupby('item_id').first() #lost track of the index
desired_row = t[t.cost == t.cost.max()]
#delete this row from df
cost
item_id
d 5
I need to keep track of desired_row and delete this row from df and repeat the process.
What is the best way to find and delete the desired_row?
I am not sure of a general way, but this will work in your case since you are taking the first item of each group (it would also easily work on the last). In fact, because of the general nature of split-aggregate-combine, I don't think this is easily achievable without doing it yourself.
gb = df.groupby('item_id', as_index=False)
>>> gb.groups # Index locations of each group.
{'a': [0, 1], 'b': [2, 3, 4], 'c': [5], 'd': [6]}
# Get the first index location from each group using a dictionary comprehension.
subset = {k: v[0] for k, v in gb.groups.iteritems()}
df2 = df.iloc[subset.values()]
# These are the first items in each groupby.
>>> df2
cost item_id
0 1 a
5 1 c
2 1 b
6 5 d
# Exclude any items from above where the cost is equal to the max cost across the first item in each group.
>>> df[~df.index.isin(df2[df2.cost == df2.cost.max()].index)]
cost item_id
0 1 a
1 2 a
2 1 b
3 1 b
4 3 b
5 1 c
Try this ?
import pandas as pd
df = pd.DataFrame({'item_id': ['a', 'a', 'b', 'b', 'b', 'c', 'd'],
'cost': [1, 2, 1, 1, 3, 1, 5]})
t=df.drop_duplicates(subset=['item_id'],keep='first')
desired_row = t[t.cost == t.cost.max()]
df[~df.index.isin([desired_row.index[0]])]
Out[186]:
cost item_id
0 1 a
1 2 a
2 1 b
3 1 b
4 3 b
5 1 c
Or using not in
Consider this df with few more rows
pd.DataFrame({'item_id': ['a', 'a', 'b', 'b', 'b', 'c', 'd', 'd','d'],
'cost': [1, 2, 1, 1, 3, 1, 5,1,7]})
df[~df.cost.isin(df.groupby('item_id').first().max().tolist())]
cost item_id
0 1 a
1 2 a
2 1 b
3 1 b
4 3 b
5 1 c
7 1 d
8 7 d
Overview: Create a dataframe using an dictionary. Group by item_id and find the max value. enumerate over the grouped dataframe and use the key which is an numeric value to return the alpha index value. Create an result_df dataframe if you desire.
df_temp = pd.DataFrame({'item_id': ['a', 'a', 'b', 'b', 'b', 'c', 'd'],
'cost': [1, 2, 1, 1, 3, 1, 5]})
grouped=df_temp.groupby(['item_id'])['cost'].max()
result_df=pd.DataFrame(columns=['item_id','cost'])
for key, value in enumerate(grouped):
index=grouped.index[key]
result_df=result_df.append({'item_id':index,'cost':value},ignore_index=True)
print(result_df.head(5))