I have a DataFrame that contains hour intervals in the columns, and employee ID's in rows.
I want to iterate over each column(hourly interval) and extract it to a list ONLY if the column contains the number 1 (1 means they are available in that hour , 0 means they are not)
I've tried iterrows() and iteritems() and neither are giving me what I want to see from this DataFrame
Which is a new list called
available = [0800, 0900, 1000, 1100]
Which I can then extract the min and max values to create a schedule.
Apologies if this is somewhat vague Im pretty new to Python 3 and Pandas
You don't need to iterate
Suppose you have a dataframe like this
0 1 2 3 4 5 6 7 8 9
0 0 0 0 0 0 1 0 1 1 0
1 1 0 1 1 1 1 1 1 0 1
2 1 1 1 0 0 0 0 0 0 0
3 0 1 1 0 1 1 0 0 1 1
4 1 0 1 0 1 0 1 0 0 0
5 0 1 1 0 0 0 0 0 0 0
6 1 0 0 0 1 1 1 1 0 0
7 0 1 0 1 0 1 1 1 1 1
8 0 0 1 0 1 1 1 0 0 0
9 1 0 0 1 0 0 1 1 1 1
You can just use this code to get the column names of all the columns where value is 1
df['available'] = df.apply(lambda row: row[row == 1].index.tolist(), axis=1)
0 1 2 3 4 5 6 7 8 9 available
0 0 0 0 0 0 1 0 1 1 0 [5, 7, 8]
1 1 0 1 1 1 1 1 1 0 1 [0, 2, 3, 4, 5, 6, 7, 9]
2 1 1 1 0 0 0 0 0 0 0 [0, 1, 2]
3 0 1 1 0 1 1 0 0 1 1 [1, 2, 4, 5, 8, 9]
4 1 0 1 0 1 0 1 0 0 0 [0, 2, 4, 6]
5 0 1 1 0 0 0 0 0 0 0 [1, 2]
6 1 0 0 0 1 1 1 1 0 0 [0, 4, 5, 6, 7]
7 0 1 0 1 0 1 1 1 1 1 [1, 3, 5, 6, 7, 8, 9]
8 0 0 1 0 1 1 1 0 0 0 [2, 4, 5, 6]
9 1 0 0 1 0 0 1 1 1 1 [0, 3, 6, 7, 8, 9]
And if you want mix/max from this you can use
df['min_max'] = df['available'].apply(lambda x: (min(x), max(x)))
available min_max
0 [5, 7, 8] (5, 8)
1 [0, 2, 3, 4, 5, 6, 7, 9] (0, 9)
2 [0, 1, 2] (0, 2)
3 [1, 2, 4, 5, 8, 9] (1, 9)
4 [0, 2, 4, 6] (0, 6)
5 [1, 2] (1, 2)
6 [0, 4, 5, 6, 7] (0, 7)
7 [1, 3, 5, 6, 7, 8, 9] (1, 9)
8 [2, 4, 5, 6] (2, 6)
9 [0, 3, 6, 7, 8, 9] (0, 9)
You can simply do
available = df.columns[df.T.any(axis=1)].tolist()
In general it is not advisable to iterate over Pandas DataFrames unless they are small, as AFAIK this does not use vectorized functions and is thus slower.
Can you show the rest of your code?
Assuming only 0s and 1s are in the dataframe the following conditional selection should work (if I'm correctly interpreting what you want; it seems more likely that you want what
Shubham Periwal posted):
filtered_df = df[df != 0]
lists = filtered_df.values.tolist()
Or in 1 line:
lists = df[df != 0].values.tolist()
Related
I'm using the following dictionary and developing in pandas to manipulate it in a dataframe:
data = {"Value": [4, 4, 2, 1, 1, 1, 0, 7, 0, 4, 1, 1, 3, 0, 3, 0, 7, 0, 4, 1, 0, 1, 0, 1, 4, 4, 2, 3],
"IdPar": [0, 0, 0, 0, 0, 0, 10, 10, 10, 10, 10, 0, 0, 22, 22, 28, 28, 28, 28, 0, 0, 38, 38 , 0, 0, 0, 0, 0]
}
df = pd.DataFrame(data)
I would like to achieve that when it finds a repeated number in the IdPar column, a sequential number is generated in the same row but in a new column called Count, with the condition that if it finds 0 it repeats the value of 0 in the new column. Next I show what I expect to get:
Value IdPar Count
0 4 0 0
1 4 0 0
2 2 0 0
3 1 0 0
4 1 0 0
5 1 0 0
6 0 10 1
7 7 10 2
8 0 10 3
9 4 10 4
1 1 10 5
1 1 0 0
1 3 0 0
1 0 22 1
1 3 22 2
1 0 28 1
1 7 28 2
1 0 28 3
1 4 28 4
1 1 0 0
2 0 0 0
2 1 38 1
2 0 38 2
2 1 0 0
2 4 0 0
2 4 0 0
2 2 0 0
2 3 0 0
What I've done is review pandas information, I've tried many functions and what I've found is the use of ne, shift, cumsum, groupby, pivot_table or transform functions, but it isn't the result I want:
s = df.pivot_table(index = ['IdPar'], aggfunc = 'size')
print(s)
t = df['IdPar'].ne(df['IdPar'].shift()).cumsum()
print(t)
df ['Count'] = df['IdPar'].isin(df['Id_Par'])
df ['Count'] = df.loc[df ['Count'] == True, 'IdPar']
print(df)
How far I've come is to place in the column Count the sum of the repetitions in front of the row in which it is presented or that the repetition of the number in the IdPar column begins, which is the code below, but I don't want that either:
df['Count'] = df.groupby(['IdPar'])['Value'].transform('count')
print(df['Count'])
I really appreciate anyone who can help me. Any comment helps.
Try cumcount:
df['Count'] = df.groupby('IdPar')['IdPar'].cumcount() + 1
df.loc[df['IdPar'] == 0, 'Count'] = 0
print(df)
Or try in one line:
df['Count'] = df.groupby('IdPar').cumcount().add(1).mask(df['IdPar'].eq(0), 0)
Both codes output:
IdPar Value Count
0 0 4 0
1 0 4 0
2 0 2 0
3 0 1 0
4 0 1 0
5 0 1 0
6 10 0 1
7 10 7 2
8 10 0 3
9 10 4 4
10 10 1 5
11 0 1 0
12 0 3 0
13 22 0 1
14 22 3 2
15 28 0 1
16 28 7 2
17 28 0 3
18 28 4 4
19 0 1 0
20 0 0 0
21 38 1 1
22 38 0 2
23 0 1 0
24 0 4 0
25 0 4 0
26 0 2 0
27 0 3 0
I have the following DataFrame:
>>> df = pd.DataFrame({"a": [1, 1, 1, 1, 2, 2, 3, 3, 3], "b": [1, 5, 7, 9, 2, 4, 6, 14, 5], "c": [1, 0, 0, 1, 1, 1, 1, 0, 1]})
>>> df
a b c
0 1 1 1
1 1 5 0
2 1 7 0
3 1 9 1
4 2 2 1
5 2 4 1
6 3 6 1
7 3 14 0
8 3 5 1
I want to calculate the mode of column c for every unique value in a and then select the rows where c has this value.
This is my own solution:
>>> major_types = df.groupby(['a'])['c'].apply(lambda x: pd.Series.mode(x)[0])
>>> df = df.merge(major_types, how="left", right_index=True, left_on="a", suffixes=("", "_major"))
>>> df = df[df['c'] == df['c_major']].drop(columns="c_major", axis=1)
Which would output the following:
>>> df
a b c
1 1 5 0
2 1 7 0
4 2 2 1
5 2 4 1
6 3 6 1
8 3 5 1
It is very insufficient for large DataFrames. Any idea on what to do?
IIUC, GroupBy.transform instead apply + merge
df.loc[df['c'].eq(df.groupby('a')['c'].transform(lambda x: x.mode()[0]))]
a b c
1 1 5 0
2 1 7 0
4 2 2 1
5 2 4 1
6 3 6 1
8 3 5 1
Or
s = df.groupby(['a','c'])['c'].transform('size')
df.loc[s.eq(s.groupby(df['c']).transform('max'))]
I have a df where I want to do multi-label classification. One of the ways which was suggested to me was to calculate the probability vector. Here's an example of my DF with what would represent training data.
id ABC DEF GHI
1 0 0 0 1
2 1 0 1 0
3 2 1 0 0
4 3 0 1 1
5 4 0 0 0
6 5 0 1 1
7 6 1 1 1
8 7 1 0 1
9 8 1 1 0
And I would like to concatenate columns ABC, DEF, GHI into a new column. I will also have to do this with more than 3 columns, so I want to do relatively cleanly using a column list or something similar:
col_list = ['ABC','DEF','GHI']
The result I am looking for would be something like:
id ABC DEF GHI Conc
1 0 0 0 1 [0,0,1]
2 1 0 1 0 [0,1,0]
3 2 1 0 0 [1,0,0]
4 3 0 1 1 [0,1,1]
5 4 0 0 0 [0,0,0]
6 5 0 1 1 [0,1,1]
7 6 1 1 1 [1,1,1]
8 7 1 0 1 [1,0,1]
9 8 1 1 0 [1,1,0]
Try:
col_list = ['ABC','DEF','GHI']
df['agg_lst']=df.apply(lambda x: list(x[col] for col in col_list), axis=1)
You can use 'agg' with function 'list':
df[cols].agg(list,axis=1)
1 [0, 0, 1]
2 [0, 1, 0]
3 [1, 0, 0]
4 [0, 1, 1]
5 [0, 0, 0]
6 [0, 1, 1]
7 [1, 1, 1]
8 [1, 0, 1]
9 [1, 1, 0]
If I have an array [1, 2, 3, 4, 5] and a Pandas Dataframe
df = pd.DataFrame([[1,1,1,1,1], [0,0,0,0,0], [0,0,0,0,0], [0,0,0,0,0]])
0 1 2 3 4
0 1 1 1 1 1
1 0 0 0 0 0
2 0 0 0 0 0
3 0 0 0 0 0
How do I iterate through the Pandas DataFrame adding my array to each previous row?
The expected result would be:
0 1 2 3 4
0 1 1 1 1 1
1 2 3 4 5 6
2 3 5 7 9 11
3 4 7 10 13 16
The array is added n times to the nth row, which you can create using np.arange(len(df))[:,None] * a and then add the first row:
df
# 0 1 2 3 4
#0 1 1 1 1 1
#1 0 0 0 0 0
#2 0 0 0 0 0
#3 0 0 0 0 0
a = np.array([1, 2, 3, 4, 5])
np.arange(len(df))[:,None] * a
#array([[ 0, 0, 0, 0, 0],
# [ 1, 2, 3, 4, 5],
# [ 2, 4, 6, 8, 10],
# [ 3, 6, 9, 12, 15]])
df[:] = df.iloc[0].values + np.arange(len(df))[:,None] * a
df
# 0 1 2 3 4
#0 1 1 1 1 1
#1 2 3 4 5 6
#2 3 5 7 9 11
#3 4 7 10 13 16
df = pd.DataFrame([
[1,1,1],
[0,0,0],
[0,0,0],
])
s = pd.Series([1,2,3])
# add to every row except first, then cumulative sum
result = df.add(s, axis=1)
result.iloc[0] = df.iloc[0]
result.cumsum()
Or if you want a one-liner:
pd.concat([df[:1], df[1:].add(s, axis=1)]).cumsum()
Either way, result:
0 1 2
0 1 1 1
1 2 3 4
2 3 5 7
Using cumsum and assignment:
df[1:] = (df+l).cumsum()[:-1].values
0 1 2 3 4
0 1 1 1 1 1
1 2 3 4 5 6
2 3 5 7 9 11
3 4 7 10 13 16
Or using concat:
pd.concat((df[:1], (df+l).cumsum()[:-1]))
0 1 2 3 4
0 1 1 1 1 1
0 2 3 4 5 6
1 3 5 7 9 11
2 4 7 10 13 16
After cumsum, you can shift and add back to the original df:
a = [1,2,3,4,5]
updated = df.add(pd.Series(a), axis=1).cumsum().shift().fillna(0)
df.add(updated)
I have a data set which has a variable with values 0,1.
I need output in the following way.
Variable - 0 1 1 1 0 1 1 1 0 1 1 0
Flag - 1 1 1 1 2 2 2 2 3 3 3 4
Every time variable changes to 0 flag should increment by 1, and it should remain same till it encounters next 0.
I'm doing code conversion from SAS to python. It was pretty easy in SAS but I'm finding it difficult in Pandas. Is there any specific retain function in pandas like SAS? I don't see any retain function in pandas documentation.
Thanks in Advance.
I think you need compare with 0 and cumsum:
s = pd.Series([ 0, 1, 1, 1, 0, 1, 1, 1, 0, 1, 1, 0])
print (s)
0 0
1 1
2 1
3 1
4 0
5 1
6 1
7 1
8 0
9 1
10 1
11 0
dtype: int64
s1 = (s == 0).cumsum()
print (s1)
0 1
1 1
2 1
3 1
4 2
5 2
6 2
7 2
8 3
9 3
10 3
11 4
dtype: int32
df = pd.DataFrame({'Variable': [ 0, 1, 1, 1, 0, 1, 1, 1, 0, 1, 1, 0]})
df['Flag'] = (df.Variable == 0).cumsum()
print (df)
Variable Flag
0 0 1
1 1 1
2 1 1
3 1 1
4 0 2
5 1 2
6 1 2
7 1 2
8 0 3
9 1 3
10 1 3
11 0 4
Instead of using pandas, just you can use loop,
Like this,
a='0 1 1 1 0 1 1 1 0 1 1 0'
flags=[]
flag=0
for i in list(a.split()):
if int(i)==0:
flag+=1
flags.append(flag)
else:
flags.append(flag)
print flags
Output:
[1, 1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 4]