I have a data frame in pandas like this:
Name Date
A 9/1/21
B 10/20/21
C 9/8/21
D 9/20/21
K 9/29/21
K 9/15/21
M 10/1/21
C 9/12/21
D 9/9/21
C 9/9/21
R 9/20/21
I need to get the count of items by week.
weeks = [9/6/21, 9/13, 9/20/21, 9/27/21, 10/4/21]
Example: From 9/6 to 9/13, the output should be:
Name Weekly count
A 0
B 0
C 3
D 1
M 0
K 0
R 0
Similarly, I need to find the count on these intervals: 9/13 to 9/20, 9/20 to 9/27, and 9/27 to 10/4. Thank you!
May be with the caveat of the definition of the first day of a week, you could take something in the following code.
df = pd.DataFrame(data=d)
df['Date']=pd.to_datetime(df['Date'])
I. Discontinuous index
Monday is chosen as the first day of week
#(1) Build a series of first_day_of_week, monday is chosen as the first day of week
weeks_index = df['Date'] - df['Date'].dt.weekday * np.timedelta64(1, 'D')
#(2) Groupby and some tidying
df2 = ( df.groupby([df['Name'], weeks_index])
.count()
.rename(columns={'Date':'Count'})
.swaplevel() # weeks to first level
.sort_index()
.unstack(1).fillna(0.0)
.astype(int)
.rename_axis('first_day_of_week')
)
>>> print(df2)
Name A B C D K M R
first_day_of_week
2021-08-30 1 0 0 0 0 0 0
2021-09-06 0 0 3 1 0 0 0
2021-09-13 0 0 0 0 1 0 0
2021-09-20 0 0 0 1 0 0 1
2021-09-27 0 0 0 0 1 1 0
2021-10-18 0 1 0 0 0 0 0
II. Continuous index
This part does not differ much of the previous one.
We build a continuous version of the index to be use to reindex
Monday is chosen as the first day of week (obviouly for the two indices)
#(1a) Build a series of first_day_of_week, monday is chosen as the
weeks_index = df['Date'] - df['Date'].dt.weekday * np.timedelta64(1, 'D')
#(1b) Build a continuous series of first_day_of_week
continuous_weeks_index = pd.date_range(start=weeks_index.min(),
end=weeks_index.max(),
freq='W-MON') # monday
#(2) Groupby, unstack, reindex, and some tidying
df2 = ( df
# groupby and count
.groupby([df['Name'], weeks_index])
.count()
.rename(columns={'Date':'Count'})
# unstack on weeks
.swaplevel() # weeks to first level
.sort_index()
.unstack(1)
# reindex to insert weeks with no data
.reindex(continuous_weeks_index) # new index
# clean up
.fillna(0.0)
.astype(int)
.rename_axis('first_day_of_week')
)
>>>print(df2)
Name A B C D K M R
first_day_of_week
2021-08-30 1 0 0 0 0 0 0
2021-09-06 0 0 3 1 0 0 0
2021-09-13 0 0 0 0 1 0 0
2021-09-20 0 0 0 1 0 0 1
2021-09-27 0 0 0 0 1 1 0
2021-10-04 0 0 0 0 0 0 0
2021-10-11 0 0 0 0 0 0 0
2021-10-18 0 1 0 0 0 0 0
Last step if needed
df2.stack()
Related
I'm trying to fill in a column with numbers -5000 to 5004, stepping by 4, between a condition in one column and a condition in another. The count starts when start==1. The count won't always get to 5004, so it needs to stop when end==1
Here is a example of the input:
start end
1 0
0 0
0 0
0 0
0 0
0 1
0 0
0 0
1 0
0 0
I have tried np.arange:
df['time'] = df['start'].apply(lambda x: np.arange(-5000,5004,4) if x==1 else 0)
This obviously doesn't work - I ended up with a series in one cell. I also messed around with cycle from itertools, but that doesn't work because the distances between the start and end aren't always equal. I also feel there might be a way to do this with ffill:
rise = df[df.start.where(df.start==1).ffill(limit=1250).notnull()]
Not sure how to edit this to stop at the correct place though.
I'd love to have a lambda function that achieves this, but I'm not sure where to go from here.
Here is my expected output:
start end time
1 0 -5000
0 0 -4996
0 0 -4992
0 0 -4988
0 0 -4984
0 1 -4980
0 0 nan
0 0 nan
1 0 -5000
0 0 -4996
grouping = df['start'].add(df['end'].shift(1).fillna(0)).cumsum()
df['time'] = (df.groupby(grouping).cumcount() * 4 - 5000)
df.loc[df.groupby(grouping).filter(lambda x: x[['start', 'end']].sum().sum() == 0).index, 'time'] = np.nan
Output:
>>> df
start end time
0 1 0 -5000.0
1 0 0 -4996.0
2 0 0 -4992.0
3 0 0 -4988.0
4 0 0 -4984.0
5 0 1 -4980.0
6 0 0 NaN
7 0 0 NaN
8 1 0 -5000.0
9 0 0 -4996.0
a = [[0,0,0,0],[0,-1,1,0],[1,-1,1,0],[1,-1,1,0]]
df = pd.DataFrame(a, columns=['A','B','C','D'])
df
Output:
A B C D
0 0 0 0 0
1 0 -1 1 0
2 1 -1 1 0
3 1 -1 1 0
So reading down vertically per column, values in the columns all begin at 0 on the first row, once they change they can never change back and can either become a 1 or a -1. I would like to re arrange the dataframe columns so that the columns in this order:
Order columns that hit 1 in the earliest row as possible
Order columns that hit -1 in the earliest row as possible
Finally the remaining rows that never changed values and remained as zero (if there are even any left)
Desired Output:
C A B D
0 0 0 0 0
1 1 0 -1 0
2 1 1 -1 0
3 1 1 -1 0
The my main data frame is 3000 rows and 61 columns long, is there any way of doing this quickly?
We have to handle the positive and negative values seperately. One way is take sum of the columns , then using sort_values , we can adjust the ordering:
a = df.sum().sort_values(ascending=False)
b = pd.concat((a[a.gt(0)],a[a.lt(0)].sort_values(),a[a.eq(0)]))
out = df.reindex(columns=b.index)
print(out)
C A B D
0 0 0 0 0
1 1 0 -1 0
2 1 1 -1 0
3 1 1 -1 0
Try with pd.Series.first_valid_index
s = df.where(df.ne(0))
s1 = s.apply(pd.Series.first_valid_index)
s2 = s.bfill().iloc[0]
out = df.loc[:,pd.concat([s2,s1],axis=1,keys=[0,1]).sort_values([0,1],ascending=[False,True]).index]
out
Out[35]:
C A B D
0 0 0 0 0
1 1 0 -1 0
2 1 1 -1 0
3 1 1 -1 0
I have a dataframe and want to create a new column based on other rows of the dataframe. My dataframe looks like
MitarbeiterID ProjektID Jahr Monat Week mean freq last
0 583 83224 2020 1 2 3.875 4 0
1 373 17364 2020 1 3 5.00 0 4
2 923 19234 2020 1 4 5.00 3 3
3 643 17364 2020 1 3 4.00 2 2
Now I want to check, if the freq of a row is zero, then I will check if there is another row with the same ProjektID and Year an Week where the freq is not 0. If this is true I want a new column "other" which is value 1 and 0 else.
So, the output should be
MitarbeiterID ProjektID Jahr Monat Week mean freq last other
0 583 83224 2020 1 2 3.875 4 0 0
1 373 17364 2020 1 3 5.00 0 4 1
2 923 19234 2020 1 4 5.00 3 3 0
3 643 17364 2020 1 3 4.00 2 2 0
This time I have no approach, can anyone help?
Thanks!
The following solution tests if the required conditions are True.
import io
import pandas as pd
Data
df = pd.read_csv(io.StringIO("""
MitarbeiterID ProjektID Jahr Monat Week mean freq last
0 583 83224 2020 1 2 3.875 4 0
1 373 17364 2020 1 3 5.00 0 4
2 923 19234 2020 1 4 5.00 3 3
3 643 17364 2020 1 3 4.00 2 2
"""), sep="\s\s+", engine="python")
Make a column other with all values zero.
df['other'] = 0
If ProjektID, Jahr, Week are duplicated and any of the Freq values is larger than zero, then the rows that are duplicated (keep=False to also capture the original duplicated row) and where Freq is zero will have the value Other filled with 1. Change any() to all() if you need all values to be larger than zero.
if (df.loc[df[['ProjektID','Jahr', 'Week']].duplicated(), 'freq'] > 0).any(): df.loc[(df[['ProjektID','Jahr', 'Week']].duplicated(keep=False)) & (df['freq'] == 0), ['other']] = 1
else: print("Other stays zero")
Output:
I think the best way to solve this is not to use pandas too much :-) converting things to sets and tuples should make it fast enough.
The idea is to make a dictionary of all the triples (ProjektID, Jahr, Week) that appear in the dataset with freq != 0 and then check for all lines with freq == 0 if their triple belongs to this dictionary or not. In code, I'm creating a dummy dataset with:
x = pd.DataFrame(np.random.randint(0, 2, (8, 4)), columns=['id', 'year', 'week', 'freq'])
which in my case randomly gave:
>>> x
id year week freq
0 1 0 0 0
1 0 0 0 1
2 0 1 0 1
3 0 0 1 0
4 0 1 0 0
5 1 0 0 1
6 0 0 1 1
7 0 1 1 0
Now, we want triplets only where freq != 0, so we use
x1 = x.loc[x['freq'] != 0]
triplets = {tuple(row) for row in x1[['id', 'year', 'week']].values}
Note that I'm using x1.values, which is not a pandas DataFrame but rather a numpy array; so each row in there can now be converted to tuple. This is necessary because dataframe rows, or even numpy array or lists, are mutable objects and cannot be hashed in a dictionary otherwise. Using a set instead of e.g. a list (which doesn't have this restriction) is for efficiency purposes.
Next, we define a boolean variable which is True if a triplet (id, year, week) belongs to the above set:
belongs = x[['id', 'year', 'week']].apply(lambda x: tuple(x) in triplets, axis=1)
We are basically done, this is the further column you want, except for also needing to force freq == 0:
x['other'] = np.logical_and(belongs, x['freq'] == 0).astype(int)
(the final .astype(int) is to have it values 0 and 1, as you were asking, instead of False and True). Final result in my case:
>>> x
id year week freq other
0 1 0 0 0 1
1 0 0 0 1 0
2 0 1 0 1 0
3 0 0 1 0 1
4 0 1 0 0 1
5 1 0 0 1 0
6 0 0 1 1 0
7 0 1 1 0 0
Looks like I am too late ...:
df.set_index(['ProjektID', 'Jahr', 'Week'], drop=True, inplace=True)
df['other'] = 0
df.other.mask(df.freq == 0,
df.freq[df.freq == 0].index.isin(df.freq[df.freq != 0].index),
inplace=True)
df.other = df.other.astype('int')
df.reset_index(drop=False, inplace=True)
Ok, I admit, I had troubles to really formulate a good header for that. So I will try to make give an example.
This is my sample dataframe:
df = pd.DataFrame([
(1,"a","good"),
(1,"a","good"),
(1,"b","good"),
(1,"c","bad"),
(2,"a","good"),
(2,"b","bad"),
(3,"a","none")], columns=["id", "type", "eval"])
What I do with it is the following:
df.groupby(["id", "type"])["id"].agg({'id':'count'})
This results in:
id
id type
1 a 2
b 1
c 1
2 a 1
b 1
3 a 1
This is fine, although what I will need later on is that e.g. the id would be repeated in every row. But this is not the most important part.
What I would need now is something like this:
id good bad none
id type
1 a 2 2 0 0
b 1 1 0 0
c 1 0 1 0
2 a 1 1 0 0
b 1 0 1 0
3 a 1 0 0 1
And even better would be a result like this, because I will need this back in a dataframe (and finally in an Excel sheet) with all fields populated. In reality, there will be many more columns I am grouping by. They would have to be completely populated as well.
id good bad none
id type
1 a 2 2 0 0
1 b 1 1 0 0
1 c 1 0 1 0
2 a 1 1 0 0
2 b 1 0 1 0
3 a 1 0 0 1
Thank you for helping me out.
You can use groupby + size (last column was added) or value_counts with unstack:
df1 = df.groupby(["id", "type", 'eval'])
.size()
.unstack(fill_value=0)
.rename_axis(None, axis=1)
print (df1)
bad good none
id type
1 a 0 2 0
b 0 1 0
c 1 0 0
2 a 0 1 0
b 1 0 0
3 a 0 0 1
df1 = df.groupby(["id", "type"])[ 'eval']
.value_counts()
.unstack(fill_value=0)
.rename_axis(None, axis=1)
print (df1)
bad good none
id type
1 a 0 2 0
b 0 1 0
c 1 0 0
2 a 0 1 0
b 1 0 0
3 a 0 0 1
But for write to excel get:
df1.to_excel('file.xlsx')
So need reset_index last.
df1.reset_index().to_excel('file.xlsx', index=False)
EDIT:
I forget for id column, but it is duplicate column name, so need id1:
df1.insert(0, 'id1', df1.sum(axis=1))
I have the foll. dataframe:
c3ann c3nfx c3per c4ann c4per pastr primf
c3ann 1 0 1 0 1 0 1
c3nfx 1 0 1 0 1 0 1
c3per 1 0 1 0 1 0 1
c4ann 1 0 1 0 1 0 1
c4per 1 0 1 0 1 0 1
pastr 1 0 1 0 1 0 1
primf 1 0 1 0 1 0 1
I would like to reorder the rows and columns so that the order is this:
primf pastr c3ann c3nfx c3per c4ann c4per
I can do this for just the columns like this:
cols = ['primf', 'pastr', 'c3ann', 'c3nfx', 'c3per', 'c4ann', 'c4per']
df = df[cols]
How do I do this such that the row headers are also changed appropriately?
You can use reindex to reorder both the columns and index at the same time.
df = df.reindex(index=cols, columns=cols)