I'm trying to fill in a column with numbers -5000 to 5004, stepping by 4, between a condition in one column and a condition in another. The count starts when start==1. The count won't always get to 5004, so it needs to stop when end==1
Here is a example of the input:
start end
1 0
0 0
0 0
0 0
0 0
0 1
0 0
0 0
1 0
0 0
I have tried np.arange:
df['time'] = df['start'].apply(lambda x: np.arange(-5000,5004,4) if x==1 else 0)
This obviously doesn't work - I ended up with a series in one cell. I also messed around with cycle from itertools, but that doesn't work because the distances between the start and end aren't always equal. I also feel there might be a way to do this with ffill:
rise = df[df.start.where(df.start==1).ffill(limit=1250).notnull()]
Not sure how to edit this to stop at the correct place though.
I'd love to have a lambda function that achieves this, but I'm not sure where to go from here.
Here is my expected output:
start end time
1 0 -5000
0 0 -4996
0 0 -4992
0 0 -4988
0 0 -4984
0 1 -4980
0 0 nan
0 0 nan
1 0 -5000
0 0 -4996
grouping = df['start'].add(df['end'].shift(1).fillna(0)).cumsum()
df['time'] = (df.groupby(grouping).cumcount() * 4 - 5000)
df.loc[df.groupby(grouping).filter(lambda x: x[['start', 'end']].sum().sum() == 0).index, 'time'] = np.nan
Output:
>>> df
start end time
0 1 0 -5000.0
1 0 0 -4996.0
2 0 0 -4992.0
3 0 0 -4988.0
4 0 0 -4984.0
5 0 1 -4980.0
6 0 0 NaN
7 0 0 NaN
8 1 0 -5000.0
9 0 0 -4996.0
Related
Hi I have a dataframe in pandas like below,
exit
new_column.
0
0
0
0
1
1
0
0
0
0
0
0.
0
0.
0
0.
1
1.
I need to create the desired output column as given below,
in the exit column, if there are two occurances of 1 within next 10 rows, if so, sets the value of the new column for the current occurrence of 1 to 1 and to 0 for the later occurrence of 1. If there are no other occurrences of 1 in the next 10 rows, the new column is set to 1 for the current occurrence of 1. If the value of 'exit' is 0, the new column is set to 0.
exit
new_column.
desired_output
0
0
0
0
0
0
1
1
1
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
1
1
0
I tried the below code, but i am not able to acheive desired output column, i am acheiveing results similar to new_column which is not intended.
df['new_column'] = 0
for i, row in combined_df.iterrows():
if row['exit'] == 1:
next_rows = df.loc[i+1:i+10, 'exit']
if (next_rows == 1).any():
df.loc[i, 'new_column'] = 1
later_occurrence_index = next_rows[next_rows == 1].index[0]
df.loc[later_occurrence_index, 'new_column'] = 0
else:
df.loc[i, 'new_column'] = 1
else:
df.loc[i, 'new_column'] = 0
You can use rolling:
check_one = lambda x: (x.iloc[-1] == 1) & (x.sum() == 1)
df['out'] = (df[['exit', 'new_column']].eq(1).all(axis=1)
.rolling(10, min_periods=1)
.apply(check_one).astype(int))
print(df)
# Output
exit new_column out
0 0 0 0
1 0 0 0
2 1 1 1
3 0 0 0
4 0 0 0
5 0 0 0
6 0 0 0
7 0 0 0
8 1 1 0
I have a data frame in pandas like this:
Name Date
A 9/1/21
B 10/20/21
C 9/8/21
D 9/20/21
K 9/29/21
K 9/15/21
M 10/1/21
C 9/12/21
D 9/9/21
C 9/9/21
R 9/20/21
I need to get the count of items by week.
weeks = [9/6/21, 9/13, 9/20/21, 9/27/21, 10/4/21]
Example: From 9/6 to 9/13, the output should be:
Name Weekly count
A 0
B 0
C 3
D 1
M 0
K 0
R 0
Similarly, I need to find the count on these intervals: 9/13 to 9/20, 9/20 to 9/27, and 9/27 to 10/4. Thank you!
May be with the caveat of the definition of the first day of a week, you could take something in the following code.
df = pd.DataFrame(data=d)
df['Date']=pd.to_datetime(df['Date'])
I. Discontinuous index
Monday is chosen as the first day of week
#(1) Build a series of first_day_of_week, monday is chosen as the first day of week
weeks_index = df['Date'] - df['Date'].dt.weekday * np.timedelta64(1, 'D')
#(2) Groupby and some tidying
df2 = ( df.groupby([df['Name'], weeks_index])
.count()
.rename(columns={'Date':'Count'})
.swaplevel() # weeks to first level
.sort_index()
.unstack(1).fillna(0.0)
.astype(int)
.rename_axis('first_day_of_week')
)
>>> print(df2)
Name A B C D K M R
first_day_of_week
2021-08-30 1 0 0 0 0 0 0
2021-09-06 0 0 3 1 0 0 0
2021-09-13 0 0 0 0 1 0 0
2021-09-20 0 0 0 1 0 0 1
2021-09-27 0 0 0 0 1 1 0
2021-10-18 0 1 0 0 0 0 0
II. Continuous index
This part does not differ much of the previous one.
We build a continuous version of the index to be use to reindex
Monday is chosen as the first day of week (obviouly for the two indices)
#(1a) Build a series of first_day_of_week, monday is chosen as the
weeks_index = df['Date'] - df['Date'].dt.weekday * np.timedelta64(1, 'D')
#(1b) Build a continuous series of first_day_of_week
continuous_weeks_index = pd.date_range(start=weeks_index.min(),
end=weeks_index.max(),
freq='W-MON') # monday
#(2) Groupby, unstack, reindex, and some tidying
df2 = ( df
# groupby and count
.groupby([df['Name'], weeks_index])
.count()
.rename(columns={'Date':'Count'})
# unstack on weeks
.swaplevel() # weeks to first level
.sort_index()
.unstack(1)
# reindex to insert weeks with no data
.reindex(continuous_weeks_index) # new index
# clean up
.fillna(0.0)
.astype(int)
.rename_axis('first_day_of_week')
)
>>>print(df2)
Name A B C D K M R
first_day_of_week
2021-08-30 1 0 0 0 0 0 0
2021-09-06 0 0 3 1 0 0 0
2021-09-13 0 0 0 0 1 0 0
2021-09-20 0 0 0 1 0 0 1
2021-09-27 0 0 0 0 1 1 0
2021-10-04 0 0 0 0 0 0 0
2021-10-11 0 0 0 0 0 0 0
2021-10-18 0 1 0 0 0 0 0
Last step if needed
df2.stack()
a = [[0,0,0,0],[0,-1,1,0],[1,-1,1,0],[1,-1,1,0]]
df = pd.DataFrame(a, columns=['A','B','C','D'])
df
Output:
A B C D
0 0 0 0 0
1 0 -1 1 0
2 1 -1 1 0
3 1 -1 1 0
So reading down vertically per column, values in the columns all begin at 0 on the first row, once they change they can never change back and can either become a 1 or a -1. I would like to re arrange the dataframe columns so that the columns in this order:
Order columns that hit 1 in the earliest row as possible
Order columns that hit -1 in the earliest row as possible
Finally the remaining rows that never changed values and remained as zero (if there are even any left)
Desired Output:
C A B D
0 0 0 0 0
1 1 0 -1 0
2 1 1 -1 0
3 1 1 -1 0
The my main data frame is 3000 rows and 61 columns long, is there any way of doing this quickly?
We have to handle the positive and negative values seperately. One way is take sum of the columns , then using sort_values , we can adjust the ordering:
a = df.sum().sort_values(ascending=False)
b = pd.concat((a[a.gt(0)],a[a.lt(0)].sort_values(),a[a.eq(0)]))
out = df.reindex(columns=b.index)
print(out)
C A B D
0 0 0 0 0
1 1 0 -1 0
2 1 1 -1 0
3 1 1 -1 0
Try with pd.Series.first_valid_index
s = df.where(df.ne(0))
s1 = s.apply(pd.Series.first_valid_index)
s2 = s.bfill().iloc[0]
out = df.loc[:,pd.concat([s2,s1],axis=1,keys=[0,1]).sort_values([0,1],ascending=[False,True]).index]
out
Out[35]:
C A B D
0 0 0 0 0
1 1 0 -1 0
2 1 1 -1 0
3 1 1 -1 0
if have two dataframes, (pandas.DataFrame), each looking as follows. Let's call the first one df_A
code1 code2 code3 code4 code5
0 1 4 2 0 0
1 3 2 1 5 0
2 2 3 0 0 0
has1 has2 has3 has4 has5
0 1 1 0 1 0
1 1 1 0 0 1
2 0 1 1 0 0
The objects(rows) are each given up to 5 codes shown by the five columns in the first df.
I instead want a binary representation of which codes each object has. As shown in the second df.
The functions in pandas or scikit-learn for dummy-values take into account which position the code is written in, this in unimportant.
The attempts I have with my own code have not worked due to my inexperience in python and pandas.
This case is different from others I have seen on stack overflow as all the columns represent the same thing.
Thank you!
Edit:
for colname in df_bin.columns:
for row in range(len(df_codes)):
if int(colname) in df_codes.iloc[[row]]:
df_bin[colname][row]=1
This is one of the attempts I made so far.
You can try stack then str.get_dummies
s=df.stack().loc[lambda x : x!=0].astype(str).str.get_dummies().sum(level=0).add_prefix('Has')
Has1 Has2 Has3 Has4 Has5
0 1 1 0 1 0
1 1 1 1 0 1
2 0 1 1 0 0
Let's try:
(df.stack().groupby(level=0)
.value_counts()
.unstack(fill_value=0)
[range(1,6)]
.add_prefix('has')
)
Output:
has1 has2 has3 has4 has5
0 1 1 0 1 0
1 1 1 1 0 1
2 0 1 1 0 0
Here's another way using pd.crosstab:
df_out = df.reset_index().melt('index')
df_out = pd.crosstab(df_out['index'], df_out['value']).drop(0, axis=1).add_prefix('has')
Output:
value has1 has2 has3 has4 has5
index
0 1 1 0 1 0
1 1 1 1 0 1
2 0 1 1 0 0
I am using R to construct and analyze a data set created from a Python script that a colleague has created which returns the following structure where 13 refers to the number of samples and 3128 is the number of observations of traits that are coded as a single digit(every single digit after the sample name represents a single column, the value encapsulating the coding for the trait):
13 3128
>1062_0 0000000000[...]
>1066A_0 000001010[...]
>1067A_0 000002010[...]
>1067B_0 110013010[...]
>1067C_0 000024010[...]
>1067D_0 000024010[...]
>1084A_0 200100010[...]
>1084B_0 001005110[...]
>1084C_0 000000010[...]
>1086_0 0100002100[...]
>1087_0 3002040100[...]
>1088_0 0000060111[...]
>C105_0 0000050120[...]
I am working to get these get these data into a data frame which has 13 rows and 3,128 columns.
I have used the read.phylip function of phylotools to read in this file above and can get it into a data.frame:
SL_FFR_input <- read.phylip(fil = "matrix.phy")
SL_FFR_frame <- phy2dat(SL_FFR_input)
However, this results in a data frame of two columns, V1 being the sample names, and V2 being a string of all of the single digit codings.
The frame that would be useful is shown below, where the sample names form the row names and each value now has its own column.
>1062_0 0 0 0 0 0 0 0 0 0[...]
>1066A_0 0 0 0 0 0 1 0 1 0[...]
>1067A_0 0 0 0 0 0 2 0 1 0[...]
>1067B_0 1 1 0 0 1 3 0 1 0[...]
>1067C_0 0 0 0 0 2 4 0 1 0[...]
>1067D_0 0 0 0 0 2 4 0 1 0[...]
>1084A_0 2 0 0 1 0 0 0 1 0[...]
>1084B_0 0 0 1 0 0 5 1 1 0[...]
>1084C_0 0 0 0 0 0 0 0 1 0[...]
>1086_0 0 1 0 0 0 0 2 1 0[...]
>1087_0 3 0 0 2 0 4 0 1 0[...]
>1088_0 0 0 0 0 0 6 0 1 1[...]
>C105_0 0 0 0 0 0 5 0 1 2[...]
It would be a huge help if someone could point me in the right direction!
I recommend dplyr + tidyr, it's possible to do this with strsplit and rbind, but it's ugly.
library(dplyr)
library(tidyr)
df1 <- data.frame(snames = c('a','b','c'),
digits = c('0000000000000',
'0000100000000',
'0000000001000'))
result <- df1 %>% separate(digits, paste0('X',1:13),sep = 1:12)
that will separate at the character positions 1:12 in the column, and name the columns X1 -> X13
EDIT: for your case change the 13 to 3128, and the 12 to 3127, "digits" to whatever the name of your column is