I have a dataframe that looks like this
+--------+--------+--------+--------+--------+
| index | Q111 | Q570 | Q7891 |Info583 |
+--------+--------+--------+--------+--------+
| 1 | 1 | 0 | 0 | 0 |
| 2 | 0 | 1 | 1 | 0 |
| 3 | 0 | 0 | 0 | 1 |
| code | 1 | 0 | 0 | 1 |
+--------+--------+--------+--------+--------+
I would like that for each 1 for the index rows 'code', I have the name of the corresponding column in a new column 'key_name', here is the desired final result
+--------+--------+--------+--------+--------+--------+
| index | Q111 | Q570 | Q7891 |Info583|key_name|
+--------+--------+--------+--------+--------+--------+
| 1 | 1 | 0 | 0 | 0 | Q111 |
| 2 | 0 | 1 | 1 | 0 | nan |
| 3 | 0 | 0 | 0 | 1 | nan |
| 4 | 1 | 0 | 0 | 1 | Info583|
| code | 1 | 0 | 0 | 1 | nan |
+--------+--------+--------+--------+--------+--------+
Thx for any help or advices !
I think this is what you're looking for:
df['key_name'] = np.nan
condition = df.loc['code', :] == 1
df.loc[condition, 'key_name'] = df.columns[condition]
First make the column with just nan's. Then compute your condition: row with index 'code' equals 1. Then plug in the column names when condition is met.
I have a dataset similar to the this sample below:
| id | old_a | old_b | new_a | new_b |
|----|-------|-------|-------|-------|
| 6 | 3 | 0 | 0 | 0 |
| 6 | 9 | 0 | 2 | 0 |
| 13 | 3 | 0 | 0 | 0 |
| 13 | 37 | 0 | 0 | 1 |
| 13 | 30 | 0 | 0 | 6 |
| 13 | 12 | 2 | 0 | 0 |
| 6 | 7 | 0 | 2 | 0 |
| 6 | 8 | 0 | 0 | 0 |
| 6 | 19 | 0 | 3 | 0 |
| 6 | 54 | 0 | 0 | 0 |
| 87 | 6 | 0 | 2 | 0 |
| 87 | 11 | 1 | 1 | 0 |
| 87 | 25 | 0 | 1 | 0 |
| 87 | 10 | 0 | 0 | 0 |
| 9 | 8 | 1 | 0 | 0 |
| 9 | 19 | 0 | 2 | 0 |
| 9 | 1 | 0 | 0 | 0 |
| 9 | 34 | 0 | 7 | 0 |
I'm providing this sample dataset for the above table:
data=[[6,3,0,0,0],[6,9,0,2,0],[13,3,0,0,0],[13,37,0,0,1],[13,30,0,0,6],[13,12,2,0,0],[6,7,0,2,0],
[6,8,0,0,0],[6,19,0,3,0],[6,54,0,0,0],[87,6,0,2,0],[87,11,1,1,0],[87,25,0,1,0],[87,10,0,0,0],
[9,8,1,0,0],[9,19,0,2,0],[9,1,0,0,0],[9,34,0,7,0]]
data= pd.DataFrame(data,columns=['id','old_a','old_b','new_a','new_b'])
I want to look into columns 'new_a' and 'new_b' for each id and even if a single value exists in these two columns for each id, I want to count it as 1 irrespective of the number of times any value has occurred and assign 0 if no value is present. For example, if I look into the id '9', there are two distinct values in new_a, but I want to count it as 1. Similarly, for id '13', there are no values in new_a, so I would want to assign it 0.
My final output should like:
| id | new_a | new_b |
|----|-------|-------|
| 6 | 1 | 0 |
| 9 | 1 | 0 |
| 13 | 0 | 1 |
| 87 | 1 | 0 |
I would eventually want to calculate the % of clients using new_a and new_b. So from the above table, 75% clients use new_a and 25% use new_b. I'm a beginner in python and not sure how to proceed in this.
Use GroupBy.any, because 0 are processing like Falses and convert output boolean to integers:
df = data.groupby('id')[['new_a','new_b']].any().astype(int).reset_index()
print (df)
id new_a new_b
0 6 1 0
1 9 1 0
2 13 0 1
3 87 1 0
For percentage use mean of output above:
s = df[['new_a','new_b']].mean().mul(100)
print (s)
new_a 75.0
new_b 25.0
dtype: float64
I have a time series dataframe with the following structure:
| ID | second | speaker1 | speaker2 | company | ... |
|----|--------|----------|----------|---------|-----|
| A | 1 | 1 | 1 | name1 | |
| A | 2 | 1 | 1 | name1 | |
| A | 3 | 1 | 1 | name1 | |
| B | 1 | 1 | 1 | name2 | |
| B | 2 | 1 | 1 | name2 | |
| B | 3 | 1 | 1 | name2 | |
| B | 4 | 1 | 1 | name2 | |
| C | 1 | 1 | 1 | name3 | |
| C | 2 | 1 | 1 | name3 | |
*note that speaker1 and speaker2 can be either 0 or 1, I set all to one for clarity here
I would like to add rows to each group until every group has the same number of rows. (where num of rows = ID with the most rows)
For every new row, I would like to populate the speaker1 and speaker2 columns with 0s while keeping the values in the other columns the same for that ID.
So the output should be:
| ID | second | speaker1 | speaker2 | company | ... |
|:--:|:------:|:--------:|:--------:|:-------:|:---:|
| A | 1 | 1 | 1 | name1 | |
| A | 2 | 1 | 1 | name1 | |
| A | 3 | 1 | 1 | name1 | |
| A | 4 | 0 | 0 | name1 | |
| B | 1 | 1 | 1 | name2 | |
| B | 2 | 1 | 1 | name2 | |
| B | 3 | 1 | 1 | name2 | |
| B | 4 | 1 | 1 | name2 | |
| C | 1 | 1 | 1 | name3 | |
| C | 2 | 1 | 1 | name3 | |
| C | 3 | 0 | 0 | name3 | |
| C | 4 | 0 | 0 | name3 | |
So far I have tried a groupby and apply, but found it to be extremely slow as I have many rows and columns in this dataframe.
def add_rows_sec(w):
'input: dataframe for grouped by ID, output: dataframe with added rows until max call length'
while w['second'].max() < clean_data['second'].max(): # if duration is less than max duration in full data set
last_row = w.iloc[-1]
last_row['second'] += 1
last_row['speaker1'] = 0
last_row['speaker2'] = 0
return w.append(last_row)
return w
df.groupby('ID').apply(add_rows_sec).reset_index(drop=True)
Is there a way of doing this with numpy? Something like
condition = w['second'].max() < df['second'].max()
choice = pd.Series([w.ID, w.second + 1, 0, 0, w.company...])
df = np.select(condition, choice, default = np.nan)
Any help is much appreciated!
A different approach with pandas
construct a Dataframe that is Cartesian product of ID and second
outer join it back to original data frame
fill missing values based on your spec
No groupby() no loops.
df = pd.DataFrame({"ID":["A","A","A","B","B","B","B","C","C"],"second":["1","2","3","1","2","3","4","1","2"],"speaker1":["1","1","1","1","1","1","1","1","1"],"speaker2":["1","1","1","1","1","1","1","1","1"],"company":["name1","name1","name1","name2","name2","name2","name2","name3","name3"]})
df2 = pd.DataFrame({"ID":df["ID"].unique()}).assign(foo=1).merge(\
pd.DataFrame({"second":df["second"].unique()}).assign(foo=1)).drop("foo", 1)\
.merge(df, on=["ID","second"], how="outer")
df2["company"] = df2["company"].fillna(method="ffill")
df2.fillna(0)
output
ID second speaker1 speaker2 company
0 A 1 1 1 name1
1 A 2 1 1 name1
2 A 3 1 1 name1
3 A 4 0 0 name1
4 B 1 1 1 name2
5 B 2 1 1 name2
6 B 3 1 1 name2
7 B 4 1 1 name2
8 C 1 1 1 name3
9 C 2 1 1 name3
10 C 3 0 0 name3
11 C 4 0 0 name3
I have a dataframe which consists of data that is indexed by the date. So the index has dates ranging from 6-1 to 6-18.
What I need to do is perform a "pivot" or a horizontal merge, based on the date.
So for example, lets say today is 6-18. I need to go through this dataframe, and find the dates which are 6-18, basically pivot/join them horizontally to the same dataframe.
Expected output (1 signifies there is data there, 0 signifies null/NaN):
Before the join, df:
date | x | y | z
6-15 | 1 | 1 | 1
6-15 | 2 | 2 | 2
6-18 | 3 | 3 | 3
6-18 | 3 | 3 | 3
Joining the df on 6-18:
date | x | y | z | x (6-18) | y (6-18) | z (6-18)
6-15 | 1 | 1 | 1 | 0 | 0 | 0
6-15 | 1 | 1 | 1 | 0 | 0 | 0
6-18 | 1 | 1 | 1 | 1 | 1 | 1
6-18 | 1 | 1 | 1 | 1 | 1 | 1
When I use append, or join or merge, what I get is this:
date | x | y | z | x (6-18) | y (6-18) | z (6-18)
6-15 | 1 | 1 | 1 | 0 | 0 | 0
6-15 | 1 | 1 | 1 | 0 | 0 | 0
6-18 | 1 | 1 | 1 | 0 | 0 | 0
6-18 | 1 | 1 | 1 | 0 | 0 | 0
6-18 | 1 | 1 | 1 | 1 | 1 | 1
6-18 | 1 | 1 | 1 | 1 | 1 | 1
What I've done is extract the date that I want, to a new dataframe using loc.
df_daily = df_metrics.loc[str(_date_map['daily']['start'].date())]
df_daily.columns = [str(cols) + " (Daily)" if cols in metric_names else cols for cols in df_daily.columns]
And then joining it to the master df:
df = df.join(df_daily, lsuffix=' (Daily)', rsuffix=' (Monthly)').reset_index()
When I try joining or merging, the dataset gets so big because I'm assuming it's doing a comparison of each row. So when 1 date of 1 row doesn't match, it will create a new row with NaN.
My dataset turns from a 30k row piece, to 2.8 million.
First column cond contains either 1 or 0
Second column event contains either 1 or 0
I want to create a third column where each row is the (cumulated sum of cond % 4) of the COND column between two rows where event==1 (first row where event==1 must be included in the cumulated sum but not the last row)
+------+-------+--------+
| cond | event | Result |
+------+-------+--------+
| 0 | 0 | 0 |
| 1 | 0 | 0 |
| 0 | 1 | 0 |
| 1 | 0 | 1 |
| 1 | 0 | 2 |
| 0 | 0 | 2 |
| 1 | 0 | 3 |
| 1 | 0 | 0 |
| 1 | 0 | 1 |
| 1 | 0 | 2 |
| 1 | 1 | 1 |
+------+-------+--------+
This can be easily tackles by pandas.groupby.transform and cumsum
event_cum = df['event'].cumsum()
result = df['cond'].groupby(event_cum).transform('cumsum').mod(4)
result[event_cum == 0] = 0 # rows before the first event
0 0
1 0
2 0
3 1
4 2
5 2
6 3
7 0
8 1
9 2
10 1
Name: cond, dtype: int64