Pandas: add rows to each group until condition is met - python

I have a time series dataframe with the following structure:
| ID | second | speaker1 | speaker2 | company | ... |
|----|--------|----------|----------|---------|-----|
| A | 1 | 1 | 1 | name1 | |
| A | 2 | 1 | 1 | name1 | |
| A | 3 | 1 | 1 | name1 | |
| B | 1 | 1 | 1 | name2 | |
| B | 2 | 1 | 1 | name2 | |
| B | 3 | 1 | 1 | name2 | |
| B | 4 | 1 | 1 | name2 | |
| C | 1 | 1 | 1 | name3 | |
| C | 2 | 1 | 1 | name3 | |
*note that speaker1 and speaker2 can be either 0 or 1, I set all to one for clarity here
I would like to add rows to each group until every group has the same number of rows. (where num of rows = ID with the most rows)
For every new row, I would like to populate the speaker1 and speaker2 columns with 0s while keeping the values in the other columns the same for that ID.
So the output should be:
| ID | second | speaker1 | speaker2 | company | ... |
|:--:|:------:|:--------:|:--------:|:-------:|:---:|
| A | 1 | 1 | 1 | name1 | |
| A | 2 | 1 | 1 | name1 | |
| A | 3 | 1 | 1 | name1 | |
| A | 4 | 0 | 0 | name1 | |
| B | 1 | 1 | 1 | name2 | |
| B | 2 | 1 | 1 | name2 | |
| B | 3 | 1 | 1 | name2 | |
| B | 4 | 1 | 1 | name2 | |
| C | 1 | 1 | 1 | name3 | |
| C | 2 | 1 | 1 | name3 | |
| C | 3 | 0 | 0 | name3 | |
| C | 4 | 0 | 0 | name3 | |
So far I have tried a groupby and apply, but found it to be extremely slow as I have many rows and columns in this dataframe.
def add_rows_sec(w):
'input: dataframe for grouped by ID, output: dataframe with added rows until max call length'
while w['second'].max() < clean_data['second'].max(): # if duration is less than max duration in full data set
last_row = w.iloc[-1]
last_row['second'] += 1
last_row['speaker1'] = 0
last_row['speaker2'] = 0
return w.append(last_row)
return w
df.groupby('ID').apply(add_rows_sec).reset_index(drop=True)
Is there a way of doing this with numpy? Something like
condition = w['second'].max() < df['second'].max()
choice = pd.Series([w.ID, w.second + 1, 0, 0, w.company...])
df = np.select(condition, choice, default = np.nan)
Any help is much appreciated!

A different approach with pandas
construct a Dataframe that is Cartesian product of ID and second
outer join it back to original data frame
fill missing values based on your spec
No groupby() no loops.
df = pd.DataFrame({"ID":["A","A","A","B","B","B","B","C","C"],"second":["1","2","3","1","2","3","4","1","2"],"speaker1":["1","1","1","1","1","1","1","1","1"],"speaker2":["1","1","1","1","1","1","1","1","1"],"company":["name1","name1","name1","name2","name2","name2","name2","name3","name3"]})
df2 = pd.DataFrame({"ID":df["ID"].unique()}).assign(foo=1).merge(\
pd.DataFrame({"second":df["second"].unique()}).assign(foo=1)).drop("foo", 1)\
.merge(df, on=["ID","second"], how="outer")
df2["company"] = df2["company"].fillna(method="ffill")
df2.fillna(0)
output
ID second speaker1 speaker2 company
0 A 1 1 1 name1
1 A 2 1 1 name1
2 A 3 1 1 name1
3 A 4 0 0 name1
4 B 1 1 1 name2
5 B 2 1 1 name2
6 B 3 1 1 name2
7 B 4 1 1 name2
8 C 1 1 1 name3
9 C 2 1 1 name3
10 C 3 0 0 name3
11 C 4 0 0 name3

Related

Have the name of the column if my row is equal to 1 return in another one

I have a dataframe that looks like this
+--------+--------+--------+--------+--------+
| index | Q111 | Q570 | Q7891 |Info583 |
+--------+--------+--------+--------+--------+
| 1 | 1 | 0 | 0 | 0 |
| 2 | 0 | 1 | 1 | 0 |
| 3 | 0 | 0 | 0 | 1 |
| code | 1 | 0 | 0 | 1 |
+--------+--------+--------+--------+--------+
I would like that for each 1 for the index rows 'code', I have the name of the corresponding column in a new column 'key_name', here is the desired final result
+--------+--------+--------+--------+--------+--------+
| index | Q111 | Q570 | Q7891 |Info583|key_name|
+--------+--------+--------+--------+--------+--------+
| 1 | 1 | 0 | 0 | 0 | Q111 |
| 2 | 0 | 1 | 1 | 0 | nan |
| 3 | 0 | 0 | 0 | 1 | nan |
| 4 | 1 | 0 | 0 | 1 | Info583|
| code | 1 | 0 | 0 | 1 | nan |
+--------+--------+--------+--------+--------+--------+
Thx for any help or advices !
I think this is what you're looking for:
df['key_name'] = np.nan
condition = df.loc['code', :] == 1
df.loc[condition, 'key_name'] = df.columns[condition]
First make the column with just nan's. Then compute your condition: row with index 'code' equals 1. Then plug in the column names when condition is met.

Replacing values with ffill in pandas?

I have various columns in a pandas dataframe that have dummy values and I want to fill them as follows:
Input Columns
+----+-----
| c1 | c2 |
+----+----+
| 0 | 1 |
| 0 | 0 |
| 1 | 0 |
| 0 | 0 |
| 0 | 1 |
| 0 | 1 |
| 1 | 0 |
| 0 | 1 |
Output columns:
+----+-----
| c1 | c2 |
+----+----+
| 0 | 1 |
| 0 | 1 |
| 1 | 1 |
| 1 | 1 |
| 1 | 2 |
| 1 | 3 |
| 2 | 3 |
| 2 | 4 |
How can I get this output in pandas?
Here working if there are only 0 and 1 values cumulative sum - DataFrame.cumsum:
df1 = df.cumsum()
print (df1)
c1 c2
0 0 1
1 0 1
2 1 1
3 1 1
4 1 2
5 1 3
6 2 3
7 2 4
If there are 0 and another values is possible use cumulative sum for mask for test not equal 0 values:
df2 = df.ne(0).cumsum()

Count the number of duplicate grouped by ID pandas

I'm not sure if this is a duplicate question, but here it goes.
Assuming I have the following table:
import pandas
lst = [1,1,1,2,2,3,3,4,5]
lst2 = ['A','A','B','D','E','A','A','A','E']
df = pd.DataFrame(list(zip(lst, lst2)),
columns =['ID', 'val'])
will output the following table
+----+-----+
| ID | Val |
+----+-----+
| 1 | A |
+----+-----+
| 1 | A |
+----+-----+
| 1 | B |
+----+-----+
| 2 | D |
+----+-----+
| 2 | E |
+----+-----+
| 3 | A |
+----+-----+
| 3 | A |
+----+-----+
| 4 | A |
+----+-----+
| 5 | E |
+----+-----+
The goal is count the duplicates on VAL grouped by ID:
+----+-----+--------------+
| ID | Val | is_duplicate |
+----+-----+--------------+
| 1 | A | 1 |
+----+-----+--------------+
| 1 | A | 1 |
+----+-----+--------------+
| 1 | B | 0 |
+----+-----+--------------+
| 2 | D | 0 |
+----+-----+--------------+
| 2 | E | 0 |
+----+-----+--------------+
| 3 | A | 1 |
+----+-----+--------------+
| 3 | A | 1 |
+----+-----+--------------+
| 4 | A | 0 |
+----+-----+--------------+
| 5 | E | 0 |
+----+-----+--------------+
I tried the following code but its counting the overall duplicates
df_grouped = df.groupby(['notes']).size().reset_index(name='count')
while the following code does only the duplicate count
df.duplicated(subset=['notes'])
what would be the best approach for this?
Let us try duplicated
df['is_dup']=df.duplicated(subset=['ID','val'],keep=False).astype(int)
df
Out[21]:
ID val is_dup
0 1 A 1
1 1 A 1
2 1 B 0
3 2 D 0
4 2 E 0
5 3 A 1
6 3 A 1
7 4 A 0
8 5 E 0
You can use .groupby on the relevant columns and get the count. Then if you add >1 to the end, then that will mean the value for the specified group contains duplicates. The > 1 will create a boolean True/False data type. Finally, to change to 1 or 0, simply use .astype(int) to transform the data type from a boolean data type to an int, which changes True to 1 and False to 0:
df['is_duplicate'] = (df.groupby(['ID','val'])['val'].transform('count') > 1).astype(int)
Out[7]:
ID val is_duplicate
0 1 A 1
1 1 A 1
2 1 B 0
3 2 D 0
4 2 E 0
5 3 A 1
6 3 A 1
7 4 A 0

Python: Merge up to 3 columns of a Dataframe into 1 but either of the 3 could not exist

I have a Dataframe formed like this:
+------+------+------+--------+--------+--------+--------+
| Col1 | Col2 | Col3 | Col1.1 | Col2.1 | Col3.1 | Col1.2 |
+------+------+------+--------+--------+--------+--------+
| 0 | 0 | 1 | 1 | 0 | 0 | 0 |
| 1 | 1 | 0 | 0 | 0 | 1 | 0 |
| 0 | 0 | 1 | 0 | 1 | 0 | 1 |
+------+------+------+--------+--------+--------+--------+
Now I want to merge the columns into one, like all Col1.* into Col1 where there is a 1:
+------+------+------+--------+--------+
| Col1 | Col2 | Col3 | Col2.1 | Col3.1 |
+------+------+------+--------+--------+
| 1 | 0 | 1 | 0 | 0 |
| 1 | 1 | 0 | 0 | 1 |
| 1 | 0 | 1 | 1 | 0 |
+------+------+------+--------+--------+
df['Col1'] = df[['Col1', 'Col1.1', 'Col1.2']].max(axis=1)
This works if all 3 columns exist. But obviously not if I want to merge Col2 with Col2.1 and Col2.2 because they are not existing.
Is there a way with pandas or in python to do this task with some function or do I need to go the long way with a lot of if cases?
Let's use string manipulation and groupby with axis=1 and max:
df.groupby(df.columns.str[:4], axis=1).max()
Output:
| | Col1 | Col2 | Col3 |
|---:|-------:|-------:|-------:|
| 0 | 1 | 0 | 1 |
| 1 | 1 | 1 | 1 |
| 2 | 1 | 1 | 1 |

Pandas one-hot encoding with multiple like columns

I have several 'condition' columns in a dataset. These columns are all eligible to receive the same coded input. This is only to allow multiple conditions to be associated with a single record - which column the code winds up in carries no meaning.
In the sample below there are only 5 unique values across the 3 condition columns, although if you consider each column separately, there are 3 unique values in each. So when I apply one-hot encoding to these variables together I get 9 new columns, but I only want 5 (one for each unique value in the collective set of columns).
Here is a sample of the original data:
| cond1 | cond2 | cond3 | target |
|-------|-------|-------|--------|
| I219 | E119 | I48 | 1 |
| I500 | | | 0 |
| I48 | I500 | F171 | 1 |
| I219 | E119 | I500 | 0 |
| I219 | I48 | | 0 |
Here's what I tried:
import pandas as pd
df = pd.read_csv('micro.csv', dtype='object')
df['cond1'] = pd.Categorical(df['cond1'])
df['cond2'] = pd.Categorical(df['cond2'])
df['cond3'] = pd.Categorical(df['cond3'])
dummies = pd.get_dummies(df[['cond1', 'cond2', 'cond3']], prefix = 'cond')
dummies
Which gives me:
| cond_I219 | cond_I48 | cond_I500 | cond_E119 | cond_I48 | cond_I500 | cond_F171 | cond_I48 | cond_I500 |
|-----------|----------|-----------|-----------|----------|-----------|-----------|----------|-----------|
| 1 | 0 | 0 | 1 | 0 | 0 | 0 | 1 | 0 |
| 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 |
| 0 | 1 | 0 | 0 | 0 | 1 | 1 | 0 | 0 |
| 1 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 1 |
| 1 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 |
So I have multiple coded columns for any code that appears in more than one column (I48 and I500).. I would like only a single column for each so I can check for correlations between individual codes and my target variable.
Is there a way to do this? This is the result I'm after:
| cond_I219 | cond_I48 | cond_I500 | cond_E119 | cond_F171 |
|-----------|----------|-----------|-----------|-----------|
| 1 | 1 | 0 | 1 | 0 |
| 0 | 0 | 1 | 0 | 0 |
| 0 | 1 | 1 | 0 | 1 |
| 1 | 0 | 1 | 1 | 0 |
| 1 | 1 | 0 | 0 | 0 |
Get max values if need 1 and 0 data in output:
dfDummies = dummies.max(axis=1, level=0)
Or use sum if need count 1 values:
dfDummies = dummies.sum(axis=1, level=0)

Categories

Resources