I have got a dataframe (df) in python with 2 columns: ID and Date.
| ID | Date |
| ------------- |:-------------:|
| 1 | 06-14-2019 |
| 1 | 06-10-2019 |
| 2 | 06-16-2019 |
| 3 | 06-12-2019 |
| 3 | 06-12-2019 |
I'm trying to add a column to the dataframe which contains the count of rows where ID matches ID of the current row and Date <= Date of the current row.
Like the following:
| ID | Date | Count |
| ------------- |:-------------:|:-------------:|
| 1 | 06-14-2019 | 2 |
| 1 | 06-10-2019 | 1 |
| 2 | 06-16-2019 | 1 |
| 3 | 06-12-2019 | 2 |
| 3 | 06-12-2019 | 2 |
I have tried something like:
grouped = df.groupby(['ID'])
df['count'] = df.apply(lambda row: grouped.get_group[row['ID']][grouped.get_group(row['ID'])['Date'] < row['Date']]['ID'].size, axis=1)
This results in the following error:
TypeError: ("'method' object is not subscriptable", 'occurred at index 0')
Suggestions are welcome
I forgot to mention:
My real dataframe contains almost 4 million rows, so i'm looking for a smart and fast solution that won't take to long to run
Using df.iterrows():
df['Count'] = None
for idx, value in df.iterrows():
df.iloc[idx, -1 ] = len(df[(df.ID == value[0]) & (df.Date <= value[1])].index)
Output:
+---+----+------------+-------+
| | ID | Date | Count |
+---+----+------------+-------+
| 0 | 1 | 06-14-2019 | 2 |
| 1 | 1 | 06-10-2019 | 1 |
| 2 | 2 | 06-16-2019 | 1 |
| 3 | 3 | 06-12-2019 | 2 |
| 4 | 3 | 06-12-2019 | 2 |
+---+----+------------+-------+
Related
I have a dataframe that looks like this
+--------+--------+--------+--------+--------+
| index | Q111 | Q570 | Q7891 |Info583 |
+--------+--------+--------+--------+--------+
| 1 | 1 | 0 | 0 | 0 |
| 2 | 0 | 1 | 1 | 0 |
| 3 | 0 | 0 | 0 | 1 |
| code | 1 | 0 | 0 | 1 |
+--------+--------+--------+--------+--------+
I would like that for each 1 for the index rows 'code', I have the name of the corresponding column in a new column 'key_name', here is the desired final result
+--------+--------+--------+--------+--------+--------+
| index | Q111 | Q570 | Q7891 |Info583|key_name|
+--------+--------+--------+--------+--------+--------+
| 1 | 1 | 0 | 0 | 0 | Q111 |
| 2 | 0 | 1 | 1 | 0 | nan |
| 3 | 0 | 0 | 0 | 1 | nan |
| 4 | 1 | 0 | 0 | 1 | Info583|
| code | 1 | 0 | 0 | 1 | nan |
+--------+--------+--------+--------+--------+--------+
Thx for any help or advices !
I think this is what you're looking for:
df['key_name'] = np.nan
condition = df.loc['code', :] == 1
df.loc[condition, 'key_name'] = df.columns[condition]
First make the column with just nan's. Then compute your condition: row with index 'code' equals 1. Then plug in the column names when condition is met.
I have a dataframe with a column with either a 1 or 0 in it.
This is the Signal column.
I want to cycle through this dataframe until I get to the first 1 then take the value in the Open column and put that into another Dataframe Total, column Buy
Then as it continues through the dataframe when it reaches the first 0 then take that value in the Open column and put that into the same Dataframe Total, column Sold.
I know I need a loop within a loop but I'm not getting very far!
Any pointers/help would be appreciated!
Total = DataFrame()
for i in range(len(df)) :
if i.Signal == 1 :
Total['Buy'] = i.Open
if i.Signal == 0:
Total['Sold'] = i.Open
I know the code is wrong!
Cheers
Example of DataFrame
df = pd.DataFrame({'Signal': [0,0,1,1,1,1,0,0,1,1,1,1,0,0,0,1,1,1,0,0], 'Open': np.random.rand(20)})
>>> df
| | Signal | Open |
|---:|---------:|----------:|
| 0 | 0 | 0.959061 |
| 1 | 0 | 0.820516 |
| 2 | 1 | 0.0562783 |
| 3 | 1 | 0.612508 |
| 4 | 1 | 0.288703 |
| 5 | 1 | 0.332118 |
| 6 | 0 | 0.949236 |
| 7 | 0 | 0.20909 |
| 8 | 1 | 0.574924 |
| 9 | 1 | 0.170012 |
| 10 | 1 | 0.0255655 |
| 11 | 1 | 0.788829 |
| 12 | 0 | 0.39462 |
| 13 | 0 | 0.493338 |
| 14 | 0 | 0.347471 |
| 15 | 1 | 0.574096 |
| 16 | 1 | 0.286367 |
| 17 | 1 | 0.131327 |
| 18 | 0 | 0.38943 |
| 19 | 0 | 0.592241 |
# get the position of the first 1
first_1 = (df['Signal']==1).idxmax()
# Create a mask with True in the position of the first 1
# and every time a different value appears (0 after a 1, or 1 after a 0)
mask = np.full(len(df), False)
mask[first_1] = True
for i in range (first_1 + 1, len(df)):
mask[i] = df['Signal'][i] != df['Signal'][i-1]
>>> df[mask]
| | Signal | Open |
|---:|---------:|----------:|
| 2 | 1 | 0.0562783 |
| 6 | 0 | 0.949236 |
| 8 | 1 | 0.574924 |
| 12 | 0 | 0.39462 |
| 15 | 1 | 0.574096 |
| 18 | 0 | 0.38943 |
# Create new DF with 'Buy' = odd values of masked df['Open']
# and 'Sold' = even values of masked df['Open']
open_values = df[mask]['Open'].to_numpy()
total = pd.DataFrame({'Buy': [open_values[i] for i in range(0, len(open_values), 2)], 'Sold': [open_values[i] for i in range(1, len(open_values), 2)]})
>>> total
| | Buy | Sold |
|---:|----------:|---------:|
| 0 | 0.0562783 | 0.949236 |
| 1 | 0.574924 | 0.39462 |
| 2 | 0.574096 | 0.38943 |
It works under the assumption that the original df table ends with 0s and not with 1s, i.e. for each first 1 in a row, there must be at least one 0 afterwards.
The assumption makes sense since the objective is to take differences later.
If the last value is 1, it will produce ValueError: All arrays must be of the same length.
here is a sample piece of the current dataframe, it is the first day and all 24 hours. The whole dataframe is a year broken down into 24 hour segments
+-------+-----+------+---------------+-------------------+
| month | day | hour | project_name | hourly_production |
+-------+-----+------+---------------+-------------------+
| 1 | 1 | 1 | Blah | 0 |
| 1 | 1 | 2 | Blah | 0 |
| 1 | 1 | 3 | Blah | 0 |
| 1 | 1 | 4 | Blah | 0 |
| 1 | 1 | 5 | Blah | 0 |
| 1 | 1 | 6 | Blah | 0 |
| 1 | 1 | 7 | Blah | 0 |
| 1 | 1 | 8 | Blah | 1.44 |
| 1 | 1 | 9 | Blah | 40.42 |
| 1 | 1 | 10 | Blah | 49.13 |
| 1 | 1 | 11 | Blah | 47.57 |
| 1 | 1 | 12 | Blah | 43.77 |
| 1 | 1 | 13 | Blah | 42.33 |
| 1 | 1 | 14 | Blah | 45.25 |
| 1 | 1 | 15 | Blah | 48.54 |
| 1 | 1 | 16 | Blah | 46.34 |
| 1 | 1 | 17 | Blah | 18.35 |
| 1 | 1 | 18 | Blah | 0 |
| 1 | 1 | 19 | Blah | 0 |
| 1 | 1 | 20 | Blah | 0 |
| 1 | 1 | 21 | Blah | 0 |
| 1 | 1 | 22 | Blah | 0 |
| 1 | 1 | 23 | Blah | 0 |
| 1 | 1 | 24 | Blah | 0 |
+-------+-----+------+---------------+-------------------+
Here is my current code:
df0_partition_1 = df0[['project_id', 'start_date', 'degradation_factor', 'snapshot_datetime']]
df0_partition_2 = df0_partition_1.groupby(['project_id', 'start_date', 'degradation_factor_solar', 'snapshot_datetime']).size().reset_index()
df2_partition_1 = df2[df2['duration_year']==df2['duration_year'].max()]
df2_partition_2 = df2_partition_1.groupby(['project_id', 'snapshot_datetime']).size().reset_index()
df_merge = pd.merge(df0_partition_2, df2_partition_2, on=['project_id', 'snapshot_datetime'], how='left')
df_merge.rename(columns={'0_y':'duration_year'}, inplace=True)
df_parts = df_merge[['project_id', 'start_date', 'duration_year', 'degradation_factor_solar', 'snapshot_datetime']].dropna()
for index, row in df_parts.iterrows():
df1_filtered = df1[(df1['project_id'] == row['project_id']) &
(df1['snapshot_datetime'] == row['snapshot_datetime'])]
df1_filtered['year'] = pd.to_datetime(row['start_date']).year
for y in range(1, int(row['duration_year'])+1):
df_stg = df_stg = df1_filtered[[df1_filtered['year'] + y, df1_filtered['hourly_production']*(1-(float(row.loc['degradation_factor_solar'].strip('%'))*y/100))]]
df_final = df1_filtered.append(df_stg)
I need help figuring out how to create the final dataframe. The final dataframe is an append of future years with the degredation factor applied to the hourly production. I am not sure how to increment the year in the DF and apply the degradation factor and then append.
right now this gives me TypeError: 'Series' objects are mutable, thus they cannot be hashed
turns out I needed to do a df.copy in order to stop messing up my original dataframe and thus have an append that works.
df0_partition_1 = df0[['project_id', 'start_date', 'degradation_factor_solar', 'snapshot_datetime']]
df0_partition_2 = df0_partition_1.groupby(['project_id', 'start_date', 'degradation_factor', 'snapshot_datetime']).size().reset_index()
df2_partition_1 = df2[df2['duration_year']==df2['duration_year'].max()]
df2_partition_2 = df2_partition_1.groupby(['project_id', 'snapshot_datetime']).size().reset_index()
df_merge = pd.merge(df0_partition_2, df2_partition_2, on=['project_id', 'snapshot_datetime'], how='left')
df_merge.rename(columns={'0_y':'duration_year'}, inplace=True)
df_parts = df_merge[['project_id', 'start_date', 'duration_year', 'degradation_factor', 'snapshot_datetime']].dropna()
for index, row in df_parts.iterrows():
df1_filtered = df1[(df1['project_id'] == row['project_id']) &
(df1['snapshot_datetime'] == row['snapshot_datetime'])]
df1_filtered['year'] = pd.to_datetime(row['start_date']).year
df1_filtered.reset_index(inplace=True, drop=True)
df1_filtered.drop(columns='project_name', inplace=True)
df_stg_1 = df1_filtered.copy()
for y in range(2, int(row['duration_year'])+1):
year = df1_filtered['year']+(y-1)
hourly_production = df1_filtered['hourly_production']
df_stg_1['year'] = year
df_stg_1['hourly_production'] = hourly_production*(1-(float(row.loc['degradation_factor_solar'].strip('%'))*(y-1)/100))
df_stg_2 = df1_filtered.append(df_stg_1)
df_final = df1_filtered.append(df_stg_2)
df_final.reset_index(inplace=True, drop=True)
I have several 'condition' columns in a dataset. These columns are all eligible to receive the same coded input. This is only to allow multiple conditions to be associated with a single record - which column the code winds up in carries no meaning.
In the sample below there are only 5 unique values across the 3 condition columns, although if you consider each column separately, there are 3 unique values in each. So when I apply one-hot encoding to these variables together I get 9 new columns, but I only want 5 (one for each unique value in the collective set of columns).
Here is a sample of the original data:
| cond1 | cond2 | cond3 | target |
|-------|-------|-------|--------|
| I219 | E119 | I48 | 1 |
| I500 | | | 0 |
| I48 | I500 | F171 | 1 |
| I219 | E119 | I500 | 0 |
| I219 | I48 | | 0 |
Here's what I tried:
import pandas as pd
df = pd.read_csv('micro.csv', dtype='object')
df['cond1'] = pd.Categorical(df['cond1'])
df['cond2'] = pd.Categorical(df['cond2'])
df['cond3'] = pd.Categorical(df['cond3'])
dummies = pd.get_dummies(df[['cond1', 'cond2', 'cond3']], prefix = 'cond')
dummies
Which gives me:
| cond_I219 | cond_I48 | cond_I500 | cond_E119 | cond_I48 | cond_I500 | cond_F171 | cond_I48 | cond_I500 |
|-----------|----------|-----------|-----------|----------|-----------|-----------|----------|-----------|
| 1 | 0 | 0 | 1 | 0 | 0 | 0 | 1 | 0 |
| 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 |
| 0 | 1 | 0 | 0 | 0 | 1 | 1 | 0 | 0 |
| 1 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 1 |
| 1 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 |
So I have multiple coded columns for any code that appears in more than one column (I48 and I500).. I would like only a single column for each so I can check for correlations between individual codes and my target variable.
Is there a way to do this? This is the result I'm after:
| cond_I219 | cond_I48 | cond_I500 | cond_E119 | cond_F171 |
|-----------|----------|-----------|-----------|-----------|
| 1 | 1 | 0 | 1 | 0 |
| 0 | 0 | 1 | 0 | 0 |
| 0 | 1 | 1 | 0 | 1 |
| 1 | 0 | 1 | 1 | 0 |
| 1 | 1 | 0 | 0 | 0 |
Get max values if need 1 and 0 data in output:
dfDummies = dummies.max(axis=1, level=0)
Or use sum if need count 1 values:
dfDummies = dummies.sum(axis=1, level=0)
I have the following DataFrame
| name | number |
|------|--------|
| a | 1 |
| a | 1 |
| a | 1 |
| b | 2 |
| b | 2 |
| b | 2 |
| c | 3 |
| c | 3 |
| c | 3 |
| d | 4 |
| d | 4 |
| d | 4 |
I wish to merge all the rows by string, but have their number value added up and kept in line with the name..
Output desired..
| name | number |
|------|--------|
| a | 3 |
| b | 6 |
| c | 9 |
| d | 12 |
It seems you need groupby and aggregate sum:
df = df.groupby('name', as_index=False)['number'].sum()
#or
#df = df.groupby('name')['number'].sum().reset_index()
Assuming DataFrame is your table name
Select name, SUM(number) [number] FROM DataFrame GROUP BY name
Insert the result after deleting the original rows