I have a dataframe which consists of data that is indexed by the date. So the index has dates ranging from 6-1 to 6-18.
What I need to do is perform a "pivot" or a horizontal merge, based on the date.
So for example, lets say today is 6-18. I need to go through this dataframe, and find the dates which are 6-18, basically pivot/join them horizontally to the same dataframe.
Expected output (1 signifies there is data there, 0 signifies null/NaN):
Before the join, df:
date | x | y | z
6-15 | 1 | 1 | 1
6-15 | 2 | 2 | 2
6-18 | 3 | 3 | 3
6-18 | 3 | 3 | 3
Joining the df on 6-18:
date | x | y | z | x (6-18) | y (6-18) | z (6-18)
6-15 | 1 | 1 | 1 | 0 | 0 | 0
6-15 | 1 | 1 | 1 | 0 | 0 | 0
6-18 | 1 | 1 | 1 | 1 | 1 | 1
6-18 | 1 | 1 | 1 | 1 | 1 | 1
When I use append, or join or merge, what I get is this:
date | x | y | z | x (6-18) | y (6-18) | z (6-18)
6-15 | 1 | 1 | 1 | 0 | 0 | 0
6-15 | 1 | 1 | 1 | 0 | 0 | 0
6-18 | 1 | 1 | 1 | 0 | 0 | 0
6-18 | 1 | 1 | 1 | 0 | 0 | 0
6-18 | 1 | 1 | 1 | 1 | 1 | 1
6-18 | 1 | 1 | 1 | 1 | 1 | 1
What I've done is extract the date that I want, to a new dataframe using loc.
df_daily = df_metrics.loc[str(_date_map['daily']['start'].date())]
df_daily.columns = [str(cols) + " (Daily)" if cols in metric_names else cols for cols in df_daily.columns]
And then joining it to the master df:
df = df.join(df_daily, lsuffix=' (Daily)', rsuffix=' (Monthly)').reset_index()
When I try joining or merging, the dataset gets so big because I'm assuming it's doing a comparison of each row. So when 1 date of 1 row doesn't match, it will create a new row with NaN.
My dataset turns from a 30k row piece, to 2.8 million.
Related
I have a dataframe that looks like this
+--------+--------+--------+--------+--------+
| index | Q111 | Q570 | Q7891 |Info583 |
+--------+--------+--------+--------+--------+
| 1 | 1 | 0 | 0 | 0 |
| 2 | 0 | 1 | 1 | 0 |
| 3 | 0 | 0 | 0 | 1 |
| code | 1 | 0 | 0 | 1 |
+--------+--------+--------+--------+--------+
I would like that for each 1 for the index rows 'code', I have the name of the corresponding column in a new column 'key_name', here is the desired final result
+--------+--------+--------+--------+--------+--------+
| index | Q111 | Q570 | Q7891 |Info583|key_name|
+--------+--------+--------+--------+--------+--------+
| 1 | 1 | 0 | 0 | 0 | Q111 |
| 2 | 0 | 1 | 1 | 0 | nan |
| 3 | 0 | 0 | 0 | 1 | nan |
| 4 | 1 | 0 | 0 | 1 | Info583|
| code | 1 | 0 | 0 | 1 | nan |
+--------+--------+--------+--------+--------+--------+
Thx for any help or advices !
I think this is what you're looking for:
df['key_name'] = np.nan
condition = df.loc['code', :] == 1
df.loc[condition, 'key_name'] = df.columns[condition]
First make the column with just nan's. Then compute your condition: row with index 'code' equals 1. Then plug in the column names when condition is met.
I have various columns in a pandas dataframe that have dummy values and I want to fill them as follows:
Input Columns
+----+-----
| c1 | c2 |
+----+----+
| 0 | 1 |
| 0 | 0 |
| 1 | 0 |
| 0 | 0 |
| 0 | 1 |
| 0 | 1 |
| 1 | 0 |
| 0 | 1 |
Output columns:
+----+-----
| c1 | c2 |
+----+----+
| 0 | 1 |
| 0 | 1 |
| 1 | 1 |
| 1 | 1 |
| 1 | 2 |
| 1 | 3 |
| 2 | 3 |
| 2 | 4 |
How can I get this output in pandas?
Here working if there are only 0 and 1 values cumulative sum - DataFrame.cumsum:
df1 = df.cumsum()
print (df1)
c1 c2
0 0 1
1 0 1
2 1 1
3 1 1
4 1 2
5 1 3
6 2 3
7 2 4
If there are 0 and another values is possible use cumulative sum for mask for test not equal 0 values:
df2 = df.ne(0).cumsum()
I have several 'condition' columns in a dataset. These columns are all eligible to receive the same coded input. This is only to allow multiple conditions to be associated with a single record - which column the code winds up in carries no meaning.
In the sample below there are only 5 unique values across the 3 condition columns, although if you consider each column separately, there are 3 unique values in each. So when I apply one-hot encoding to these variables together I get 9 new columns, but I only want 5 (one for each unique value in the collective set of columns).
Here is a sample of the original data:
| cond1 | cond2 | cond3 | target |
|-------|-------|-------|--------|
| I219 | E119 | I48 | 1 |
| I500 | | | 0 |
| I48 | I500 | F171 | 1 |
| I219 | E119 | I500 | 0 |
| I219 | I48 | | 0 |
Here's what I tried:
import pandas as pd
df = pd.read_csv('micro.csv', dtype='object')
df['cond1'] = pd.Categorical(df['cond1'])
df['cond2'] = pd.Categorical(df['cond2'])
df['cond3'] = pd.Categorical(df['cond3'])
dummies = pd.get_dummies(df[['cond1', 'cond2', 'cond3']], prefix = 'cond')
dummies
Which gives me:
| cond_I219 | cond_I48 | cond_I500 | cond_E119 | cond_I48 | cond_I500 | cond_F171 | cond_I48 | cond_I500 |
|-----------|----------|-----------|-----------|----------|-----------|-----------|----------|-----------|
| 1 | 0 | 0 | 1 | 0 | 0 | 0 | 1 | 0 |
| 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 |
| 0 | 1 | 0 | 0 | 0 | 1 | 1 | 0 | 0 |
| 1 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 1 |
| 1 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 |
So I have multiple coded columns for any code that appears in more than one column (I48 and I500).. I would like only a single column for each so I can check for correlations between individual codes and my target variable.
Is there a way to do this? This is the result I'm after:
| cond_I219 | cond_I48 | cond_I500 | cond_E119 | cond_F171 |
|-----------|----------|-----------|-----------|-----------|
| 1 | 1 | 0 | 1 | 0 |
| 0 | 0 | 1 | 0 | 0 |
| 0 | 1 | 1 | 0 | 1 |
| 1 | 0 | 1 | 1 | 0 |
| 1 | 1 | 0 | 0 | 0 |
Get max values if need 1 and 0 data in output:
dfDummies = dummies.max(axis=1, level=0)
Or use sum if need count 1 values:
dfDummies = dummies.sum(axis=1, level=0)
First column cond contains either 1 or 0
Second column event contains either 1 or 0
I want to create a third column where each row is the (cumulated sum of cond % 4) of the COND column between two rows where event==1 (first row where event==1 must be included in the cumulated sum but not the last row)
+------+-------+--------+
| cond | event | Result |
+------+-------+--------+
| 0 | 0 | 0 |
| 1 | 0 | 0 |
| 0 | 1 | 0 |
| 1 | 0 | 1 |
| 1 | 0 | 2 |
| 0 | 0 | 2 |
| 1 | 0 | 3 |
| 1 | 0 | 0 |
| 1 | 0 | 1 |
| 1 | 0 | 2 |
| 1 | 1 | 1 |
+------+-------+--------+
This can be easily tackles by pandas.groupby.transform and cumsum
event_cum = df['event'].cumsum()
result = df['cond'].groupby(event_cum).transform('cumsum').mod(4)
result[event_cum == 0] = 0 # rows before the first event
0 0
1 0
2 0
3 1
4 2
5 2
6 3
7 0
8 1
9 2
10 1
Name: cond, dtype: int64
I would like to find all cases for all ids in a Pandas DataFrame.
What would be an efficient solution? I have around 10k of records and it is processed server-side. Would it be a good idea to create a new DataFrame, or is there a more efficient data structure I can use? A case is satisfied when an id contains all names in a case.
Input (Pandas DataFrame)
id | name |
-----------
1 | bla1 |
2 | bla2 |
2 | bla3 |
2 | bla4 |
3 | bla5 |
4 | bla9 |
5 | bla6 |
5 | bla7 |
6 | bla8 |
Cases
names [
[bla2, bla3, bla4], #case 1
[bla1, bla3, bla7], #case 2
[bla3, bla1, bla6], #case 3
[bla6, bla7] #case 4
]
Needed output (unless there is a more efficient way)
id | case1 | case2 | case3 | case4 |
------------------------------------
1 | 0 | 0 | 0 | 0 |
2 | 1 | 0 | 0 | 0 |
3 | 0 | 0 | 0 | 0 |
4 | 0 | 0 | 0 | 0 |
5 | 0 | 0 | 0 | 1 |
6 | 0 | 0 | 0 | 0 |
names = [
['bla2', 'bla3', 'bla4'], # case 1
['bla1', 'bla3', 'bla7'], # case 2
['bla3', 'bla1', 'bla6'], # case 3
['bla6', 'bla7'] # case 4
]
df = df.groupby('id').apply(lambda x: \
pd.Series([int(pd.Series(y).isin(x['name']).all()) for y in names]))\
.rename(columns=lambda x: 'case{}'.format(x + 1))
df
+------+---------+---------+---------+---------+
| id | case1 | case2 | case3 | case4 |
|------+---------+---------+---------+---------|
| 1 | 0 | 0 | 0 | 0 |
| 2 | 1 | 0 | 0 | 0 |
| 3 | 0 | 0 | 0 | 0 |
| 5 | 0 | 0 | 0 | 1 |
| 6 | 0 | 0 | 0 | 0 |
+------+---------+---------+---------+---------+
First, groupby id, and then apply apply a check successively on each case, for each group. The objective is to check whether all items in a group will match with a given case. This is handled by the isin in conjunction with the list comprehension. The outer pd.Series will expand the result to separate columns and df.rename is used to rename the columns.