I have a dataframe df1, and I have a list which contains names of several columns of df1.
df1:
User_id month day Age year CVI ZIP sex wgt
0 1 7 16 1977 2 NA M NaN
1 2 7 16 1977 3 NA M NaN
2 3 7 16 1977 2 DM F NaN
3 4 7 16 1977 7 DM M NaN
4 5 7 16 1977 3 DM M NaN
... ... ... ... ... ... ... ... ...
35544 35545 12 31 2002 15 AH NaN NaN
35545 35546 12 31 2002 15 AH NaN NaN
35546 35547 12 31 2002 10 RM F 14
35547 35548 12 31 2002 7 DO M 51
35548 35549 12 31 2002 5 NaN NaN NaN
list= [u"User_id", u"day", u"ZIP", u"sex"]
I want to make a new dataframe df2 which will contain omly those columns which are in the list, and a dataframe df3 which will contain columns which are not in the list.
Here I found that I need to do:
df2=df1[df1[df1.columns[1]].isin(list)]
But as a result I get:
Empty DataFrame
Columns: []
Index: []
[0 rows x 9 columns]
What Im I odoing wrong and how can i get a needed result? Why "9 columns" if it supossed to be 4?
Solution with Index.difference:
L = [u"User_id", u"day", u"ZIP", u"sex"]
df2 = df1[L]
df3 = df1[df1.columns.difference(df2.columns)]
print (df2)
User_id day ZIP sex
0 0 7 NaN M
1 1 7 NaN M
2 2 7 DM F
3 3 7 DM M
4 4 7 DM M
print (df3)
Age CVI month wgt year
0 16 2 1 NaN 1977
1 16 3 2 NaN 1977
2 16 2 3 NaN 1977
3 16 7 4 NaN 1977
4 16 3 5 NaN 1977
Or:
df2 = df1[L]
df3 = df1[df1.columns.difference(pd.Index(L))]
print (df2)
User_id day ZIP sex
0 0 7 NaN M
1 1 7 NaN M
2 2 7 DM F
3 3 7 DM M
4 4 7 DM M
print (df3)
Age CVI month wgt year
0 16 2 1 NaN 1977
1 16 3 2 NaN 1977
2 16 2 3 NaN 1977
3 16 7 4 NaN 1977
4 16 3 5 NaN 1977
never name a list as "list"
my_list= [u"User_id", u"day", u"ZIP", u"sex"]
df2 = df1[df1.keys()[df1.keys().isin(my_list)]]
or
df2 = df1[df1.columns[df1.columns.isin(my_list)]]
You can try :
df2 = df1[list] # it does a projection on the columns contained in the list
df3 = df1[[col for col in df1.columns if col not in list]]
never name a list as "list"
my_list= [u"User_id", u"day", u"ZIP", u"sex"]
df2 = df1[df1.keys()[df1.keys().isin(my_list)]]
Related
I work with panel data. Typically my panel data is not balanced, i.e., there are some missing years. The general look of panel data is as follows:
df = pd.DataFrame({'name': ['a']*4+['b']*3+['c']*4,
'year':[2001,2002,2004,2005]+[2000,2002,2003]+[2001,2002,2003,2005],
'val1':[1,2,3,4,5,6,7,8,9,10,11],
'val2':[2,5,7,11,13,17,19,23,29,31,37]})
name year val1 val2
0 a 2001 1 2
1 a 2002 2 5
2 a 2004 3 7
3 a 2005 4 11
4 b 2000 5 13
5 b 2002 6 17
6 b 2003 7 19
7 c 2001 8 23
8 c 2002 9 29
9 c 2003 10 31
10 c 2005 11 37
Now I want to create lead and lag variables that are groupby name. Using:
df['val1_lag'] = df.groupby('name')['val1'].shift(1)
df['val1_lead'] = df.groupby('name')['val1'].shift(-1)
This simply shift up/down 1 row before/after which is not what I want. I want to shift in relative to year. My expected output:
name year val1 val2 val1_lag val1_lead
0 a 2001 1 2 NaN 2.0
1 a 2002 2 5 1.0 NaN
2 a 2004 3 7 NaN 4.0
3 a 2005 4 11 3.0 NaN
4 b 2000 5 13 NaN NaN
5 b 2002 6 17 NaN 7.0
6 b 2003 7 19 6.0 NaN
7 c 2001 8 23 NaN 9.0
8 c 2002 9 29 8.0 10.0
9 c 2003 10 31 9.0 NaN
10 c 2005 11 37 NaN NaN
My current work around solution is to fill is missing year by:
df.set_index(['name', 'year'], inplace=True)
mux = pd.MultiIndex.from_product([df.index.levels[0], df.index.levels[1]], names=['name', 'year'])
df = df.reindex(mux).reset_index()
Then using normal shift. However, because my data size is quite large. Using this often x3 the data size which is not very efficiency here.
I am looking for a better approach for this scenario
The solution is to create a check column if that year is continuous by lag and lead. Set the check col to 1.0 and np.NaN then multiply to your normal groupby
df['yearlag'] = (df['year'] == 1 + df.groupby('name')['year'].shift(1))*1.0
df.loc[df['yearlag']==0.0, 'yearlag'] = None
df['yearlead'] = (df['year'] == -1 + df.groupby('name')['year'].shift(-1))*1.0
df.loc[df['yearlead']==0.0, 'yearlead'] = None
To create lag lead variables:
%timeit df['val1_lag'] = df.groupby('name')['val1'].shift(1)*df['yearlag']
You can check if one with the merge method above, it is much more efficiency
%timeit df['val1_lag'] = df[['name', 'year']].merge(df.eval('year=year+1'), how='left')['val1']
Don't use shift but a merge with the year ± 1:
df['val1_lag'] = df[['name', 'year']].merge(df.eval('year = year+1'), how='left')['val1']
df['val1_lead'] = df[['name', 'year']].merge(df.eval('year = year-1'), how='left')['val1']
Output:
name year val1 val2 val1_lag val1_lead
0 a 2001 1 2 NaN 2.0
1 a 2002 2 5 1.0 NaN
2 a 2004 3 7 NaN 4.0
3 a 2005 4 11 3.0 NaN
4 b 2000 5 13 NaN NaN
5 b 2002 6 17 NaN 7.0
6 b 2003 7 19 6.0 NaN
7 c 2001 8 23 NaN 9.0
8 c 2002 9 29 8.0 10.0
9 c 2003 10 31 9.0 NaN
10 c 2005 11 37 NaN NaN
I have a dataframe with thousand records as:
ID to from Date price Type
1 69 18 2/2020 10 A
2 11 12 2/2020 5 A
3 18 10 3/2020 4 B
4 10 11 3/2020 10 A
5 12 69 3/2020 4 B
6 12 20 3/2020 3 B
7 69 21 3/2020 3 A
The output that i want is :
ID to from Date price Type ID to from Date price Type
1 69 18 2/2020 4 A 5 12 69 3/2020 4 B
1' 69 18 2/2020 6 A Nan Nan Nan Nan Nan Nan
2 11 12 2/2020 5 A Nan Nan Nan Nan Nan Nan
4 10 11 3/2020 4 A 3 18 10 3/2020 4 B
4' 10 11 3/2020 6 A Nan Nan Nan Nan Nan Nan
Nan Nan Nan Nan Nan Nan 6 12 20 3/2020 3 B
7 69 21 3/2020 3 A Nan Nan Nan Nan Nan Nan
The idea is to iterate over row , if the type is B , put the row next to the first record with type A and from = TO ,
if the price are equals its ok , if its not split the row with higher price , and the new price will be soustracted.
i divise the dataframe in type A and B , and im trying to iterate both of them
grp = df.groupby('type')
transformed_df_list = []
for idx, frame in grp:
frame.reset_index(drop=True, inplace=True)
transformed_df_list.append(frame.copy())
A = pd.DataFrame([transformed_df_list[0])
B= pd.DataFrame([transformed_df_list[1])
for i , row in A.iterrows():
for i, row1 in B.iterrows():
if row['to'] == row1['from']:
if row['price'] == row1['price']:
row_df = pd.DataFrame([row1])
output = pd.merge(A ,B, how='left' , left_on =['to'] , right_on =['from'] )
The problem is that with merge function a get several duplicate rows and i cant check the price to split the row ?
There is way to insert B row in A dataframe witout merge function ?
I have a dataframe as follows. Actually this dataframe is made by outer join of two table.
IndentID IndentNo role_name role_id user_id ishod isdesignatedhod Flag
100 1234 xyz 3 17 1 nan right_only
nan nan nan -1 -1 None None right_only
nan nan nan 1 15 None None right_only
nan nan nan 100 100 None 1 right_only
Objective: I want a resultant dataframe based on column conditions. The conditions are given below
if ishod == 1 the resultant df will be:
IndentID IndentNo role_name role_id user_id
100 1234 xyz 3 17
if ishod!=1 and isdesignatedhod==1 the resultant df will be:
IndentID IndentNo role_name role_id user_id
100 1234 xyz 100 100
I am really clueless on how to proceed on this. Any clue will be appreciated!!
To select rows based on a value in a certain column you can do use the following notation:
df[ df["column_name"] == value_to_keep ]
Here is an example of this in action:
import pandas as pd
d = {'col1': [1,2,1,2,1,2,1,2,1,2,1,2,1,2,1],
'col2': [3,4,5,3,4,5,3,4,5,3,4,5,3,4,5],
'col3': [6,7,8,9,6,7,8,9,6,7,8,9,6,7,8]}
# create a dataframe
df = pd.DataFrame(d)
This is what df looks like:
In [17]: df
Out[17]:
col1 col2 col3
0 1 3 6
1 2 4 7
2 1 5 8
3 2 3 9
4 1 4 6
5 2 5 7
6 1 3 8
7 2 4 9
8 1 5 6
9 2 3 7
10 1 4 8
11 2 5 9
12 1 3 6
13 2 4 7
14 1 5 8
Now to select all rows for which the value is '2' in the first column:
df_1 = df[df["col1"] == 2]
In [19]: df_1
Out [19]:
col1 col2 col3
1 2 4 7
3 2 3 9
5 2 5 7
7 2 4 9
9 2 3 7
11 2 5 9
13 2 4 7
You can also multiple conditions this way:
df_2 = df[(df["col2"] >= 4) & (df["col3"] != 7)]
In [22]: df_2
Out [22]:
col1 col2 col3
2 1 5 8
4 1 4 6
7 2 4 9
8 1 5 6
10 1 4 8
11 2 5 9
14 1 5 8
Hope this example helps!
Andre gives the right answer. Also you have to keep in mind dtype of columns ishod and isdesignatedhod. They are "object" type, in this specifically case "strings".
So you have to use "quotes" when compare these object columns with numbers.
df[df["ishod"] == "1"]
This should do approximately what you want
nan = float("nan")
def func(row):
if row["ishod"] == "1":
return pd.Series([100, 1234, "xyz", 3, 17, nan, nan, nan], index=row.index)
elif row["isdesignatedhod"] == "1":
return pd.Series([100, 1234, "xyz", 100, 100, nan, nan, nan], index=row.index)
else:
return row
pd.read_csv(io.StringIO(
"""IndentID IndentNo role_name role_id user_id ishod isdesignatedhod Flag
100 1234 xyz 3 17 1 nan right_only
nan nan nan -1 -1 None None right_only
nan nan nan 1 15 None None right_only
nan nan nan 100 100 None 1 right_only
"""), sep=" +", engine='python')\
.apply(func,axis=1)
Output:
IndentID IndentNo role_name role_id user_id ishod isdesignatedhod Flag
0 100.0 1234.0 xyz 3 17 NaN NaN NaN
1 NaN NaN NaN -1 -1 None None right_only
2 NaN NaN NaN 1 15 None None right_only
3 100.0 1234.0 xyz 100 100 NaN NaN NaN
I have a Pandas dataframe, df. Here are the first five rows:
Id StartDate EndDate
0 0 2015-08-11 2018-07-13
1 1 2014-02-15 2016-01-25
2 2 2014-12-20 NaT
3 3 2015-01-09 2015-01-14
4 4 2014-07-20 NaT
I want to construct a new dataframe, df2. df2 should have a row for each month between StartDate and EndDate, inclusive, for each Id in df1. For example, since the first row of df1 has StartDate in August 2015 and EndDate in July 2018, df2 should have rows corresponding to August 2015, September 2015, ..., July 2018. If an Id in df1 has no EndDate, we will take it to be June 2019.
I would like df2 to use a multiindex with the first level being the corresponding Id in df1, the second level being the year, and the third level being the month. For example, if the above five rows were all of df1, then df2 should look like:
Id Year Month
0 2015 8
9
10
11
12
2016 1
2
3
4
5
6
7
8
9
10
11
12
2017 1
2
3
4
5
6
7
8
9
10
11
12
2018 1
... ... ...
4 2017 1
2
3
4
5
6
7
8
9
10
11
12
2018 1
2
3
4
5
6
7
8
9
10
11
12
2019 1
2
3
4
5
6
The following code does the trick, but takes about 20 seconds on my decent laptop for 10k Ids. Can I be more efficient somehow?
import numpy as np
def build_multiindex_for_id_(id_, enroll_month, enroll_year, cancel_month, cancel_year):
# Given id_ and start/end dates,
# returns 2d array to be converted to multiindex.
# Each row of returned array represents a month/year
# between enroll date and cancel date inclusive.
year = enroll_year
month = enroll_month
multiindex_array = [[],[],[]]
while (month != cancel_month) or (year != cancel_year):
multiindex_array[0].append(id_)
multiindex_array[1].append(year)
multiindex_array[2].append(month)
month += 1
if month == 13:
month = 1
year += 1
multiindex_array[0].append(id_)
multiindex_array[1].append(year)
multiindex_array[2].append(month)
return np.array(multiindex_array)
# Begin by constructing array for first id.
array_for_multiindex = build_multiindex_for_id_(0,8,2015,7,2018)
# Append the rest of the multiindices for the remaining ids.
for _, row in df.loc[1:].fillna(pd.to_datetime('2019-06-30')).iterrows():
current_id_array = build_multiindex_for_id_(
row['Id'],
row['StartDate'].month,
row['StartDate'].year,
row['EndDate'].month,
row['EndDate'].year)
array_for_multiindex = np.append(array_for_multiindex, current_id_array, axis=1)
df2_index = pd.MultiIndex.from_arrays(array_for_multiindex).rename(['Id','Year','Month'])
pd.DataFrame(index=df2_index)
Here's my approach after several trial and error:
(df.melt(id_vars='Id')
.fillna(pd.to_datetime('June 2019'))
.set_index('value')
.groupby('Id').apply(lambda x: x.asfreq('M').ffill())
.reset_index('value')
.assign(year=lambda x: x['value'].dt.year,
month=lambda x: x['value'].dt.month)
.set_index(['year','month'], append=True)
)
Output:
value Id variable
Id year month
0 2015 8 2015-08-31 NaN NaN
9 2015-09-30 NaN NaN
10 2015-10-31 NaN NaN
11 2015-11-30 NaN NaN
12 2015-12-31 NaN NaN
2016 1 2016-01-31 NaN NaN
2 2016-02-29 NaN NaN
3 2016-03-31 NaN NaN
4 2016-04-30 NaN NaN
5 2016-05-31 NaN NaN
6 2016-06-30 NaN NaN
7 2016-07-31 NaN NaN
8 2016-08-31 NaN NaN
9 2016-09-30 NaN NaN
10 2016-10-31 NaN NaN
Here is the dataframe:
A B val val2 loc
1 march 3 2 NY
1 april 5 1 NY
1 may 12 4 NY
2 march 4 1 NJ
2 april 7 5 NJ
2 may 12 1 NJ
3 march 1 8 CA
3 april 54 6 CA
3 may 2 9 CA
I'd like to transform this into:
march march april april may may
val1 val2 val1 val2 val1 val2
A B
1 NY 3 5 12 2 1 4
2 NJ 4 7 12 1 5 5
3 CA 1 54 2 8 6 9
I'm looking into pivot tables and stacking and unstacking but im truly stuck. I'm not sure where to start
With pd.pivot_table, and some swapping of levels:
new_df = (pd.pivot_table(df,['val','val2'],['A','loc'],['B'])
.sort_index(axis=1, level=1)
.swaplevel(0, axis=1))
>>> new_df
B april march may
val val2 val val2 val val2
A loc
1 NY 5 1 3 2 12 4
2 NJ 7 5 4 1 12 1
3 CA 54 6 1 8 2 9
If the ordering of your columns is important (as in you need it as march, april and may), you can set it to a ordered categorical:
new_df = (pd.pivot_table(df,['val','val2'],['A','loc'],
[pd.Categorical(df.B, categories=['march','april','may'],
ordered=True)])
.dropna(how='all')
.sort_index(axis=1, level=1)
.swaplevel(0, axis=1))
>>> new_df
B march april may
val val2 val val2 val val2
A loc
1 NY 3.0 2.0 5.0 1.0 12.0 4.0
2 NJ 4.0 1.0 7.0 5.0 12.0 1.0
3 CA 1.0 8.0 54.0 6.0 2.0 9.0