Merge and delete duplicates - python

I have two large datasets I want to merge which have a common column, "gene".
All entries are unique in df1
in [85]: df1
Out[85]:
gene
0 Cdk12
1 Cdk2ap1
2 Cdk7
3 Cdk8
4 Cdx2
5 Cenpa
6 Cenpa
7 Cenpa
8 Cenpc1
9 Cenpe
10 Cenpj
df2
Out[86]:
gene year DOI
0 Cdk12 2001 10.1038/35055500
1 Cdk12 2002 10.1038/nature01266
2 Cdk12 2002 10.1074/jbc.M106813200
3 Cdk12 2003 10.1073/pnas.1633296100
4 Cdk12 2003 10.1073/pnas.2336103100
5 Cdk12 2005 10.1093/nar/gni045
6 Cdk12 2005 10.1126/science.1112014
7 Cdk12 2008 10.1101/gr.078352.108
8 Cdk12 2011 10.1371/journal.pbio.1000582
9 Cdk12 2012 10.1074/jbc.M111.321760
10 Cdk12 2016 10.1038/cdd.2015.157
11 Cdk12 2017 10.1093/cercor/bhw081
12 Cdk2ap1 2001 10.1006/geno.2001.6474
13 Cdk2ap1 2001 10.1038/35055500
14 Cdk2ap1 2002 10.1038/nature01266
I want to keep the order of df1 because I am going to join that alongside a different dataset.
Dataframe 2 has many entries for each "gene" and I want only one for each gene.
The most recent value in "year" will decide which "gene" entry to keep.
I have tried:
reading the files into pandas and then naming the columns
df1 = pd.read_csv('T1inorderforMerge.csv', header = None)
df2 = pd.read_csv('T2inorderforMerge.csv', header = None)
df1.columns = ["gene"]
df2.columns = ["gene","year","DOI"]
I have tried all variations of the code below i.e changing how and order of the df's.
df3 = pd.merge(df1, df2, on ="gene", how="left")
I have tried vertical and horizontal stacking which obvious to some, didn't work. There is lots of other messy code I have also tried but really want to see how/if I can do this using pandas.

I think one possible solution is create helper columns which count values of gene and then merge pairs - first Cdk12 in df1 with first Cdk12 in df2, second Cdk12 with second Cdk12,... . Unique values are merged 1 to 1, classic way (because a is then always 0):
df1['a'] = df1.groupby('gene').cumcount()
df2['a'] = df2.groupby('gene').cumcount()
print (df1)
gene a
0 Cdk12 0
1 Cdk2ap1 0
2 Cdk7 0
3 Cdk8 0
4 Cdx2 0
5 Cenpa 0
6 Cenpa 1
7 Cenpa 2
8 Cenpc1 0
9 Cenpe 0
10 Cenpj 0
print (df2)
gene year DOI a
0 Cdk12 2001 10.1038/35055500 0
1 Cdk12 2002 10.1038/nature01266 1
2 Cdk12 2002 10.1074/jbc.M106813200 2
3 Cdk12 2003 10.1073/pnas.1633296100 3
4 Cdk12 2003 10.1073/pnas.2336103100 4
5 Cdk12 2005 10.1093/nar/gni045 5
6 Cdk12 2005 10.1126/science.1112014 6
7 Cdk12 2008 10.1101/gr.078352.108 7
8 Cdk12 2011 10.1371/journal.pbio.1000582 8
9 Cdk12 2012 10.1074/jbc.M111.321760 9
10 Cdk12 2016 10.1038/cdd.2015.157 10
11 Cdk12 2017 10.1093/cercor/bhw081 11
12 Cdk2ap1 2001 10.1006/geno.2001.6474 0
13 Cdk2ap1 2001 10.1038/35055500 1
14 Cdk2ap1 2002 10.1038/nature01266 2
df3 = pd.merge(df1, df2, on =["a","gene"], how="left").drop('a', axis=1)
print (df3)
gene year DOI
0 Cdk12 2001.0 10.1038/35055500
1 Cdk2ap1 2001.0 10.1006/geno.2001.6474
2 Cdk7 NaN NaN
3 Cdk8 NaN NaN
4 Cdx2 NaN NaN
5 Cenpa NaN NaN
6 Cenpa NaN NaN
7 Cenpa NaN NaN
8 Cenpc1 NaN NaN
9 Cenpe NaN NaN
10 Cenpj NaN NaN
Also get NaNs of all rows which not match pairs gene.
But if need process only unique values in df1['gene'] then need drop_duplicates first in both DataFrames:
df1 = df1.drop_duplicates('gene')
df2 = df2.drop_duplicates('gene')
print (df1)
gene
0 Cdk12
1 Cdk2ap1
2 Cdk7
3 Cdk8
4 Cdx2
5 Cenpa
8 Cenpc1
9 Cenpe
10 Cenpj
print (df2)
gene year DOI
0 Cdk12 2001 10.1038/35055500
12 Cdk2ap1 2001 10.1006/geno.2001.6474
df3 = pd.merge(df1, df2, on ="gene", how="left")
print (df3)
gene year DOI
0 Cdk12 2001.0 10.1038/35055500
1 Cdk2ap1 2001.0 10.1006/geno.2001.6474
2 Cdk7 NaN NaN
3 Cdk8 NaN NaN
4 Cdx2 NaN NaN
5 Cenpa NaN NaN
6 Cenpc1 NaN NaN
7 Cenpe NaN NaN
8 Cenpj NaN NaN

Not sure what is type(df1), but:
In [1]: df1 = ['a', 'f', 'g']
In [2]: df2 = [['a', 7, True], ['g',8, False]]
In [3]: [[inner_item for inner_item in df2 if inner_item[0] == outer_item][0] if len([inner_item for inner_item in df2 if inner_item[0] == outer_item])>0 else [outer_item,None,None] for outer_item in df1]
Out[3]: [['a', 7, True], ['f', None, None], ['g', 8, False]]

Related

Panel Data - dealing with missing year when creating lead and lag variables

I work with panel data. Typically my panel data is not balanced, i.e., there are some missing years. The general look of panel data is as follows:
df = pd.DataFrame({'name': ['a']*4+['b']*3+['c']*4,
'year':[2001,2002,2004,2005]+[2000,2002,2003]+[2001,2002,2003,2005],
'val1':[1,2,3,4,5,6,7,8,9,10,11],
'val2':[2,5,7,11,13,17,19,23,29,31,37]})
name year val1 val2
0 a 2001 1 2
1 a 2002 2 5
2 a 2004 3 7
3 a 2005 4 11
4 b 2000 5 13
5 b 2002 6 17
6 b 2003 7 19
7 c 2001 8 23
8 c 2002 9 29
9 c 2003 10 31
10 c 2005 11 37
Now I want to create lead and lag variables that are groupby name. Using:
df['val1_lag'] = df.groupby('name')['val1'].shift(1)
df['val1_lead'] = df.groupby('name')['val1'].shift(-1)
This simply shift up/down 1 row before/after which is not what I want. I want to shift in relative to year. My expected output:
name year val1 val2 val1_lag val1_lead
0 a 2001 1 2 NaN 2.0
1 a 2002 2 5 1.0 NaN
2 a 2004 3 7 NaN 4.0
3 a 2005 4 11 3.0 NaN
4 b 2000 5 13 NaN NaN
5 b 2002 6 17 NaN 7.0
6 b 2003 7 19 6.0 NaN
7 c 2001 8 23 NaN 9.0
8 c 2002 9 29 8.0 10.0
9 c 2003 10 31 9.0 NaN
10 c 2005 11 37 NaN NaN
My current work around solution is to fill is missing year by:
df.set_index(['name', 'year'], inplace=True)
mux = pd.MultiIndex.from_product([df.index.levels[0], df.index.levels[1]], names=['name', 'year'])
df = df.reindex(mux).reset_index()
Then using normal shift. However, because my data size is quite large. Using this often x3 the data size which is not very efficiency here.
I am looking for a better approach for this scenario
The solution is to create a check column if that year is continuous by lag and lead. Set the check col to 1.0 and np.NaN then multiply to your normal groupby
df['yearlag'] = (df['year'] == 1 + df.groupby('name')['year'].shift(1))*1.0
df.loc[df['yearlag']==0.0, 'yearlag'] = None
df['yearlead'] = (df['year'] == -1 + df.groupby('name')['year'].shift(-1))*1.0
df.loc[df['yearlead']==0.0, 'yearlead'] = None
To create lag lead variables:
%timeit df['val1_lag'] = df.groupby('name')['val1'].shift(1)*df['yearlag']
You can check if one with the merge method above, it is much more efficiency
%timeit df['val1_lag'] = df[['name', 'year']].merge(df.eval('year=year+1'), how='left')['val1']
Don't use shift but a merge with the year ± 1:
df['val1_lag'] = df[['name', 'year']].merge(df.eval('year = year+1'), how='left')['val1']
df['val1_lead'] = df[['name', 'year']].merge(df.eval('year = year-1'), how='left')['val1']
Output:
name year val1 val2 val1_lag val1_lead
0 a 2001 1 2 NaN 2.0
1 a 2002 2 5 1.0 NaN
2 a 2004 3 7 NaN 4.0
3 a 2005 4 11 3.0 NaN
4 b 2000 5 13 NaN NaN
5 b 2002 6 17 NaN 7.0
6 b 2003 7 19 6.0 NaN
7 c 2001 8 23 NaN 9.0
8 c 2002 9 29 8.0 10.0
9 c 2003 10 31 9.0 NaN
10 c 2005 11 37 NaN NaN

Faster way to construct a multiindex of dates in Pandas

I have a Pandas dataframe, df. Here are the first five rows:
Id StartDate EndDate
0 0 2015-08-11 2018-07-13
1 1 2014-02-15 2016-01-25
2 2 2014-12-20 NaT
3 3 2015-01-09 2015-01-14
4 4 2014-07-20 NaT
I want to construct a new dataframe, df2. df2 should have a row for each month between StartDate and EndDate, inclusive, for each Id in df1. For example, since the first row of df1 has StartDate in August 2015 and EndDate in July 2018, df2 should have rows corresponding to August 2015, September 2015, ..., July 2018. If an Id in df1 has no EndDate, we will take it to be June 2019.
I would like df2 to use a multiindex with the first level being the corresponding Id in df1, the second level being the year, and the third level being the month. For example, if the above five rows were all of df1, then df2 should look like:
Id Year Month
0 2015 8
9
10
11
12
2016 1
2
3
4
5
6
7
8
9
10
11
12
2017 1
2
3
4
5
6
7
8
9
10
11
12
2018 1
... ... ...
4 2017 1
2
3
4
5
6
7
8
9
10
11
12
2018 1
2
3
4
5
6
7
8
9
10
11
12
2019 1
2
3
4
5
6
The following code does the trick, but takes about 20 seconds on my decent laptop for 10k Ids. Can I be more efficient somehow?
import numpy as np
def build_multiindex_for_id_(id_, enroll_month, enroll_year, cancel_month, cancel_year):
# Given id_ and start/end dates,
# returns 2d array to be converted to multiindex.
# Each row of returned array represents a month/year
# between enroll date and cancel date inclusive.
year = enroll_year
month = enroll_month
multiindex_array = [[],[],[]]
while (month != cancel_month) or (year != cancel_year):
multiindex_array[0].append(id_)
multiindex_array[1].append(year)
multiindex_array[2].append(month)
month += 1
if month == 13:
month = 1
year += 1
multiindex_array[0].append(id_)
multiindex_array[1].append(year)
multiindex_array[2].append(month)
return np.array(multiindex_array)
# Begin by constructing array for first id.
array_for_multiindex = build_multiindex_for_id_(0,8,2015,7,2018)
# Append the rest of the multiindices for the remaining ids.
for _, row in df.loc[1:].fillna(pd.to_datetime('2019-06-30')).iterrows():
current_id_array = build_multiindex_for_id_(
row['Id'],
row['StartDate'].month,
row['StartDate'].year,
row['EndDate'].month,
row['EndDate'].year)
array_for_multiindex = np.append(array_for_multiindex, current_id_array, axis=1)
df2_index = pd.MultiIndex.from_arrays(array_for_multiindex).rename(['Id','Year','Month'])
pd.DataFrame(index=df2_index)
Here's my approach after several trial and error:
(df.melt(id_vars='Id')
.fillna(pd.to_datetime('June 2019'))
.set_index('value')
.groupby('Id').apply(lambda x: x.asfreq('M').ffill())
.reset_index('value')
.assign(year=lambda x: x['value'].dt.year,
month=lambda x: x['value'].dt.month)
.set_index(['year','month'], append=True)
)
Output:
value Id variable
Id year month
0 2015 8 2015-08-31 NaN NaN
9 2015-09-30 NaN NaN
10 2015-10-31 NaN NaN
11 2015-11-30 NaN NaN
12 2015-12-31 NaN NaN
2016 1 2016-01-31 NaN NaN
2 2016-02-29 NaN NaN
3 2016-03-31 NaN NaN
4 2016-04-30 NaN NaN
5 2016-05-31 NaN NaN
6 2016-06-30 NaN NaN
7 2016-07-31 NaN NaN
8 2016-08-31 NaN NaN
9 2016-09-30 NaN NaN
10 2016-10-31 NaN NaN

Filter Dates in Pandas

Currently have a dataset structured the following way:
id_number start_date end_date data1 data2 data3 ...
Basically, I have a whole bunch of id's with a certain date range and then multiple columns of summary data. My problem is that I need yearly totals of the summary data. This means I need to get to a place where I can groupby year on a single occurrence of each document. However, it is not guaranteed that a document exists for a given year, and the date ranges can span multiple years. Any help would be greatly appreciated, I am quite stuck.
Sample dataframe:
df = pd.DataFrame([[1, '3/10/2002', '4/12/2005'], [1, '4/13/2005', '5/20/2005'], [1, '5/21/2005', '8/10/2009'], [2, '2/20/2012', '2/20/2015'], [3, '10/19/2003', '12/12/2012']])
df.columns = ['id_num', 'start', 'end']
df.start = pd.to_datetime(df['start'], format= "%m/%d/%Y")
df.end = pd.to_datetime(df['end'], format= "%m/%d/%Y")
Assuming we have a DataFrame df:
id_num start end value
0 1 2002-03-10 2005-04-12 1
1 1 2005-04-13 2005-05-20 2
2 1 2007-05-21 2009-08-10 3
3 2 2012-02-20 2015-02-20 4
4 3 2003-10-19 2012-12-12 5
we can create a row for each year for our start to end ranges with:
ys = [np.arange(x[0], x[1]+1) for x in zip(df['start'].dt.year, df['end'].dt.year)]
df = (pd.DataFrame(ys, df.index)
.stack()
.astype(int)
.reset_index(1, True)
.to_frame('year')
.join(df, how='left')
.reset_index())
print(df)
Here we're first creating the ys variable with the list of years for each start-end range from our DataFrame, and the df = ... is splitting these year lists into separate rows and joining back to the original DataFrame (very similar to what's done in this post: How to convert column with list of values into rows in Pandas DataFrame).
Output:
index year id_num start end value
0 0 2002 1 2002-03-10 2005-04-12 1
1 0 2003 1 2002-03-10 2005-04-12 1
2 0 2004 1 2002-03-10 2005-04-12 1
3 0 2005 1 2002-03-10 2005-04-12 1
4 1 2005 1 2005-04-13 2005-05-20 2
5 2 2007 1 2007-05-21 2009-08-10 3
6 2 2008 1 2007-05-21 2009-08-10 3
7 2 2009 1 2007-05-21 2009-08-10 3
8 3 2012 2 2012-02-20 2015-02-20 4
9 3 2013 2 2012-02-20 2015-02-20 4
10 3 2014 2 2012-02-20 2015-02-20 4
11 3 2015 2 2012-02-20 2015-02-20 4
12 4 2003 3 2003-10-19 2012-12-12 5
13 4 2004 3 2003-10-19 2012-12-12 5
14 4 2005 3 2003-10-19 2012-12-12 5
15 4 2006 3 2003-10-19 2012-12-12 5
16 4 2007 3 2003-10-19 2012-12-12 5
17 4 2008 3 2003-10-19 2012-12-12 5
18 4 2009 3 2003-10-19 2012-12-12 5
19 4 2010 3 2003-10-19 2012-12-12 5
20 4 2011 3 2003-10-19 2012-12-12 5
21 4 2012 3 2003-10-19 2012-12-12 5
Note:
I changed the original ranges to test cases where there are some years missing for some id_num, e.g. for id_num=1 we have years 2002-2005, 2005-2005 and 2007-2009, so we should not get 2006 for id_num=1 in the output (and we don't, so it passes the test)
I've taken your example and added some random values so we have something to work with:
df = pd.DataFrame([[1, '3/10/2002', '4/12/2005'], [1, '4/13/2005', '5/20/2005'], [1, '5/21/2005', '8/10/2009'], [2, '2/20/2012', '2/20/2015'], [3, '10/19/2003', '12/12/2012']])
df.columns = ['id_num', 'start', 'end']
df.start = pd.to_datetime(df['start'], format= "%m/%d/%Y")
df.end = pd.to_datetime(df['end'], format= "%m/%d/%Y")
np.random.seed(0) # seeding the random values for reproducibility
df['value'] = np.random.random(len(df))
So far we have:
id_num start end value
0 1 2002-03-10 2005-04-12 0.548814
1 1 2005-04-13 2005-05-20 0.715189
2 1 2005-05-21 2009-08-10 0.602763
3 2 2012-02-20 2015-02-20 0.544883
4 3 2003-10-19 2012-12-12 0.423655
We want values at the end of the year for each given date, whether it is beginning or end. So we will treat all dates the same. We just want date + user + value:
tmp = df[['end', 'value']].copy()
tmp = tmp.rename(columns={'end':'start'})
new = pd.concat([df[['start', 'value']], tmp], sort=True)
new['id_num'] = df.id_num.append(df.id_num) # doubling the id numbers
Giving us:
start value id_num
0 2002-03-10 0.548814 1
1 2005-04-13 0.715189 1
2 2005-05-21 0.602763 1
3 2012-02-20 0.544883 2
4 2003-10-19 0.423655 3
0 2005-04-12 0.548814 1
1 2005-05-20 0.715189 1
2 2009-08-10 0.602763 1
3 2015-02-20 0.544883 2
4 2012-12-12 0.423655 3
Now we can group by ID number and year:
new = new.groupby(['id_num', new.start.dt.year]).sum().reset_index(0).sort_index()
id_num value
start
2002 1 0.548814
2003 3 0.423655
2005 1 2.581956
2009 1 0.602763
2012 2 0.544883
2012 3 0.423655
2015 2 0.544883
And finally, for each user we expand the range to have every year in between, filling forward missing data:
new = new.groupby('id_num').apply(lambda x: x.reindex(pd.RangeIndex(x.index.min(), x.index.max() + 1)).fillna(method='ffill')).drop(columns='id_num')
value
id_num
1 2002 0.548814
2003 0.548814
2004 0.548814
2005 2.581956
2006 2.581956
2007 2.581956
2008 2.581956
2009 0.602763
2 2012 0.544883
2013 0.544883
2014 0.544883
2015 0.544883
3 2003 0.423655
2004 0.423655
2005 0.423655
2006 0.423655
2007 0.423655
2008 0.423655
2009 0.423655
2010 0.423655
2011 0.423655
2012 0.423655

Reformatting a date-frame into a new output format

I have a the output from a pivot table in dataframe (df) which is that looks like:
Year Month sum
2005 10 -1.596817e+05
11 -2.521054e+05
12 5.981900e+05
2006 1 8.686413e+05
2 1.673673e+06
3 1.218341e+06
4 4.131970e+05
5 1.090499e+05
6 1.495985e+06
7 1.736795e+06
8 1.155071e+05
...
9 7.847369e+05
10 -5.564139e+04
11 -7.435682e+05
12 1.073361e+05
2017 1 3.427652e+05
2 3.574432e+05
3 5.026018e+04
Is there a way to reformat the dataframe so the output to console would look like:
Month 1 2 3 4 5 6 7 8 9 10 11 12
Year
2005
2006
2007
2008
2009
2010
2011
2012
2013
2014
2015
2016
2017
All the values would be populated in the new table as well.
Use unstack:
In [18]: df['sum'].unstack('Month')
Out[18]:
Month 1 2 3 4 5 6 7 8 9 10 11 12
Year
2005.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN -159681.70 -252105.4 598190.0
2006.0 868641.3 1673673.0 1218341.00 413197.0 109049.9 1495985.0 1736795.0 115507.1 784736.9 -55641.39 -743568.2 107336.1
2017.0 342765.2 357443.2 50260.18 NaN NaN NaN NaN NaN NaN NaN NaN NaN
Try df.pivot(index='year', columns='month', values='sum').
To fill you empty (if empty) year column use df.fillna(method='ffill') before the above.
Reading the answer above it should be mentioned that my suggestion works in cases where year and month aren't the index.

Get dataframe columns from a list using isin

I have a dataframe df1, and I have a list which contains names of several columns of df1.
df1:
User_id month day Age year CVI ZIP sex wgt
0 1 7 16 1977 2 NA M NaN
1 2 7 16 1977 3 NA M NaN
2 3 7 16 1977 2 DM F NaN
3 4 7 16 1977 7 DM M NaN
4 5 7 16 1977 3 DM M NaN
... ... ... ... ... ... ... ... ...
35544 35545 12 31 2002 15 AH NaN NaN
35545 35546 12 31 2002 15 AH NaN NaN
35546 35547 12 31 2002 10 RM F 14
35547 35548 12 31 2002 7 DO M 51
35548 35549 12 31 2002 5 NaN NaN NaN
list= [u"User_id", u"day", u"ZIP", u"sex"]
I want to make a new dataframe df2 which will contain omly those columns which are in the list, and a dataframe df3 which will contain columns which are not in the list.
Here I found that I need to do:
df2=df1[df1[df1.columns[1]].isin(list)]
But as a result I get:
Empty DataFrame
Columns: []
Index: []
[0 rows x 9 columns]
What Im I odoing wrong and how can i get a needed result? Why "9 columns" if it supossed to be 4?
Solution with Index.difference:
L = [u"User_id", u"day", u"ZIP", u"sex"]
df2 = df1[L]
df3 = df1[df1.columns.difference(df2.columns)]
print (df2)
User_id day ZIP sex
0 0 7 NaN M
1 1 7 NaN M
2 2 7 DM F
3 3 7 DM M
4 4 7 DM M
print (df3)
Age CVI month wgt year
0 16 2 1 NaN 1977
1 16 3 2 NaN 1977
2 16 2 3 NaN 1977
3 16 7 4 NaN 1977
4 16 3 5 NaN 1977
Or:
df2 = df1[L]
df3 = df1[df1.columns.difference(pd.Index(L))]
print (df2)
User_id day ZIP sex
0 0 7 NaN M
1 1 7 NaN M
2 2 7 DM F
3 3 7 DM M
4 4 7 DM M
print (df3)
Age CVI month wgt year
0 16 2 1 NaN 1977
1 16 3 2 NaN 1977
2 16 2 3 NaN 1977
3 16 7 4 NaN 1977
4 16 3 5 NaN 1977
never name a list as "list"
my_list= [u"User_id", u"day", u"ZIP", u"sex"]
df2 = df1[df1.keys()[df1.keys().isin(my_list)]]
or
df2 = df1[df1.columns[df1.columns.isin(my_list)]]
You can try :
df2 = df1[list] # it does a projection on the columns contained in the list
df3 = df1[[col for col in df1.columns if col not in list]]
never name a list as "list"
my_list= [u"User_id", u"day", u"ZIP", u"sex"]
df2 = df1[df1.keys()[df1.keys().isin(my_list)]]

Categories

Resources