Merge two pandas dataframes, as lists in every cell - python

I want to merge 2 dataframes, with the resulting dataframe having a list in every single cell. I'm completely lost on how to do this.
My current solution is using the index of each dataframe to build a dict (eg. dict[index[0]]['DEPTH'] = []), and then looping over rows of the dataframes to append to dict keys (eg. dict[index[0]]['DEPTH'].append(cell_value)), but I'm thinking that's super inefficient and slow.
Does a pandas solution exist that would get this done?
df1 would look like this:
df2 would look like this:
Resulting df would look something like this:
DEPTH A
chr1~10007022~C [1, 1] [0, 0]
chr1~10007023~T [1, 1] [0, 0]
.
.
.
chr1~10076693~T [1, 1] [0, 0]
Keep in mind:
indexes of dataframe would probably differ, but not always.
dataframes will probably contain >100M rows each

You could concatenate the two, groupby the item and then agg with list.
import pandas as pd
df = pd.DataFrame({'item':['chr1-10007022-C', 'chr1-10007023-T'],
'DEPTH':[1,1],
'A':[0,0],
'C':[0,0]})
df = df.set_index('item')
df2 = pd.DataFrame({'item':['chr1-10007022-C', 'chr1-10007026-X'],
'DEPTH':[1,1],
'A':[0,0],
'C':[0,0]})
df2 = df2.set_index('item')
out = pd.concat([df,df2]).groupby(level=0).agg(list)
Output
DEPTH A C
item
chr1-10007022-C [1, 1] [0, 0] [0, 0]
chr1-10007023-T [1] [0] [0]
chr1-10007026-X [1] [0] [0]

Related

Faster way to set new df value using np.array index values in dataframe

I need to set the value of a new pandas df column based on the NumPy array index, also stored in the df. This works, but it is pretty slow with a large df. Any tips on how to speed things up?
a=np.random.random((5,5))
df=pd.DataFrame(np.array([[1,1],[3,3],[2,2],[3,2]]),columns=['i','j'])
df['ij']=df.apply(lambda x: (int(x['i']-1),int(x['j']-1)),axis=1)
for idx,r in df.iterrows():
df.loc[idx,'new']=a[r['ij']]
With NumPy indexing:
inds = df[["i", "j"]].to_numpy() - 1
df["new"] = a[inds[:, 0], inds[:, 1]]
where we index into a along rows with numbers in inds' first column and columns with its second column.
to get
>>> a
array([[0.27494719, 0.17706064, 0.71306907, 0.94776026, 0.04024955],
[0.56557293, 0.63732559, 0.12254121, 0.53177861, 0.48435987],
[0.33299644, 0.43459935, 0.57227818, 0.96142159, 0.79794503],
[0.80112425, 0.52816002, 0.01885327, 0.39880301, 0.51974912],
[0.60377461, 0.24419486, 0.88203753, 0.87263663, 0.49345361]])
>>> inds
array([[0, 0],
[2, 2],
[1, 1],
[2, 1]])
>>> df
i j new
0 1 1 0.274947
1 3 3 0.572278
2 2 2 0.637326
3 3 2 0.434599
for the ij column, you can do df["ij"] = inds.tolist().
Coming from the numpy side you could reshape the indices such that they match the a shape, using ravel_multi_index:
df["new"] = np.take(a, np.ravel_multi_index([df.i -1, df.j - 1], a.shape))

Can I create column where each row is a running list in a Pandas data frame using groupby?

Imagine I have a Pandas DataFrame:
# create df
df = pd.DataFrame({'id': [1,1,1,2,2,2],
'val': [5,4,6,3,2,3]})
Lets assume it is ordered by 'id' and an imaginary, not shown, date column (ascending).
I want to create another column where each row is a list of 'val' at that date.
The ending DataFrame will look like this:
df = pd.DataFrame({'id': [1,1,1,2,2,2],
'val': [5,4,6,3,2,3],
'val_list': [[5],[5,4],[5,4,6],[3],[3,2],[3,2,3]]})
I don't want to use a loop because the actual df I am working with has about 4 million records. I am imagining I would use a lambda function in conjunction with groupby (something like this):
df['val_list'] = df.groupby('id')['val'].apply(lambda x: x.runlist())
This raises an AttributError because the runlist() method does not exist, but I am thinking the solution would be something like this.
Does anyone know what to do to solve this problem?
Let us try
df['new'] = df.val.map(lambda x : [x]).groupby(df.id).apply(lambda x : x.cumsum())
Out[138]:
0 [5]
1 [5, 4]
2 [5, 4, 6]
3 [3]
4 [3, 2]
5 [3, 2, 3]
Name: val, dtype: object

Comparing two data frames columns and assigning Zero and One

I have a dataframe and a list, which includes a part of columns' name from my dataframe as follows:
my_frame:
col1, col2, col3, ..., coln
2, 3, 4, ..., 2
5, 8, 5, ..., 1
6, 1, 8, ..., 9
my_list:
['col1','col3','coln']
Now, I want to create an array with the size of my original dataframe (total number of columns) which consists only zero and one. Basically I want the array includes 1 if the there is a similar columns name in "my_list", otherwise 0. My desired output should be like this:
my_array={[1,0,1,0,0,...,1]}
This should help u:
import pandas as pd
dictt = {'a':[1,2,3],
'b':[4,5,6],
'c':[7,8,9]}
df = pd.DataFrame(dictt)
my_list = ['a','h','g','c']
my_array = []
for column in df.columns:
if column in my_list:
my_array.append(1)
else:
my_array.append(0)
print(my_array)
Output:
[1, 0, 1]
If u wanna use my_array as a numpy array instead of a list, then use this:
import pandas as pd
import numpy as np
dictt = {'a':[1,2,3],
'b':[4,5,6],
'c':[7,8,9]}
df = pd.DataFrame(dictt)
my_list = ['a','h','g','c']
my_array = np.empty(0,dtype = int)
for column in df.columns:
if column in my_list:
my_array = np.append(my_array,1)
else:
my_array = np.append(my_array,0)
print(my_array)
Output:
[1 0 1]
I have used test data in my code for easier understanding. U can replace the test data with ur actual data (i.e replace my test dataframe with ur actual dataframe). Hope that this helps!

Row-wise difference in two list in pandas

I am using pandas to incrementally find out new elements i.e. for every row, I'd see whether values in list have been seen before. If they are, we will ignore them. If not, we will select them.
I was able to do this using row.iterrows(), but I have >1M rows, so I believe vectorized apply might be better.
Here's sample data and code. Once you run this code, you will get expected output:
from numpy import nan as NA
import collections
df = pd.DataFrame({'ID':['A','B','C','A','B','A','A','A','D','E','E','E'],
'Value': [1,2,3,4,3,5,2,3,7,2,3,9]})
#wrap all elements by group in a list
Changed_df=df.groupby('ID')['Value'].apply(list).reset_index()
Changed_df=Changed_df.rename(columns={'Value' : 'Elements'})
Changed_df=Changed_df.reset_index(drop=True)
def flatten(l):
for el in l:
if isinstance(el, collections.Iterable) and not isinstance(el, (str, bytes)):
yield from flatten(el)
else:
yield el
Changed_df["Elements_s"]=Changed_df['Elements'].shift()
#attempt 1: For loop
Changed_df["Diff"]=NA
Changed_df["count"]=0
Elements_so_far = []
#replace NA with empty list in columns that will go through list operations
for col in ["Elements","Elements_s","Diff"]:
Changed_df[col] = Changed_df[col].apply(lambda d: d if isinstance(d, list) else [])
for idx,row in Changed_df.iterrows():
diff = list(set(row['Elements']) - set(Elements_so_far))
Changed_df.at[idx, "Diff"] = diff
Elements_so_far.append(row['Elements'])
Elements_so_far = flatten(Elements_so_far)
Elements_so_far = list(set(Elements_so_far)) #keep unique elements
Changed_df.loc[idx,"count"]=diff.__len__()
Commentary about the code:
I am not a fan of this code because it's clunky and inefficient.
I am saying inefficient because I have created Elements_s which holds shifted values. Another reason for inefficiency is for loop through rows.
Elements_so_far keeps track of all the elements we have discovered for every row. If there is a new element that shows up, we count that in Diff column.
We also keep track of the length of new elements discovered in count column.
I'd appreciate if an expert could help me with a vectorized version of the code.
I did try the vectorized version, but I couldn't go too far.
#attempt 2:
Changed_df.apply(lambda x: [i for i in x['Elements'] if i in x['Elements_s']], axis=1)
I was inspired from How to compare two columns both with list of strings and create a new column with unique items? to do above, but I couldn't do it. The linked SO thread does row-wise difference among columns.
I am using Python 3.6.7 by Anaconda. Pandas version is 0.23.4
You could using sort and then use numpy to get the unique indexes and then construct your groupings, e.g.:
In []:
df = df.sort_values(by='ID').reset_index(drop=True)
_, i = np.unique(df.Value.values, return_index=True)
df.iloc[i].groupby(df.ID).Value.apply(list)
Out[]:
ID
A [1, 2, 3, 4, 5]
D [7]
E [9]
Name: Value, dtype: object
Or to get close to your current output:
In []:
df = df.sort_values(by='ID').reset_index(drop=True)
_, i = np.unique(df.Value.values, return_index=True)
s1 = df.groupby(df.ID).Value.apply(list).rename('Elements')
s2 = df.iloc[i].groupby(df.ID).Value.apply(list).rename('Diff').reindex(s1.index, fill_value=[])
pd.concat([s1, s2, s2.apply(len).rename('Count')], axis=1)
Out[]:
Elements Diff Count
ID
A [1, 4, 5, 2, 3] [1, 2, 3, 4, 5] 5
B [2, 3] [] 0
C [3] [] 0
D [7] [7] 1
E [2, 3, 9] [9] 1
One alternative using drop duplicates and groupby
# Groupby and apply list func.
df1 = df.groupby('ID')['Value'].apply(list).to_frame('Elements')
# Sort values , drop duplicates by Value column then use groupby.
df1['Diff'] = df.sort_values(['ID','Value']).drop_duplicates('Value').groupby('ID')['Value'].apply(list)
# Use str.len for count.
df1['Count'] = df1['Diff'].str.len().fillna(0).astype(int)
# To fill NaN with empty list
df1['Diff'] = df1.Diff.apply(lambda x: x if type(x)==list else [])
Elements Diff Count
ID
A [1, 4, 5, 2, 3] [1, 2, 3, 4, 5] 5
B [2, 3] [] 0
C [3] [] 0
D [7] [7] 1
E [2, 3, 9] [9] 1

Get row index from DataFrame row

Is it possible to get the row number (i.e. "the ordinal position of the index value") of a DataFrame row without adding an extra row that contains the row number (the index can be arbitrary, i.e. even a MultiIndex)?
>>> import pandas as pd
>>> df = pd.DataFrame({'a': [2, 3, 4, 2, 4, 6]})
>>> result = df[df.a > 3]
>>> result.iloc[0]
a 4
Name: 2, dtype: int64
# but how can I get the original row index of iloc[0] in df?
I could have done df['row_index'] = range(len(df)) which would maintain the original row number, but I am wondering if Pandas has a built-in way of doing this.
Access the .name attribute and use get_loc:
In [10]:
df.index.get_loc(result.iloc[0].name)
Out[10]:
2
Looking this from a different side:
for r in df.itertuples():
getattr(r, 'Index')
Where df is the data frame. May be you want to use a conditional to get the index when a condition are met.

Categories

Resources