Numpy array as single element of pandas dataframe - python

I have a numpy array and an empty dataframe:
element = numpy.array([1,2,3])
df = pandas.DataFrame(columns = ["Col"])
I want to insert element in the first row of df. The following code:
df["Col"] = element
Gives me a dataframe 3x1 whose elements are 1, 2 and 3. I want a dataframe 1x1 whose element is the array. How can I get this result?
Thanks in advance!

Use DataFrame.loc or DataFrame.at for specify label for set array to DataFrame:
df.loc[0, "Col"] = element
print (df)
Col
0 [1, 2, 3]
df.at[0, "Col"] = element

Wrap element in a list.
>>> df['Col'] = [element]
>>> df
Col
0 [1, 2, 3]

Related

Merge two pandas dataframes, as lists in every cell

I want to merge 2 dataframes, with the resulting dataframe having a list in every single cell. I'm completely lost on how to do this.
My current solution is using the index of each dataframe to build a dict (eg. dict[index[0]]['DEPTH'] = []), and then looping over rows of the dataframes to append to dict keys (eg. dict[index[0]]['DEPTH'].append(cell_value)), but I'm thinking that's super inefficient and slow.
Does a pandas solution exist that would get this done?
df1 would look like this:
df2 would look like this:
Resulting df would look something like this:
DEPTH A
chr1~10007022~C [1, 1] [0, 0]
chr1~10007023~T [1, 1] [0, 0]
.
.
.
chr1~10076693~T [1, 1] [0, 0]
Keep in mind:
indexes of dataframe would probably differ, but not always.
dataframes will probably contain >100M rows each
You could concatenate the two, groupby the item and then agg with list.
import pandas as pd
df = pd.DataFrame({'item':['chr1-10007022-C', 'chr1-10007023-T'],
'DEPTH':[1,1],
'A':[0,0],
'C':[0,0]})
df = df.set_index('item')
df2 = pd.DataFrame({'item':['chr1-10007022-C', 'chr1-10007026-X'],
'DEPTH':[1,1],
'A':[0,0],
'C':[0,0]})
df2 = df2.set_index('item')
out = pd.concat([df,df2]).groupby(level=0).agg(list)
Output
DEPTH A C
item
chr1-10007022-C [1, 1] [0, 0] [0, 0]
chr1-10007023-T [1] [0] [0]
chr1-10007026-X [1] [0] [0]

How to index a dataframe using a condition on a column that is a column of numpy arrays?

I currently have a pandas dataframe that has a column of values that are numpy arrays. I am trying to get the rows of the dataframe where the value of the column is an empty numpy array but I can't index using the pandas method.
Here is an example dataframe.
data = {'Name': ['A', 'B', 'C', 'D'], 'stats': [np.array([1,1,1]), np.array([]), np.array([2,2,2]), np.array([])]}
df = pd.DataFrame(data)
I am trying to just get the rows where 'stats' is None, but when I try df[df['stats'] is None] I just get a KeyError: False.
How can I filter by rows that contain an empty list?
Additionally, how can I filter by row where the numpy array is something specific? i.e. get all rows of df where df['stats'] == np.array([1, 1, 1])
Thanks
You can check length by Series.str.len, because it working with all Iterables:
print (df['stats'].str.len())
0 3
1 0
2 3
3 0
Name: stats, dtype: int64
And then filter, e.g. rows with len=0:
df = df[df['stats'].str.len().eq(0)]
#alternative
#df = df[df['stats'].apply(len).eq(0)]
print (df)
Name stats
1 B []
3 D []
If need test specific array is possible use tuples:
df =df[ df['stats'].apply(tuple) == tuple(np.array([1, 1, 1]))]
print (df)
Name stats
0 A [1, 1, 1]
for this question:
"Additionally, how can I filter by row where the numpy array is something specific? i.e. get all rows of df where df['stats'] == np.array([1, 1, 1])"
data = {'Name': ['A', 'B', 'C', 'D'], 'stats': [np.array([1,1,1]), np.array([]), np.array([2,2,2]), np.array([])]}
df = pd.DataFrame(data)
df = df[df['stats'].apply(lambda x: np.array_equal(x, np.array([1,1,1])))]

Faster way to set new df value using np.array index values in dataframe

I need to set the value of a new pandas df column based on the NumPy array index, also stored in the df. This works, but it is pretty slow with a large df. Any tips on how to speed things up?
a=np.random.random((5,5))
df=pd.DataFrame(np.array([[1,1],[3,3],[2,2],[3,2]]),columns=['i','j'])
df['ij']=df.apply(lambda x: (int(x['i']-1),int(x['j']-1)),axis=1)
for idx,r in df.iterrows():
df.loc[idx,'new']=a[r['ij']]
With NumPy indexing:
inds = df[["i", "j"]].to_numpy() - 1
df["new"] = a[inds[:, 0], inds[:, 1]]
where we index into a along rows with numbers in inds' first column and columns with its second column.
to get
>>> a
array([[0.27494719, 0.17706064, 0.71306907, 0.94776026, 0.04024955],
[0.56557293, 0.63732559, 0.12254121, 0.53177861, 0.48435987],
[0.33299644, 0.43459935, 0.57227818, 0.96142159, 0.79794503],
[0.80112425, 0.52816002, 0.01885327, 0.39880301, 0.51974912],
[0.60377461, 0.24419486, 0.88203753, 0.87263663, 0.49345361]])
>>> inds
array([[0, 0],
[2, 2],
[1, 1],
[2, 1]])
>>> df
i j new
0 1 1 0.274947
1 3 3 0.572278
2 2 2 0.637326
3 3 2 0.434599
for the ij column, you can do df["ij"] = inds.tolist().
Coming from the numpy side you could reshape the indices such that they match the a shape, using ravel_multi_index:
df["new"] = np.take(a, np.ravel_multi_index([df.i -1, df.j - 1], a.shape))

Comparing two data frames columns and assigning Zero and One

I have a dataframe and a list, which includes a part of columns' name from my dataframe as follows:
my_frame:
col1, col2, col3, ..., coln
2, 3, 4, ..., 2
5, 8, 5, ..., 1
6, 1, 8, ..., 9
my_list:
['col1','col3','coln']
Now, I want to create an array with the size of my original dataframe (total number of columns) which consists only zero and one. Basically I want the array includes 1 if the there is a similar columns name in "my_list", otherwise 0. My desired output should be like this:
my_array={[1,0,1,0,0,...,1]}
This should help u:
import pandas as pd
dictt = {'a':[1,2,3],
'b':[4,5,6],
'c':[7,8,9]}
df = pd.DataFrame(dictt)
my_list = ['a','h','g','c']
my_array = []
for column in df.columns:
if column in my_list:
my_array.append(1)
else:
my_array.append(0)
print(my_array)
Output:
[1, 0, 1]
If u wanna use my_array as a numpy array instead of a list, then use this:
import pandas as pd
import numpy as np
dictt = {'a':[1,2,3],
'b':[4,5,6],
'c':[7,8,9]}
df = pd.DataFrame(dictt)
my_list = ['a','h','g','c']
my_array = np.empty(0,dtype = int)
for column in df.columns:
if column in my_list:
my_array = np.append(my_array,1)
else:
my_array = np.append(my_array,0)
print(my_array)
Output:
[1 0 1]
I have used test data in my code for easier understanding. U can replace the test data with ur actual data (i.e replace my test dataframe with ur actual dataframe). Hope that this helps!

Get row index from DataFrame row

Is it possible to get the row number (i.e. "the ordinal position of the index value") of a DataFrame row without adding an extra row that contains the row number (the index can be arbitrary, i.e. even a MultiIndex)?
>>> import pandas as pd
>>> df = pd.DataFrame({'a': [2, 3, 4, 2, 4, 6]})
>>> result = df[df.a > 3]
>>> result.iloc[0]
a 4
Name: 2, dtype: int64
# but how can I get the original row index of iloc[0] in df?
I could have done df['row_index'] = range(len(df)) which would maintain the original row number, but I am wondering if Pandas has a built-in way of doing this.
Access the .name attribute and use get_loc:
In [10]:
df.index.get_loc(result.iloc[0].name)
Out[10]:
2
Looking this from a different side:
for r in df.itertuples():
getattr(r, 'Index')
Where df is the data frame. May be you want to use a conditional to get the index when a condition are met.

Categories

Resources