insert python list in all rows new pd.Dataframe column - python

I have python list:
my_list = [1, 'V']
I have pd.Dataframe:
A B C
0 f v b
1 f i n
2 f i m
I need to create new column in my dataframe with value = my_list:
A B C D
0 f v b [1, 'V']
1 f i n [1, 'V']
2 f i m [1, 'V']
As far as I understand python lists can be values, bc df.groupby with apply "list":
df = df.groupby(['A', 'B'], group_keys=True)['C'].apply(list).reset_index(name='H')
A B H
0 f i [n, m]
1 f v [b]
Its posible without convert my_list type? What the the easiest way to do that?
I tried:
df['D'] = my_list
df['D'] = pd.Series(my_list)
but they did not meet my expectations

You can try using: np.repeat and set its repeat parameter to number of rows which can be found out from the shape of the dataframe.
my_list = [1, 'V']
df = pd.DataFrame({'col1': ['f', 'f', 'f'], 'col2': ['v', 'i', 'i'], 'col3': ['b', 'n', 'm']})
df['new_col'] = np.repeat(my_list, df.shape[0])
This will repeat the values of my_list as many times as there are rows in the DataFrame.

You can do it by creating a new array with my_list through hstack and then forming a new DataFrame. The code below has been tested and works fine.
import numpy as np
import pandas as ph
a1 = np.array([['f','v','b'], ['f','i','n'], ['f','i','m']])
a2 = np.array([1, 'V']).repeat(3).reshape(2,3).transpose()
df = pd.DataFrame(np.hstack((a1,a2)))
Edit: Another code that has been tested is:
import pandas as pd
import numpy as np
a1 = np.array([['f','v','b'], ['f','i','n'], ['f','i','m']])
a2 = np.squeeze(np.dstack((np.array(1).repeat(3), np.array('V').repeat(3))))
df = pd.DataFrame(np.hstack((a1,a2)))

Related

Pandas Dataframe explode List, add new columns and count values

I'm a little bit stuck. I habe a Dataframe with a list in a column.
id
list
1
[a, b]
2
[a,a,a,b]
3
c,b,b
4
[c,a]
5
[f,f,b]
I have the values, a, b, c, d, e, f in general.
I want to count if two values are in a list togehter and also if a value appears more than once in that list.
I want to get that to create a heatmap, with all values in x and y axis. and the counts where e.g. a is x times in a list with itself or e.g. a and b are x times togehter.
I tried this so far, but it is not exactly the solution i want.
Make ne columns and count values
df['a'] = df['list'].explode().str.contains('a').groupby(level=0).any().astype('int')
df['b'] = df['list'].explode().str.contains('b').groupby(level=0).any().astype('int')
df['c'] = df['list'].explode().str.contains('c').groupby(level=0).any().astype('int')
df['d'] = df['list'].explode().str.contains('d').groupby(level=0).any().astype('int')
df['e'] = df['list'].explode().str.contains('e').groupby(level=0).any().astype('int')
df['f'] = df['list'].explode().str.contains('f').groupby(level=0).any().astype('int')
here i get the first problem, i create a new df with rows names the list and counting the values in the list, but I also get the count if i only have the value once in the list
make x axis
df_explo = df.explode(['list'],ignore_index=True)
get sum of all
df2=df_explo.groupby(['list']).agg({'a':'sum','b':'sum','c':'sum','d':'sum','e':'sum','f':'sum').reset_index()
set index to list
df3 = df2.set_index('list')
create heatmap
sns.heatmap(df3,cmap='RdYlGn_r', linewidths=0.5,annot=True,fmt="d")
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from collections import Counter
from itertools import combinations
data = [
['a', 'b'],
['a', 'a', 'a', 'b'],
['b', 'b', 'b'],
['c', 'a'],
['f', 'f', 'b']
]
letters = ['a', 'b', 'c', 'd', 'e', 'f']
duplicate_occurrences = pd.DataFrame(0, index=[0], columns=letters)
co_occurrences = pd.DataFrame(0, index=letters, columns=letters)
for l in data:
duplicates = [k for k, v in Counter(l).items() if v > 1]
for d in duplicates:
duplicate_occurrences[d] += 1
co = list(combinations(set(l), 2))
for a, b in co:
co_occurrences.loc[a, b] += 1
co_occurrences.loc[b, a] += 1
plt.figure(figsize=(7, 1))
sns.heatmap(duplicate_occurrences, cmap='RdYlGn_r', linewidths=0.5, annot=True, fmt="d")
plt.title('Duplicate Occurrence Counts')
plt.show()
sns.heatmap(co_occurrences, cmap='RdYlGn_r', linewidths=0.5, annot=True, fmt="d")
plt.title('Co-Occurrence Counts')
plt.show()
The first plot shows how often each letter occurs at least twice in a list, the second shows how often each pair of letters occurs together in a list.
In case you want to plot the duplicate occurrences on the diagonal, you could do it e.g. as follows:
df = pd.DataFrame(0, index=letters, columns=letters)
for l in data:
for k, v in Counter(l).items():
if v > 1:
df.loc[k, k] += 1
for a, b in combinations(set(l), 2):
df.loc[a, b] += 1
df.loc[b, a] += 1
sns.heatmap(df, cmap='RdYlGn_r', linewidths=0.5, annot=True, fmt="d")

Manage the missing value in a dataframe with string and number

I have a dataframe with some string columns and number columns. I want to manage the missing values.
I want to change the "nan" values with mean of each row.
I saw the different question in this website, however, they are different from my question. Like this link: Pandas Dataframe: Replacing NaN with row average
If all the values of a rows are "Nan" values, I want to delete that rows. I have also provide a sample case as follows:
import pandas as pd
import numpy as np
df = pd.DataFrame()
df['id'] = ['a', 'b', 'c', 'n']
df['md'] = ['d', 'e', 'f', 'l']
df['c1'] = [2, np.nan,np.nan, 5]
df['c2'] = [0, 5, np.nan, 3]
df['c3'] = [8, 7, np.nan,np.nan]
df = pd.DataFrame()
df['id'] = ['a', 1, 'n']
df['md'] = ['d', 6, 'l']
df['c1'] = [2, 6, 5]
df['c2'] = [0, 5, 3]
df['c3'] = [8, 7,4]
df
Note:
I have used the following code, however it is very slow and for a big dataframe it take a looong time to run.
index_colum = df.columns.get_loc("c1")
df_withno_id = df.iloc[:,index_colum:]
rowsidx_with_all_NaN = df_withno_id[df_withno_id.isnull().all(axis=1)].index.values
df = df.drop(df.index[rowsidx_with_all_NaN])
for i, cols in df_withno_id.iterrows():
if i not in rowsidx_with_all_NaN:
endsidx = len(cols)
extract_data = list(cols[0:endsidx])
mean = np.nanmean(extract_data)
fill_nan = [mean for x in extract_data if np.isnan(x)]
df.loc[i] = df.loc[i].replace(np.nan, mean)
Can anybody help me with this? thanks.
First, you can select only float columns types. Second, for these columns drop rows with all nan values. Finally, you can transpose dataframe (only float columns), calculate average value and later transpose again.
import pandas as pd
import numpy as np
df = pd.DataFrame()
df['id'] = ['a', 'b', 'c', 'n']
df['md'] = ['d', 'e', 'f', 'l']
df['c1'] = [2, np.nan,np.nan, 5]
df['c2'] = [0, 5, np.nan, 3]
df['c3'] = [8, 7, np.nan,np.nan]
numeric_cols = df.select_dtypes(include='float64').columns
df.dropna(how = 'all', subset = numeric_cols, inplace = True)
df[numeric_cols] = df[numeric_cols].T.fillna(df[numeric_cols].T.mean()).T
df

Python pandas: elegant division within dataframe

I'm new on stackoverflow and have switched from R to python. I'm trying to do something probably not too difficult, and while I can do this by butchering, I am wondering what the most pythonic way to do it is. I am trying to divide certain values (E where F=a) in a column by values further down in the column (E where F=b) using column D as a lookup:
import pandas as pd
df = pd.DataFrame({'D':[1, 2, 3, 4, 5, 1, 2, 3, 4, 5, 1], 'E':[10,20,30,40,50,100, 250, 250, 360, 567, 400],'F':['a', 'a', 'a', 'a', 'a', 'b','b', 'b', 'b', 'b', 'c']})
print(df)
out = pd.DataFrame({'D': [1, 2, 3, 4, 5], 'a/b': [0.1, 0.08, 0.12 , 0.1111, 0.0881]}
print(out)
Can anyone help write this nicely?
I'm not entirely sure what you mean by "using D column as lookup" since there is no need for such lookup in the example you provided.
However the quick and dirty way to achieve the output you did provide is
output = pd.DataFrame({'a/b': df[df['F'] == 'a']['E'].values / df[df['F'] == 'b']['E'].values})
output['D'] = df['D']
which makes output to be
a/b D
0 0.100000 1
1 0.080000 2
2 0.120000 3
3 0.111111 4
4 0.088183 5
Lookup with .loc in pandas dataframe as df.loc[rows, columns] where the conditions for rows and columns are True
import numpy as np
# get indices from column D. I convert it to a list structure to make sure that the order is maintained.
idx = list(set(df['D']))
# A is an array of values with 'F'=a
A = np.array([df.loc[(df['F']=='a') & (df['D']==i),'E'].values[0] for i in idx])
# B is an array of values with 'F'=b
B = np.array([df.loc[(df['F']=='b') & (df['D']==i),'E'].values[0] for i in idx])
# Now devide towards your new dataframe of divisions
out = pd.DataFrame(np.vstack([A/B,idx]).T, columns = ['a/b','D'])
Instead of using numpy.vstack, you can use:
out = pd.DataFrame(A/B,idx).T
out.columns = ['a/b','D']
with the same result. I tried to do it in a single line (for no reason whatsoever)
Got it:
df = df.set_index('D')
out = df.loc[(df['F'] == 'a'), 'E'] / df.loc[(df['F'] == 'b'), 'E']
out = out.reset_index()
Thanks for your thoughts - I got inspired.

Implementing k nearest neighbours from distance matrix?

I am trying to do the following:
Given a dataFrame of distance, I want to identify the k-nearest neighbours for each element.
Example:
A B C D
A 0 1 3 2
B 5 0 2 2
C 3 2 0 1
D 2 3 4 0
If k=2, it should return:
A: B D
B: C D
C: D B
D: A B
Distances are not necessarily symmetric.
I am thinking there must be something somewhere that does this in an efficient way using Pandas DataFrames. But I cannot find anything?
Homemade code is also very welcome! :)
Thank you!
The way I see it, I simply find n + 1 smallest numbers/distances/neighbours for each row and remove the 0, which would then give you n numbers/distances/neighbours. Keep in mind that the code will not work if you have a distance of zeroes! Only the diagonals are allowed to be 0.
import pandas as pd
import numpy as np
X = pd.DataFrame([[0, 1, 3, 2],[5, 0, 2, 2],[3, 2, 0, 1],[2, 3, 4, 0]])
X.columns = ['A', 'B', 'C', 'D']
X.index = ['A', 'B', 'C', 'D']
X = X.T
for i in X.index:
Y = X.nsmallest(3, i)
Y = Y.T
Y = Y[Y.index.str.startswith(i)]
Y = Y.loc[:, Y.any()]
for j in Y.index:
print(i + ": ", list(Y.columns))
This prints out:
A: ['B', 'D']
B: ['C', 'D']
C: ['D', 'B']
D: ['A', 'B']

Create a matrix from a list of key-value pairs

I have a list of numpy arrays that contains a list of name-value pairs which are both strings. Every name and value can be found multiple times in the list, and I would like to convert it to a binary matrix.
The columns represent the values while the rows represent a key/name, and when a field is set to 1 it represents that particular name value pair.
E.g
I have
A : aa
A : bb
A : cc
B : bb
C : aa
and i want to convert it to
aa bb cc
A 1 1 1
B 0 1 0
C 1 0 0
I have some code that does this but I was wondering if there is an easier/out of the box way of doing this with numpy or some other library.
This is my code so far:
resources = Set(result[:,1])
resourcesDict = {}
i = 0
for r in resources:
resourcesDict[r] = i
i = i + 1
clients = Set(result[:,0])
clientsDict = {}
i = 0
for c in clients:
clientsDict[c] = i
i = i + 1
arr = np.zeros((len(clientsDict),len(resourcesDict)), dtype = 'bool')
for line in result[:,0:2]:
arr[clientsDict[line[0]],resourcesDict[line[1]]] = True
and in result theres the following
array([["a","aa"],["a","bb"],..]
I feel that using Pandas.DataFrame.pivot is the best way
>>> df = pd.DataFrame({'foo': ['one','one','one','two','two','two'],
'bar': ['A', 'B', 'C', 'A', 'B', 'C'],
'baz': [1, 2, 3, 4, 5, 6]})
>>> df
foo bar baz
0 one A 1
1 one B 2
2 one C 3
3 two A 4
4 two B 5
5 two C 6
Or
you can load your pair list using
>>> df = pd.read_csv('ratings.csv')
Then
>>> df.pivot(index='foo', columns='bar', values='baz')
A B C
one 1 2 3
two 4 5 6
you probably have something like
m_dict = {'A': ['aa', 'bb', 'cc'], 'B': ['bb'], 'C': ['aa']}
i would go like this:
res = {}
for k, v in m_dict.items():
res[k] = defaultdict(int)
for col in v:
res[k][v] = 1
edit
given your format, it would probably be more in the line of :
m_array = [['A', 'aa'], ['A', 'bb'], ['A', 'cc'], ['B', 'bb'], ['C', 'aa']]
res = defaultdict(lambda: defaultdict(int))
for k, v in m_array:
res[k][v] = 1
which both give:
>>> res['A']['aa']
1
>>> res['B']['aa']
0
This is a job for np.unique. It is not clear what format your data is in, but you need to get two 1-D arrays, one with the keys, another with the values, e.g.:
kvp = np.array([['A', 'aa'], ['A', 'bb'], ['A', 'cc'],
['B', 'bb'], ['C', 'aa']])
keys, values = kvp.T
rows, row_idx = np.unique(keys, return_inverse=True)
cols, col_idx = np.unique(values, return_inverse=True)
out = np.zeros((len(rows), len(cols)), dtype=np.int)
out[row_idx, col_idx] += 1
>>> out
array([[1, 1, 1],
[0, 1, 0],
[1, 0, 0]])
>>> rows
array(['A', 'B', 'C'],
dtype='|S2')
>>> cols
array(['aa', 'bb', 'cc'],
dtype='|S2')
If you have no repeated key-value pairs, this code will work just fine. If there are repetitions, I would suggest abusing scipy's sparse module:
import scipy.sparse as sps
kvp = np.array([['A', 'aa'], ['A', 'bb'], ['A', 'cc'],
['B', 'bb'], ['C', 'aa'], ['A', 'bb']])
keys, values = kvp.T
rows, row_idx = np.unique(keys, return_inverse=True)
cols, col_idx = np.unique(values, return_inverse=True)
out = sps.coo_matrix((np.ones_like(row_idx), (row_idx, col_idx))).A
>>> out
array([[1, 2, 1],
[0, 1, 0],
[1, 0, 0]])
d = {'A': ['aa', 'bb', 'cc'], 'C': ['aa'], 'B': ['bb']}
rows = 'ABC'
cols = ('aa', 'bb', 'cc')
print ' ', ' '.join(cols)
for row in rows:
print row, ' ',
for col in cols:
print ' 1' if col in d.get(row) else ' 0',
print
>>> aa bb cc
>>> A 1 1 1
>>> B 0 1 0
>>> C 1 0 0

Categories

Resources