After using my script my algorithms return the exptected outcome in a list of lists like this: pred=[[b,c,d],[b,a,u],...[b,i,o]]
I already have a dataframe that needs those values added in a new matching column.
The list is exactly x long like the other columns in the frame and I just need to create a new column with all the values of the lists.
However when I try to put the list into the column I get the error:
ValueError: Length of values does not match length of index
Looking at the data, it puts the entire list into one row instead of each entry in a new row.
EDIT:
All values in the list should be put in the column namend pred
sent token pred
0 a b
0 b c
0 b d
1 a b
1 b a
1 c u
Solution:
x = []
for _ in pred:
if _ is not None:
x += _
df_new = pd.DataFrame(df)
df_new["pred"] = list(itertools.chain.from_iterable(x))
You can use itertools.chain, which allows you to flatten a list of lists, which you can then slice according to the length of your dataframe.
Data from #ak_slick.
import pandas as pd
from itertools import chain
df = pd.DataFrame({'sent': [0, 0, 0, 1, 1, 1],
'token': ['a', 'b', 'b', 'a', 'b', 'c']})
lst = [['b','c',None],['b',None,'u'], ['b','i','o']]
df['pred'] = list(filter(None, chain.from_iterable(lst)))[:len(df.index)]
print(df)
sent token pred
0 0 a b
1 0 b c
2 0 b d
3 1 a b
4 1 b a
5 1 c u
import pandas as pd
# combine input lists
x = []
for _ in [['b','c','d'],['b','a','u'], ['b','i','o']]:
x += _
# output into a single column
a = pd.Series(x)
# mock original dataframe
b = pd.DataFrame({'sent': [0, 0, 0, 1, 1, 1],
'token': ['a', 'b', 'b', 'a', 'b', 'c']})
# add column to existing dataframe
# this will avoid the mis matched length error by ignoring anything longer
# than your original data frame
b['pred'] = a
sent token pred
0 0 a b
1 0 b c
2 0 b d
3 1 a b
4 1 b a
5 1 c u
Related
I have a dataframe with words as index and a corresponding sentiment score in another column. Then, I have another dataframe which has one column with list of words (token list) with multiple rows. So each row will have a column with different lists. I want to find the average of sentiment score for a particular list. This has to be done for a huge number of rows, and hence efficiency is important.
One method I have in mind is given below:
import pandas as pd
a = [['a', 'b', 'c'], ['hi', 'this', 'is', 'a', 'sample']]
df = pd.DataFrame()
df['tokens'] = a
'''
df
words
0 [a, b, c]
1 [hi, this, is, a, sample]
'''
def find_score(tokenlist, ref_df):
# ref_df contains two cols, 'tokens' and 'score'
temp_df = pd.DataFrame()
temp_df['tokens'] = tokenlist
return temp_df.merge(ref_df, on='tokens', how='inner')['sentiment_score'].mean(axis=0)
# this should return score
df['score'] = df['tokens'].apply(find_score, axis=1, args=(ref_df))
# each input for find_score will be a list
Is there any more efficient way to do it without creating dataframe for each list?
You can create a dictionary for mapping from the reference dataframe ref_df and then use .map() on each token list on each row of dataframe df, as follows:
ref_dict = dict(zip(ref_df['tokens'], ref_df['sentiment_score']))
df['score'] = df['tokens'].map(lambda x: np.mean([ref_dict[y] for y in x if y in ref_dict.keys()]))
Demo
Test Data Construction
a = [['a', 'b', 'c'], ['hi', 'this', 'is', 'a', 'sample']]
df = pd.DataFrame()
df['tokens'] = a
ref_df = pd.DataFrame({'tokens': ['a', 'b', 'c', 'd', 'hi', 'this', 'is', 'sample', 'example'],
'sentiment_score': [1, 2, 3, 4, 11, 12, 13, 14, 15]})
print(df)
tokens
0 [a, b, c]
1 [hi, this, is, a, sample]
print(ref_df)
tokens sentiment_score
0 a 1
1 b 2
2 c 3
3 d 4
4 hi 11
5 this 12
6 is 13
7 sample 14
8 example 15
Run New Code
ref_dict = dict(zip(ref_df['tokens'], ref_df['sentiment_score']))
df['score'] = df['tokens'].map(lambda x: np.mean([ref_dict[y] for y in x if y in ref_dict.keys()]))
Output
print(df)
tokens score
0 [a, b, c] 2.0
1 [hi, this, is, a, sample] 10.2
Let's try explode, merge, and agg:
import pandas as pd
a = [['a', 'b', 'c'], ['hi', 'this', 'is', 'a', 'sample']]
df = pd.DataFrame()
df['tokens'] = a
ref_df = pd.DataFrame({'sentiment_score': {'a': 1, 'b': 2,
'c': 3, 'hi': 4,
'this': 5, 'is': 6,
'sample': 7}})
# Explode Tokens into rows (Preserve original index)
new_df = df.explode('tokens').reset_index()
# Merge sentiment_scores
new_df = new_df.merge(ref_df, left_on='tokens',
right_index=True,
how='inner')
# Group By Original Index and agg back to lists and take mean
new_df = new_df.groupby('index') \
.agg({'tokens': list, 'sentiment_score': 'mean'}) \
.reset_index(drop=True)
print(new_df)
Output:
tokens sentiment_score
0 [a, b, c] 2.0
1 [a, hi, this, is, sample] 4.6
After Explode:
index tokens
0 0 a
1 0 b
2 0 c
3 1 hi
4 1 this
5 1 is
6 1 a
7 1 sample
After Merge
index tokens sentiment_score
0 0 a 1
1 1 a 1
2 0 b 2
3 0 c 3
4 1 hi 4
5 1 this 5
6 1 is 6
7 1 sample 7
(The one-liner)
new_df = df.explode('tokens') \
.reset_index() \
.merge(ref_df, left_on='tokens',
right_index=True,
how='inner') \
.groupby('index') \
.agg({'tokens': list, 'sentiment_score': 'mean'}) \
.reset_index(drop=True)
If the order of the tokens in the list matters, the scores can be calculated and merged back to the original df instead of using list aggregation:
mean_scores = df.explode('tokens') \
.reset_index() \
.merge(ref_df, left_on='tokens',
right_index=True,
how='inner') \
.groupby('index').mean() \
.reset_index(drop=True)
new_df = df.merge(mean_scores,
left_index=True,
right_index=True)
print(new_df)
Output:
tokens sentiment_score
0 [a, b, c] 2.0
1 [hi, this, is, a, sample] 4.6
I have n variables. Suppose n equals 3 in this case. I want to apply one function to all of the combinations(or permutations, depending on how you want to solve this) of variables and store the result in the same row and column in dataframe.
a = 1
b = 2
c = 3
indexes = ['a', 'b', 'c']
df = pd.DataFrame({x:np.nan for x in indexes}, index=indexes)
If I apply sum(the function can be anything), then the result that I want to get is like this:
a b c
a 2 3 4
b 3 4 5
c 4 5 6
I can only think of iterating all the variables, apply the function one by one, and use the index of the iterators to set the value in the dataframe. Is there any better solution?
You can use apply and return a pd.Series for that effect. In such cases, pandas uses the series indices as columns in the resulting dataframe.
s = pd.Series({"a": 1, "b": 2, "c": 3})
s.apply(lambda x: x+s)
Just note that the operation you do is between an element and a series.
I believe you need broadcast sum of array created from variables if performance is important:
a = 1
b = 2
c = 3
indexes = ['a', 'b', 'c']
arr = np.array([a,b,c])
df = pd.DataFrame(arr + arr[:, None], index=indexes, columns=indexes)
print (df)
a b c
a 2 3 4
b 3 4 5
c 4 5 6
If I have a df such as this:
a b
0 1 3
1 2 4
I can use df['c'] = '' and df['d'] = -1 to add 2 columns and become this:
a b c d
0 1 3 -1
1 2 4 -1
How can I make the code within a function, so I can apply that function to df and add all the columns at once, instead of adding them one by one seperately as above? Thanks
Create a dictionary:
dictionary= { 'c':'', 'd':-1 }
def new_columns(df, dictionary):
return df.assign(**dictionary)
then call it with your df:
df = new_columns(df, dictionary)
or just ( if you don't need a function call, not sure what your use case is) :
df.assign(**dictionary)
def update_df(a_df, new_cols_names, new_cols_vals):
for n, v in zip(new_cols_names, new_cols_vals):
a_df[n] = v
update_df(df, ['c', 'd', 'e'], ['', 5, 6])
I want to, at the same time, create a new column in a pandas dataframe and set its first value to a list.
I want to transform this dataframe
df = pd.DataFrame.from_dict({'a':[1,2],'b':[3,4]})
a b
0 1 3
1 2 4
into this one
a b c
0 1 3 [2,3]
1 2 4 NaN
I tried :
df.loc[0, 'c'] = [2,3]
df.loc[0, 'c'] = np.array([2,3])
df.loc[0, 'c'] = [[2,3]]
df.at[0,'c'] = [2,3]
df.at[0,'d'] = [[2,3]]
It does not work.
How should I proceed?
If the first element of a series is a list, then the series must be of type object (not the most efficient for numerical computations). This should work, however.
df = df.assign(c=None)
df.loc[0, 'c'] = [2, 3]
>>> df
a b c
0 1 3 [2, 3]
1 2 4 None
If you really need the remaining values of column c to be NaNs instead of None, use this:
df.loc[1:, 'c'] = np.nan
The problem seems to have something to do with the type of the c column. If you convert it to type 'object', you can use iat, loc or set_value to set a cell as a list.
df2 = (
df.assign(c=np.nan)
.assign(c=lambda x: x.c.astype(object))
)
df2.set_value(0,'c',[2,3])
Out[86]:
a b c
0 1 3 [2, 3]
1 2 4 NaN
I have a dataframe (df) that looks like:
name type cost
a apples 1
b apples 2
c oranges 1
d banana 4
e orange 6
Apart from using 2 for loops is there a way to loop through and compare each name and type in the list against each other and where the name is not itself (A vs A), the type is the same (apples vs apples) and its not a repeat of the same pair but the other way around e.g. if we have A vs B, I would not want to see B vs A, produce an output list of that looks:
name1, name2, status
a b 0
c e 0
Where the first 2 elements are the names where the criteria match and the third element is always a 0.
I have tried to do this with 2 for loops (see below) but can't get it to reject say b vs a if we already have a vs b.
def pairListCreator(staticData):
for x, row1 in df.iterrows():
name1 = row1['name']
type1= row1['type']
for y, row2 in df.iterrows():
name2 = row['name']
type2 = row['type']
if name1<> name2 and type1 = type2:
pairList = name1,name2,0
Something like this
import pandas as pd
# Data
data = [['a', 'apples', 1],
['b', 'apples', 2],
['c', 'orange', 1],
['d', 'banana', 4],
['e', 'orange', 6]]
# Create Dataframe
df = pd.DataFrame(data, columns=['name', 'type', 'cost'])
df.set_index('name', inplace=True)
# Print DataFrame
print df
# Count number of rows
nr_of_rows = df.shape[0]
# Create result and compare
res_col_nam = ['name1', 'name2', 'status']
result = pd.DataFrame(columns=res_col_nam)
for i in range(nr_of_rows):
x = df.iloc[i]
for j in range(i + 1, nr_of_rows):
y = df.iloc[j]
if x['type'] == y['type']:
temp = pd.DataFrame([[x.name, y.name, 0]], columns=res_col_nam)
result = result.append(temp)
# Reset the index
result.reset_index(inplace=True)
result.drop('index', axis=1, inplace=True)
# Print result
print 'result:'
print result
Output:
type cost
name
a apples 1
b apples 2
c orange 1
d banana 4
e orange 6
result:
name1 name2 status
0 a b 0.0
1 c e 0.0
You can use self join on column type first, then sort values in names column per row by apply(sorted).
Then remove same values in names columns by boolean indexing, drop_duplicates and add new column status by assign:
df = pd.merge(df,df, on='type', suffixes=('1','2'))
names = ['name1','name2']
df[names] = df[names].apply(sorted, axis=1)
df = df[df.name1 != df.name2].drop_duplicates(subset=names)[names]
.assign(status=0)
.reset_index(drop=True)
print (df)
name1 name2 status
0 a b 0
1 c e 0