Sorry for the trivial question.
I am having troubles with selecting and replacing a value in list based on the values in another column. I have the following list:
Jack 0.794938 0
Marc 0.05155265 0
Eliza 0.96454115 0
Louis 0.075102 0
Milo 0.951499 0
Marc 0.63319 0
Michael 0.719391 0
Louis 0.502843 0
Eliza 0.620387 0
I would like to keep the first occurrence of each name with the third column taking the value of the second column of the second occurrence. So the result should be:
Jack 0.794938 0
Marc 0.05155265 0.63319
Eliza 0.96454115 0.620387
Louis 0.075102 0.502843
Milo 0.951499 0
Michael 0.719391 0
I am using this code:
res = []
already_added = set()
for e in a:
key1 = e[0]
if key1 not in already_added:
res.append(e)
from that point on I would like something like:
else:
res[res[:][0] == e[0]][2] = e[1]
or
else:
res[np.where(res[:][0] == e[0]][2])] = e[1]
But I keep getting the TypeError: list indices must be integers or slices, not list.
Can someone help me solve this?
Thanks
Edit: I corrected the indices
Here is a pure numpy solution. It sorts the records by first column to easily find duplicate names.
import numpy as np
data = """
Jack 0.794938 0
Marc 0.05155265 0
Eliza 0.96454115 0
Louis 0.075102 0
Milo 0.951499 0
Marc 0.63319 0
Michael 0.719391 0
Louis 0.502843 0
Eliza 0.620387 0
"""
data = (line.split() for line in data.strip().split('\n'))
data = np.array([(x, float(y), float(z)) for x, y, z in data], dtype=object)
res = data.copy()
idx = np.argsort(res[:, 0], kind='mergesort')
dupl = res[idx[:-1], 0] == res[idx[1:], 0]
res[idx[:-1][dupl], 2] = res[idx[1:][dupl], 1]
mask = np.ones(res.shape[:1], dtype=bool)
mask[idx[1:][dupl]] = False
res = res[mask]
Result:
# array([['Jack', 0.794938, 0.0],
# ['Marc', 0.05155265, 0.63319],
# ['Eliza', 0.96454115, 0.620387],
# ['Louis', 0.075102, 0.502843],
# ['Milo', 0.951499, 0.0],
# ['Michael', 0.719391, 0.0]], dtype=object)
You could use Pandas:
Load values into a dataframe, df:
csvfile = StringIO("""Jack 0.794938 0
Marc 0.05155265 0
Eliza 0.96454115 0
Louis 0.075102 0
Milo 0.951499 0
Marc 0.63319 0
Michael 0.719391 0
Louis 0.502843 0
Eliza 0.620387 0""")
df= pd.read_csv(csvfile, header=None, sep='\s\s+')
Then, use groupby and unstack:
df.groupby(0).apply(lambda x: pd.Series(x[1].tolist()))\
.unstack().add_prefix('value').reset_index()
Output:
0 value0 value1
0 Eliza 0.964541 0.620387
1 Jack 0.794938 NaN
2 Louis 0.075102 0.502843
3 Marc 0.051553 0.633190
4 Michael 0.719391 NaN
5 Milo 0.951499 NaN
Related
DOB Name
0 1956-10-30 Anna
1 1993-03-21 Jerry
2 2001-09-09 Peter
3 1993-01-15 Anna
4 1999-05-02 James
5 1962-12-17 Jerry
6 1972-05-04 Kate
In the dataframe similar to the one above where I have duplicate names. So I am want to add a suffix '_0' to the name if DOB is before 1990 and a duplicate name.
I am expecting a result like this
DOB Name
0 1956-10-30 Anna_0
1 1993-03-21 Jerry
2 2001-09-09 Peter
3 1993-01-15 Anna
4 1999-05-02 James
5 1962-12-17 Jerry_0
6 1972-05-04 Kate
I am using the following
df['Name'] = df[(df['DOB'] < '01-01-1990') & (df['Name'].isin(['Anna','Jerry']))].Name.apply(lambda x: x+'_0')
But I am getting this result
DOB Name
0 1956-10-30 Anna_0
1 1993-03-21 NaN
2 2001-09-09 NaN
3 1993-01-15 NaN
4 1999-05-02 NaN
5 1962-12-17 Jerry_0
6 1972-05-04 NaN
How can I add a suffix to the Name which is a duplicate and have to be born before 1990.
Problem in your df['Name'] = df[(df['DOB'] < '01-01-1990') & (df['Name'].isin(['Anna','Jerry']))].Name.apply(lambda x: x+'_0') is that df[(df['DOB'] < '01-01-1990') & (df['Name'].isin(['Anna','Jerry']))] is a filtered dataframe whose rows are less than the original. When you assign it back, the not filtered rows doesn't have corresponding value in the filtered dataframe, so it becomes NaN.
You can try mask instead
m = (df['DOB'] < '1990-01-01') & df['Name'].duplicated(keep=False)
df['Name'] = df['Name'].mask(m, df['Name']+'_0')
You can use masks and boolean indexing:
# is the year before 1990?
m1 = pd.to_datetime(df['DOB']).dt.year.lt(1990)
# is the name duplicated?
m2 = df['Name'].duplicated(keep=False)
# if both conditions are True, add '_0' to the name
df.loc[m1&m2, 'Name'] += '_0'
output:
DOB Name
0 1956-10-30 Anna_0
1 1993-03-21 Jerry
2 2001-09-09 Peter
3 1993-01-15 Anna
4 1999-05-02 James
5 1962-12-17 Jerry_0
6 1972-05-04 Kate
I've been searching around for a while now, but I can't seem to find the answer to this small problem.
I have this code to make a function for replace values:
df = {'Name':['al', 'el', 'naila', 'dori','jlo'],
'living':['Alvando','Georgia GG','Newyork NY','Indiana IN','Florida FL'],
'sample2':['malang','kaltim','ambon','jepara','sragen'],
'output':['KOTA','KAB','WILAYAH','KAB','DAERAH']
}
df = pd.DataFrame(df)
df = df.replace(['KOTA', 'WILAYAH', 'DAERAH'], 0)
df = df.replace('KAB', 1)
But I am actually expecting this output with the simple code that doesn't repeat replace
Name living sample2 output
0 al Alvando malang 0
1 el Georgia GG kaltim 1
2 naila Newyork NY ambon 0
3 dori Indiana IN jepara 1
4 jlo Florida FL sragen 0
I've tried using np.where but it doesn't give the desired result. all results display 0, but the original value is 1
df['output'] = pd.DataFrame({'output':np.where(df == "KAB", 1, 0).reshape(-1, )})
This code should work for you:
df = df.replace(['KOTA', 'WILAYAH', 'DAERAH'], 0).replace('KAB', 1)
Output:
>>> df
Name living sample2 output
0 al Alvando malang 0
1 el Georgia GG kaltim 1
2 naila Newyork NY ambon 0
3 dori Indiana IN jepara 1
4 jlo Florida FL sragen 0
If i have dataset like this:
id person_name salary
0 [alexander, william, smith] 45000
1 [smith, robert, gates] 65000
2 [bob, alexander] 56000
3 [robert, william] 80000
4 [alexander, gates] 70000
If we sum that salary column then we will get 316000
I really want to know how much person who named 'alexander, smith, etc' (in distinct) makes in salary if we sum all of the salaries from its splitting name in this dataset (that contains same string value).
output:
group sum_salary
alexander 171000 #sum from id 0 + 2 + 4 (which contain 'alexander')
william 125000 #sum from id 0 + 3
smith 110000 #sum from id 0 + 1
robert 145000 #sum from id 1 + 3
gates 135000 #sum from id 1 + 4
bob 56000 #sum from id 2
as we see the sum of sum_salary columns is not the same as the initial dataset. all because the function requires double counting.
I thought it seems familiar like string count, but what makes me confuse is the way we use aggregation function. I've tried creating a new list of distinct value in person_name columns, then stuck comes.
Any help is appreciated, Thank you very much
Solutions working with lists in column person_name:
#if necessary
#df['person_name'] = df['person_name'].str.strip('[]').str.split(', ')
print (type(df.loc[0, 'person_name']))
<class 'list'>
First idea is use defaultdict for store sumed values in loop:
from collections import defaultdict
d = defaultdict(int)
for p, s in zip(df['person_name'], df['salary']):
for x in p:
d[x] += int(s)
print (d)
defaultdict(<class 'int'>, {'alexander': 171000,
'william': 125000,
'smith': 110000,
'robert': 145000,
'gates': 135000,
'bob': 56000})
And then:
df1 = pd.DataFrame({'group':list(d.keys()),
'sum_salary':list(d.values())})
print (df1)
group sum_salary
0 alexander 171000
1 william 125000
2 smith 110000
3 robert 145000
4 gates 135000
5 bob 56000
Another solution with repeating values by length of lists and aggregate sum:
from itertools import chain
df1 = pd.DataFrame({
'group' : list(chain.from_iterable(df['person_name'].tolist())),
'sum_salary' : df['salary'].values.repeat(df['person_name'].str.len())
})
df2 = df1.groupby('group', as_index=False, sort=False)['sum_salary'].sum()
print (df2)
group sum_salary
0 alexander 171000
1 william 125000
2 smith 110000
3 robert 145000
4 gates 135000
5 bob 56000
Another sol:
df_new=(pd.DataFrame({'person_name':np.concatenate(df.person_name.values),
'salary':df.salary.repeat(df.person_name.str.len())}))
print(df_new.groupby('person_name')['salary'].sum().reset_index())
person_name salary
0 alexander 171000
1 bob 56000
2 gates 135000
3 robert 145000
4 smith 110000
5 william 125000
Can be done concisely with dummies though performance will suffer due to all of the .str methods:
df.person_name.str.join('*').str.get_dummies('*').multiply(df.salary, 0).sum()
#alexander 171000
#bob 56000
#gates 135000
#robert 145000
#smith 110000
#william 125000
#dtype: int64
I parsed this as strings of lists, by copying OP's data and using pandas.read_clipboard(). In case this was indeed the case (a series of strings of lists), this solution would work:
df = df.merge(df.person_name.str.split(',', expand=True), left_index=True, right_index=True)
df = df[[0, 1, 2, 'salary']].melt(id_vars = 'salary').drop(columns='variable')
# Some cleaning up, then a simple groupby
df.value = df.value.str.replace('[', '')
df.value = df.value.str.replace(']', '')
df.value = df.value.str.replace(' ', '')
df.groupby('value')['salary'].sum()
Output:
value
alexander 171000
bob 56000
gates 135000
robert 145000
smith 110000
william 125000
Another way you can do this is with iterrows(). This will not be as fast jezraels solution. But it works:
ids = []
names = []
salarys = []
# Iterate over the rows and extract the names from the lists in person_name column
for ix, row in df.iterrows():
for name in row['person_name']:
ids.append(row['id'])
names.append(name)
salarys.append(row['salary'])
# Create a new 'unnested' dataframe
df_new = pd.DataFrame({'id':ids,
'names':names,
'salary':salarys})
# Groupby on person_name and get the sum
print(df_new.groupby('names').salary.sum().reset_index())
Output
names salary
0 alexander 171000
1 bob 56000
2 gates 135000
3 robert 145000
4 smith 110000
5 william 125000
I have two dataframes I am working with, one which contains a list of players and another that contains play by play data for the players from the other dataframe. Portions of the rows of interest within these two dataframes are shown below.
0 Matt Carpenter
1 Jason Heyward
2 Peter Bourjos
3 Matt Holliday
4 Jhonny Peralta
5 Matt Adams
...
Name: Name, dtype: object
0 Matt Carpenter grounded out to second (Grounder).
1 Jason Heyward doubled to right (Liner).
2 Matt Holliday singled to right (Liner). Jason Heyward scored.
...
Name: Play, dtype: object
What I am trying to do is create a column in the first dataframe that counts the number of occurrences of the string (df['Name'] + ' scored') in the column in the other dataframe. For example, it would search for instances of "Matt Carpenter scored", "Jason Heyward scored", etc. I know you can use str.contains to do this type of thing, but it only seems to work if you put in the explicit string. For example,
batter_game_logs_df['R vs SP'] = len(play_by_play_SP_df[play_by_play_SP_df['Play'].str.contains('Jason Heyward scored')].index)
works fine but if I try
batter_game_logs_df['R vs SP'] = len(play_by_play_SP_df[play_by_play_SP_df['Play'].str.contains(batter_game_logs_df['Name'].astype(str) + ' scored')].index)
it returns the error 'Series' objects are mutable, thus they cannot be hashed. I have looked at various similar questions but cannot find the solution to this problem for the life of me. Any assistance on this would be greatly appreciated, thank you!
I think need findall by regex with join all values of Name, then create indicator columns by MultiLabelBinarizer and add all missing columns by reindex:
s = df1['Name'] + ' scored'
pat = r'\b{}\b'.format('|'.join(s))
from sklearn.preprocessing import MultiLabelBinarizer
mlb = MultiLabelBinarizer()
df = pd.DataFrame(mlb.fit_transform(df2['Play'].str.findall(pat)),
columns=mlb.classes_,
index=df2.index).reindex(columns=s, fill_value=0)
print (df)
Name Matt Carpenter scored Jason Heyward scored Peter Bourjos scored \
0 0 0 0
1 0 0 0
2 0 1 0
Name Matt Holliday scored Jhonny Peralta scored Matt Adams scored
0 0 0 0
1 0 0 0
2 0 0 0
Last if necessary join to df1:
df = df2.join(df)
print (df)
Play Matt Carpenter scored \
0 Matt Carpenter grounded out to second (Grounder). 0
1 Jason Heyward doubled to right (Liner). 0
2 Matt Holliday singled to right (Liner). Jason ... 0
Jason Heyward scored Peter Bourjos scored Matt Holliday scored \
0 0 0 0
1 0 0 0
2 1 0 0
Jhonny Peralta scored Matt Adams scored
0 0 0
1 0 0
2 0 0
I have a dictionary of states (example IA:Idaho). I have loaded the dictionary into a DataFrame bystate_df.
then I am importing a CSV with states deaths that I want to add them to the bystate_df as I read the lines:
byState_df = pd.DataFrame(states.items())
byState_df['Deaths'] = 0
df['Deaths'] = df['Deaths'].convert_objects(convert_numeric=True)
print byState_df
for index, row in df.iterrows():
if row['Area'] in states:
byState_df[(byState_df[0] == row['Area'])]['Deaths'] = row['Deaths']
print byState_df
but the byState_df is still 0 afterwords:
0 1 Deaths
0 WA Washington 0
1 WI Wisconsin 0
2 WV West Virginia 0
3 FL Florida 0
4 WY Wyoming 0
5 NH New Hampshire 0
6 NJ New Jersey 0
7 NM New Mexico 0
8 NA National 0
I test row['Deaths'] while it iterates and it's producing the correct values, it just seem to be setting the byState_df value incorrectly.
Can you try the following code where I use .loc instead of [][].
byState_df = pd.DataFrame(states.items())
byState_df['Deaths'] = 0
df['Deaths'] = df['Deaths'].convert_objects(convert_numeric=True)
print byState_df
for index, row in df.iterrows():
if row['Area'] in states:
byState_df.loc[byState_df[0] == row['Area'], 'Deaths'] = row['Deaths']
print byState_df