I am trying to generate a list of values, grouped by 'Melder' and add that list as a column to my dataframe. But the apply(list) doesn't work in conjunction with the new_df.insert():
This works, but generates a new Dataframe with only the groupy values
new_df2 = new_df.groupby('Melder')['SAG-Nummer'].apply(list)
This adds a column to my current dataframe, but the values are all NaN
Example:
my_df.insert(1,'Liste',my_df.groupby('Melder')['SAG-ummer'].apply(list))
print(my_df)
SAG-Nummer Liste Melder
0 SAG-2001-0389 NaN Meyer
1 SAG-2001-0388 NaN Meyer
2 SAG-2001-1833 NaN Schmidt
3 SAG-2001-1836 NaN Berg
new_df2 = new_df.groupby('Melder')['SAG-Nummer'].apply(list)
print(my_df2)
Melder
Berg [SAG-2001-1836]
Meyer [SAG-2001-0389, SAG-2001-0388]
Schmidt [SAG-2001-1833]
Expected Result:
SAG-Nummer Liste Melder
0 SAG-2001-0389 [SAG-2001-0389, SAG-2001-0388] Meyer
1 SAG-2001-0388 [SAG-2001-0389, SAG-2001-0388] Meyer
2 SAG-2001-1833 [SAG-2001-1833] Schmidt
3 SAG-2001-1836 [SAG-2001-1836] Berg
Use the following transformation to expand the result of each group row-wise:
my_df.assign(Liste=my_df.groupby('Melder')['SAG-ummer'].transform(lambda x: [x.values] * len(x)))
Related
I want to create 44 dataframe columns based on TAZ_1720 such that each column is shift(-1) of the previous column.
How can I do it instead of writing it 44 times?
df['m1']=df['TAZ_1270'].shift(-1)
df['m2']=df['m1'].shift(-1)
df['m3']=df['m2'].shift(-1)
Use DataFrame.assign with a dict comprehension.
Here is a minimal example with 4 shifts:
df = pd.DataFrame({'TAZ_1270': [100047, 100500, 100488, 100099]})
# TAZ_1270
# 0 100047
# 1 100500
# 2 100488
# 3 100099
df = df.assign(**{f'm{i}': df['TAZ_1270'].shift(-i) for i in range(1, 5)})
# TAZ_1270 m1 m2 m3 m4
# 0 100047 100500.0 100488.0 100099.0 NaN
# 1 100500 100488.0 100099.0 NaN NaN
# 2 100488 100099.0 NaN NaN NaN
# 3 100099 NaN NaN NaN NaN
Re: questions in comments
Why use **?
DataFrame.assign normally accepts the format df.assign(col1=foo, col2=bar, ...). When we use ** on a dict in a function call, it automatically unpacks the dict's 'col1': foo, 'col2': bar, ... pairs into col1=foo, col2=bar, ... arguments.
Why use f?
This is f-string syntax (introduced in python 3.6). f'm{i}' is just a more concise version of 'm' + str(i).
I need to show columns which have only duplicated rows inside, in Name groups
I cannot remove/drop columns for one groupo because for other this specific column could be usefull.
So when in specific column will be duplicates i need to make this column empty (replace with np.nan for example)
my df:
Name,B,C,D
Adam,20,dog,cat
Adam,20,cat,elephant
Katie,21,cat,cat
Katie,21,cat,dog
Brody,22,dog,dog
Brody,21,cat,dog
expected output:
#grouping by Name, always two Names are same, not less not more.
Name,B,C,D
Adam,np.nan,dog,cat
Adam,np.nan,cat,elephant
Katie,np.nan,np.nan,cat
Katie,np.nan,np.nan,dog
Brody,22,dog,np.nan
Brody,21,cat,np.nan
I know I should use groupby() function and duplicated()
but I dont know how this approach should looks like
output=df[df.duplicated(keep=False)].groupby('Name')
output=output.replace({True:'np.nan'},regex=True)
Use GroupBy.transform with lambda function and DataFrame.mask for replace:
df = df.set_index('Name')
output=df.mask(df.groupby('Name').transform(lambda x: x.duplicated(keep=False))).reset_index()
print (output)
Name B C D
0 Adam NaN dog cat
1 Adam NaN cat elephant
2 Katie NaN NaN cat
3 Katie NaN NaN dog
4 Brody 22.0 dog NaN
5 Brody 21.0 cat NaN
The problem is to get the index of dataframe based on a condition that it's sliced from all the non-null values while concatenating the indices without iterating using for-loop after getting the indices?
I able to do it but by using for-loop after slicing the df according to not null indices.I want this to be done without separately iterating over the indices.
df=pd.DataFrame([["a",2,1],["b",np.nan,np.nan],["c",np.nan,np.nan],["d",3,4]])
list1=[]
indexes=(df.dropna().index.values).tolist()
indexes.append(df.shape[0])
for i in range(len(indexes)-1):
list1.append(" ".join(df[0][indexes[i]:indexes[i+1]].tolist()))
# list1 becomes ['abc', 'de']
This is the sample DF:
0 1 2
0 a 2.0 1.0
1 b NaN NaN
2 c NaN NaN
3 d 3.0 4.0
4 e NaN NaN
The expected output will be a list like : [abc,de]
Explanation :
first string
a: not null (start picking)
b: null
c: null
second string
d: not null (second not-null encountered concat to second string)
e:null
This is a case for cumsum:
# change all(axis=1) to any(axis=1) if only one NaN in a row is enough
s = df.iloc[:,1:].notnull().all(axis=1)
df[0].groupby(s.cumsum()).apply(''.join)
output:
1 abc
2 de
Name: 0, dtype: object
(no idea how to introduce a matrix here for readability)
I have two dataframes obtained with Panda and Python.
df1 = pd.DataFrame({'Index': ['0','1','2'], 'number':[3,'dd',1], 'people':[3,'s',3]})
df1 = df1.set_index('Index')
df2 = pd.DataFrame({'Index': ['0','1','2'], 'quantity':[3,2,'hi'], 'persons':[1,5,np.nan]})
I would like to sum the quantities of columns based on Index. Columns do not have the same name and may contain strings. (I have in fact 50 columns on each df). I want to consider nan as 0. The result should look:
df3
Index column 1 column 2
0 6 4
1 nan nan
2 nan nan
I was wondering how could this be done.
Note:
For sure a double while or for would do the trick, just not very elegant...
indices=0
columna=0
while indices<len(df.index)-1:
while columna<numbercolumns-1:
df3.iloc[indices,columna]=df1.iloc[indices,columna] +df2.iloc[indices,columna]
indices += 1
columna += 1
Thank you.
You can try of concatenating both dataframes, then add based on the index group
df1.columns = df.columns
df1.people = pd.to_numeric(df1.people,errors='coerce')
pd.concat([df,df1]).groupby('Index').sum()
Out:
number people
Index
A 8 5.0
B 2 2.0
C 2 5.0
F 3 3.0
What I want is this:
visit_id atc_1 atc_2 atc_3 atc_4 atc_5 atc_6 atc_7
48944282 A02AG J01CA04 J095AX02 N02BE01 R05X NaN NaN
48944305 A02AG A03AX13 N02BE01 R05X NaN NaN NaN
I don't know how many atc_1...atc_7...?atc_100 columns there will need to be in advance. I just need to gather all associated atc_codes into one row with each visit_id.
This seems like a group_by and then a pivot but I have tried many times and failed. I also tried to self-join a la SQL using pandas' merge() but that doesn't work either.
The end result is that I will paste together atc_1, atc_7, ... atc_100 to form one long atc_code. This composite atc_code will be my "Y" or "labels" column of my dataset that I am trying to predict.
Thank you!
Use cumcount first for count values per groups which create columns by function pivot. Then add missing columns with reindex_axis and change column names by add_prefix. Last reset_index:
g = df.groupby('visit_id').cumcount() + 1
print (g)
0 1
1 2
2 3
3 4
4 5
5 1
6 2
7 3
8 4
dtype: int64
df = pd.pivot(index=df['visit_id'], columns=g, values=df['atc_code'])
.reindex_axis(range(1, 8), 1)
.add_prefix('atc_')
.reset_index()
print (df)
visit_id atc_1 atc_2 atc_3 atc_4 atc_5 atc_6 atc_7
0 48944282 A02AG J01CA04 J095AX02 N02BE01 R05X NaN NaN
1 48944305 A02AG A03AX13 N02BE01 R05X None NaN NaN