Applying values to column and grouping all columns by those values - python

I have a pandas dataframe as shown here. All lines without a value for ["sente"] contain further information but they are yet not linked to ["sente"].
id pos value sente
1 a I 21
2 b have 21
3 b a 21
4 a cat 21
5 d ! 21
6 cat N Nan
7 a My 22
8 a cat 22
9 b is 22
10 a cute 22
11 d . 22
12 cat N NaN
13 cute M NaN
Now I want each row where there is no value in ["sente"] to get its value from the row above. Then I want to group them all by ["sente"] and create a new column with its content from the row without a value in ["sente"].
sente pos value content
21 a,b,b,a,d I have a cat ! 'cat,N'
22 a,a,b,a,d My cat is cute . 'cat,N','cute,M'
This would be my first step:
df.loc[(df['sente'] != df["sente"].shift(-1) & df["sente"] == Nan) , "sente"] = df["sente"].shift(+1)
but it only works for one additional row not if there is 2 or more.
This groups up one column like I want it:
df.groupby(["sente"])['value'].apply(lambda x: " ".join()
But for more columns it doesn't work like I want:
df.groupby(["sente"]).agr(lambda x: ",".join()
Is there any way to do this without using stack functions?

Use:
#check NaNs values to boolean mask
m = df['sente'].isnull()
#new column of joined columns only if mask
df['contant'] = np.where(m, df['pos'] + ',' + df['value'], np.nan)
#replace to NaNs by mask
df[['pos', 'value']] = df[['pos', 'value']].mask(m)
print (df)
id pos value sente contant
0 1 a I 21.0 NaN
1 2 b have 21.0 NaN
2 3 b a 21.0 NaN
3 4 a cat 21.0 NaN
4 5 d ! 21.0 NaN
5 6 NaN NaN NaN cat,N
6 7 a My 22.0 NaN
7 8 a cat 22.0 NaN
8 9 b is 22.0 NaN
9 10 a cute 22.0 NaN
10 11 d . 22.0 NaN
11 12 NaN NaN NaN cat,N
12 13 NaN NaN NaN cute,M
Last replace NaNs by forward filling with ffill and join with remove NaNs by dropna:
df1 = df.groupby(df["sente"].ffill()).agg(lambda x: " ".join(x.dropna()))
print (df1)
pos value contant
sente
21.0 a b b a d I have a cat ! cat,N
22.0 a a b a d My cat is cute . cat,N cute,M

Related

Insert rows from dataframeB to DataframeA with keys and without Merge

I have a dataframe with thousand records as:
ID to from Date price Type
1 69 18 2/2020 10 A
2 11 12 2/2020 5 A
3 18 10 3/2020 4 B
4 10 11 3/2020 10 A
5 12 69 3/2020 4 B
6 12 20 3/2020 3 B
7 69 21 3/2020 3 A
The output that i want is :
ID to from Date price Type ID to from Date price Type
1 69 18 2/2020 4 A 5 12 69 3/2020 4 B
1' 69 18 2/2020 6 A Nan Nan Nan Nan Nan Nan
2 11 12 2/2020 5 A Nan Nan Nan Nan Nan Nan
4 10 11 3/2020 4 A 3 18 10 3/2020 4 B
4' 10 11 3/2020 6 A Nan Nan Nan Nan Nan Nan
Nan Nan Nan Nan Nan Nan 6 12 20 3/2020 3 B
7 69 21 3/2020 3 A Nan Nan Nan Nan Nan Nan
The idea is to iterate over row , if the type is B , put the row next to the first record with type A and from = TO ,
if the price are equals its ok , if its not split the row with higher price , and the new price will be soustracted.
i divise the dataframe in type A and B , and im trying to iterate both of them
grp = df.groupby('type')
transformed_df_list = []
for idx, frame in grp:
frame.reset_index(drop=True, inplace=True)
transformed_df_list.append(frame.copy())
A = pd.DataFrame([transformed_df_list[0])
B= pd.DataFrame([transformed_df_list[1])
for i , row in A.iterrows():
for i, row1 in B.iterrows():
if row['to'] == row1['from']:
if row['price'] == row1['price']:
row_df = pd.DataFrame([row1])
output = pd.merge(A ,B, how='left' , left_on =['to'] , right_on =['from'] )
The problem is that with merge function a get several duplicate rows and i cant check the price to split the row ?
There is way to insert B row in A dataframe witout merge function ?

How to loc 5 rows before and 5 rows after value 1 in column

I have dataframe , i want to change loc 5 rows before and 5 rows after flag value is 1.
df=pd.DataFrame({'A':[2,1,3,4,7,8,11,1,15,20,15,16,87],
'flag':[0,0,0,0,0,1,1,1,0,0,0,0,0]})
expect_output
df1_before =pd.DataFrame({'A':[1,3,4,7,8],
'flag':[0,0,0,0,1]})
df1_after =pd.DataFrame({'A':[8,11,1,15,20],
'flag':[1,1,1,0,0]})
do same process for all three flag 1
I think one easy way is to loop over the index where the flag is 1 and select the rows you want with loc:
l = len(df)
for idx in df[df.flag.astype(bool)].index:
dfb = df.loc[max(idx-4,0):idx]
dfa = df.loc[idx:min(idx+4,l)]
#do stuff
the min and max function are to ensure the boundary are not passed in case you have a flag=1 within the first or last 5 rows. Note also that with loc, if you want 5 rows, you need to use +/-4 on idx to get the right segment.
That said, depending on what your actual #do stuff is, you might want to change tactic. Let's say for example, you want to calculate the difference between the sum of A over the 5 rows after and the 5 rows before. you could use rolling and shift:
df['roll'] = df.rolling(5)['A'].sum()
df.loc[df.flag.astype(bool), 'diff_roll'] = df['roll'].shift(-4) - df['roll']
print (df)
A flag roll diff_roll
0 2 0 NaN NaN
1 1 0 NaN NaN
2 3 0 NaN NaN
3 4 0 NaN NaN
4 7 0 17.0 NaN
5 8 1 23.0 32.0 #=55-23, 55 is the sum of A of df_after and 23 df_before
6 11 1 33.0 29.0
7 1 1 31.0 36.0
8 15 0 42.0 NaN
9 20 0 55.0 NaN
10 15 0 62.0 NaN
11 16 0 67.0 NaN
12 87 0 153.0 NaN

Fixing Indexing when Appending Dataframes

I am appending three CSVs:
df = pd.read_csv("places_1.csv")
temp = pd.read_csv("places_2.csv")
df = df.append(temp)
temp = pd.read_csv("places_3.csv")
df = df.append(temp)
print(df.head(20))
the joined table looks like:
location device_count population
0 A 11 NaN
1 B 12 NaN
2 C 13 NaN
3 D 14 NaN
4 E 15 NaN
0 F 21 NaN
1 G 22 NaN
2 H 23 NaN
3 I 24 NaN
4 J 25 NaN
0 K 31 NaN
1 L 32 NaN
2 M 33 NaN
3 N 34 NaN
4 O 35 NaN
As you can see the indices are not unique.
When I call this iloc function to multiply the population column by 2:
df2 = df.copy
for index, row in df.iterrows():
df.iloc[index, df.columns.get_loc('population')] = row['device_count'] * 2
I get the following erronious result:
location device_count population
0 A 11 62.0
1 B 12 64.0
2 C 13 66.0
3 D 14 68.0
4 E 15 70.0
0 F 21 NaN
1 G 22 NaN
2 H 23 NaN
3 I 24 NaN
4 J 25 NaN
0 K 31 NaN
1 L 32 NaN
2 M 33 NaN
3 N 34 NaN
4 O 35 NaN
For each CSV it is overwriting the indexes of the first CSV
I have also tried creating a new column of integers and calling df.set_index(). That did not work.
Any tips?
First, use ignore_index, second, don't use append, use pd.concat([temp1, temp2, temp3], ignore_index=True).
As others have stated, you can use ignore_index, and you probably should use pd.concat here. Alternatively, for other situations where you are not combining DataFrames, you can also use df = df.reset_index(drop=True) to change the indices after the fact.
Additionally, you should avoid using iterrows() for reasons listed in the docs here. Using the following works way better:
df.loc[:, 'population'] = df.loc[:, 'device_count'].astype('int') * 2

Fill Nan based on group

I would like to fill NaN based on a column values' mean.
Example:
Groups Temp
1 5 27
2 5 23
3 5 NaN (will be replaced by 25)
4 1 NaN (will be replaced by the mean of the Temps that are in group 1)
Any suggestions ? Thanks !
Use groupby, transfrom, and lamdba function with fillna and mean:
df = df.assign(Temp=df.groupby('Groups')['Temp'].transform(lambda x: x.fillna(x.mean())))
print(df)
Output:
Temp
0 27.0
1 23.0
2 25.0
3 NaN

Unmelt Pandas DataFrame

I have a pandas dataframe with two id variables:
df = pd.DataFrame({'id': [1,1,1,2,2,3],
'num': [10,10,12,13,14,15],
'q': ['a', 'b', 'd', 'a', 'b', 'z'],
'v': [2,4,6,8,10,12]})
id num q v
0 1 10 a 2
1 1 10 b 4
2 1 12 d 6
3 2 13 a 8
4 2 14 b 10
5 3 15 z 12
I can pivot the table with:
df.pivot('id','q','v')
And end up with something close:
q a b d z
id
1 2 4 6 NaN
2 8 10 NaN NaN
3 NaN NaN NaN 12
However, what I really want is (the original unmelted form):
id num a b d z
1 10 2 4 NaN NaN
1 12 NaN NaN 6 NaN
2 13 8 NaN NaN NaN
2 14 NaN 10 NaN NaN
3 15 NaN NaN NaN 12
In other words:
'id' and 'num' my indices (normally, I've only seen either 'id' or 'num' being the index but I need both since I'm trying to retrieve the original unmelted form)
'q' are my columns
'v' are my values in the table
Update
I found a close solution from Wes McKinney's blog:
df.pivot_table(index=['id','num'], columns='q')
v
q a b d z
id num
1 10 2 4 NaN NaN
12 NaN NaN 6 NaN
2 13 8 NaN NaN NaN
14 NaN 10 NaN NaN
3 15 NaN NaN NaN 12
However, the format is not quite the same as what I want above.
You could use set_index and unstack
In [18]: df.set_index(['id', 'num', 'q'])['v'].unstack().reset_index()
Out[18]:
q id num a b d z
0 1 10 2.0 4.0 NaN NaN
1 1 12 NaN NaN 6.0 NaN
2 2 13 8.0 NaN NaN NaN
3 2 14 NaN 10.0 NaN NaN
4 3 15 NaN NaN NaN 12.0
You're really close slaw. Just rename your column index to None and you've got what you want.
df2 = df.pivot_table(index=['id','num'], columns='q')
df2.columns = df2.columns.droplevel().rename(None)
df2.reset_index().fillna("null").to_csv("test.csv", sep="\t", index=None)
Note that the the 'v' column is expected to be numeric by default so that it can be aggregated. Otherwise, Pandas will error out with:
DataError: No numeric types to aggregate
To resolve this, you can specify your own aggregation function by using a custom lambda function:
df2 = df.pivot_table(index=['id','num'], columns='q', aggfunc= lambda x: x)
you can remove name q.
df1.columns=df1.columns.tolist()
Zero's answer + remove q =
df1 = df.set_index(['id', 'num', 'q'])['v'].unstack().reset_index()
df1.columns=df1.columns.tolist()
id num a b d z
0 1 10 2.0 4.0 NaN NaN
1 1 12 NaN NaN 6.0 NaN
2 2 13 8.0 NaN NaN NaN
3 2 14 NaN 10.0 NaN NaN
4 3 15 NaN NaN NaN 12.0
This might work just fine:
Pivot
df2 = (df.pivot_table(index=['id', 'num'], columns='q', values='v')).reset_index())
Concatinate the 1st level column names with the 2nd
df2.columns =[s1 + str(s2) for (s1,s2) in df2.columns.tolist()]
Came up with a close solution
df2 = df.pivot_table(index=['id','num'], columns='q')
df2.columns = df2.columns.droplevel()
df2.reset_index().fillna("null").to_csv("test.csv", sep="\t", index=None)
Still can't figure out how to drop 'q' from the dataframe
It can be done in three steps:
#1: Prepare auxilary column 'id_num':
df['id_num'] = df[['id', 'num']].apply(tuple, axis=1)
df = df.drop(columns=['id', 'num'])
#2: 'pivot' is almost an inverse of melt:
df, df.columns.name = df.pivot(index='id_num', columns='q', values='v').reset_index(), ''
#3: Bring back 'id' and 'num' columns:
df['id'], df['num'] = zip(*df['id_num'])
df = df.drop(columns=['id_num'])
This is a result, but with different order of columns:
a b d z id num
0 2.0 4.0 NaN NaN 1 10
1 NaN NaN 6.0 NaN 1 12
2 8.0 NaN NaN NaN 2 13
3 NaN 10.0 NaN NaN 2 14
4 NaN NaN NaN 12.0 3 15
Alternatively with proper order:
def multiindex_pivot(df, columns=None, values=None):
#inspired by: https://github.com/pandas-dev/pandas/issues/23955
names = list(df.index.names)
df = df.reset_index()
list_index = df[names].values
tuples_index = [tuple(i) for i in list_index] # hashable
df = df.assign(tuples_index=tuples_index)
df = df.pivot(index="tuples_index", columns=columns, values=values)
tuples_index = df.index # reduced
index = pd.MultiIndex.from_tuples(tuples_index, names=names)
df.index = index
df = df.reset_index() #me
df.columns.name = '' #me
return df
df = df.set_index(['id', 'num'])
df = multiindex_pivot(df, columns='q', values='v')

Categories

Resources