Check value on merged dataframe and make changes on original dataframe - python

I have a merged_dataframe which id is from X dataframe and value from Y dataframe
I want to drop rows like A which have value 1 on the last row.
How do I do it so in X dataframe, A rows are dropped?
id value
A 0
A 1
B 0
C 0
To check the last value in the row, isit by using
merged_dataframe = merged_dataframe.groupby('id').nth(-1)
get_last_value = merged_dataframe['value']

Here's 1 method of doing it i.e
mask = df.groupby('id',as_index=False)['value'].nth(-1) == 1
df.loc[mask[mask].index,'value'] = np.nan
ndf = df.dropna()
Output:
id value
0 A 0.0
2 B 0.0
3 C 0.0
If you have a dataframe like
id value
0 A 1.0
1 A 0.0
2 A 1.0
3 B 1.0
4 B 0.0
5 B 1.0
6 C 0.0
Then Output :
id value
0 A 1.0
1 A 0.0
3 B 1.0
4 B 0.0
6 C 0.0

Related

How to rank the categorical values while one-hot-encoding

I have the data like this.
id
feature_1
feature_2
1
a
e
2
b
c
3
c
d
4
d
b
5
e
a
I want the one-hot-encoded like feature with the first column representing 1 and the second column representing 0.5. Like the following table.
id
a
b
c
d
e
1
1
0
0
0
0.5
2
0
1
0.5
0
0
3
0
0
1
0.5
0
4
0
0.5
0
1
0
5
0.5
0
0
0
1
But when applying sklearn.preprocessing.OneHotEncoder
it outputs 10 columns with respective 1s.
How can I achieve this?
For the two columns, you can do:
pd.crosstab(df.id, df.feature_1) + pd.crosstab(df['id'], df['feature_2']) * .5
Output:
feature_1 a b c d e
id
1 1.0 0.0 0.0 0.0 0.5
2 0.0 1.0 0.5 0.0 0.0
3 0.0 0.0 1.0 0.5 0.0
4 0.0 0.5 0.0 1.0 0.0
5 0.5 0.0 0.0 0.0 1.0
If you have more than two features, with the weights defined, then you can melt then map the features to the weights:
weights = {'feature_1':1, 'feature_2':0.5}
flatten = df.melt('id')
(flatten['variable'].map(weights)
.groupby([flattern['id'], flatten['value']])
.sum().unstack('value', fill_value=0)
)

Summarize data from a list of pandas dataframes

I have a list of dfs, df_list:
[ CLASS IDX A B C D
0 1 1 1.0 0.0 0.0 0.0
1 1 2 1.0 0.0 0.0 0.0
2 1 3 1.0 0.0 0.0 0.0,
CLASS IDX A B C D
0 1 1 NaN NaN NaN NaN
1 1 2 1.0 0.0 0.0 0.0
2 1 3 1.0 0.0 0.0 0.0,
CLASS IDX A B C D
0 1 1 0.900 0.100 0.0 0.0
1 1 2 1.000 0.000 0.0 0.0
2 1 3 0.999 0.001 0.0 0.0]
I would like to summarize the data into one df based on conditions and values in the individual dfs. Each df has 4 columns of interest, A, B, C and D. If the value in for example column A is >= 0.1 in df_list[0], I want to print 'A' in the summary df. If two columns, for example A and B, have values >= 0.1, I want to print 'A/B'. The final summary df for this data should be:
CLASS IDX 0 1 2
0 1 1 A NaN A/B
1 1 2 A A A
2 1 3 A A A
In the summary df, the column labels (0,1,2) represent the position of the df in the df_list.
I am starting with this
for index, values in enumerate(df_list):
# summarize the data
But not sure what would be the best way to continue..
Any help greatly appreciated!
Here there is one approach
cols = ['A','B','C','D']
def join_func(df):
m = df[cols].ge(0.1)
return (df[cols].mask(m, cols)
.where(m, np.nan)
.apply(lambda x: '/'.join(x.dropna()), axis=1))
result = (df_list[0].loc[:, ['CLASS','IDX']]
.assign(**{str(i) : join_func(df)
for i, df in enumerate(df_list)}))
print(result)
CLASS IDX 0 1 2
0 1 1 A A/B
1 1 2 A A A
2 1 3 A A A

How to create another column according to other column value in python?

I have the following dataframe with the following code:
for i in range(int(tower1_base),int(tower1_top)):
if i not in tower1_not_included_int :
df = pd.concat([df, pd.DataFrame({"Tower": 1, "Floor": i, "Unit": list("ABCDEFG")})],ignore_index=True)
Result:
Tower Floor Unit
0 1 1.0 A
1 1 1.0 B
2 1 1.0 C
3 1 1.0 D
4 1 1.0 E
5 1 1.0 F
6 1 1.0 G
How can I create another Index column like this?
Tower Floor Unit Index
0 1 1.0 A 1A1
1 1 2.0 B 1B2
2 1 3.0 C 1C3
3 1 4.0 D 1D4
4 1 5.0 E 1E5
5 1 6.0 F 1F6
6 1 7.0 G 1G7
You can simply add the columns:
df['Index'] = df['Tower'].astype(str)+df['Unit']+df['Floor'].astype(int).astype(str)
Outputs this for the first version of your dataframe:
Tower Floor Unit Index
0 1 1.0 A 1A1
1 1 1.0 B 1B1
2 1 1.0 C 1C1
3 1 1.0 D 1D1
4 1 1.0 E 1E1
5 1 1.0 F 1F1
6 1 1.0 G 1G1
Another approach.
I've created a copy of the dataframe and reordered the columns, to make the "melting" easier.
dfAl = df.reindex(columns=['Tower','Unit','Floor'])
to_load = [] #list to load the new column
vals = pd.DataFrame.to_numpy(dfAl) #All values extracted
for sublist in vals:
combs = ''.join([str(i).strip('.0') for i in sublist]) #melting values
to_load.append(combs)
df['Index'] = to_load
If you really want the 'Index' column to be a real index, the last step:
df = df.set_index('Index')
print(df)
Tower Floor Unit
Index
1A1 1 1.0 A
1B2 1 2.0 B
1C3 1 3.0 C
1D4 1 4.0 D
1E5 1 5.0 E
1F6 1 6.0 F
1G7 1 7.0 G

How to change a simple network dataframe to a correlation table? [duplicate]

This question already has answers here:
How can I pivot a dataframe?
(5 answers)
Closed 3 years ago.
I have a dataframe which is in this format
from to weight
0 A D 3
1 B A 5
2 C E 6
3 A C 2
I wish to convert this to a correlation-type dataframe which would look like this -
A B C D E
A 0 0 2 0 3
B 5 0 0 0 0
C 0 0 0 0 6
D 0 0 0 0 0
E 0 0 0 0 0
I thought a possible (read naïve) solution would be to loop over the dataframe and then assign the values to the correct cells of another dataframe by comparing the rows and columns.
Something similar to this:
new_df = pd.DataFrame(columns = sorted(set(df["from"])), index =sorted(set(df["from"])))
for i in range(len(df)):
cor.loc[df.iloc[i,0], df.iloc[i,1]] = df.iloc[i,2]
And that worked. However, I've read about not looping over Pandas dataframes here.
The primary issue is that my dataframe is larger than this - a couple thousand rows. So I wish to know if there's another solution to this since this method doesn't sit well with me in terms of being Pythonic. Possibly faster as well, since speed is a concern.
IIUC, this is a pivot with reindex:
(df.pivot(index='from', columns='to', values='weight')
.reindex(all_vals)
.reindex(all_vals,axis=1)
.fillna(0)
)
Output:
to A B C D E
from
A 0.0 0.0 2.0 3.0 0.0
B 5.0 0.0 0.0 0.0 0.0
C 0.0 0.0 0.0 0.0 6.0
D 0.0 0.0 0.0 0.0 0.0
E 0.0 0.0 0.0 0.0 0.0

pandas take average on odd rows

I want to fill in data between each row in a dataframe with an average of current and next row (where columns are numeric)
starting data:
time value value_1 value-2
0 0 0 4 3
1 2 1 6 6
intermediate df:
time value value_1 value-2
0 0 0 4 3
1 1 0 4 3 #duplicate of row 0
2 2 1 6 6
3 3 1 6 6 #duplicate of row 2
I would like to create df_1:
time value value_1 value-2
0 0 0 4 3
1 1 0.5 5 4.5 #average of row 0 and 2
2 2 1 6 6
3 3 2 8 8 #average of row 2 and 4
To to this I appended a copy of the starting dataframe to create the intermediate dataframe shown above:
df = df_0.append(df_0)
df.sort_values(['time'], ascending=[True], inplace=True)
df = df.reset_index()
df['value_shift'] = df['value'].shift(-1)
df['value_shift_1'] = df['value_1'].shift(-1)
df['value_shift_2'] = df['value_2'].shift(-1)
then I was thinking of applying a function to each column:
def average_vals(numeric_val):
#average every odd row
if int(row.name) % 2 != 0:
#take average of value and value_shift for each value
#but this way I need to create 3 separate functions
Is there a way to do this without writing a separate function for each column and applying to each column one by one (in real data I have tens of columns)?
How about this method using DataFrame.reindex and DataFrame.interpolate
df.reindex(np.arange(len(df.index) * 2) / 2).interpolate().reset_index(drop=True)
Explanation
Reindex, in half steps reindex(np.arange(len(df.index) * 2) / 2)
This gives a DataFrame like this:
time value value_1 value-2
0.0 0.0 0.0 4.0 3.0
0.5 NaN NaN NaN NaN
1.0 2.0 1.0 6.0 6.0
1.5 NaN NaN NaN NaN
Then use DataFrame.interpolate to fill in the NaN values .... the default will be linear interpolation, so mean in this case.
Finaly, use .reset_index(drop=True) to fix your index.
Should give
time value value_1 value-2
0 0.0 0.0 4.0 3.0
1 1.0 0.5 5.0 4.5
2 2.0 1.0 6.0 6.0
3 2.0 1.0 6.0 6.0

Categories

Resources