after processing some data I got df, now I need to get max 3 value from the data frame with column name
data=[[4.12,3,2],[1.0123123,-6.12312,5.123123],[-3.123123,-8.512323,6.12313]]
df = pd.DataFrame(data,columns =['a','b','c'],index=['aa','bb','cc'])
df
output:
a b c
aa 4.120000 3.000000 2.000000
bb 1.012312 -6.123120 5.123123
cc -3.123123 -8.512323 6.123130
Now I assigned each value with a columns name
df1 = df.astype(str).apply(lambda x:x+'='+x.name)
a b c
aa 4.12=a 3.0=b 2.0=c
bb 1.0123123=a -6.12312=b 5.123123=c
cc -3.123123=a -8.512323=b 6.12313=c
I need to get the max, I have tried to sort the data frame but not able to get the output
what I need is
final_df
max=1 max=2 max=3
aa 4.12=a 3.0=b 2.0=c
bb 5.123123=c 1.0123123=a -6.12312=b
cc 6.12313=c -3.123123=a -8.512323=b
I suggest you proceed as follows:
import pandas as pd
data=[[4.12,3,2],[1.0123123,-6.12312,5.123123],[-3.123123,-8.512323,6.12313]]
df = pd.DataFrame(data,columns =['a','b','c'],index=['aa','bb','cc'])
# first sort values in descending order
df.values[:, ::-1].sort(axis=1)
# then rename row values
df1 = df.astype(str).apply(lambda x: x + '=' + x.name)
# rename columns
df1.columns = [f"max={i}" for i in range(1, len(df.columns)+1)]
Result as desired:
max=1 max=2 max=3
aa 4.12=a 3.0=b 2.0=c
bb 5.123123=a 1.0123123=b -6.12312=c
cc 6.12313=a -3.123123=b -8.512323=c
As the solution proposed by #GuglielmoSanchini does not give the expected result, It works as follows:
# Imports
import pandas as pd
import numpy as np
# Data
data=[[4.12,3,2],[1.0123123,-6.12312,5.123123],[-3.123123,-8.512323,6.12313]]
df = pd.DataFrame(data,columns =['a','b','c'],index=['aa','bb','cc'])
df1 = df.astype(str).apply(lambda x:x+'='+x.name)
data = []
for index, row in df1.iterrows():
# the indices of the numbers sorted in descending order
indices_max = np.argsort([float(item[:-2]) for item in row])[::-1]
# We add the new values sorted
data.append(row.iloc[indices_max].values.tolist())
# We create the new dataframe with values sorted
df = pd.DataFrame(data, columns = [f"max={i}" for i in range(1, len(df1.columns)+1)])
df.index = df1.index
print(df)
Here is the result:
max=1 max=2 max=3
aa 4.12=a 3.0=b 2.0=c
bb 5.123123=c 1.0123123=a -6.12312=b
cc 6.12313=c -3.123123=a -8.512323=b
Related
Say I have two DataFrames
df1 = pd.DataFrame({'A':[1,2], 'B':[3,4]}, index = [0,1])
df2 = pd.DataFrame({'B':[8,9], 'C':[10,11]}, index = [1,2])
I want to merge so that any values in df1 are overwritten in there is a value in df2 at that location and any new values in df2 are added including the new rows and columns.
The result should be:
A B C
0 1 3 nan
1 2 8 10
2 nan 9 11
I've tried combine_first but that causes only nan values to be overwritten
updated has the issue where new rows are created rather than overwritten
merge has many issues.
I've tried writing my own function
def take_right(df1, df2, j, i):
print (df1)
print (df2)
try:
s1 = df1[j][i]
except:
s1 = np.NaN
try:
s2 = df2[j][i]
except:
s2 = np.NaN
if math.isnan(s2):
#print(s1)
return s1
else:
# print(s2)
return s2
def combine_df(df1, df2):
rows = (set(df1.index.values.tolist()) | set(df2.index.values.tolist()))
#print(rows)
columns = (set(df1.columns.values.tolist()) | set(df2.columns.values.tolist()))
#print(columns)
df = pd.DataFrame()
#df.columns = columns
for i in rows:
#df[:][i]=[]
for j in columns:
df = df.insert(int(i), j, take_right(df1,df2,j,i), allow_duplicates=False)
# print(df)
return df
This won't add new columns or rows to an empty DataFrame.
Thank you!!
One approach is to create an empty output dataframe with the union of columns and indices from df1 and df2 and then use the df.update method to assign their values into the out_df
import pandas as pd
df1 = pd.DataFrame({'A':[1,2], 'B':[3,4]}, index = [0,1])
df2 = pd.DataFrame({'B':[8,9], 'C':[10,11]}, index = [1,2])
out_df = pd.DataFrame(
columns = df1.columns.union(df2.columns),
index = df1.index.union(df2.index),
)
out_df.update(df1)
out_df.update(df2)
out_df
Why does combine_first not work?
df = df2.combine_first(df1)
print(df)
Output:
A B C
0 1.0 3 NaN
1 2.0 8 10.0
2 NaN 9 11.0
I want to replace a row in a csv file with a variable. The row itself also has to be a variable. The following code is an example:
import pandas as pd
# sample dataframe
df = pd.DataFrame({'A': ['a','b','c'], 'B':['b','c','d']})
print("Original DataFrame:\n", df)
x = 1
y = 12698
df_rep = df.replace([int(x),1], y)
print("\nAfter replacing:\n", df_rep)
This can be done using pandas indexing eg df.iloc[row_num, col_num].
#update df
df.iloc[x,1]=y
#print df
print(df)
A B
0 a b
1 b 12698
2 c d
Say I have the following variables and dataframe:
a = '2020-04-23 14:00:00+00:00','2020-04-23 13:00:00+00:00','2020-04-23 12:00:00+00:00','2020-04-23 11:00:00+00:00','2020-04-23 10:00:00+00:00','2020-04-23 09:00:00+00:00','2020-04-23 08:00:00+00:00','2020-04-23 07:00:00+00:00','2020-04-23 06:00:00+00:00','2020-04-23 04:00:00+00:00'
b = '2020-04-23 10:00:00+00:00','2020-04-23 09:00:00+00:00','2020-04-23 08:00:00+00:00','2020-04-23 07:00:00+00:00','2020-04-23 06:00:00+00:00','2020-04-23 05:00:00+00:00','2020-04-23 04:00:00+00:00','2020-04-23 03:00:00+00:00','2020-04-23 02:00:00+00:00','2020-04-23 01:00:00+00:00'
aa = 7105.50,6923.50,6692.50,6523.00,6302.5,6081.5,6262.0,6451.50,6369.50,6110.00
bb = 6386.00,6221.00,6505.00,6534.70,6705.00,6535.00,7156.50,7422.00,7608.50,8098.00
df1 = pd.DataFrame()
df1['timestamp'] = a
df1['price'] = aa
df2 = pd.DataFrame()
df2['timestamp'] = b
df2['price'] = bb
print(df1)
print(df2)
I am trying to concatenate the rows of following:
top row of df1 to '2020-04-23 08:00:00+00:00'
'2020-04-23 07:00:00+00:00' to the last row of df2
for illustration purposes the following is what the dataframe should look like:
c = '2020-04-23 14:00:00+00:00','2020-04-23 13:00:00+00:00','2020-04-23 12:00:00+00:00','2020-04-23 11:00:00+00:00','2020-04-23 10:00:00+00:00','2020-04-23 09:00:00+00:00','2020-04-23 08:00:00+00:00','2020-04-23 07:00:00+00:00','2020-04-23 06:00:00+00:00','2020-04-23 05:00:00+00:00','2020-04-23 04:00:00+00:00','2020-04-23 03:00:00+00:00','2020-04-23 02:00:00+00:00','2020-04-23 01:00:00+00:00'
cc = 7105.50,6923.50,6692.50,6523.00,6302.5,6081.5,6262.0,6534.70,6705.00,6535.00,7156.50,7422.00,7608.50,8098.00
df = pd.DataFrame()
df['timestamp'] = c
df['price'] = cc
print(df)
Any ideas?
You can convert the timestamp columns to pd.date_time objects, and then use boolean indexing and pd.concat to select and merge them:
df1.timestamp = pd.to_datetime(df1.timestamp)
df2.timestamp = pd.to_datetime(df2.timestamp)
dfs = [df1.loc[df1.timestamp >= pd.to_datetime("2020-04-23 08:00:00+00:00"),:],
df2.loc[df2.timestamp <= pd.to_datetime("2020-04-23 07:00:00+00:00"),:]
]
df_conc = pd.concat(dfs)
I have a dataframe where each second column name is skipped:
eg
Step_1.
The idea is to fill unnamed columns with previous name to get:
Step_2.
To sum up "in" and "out" in each class, to get final result like this
The intermediary Step_1 is important and cannot be skipped to get the final result.
I appreciate any help and apologize for not being clear enough when asking question at the first attempt.
Thank you
Idea is convert columns to Series, so possible replace missing values instead values starting by Unnamed with forward filling:
df.columns = df.columns.to_series().mask(lambda x: x.str.startswith('Unnamed')).ffill()
print (df)
Column_1 Column_1 Column_2 Column_2
0 a d f g
EDIT:
If missing values in index:
df.columns = df.columns.to_series().ffill()
MultiIndex solution is necessary, if second row is header too - first use header=[0,1] for MultiIndex:
import pandas as pd
temp=u"""Column_1;Unnamed_column;Column_2;Unnamed_column
a;d;f;g
1;5;5;6
7;8;9;4"""
#after testing replace 'pd.compat.StringIO(temp)' to 'filename.csv'
df = pd.read_csv(pd.compat.StringIO(temp), sep=";", header=[0,1])
print (df)
Column_1 Unnamed_column Column_2 Unnamed_column
a d f g
0 1 5 5 6
1 7 8 9 4
a = df.columns.get_level_values(0)
b = df.columns.get_level_values(1)
df.columns = [a.to_series().mask(lambda x: x.str.startswith('Unnamed')).ffill(), b]
print (df)
Column_1 Column_2
a d f g
0 1 5 5 6
1 7 8 9 4
I tried this,
t = pd.DataFrame(df.columns)
t.loc[t[0].str.startswith('Unnamed: '),0] = np.NaN
t[0].bfill(inplace=True)
df.columns = t[0].values
Create temp dataframe with column of original dataframe. apply ffill or bfill as per your wish. assign back the values again to original dataframe.
You can rewrite the df.index with a list comprehension.
from itertools import chain
df = pd.DataFrame(
{"Column_1": [1], "Unnamed_column1": [2], "Column_2": [3], "Unnamed_column2": [4]})
cols = [[c, c] for c in df.columns[::2]]
df.columns = [_ for _ in chain(*cols)]
Having said that it might be better to assign unique names to columns as they will be used keys/indices, i.e .
cols = [[c, c+"_new"] for c in df.columns[::2]]
I have a dataframe such as:
label column1
a 1
a 2
b 6
b 4
I would like to make a dataframe with a new column, with the opposite value from column1 where the labels match. Such as:
label column1 column2
a 1 2
a 2 1
b 6 4
b 4 6
I know this is probably very simple to do with a groupby command but I've been searching and can't find anything.
The following uses groupby and apply and seems to work okay:
x = pd.DataFrame({ 'label': ['a','a','b','b'],
'column1': [1,2,6,4] })
y = x.groupby('label').apply(
lambda g: g.assign(column2 = np.asarray(g.column1[::-1])))
y = y.reset_index(drop=True) # optional: drop weird index
print(y)
you can try the code block below:
#create the Dataframe
df = pd.DataFrame({'label':['a','a','b','b'],
'column1':[1,2,6,4]})
#Group by label
a = df.groupby('label').first().reset_index()
b = df.groupby('label').last().reset_index()
#Concat those groups to create columns2
df2 = (pd.concat([b,a])
.sort_values(by='label')
.rename(columns={'column1':'column2'})
.reset_index()
.drop('index',axis=1))
#Merge with the original Dataframe
df = df.merge(df2,left_index=True,right_index=True,on='label')[['label','column1','column2']]
Hope this helps
Assuming their are only pairs of labels, you could use the following as well:
# Create dataframe
df = pd.DataFrame(data = {'label' :['a', 'a', 'b', 'b'],
'column1' :[1,2, 6,4]})
# iterate over dataframe, identify matching label and opposite value
for index, row in df.iterrows():
newvalue = int(df[(df.label == row.label) & (df.column1 != row.column1)].column1.values[0])
# set value to new column
df.set_value(index, 'column2', newvalue)
df.head()
You can use groupby with apply where create new Series with back order:
df['column2'] = df.groupby('label')["column1"] \
.apply(lambda x: pd.Series(x[::-1].values)).reset_index(drop=True)
print (df)
column1 label column2
0 1 a 2
1 2 a 1
2 6 b 4
3 4 b 6