Merge two columns into one while respecting ascending order - python

Hello I would like to merge my replace my "Timestamp" column by an outer merge of "Timestamp" and "Timestamp+0.4". Moreover, I would like that my values in the input still correspond to this new "Timestamp" merge column and to have NaNs where the value is not defined (for example 0.6 = NaN in the "input" column)
My expected output is something like this :
Do you have any idea how to achieve this ?
Here is the code to create the dataframe
df = pd.DataFrame({'Timestamp':[0.2,0.4,0.8,1.2,1.4,1.6,2.0,2.4],
'input':[10,20,40,5,15,25,0,20]})
df["Timestamp+0.4"] = df["Timestamp"]+0.4
Thanks a lot !

You may use concat to concatenate dataframes along a particular axis. After that drop the duplicates using only the Timestamp column as parameter, then finally sort the values again by the Timestamp column.
import pandas as pd
df = pd.DataFrame({'Timestamp':[0.2,0.4,0.8,1.2,1.4,1.6,2.0,2.4],
'input': [10, 20, 40, 5, 15, 25, 0, 20]})
df1 = pd.DataFrame(df["Timestamp"]+0.4)
df = pd.concat([df, df1])
df["Timestamp"] = round(df["Timestamp"], 8)
df = df.drop_duplicates(subset=["Timestamp"], keep="first")
df = df.sort_values(["Timestamp"], ignore_index=True)
print(df)
Output from df
Timestamp input
0 0.2 10.0
1 0.4 20.0
2 0.6 NaN
3 0.8 40.0
4 1.2 5.0
5 1.4 15.0
6 1.6 25.0
7 1.8 NaN
8 2.0 0.0
9 2.4 20.0
10 2.8 NaN

Related

Replace values in Pandas Dataframe using another Dataframe as a lookup table

I'm looking to replace values in a Dataframe with the values in a second Dataframe by matching the values in the first Dataframe with the columns from the second Dataframe.
Example:
import numpy as np
import pandas as pd
dt_index = pd.to_datetime(['2003-05-01', '2003-05-02', '2003-05-03', '2003-05-04'])
df = pd.DataFrame({'A':[1,1,3,12], 'B':[12,1,3,3], 'C':[3,12,12,1]}, index = dt_index)
df2 = pd.DataFrame({1:[1.4,4.2,1.3,5.6], 12:[2.3,7.3,9.5,0.4], 3:[8.8,0.1,8.7,2.4], 4:[9.6,9.8,5.5,1.8]}, index = dt_index)
df =
A B C
2003-05-01 1 12 3
2003-05-02 1 1 12
2003-05-03 3 3 12
2003-05-04 12 3 1
df2 =
1 12 3 4
2003-05-01 1.4 2.3 8.8 9.6
2003-05-02 4.2 7.3 0.1 9.8
2003-05-03 1.3 9.5 8.7 5.5
2003-05-04 5.6 0.4 2.4 1.8
Expected output:
expect = pd.DataFrame({'A':[1.4,4.2,8.7,0.4], 'B':[2.3,4.2,8.7,2.4], 'C':[8.8,7.3,9.5,5.6]}, index = dt_index)
expect =
A B C
2003-05-01 1.4 2.3 8.8
2003-05-02 4.2 4.2 7.3
2003-05-03 8.7 8.7 9.5
2003-05-04 0.4 2.4 5.6
Attempt:
X = df.copy()
for i in np.unique(df):
X.mask(df == i, df2[i], axis=0, inplace=True)
My attempt seems to work but I'm not sure if it has any pitfalls and how it would scale as the sizes of the Dataframe increase.
Are there better or faster solutions?
EDIT:
After cottontail's helpful answer, I realised I've made an oversimplification in my example. The values in df and columns of df and df2 cannot be assumed to be sequential.
I've now modified the example to reflect that.
One approach is to use stack() to reshape df2 into a Series and reindex() it using the values in df; reshape back into original shape using unstack().
tmp = df2.stack().reindex(df.stack().droplevel(-1).items())
tmp.index = pd.MultiIndex.from_arrays([tmp.index.get_level_values(0), df.columns.tolist()*len(df)])
df = tmp.unstack()
Another approach is to iteratively create a dummy dataframe shaped like df2, multiply it by df2, reduce it into a Series (using sum()) and assign it to an empty dataframe shaped like df.
X = pd.DataFrame().reindex_like(df)
df['dummy'] = 1
for c in X:
X[c] = (
df.groupby([df.index, c])['dummy'].size()
.unstack(fill_value=0)
.reindex(df2.columns, axis=1, fill_value=0)
.mul(df2)
.sum(1)
)

Pandas DataFrame, group by column into single line items but extend columns by number of occurrences per group

I am trying to reformat a DataFrame into a single line item per categorical group, but my fixed format needs to retain all elements of data associated to the category as new columns.
for example I have a DataFrame:
dta = {'day':['A','A','A','A','B','C','C','C','C','C'],
'param1':[100,200,2,3,7,23,43,98,1,0],
'param2':[1,20,65,3,67,2,3,98,654,5]}
df = pd.DataFrame(dta)
I need to be able to transform/reformat the DataFrame where the data is grouped by the 'day' column (e.g. one row per day) but then has columns generated dynamically according to how many entries are within each category.
For example category C in the 'day' column has 5 entries, meaning for 'day' C you would have 5 param1 values and 5 param2 values.
The associated values for days A and B would be populated with NaN or empty where they do not have entries.
e.g.
dta2 = {'day':['A','B','C'],
'param1_1':[100,7,23],
'param1_2':[200,np.nan,43],
'param1_3':[2,np.nan,98],
'param1_4':[3,np.nan,1],
'param1_5':[np.nan,np.nan,0],
'param2_1':[1,67,2],
'param2_2':[20,np.nan,3],
'param2_3':[65,np.nan,98],
'param2_4':[3,np.nan,654],
'param2_5':[np.nan,np.nan,5]
}
df2 = pd.DataFrame(dta2)
Unfortunately this is a predefined format that I have to maintain.
I am aiming to use Pandas as efficiently as possible to minimise deconstructing and reassembling the DataFrame.
You first need to melt, then add a helper columns to cumcount the labels per group and pivot:
df2 = (
df.melt(id_vars='day')
.assign(group=lambda d: d.groupby(['day', 'variable']).cumcount().add(1).astype(str))
.pivot(index='day', columns=['variable', 'group'], values='value')
)
df2.columns = df2.columns.map('_'.join)
df2 = df2.reset_index()
output:
day param1_1 param1_2 param1_3 param1_4 param1_5 param2_1 param2_2 param2_3 param2_4 param2_5
0 A 100.0 200.0 2.0 3.0 NaN 1.0 20.0 65.0 3.0 NaN
1 B 7.0 NaN NaN NaN NaN 67.0 NaN NaN NaN NaN
2 C 23.0 43.0 98.0 1.0 0.0 2.0 3.0 98.0 654.0 5.0

How do I fill na values in a column with the average of previous non-na and next non-na value in pandas?

Raw table:
Column A
5
nan
nan
15
New table:
Column A
5
10
10
15
One option might be the following (using fillna twice (with options ffill and bfill) and then averaging them):
import pandas as pd
import numpy as np
df = pd.DataFrame({'x': [np.nan, 5, np.nan, np.nan, 15]})
filled_series = [df['x'].fillna(method=m) for m in ('ffill', 'bfill')]
print(pd.concat(filled_series, axis=1).mean(axis=1))
# 0 5.0
# 1 5.0
# 2 10.0
# 3 10.0
# 4 15.0
As you can see, this works even if nan happens at the beginning or at the end.

How do I create a rank table for a given pandas dataframe with multiple numerical columns?

I would like to create a rank table based on a multi-column pandas dataframe, with several numerical columns.
Let's use the following df as an example:
Name
Sales
Volume
Reviews
A
1000
100
100
B
2000
200
50
C
5400
500
10
I would like to create a new table, ranked_df that ranks the values in each column by descending order while maintaining essentially the same format:
Name
Sales_rank
Volume_rank
Reviews_rank
A
3
3
1
B
2
2
2
C
1
1
3
Now, I can iteratively do this by looping through the columns, i.e.
df = pd.DataFrame{
"Name":['A', 'B', 'C'],
"Sales":[1000, 2000, 5400],
"Volume":[100, 200, 500],
"Reviews":[1000, 2000, 5400]
}
# make a copy of the original df
ranked_df = df.copy()
# define our interested columns
interest_cols = ['Sales', 'Volume', 'Reviews']
for col in interest_cols:
ranked_df[f"{col}_rank"] = df[col].rank()
# drop the cols not needed
...
But my question is this: is there a more elegant - or pythonic way of doing this? Maybe an apply for the dataframe? Or some vectorized operation by throwing it to numpy?
Thank you.
df.set_index('Name').rank().reset_index()
Name Sales Volume Reviews
0 A 1.0 1.0 1.0
1 B 2.0 2.0 2.0
2 C 3.0 3.0 3.0
You could use transform/apply to hit each column
df.set_index('Name').transform(pd.Series.rank, ascending = False)
Sales Volume Reviews
Name
A 3.0 3.0 1.0
B 2.0 2.0 2.0
C 1.0 1.0 3.0

Need to combine multiple rows based on index

I have a dataframe with values like
0 1 2
a 5 NaN 6
a NaN 2 NaN
Need the output by combining the two rows based on index 'a' which is same in both rows
Also need to add multiple columns and output as single column
Need the output as below. Value 13 since adding 5 2 6
0
a 13
Trying this using concat function but getting errors
How about using Pandas dataframe.sum() ?
import pandas as pd
import numpy as np
data = pd.DataFrame({"0":[5, np.NaN], "1":[np.NaN, 2], "2":[6,np.NaN]})
row_total = data.sum(axis = 1, skipna = True)
row_total.sum(axis = 0)
result:
13.0
EDIT: #Chris comment (did not see it while writing my answer) shows how to do it in one line, if all rows have same index.
data:
data = pd.DataFrame({"0":[5, np.NaN],
"1":[np.NaN, 2],
"2":[6,np.NaN]},
index=['a', 'a'])
gives:
0 1 2
a 5.0 NaN 6.0
a NaN 2.0 NaN
Then
data.groupby(data.index).sum().sum(1)
Returns
13.0

Categories

Resources