Concatenating two Pandas DataFrames while maintaining index order

Concatenating two Pandas DataFrames while maintaining index order - python

Basic question - I am trying to concatenate two DataFrames, with the resulting DataFrame preserving the index in order of the original two. For example:
df = pd.DataFrame({'Houses':[10,20,30,40,50], 'Cities':[3,4,7,6,1]}, index = [1,2,4,6,8])
df2 = pd.DataFrame({'Houses':[15,25,35,45,55], 'Cities':[1,8,11,14,4]}, index = [0,3,5,7,9])
Using pd.concat([df, df2]) simply appends df2 to the end of df1. I am trying to instead concatenate them to produce correct index order (0 through 9).

Use concat with parameter sort for avoid warning and then DataFrame.sort_index:
df = pd.concat([df, df2], sort=False).sort_index()
print(df)
Cities Houses
0 1 15
1 3 10
2 4 20
3 8 25
4 7 30
5 11 35
6 6 40
7 14 45
8 1 50
9 4 55

Try using:
print(df.T.join(df2.T).T.sort_index())
Output:
Cities Houses
0 1 15
1 3 10
2 4 20
3 8 25
4 7 30
5 11 35
6 6 40
7 14 45
8 1 50
9 4 55

Related

select specific rows from a large data frame

I have a data frame with 790 rows. I want to create a new data frame that excludes rows from 300 to 400 and leave the rest.
I tried:
df.loc[[:300, 400:]]
df.iloc[[:300, 400:]]
df_new=df.drop(labels=range([300:400]),
axis=0)
This does not work. How can I achieve this goal?
Thanks in advance

Use range or numpy.r_ for join indices:
df_new=df.drop(range(300,400))
df_new=df.iloc[np.r_[0:300, 400:len(df)]]
Sample:
df = pd.DataFrame({'a':range(20)})
# print (df)
df1 = df.drop(labels=range(7,15))
print (df1)
a
0 0
1 1
2 2
3 3
4 4
5 5
6 6
15 15
16 16
17 17
18 18
19 19
df1 = df.iloc[np.r_[0:7, 15:len(df)]]
print (df1)
a
0 0
1 1
2 2
3 3
4 4
5 5
6 6
15 15
16 16
17 17
18 18
19 19

First select index you want to drop and then create a new df
i = df.iloc[299:400].index
new_df = df.drop(i)

How to group dataframe by column and receive new column for every group

I have the following dataframe:
df = pd.DataFrame({'timestamp' : [10,10,10,20,20,20], 'idx': [1,2,3,1,2,3], 'v1' : [1,2,4,5,1,9], 'v2' : [1,2,8,5,1,2]})
timestamp idx v1 v2
0 10 1 1 1
1 10 2 2 2
2 10 3 4 8
3 20 1 5 5
4 20 2 1 1
5 20 3 9 2
I'd like to group data by timestamp and calculate the following cumulative statistic:
np.sum(v1*v2) for every timestamp. I'd like to see the following result:
timestamp idx v1 v2 stat
0 10 1 1 1 37
1 10 2 2 2 37
2 10 3 4 8 37
3 20 1 5 5 44
4 20 2 1 1 44
5 20 3 9 2 44
I'm trying to do the following:
def calc_some_stat(d):
return np.sum(d.v1 * d.v2)
df.loc[:, 'stat'] = df.groupby('timestamp').apply(calc_some_stat)
But for stat columns I receive all NaN values - what is wrong in my code?

We want groupby transform here not groupby apply:
df['stat'] = (df['v1'] * df['v2']).groupby(df['timestamp']).transform('sum')
If we really want to use the function we need to join back to scale up the aggregated DataFrame:
def calc_some_stat(d):
return np.sum(d.v1 * d.v2)
df = df.join(
df.groupby('timestamp').apply(calc_some_stat)
.rename('stat'), # Needed to use join but also sets the col name
on='timestamp'
)
df:
timestamp idx v1 v2 stat
0 10 1 1 1 37
1 10 2 2 2 37
2 10 3 4 8 37
3 20 1 5 5 44
4 20 2 1 1 44
5 20 3 9 2 44
The issue is that groupby apply is producing summary information:
timestamp
10 37
20 44
dtype: int64
This does not assign back to the DataFrame naturally as there are only 2 rows when the initial DataFrame has 6. We either need to use join to scale these 2 rows up to align with the original DataFrame, or we can avoid all of this using groupby transform which is designed to produce a:
like-indexed DataFrame on each group and return a DataFrame having the same indexes as the original object filled with the transformed values

Merge two Series according to their index

After a long time of googling and not finding a solution to my, probably often asked, problem.
I have two Dataframes:
DF1: DF2:
val val
index index
1 3 2 5
3 10 4 15
5 20 7 35
6 30 8 40
and need an output like this:
DF_out:
val
index
1 3
2 5
3 10
4 15
5 20
6 30
7 35
8 40
DF1 and DF2 should be combined and sorted according to ther indices.
Side notes:
DF1 and DF2 never have the same index twice
The values of the dataframes are always sequel
I would very much appreciate your help!

Use concat with Series.sort_index:
df = pd.concat([DF1, DF2]).sort_index()
print (df)
val
index
1 3
2 5
3 10
4 15
5 20
6 30
7 35
8 40

Summing columns in Dataframe that have matching column headers

I have a dataframe that currently looks somewhat like this.
import pandas as pd
In [161]: pd.DataFrame(np.c_[s,t],columns = ["M1","M2","M1","M2"])
Out[161]:
M1 M2 M1 M2
6/7 1 2 3 5
6/8 2 4 7 8
6/9 3 6 9 9
6/10 4 8 8 10
6/11 5 10 20 40
Except, instead of just four columns, there are approximately 1000 columns, from M1 till ~M340 (there are multiple columns with the same headers). I wanted to sum the values associated with matching columns based on their index. Ideally, the result dataframe would look like:
M1_sum M2_sum
6/7 4 7
6/8 9 12
6/9 12 15
6/10 12 18
6/11 25 50
I wanted to somehow apply the "groupby" and "sum" function, but was unsure how to do that when dealing with a dataframe that has multiple columns and has some columns with 3 other columns matching whereas another may only have one other column matching (or even 0 other columns matching).

You probably want to groupby the first level, and over the second axis, and then perform a .sum(), like:
>>> df.groupby(level=0,axis=1).sum().add_suffix('_sum')
M1_sum M2_sum
0 4 7
1 9 12
2 12 15
3 12 18
4 25 50
If we rename the last column to M1 instead, it will again group this correctly:
>>> df
M1 M2 M1 M1
0 1 2 3 5
1 2 4 7 8
2 3 6 9 9
3 4 8 8 10
4 5 10 20 40
>>> df.groupby(level=0,axis=1).sum().add_suffix('_sum')
M1_sum M2_sum
0 9 2
1 17 4
2 21 6
3 22 8
4 65 10

Replace by previous values

I have some dataframe like the one shown above. The goal of this program is to replace some specific value by the previous one.
import pandas as pd
test = pd.DataFrame([2,2,3,1,1,2,4,6,43,23,4,1,3,3,1,1,1,4,5], columns = ['A'])
obtaining:
If one want to replace all 1 by the previous values, a possible solution is:
for li in test[test['A'] == 1].index:
test['A'].iloc[li] = test['A'].iloc[li-1]
However, it is very inefficient. Can you suggest a more efficient solution?

IIUC, replace to np.nan then ffill
test.replace(1,np.nan).ffill().astype(int)
Out[881]:
A
0 2
1 2
2 3
3 3
4 3
5 2
6 4
7 6
8 43
9 23
10 4
11 4
12 3
13 3
14 3
15 3
16 3
17 4
18 5

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Concatenating two Pandas DataFrames while maintaining index order - python

Use concat with parameter sort for avoid warning and then DataFrame.sort_index: df = pd.concat([df, df2], sort=False).sort_index() print(df) Cities Houses 0 1 15 1 3 10 2 4 20 3 8 25 4 7 30 5 11 35 6 6 40 7 14 45 8 1 50 9 4 55

Try using: print(df.T.join(df2.T).T.sort_index()) Output: Cities Houses 0 1 15 1 3 10 2 4 20 3 8 25 4 7 30 5 11 35 6 6 40 7 14 45 8 1 50 9 4 55

Related

select specific rows from a large data frame

How to group dataframe by column and receive new column for every group

Merge two Series according to their index

Summing columns in Dataframe that have matching column headers

Replace by previous values

Categories

Resources