select specific rows from a large data frame - python

I have a data frame with 790 rows. I want to create a new data frame that excludes rows from 300 to 400 and leave the rest.
I tried:
df.loc[[:300, 400:]]
df.iloc[[:300, 400:]]
df_new=df.drop(labels=range([300:400]),
axis=0)
This does not work. How can I achieve this goal?
Thanks in advance

Use range or numpy.r_ for join indices:
df_new=df.drop(range(300,400))
df_new=df.iloc[np.r_[0:300, 400:len(df)]]
Sample:
df = pd.DataFrame({'a':range(20)})
# print (df)
df1 = df.drop(labels=range(7,15))
print (df1)
a
0 0
1 1
2 2
3 3
4 4
5 5
6 6
15 15
16 16
17 17
18 18
19 19
df1 = df.iloc[np.r_[0:7, 15:len(df)]]
print (df1)
a
0 0
1 1
2 2
3 3
4 4
5 5
6 6
15 15
16 16
17 17
18 18
19 19

First select index you want to drop and then create a new df
i = df.iloc[299:400].index
new_df = df.drop(i)

Related

How to group dataframe by column and receive new column for every group

I have the following dataframe:
df = pd.DataFrame({'timestamp' : [10,10,10,20,20,20], 'idx': [1,2,3,1,2,3], 'v1' : [1,2,4,5,1,9], 'v2' : [1,2,8,5,1,2]})
timestamp idx v1 v2
0 10 1 1 1
1 10 2 2 2
2 10 3 4 8
3 20 1 5 5
4 20 2 1 1
5 20 3 9 2
I'd like to group data by timestamp and calculate the following cumulative statistic:
np.sum(v1*v2) for every timestamp. I'd like to see the following result:
timestamp idx v1 v2 stat
0 10 1 1 1 37
1 10 2 2 2 37
2 10 3 4 8 37
3 20 1 5 5 44
4 20 2 1 1 44
5 20 3 9 2 44
I'm trying to do the following:
def calc_some_stat(d):
return np.sum(d.v1 * d.v2)
df.loc[:, 'stat'] = df.groupby('timestamp').apply(calc_some_stat)
But for stat columns I receive all NaN values - what is wrong in my code?
We want groupby transform here not groupby apply:
df['stat'] = (df['v1'] * df['v2']).groupby(df['timestamp']).transform('sum')
If we really want to use the function we need to join back to scale up the aggregated DataFrame:
def calc_some_stat(d):
return np.sum(d.v1 * d.v2)
df = df.join(
df.groupby('timestamp').apply(calc_some_stat)
.rename('stat'), # Needed to use join but also sets the col name
on='timestamp'
)
df:
timestamp idx v1 v2 stat
0 10 1 1 1 37
1 10 2 2 2 37
2 10 3 4 8 37
3 20 1 5 5 44
4 20 2 1 1 44
5 20 3 9 2 44
The issue is that groupby apply is producing summary information:
timestamp
10 37
20 44
dtype: int64
This does not assign back to the DataFrame naturally as there are only 2 rows when the initial DataFrame has 6. We either need to use join to scale these 2 rows up to align with the original DataFrame, or we can avoid all of this using groupby transform which is designed to produce a:
like-indexed DataFrame on each group and return a DataFrame having the same indexes as the original object filled with the transformed values

Concatenating two Pandas DataFrames while maintaining index order

Basic question - I am trying to concatenate two DataFrames, with the resulting DataFrame preserving the index in order of the original two. For example:
df = pd.DataFrame({'Houses':[10,20,30,40,50], 'Cities':[3,4,7,6,1]}, index = [1,2,4,6,8])
df2 = pd.DataFrame({'Houses':[15,25,35,45,55], 'Cities':[1,8,11,14,4]}, index = [0,3,5,7,9])
Using pd.concat([df, df2]) simply appends df2 to the end of df1. I am trying to instead concatenate them to produce correct index order (0 through 9).
Use concat with parameter sort for avoid warning and then DataFrame.sort_index:
df = pd.concat([df, df2], sort=False).sort_index()
print(df)
Cities Houses
0 1 15
1 3 10
2 4 20
3 8 25
4 7 30
5 11 35
6 6 40
7 14 45
8 1 50
9 4 55
Try using:
print(df.T.join(df2.T).T.sort_index())
Output:
Cities Houses
0 1 15
1 3 10
2 4 20
3 8 25
4 7 30
5 11 35
6 6 40
7 14 45
8 1 50
9 4 55

Modify a column in first n rows based on other column value in a DataFrame

I want to modify a column in first n rows based on other column value in a DataFrame. Like this:
df.loc[(df.A == i), 'B'][0:10] = 100
It did not work.
Things I have also tried is sampling first n rows like this:
(df.sample(10)).loc[(df.A == i), 'B'] = 100
But It returned ValueError: cannot reindex from a duplicate axis
You can use head and loc like this:
import pandas as pd
import numpy as np
df = pd.DataFrame({'A':np.arange(100),'B':[1]*100})
df.loc[df[(df.A % 2 == 0)].head(10).index,'B'] = 100
print(df.head(25))
Output:
A B
0 0 100
1 1 1
2 2 100
3 3 1
4 4 100
5 5 1
6 6 100
7 7 1
8 8 100
9 9 1
10 10 100
11 11 1
12 12 100
13 13 1
14 14 100
15 15 1
16 16 100
17 17 1
18 18 100
19 19 1
20 20 1
21 21 1
22 22 1
23 23 1
24 24 1
I can only come up with this
df.loc[(df.A==i)&(df.index.isin(df.iloc[:10,:].index)),'B']=100
For the sample , this will work
s=(df.sample(10))
s.loc[(df.A == i), 'B'] = 100
And base on discussion on github
You should NEVER do this type of chained inplace setting. It is simply bad practice.
PS : (df.sample(10)).loc[(df.A == i), 'B'] = 100 #this is chained inplace setting

combine 2 tables without header (with common column)

I have 2 tables, say:
table1 = 101 1 2 3
201 4 5 6
301 7 8 9
table2 = 10 11 101 12
13 14 201 15
16 17 301 18
It is clear that table1 column1 and table2 column 3 are the columns in common. I want to join these 2 tables using pd.join but the problem is that my tables do not have a header. So how can I do this using pandas?
EDIT
I am using pd.read_csv to read the tables. And my tables are text files.
outputtable = 101 1 2 3 10 11 12
201 4 5 6 13 14 15
301 7 8 9 16 17 18
and I would like to export the outputtable as a text file.
I'd set the index to the ordinal columns that you want to merge on, then merge, rename the index name as you need to reset the index afterwards:
In [121]:
import io
import pandas as pd
# read in data, you can ignore the io.StringIO bit and replace with your paths
t="""101 1 2 3
201 4 5 6
301 7 8 9"""
table1 = pd.read_csv(io.StringIO(t), sep='\s+', header=None)
t1="""10 11 101 12
13 14 201 15
16 17 301 18"""
table2 = pd.read_csv(io.StringIO(t1), sep='\s+', header=None)
​# merge the tables after setting index
merged = table1.set_index(0).merge(table2.set_index(2), left_index=True, right_index=True)
# rename the index name so it doesn't bork complaining about column 0 existing already
merged.index.name = 'index'
merged = merged.reset_index()
merged
Out[121]:
index 1_x 2 3_x 0 1_y 3_y
0 101 1 2 3 10 11 12
1 201 4 5 6 13 14 15
2 301 7 8 9 16 17 18
You can now export the df as desired and pass header=False:
In [124]:
merged.to_csv(header=False, index=False)
Out[124]:
'101,1,2,3,10,11,12\n201,4,5,6,13,14,15\n301,7,8,9,16,17,18\n'
What you can easily do as well (I assume df1 and df2 are your two tables):
l1 = [''.join(df1.applymap(str)[c].tolist()) for c in df1]
l2 = [''.join(df2.applymap(str)[c].tolist()) for c in df2]
indexes = [l1.index(i) for i in list(set(l1)-set(l2))]
In [194]: pd.concat([df2, df1.ix[:,indexes]], axis=1)
Out[194]:
0 1 2 3 1 2 3
0 10 11 101 12 1 2 3
1 13 14 201 15 4 5 6
2 16 17 301 18 7 8 9

I want to get the relative index of a column in a pandas dataframe

I want to make a new column of the 5 day return for a stock, let's say. I am using pandas dataframe. I computed a moving average using the rolling_mean function, but I'm not sure how to reference lines like i would in a spreadsheet (B6-B1) for example. Does anyone know how I can do this index reference and subtraction?
sample data frame:
day price 5-day-return
1 10 -
2 11 -
3 15 -
4 14 -
5 12 -
6 18 i want to find this ((day 5 price) -(day 1 price) )
7 20 then continue this down the list
8 19
9 21
10 22
Are you wanting this:
In [10]:
df['5-day-return'] = (df['price'] - df['price'].shift(5)).fillna(0)
df
Out[10]:
day price 5-day-return
0 1 10 0
1 2 11 0
2 3 15 0
3 4 14 0
4 5 12 0
5 6 18 8
6 7 20 9
7 8 19 4
8 9 21 7
9 10 22 10
shift returns the row at a specific offset, we use this to subtract this from the current row. fillna fills the NaN values which will occur prior to the first valid calculation.

Categories

Resources