Drop row in Pandas Series and clean up index - python

I have a Pandas Series and based on a random number I want to pick a row (5 in the code example below) and drop that row. When the row is dropped I want to create a new index for the remaining rows (0 to 8). The code below:
print 'Original series: ', sample_mean_series
print 'Length of original series', len(sample_mean_series)
sample_mean_series = sample_mean_series.drop([5],axis=0)
print 'Series with item 5 dropped: ', sample_mean_series
print 'Length of modified series:', len(sample_mean_series)
print sample_mean_series.reindex(range(len(sample_mean_series)))
And this is the output:
Original series:
0 0.000074
1 -0.000067
2 0.000076
3 -0.000017
4 -0.000038
5 -0.000051
6 0.000125
7 -0.000108
8 -0.000009
9 -0.000052
Length of original series 10
Series with item 5 dropped:
0 0.000074
1 -0.000067
2 0.000076
3 -0.000017
4 -0.000038
6 0.000125
7 -0.000108
8 -0.000009
9 -0.000052
Length of modified series: 9
0 0.000074
1 -0.000067
2 0.000076
3 -0.000017
4 -0.000038
5 NaN
6 0.000125
7 -0.000108
8 -0.000009
My problem is that the row number 8 is dropped. I want to drop row "5 NaN" and keep -0.000052 with an index 0 to 8. This is what I want it to look like:
0 0.000074
1 -0.000067
2 0.000076
3 -0.000017
4 -0.000038
5 0.000125
6 -0.000108
7 -0.000009
8 -0.000052

Somewhat confusingly, reindex does not mean "create a new index". To create a new index, just assign to the index attribute. So at your last step just do sample_mean_series.index = range(len(sample_mean_series)).

Here's a one-liner:
In [1]: s
Out[1]:
0 -0.942184
1 0.397485
2 -0.656745
3 1.415797
4 1.123858
5 -1.890870
6 0.401715
7 -0.193306
8 -1.018140
9 0.262998
I use the Series.drop method to drop row 5 and then use reset_index to re-number the indices to be consecutive. Without using reset_index, the indices would jump from 4 to 6 with no 5.
By default, reset_index will move the original index into a DataFrame and return it alongside the series values. Passing drop=True prevents this from happening.
In [2]: s2 = s.drop([5]).reset_index(drop=True)
In [3]: s2
Out[3]:
0 -0.942184
1 0.397485
2 -0.656745
3 1.415797
4 1.123858
5 0.401715
6 -0.193306
7 -1.018140
8 0.262998
Name: 0

To drop rows in a dataframe and clean up index:
b = df['amount'] > 10000
df_dropped = df.drop(df[~b].index).reset_index()

df.reset_index(drop=True, inplace = True)
Will do exactly what you want.
When you reset the index, the old index is added as a column, and a new sequential index is used. You can use the drop parameter to avoid the old index being added as a column.

Related

Adding multiple row values into one row keeping the index interval as same as the number of row added in python

I have a data frame with multiple columns (30/40) in a time series continuously from 1 to 1440 minutes.
df
time colA colB colC.....
1 5 4 3
2 1 2 3
3 5 4 3
4 6 7 3
5 9 0 3
6 4 4 0
..
Now I want to add two row values into one but I want to keep the interval of index 'time' same as the row number I am adding. The resulted data frame is:
df
time colA colB colC.......
1 6 6 6
3 11 11 6
5 13 4 3
..
Here I added two row values into one but the time index interval is also same as 2 rows. 1,3,5...
Is it possible to achieve that?
Another way would be to group your data set every two rows and aggregate with using sum on your 'colX' columns and mean on your time column. Chaining astype(int) will round the resulting values:
d = {col: 'sum' for col in [c for c in df.columns if c.startswith('col')]}
df.groupby(df.index // 2).agg({**d,'time': 'mean'}).astype(int)
prints back:
colA colB colC time
0 6 6 6 1
1 11 11 6 3
2 13 4 3 5
one way is to do the addition for all and then fix time:
df_new = df[1::2].reset_index(drop=True) + df[::2].reset_index(drop=True)
df_new['time'] = df[::2]['time'].values

Group seperated counting values in a pandas dataframe

I have following df
A B
0 1 10
1 2 20
2 NaN 5
3 3 1
4 NaN 2
5 NaN 3
6 1 10
7 2 50
8 Nan 80
9 3 5
Consisting of repeating sequences from 1-3 seperated by a variable number of NaN's.I want to groupby each this sequences from 1-3 and get the minimum value of column B within these sequences.
Desired Output something like:
B_min
0 1
6 5
Many thanks beforehand
draj
Idea is first remove rows by missing values by DataFrame.dropna, then use GroupBy.cummin by helper Series created by compare A for equal by Series.eq and Series.cumsum, last data cleaning to one column DataFrame:
df = (df.dropna(subset=['A'])
.groupby(df['A'].eq(1).cumsum())['B']
.min()
.reset_index(drop=True)
.to_frame(name='B_min'))
print (df)
B_min
0 1
1 5
All you need to df.groupby() and apply min(). Is this what you are expecting?
df.groupby('A')['B'].min()
Output:
A
1 10
2 20
3 1
Nan 80
If you don't want the NaNs in your group you can drop them using df.dropna()
df.dropna().groupby('A')['B'].min()

How to divide a column into 5 groups by the column value sorted, and then add column's

How to divide an column into 5 groups by the column's value sorted.
and add a column by the groups
for example
import pandas as pd
df = pd.DataFrame({'x1':[1,2,3,4,5,6,7,8,9,10]})
and I want add columns like this:
You probably want to look at pd.cut, and set the argument bins to an integer of however many groups you want, and the labels argument to False (to return integer indicators of your groups instead of ranges):
df['add_col'] = pd.cut(df['x1'], bins=5, labels=False) + 1
>>> df
x1 add_col
0 1 1
1 2 1
2 3 2
3 4 2
4 5 3
5 6 3
6 7 4
7 8 4
8 9 5
9 10 5
Note that the + 1 is only there so that your groups are numbered 1 to 5, as in your desired output. If you don't say + 1 they will be numbered 0 to 4

Automating slicing prodcedures using pandas

I am currently using Pandas and Python to handle much of the repetitive tasks, I need done for my master thesis. At this point, I have written some code (with help from stack overflow) that, based on some event dates in one file, finds a start and end date to use as a date range in another file. These dates are then located and appended to an empty list, which I can then output to excel. However, using the below code I get a dataframe with 5 columns and 400.000 + rows (which is basically what I want), but not how I want the data outputted to excel. Below is my code:
end_date = pd.DataFrame(data=(df_sample['Date']-pd.DateOffset(days=2)))
start_date = pd.DataFrame(data=(df_sample['Date']-pd.offsets.BDay(n=252)))
merged_dates = pd.merge(end_date,start_date,left_index=True,right_index=True)
ff_factors = []
for index, row in merged_dates.iterrows():
time_range= (df['Date'] > row['Date_y']) & (df['Date'] <= row['Date_x'])
df_factor = df.loc[time_range]
ff_factors.append(df_factor)
appended_data = pd.concat(ff_factors, axis=0)
I need the data to be 5 columns and 250 rows (columns are variable identifiers) side by side, so that when outputting it to excel I have, for example column A-D and then 250 rows for each column. This then needs to be repeated for column E-H and so on. Using iloc, I can locate the 250 observations using appended_data.iloc[0:250], with both 5 columns and 250 rows, and then output it to excel.
Are the any way for me to automate the process, so that after selecting the first 250 and outputting it to excel, it selects the next 250 and outputs it next to the first 250 and so on?
I hope the above is precise and clear, else I'm happy to elaborate!
EDIT:
The above picture illustrate what I get when outputting to excel; 5 columns and 407.764 rows. What I needed is to get this split up into the following way:
The second picture illustrates how I needed the total sample to be split up. The first five columns and corresponding 250 rows needs to be as the second picture. When I do the next split using iloc[250:500], I will get the next 250 rows, which needs to be added after the initial five columns and so on.
You can do this with a combination of np.reshape, which can be made to behave as desired on individual columns, and which should be much faster than a loop through the rows, and pd.concat, to join the dataframes it makes back together:
def reshape_appended(df, target_rows, pad=4):
df = df.copy() # don't modify in-place
# below line adds strings, '0000',...,'0004' to the column names
# this ensures sorting the columns preserves the order
df.columns = [str(i).zfill(pad)+df.columns[i] for i in range(len(df.columns))]
#target number of new columns per column in df
target_cols = len(df.index)//target_rows
last_group = pd.DataFrame()
# below conditional fires if there will be leftover rows - % is mod
if len(df.index)%target_rows != 0:
last_group = df.iloc[-(len(df.index)%target_rows):].reset_index(drop=True)
df = df.iloc[:-(len(df.index)%target_rows)] # keep rows that divide nicely
#this is a large list comprehension, that I'll elaborate on below
groups = [pd.DataFrame(df[col].values.reshape((target_rows, target_cols),
order='F'),
columns=[str(i).zfill(pad)+col for i in range(target_cols)])
for col in df.columns]
if not last_group.empty: # if there are leftover rows, add them back
last_group.columns = [pad*'9'+col for col in last_group.columns]
groups.append(last_group)
out = pd.concat(groups, axis=1).sort_index(axis=1)
out.columns = out.columns.str[2*pad:] # remove the extra characters in the column names
return out
last_group takes care of any rows that don't divide evenly into sets of 250. The playing around with column names enforces proper sorting order.
df[col].values.reshape((target_rows, target_cols), order='F')
Reshapes the values in the column col of df into the shape specified by the tuple (target_rows, target_cols), with the ordering Fortran uses, indicated by F.
columns=[str(i).zfill(pad)+col for i in range(target_cols)]
is just giving names to these columns, with any eye to establishing proper ordering afterward.
Ex:
df = pd.DataFrame(np.random.randint(0, 10, (23, 3)), columns=list('abc'))
reshape_appended(df, 5)
Out[160]:
a b c a b c a b c a b c a b c
0 8 3 0 4 1 9 5 4 7 2 3 4 5.0 7.0 2.0
1 1 6 1 3 5 1 1 6 0 5 9 4 6.0 0.0 1.0
2 3 1 3 4 3 8 9 3 9 8 7 8 7.0 3.0 2.0
3 4 0 1 5 5 6 6 4 4 0 0 3 NaN NaN NaN
4 9 7 3 5 7 4 6 5 8 9 5 5 NaN NaN NaN
df
Out[161]:
a b c
0 8 3 0
1 1 6 1
2 3 1 3
3 4 0 1
4 9 7 3
5 4 1 9
6 3 5 1
7 4 3 8
8 5 5 6
9 5 7 4
10 5 4 7
11 1 6 0
12 9 3 9
13 6 4 4
14 6 5 8
15 2 3 4
16 5 9 4
17 8 7 8
18 0 0 3
19 9 5 5
20 5 7 2
21 6 0 1
22 7 3 2
My best guess to solving the problem would be to try and loop, until the counter is greater than length, so
i = 250 # counter
j = 0 # left limit
for x in range(len("your dataframe")):
appended_data.iloc[j:i]
i+=250
if i > len("your df"):
appended_data.iloc[j:(len("your df"))
break
else:
j = i

Python Pandas value-dependent column creation

I have a pandas DataFrame with columns "Time" and "A". For each row, df["Time"] is an integer timestamp and df["A"] is a float. I want to create a new column "B" which has the value of df["A"], but the one that occurs at or immediately before five seconds in the future. I can do this iteratively as:
for i in df.index:
df["B"][i] = df["A"][max(df[df["Time"] <= df["Time"][i]+5].index)]
However, the df has tens of thousands of records so this takes far too long, and I need to run this a few hundred times so my solution isn't really an option. I am somewhat new to pandas (and only somewhat less new to programming in general) so I'm not sure if there's an obvious solution to this supported by pandas.
It would help if I had a way of referencing the specific value of df["Time"] in each row while creating the column, so I could do something like:
df["B"] = df["A"][max(df[df["Time"] <= df["Time"][corresponding_row]+5].index)]
Thanks.
Edit: Here's an example of what my goal is. If the dataframe is as follows:
Time A
0 0
1 1
4 2
7 3
8 4
10 5
12 6
15 7
18 8
20 9
then I would like the result to be:
Time A B
0 0 2
1 1 2
4 2 4
7 3 6
8 4 6
10 5 7
12 6 7
15 7 9
18 8 9
20 9 9
where each line in B comes from the value of A in the row with Time greater by at most 5. So if Time is the index as well, then df["B"][0] = df["A"][4] since 4 is the largest time which is at most 5 greater than 0. In code, 4 = max(df["Time"][df["Time"] <= 0+5], which is why df["B"][0] is df["A"][4].
Use tshift. You may need to resample first to fill in any missing values. I don't have time to test this, but try this.
df['B'] = df.resample('s', how='ffill').tshift(5, freq='s').reindex_like(df)
And a tip for getting help here: if you provide a few rows of sample data and an example of your desired result, it's easy for us to copy/paste and try out a solution for you.
Edit
OK, looking at your example data, let's leave your Time column as integers.
In [59]: df
Out[59]:
A
Time
0 0
1 1
4 2
7 3
8 4
10 5
12 6
15 7
18 8
20 9
Make an array containing the first and last Time values and all the integers in between.
In [60]: index = np.arange(df.index.values.min(), df.index.values.max() + 1)
Make a new DataFrame with all the gaps filled in.
In [61]: df1 = df.reindex(index, method='ffill')
Make a new column with the same data shifted up by 5 -- that is, looking forward in time by 5 seconds.
In [62]: df1['B'] = df1.shift(-5)
And now drop all the filled-in times we added, taking only values from the original Time index.
In [63]: df1.reindex(df.index)
Out[63]:
A B
Time
0 0 2
1 1 2
4 2 4
7 3 6
8 4 6
10 5 7
12 6 7
15 7 9
18 8 NaN
20 9 NaN
How you fill in the last values, for which there is no "five seconds later" is up to you. Judging from your desired output, maybe use fillna with a constant value set to the last value in column A.

Categories

Resources