Drop the rows in front by conditions - python

df:
Id timestamp data Date sig event1 Start End Timediff2 datadiff2 B
51253 51494 2020-01-27 06:22:08.330 19.5 2020-01-27 -1.0 0.0 NaN 1.0 NaN NaN NaN
51254 51495 2020-01-27 06:22:08.430 19.0 2020-01-27 1.0 1.0 0.0 0.0 0.1 NaN NaN
51255 51496 2020-01-27 07:19:06.297 19.5 2020-01-27 1.0 0.0 1.0 0.0 3417.967 0.0 0.000000
51256 51497 2020-01-27 07:19:06.397 20.0 2020-01-27 1.0 0.0 0.0 0.0 0.1 1.0 0.000293
51259 51500 2020-01-27 07:32:19.587 20.5 2020-01-27 1.0 0.0 0.0 1.0 793.290 1.0 0.001261
I have 2 questions:
I want to drop the rows before the rows where Timediff2 ==0.1.
Add another condition, drop theses rows, unless for that row, Start ==1.

I suggest the following, first I create a top for the row just before Timediff2 == 0.1 then I filter:
import pandas as pd
import numpy as np
df = pd.DataFrame({"Start": [np.NaN, 0.0, 1.0,0.0, 0.0],
"Timediff2": [np.NaN, 0.1, 3417, 0.1, 793]})
df["top"] = (df["Timediff2"] == 0.1).shift(-1)
df = df.loc[(df["Start"] == 1) | (df["top"] == False), :]
df = df.drop(columns="top")
The result is :
Start Timediff2
1 0.0 0.1
2 1.0 3417.0
3 0.0 0.1

Related

How to read a csv with rows of NUL, ('\x00'), into pandas?

I have a set of csv files with Date and Time as the first two columns (no headers in the files). The files open up fine in Excel but when I try to read them into Python using Pandas read_csv, only the first Date is returned, whether or not I try a type conversion.
When I open in Notepad, it's not simply comma separated and has loads of space before each line after line 1; I have tried skipinitialspace = True to no avail
I have also tried various type conversions but none work. I am currently using parse_dates = [['Date','Time']], infer_datetime_format = True, dayfirst = True
Example output (no conversion):
0 1 2 3 4 ... 12 13 14 15 16
0 02/03/20 15:13:39 5.5 5.8 42.84 ... 30.0 79.0 0.0 0.0 0.0
1 NaN 15:13:49 5.5 5.8 42.84 ... 30.0 79.0 0.0 0.0 0.0
2 NaN 15:13:59 5.5 5.7 34.26 ... 30.0 79.0 0.0 0.0 0.0
3 NaN 15:14:09 5.5 5.7 34.26 ... 30.0 79.0 0.0 0.0 0.0
4 NaN 15:14:19 5.5 5.4 17.10 ... 30.0 79.0 0.0 0.0 0.0
... ... ... ... ... ... ... ... ... ... ... ...
39451 NaN 01:14:27 5.5 8.4 60.00 ... 30.0 68.0 0.0 0.0 0.0
39452 NaN 01:14:37 5.5 8.4 60.00 ... 30.0 68.0 0.0 0.0 0.0
39453 NaN 01:14:47 5.5 8.4 60.00 ... 30.0 68.0 0.0 0.0 0.0
39454 NaN 01:14:57 5.5 8.4 60.00 ... 30.0 68.0 0.0 0.0 0.0
39455 NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN
And with parse_dates etc:
Date_Time pH1 SP pH Ph1 PV pH ... 1 2 3
0 02/03/20 15:13:39 5.5 5.8 ... 0.0 0.0 0.0
1 nan 15:13:49 5.5 5.8 ... 0.0 0.0 0.0
2 nan 15:13:59 5.5 5.7 ... 0.0 0.0 0.0
3 nan 15:14:09 5.5 5.7 ... 0.0 0.0 0.0
4 nan 15:14:19 5.5 5.4 ... 0.0 0.0 0.0
... ... ... ... ... ... ... ...
39451 nan 01:14:27 5.5 8.4 ... 0.0 0.0 0.0
39452 nan 01:14:37 5.5 8.4 ... 0.0 0.0 0.0
39453 nan 01:14:47 5.5 8.4 ... 0.0 0.0 0.0
39454 nan 01:14:57 5.5 8.4 ... 0.0 0.0 0.0
39455 nan nan NaN NaN ... NaN NaN NaN
Data copied from Notepad (there is actually more whitespace in front of each line but it wouldn't work here):
Data from 67.csv
02/03/20,15:13:39,5.5,5.8,42.84,7.2,6.8,10.63,60.0,0.0,300,1,30,79,0.0,0.0, 0.0
02/03/20,15:13:49,5.5,5.8,42.84,7.2,6.8,10.63,60.0,0.0,300,1,30,79,0.0,0.0, 0.0
02/03/20,15:13:59,5.5,5.7,34.26,7.2,6.8,10.63,60.0,22.3,300,1,30,79,0.0,0.0, 0.0
02/03/20,15:14:09,5.5,5.7,34.26,7.2,6.8,10.63,60.0,15.3,300,45,30,79,0.0,0.0, 0.0
02/03/20,15:14:19,5.5,5.4,17.10,7.2,6.8,10.63,60.0,50.2,300,86,30,79,0.0,0.0, 0.0
And in Excel (so I know the information is there and readable):
Code
import sys
import numpy as np
import pandas as pd
from datetime import datetime
from tkinter import filedialog
from tkinter import *
def import_file(filename):
print('\nOpening ' + filename + ":")
##Read the data in the file
df = pd.read_csv(filename, header = None, low_memory = False)
print(df)
df['Date_Time'] = pd.to_datetime(df[0] + ' ' + df[1])
df.drop(columns=[0, 1], inplace=True)
print(df)
filenames=[]
print('Select files to read, Ctrl or Shift for Multiples')
TkWindow = Tk()
TkWindow.withdraw() # we don't want a full GUI, so keep the root window from appearing
## Show an "Open" dialog box and return the path to the selected file
filenames = filedialog.askopenfilename(title='Open data file', filetypes=(("Comma delimited", "*.csv"),), multiple=True)
TkWindow.destroy()
if len(filenames) == 0:
print('No files selected - Exiting program.')
sys.exit()
else:
print('\n'.join(filenames))
##Read the data from the specified file/s
print('\nReading data file/s')
dfs=[]
for filename in filenames:
dfs.append(import_file(filename))
if len(dfs) > 1:
print('\nCombining data files.')
The file is filled with NUL, '\x00', which needs to be removed.
Use pandas.DataFrame to load the data from d, after the rows have been cleaned.
import pandas as pd
import string # to make column names
# the issue is the the file is filled with NUL not whitespace
def import_file(filename):
# open the file and clean it
with open(filename) as f:
d = list(f.readlines())
# replace NUL, strip whitespace from the end of the strings, split each string into a list
d = [v.replace('\x00', '').strip().split(',') for v in d]
# remove some empty rows
d = [v for v in d if len(v) > 2]
# load the file with pandas
df = pd.DataFrame(d)
# convert column 0 and 1 to a datetime
df['datetime'] = pd.to_datetime(df[0] + ' ' + df[1])
# drop column 0 and 1
df.drop(columns=[0, 1], inplace=True)
# set datetime as the index
df.set_index('datetime', inplace=True)
# convert data in columns to floats
df = df.astype('float')
# give character column names
df.columns = list(string.ascii_uppercase)[:len(df.columns)]
# reset the index
df.reset_index(inplace=True)
return df.copy()
# call the function
dfs = list()
filenames = ['67.csv']
for filename in filenames:
dfs.append(import_file(filename))
display(df)
A B C D E F G H I J K L M N O
datetime
2020-02-03 15:13:39 5.5 5.8 42.84 7.2 6.8 10.63 60.0 0.0 300.0 1.0 30.0 79.0 0.0 0.0 0.0
2020-02-03 15:13:49 5.5 5.8 42.84 7.2 6.8 10.63 60.0 0.0 300.0 1.0 30.0 79.0 0.0 0.0 0.0
2020-02-03 15:13:59 5.5 5.7 34.26 7.2 6.8 10.63 60.0 22.3 300.0 1.0 30.0 79.0 0.0 0.0 0.0
2020-02-03 15:14:09 5.5 5.7 34.26 7.2 6.8 10.63 60.0 15.3 300.0 45.0 30.0 79.0 0.0 0.0 0.0
2020-02-03 15:14:19 5.5 5.4 17.10 7.2 6.8 10.63 60.0 50.2 300.0 86.0 30.0 79.0 0.0 0.0 0.0

Subset pandas dataframe based on first non zero occurrence

Here is the sample dataframe:-
Trade_signal
2007-07-31 0.0
2007-08-31 0.0
2007-09-28 0.0
2007-10-31 0.0
2007-11-30 0.0
2007-12-31 0.0
2008-01-31 0.0
2008-02-29 0.0
2008-03-31 0.0
2008-04-30 0.0
2008-05-30 0.0
2008-06-30 0.0
2008-07-31 -1.0
2008-08-29 0.0
2008-09-30 -1.0
2008-10-31 -1.0
2008-11-28 -1.0
2008-12-31 0.0
2009-01-30 -1.0
2009-02-27 -1.0
2009-03-31 0.0
2009-04-30 0.0
2009-05-29 1.0
2009-06-30 1.0
2009-07-31 1.0
2009-08-31 1.0
2009-09-30 1.0
2009-10-30 0.0
2009-11-30 1.0
2009-12-31 1.0
1 represents buy and -1 represents sell. I want to subset the dataframe so that the new dataframe starts with first 1 occurrence. Expected Output:-
2009-05-29 1.0
2009-06-30 1.0
2009-07-31 1.0
2009-08-31 1.0
2009-09-30 1.0
2009-10-30 0.0
2009-11-30 1.0
2009-12-31 1.0
Please suggest the way forward. Apologies if this is a repeated question.
Simply do. Here df[1] refers to the column containing buy/sell data.
new_df = df.iloc[df[df["Trade Signal"]==1].index[0]:,:]

How to drop rows with nan cell in dask dataframe?

I have a dask dataframe in which I want to delete all the rows which have an NAN value in the "selling_price" column
image_features_df.head(3)
feat1 feat2 feat3 ... feat25087 feat25088 fid selling_price
0 0.0 0.0 0.0 ... 0.0 0.0 2 269.00
1 0.2 0.0 0.8 ... 0.0 0.3 22 NAN
2 0.5 0.0 0.4 ... 0.0 0.1 70 NAN
The above table shows a view of my dataframe.
I want the output to be a dask dataframe without any NAN cells in my "selling_price" column.
Expected Output:
image_features_df.head(3)
feat1 feat2 feat3 ... feat25087 feat25088 fid selling_price
0 0.0 0.0 0.0 ... 0.0 0.0 2 269.00
4 0.3 0.1 0.0 ... 0.0 0.3 26 1720.00
6 0.8 0.0 0.0 ... 0.0 0.1 50 18145.25
Could you please try following, this will remove line if NaN is found in column selling_price.
df.dropna(subset=['selling_price'])

Add missing datetime columns to grouped dataframe

Is it possible to add missing date columns from created date_range to grouped dataframe df without for loop and fill zeros as missing values?
date_range has 7 date elements. df has 4 date columns. So how to add 3 missing columns to df?
import pandas as pd
from datetime import datetime
start = datetime(2018,6,4, )
end = datetime(2018,6,10,)
date_range = pd.date_range(start=start, end=end, freq='D')
DatetimeIndex(['2018-06-04', '2018-06-05', '2018-06-06', '2018-06-07',
'2018-06-08', '2018-06-09', '2018-06-10'],
dtype='datetime64[ns]', freq='D')
df = pd.DataFrame({
'date':
['2018-06-07', '2018-06-10', '2018-06-09','2018-06-09',
'2018-06-08','2018-06-09','2018-06-08','2018-06-10',
'2018-06-10','2018-06-10',],
'name':
['sogan', 'lyam','alex','alex',
'kovar','kovar','kovar','yamo','yamo','yamo',]
})
df['date'] = pd.to_datetime(df['date'])
df = (df
.groupby(['name', 'date',])['date',]
.count()
.unstack(fill_value=0)
)
df
date date date date
date 2018-06-07 00:00:00 2018-06-08 00:00:00 2018-06-09 00:00:00 2018-06-10 00:00:00
name
alex 0 0 2 0
kovar 0 2 1 0
lyam 0 0 0 1
sogan 1 0 0 0
yamo 0 0 0 3
I would pivot the table for making the date columns as rows then use the .asfreq function of pandas as below:
DataFrame.asfreq(freq, method=None, how=None, normalize=False, fill_value=None)
source:
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.asfreq.html
Thanks Sina Shabani for clue to making date columns as rows. And in this situation more suitable setting date as index and using .reindex appeared
df = (df.groupby(['date', 'name'])['name']
.size()
.reset_index(name='count')
.pivot(index='date', columns='name', values='count')
.fillna(0))
df
name alex kovar lyam sogan yamo
date
2018-06-07 0.0 0.0 0.0 1.0 0.0
2018-06-08 0.0 2.0 0.0 0.0 0.0
2018-06-09 2.0 1.0 0.0 0.0 0.0
2018-06-10 0.0 0.0 1.0 0.0 3.0
df.index = pd.DatetimeIndex(df.index)
df = (df.reindex(pd.date_range(start, freq='D', periods=7), fill_value=0)
.sort_index())
df
name alex kovar lyam sogan yamo
2018-06-04 0.0 0.0 0.0 0.0 0.0
2018-06-05 0.0 0.0 0.0 0.0 0.0
2018-06-06 0.0 0.0 0.0 0.0 0.0
2018-06-07 0.0 0.0 0.0 1.0 0.0
2018-06-08 0.0 2.0 0.0 0.0 0.0
2018-06-09 2.0 1.0 0.0 0.0 0.0
2018-06-10 0.0 0.0 1.0 0.0 3.0
df.T
date 2018-06-07 00:00:00 2018-06-08 00:00:00 2018-06-09 00:00:00 2018-06-10 00:00:00
name
alex 0.0 0.0 2.0 0.0
kovar 0.0 2.0 1.0 0.0
lyam 0.0 0.0 0.0 1.0
sogan 1.0 0.0 0.0 0.0
yamo 0.0 0.0 0.0 3.0

Drop values satisfying condition plus arbitrary number of next values in a pandas DataFrame

So my final goal is to drop values in one column of a pandas DataFrame according to some condition on another column of the same DataFrame, plus several next values e.g.:
import pandas as pd
df = pd.DataFrame({'a': [0, 0.5, 0.2, 0, 0, 0, 0, 0.2, 0, 0, 0, 0.1, 0,],
'b': [0.1, -0.5, -0.3, None, 100., 0.2, 0.1, None, -0.3, -0.3, None, None, None]},
index=pd.date_range('2015/1/1', freq='D', periods=13))
df.loc[df['a'] > 0, 'b'] = None
print df
Result:
a b
2015-01-01 0.0 0.1
2015-01-02 0.5 NaN
2015-01-03 0.2 NaN
2015-01-04 0.0 NaN
2015-01-05 0.0 100.0
2015-01-06 0.0 0.2
2015-01-07 0.0 0.1
2015-01-08 0.2 NaN
2015-01-09 0.0 -0.3
2015-01-10 0.0 -0.3
2015-01-11 0.0 NaN
2015-01-12 0.1 NaN
2015-01-13 0.0 NaN
So this will drop the records where the condition is satisfied, but how do I drop the next 3 records after the condition was satisfied too? My desired output would look something like this:
a b
2015-01-01 0.0 0.1
2015-01-02 0.5 NaN
2015-01-03 0.2 NaN
2015-01-04 0.0 NaN
2015-01-05 0.0 NaN
2015-01-06 0.0 NaN
2015-01-07 0.0 0.1
2015-01-08 0.2 NaN
2015-01-09 0.0 NaN
2015-01-10 0.0 NaN
2015-01-11 0.0 NaN
2015-01-12 0.1 NaN
2015-01-13 0.0 NaN
Note that there could be sequential a > 0.
[EDIT]: I seem to have found a solution:
for pos, i in df.iterrows():
if pd.isnull(i['a']):
pass
elif i['a'] > 0:
df['b'].ix[pos:pos+3] = None
else:
pass
Which is rather slow. So, any suggestions are welcome.
We can use the boolean condition index to slice the df using loc and set the following values:
In [392]:
# take the first value of the index
idx = (df['a'] > 0).index[0]
idx
Out[392]:
Timestamp('2015-01-01 00:00:00', offset='D')
In [393]:
# we have to offset the range by 1 at begin and end points
df.loc[idx+1:idx+4,'b'] = None
df
Out[393]:
a b
2015-01-01 0.0 0.1
2015-01-02 0.5 NaN
2015-01-03 0.0 NaN
2015-01-04 0.0 NaN
2015-01-05 0.0 NaN
EDIT
This is an alternative method, extending the above answer which worked on your original edit data, the new method uses the same principal but we have to construct a timestamp from the index values so we can offset it:
In [39]:
idx = df[df.a > 0].index
for index in idx:
df.loc[pd.Timestamp(index, offset='D'):pd.Timestamp(index, offset='D') + 3,'b']=None
df
Out[39]:
a b
2015-01-01 0.0 0.1
2015-01-02 0.5 NaN
2015-01-03 0.2 NaN
2015-01-04 0.0 NaN
2015-01-05 0.0 NaN
2015-01-06 0.0 NaN
2015-01-07 0.0 0.1
2015-01-08 0.2 NaN
2015-01-09 0.0 NaN
2015-01-10 0.0 NaN
2015-01-11 0.0 NaN
2015-01-12 0.1 NaN
2015-01-13 0.0 NaN
Timings however show that your method is twice as fast, unclear if my method will scale better as it depends on the size and distribution of your data.

Categories

Resources