Select dataframe rows defining a date range that contains a needle date - python

There are lots of answers which make it easy to select some date range and get the ones that fall into that range.
I don't want that.
I have data like this:
id other_flags d_dt_start d_dt_end
0 28 ... 1993-02-12 1993-12-31
1 28 ... 1993-02-12 1993-12-31
2 46 ... 1986-01-15 1993-09-30
3 46 ... 1986-01-15 1993-09-30
4 46 ... 1986-01-15 1993-09-30
I want to select the ones that match when I have a date, say, 1986-06-15, thus giving me the subset of indices 2, 3, and 4. Currently, I'm doing this by something like this:
subs = subs[(time >= subs['d_dt_start1']) # later
& (time <= subs['d_dt_end1'])] # before
There has got to be a more elegant way to do this similar to the between command, just the opposite of that.
Basically, instead of saying 'you have a date, I have a date range', 'you have a date range, I have a date'.

Related

DataFrame.sort_values only looks at first digit rather then entire number

I have a DataFrame that looks like this,
del Ticker Open Interest
0 1 SPY 20,996,893
1 3 IWM 7,391,074
2 5 EEM 6,545,445
...
47 46 MU 1,268,256
48 48 NOK 1,222,759
49 50 ET 1,141,467
I want it to go in order from the lowest number to greatest with df['del'], but when I write df.sort_values('del') I get
del Ticker
0 1 SPY
29 10 BAC
5 11 GE
It appears do do it based on the first number rather than go in order? Am I using the correct code or do I need to completely change it?
Assuming you have numbers as type string you can do:
add leading zeros to the string numbers which will allow for ordering of the string
df["del"] = df["del"].map(lambda x: x.zfill(10))
df = df.sort_values('del')
or convert the type to integer
df["del"] = df["del"].astype('int') # as recommended by Alex.Kh in comment
#df["del"] = df["del"].map(int) # my initial answer
df = df.sort_values('del')
I also noticed that del seems to be sorted in the same way your index is sorted, so you even could do:
df = df.sort_index(ascending=False)
to go from lowest to highest you can explicitly .sort_values('del', ascending=True)

Pandas df conditionals: changing value name if pd.value_counts is less than something

I have this table with models df['model'] and
pd.value_counts(df2['model'].values, sort=True)
returns this:
'''
MONSTER 331
MULTISTRADA 134
HYPERMOTARD 69
SCRAMBLER 63
SUPERSPORT 31
...
900 1
T-MAX 1
FC 1
GTS 1
SCOUT 1
Length: 75, dtype: int64
'''
I want to rename all the values in df2['model'] that have count <5 into 'OTHER'.
Please can anyone help me, how to go about this?
You first can get a list of the categories you want to change to other with the first line of code. It takes your functiona and selects the rows which meet the condicion you want (in this case less than 5 occurences).
Then you select the dataframe and just select the rows whose model cell is in the list of categories you want to substitute and change te value to 'OTHER'.
other_classes = data['model'].value_counts()[data['model'].value_counts() < 5].index
data['model'][data['model'].isin(other_classes)] = 'OTHER'
Hope it helps
I suspect it is not at all elegant or pythonic, but this worked in the end:
df_pooled_other = df_final.assign(freq=df_final.groupby('model name')['model name'].transform('count'))\
.sort_values(by=['freq','model name', 'Age in months_x_x'],ascending=[False,True, True])
df_pooled_other['model name'] = np.where(df_pooled_other['freq'] <= 5, 'Other', df_pooled_other['model name'])

Using 3 criteria for a Table Lookup Python

Backstory: I'm fairly new to python, and have only ever done things in MATLAB prior.
I am looking to take a specific value from a table based off of data I have.
The data I have is
Temperatures = [0.8,0.1,-0.8,-1.4,-1.7,-1.5,-2,-1.7,-1.7,-1.3,-0.7,-0.2,0.3,1.4,1.4,1.5,1.2,1,0.9,1.3,1.7,1.7,1.6,1.6]
Hour of the Day =
[0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23]
This is all data for a Monday.
My Monday table looks like this:
Temp | Hr0 | Hr1 | Hr2 ...
-15 < t <= -10 | 0.01 | 0.02 | 0.06 ...
-10 < t <= -5 | 0.04 | 0.03 | 0.2 ...
with the Temperatures increment by +5 until 30, and the hours of the day until 23. The values in the table are constants that I would like to call based off of the temperature and hour.
For example, I'd like to be able to say:
print(monday(1,1)) = 0.01
I would also be doing this for everyday of the week for a mass data analysis, thus the need for it to be efficient.
What I've done so far:
So i have stored all of my tables in dictionaries that look kind of like this:
monday_hr0 = [0.01,0.04, ... ]
So first by column then calling them by the temperature value.
What I have now is a bunch of loops that looks like this:
for i in range (0,365):
for j in range (0,24):
if Day[i] = monday
if hr[i+24*j] = 0
if temp[i] = -15
constant.append(monday_hr1[0])
...
if hr[i+24*j] = 1
if temp[i] = -15
constant.append(monday_hr2[0])
...
...
elif Day[i] = tuesday
if hr[i+24*j] = 0
if temp[i] = -15
constant.append(tuesday_hr1[0])
...
if hr[i+24*j] = 1
if temp[i] = -15
constant.append(tuesday_hr2[0])
...
...
...
I'm basically saying here if it's a monday, use this table. Then if it's this hour use this column. Then if it's this temperature, use this cell. This is VERY VERY inefficient however.
I'm sure there's a quicker way but I can't wrap my head around it. Thank you very much for your help!
Okay, bear with me here, I'm on mobile. I'll try to write up a solution.
I am assuming the following:
you have a dictionary called day_data which contains the table of data for each day of the week.
you have a dictionary called days which maps 0-6 to a day of the week. 0 is monday, 6 is Sunday.
you have a list of temperatures you want something done with
you have a time of the day you want to use to pick out the appropriate data from your day_data. You want to do this for each day of the year.
We should only have to iterate once through all 365 days and once through each hour of the day.
heat-load-days={}
for day_index in range(1,365):
day=Days[day_index%7]
#day is now the Day of the week.
data = day_data[day]
Heat_load =[]
for hour in range(24):
#still unsure on how to select which temperature row from the data table.
Heat_load.append (day_data_selected)
heat-load-days [day] = Heat_load

Taking single value from a grouped data frame in Pandas

I am a new Python convert (from Matlab). I am using the pandas groupby function, and I am getting tripped up by a seemingly easy problem. I have written a custom function that I apply to the grouped df that returns 4 different values. Three of the values are working great, but the other value is giving me an error. Here is the original df:
Index,SN,Date,City,State,ID,County,Age,A,B,C
0,32,9/1/16,X,AL,360,BB County,29.0,negative,positive,positive
1,32,9/1/16,X,AL,360,BB County,1.0,negative,negative,negative
2,32,9/1/16,X,AL,360,BB County,10.0,negative,negative,negative
3,32,9/1/16,X,AL,360,BB County,11.0,negative,negative,negative
4,35,9/1/16,X,AR,718,LL County,67.0,negative,negative,negative
5,38,9/1/16,X,AR,728-13,JJ County,3.0,negative,negative,negative
6,38,9/1/16,X,AR,728-13,JJ County,8.0,negative,negative,negative
7,30,9/1/16,X,AR,728-13,JJ County,8.0,negative,negative,negative
8,30,9/1/16,X,AR,728-13,JJ County,14.0,negative,negative,negative
9,30,9/1/16,X,AR,728-13,JJ County,5.0,negative,negative,negative
...
This is the function that transforms the data. Basically, it counts the number of 'positive' values and the total number of observations in the group. I also want it to return the ID value, and this is where the problem is:
def _ct_id_pos(grp):
return grp['ID'][0], grp[grp.A == 'positive'].shape[0], grp[grp.B == 'positive'].shape[0], grp.shape[0]
I apply the _ct_id_pos function to the data grouped by Date and SN:
FullMx_prime = FullMx.groupby(['Date', 'SN']).apply(_ct_id_pos).reset_index()
So, the method should return something like this:
Date SN ID 0
0 9/1/16 32 360 (360,2,1,4)
1 9/1/16 35 718 (718,0,0,1)
2 9/2/16 38 728 (728,1,0,2)
3 9/3/16 30 728 (728,2,0,3)
But, I keep getting the following error:
...
KeyError: 0
Obviously, it does not like this part of the function: grp['ID'][0] . I just want to take the first value of grp['ID'] because--if there are multiple values--they should all be the same (i.e., I could take the last, it does not matter). I have tried other ways to index, but to no avail.
Change grp['ID'][0] to grp.iloc[0]['ID']
The problem you are having is due to grp['ID'] which selects a column and returns a pandas.Series. Which is straight forward enough, and you could reasonably expect that [0] would select the first element. But the [0] actually selects based on the index for the Series, and in this case the index is from the dataframe that was grouped. So, 0 is not always going to be a valid index.
Code:
def _ct_id_pos(grp):
id = grp.iloc[0]['ID']
a = grp[grp.A == 'positive'].shape[0]
b = grp[grp.B == 'positive'].shape[0]
sz = grp.shape[0]
return id, a, b, sz
Test Code:
df = pd.read_csv(StringIO(u"""
Index,SN,Date,City,State,ID,County,Age,A,B,C
0,32,9/1/16,X,AL,360,BB County,29.0,negative,positive,positive
1,32,9/1/16,X,AL,360,BB County,1.0,negative,negative,negative
2,32,9/1/16,X,AL,360,BB County,10.0,negative,negative,negative
3,32,9/1/16,X,AL,360,BB County,11.0,negative,negative,negative
4,35,9/1/16,X,AR,718,LL County,67.0,negative,negative,negative
5,38,9/1/16,X,AR,728-13,JJ County,3.0,negative,negative,negative
6,38,9/1/16,X,AR,728-13,JJ County,8.0,negative,negative,negative
7,30,9/1/16,X,AR,728-13,JJ County,8.0,negative,negative,negative
8,30,9/1/16,X,AR,728-13,JJ County,14.0,negative,negative,negative
9,30,9/1/16,X,AR,728-13,JJ County,5.0,negative,negative,negative
"""), header=0, index_col=0)
print(df.groupby(['Date', 'SN']).apply(_ct_id_pos).reset_index())
Results:
Date SN 0
0 9/1/16 30 (728-13, 0, 0, 3)
1 9/1/16 32 (360, 0, 1, 4)
2 9/1/16 35 (718, 0, 0, 1)
3 9/1/16 38 (728-13, 0, 0, 2)

Sorting by Date string in pandas - Python 2.7

I have .csv data that I want to sort by it's date column. My date format is of the following:
Week,Quarter,Year: So WK01Q12001 for example.
When I .sort() my dataframe on this column, the resulting is sorted like:
WK01Q12001, WK01Q12002, WK01Q12003, WK01Q22001, WK01Q22002, WK01Q22003, ... WK02Q12001, WK02Q12002...
for example. This makes sense because its sorting the string in ascending order.
But I need my data sorted chronologically such that the result is like the following:
WK01Q12001, WK02Q12001, WK03Q12001, WK04Q12001, ... , WK01Q22001, WK02Q22001, ... WK01Q12002, WK02Q22002 ...
How can I sort it this way using pandas? Perhaps sorting the string in reverse? (right to left) or creating some kind of datetime object?
I have also tried using Series(): pd.Series([pd.to_datetime(d) for d in weeklyData['Date']])
But the result is same as the above .sort() method.
UPDATE:
My DataFrame is similar in format to an excel sheet and currently looks like the following. I want to sort chronologically by 'Date'.
Date Price Volume
WK01Q12001 32 500
WK01Q12002 43 400
WK01Q12003 55 300
WK01Q12004 58 350
WK01Q22001 33 480
WK01Q22002 40 450
.
.
.
WK13Q42004 60 400
You can add a new column to your dataframe containing the date components as a list.
e.g.
a = ["2001", "Q2", "WK01"]
b = ["2002", "Q2", "WK01"]
c = ["2002", "Q2", "WK02"]
So, you can apply a function to your data frame to do this...
def tolist(x):
g = re.match(r"(WK\d{2})(Q\d)(\d{4})", str(x))
return [g.group(3), g.group(2), g.group(1)]
then...
df['datelist'] = df['Date'].apply(tolist)
which gives you your date as a list arranged in the order of importance...
Date Price Volume datelist
0 WK01Q12001 32 500 [2001, Q1, WK01]
1 WK01Q12002 22 400 [2002, Q1, WK01]
2 WK01Q12003 42 500 [2003, Q1, WK01]
When comparing lists of equal length in Python the comparison operators behave well. So, you can use the standard DataFrame sort to order your data.
So the default sorting in a Pandas series will work correctly when you do...
df.sort('datelist')
Use str.replace to change the order of the keys first:
s = "WK01Q12001, WK01Q12002, WK01Q12003, WK01Q22001, WK01Q22002, WK01Q22003, WK02Q12001, WK02Q12002"
date = map(str.strip, s.split(","))
df = pd.DataFrame({"date":date, "value":range(len(date))})
df["date2"] = df.date.str.replace(r"WK(\d\d)Q(\d)(\d{4})", r"\3Q\2WK\1")
df.sort("date2")
I was also able to accomplish this Date reformatting very easily using SQL. When I first query my data, I did SELECT *,
RIGHT([Date], 4) + SUBSTRING([Date], 5, 2) + LEFT([Date], 4) As 'SortedDate'
FROM [Table]
ORDER BY 'SortedDate' ASC.
Use the right tool for the job!

Categories

Resources