Sorting by Date string in pandas - Python 2.7 - python

I have .csv data that I want to sort by it's date column. My date format is of the following:
Week,Quarter,Year: So WK01Q12001 for example.
When I .sort() my dataframe on this column, the resulting is sorted like:
WK01Q12001, WK01Q12002, WK01Q12003, WK01Q22001, WK01Q22002, WK01Q22003, ... WK02Q12001, WK02Q12002...
for example. This makes sense because its sorting the string in ascending order.
But I need my data sorted chronologically such that the result is like the following:
WK01Q12001, WK02Q12001, WK03Q12001, WK04Q12001, ... , WK01Q22001, WK02Q22001, ... WK01Q12002, WK02Q22002 ...
How can I sort it this way using pandas? Perhaps sorting the string in reverse? (right to left) or creating some kind of datetime object?
I have also tried using Series(): pd.Series([pd.to_datetime(d) for d in weeklyData['Date']])
But the result is same as the above .sort() method.
UPDATE:
My DataFrame is similar in format to an excel sheet and currently looks like the following. I want to sort chronologically by 'Date'.
Date Price Volume
WK01Q12001 32 500
WK01Q12002 43 400
WK01Q12003 55 300
WK01Q12004 58 350
WK01Q22001 33 480
WK01Q22002 40 450
.
.
.
WK13Q42004 60 400

You can add a new column to your dataframe containing the date components as a list.
e.g.
a = ["2001", "Q2", "WK01"]
b = ["2002", "Q2", "WK01"]
c = ["2002", "Q2", "WK02"]
So, you can apply a function to your data frame to do this...
def tolist(x):
g = re.match(r"(WK\d{2})(Q\d)(\d{4})", str(x))
return [g.group(3), g.group(2), g.group(1)]
then...
df['datelist'] = df['Date'].apply(tolist)
which gives you your date as a list arranged in the order of importance...
Date Price Volume datelist
0 WK01Q12001 32 500 [2001, Q1, WK01]
1 WK01Q12002 22 400 [2002, Q1, WK01]
2 WK01Q12003 42 500 [2003, Q1, WK01]
When comparing lists of equal length in Python the comparison operators behave well. So, you can use the standard DataFrame sort to order your data.
So the default sorting in a Pandas series will work correctly when you do...
df.sort('datelist')

Use str.replace to change the order of the keys first:
s = "WK01Q12001, WK01Q12002, WK01Q12003, WK01Q22001, WK01Q22002, WK01Q22003, WK02Q12001, WK02Q12002"
date = map(str.strip, s.split(","))
df = pd.DataFrame({"date":date, "value":range(len(date))})
df["date2"] = df.date.str.replace(r"WK(\d\d)Q(\d)(\d{4})", r"\3Q\2WK\1")
df.sort("date2")

I was also able to accomplish this Date reformatting very easily using SQL. When I first query my data, I did SELECT *,
RIGHT([Date], 4) + SUBSTRING([Date], 5, 2) + LEFT([Date], 4) As 'SortedDate'
FROM [Table]
ORDER BY 'SortedDate' ASC.
Use the right tool for the job!

Related

DataFrame.sort_values only looks at first digit rather then entire number

I have a DataFrame that looks like this,
del Ticker Open Interest
0 1 SPY 20,996,893
1 3 IWM 7,391,074
2 5 EEM 6,545,445
...
47 46 MU 1,268,256
48 48 NOK 1,222,759
49 50 ET 1,141,467
I want it to go in order from the lowest number to greatest with df['del'], but when I write df.sort_values('del') I get
del Ticker
0 1 SPY
29 10 BAC
5 11 GE
It appears do do it based on the first number rather than go in order? Am I using the correct code or do I need to completely change it?
Assuming you have numbers as type string you can do:
add leading zeros to the string numbers which will allow for ordering of the string
df["del"] = df["del"].map(lambda x: x.zfill(10))
df = df.sort_values('del')
or convert the type to integer
df["del"] = df["del"].astype('int') # as recommended by Alex.Kh in comment
#df["del"] = df["del"].map(int) # my initial answer
df = df.sort_values('del')
I also noticed that del seems to be sorted in the same way your index is sorted, so you even could do:
df = df.sort_index(ascending=False)
to go from lowest to highest you can explicitly .sort_values('del', ascending=True)

How to get a value from a pandas core series?

I have a dataframe df in which are the timezones for particular ip numbers:
ip1 ip2 timezone
0 16777215 0
16777216 16777471 +10:00
16777472 16778239 +08:00
16778240 16779263 +11:00
16779264 16781311 +08:00
16781312 16785407 +09:00
...
The first row is valid for the ip numbers from 0 to 16777215, the second from 16777216 to 16777471 an so on.
Now, I go through a folder an want to know the timezone for every file (after I calculate the ip_number of the file).
I use:
time=df.loc[(df['ip1'] <= ip_number) & (ip_number <= df['ip2']), 'timezone']
and become my expected output:
1192 +05:30
Name: timezone, dtype: object
But this is panda core series series and I just want to have "+5:30".
How do I become this? Or is there another way instead of df.loc[...]to become directly the value of the column timezonein df?
just list it
list(time)
if you are excepting only one value
list(time)[0]
or you can make it earlier:
#for numpy array
time=df.loc[(df['ip1'] <= ip_number) & (ip_number <= df['ip2']), 'timezone'].values
#for list
time=list(df.loc[(df['ip1'] <= ip_number) & (ip_number <= df['ip2']), 'timezone'].values)
To pull the only value out of a Series of size 1, use the Series.item() method:
time = df.loc[(df['ip1'] <= ip_number) & (ip_number <= df['ip2']), 'timezone'].item()
Note that this raises a ValueError if the Series contains more than one item.
Usually pulling single values out of a Series is an anti-pattern. NumPy/Pandas
is built around the idea that applying vectorized functions to large arrays is
going to be much much faster than using a Python loop that processes single
values one at a time.
Given your df and a list of IP numbers, here is a way to find the
corresponding timezone offsets for all the IP numbers with just one call to pd.merge_asof.
import pandas as pd
df = pd.DataFrame({'ip1': [0, 16777216, 16777472, 16778240, 16779264, 16781312],
'ip2': [16777215, 16777471, 16778239, 16779263, 16781311, 16785407],
'timezone': ['0', '+10:00', '+08:00', '+11:00', '+08:00', '+09:00']})
df1 = df.melt(id_vars=['timezone'], value_name='ip').sort_values(by='ip').drop('variable', axis=1)
ip_nums = [16777473, 16777471, 16778238, 16785406]
df2 = pd.DataFrame({'ip':ip_nums}).sort_values(by='ip')
result = pd.merge_asof(df2, df1)
print(result)
yields
ip timezone
0 16777471 +10:00
1 16777473 +08:00
2 16778238 +08:00
3 16785406 +09:00
Ideally, your next step would be to apply more NumPy/Pandas vectorized functions
to process the whole DataFrame at once. But if you must, you could iterate
through the result DataFrame row-by-row. Still, your code will look a little bit cleaner
since you'll be able to read off ip and corresponding offset easily (and without calling .item()).
for row in result.itertuples():
print('{} --> {}'.format(row.ip, row.timezone))
# 16777471 --> +10:00
# 16777473 --> +08:00
# 16778238 --> +08:00
# 16785406 --> +09:00

python pass string to pandas dataframe in a specific format

I am not entirely sure if this is possible but I thought I would go ahead and ask. I currently have a string that looks like the following:
myString =
"{"Close":175.30,"DownTicks":122973,"DownVolume":18639140,"High":177.47,"Low":173.66,"Open":177.32,"Status":29,"TimeStamp":"\/Date(1521489600000)\/","TotalTicks":245246,"TotalVolume":33446771,"UnchangedTicks":0,"UnchangedVolume":0,"UpTicks":122273,"UpVolume":14807630,"OpenInterest":0}
{"Close":175.24,"DownTicks":69071,"DownVolume":10806836,"High":176.80,"Low":174.94,"Open":175.24,"Status":536870941,"TimeStamp":"\/Date(1521576000000)\/","TotalTicks":135239,"TotalVolume":19649350,"UnchangedTicks":0,"UnchangedVolume":0,"UpTicks":66168,"UpVolume":8842514,"OpenInterest":0}"
The datasets can be varying lengths (this example has 2 datasets but it could have more), however the parameters will always be the same, (close, downticks, downvolume, etc).
Is there a way to create a dataframe from this string that takes the parameters as the index, and the numbers as the values in the column? So the dataframe would look something like this:
df =
0 1
index
Close 175.30 175.24
DownTicks 122973 69071
DownVolume 18639140 10806836
High 177.47 176.80
Low 173.66 174.94
Open 177.32 175.24
(etc)...
It looks like there are some issues with your input. As mentioned by #lmiguelvargasf, there's a missing comma at the end of the first dictionary. Additionally, there's a \n which you can simply use a str.replace to fix.
Once those issues have been solved, the process it pretty simple.
myString = '''{"Close":175.30,"DownTicks":122973,"DownVolume":18639140,"High":177.47,"Low":173.66,"Open":177.32,"Status":29,"TimeStamp":"\/Date(1521489600000)\/","TotalTicks":245246,"TotalVolume":33446771,"UnchangedTicks":0,"UnchangedVolume":0,"UpTicks":122273,"UpVolume":14807630,"OpenInterest":0}
{"Close":175.24,"DownTicks":69071,"DownVolume":10806836,"High":176.80,"Low":174.94,"Open":175.24,"Status":536870941,"TimeStamp":"\/Date(1521576000000)\/","TotalTicks":135239,"TotalVolume":19649350,"UnchangedTicks":0,"UnchangedVolume":0,"UpTicks":66168,"UpVolume":8842514,"OpenInterest":0}'''
myString = myString.replace('\n', ',')
import ast
list_of_dicts = list(ast.literal_eval(myString))
df = pd.DataFrame.from_dict(list_of_dicts).T
df
0 1
Close 175.3 175.24
DownTicks 122973 69071
DownVolume 18639140 10806836
High 177.47 176.8
Low 173.66 174.94
Open 177.32 175.24
OpenInterest 0 0
Status 29 536870941
TimeStamp \/Date(1521489600000)\/ \/Date(1521576000000)\/
TotalTicks 245246 135239
TotalVolume 33446771 19649350
UnchangedTicks 0 0
UnchangedVolume 0 0
UpTicks 122273 66168
UpVolume 14807630 8842514

Select dataframe rows defining a date range that contains a needle date

There are lots of answers which make it easy to select some date range and get the ones that fall into that range.
I don't want that.
I have data like this:
id other_flags d_dt_start d_dt_end
0 28 ... 1993-02-12 1993-12-31
1 28 ... 1993-02-12 1993-12-31
2 46 ... 1986-01-15 1993-09-30
3 46 ... 1986-01-15 1993-09-30
4 46 ... 1986-01-15 1993-09-30
I want to select the ones that match when I have a date, say, 1986-06-15, thus giving me the subset of indices 2, 3, and 4. Currently, I'm doing this by something like this:
subs = subs[(time >= subs['d_dt_start1']) # later
& (time <= subs['d_dt_end1'])] # before
There has got to be a more elegant way to do this similar to the between command, just the opposite of that.
Basically, instead of saying 'you have a date, I have a date range', 'you have a date range, I have a date'.

Calculating a max for every X number of lines, how to take leap year into account?

I am trying to take yearly max rainfall data for multiple years of data within one array. I understand how you would need to use a for loop if I wanted to take the max of a single range, I saw there was similar question to the problem I'm having. However, I need to take leap year into account!
So for the first year I have 14616 data points from 1960-1965, not including 1965, which contains 2 leap years: 1960 and 1964. A leap year contains 2928 data points and every other year contains 2920 data points.
I first thought was to modify the solution from the similar question which involved using a for loop as follows (just a straight copy paste from their's):
for i,d in enumerate(data_you_want):
if (i % 600) == 0:
avg_for_day = np.mean(data_you_want[i - 600:i])
daily_averages.append(avg_for_day)
Their's involved taking the average of every 600 lines in their data. I thought there might be a way to just modify this, but I couldn't figure out a way for it to work. If modification of this won't work, is there another way to loop it with the leap years taken into account without completely cutting up the file manually.
Fake data:
import numpy as np
fake = np.random.randint(2, 30, size = 14616)
Use pandas to handle the leap year functionality.
Create timestamps for your data with pandas.date_range().
import pandas as pd
index = pd.date_range(start = '1960-1-1 00:00:00', end = '1964-12-31 23:59:59' , freq='3H')
Then create a DataFrame using the timestamps for the index.
df = pd.DataFrame(data = fake, index = index)
Aggregate by year - taking advantage of the DatetimeIndex flexibilty.
>>> df['1960'].max()
0 29
dtype: int32
>>> df['1960'].mean()
0 15.501366
dtype: float64
>>>
>>> len(df['1960'])
2928
>>> len(df['1961'])
2920
>>> len(df['1964'])
2928
>>>
I just cobbled this together from the Time Series / Date functionality section of the docs. Given pandas capability this looks a bit naive and probably can be improved upon.
Like resampling (using the same DataFrame)
>>> df.resample('A').mean()
0
1960-12-31 15.501366
1961-12-31 15.170890
1962-12-31 15.412329
1963-12-31 15.538699
1964-12-31 15.382514
>>> df.resample('A').max()
0
1960-12-31 29
1961-12-31 29
1962-12-31 29
1963-12-31 29
1964-12-31 29
>>>
>>> r = df.resample('A')
>>> r.agg([np.sum, np.mean, np.std])
0
sum mean std
1960-12-31 45388 15.501366 8.211835
1961-12-31 44299 15.170890 8.117072
1962-12-31 45004 15.412329 8.257992
1963-12-31 45373 15.538699 7.986877
1964-12-31 45040 15.382514 8.178057
>>>
Food for thought:
Time-aware Rolling vs. Resampling

Categories

Resources