Defining lists based on indices of pandas dataframe - python

I have a pandas dataframe, and one of the columns has date values as strings (like "2014-01-01"). I would like to define a different list for each year that is present in the column, where the elements of the list are the index of the row in which the year is found in the dataframe.
Here's what I've tried:
import pandas as pd
df = pd.DataFrame(["2014-01-01","2013-01-01","2014-02-02", "2012-08-09"])
df = df.values.flatten().tolist()
for i in range(len(df)):
df[i] = df[i][0:4]
y2012 = []; y2013 = []; y2014 = []
for i in range(len(df)):
if df[i] == "2012":
y2012.append(i)
elif df[i] == "2013":
y2013.append(i)
else:
y2014.append(i)
print y2014 # [0, 2]
print y2013 # [1]
print y2012 # [3]
Does anyone know a better way of doing this? This way works fine, but I have a lot of years, so I have to manually define each variable and then run it through the for loop, and so the code gets really long. I was trying to use groupby in pandas, but I couldn't seem to get it to work.
Thank you so much for any help!

Scan through the original DataFrame values and parse out the year. Given, that, add the index into a defaultdict. That is, the following code creates a dict, one item per year. The value for a specific year is a list of the rows in which the year is found in the dataframe.
A defaultdict sounds scary, but it's just a dictionary. In this case, each value is a list. If we append to a nonexistent value, then it gets spontaneously created. Convenient!
source
from collections import defaultdict
import pandas as pd
df = pd.DataFrame(["2014-01-01","2013-01-01","2014-02-02", "2012-08-09"])
# df = df.values.flatten().tolist()
dindex = defaultdict(list)
for index,dateval in enumerate(df.values):
year = dateval[0].split('-')[0]
dindex[year].append(index)
assert dindex == {'2014': [0, 2], '2013': [1], '2012': [3]}
print dindex
output
defaultdict(<type 'list'>, {'2014': [0, 2], '2013': [1], '2012': [3]})

Pandas is awesome for this kind of thing, so don't be so hasty to turn your dataframe back into lists right away.
The trick here lies in the .apply() method and the .groupby() method.
Take a dataframe that has strings with ISO formatted dates in it
parse the column containing the date strings into datetime objects
Create another column of years using the datetime.year
attribute of the items in the datetime column
Group the dataframe by the new year column
Iterate over the groupby object and extract your column
Here's some code for you to play with and grok:
import pandas
import dateutil
df = pd.DataFrame({'strings': ["2014-01-01","2013-01-01","2014-02-02", "2012-08-09"]})
df['datetimes'] = df['strings'].apply(dateutil.parser.parse)
df['year'] = df['datetimes'].apply(lambda x: x.year)
grouped_data= df.groupby('year')
lists_by_year = {}
for year, data in grouped_data
lists_by_year [year] = list(data['strings'])
Which gives us a dictionary of lists, where the key is the year and the contents is a list of strings with that year.
print lists_by_year
{2012: ['2012-08-09'],
2013: ['2013-01-01'],
2014: ['2014-01-01', '2014-02-02']}

As it turns out
df.groupby('A') #is just syntactical sugar for df.groupby(df['A'])
This means that all you have to do to group by year is leverage the apply function and re-work the syntax
Solution
getYear = lambda x:x.split("-")[0]
yearGroups = df.groupby(df["dates"].apply(getYear))
Output
for key,group in yearGroups:
print key
2012
2013
2014

Related

Fastest way to calculate and append rolling mean as columns for grouped dataframe

Have the following dataset. This is a small sample while the actual dataset is much larger.
What is the fastest way to:
iterate through days = (1,2,3,4,5,6)
calculate [...rolling(day, min_periods=day).mean()]
add it as column name df[f'sma_{day}']
Method I have is casting it to dict of {ticker:price_df} and looping through shown below..
Have thought of methods like groupby, stack/unstack got stuck and need help with appending the columns because they are multi-index.
Am favouring the method with the fastest %%timeit.
import yfinance as yf
df = yf.download(['MSFT','AAPL','AMZN'], start="2022-09-13").loc[:,['Close']].stack().swaplevel().sort_index()
df.index.set_names(['Ticker','Date'], inplace=True)
df
Here is a sample dictionary method I have..
df = df.reset_index()
df = dict(tuple(df.groupby(['Ticker'])))
## Iterate through days and keys
days = (1, 2, 3, 4, 5, 6)
for key in df.keys():
for day in days:
df[key][f'sma_{day}'] = df[key].Close.sort_index(ascending=True).rolling(day, min_periods=day).mean()
## Flatten dictionary
pd.concat(df.values()).set_index(['Ticker','Date']).sort_index()

How to apply .sort() with a key=lambda function to every row of a dataframe on a single column?

I have a dataframe with a column containing a list of dates:
data = [
[
1,
[
"2017-12-06",
"2017-12-05",
"2017-12-06",
"2018-01-03",
"2018-01-04",
"2017-11-24",
],
],
[
2,
[
"2019-03-10",
"2018-12-03",
"2018-12-04",
"2018-11-08",
"2018-11-30",
"2019-03-22",
"2018-11-24",
"2019-03-06",
"2017-11-16",
],
],
]
df = pd.DataFrame(data, columns=["id", "dates"])
df
id dates
1 [2017-12-06, 2017-12-05, 2017-12-06, 2018-01-03, 2018-01-04, 2017-11-24]
2 [2019-03-10, 2018-12-03, 2018-12-04, 2018-11-08, 2018-11-30, 2019-03-22, 2018-11-24, 2019-03-06, 2017-11-16]
print(df.dtypes)
id int64
dates object
dtype: object
I would like to sort the date containing column (dates). I have tried a number of methods with no success (including .apply(list.sort) in place). The only method that I've found that works is using .sort(key = ....) like below:
import datetime
from datetime import datetime
dates = [
"2019-03-10",
"2018-12-03",
"2018-12-04",
"2018-11-08",
"2018-11-30",
"2019-03-22",
"2018-11-24",
"2019-03-06",
"2017-11-16",
]
dates.sort(key=lambda date: datetime.strptime(date, "%Y-%m-%d"))
but I can only get it to work on a list and I want to apply this to that entire column in the dataframe df. Can anyone advise the best way to do this? Or perhaps there is an even better way to sort this column?
You can use .apply() to apply a given function (in this case 'sort') to every row of a dataframe column.
This should work:
df['dates'].apply(lambda row: row.sort(key=lambda date: datetime.strptime(date, "%Y-%m-%d")))
print(df)
returns:
id dates
0 1 ['2017-11-24', '2017-12-05', '2017-12-06', '2017-12-06', '2018-01-03', '2018-01-04']
1 2 ['2017-11-16', '2018-11-08', '2018-11-24', '2018-11-30', '2018-12-03', '2018-12-04', '2019-03-06', '2019-03-10', '2019-03-22']
Note that in this case the code df['data'] = df['data'].apply(...) will NOT work because the sort function has a default inplace=True parameter: it directly modifies the dataframe and doesn't create a new one.
To apply other functions you might have to use the df = df.apply(etc)formulation.
What I see here is that you want the list in every row to be sorted (not the column itself).
The code below applies a certain function (something like my_sort()) to each row of column "dates":
df['dates'].apply(my_sort)
You just need to implement my_sort to be applied to the list in each row. Something like:
def my_sort(dates):
dates.sort(key=lambda date: datetime.strptime(date, "%Y-%m-%d"))
return dates
list.sort() sorts the list and returns None so you need to return the list itself after calling sort.
Edit:
According to the comment from #jch, it's a better practice to copy the list first and then call sort method. This way, any unexpected behavior or error produced by sort method (if any happens) won't affect the original list (in your datafram). To achieve that, you can change my_sort to something like:
from copy import deepcopy
def my_sort(dates):
dates_copy = deepcopy(dates)
dates_copy.sort(key=lambda date: datetime.strptime(date, "%Y-%m-%d"))
return dates_copy
You can learn more about copy and deepcopy of objects here.

How to rename the first column of a pandas dataframe?

I have come across this question many a times over internet however not many answers are there except for few of the likes of the following:
Cannot rename the first column in pandas DataFrame
I approached the same using following:
df = df.rename(columns={df.columns[0]: 'Column1'})
Is there a better or cleaner way of doing the rename of the first column of a pandas dataframe? Or any specific column number?
You're already using a cleaner way in pandas.
It is sad that:
df.columns[0] = 'Column1'
Is impossible because Index objects do not support mutable assignments. It would give an TypeError.
You still could do iterable unpacking:
df.columns = ['Column1', *df.columns[1:]]
Or:
df = df.set_axis(['Column1', *df.columns[1:]], axis=1)
Not sure if cleaner, but possible idea is convert to list and set by indexing new value:
df = pd.DataFrame(columns=[4,7,0,2])
arr = df.columns.tolist()
arr[0] = 'Column1'
df.columns = arr
print (df)
Empty DataFrame
Columns: [Column1, 7, 0, 2]
Index: []

Slicing data frame with datetime columns (Python - Pandas)

Through the loc and iloc methods, Pandas allows us to slice dataframes. Still, I am having trouble to do this when the columns are datetime objects.
For instance, suppose the data frame generated by the following code:
d = {'col1': [1], 'col2': [2],'col3': [3]}
df = pd.DataFrame(data=d)
dates = ['01-01-2001','02-02-2002','03-03-2003']
dates = pd.to_datetime(dates).date
df.columns= dates
Let us try to slice the first two columns of the dataframe through dfloc:
df.loc[0,'01-01-2001':'02-02-2002']
We get the following TypeError:'<' not supported between instances of 'datetime.date' and 'str'
How could this be solved?
df.iloc[0,[0,1]]
Use iloc or loc , but give column name in second parameter as index of that columns and you are passing strings, just give index
To piggyback off of #Ch3steR comment from above that line should work.
dates = pd.to_datetime(dates)
At that point the date conversion should allow you to index the columns that fall in that range based on the date as listed below. Just make sure the end date is a little beyond the end date that you're trying to capture.
# Return all rows in columns between date range 1/1/2001 and 2/3/2002
df.loc[:, '1/1/2001':'2/3/2002']
2001-01-01 2002-02-02
0 1 2
You can call the dates from the list you created earlier and it doesn't give an error.
d = {'col1': [1], 'col2': [2],'col3': [3]}
df = pd.DataFrame(data=d)
dates = ['01-01-2001','02-02-2002','03-03-2003']
dates = pd.to_datetime(dates).date
df.columns= dates
df.loc[0,dates[0]:dates[1]]
The two different formats are here. It's just important that you stick to the one format. Calling from the list works because it guarantees that the format is the same. But as you said, you need to be able to use any dates so the second one is better for you.
>>>dates = pd.to_datetime(dates).date
>>>print("With .date")
With .date
>>>print(dates)
[datetime.date(2001, 1, 1) datetime.date(2002, 2, 2)
datetime.date(2003, 3, 3)]
>>>dates = pd.to_datetime(dates)
>>>print("Without .date")
Without .date
>>>print(dates)
DatetimeIndex(['2001-01-01', '2002-02-02', '2003-03-03'], dtype='datetime64[ns]', freq=None)

Changing class pandas.Series to a list

Trying to change a column from an array that has type
to a list.
Tried changing it directly to a list, but it still comes up as a series after checking the type of it.
First I get the first 4 numbers to I can have just the year, then I create a new column in the table called year to hold that new data.
year = df['date'].str.extract(r'^(\d{4})')
df['year'] = pd.to_numeric(year)
df['year'].dtype
print(type(df['year']))
Want the type of 'year' to be a list. Thanks!
If you want to get a list with years values into date column, you could try this:
import pandas as pd
df = pd.DataFrame({'date':['2019/01/02', '2018/02/03', '2017/03/04']})
year = df.date.str.extract(r'(\d{4})')[0].to_list()
print(f'type: {type(year)}: {year}')
# type: <class 'list'>: ['2019', '2018', '2017']
df.date.str.extract returns a new DataFrame with one row for each subject string, and one column for each group, then we take the first (only) group [0]
It seems pretty straightforward to turn a series into a list. The builtin list function works fine:
> df = pd.DataFrame({'date':['2019/01/02', '2018/02/03', '2017/03/04']})
> dates = list(df['date'])
> type(dates)
< <class 'list'>
> dates
< ['2019/01/02', '2018/02/03', '2017/03/04']

Categories

Resources