I have a list of columns that I want to rename a portion of based on a list of values.
I am looking at a file which has 12 months of data and each month is a different column (I need to keep it in this specific format unfortunately). This file is generated once per month and I keep the column names more general since I have to do a lot of calculations on them based the month number (for example, I need to compare information against the average of month 8, 9, and 10 every month).
Here are the columns I want to rename:
['month_1_Sign',
'month_2_Sign',
'month_3_Sign',
'month_4_Sign',
'month_5_Sign',
'month_6_Sign',
'month_7_Sign',
'month_8_Sign',
'month_9_Sign',
'month_10_Sign',
'month_11_Sign',
'month_12_Sign',
'month_1_Actual',
'month_2_Actual',
'month_3_Actual',
'month_4_Actual',
'month_5_Actual',
'month_6_Actual',
'month_7_Actual',
'month_8_Actual',
'month_9_Actual',
'month_10_Actual',
'month_11_Actual',
'month_12_Actual',
'month_1_Target',
'month_2_Target',
'month_3_Target',
'month_4_Target',
'month_5_Target',
'month_6_Target',
'month_7_Target',
'month_8_Target',
'month_9_Target',
'month_10_Target',
'month_11_Target',
'month_12_Target']
Here are the names I want to place:
required_date_range = sorted(list(pd.Series(pd.date_range((dt.datetime.today().date() + pd.DateOffset(months=-13)), periods=12, freq='MS')).dt.strftime('%Y-%m-%d')))
['2015-03-01',
'2015-04-01',
'2015-05-01',
'2015-06-01',
'2015-07-01',
'2015-08-01',
'2015-09-01',
'2015-10-01',
'2015-11-01',
'2015-12-01',
'2016-01-01',
'2016-02-01']
So month_12 columns (month_12 is always the latest month) would be changed to '2016-02-01_Sign', '2016-02-01_Actual', '2016-02-01_Target' in this example.
I tried doing this but it doesn't change anything (trying to change the month_# with the actual date it refers to):
final.replace('month_10', required_date_range[9], inplace=True)
final.replace('month_11', required_date_range[10], inplace=True)
final.replace('month_12', required_date_range[11], inplace=True)
final.replace('month_1', required_date_range[0], inplace=True)
final.replace('month_2', required_date_range[1], inplace=True)
final.replace('month_3', required_date_range[2], inplace=True)
final.replace('month_4', required_date_range[3], inplace=True)
final.replace('month_5', required_date_range[4], inplace=True)
final.replace('month_6', required_date_range[5], inplace=True)
final.replace('month_7', required_date_range[6], inplace=True)
final.replace('month_8', required_date_range[7], inplace=True)
final.replace('month_9', required_date_range[8], inplace=True)
You could construct a dict and then call map on the split column str:
In [27]:
d = dict(zip([str(x) for x in range(1,13)], required_date_range))
d
Out[27]:
{'1': '2015-03-01',
'10': '2015-12-01',
'11': '2016-01-01',
'12': '2016-02-01',
'2': '2015-04-01',
'3': '2015-05-01',
'4': '2015-06-01',
'5': '2015-07-01',
'6': '2015-08-01',
'7': '2015-09-01',
'8': '2015-10-01',
'9': '2015-11-01'}
In [26]:
df.columns = df.columns.to_series().str.rsplit('_').str[1].map(d) + '_' + df.columns.to_series().str.rsplit('_').str[-1]
df.columns
Out[26]:
Index(['2015-03-01_Sign', '2015-04-01_Sign', '2015-05-01_Sign',
'2015-06-01_Sign', '2015-07-01_Sign', '2015-08-01_Sign',
'2015-09-01_Sign', '2015-10-01_Sign', '2015-11-01_Sign',
'2015-12-01_Sign', '2016-01-01_Sign', '2016-02-01_Sign',
'2015-03-01_Actual', '2015-04-01_Actual', '2015-05-01_Actual',
'2015-06-01_Actual', '2015-07-01_Actual', '2015-08-01_Actual',
'2015-09-01_Actual', '2015-10-01_Actual', '2015-11-01_Actual',
'2015-12-01_Actual', '2016-01-01_Actual', '2016-02-01_Actual',
'2015-03-01_Target', '2015-04-01_Target', '2015-05-01_Target',
'2015-06-01_Target', '2015-07-01_Target', '2015-08-01_Target',
'2015-09-01_Target', '2015-10-01_Target', '2015-11-01_Target',
'2015-12-01_Target', '2016-01-01_Target', '2016-02-01_Target'],
dtype='object')
You're going to want to use the .rename method instead of the .replace! For instance this code:
import pandas as pd
d = {'a': [1, 2, 4], 'b':[2,3,4],'c':[3,4,5]}
df = pd.DataFrame(d)
df.rename(columns={'a': 'x1', 'b': 'x2'}, inplace=True)
Changes the 'a' and 'b' column title to 'x1' and 'x2' respectively.
The first line of the renaming code you have would change to:
final.rename(columns={'month_10':required_date_range[9]}, inplace=True)
In fact you could do every column in that one command by adding entries to the columns dictionary argument.
final.rename(columns={'month_10':required_date_range[9],
'month_9':required_date-range[8], ... (and so on) }, inplace=True)
from collections import product
df = pd.DataFrame(np.random.rand(3, 12 * 3), columns=['month_' + str(c[0]) + '_' + c[1] for c in product(range(1, 13), ['Sign', 'Actual', 'Target'])])
First create a mapping to the relevant months.
mapping = {'month_' + str(n): date for n, date in enumerate(required_date_range, 1)}
>>> mapping
{'month_1': '2015-03-01',
'month_10': '2015-12-01',
'month_11': '2016-01-01',
'month_12': '2016-02-01',
'month_2': '2015-04-01',
'month_3': '2015-05-01',
'month_4': '2015-06-01',
'month_5': '2015-07-01',
'month_6': '2015-08-01',
'month_7': '2015-09-01',
'month_8': '2015-10-01',
'month_9': '2015-11-01'}
Then reassign columns, joining the mapped month (e.g. '2016-02-01') to the rest of the column name. This was done using a list comprehension.
df.columns = [mapping.get(c[:c.find('_', 6)]) + c[c.find('_', 6):] for c in cols]
>>> df.columns.tolist()
['2015-03-01_Sign',
'2015-04-01_Sign',
'2015-05-01_Sign',
'2015-06-01_Sign',
'2015-07-01_Sign',
'2015-08-01_Sign',
'2015-09-01_Sign',
'2015-10-01_Sign',
'2015-11-01_Sign',
'2015-12-01_Sign',
'2016-01-01_Sign',
'2016-02-01_Sign',
'2015-03-01_Actual',
'2015-04-01_Actual',
'2015-05-01_Actual',
'2015-06-01_Actual',
'2015-07-01_Actual',
'2015-08-01_Actual',
'2015-09-01_Actual',
'2015-10-01_Actual',
'2015-11-01_Actual',
'2015-12-01_Actual',
'2016-01-01_Actual',
'2016-02-01_Actual',
'2015-03-01_Target',
'2015-04-01_Target',
'2015-05-01_Target',
'2015-06-01_Target',
'2015-07-01_Target',
'2015-08-01_Target',
'2015-09-01_Target',
'2015-10-01_Target',
'2015-11-01_Target',
'2015-12-01_Target',
'2016-01-01_Target',
'2016-02-01_Target']
Related
I have two dataframes. The values in the first dataframe are supposed to be found or retrieved from the second dataframe. The problem is, while the values contained in df1 is autogenerated from an application, the values in df2 are manually entered. To retrieve values of df1 in df2, I take into consideration the unique fields that should match a search, which are
Name
id
date
The above fields work only for values or data that had been inputted on the same date. But because df2 has been manually inputed, the same row value in df1(data_today) might be found in row df2(date_tomorow).
Here is an example code
import pandas as pd
df1 = pd.DataFrame([
['Aa', 1.2, '26-1-2022'],
['Bb', 2.2, '27-1-2022'],
['Bb', 2.3, '28-1-2022'],
['Cc', 3.2, '26-1-2022']
], columns=['name', 'id', 'date'])
df2 = pd.DataFrame([
['Aa', 1.2, '26-1-2022'],
['Bb', 2.2, '27-1-2022'],
['Bb', 2.3, '29-1-2022'],
['Cc', 3.2, '29-1-2022']
], columns=['name', 'id', 'date'])
selected_rows = pd.DataFrame()
for i in df1.index:
name_i = df1._get_value(i, 'name')
id_i = df1._get_value(i, 'id')
date_i = df1._get_value(i, 'date')
for j in df2.index:
name_j = df2._get_value(j, 'name')
id_i = df2._get_value(j, 'id')
date_j = df2._get_value(j, 'date')
if name_i == name_j and id_i == id_i and date_i == date_j:
selected_rows = selected_rows.append(df1.loc[i])
break
print(selected_rows)
How can I extend my code to also include the other dates? Thanks.
You can include a date condition where the difference between the date from df1 and df2 doesn't surpass a specific number of days example 3, 4, or whatever you choose.
from datetime import datetime
def days_between(d1, d2):
d1 = datetime.strptime(d1, "%Y-%m-%d")
d2 = datetime.strptime(d2, "%Y-%m-%d")
return abs((d2 - d1).days)
and in your code update this part:
if name_i == name_j and id_i == id_i and days_between(date_i,date_j)<4:
selected_rows = selected_rows.append(df1.loc[i])
break
I have this dataframe df where
>>> df = pd.DataFrame({'Date':['10/2/2011', '11/2/2011', '12/2/2011', '13/2/11'],
'Event':['Music', 'Poetry', 'Theatre', 'Comedy'],
'Cost':[10000, 5000, 15000, 2000],
'Name':['Roy', 'Abraham', 'Blythe', 'Sophia'],
'Age':['20', '10', '13', '17']})
I want to determine the column index with the corresponding name. I tried it with this:
>>> list(df.columns)
But the solution above only returns the column names without index numbers.
How can I code it so that it would return the column names and the corresponding index for that column? Like This:
0 Date
1 Event
2 Cost
3 Name
4 Age
Simpliest is add pd.Series constructor:
pd.Series(list(df.columns))
Or convert columns to Series and create default index:
df.columns.to_series().reset_index(drop=True)
Or:
df.columns.to_series(index=False)
You can use loop like this:
myList = list(df.columns)
index = 0
for value in myList:
print(index, value)
index += 1
A nice short way to get a dictionary:
d = dict(enumerate(df))
output: {0: 'Date', 1: 'Event', 2: 'Cost', 3: 'Name', 4: 'Age'}
For a Series, pd.Series(list(df)) is sufficient as iteration occurs directly on the column names
In addition to using enumerate, this also can get a numbers in order using zip, as follows:
import pandas as pd
df = pd.DataFrame({'Date':['10/2/2011', '11/2/2011', '12/2/2011', '13/2/11'],
'Event':['Music', 'Poetry', 'Theatre', 'Comedy'],
'Cost':[10000, 5000, 15000, 2000],
'Name':['Roy', 'Abraham', 'Blythe', 'Sophia'],
'Age':['20', '10', '13', '17']})
result = list(zip([i for i in range(len(df.columns))], df.columns.values,))
for r in result:
print(r)
#(0, 'Date')
#(1, 'Event')
#(2, 'Cost')
#(3, 'Name')
#(4, 'Age')
I want to convert this dict into a pandas dataframe where each key becomes a column and values in the list become the rows:
my_dict:
{'Last updated': ['2021-05-18T15:24:19.000Z', '2021-05-18T15:24:19.000Z'],
'Symbol': ['BTC', 'BNB', 'XRP', 'ADA', 'BUSD'],
'Name': ['Bitcoin', 'Binance Coin', 'XRP', 'Cardano', 'Binance USD'],
'Rank': [1, 3, 7, 4, 25],
}
The lists in my_dict can also have some missing values, which should appear as NaNs in dataframe.
This is how I'm currently trying to append it into my dataframe:
df = pd.DataFrame(columns = ['Last updated',
'Symbol',
'Name',
'Rank',]
df = df.append(my_dict, ignore_index=True)
#print(df)
df.to_excel(r'\walletframe.xlsx', index = False, header = True)
But my output only has a single row containing all the values.
The answer was pretty simple, instead of using
df = df.append(my_dict)
I used
df = pd.DataFrame.from_dict(my_dict).T
Which transposes the dataframe so it doesn't has any missing values for columns.
Credits to #Ank who helped me find the solution!
I am fetching data from an API into a pandas dataframe whose index values are as follows:-
df.index=['Q1-2013',
'Q1-2014',
'Q1-2015',
'Q1-2016',
'Q1-2017',
'Q1-2018',
'Q2-2013',
'Q2-2014',
'Q2-2015',
'Q2-2016',
'Q2-2017',
'Q2-2018',
'Q3-2013',
'Q3-2014',
'Q3-2015',
'Q3-2016',
'Q3-2017',
'Q3-2018',
'Q4-2013',
'Q4-2014',
'Q4-2015',
'Q4-2016',
'Q4-2017',
'Q4-2018']
It is a list of string values. Is there a way to convert this to pandas datetime?
I explored few Q&A and they are about using pd.to_datetime which works when the index is of object type.
In this example, index values are strings.
Expected output:
new_df=magic_function(df.index)
print(new_df.index[0])
01-2013
Wondering how to build "magic_function". Thanks in advance.
Q1 is quarter1 which is January, Q2 is quarter2 which is April and Q3 is quarter3 which is July, Q4 is quarter4 which is October
With a bit of manipulation for the parsing to work, you can use pd.PeriodIndex and format as wanted (reason being that the format %Y%q is expected):
df.index = [''.join(s.split('-')[::-1]) for s in df.index]
df.index = pd.PeriodIndex(df.index, freq='Q').to_timestamp().strftime('%m-%Y')
print(df.index)
Index(['01-2013', '01-2014', '01-2015', '01-2016', '01-2017', '01-2018',
'04-2013', '04-2014', '04-2015', '04-2016', '04-2017', '04-2018',
'07-2013', '07-2014', '07-2015', '07-2016', '07-2017', '07-2018',
'10-2013', '10-2014', '10-2015', '10-2016', '10-2017', '10-2018'],
dtype='object')
We could also get the required format using str.replace:
df.index = df.index.str.replace(r'(Q\d)-(\d+)', r'\2\1')
df.index = pd.PeriodIndex(df.index, freq='Q').to_timestamp().strftime('%m-%Y')
You can map a function to index: pandas.Index.map
quarter_months = {
'Q1': 1,
'Q2': 4,
'Q3': 7,
'Q4': 10,
}
def quarter_to_month_year(quarter_year):
quarter, year = quarter_year.split('-')
month_year = '%s-%s'%(quarter_months[quarter], year)
return pd.to_datetime(month_year, format='%m-%Y')
df.index = df.index.map(quarter_to_month_year)
This would produce following result:
DatetimeIndex(['2013-01-01', '2014-01-01', '2015-01-01', '2016-01-01',
'2017-01-01', '2018-01-01', '2013-04-01', '2014-04-01',
'2015-04-01', '2016-04-01', '2017-04-01', '2018-04-01',
'2013-07-01', '2014-07-01', '2015-07-01', '2016-07-01',
'2017-07-01', '2018-07-01', '2013-10-01', '2014-10-01',
'2015-10-01', '2016-10-01', '2017-10-01', '2018-10-01'],
dtype='datetime64[ns]', name='index', freq=None)
to_datetime() function https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.to_datetime.html
It is a datetime64 object when applying to_datetime(), to_period() turns it into a period object, further modifications like to_timestamp().strftime('%m-%Y') turn the index items into strings :
import pandas as pd
df = pd.DataFrame(index=['Q1-2013',
'Q1-2014',
'Q1-2015',
'Q1-2016',
'Q1-2017',
'Q1-2018',
'Q2-2013',
'Q2-2014',
'Q2-2015',
'Q2-2016',
'Q2-2017',
'Q2-2018',
'Q3-2013',
'Q3-2014',
'Q3-2015',
'Q3-2016',
'Q3-2017',
'Q3-2018',
'Q4-2013',
'Q4-2014',
'Q4-2015',
'Q4-2016',
'Q4-2017',
'Q4-2018'])
# df_new = pd.DataFrame(index=pd.to_datetime(['-'.join(s.split('-')[::-1]) for s in df.index]))
df_new = pd.DataFrame(index=pd.to_datetime(['-'.join(s.split('-')[::-1]) for s in df.index]).to_period('M'))
# df_new = pd.DataFrame(index=pd.to_datetime(['-'.join(s.split('-')[::-1]) for s in df.index]).to_period('M').to_timestamp().strftime('m-%Y'))
print(df_new.index)
PeriodIndex(['2013-01', '2014-01', '2015-01', '2016-01', '2017-01', '2018-01',
'2013-04', '2014-04', '2015-04', '2016-04', '2017-04', '2018-04',
'2013-07', '2014-07', '2015-07', '2016-07', '2017-07', '2018-07',
'2013-10', '2014-10', '2015-10', '2016-10', '2017-10', '2018-10'],
dtype='period[M]', freq='M')
I have the following two dataframes:
df1 = pd.DataFrame([["blala Amazon", '02/30/2017', 'Amazon'], ["blala Amazon", '04/28/2017', 'Amazon'], ['blabla Netflix', '06/28/2017', 'Netflix']], columns=['text', 'date', 'keyword'])
df2 = pd.DataFrame([['01/28/2017', '3.4', '10.2'], ['02/30/2017', '3.7', '10.5'], ['03/28/2017', '6.0', '10.9']], columns=['dates', 'ReturnOnAssets.1', 'ReturnOnAssets.2'])
(perhaps it's clearer in the screenshots here: https://imgur.com/a/YNrWpR2)
The df2 is much larger than shown here - it contains columns for 100 companies. So for example, for the 10th company, the column names are: ReturnOnAssets.10, etc.
I have created a dictionary which maps the company names to the column names:
stocks = {'Microsoft':'','Apple' :'1', 'Amazon':'2', 'Facebook':'3',
'Berkshire Hathaway':'4', 'Johnson & Johnson':'5',
'JPMorgan' :'6', 'Alphabet': '7'}
and so on.
Now, what I am trying to achieve is adding a column "ReturnOnAssets" from d2 to d1, but for a specific company and for a specific date. So looking at df1, the first tweet (i.e. "text") contains a keyword "Amazon" and it was posted on 04/28/2017. I now need to go to df2 to the relevant column name for Amazon (i.e. "ReturnOnAssets.2") and fetch the value for the specified date.
So what I expect looks like this:
df1 = pd.DataFrame([["blala Amazon", '02/30/2017', 'Amazon', **'10.5'**], ["blala Amazon", '04/28/2017', 'Amazon', 'x'], ["blabla Netflix', '06/28/2017', 'Netflix', 'x']], columns=['text', 'date', 'keyword', 'ReturnOnAssets'])
By x I mean values which where not included in the example df1 and df2.
I am fairly new to pandas and I can't wrap my head around it. I tried:
keyword = df1['keyword']
txt = 'ReturnOnAssets.'+ stocks[keyword]
df1['ReturnOnAssets'] = df2[txt]
But I don't know how to fetch the relevant date, and also this gives me an error: "Series' objects are mutable, thus they cannot be hashed", which probably comes from the fact that I cannot just add a whole column of keywords to the text string.
I don't know how to achieve the operation I need to do, so I would appreciate help.
It can probably be shortened and you can add if statements to deal with when there are missing values.
import pandas as pd
import numpy as np
df1 = pd.DataFrame([["blala Amazon", '05/28/2017', 'Amazon'], ["blala Facebook", '04/28/2017', 'Facebook'], ['blabla Netflix', '06/28/2017', 'Netflix']], columns=['text', 'dates', 'keyword'])
df1
df2 = pd.DataFrame([['06/28/2017', '3.4', '10.2'], ['05/28/2017', '3.7', '10.5'], ['04/28/2017', '6.0', '10.9']], columns=['dates', 'ReturnOnAsset.1', 'ReturnOnAsset.2'])
#creating myself a bigger df2 to cover all the way to netflix
for i in range (9):
df2[('ReturnOnAsset.' + str(i))]=np.random.randint(1, 1000, df1.shape[0])
stocks = {'Microsoft':'0','Apple' :'1', 'Amazon':'2', 'Facebook':'3',
'Berkshire Hathaway':'4', 'Johnson & Johnson':'5',
'JPMorgan' :'6', 'Alphabet': '7', 'Netflix': '8'}
#new col where to store values
df1['ReturnOnAsset']=np.nan
for index, row in df1.iterrows():
colname=('ReturnOnAsset.' + stocks[row['keyword']] )
df1['ReturnOnAsset'][index]=df2.loc[df2['dates'] ==row['dates'] , colname]
Next time please give us a correct test data, I modified your dates and dictionary for match the first and second column (netflix and amazon values).
This code will work if and only if all dates from df1 are in df2 (Note that in df1 the column name is date and in df2 the column name is dates)
df1 = pd.DataFrame([["blala Amazon", '02/30/2017', 'Amazon'], ["blala Amazon", '04/28/2017', 'Amazon'], ['blabla Netflix', '02/30/2017', 'Netflix']], columns=['text', 'date', 'keyword'])
df2 = pd.DataFrame([['04/28/2017', '3.4', '10.2'], ['02/30/2017', '3.7', '10.5'], ['03/28/2017', '6.0', '10.9']], columns=['dates', 'ReturnOnAssets.1', 'ReturnOnAssets.2'])
stocks = {'Microsoft':'','Apple' :'5', 'Amazon':'2', 'Facebook':'3',
'Berkshire Hathaway':'4', 'Netflix':'1',
'JPMorgan' :'6', 'Alphabet': '7'}
df1["ReturnOnAssets"]= [ df2["ReturnOnAssets." + stocks[ df1[ "keyword" ][ index ] ] ][ df2.index[ df2["dates"] == df1["date"][index] ][0] ] for index in range(len(df1)) ]
df1