I have a dataframe which looks like this:
I wanted to make a dataframe which looks like this:
For this I have referred the post at pandas convert some columns into rows.
By using the merge function I get a dataframe as shown below:
How do I get my dataframe in the format required?
The complete code is as shown:
import pandas as pd
from nsepy import get_history
from datetime import date
import numpy as np
stock = ['APLAPOLLO','AUBANK','AARTIDRUGS','AARTIIND','AAVAS','ABBOTINDIA','ADANIENT','ADANIGAS','ADANIGREEN','ADANIPORTS']
res = dict(zip(stock,stock))
start = date (2020, 11, 22)
end = date (2020, 12, 22)
for stock_name in stock:
data = get_history(symbol=stock_name, start=start, end=end)
res[stock_name]=data
for key, df in res.items():
# create a column called "key name"
df['key_name'] = key
lst = list(res.values())
df = pd.concat(lst)
df['boolean'] = df['Prev Close'] < df['Close']
df1 = pd.DataFrame({'boolean' : [True] + [False] * 2 + [True] * 3})
a = df['boolean']
b = a.cumsum()
df['trend'] = (b-b.mask(a).ffill().fillna(0).astype(int)).where(a, 0)
conditions = [(df['boolean']==True), (df['boolean']==False)]
values=['Win','Loose']
df['Win/Loss']=np.select(conditions,values)
df=df.drop(['Win/Loose'],axis=1)
df.to_csv('data.csv')
conditions = [(df['trend']>=2), df['trend']<2]
df2=df[['trend','Symbol']]
w=df2.melt(id_vars=["trend"],value_vars=['Symbol'])
IIUC, this can be solved with pivot_table():
Given the original dataframe you show in the first image:
new_df = df.pivot_table(index='Date',columns='Symbol',value='trend')
Related
In the following code I am generating a new column which has list of names of those columns which are >90 and <10. I have another similar time series dataframe and I want to have those values from second dataframe df_1, which are there in the form list in the first dataframe columns named as df['Top_90'] and df['Below'].
Thanks in advance!
import pandas as pd
from datetime import datetime
import numpy as np
date_rng = pd.date_range(start='1/1/2018', end='1/08/2018', freq='H')
df = pd.DataFrame(date_rng, columns=['Date'])
df['Data_1'] = np.random.randint(0,100,size=(len(date_rng)))
df['Data_2'] = np.random.randint(0,100,size=(len(date_rng)))
df['Data_3'] = np.random.randint(0,100,size=(len(date_rng)))
df['Data_4'] = np.random.randint(0,100,size=(len(date_rng)))
df['Top_90'] = list(map(str, df.apply(lambda x: ','.join(x.index[x > 80]), axis=1)))
df['Below_10'] = list(map(str, df.drop('Top_90', axis=1).apply(lambda x: ','.join(x.index[x > 10]), axis=1)))
date_rng_1 = pd.date_range(start='1/1/2018', end='1/08/2018', freq='H')
df_1 = pd.DataFrame(date_rng_1, columns=['Date'])
df_1['Data_1'] = np.random.randint(0,1000,size=(len(date_rng)))
df_1['Data_2'] = np.random.randint(0,1000,size=(len(date_rng)))
df_1['Data_3'] = np.random.randint(0,1000,size=(len(date_rng)))
df_1['Data_4'] = np.random.randint(0,1000,size=(len(date_rng)))
df_1 = df.set_index('Date')
for index in df_1.index:
print(df_1)
for col in df['Top_90']:
print(df_1._get_value(index, col))
I would like to ask how to sum using python or excel.
Like to do summation of "number" columns based on "time" column.
Sum of the Duration for (00:00 am - 00:59 am) is (2+4) 6.
Sum of the Duration for (02:00 am - 02:59 am) is (3+1) 4.
Could you please advise how to ?
When you have a dataframe you can use groupby to accomplish this:
# import pandas module
import pandas as pd
# Create a dictionary with the values
data = {
'time' : ["12:20:51", "12:40:51", "2:26:35", "2:37:35"],
'number' : [2, 4, 3, 1]}
# create a Pandas dataframe
df = pd.DataFrame(data)
# or load the CSV
df = pd.read_csv('path/dir/filename.csv')
# Convert time column to datetime data type
df['time'] = df['time'].apply(pd.to_datetime, format='%H:%M:%S')
# add values by hour
dff = df.groupby(df['time'].dt.hour)['number'].sum()
print(dff.head(50))
output:
time
12 6
2 4
When you need more than one column. You can pass the columns as a list inside .groupby(). The code will look like this:
import pandas as pd
df = pd.read_csv('filename.csv')
# Convert time column to datetime data type
df['time'] = df['time'].apply(pd.to_datetime, format='%H:%M:%S')
df['date'] = df['date'].apply(pd.to_datetime, format='%d/%m/%Y')
# add values by hour
dff = df.groupby([df['date'], df['time'].dt.hour])['number'].sum()
print(dff.head(50))
# save the file
dff.to_csv("filename.csv")
This question already has an answer here:
How do I extract the date/year/month from pandas dataframe?
(1 answer)
Closed 2 years ago.
My initial dataframe is like this:
Original Dataframe
My code:
import pandas as pd
import numpy as np
def visualize_weather():
df=pd.read_csv('weather.csv')
def break_date(date):
day=date[-2:]
month=date[-5:-3]
year=date[:4]
return day,month,year
df['Day'],df['Month'],df['Year']=df['Date'].apply(break_date)
return df[['Day','Month','Year','Date']]
visualize_weather()
I am trying to break the date in day, month and year and store them into different columns.
But I am getting the error:
ValueError: too many values to unpack (expected 3)
Is there any way to achieve this without making 3 different functions for day, month and year.
You can use the following code to modify dataframe inplace. You should change on the dataframe object inside the function directly, else your changes will be lost.
import pandas as pd
df = pd.DataFrame(data={'dates': pd.bdate_range('2020-07-01', '2020-07-31', freq='B')})
def func(row):
df.loc[row.name, 'Day'] = row['dates'].day
df.loc[row.name, 'Month'] = row['dates'].month
df.loc[row.name, 'Year'] = row['dates'].year
print('Done')
df.apply(func, axis=1)
Hmm.. Working with dates as working with rows is not a good practice. You should do instead:
Use pd.to_datetime() if you need a datetime column
Since you're working with dates, you can use datetime.year, datetime.month, datetime.day.
So:
import pandas as pd
import numpy as np
import datetime
first_date = datetime.date(2020, 1, 3)
second_date = datetime.date(2019, 2, 10)
third_date = datetime.date(2018, 2, 20)
df = pd.DataFrame({"dates":[first_date,second_date,third_date ]})
def new_col(df):
size = df.size
years = []
months = []
days = []
for row in range(0, size):
year = df.iloc[row, 0].year
years.append(year)
month = df.iloc[row, 0].month
months.append(month)
day = df.iloc[row, 0].day
days.append(day)
df['years'] = years
df['months'] = months
df['days'] = days
df.drop(['dates'],axis='columns',inplace=True)
return df
new_col(df)
PS.You can also add any variable for column name.
I have below dataframe called "df" and calculating the sum by unique id called "Id".
Can anyone help me in optimizing the code i have tried.
import pandas as pd
from datetime import datetime, timedelta
df= {'Date':['2019-01-11 10:23:45','2019-01-09 10:23:45', '2019-01-11 10:27:45',
'2019-01-11 10:25:45', '2019-01-11 10:30:45', '2019-01-11 10:35:45',
'2019-02-09 10:25:45'],
'Id':['100','200','300','100','100', '100','200'],
'Amount':[200,400,330,100,300,200,500],
}
df= pd.DataFrame(df)
df["Date"] = pd.to_datetime(df['Date'])
You can try to use groupby, after this each adjust within sub-groupby not to the whole df
s = {}
for x , y in df.groupby(['Id','NCC']):
for i in y.index:
start_date = y['Date'][i] - timedelta(seconds=300)
end_date = y['Date'][i]
mask = (y['Date'] >= start_date) & (y['Date'] < end_date)
count = y.loc[mask]
count = count.loc[(y['Sys'] == 1)]
if len(count) == 0:
s.update({i : 0})
else:
s.update({i : count['Amount'].sum()})
df['New']=pd.Series(s)
If the original data frame has 2 million rows, it would probably be faster to convert the 'Date' column to an index and sort it. Then you can sub select each 5-minute interval:
df = df.set_index('Date').sort_index()
df['Sum_Amt'] = 0
for end in df.index:
start = end - pd.Timedelta('5min')
current_window = df[start : end] # data frame with 5-minute look-back
sum_amt = <calc logic applied to `current_window` goes here>
df.at[end, 'Sum_Amt'] = sum_amt
print(current_window)
print()
I'm not following the logic for calculating Sum_Amt, so I left that out.
I have previously asked the question Pandas set element style dependent on another dataframe, which I have a working solution to, but now I am trying to apply it to a data frame with a multi index and I am getting an error, which I do not understand.
Problem
I have a pandas df and accompanying boolean matrix. I want to highlight the df depending on the boolean matrix.
Data
import pandas as pd
import numpy as np
from datetime import datetime
date = pd.date_range(start = datetime(2016,1,1), end = datetime(2016,2,1), freq = "D")
i = len(date)
dic = {'X':pd.DataFrame(np.random.randn(i, 2),index = date, columns = ['A','B']),
'Y':pd.DataFrame(np.random.randn(i, 2),index = date, columns = ['A','B']),
'Z':pd.DataFrame(np.random.randn(i, 2),index = date, columns = ['A','B'])}
df = pd.concat(dic.values(),axis=1,keys=dic.keys())
boo = [True, False]
bool_matrix = {'X':pd.DataFrame(np.random.choice(boo, (i,2), p=[0.3,.7]), index = date, columns = ['A','B']),
'Y':pd.DataFrame(np.random.choice(boo, (i,2), p=[0.3,.7]), index = date, columns = ['A','B']),
'Z':pd.DataFrame(np.random.choice(boo, (i,2), p=[0.3,.7]), index = date, columns = ['A','B'])}
bool_matrix =pd.concat(bool_matrix.values(),axis=1,keys=bool_matrix.keys())
My attempted solution
def highlight(value):
return 'background-color: green'
my_style = df.style
for column in df.columns:
for i in df[column].index:
data = bool_matrix.loc[i, column]
if data:
my_style = df.style.use(my_style.export()).applymap(highlight, subset = pd.IndexSlice[i, column])
my_style
Results
The above throws an AttributeError: 'Series' object has no attribute 'applymap'
I do not understand what is returning as a Series. This is a single value I am subsetting and this solution worked for non multi-indexed df's as shown below.
Without Multi-index
import pandas as pd
import numpy as np
from datetime import datetime
np.random.seed(24)
date = pd.date_range(start = datetime(2016,1,1), end = datetime(2016,2,1), freq = "D")
df = pd.DataFrame({'A': np.linspace(1, 100, len(date))})
df = pd.concat([df, pd.DataFrame(np.random.randn(len(date), 4), columns=list('BCDE'))],
axis=1)
df['date'] = date
df.set_index("date", inplace = True)
boo = [True, False]
bool_matrix = pd.DataFrame(np.random.choice(boo, (len(date), 5),p=[0.3,.7]), index = date,columns=list('ABCDE'))
def highlight(value):
return 'background-color: green'
my_style = df.style
for column in df.columns:
for i in bool_matrix.index:
data = bool_matrix.loc[i, column]
if data:
my_style = df.style.use(my_style.export()).applymap(highlight, subset = pd.IndexSlice[i,column])
my_style
Documentation
The docs make reference to CSS Classes and say that "Index label cells include level where k is the level in a MultiIndex." I am obviouly indexing this wrong, but am stumped on how to proceed.
It's very nice that there is a runable example.
You can use df.style.apply(..., axis=None) to apply a highlight method to the whole dataframe.
With your df and bool_matrix, try this:
def highlight(value):
d = value.copy()
for c in d.columns:
for r in df.index:
if bool_matrix.loc[r, c]:
d.loc[r, c] = 'background-color: green'
else:
d.loc[r, c] = ''
return d
df.style.apply(highlight, axis=None)
Or to make codes simple, you can try:
def highlight(value):
return bool_matrix.applymap(lambda x: 'background-color: green' if x else '')
df.style.apply(highlight, axis=None)
Hope this is what you need.