I have such a dataframe df:
time | Score| weekday
01-01-21 12:00 | 1 | Friday
01-01-21 24:00 | 33 | Friday
02-01-21 12:00 | 12 | Saturday
02-01-21 24:00 | 9 | Saturday
03-01-21 12:00 | 11 | Sunday
03-01-21 24:00 | 8 | Sunday
I now want to get the correlation between columns Score and weekday.
I did the following to get it:
s_corr = df.weekday.str.get_dummies().corrwith(df['Score'])
print (s_corr)
I am now wondering if this is the correct way of doing it? Or would it be better to create a new dataframe in which all the rows are first summed for each day by the time column and after this using the code above to get the correlation between Score and weekday? Or are there maybe other suggestions for improvement?
I have used numpy.corrcoef before for getting correlations between continuous and categorical variables. You can try it and see if it works for you:
I first created dummies for the categorical variables:
df_dummies = pd.get_dummies(df['weekday'], drop_first= True)
df_new = pd.concat([df['Score'], df_dummies], axis=1)
I then converted the DataFrame with the dummies to a numpy array and applied corrcoef on it likewise:
df_arr = df_new.to_numpy()
corr_matrix = np.corrcoef(df_arr.T)
Related
I know there are some questions on Stack Overflow on Sumproduct but the solution are not working for me. I am also new to Python Pandas.
For each row, I want to do a sumproduct of certain columns only if column['2020'] !=0.
I used the below code, but get error:
IndexError: ('index 2018 is out of bounds for axis 0 with size 27', 'occurred at index 0')
Pls help. Thank you
# df_copy is my dataframe
column_list=[2018,2019]
weights=[6,9]
def test(df_copy):
if df_copy[2020]!=0:
W_Avg=sum(df_copy[column_list]*weights)
else:
W_Avg=0
return W_Avg
df_copy['sumpr']=df_copy.apply(test, axis=1)
df_copy
**|2020 | 2018 | 2019 | sumpr|**
|0 | 100 | 20 | 0 |
|1 | 30 | 10 | 270 |
|3 | 10 | 10 | 150 |
I am sorry if the table doesn't look like a table. I can't create a table properly in Stackoverflow.
Basically for a particular row, if
2020 = 2 ,
2018 =30 ,
2019 =10 ,
sumpr= 30 * 9 + 10*9 = 270
Your column names are most likely strings, not integers.
To confirm it, run df_copy.columns and you should receive something like:
Index(['2020', '2018', '2019'], dtype='object')
(note apostrophes surrounding column names).
So change your column list to:
column_list = ['2018', '2019']
In your function change also the column name to a string:
df_copy['2020']
Then your code should run.
You can also run a more concise code:
df_copy['sumpr'] = np.where(df_copy['2020'] != 0, (df_copy[column_list]
* weights).sum(axis=1), 0)
I'm trying something i've never done before and i'm in need of some help.
Basically, i need to filter sections of a pandas dataframe, transpose each filtered section and then concatenate every resulting section together.
Here's a representation of my dataframe:
df:
id | text_field | text_value
1 Date 2021-06-23
1 Hour 10:50
2 Position City
2 Position Countryside
3 Date 2021-06-22
3 Hour 10:45
I can then use some filtering method to isolate parts of my data:
df.groupby('id').filter(lambda x: True)
test = df.query(' id == 1 ')
test = test[["text_field","text_value"]]
test_t = test.set_index("text_field").T
test_t:
text_field | Date | Hour
text_value | 2021-06-23 | 10:50
If repeat the process looking for row with id == 3 and then concatenate the result with test_t, i'll have the following:
text_field | Date | Hour
text_value | 2021-06-23 | 10:50
text_value | 2021-06-22 | 10:45
I'm aware that performing this with rows where id == 2 will give me other columns and that's alright too, it's what a i want as well.
What i can't figure out is how to do this for every "id" in my dataframe. I wasn't able to create a function or for loop that works. Can somebody help me?
To summarize:
1 - I need to separate my dataframe in sections according with values from the "id" column
2 - After that i need to remove the "id" column and transpose the result
3 - I need to concatenate every resulting dataframe into one big dataframe
You can use pivot_table:
df.pivot_table(
index='id', columns='text_field', values='text_value', aggfunc='first')
Output:
text_field Date Hour Position
id
1 2021-06-23 10:50 NaN
2 NaN NaN City
3 2021-06-22 10:45 NaN
It's not exactly clear how you want to deal with repeating values though, would be great to have some description of that (id=2 would make a good example)
Update: If you want to ignore the ids and simply concatenate all the values:
pd.DataFrame(df.groupby('text_field')['text_value'].apply(list).to_dict())
Output:
Date Hour Position
0 2021-06-23 10:50 City
1 2021-06-22 10:45 Countryside
I'm trying to convert some VBA scripts into a python script, and I have been having troubles trying to figure some few things out, as the results seem different from what the excel file gives.
So I have an example dataframe like this :
|Name | A_Date |
_______________________
|RAHEAL | 04/30/2020|
|GIFTY | 05/31/2020|
||ERIC | 03/16/2020|
|PETER | 05/01/2020|
|EMMANUEL| 12/15/2019|
|BABA | 05/23/2020|
and I would want to achieve this result(VBA script result) :
|Name | A_Date | Sold
__________________________________
|RAHEAL | 04/30/2020| No
|GIFTY | 05/31/2020| Yes
||ERIC | 03/16/2020| No
|PETER | 05/01/2020| Yes
|EMMANUEL| 12/15/2019| No
|BABA | 05/23/2020| Yes
By converting this VBA script :
Range("C2").Select
Selection.Value = _
"=IF(RC[-1]>=(INT(R2C2)-DAY(INT(R2C2))+1),""Yes"",""No"")"
Selection.AutoFill Destination:=Range("C2", "C" & Cells(Rows.Count, 1).End(xlUp).Row)
Range("C1").Value = "Sold"
ActiveSheet.Columns("C").Copy
ActiveSheet.Columns("C").PasteSpecial xlPasteValues
Simply :=IF(B2>=(INT($B$2)-DAY(INT($B$2))+1),"Yes","No")
To this Python script:
sales['Sold']=np.where(sales['A_Date']>=(sales['A_Date'] - pd.to_timedelta(sales.A_Date.dt.day, unit='d'))+ timedelta(days=1),'Yes','No')
But I keep getting a "Yes" throughout.... could anyone help me spot out where I might have made some kind of mistake
import pandas as pd
df = pd.DataFrame({'Name':['RAHEAL','GIFTY','ERIC','PETER','EMMANUEL','BABA'],
'A_Date':['04/30/2020','05/31/2020','03/16/2020',
'05/01/2020','12/15/2019','05/23/2020']})
df['A_Date'] = pd.to_datetime(df['A_Date'])
print(df)
df['Sold'] = df['A_Date'] >= df['A_Date'].iloc[0].replace(day=1)
df['Sold'] = df['Sold'].map({True:'Yes', False:'No'})
print(df)
output:
Name A_Date
0 RAHEAL 2020-04-30
1 GIFTY 2020-05-31
2 ERIC 2020-03-16
3 PETER 2020-05-01
4 EMMANUEL 2019-12-15
5 BABA 2020-05-23
Name A_Date Sold
0 RAHEAL 2020-04-30 Yes
1 GIFTY 2020-05-31 Yes
2 ERIC 2020-03-16 No
3 PETER 2020-05-01 Yes
4 EMMANUEL 2019-12-15 No
5 BABA 2020-05-23 Yes
If I read the formula right - if A_Date value >= 04/01/2020 (i.e. first day of month for date in B2), so RAHEAL should be Yes too
I don't know if you noticed (and if this is intended), but if A_Date value has a fractional part (i.e. time), when you calculate the value for 1st of the month, there is room for error. If the time in B2 is let's say 10:00 AM, when you calculate cut value, it will be 04/1/2020 10:00. Then if you have another value, let's say 04/01/2020 09:00, it will be evaluated as False/No. This is how it works also in your Excel formula.
EDIT (12 Jan 2021): Note, values in column A_Date are of type datetime.datetime or datetime.date. Presumably they are converted when reading the Excel file or explicitly afterwards.
Very much embarassed I didn't see the simple elegant solution that buran gave +. I did more of a literal translation.
first_date.toordinal() - 693594 is the integer date value for your initial date, current_date.toordinal() - 693594 is the integer date value for the current iteration of the dates column. I apply your cell formula logic to each A_Date row value and output as the corresponding Sold column value.
import pandas as pd
from datetime import datetime
def is_sold(current_date:datetime, first_date:datetime, day_no:int)->str:
# use of toordinal idea from #rjha94 https://stackoverflow.com/a/47478659
if current_date.toordinal() - 693594 >= first_date.toordinal() - 693594 - day_no + 1:
return "Yes"
else:
return "No"
sales = pd.DataFrame({'Name':['RAHEAL','GIFTY','ERIC','PETER','EMMANUEL','BABA'],
'A_Date':['2020-04-30','2020-05-31','2020-03-16','2020-05-01','2019-12-15','2020-05-23']})
sales['A_Date'] = pd.to_datetime(sales['A_Date'], errors='coerce')
sales['Sold'] = sales['A_Date'].apply(lambda x:is_sold(x, sales['A_Date'][0], x.day))
print(sales)
I have data over the timespan of over a year. I am interested in grouping the data by week, and getting the slope of two variables by week. Here is what the data looks like:
Date | Total_Sales| Products
2015-12-30 07:42:50| 2900 | 24
2015-12-30 09:10:10| 3400 | 20
2016-02-07 07:07:07| 5400 | 25
2016-02-07 07:08:08| 1000 | 64
So ideally I would like to perform a linear regression on total_sales and products on each week of this data and record the slope. This works when each week is represented in the data, but I have problems when there are some weeks skipped in the data. I know I could do this with turning the date into the week number but I feel like the result will be skewed because there is over a year's worth of data.
Here is the code I have so far:
df['Date']=pd.to_datetime(vals['EventDate']) - pd.to_timedelta(7,unit='d')
df.groupby(pd.Grouper(key='Week', freq='W-MON')).apply(lambda v: linregress(v.Total_Sales, v.Products)[0]).reset_index()
However, I get the following error:
ValueError: Inputs must not be empty.
I expect the output to look like this:
Date | Slope
2015-12-28 | -0.008
2016-02-01 | -0.008
I assume this is happening because python is unable to groupby properly and also it is unable to recognise datetime as key ,as Date column has varying timestamp too.
Try the following code.It worked for me:
df['Date']=pd.to_datetime(df['Date']) #### Converts Date column to Python Datetime
df['daysoffset'] = df['Date'].apply(lambda x: x.weekday())
#### Return the day of the week as an integer, where Monday is 0 and Sunday is 6.
df['week_start'] = df.apply(lambda x: x['Date'].date()-timedelta(days=x['daysoffset']), axis=1)
#### x.['Date'].date() removes timestamp and considers only Date
#### the line assigns date corresponding to last Monday to column 'week_start'.
df.groupby('week_start').apply(lambda v: stats.linregress(v.Total_Sales,v.Products)
[0]).reset_index()
I have a dataframe that looks like the below.
Day | Price
12-05-2015 | 73
12-06-2015 | 68
11-07-2015 | 77
10-08-2015 | 54
I would like to subtract the price for one Day from the corresponding price 30 days later. To add to the days, I've used data.loc[data['Day'] + timedelta(days=30)] however this obviously overflowed near the final dates in my dataframe. Is there a way to subtract the prices without iterating over all the rows in the dataframe?
If it helps, my desired output is something like the following.
Start_Day | Price
12-05-2015 | -5
11-07-2015 | -23
You can use df.diff() function.
http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.diff.html