Pyspark dataframe sum variable up to the current row's month [duplicate] - python

This question already has answers here:
Calculating Cumulative sum in PySpark using Window Functions
(2 answers)
Closed 3 months ago.
I have a pyspark dataframe that looks as follows:
date, loan
1.1.2020, 0
1.2.2020, 0
1.3.2020, 0
1.4.2020, 10000
1.5.2020, 200
1.6.2020, 0
I would like to have the fact that they took out a loan in month 4 to reflect on the other later months as well. So the resulting dataframe would be:
date, loan
1.1.2020, 0
1.2.2020, 0
1.3.2020, 0
1.4.2020, 10000
1.5.2020, 10200
1.6.2020, 10200
Is there any simple way to do this in pyspark? Thanks.

#Ehrendil - do you want to calculate running total ..
select date,loan,
sum(loan) over(order by date row between unbounded preceding and current row) as running_total from table

Related

How to transfer rows to columns in a DataFrama using Python [duplicate]

This question already has answers here:
How can I pivot a dataframe?
(5 answers)
Closed 1 year ago.
I need some help
I have the follow CSV file with this Data Frame:
how could I transfer the data of cases in columns week 1, week 2 (...) using Python and Pandas?
It would be something like this:
x = (
df.pivot_table(
index=["city", "population"],
columns="week",
values="cases",
aggfunc="max",
)
.add_prefix("week ")
.reset_index()
.rename_axis("", axis=1)
)
print(x)
Prints:
city population week 1 week 2
0 x 50000 5 10
1 y 88000 2 15

How do you convert a date to a number? [duplicate]

This question already has answers here:
How to calculate number of days between two given dates
(15 answers)
Closed 1 year ago.
How do you convert a pandas dataframe column from a date formatted as below to a number as shown below:
date
0 4/5/2010
1 9/26/2014
2 8/3/2010
To this
date newFormat
0 4/5/2010 40273
1 9/26/2014 41908
2 8/3/2010 40393
Where the second columns is the number of days since 1/1/1900.
Use:
data['newFormat'] = data['Date'].dt.strftime("%Y%m%d").astype(int)
This has been answered before:
Pandas: convert date 'object' to int
enter link description here

Pandas dataframe - get column index for minimum value in a row [duplicate]

This question already has an answer here:
Python - Pandas: number/index of the minimum value in the given row
(1 answer)
Closed 2 years ago.
I am trying to get the column index for the lowest value in a row. For example, I have the dataframe
0 1 Min. dist
0 765.180690 672.136265 672.136265
1 512.437288 542.701564 512.437288
and need the following output
0 1 Min. dist ColumnID
0 765.180690 672.136265 672.136265 1
1 512.437288 542.701564 512.437288 0
I've gotten the Min. dist column by using the code df['Min. dist'] = df.min(axis=1)
Can anyone help with this? Thanks
Try using idxmin :
df['ColumnID']=df.idxmin(axis=1)

Select rows with conditions based on two columns(Start date and end date) [duplicate]

This question already has answers here:
pandas: multiple conditions while indexing data frame - unexpected behavior
(5 answers)
Pandas slicing/selecting with multiple conditions with or statement
(1 answer)
Closed 2 years ago.
I have a dataframe which looks like this:
id start_date end_date
0 1 2017/06/01 2021/05/31
1 2 2018/10/01 2022/09/30
2 3 2015/01/01 2019/02/28
3 4 2017/11/01 2021/10/31
Can anyone tell me how i will slice the rows only for the start date which is 2017/06/01 and end date which is 2021/10/31 only.

Remove duplicate rows with one different value [duplicate]

This question already has answers here:
Get the row(s) which have the max value in groups using groupby
(15 answers)
Closed 4 years ago.
I have a dataframe with duplicate rows except for one value. I want to filter them out and only keep the row with the higher value.
User_ID - Skill - Year_used
1 - skill_a - 2017
1 - skill_b - 2015
1 - skill_a - 2018
2 - skill_c - 2011
etc.
So for example rows with skill_a and the same User_ID need to be compared and only the one with the latest year should be kept.
transform.('count')
Only gives me the amount of rows of the group by User_ID.
value_counts()
Only gives me a series I can't merge back to the df.
Nay ideas?
Thank you
You can use drop_duplicates by sorting a column to keep max
df = df.sort_values('Year_used').drop_duplicates(['User_ID','Skill'], keep='last')
One option is to groupby the Skill and keep the max Year_used:
df.groupby(['User_ID','Skill']).Year_used.max().reset_index()
User_ID Skill Year_used
0 1 skill_a 2018
1 1 skill_b 2015
2 2 skill_c 2011

Categories

Resources