Please Please help.
I have a dataframe like
| | ID | Result | measurement_1 | measurement_2 | measurement_3 | measurement_4 | measurement_5 | start_time | end-time |
|----+-------+----------+-----------------+-----------------+-----------------+-----------------+-----------------+----------------------------+----------------------------|
| 0 | 12345 | nan | nan | nan | nan | nan | nan | 2020-10-20 21:24:03.390000 | 2020-10-20 23:46:36.990000 |
| 1 | 12345 | nan | nan | nan | nan | nan | nan | 2020-10-21 04:36:03.390000 | 2020-10-21 06:58:36.990000 |
| 2 | 12345 | nan | 49584 | 2827 | nan | nan | nan | 2020-10-21 09:24:03.390000 | 2020-10-21 11:46:36.990000 |
| 3 | 12345 | nan | nan | nan | 3940 | nan | nan | 2020-10-21 14:12:03.390000 | 2020-10-21 16:34:36.990000 |
| 4 | 12345 | nan | nan | nan | nan | nan | nan | 2020-10-21 21:24:03.390000 | 2020-10-21 23:46:36.990000 |
| 5 | 12345 | nan | nan | nan | nan | nan | nan | 2020-10-22 02:40:51.390000 | 2020-10-22 05:03:24.990000 |
| 6 | 12345 | nan | nan | nan | nan | nan | nan | 2020-10-22 08:26:27.390000 | 2020-10-22 10:49:00.990000 |
| 7 | 12345 | Pass | nan | nan | nan | 392 | 304 | 2020-10-22 14:12:03.390000 | 2020-10-22 16:34:36.990000 |
| 8 | 12346 | nan | nan | nan | nan | nan | nan | 2020-10-22 19:57:39.390000 | 2020-10-22 22:20:12.990000 |
| 9 | 12346 | nan | 22839 | 4059 | nan | nan | nan | 2020-10-23 01:43:15.390000 | 2020-10-23 04:05:48.990000 |
| 10 | 12346 | nan | nan | nan | 4059 | nan | nan | 2020-10-23 07:28:51.390000 | 2020-10-23 09:51:24.990000 |
| 11 | 12346 | nan | nan | nan | nan | nan | nan | 2020-10-23 13:14:27.390000 | 2020-10-23 15:37:00.990000 |
| 12 | 12346 | nan | nan | nan | nan | nan | nan | 2020-10-23 19:00:03.390000 | 2020-10-23 21:22:36.990000 |
| 13 | 12346 | nan | nan | nan | nan | nan | nan | 2020-10-24 00:45:39.390000 | 2020-10-24 03:08:12.990000 |
| 14 | 12346 | Fail | nan | nan | nan | 2938 | 495 | 2020-10-24 06:31:15.390000 | 2020-10-24 08:53:48.990000 |
| 15 | 12345 | nan | nan | nan | nan | nan | nan | 2020-10-24 12:16:51.390000 | 2020-10-24 14:39:24.990000 |
| 16 | 12345 | nan | 62839 | 1827 | nan | nan | nan | 2020-10-24 18:02:27.390000 | 2020-10-24 20:25:00.990000 |
| 17 | 12345 | nan | nan | nan | 2726 | nan | nan | 2020-10-24 23:48:03.390000 | 2020-10-25 02:10:36.990000 |
| 18 | 12345 | nan | nan | nan | nan | nan | nan | 2020-10-25 05:33:39.390000 | 2020-10-25 07:56:12.990000 |
| 19 | 12345 | Fail | nan | nan | nan | nan | 1827 | 2020-10-25 11:19:15.390000 | 2020-10-25 13:41:48.990000 |
+----+-------+----------+-----------------+-----------------+-----------------+-----------------+-----------------+----------------------------+----------------------------+
and want my output to look like
+----+-------+----------+-----------------+-----------------+-----------------+-----------------+-----------------+----------------------------+----------------------------+
| | ID | Result | measurement_1 | measurement_2 | measurement_3 | measurement_4 | measurement_5 | start_time | end-time |
|----+-------+----------+-----------------+-----------------+-----------------+-----------------+-----------------+----------------------------+----------------------------|
| 0 | 12345 | Pass | 49584 | 2827 | 3940 | 392 | 304 | 2020-10-20 21:24:03.390000 | 2020-10-22 16:34:36.990000 |
| 1 | 12346 | Fail | 22839 | 4059 | 4059 | 2938 | 495 | 2020-10-22 19:57:39.390000 | 2020-10-24 08:53:48.990000 |
| 2 | 12345 | Fail | 62839 | 1827 | 2726 | nan | 1827 | 2020-10-24 12:16:51.390000 | 2020-10-23 13:41:48.990000 |
+----+-------+----------+-----------------+-----------------+-----------------+-----------------+-----------------+----------------------------+----------------------------+
so far I am able to group the cols on `ID` and `Result`. Now want to apply the Coalesce to it (newDf)
df = pd.read_excel("Test_Coalesce.xlsx")
newDf = df.groupby(['ID','Result'])
newDf.all().reset_index()
It looks like you want to groupby consecutive blocks of ID. If so:
blocks = df['ID'].ne(df['ID'].shift()).cumsum()
agg_dict = {k:'first' if k != 'end-time' else 'last'
for k in df.columns}
df.groupby(blocks).agg(agg_dict)
Related
I tried scraping a table of the NBA site but got the error no tables found.
import requests
from bs4 import BeautifulSoup
import pandas as pd
url = 'https://www.nba.com/game/0022101100/play-by-play?latest=0'
html = requests.get(url).content
df_list = pd.read_html(html)
How do I go about getting the play-by-play table?
As stated, that data is dynamically rendered. You could a) use Selenium to simulate opeing the browser, allowing the page to render, THEN use pandas to parse the table tags. or b) use the nba api and get the data in json format.
import requests
import pandas as pd
gameId = '0022101100'
url = f'https://cdn.nba.com/static/json/liveData/playbyplay/playbyplay_{gameId}.json'
jsonData = requests.get(url).json()
df = pd.json_normalize(jsonData,
record_path=['game', 'actions'])
Here is Option 2:
import requests
from bs4 import BeautifulSoup
import pandas as pd
import json
gameId = '0021900709'
url = f'https://www.nba.com/game/{gameId}/play-by-play'
headers = {
'referer': 'https://www.nba.com/',
'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/104.0.0.0 Safari/537.36'}
response = requests.get(url, headers=headers)
soup = BeautifulSoup(response.text, 'html.parser')
jsonStr = soup.find('script', {'id':'__NEXT_DATA__'}).text
jsonData = json.loads(jsonStr)['props']['pageProps']['playByPlay']
df = pd.json_normalize(jsonData,
record_path=['actions'])
Output: first 10 rows of 548
print(df.head(10).to_markdown())
| | actionNumber | clock | timeActual | period | periodType | actionType | subType | qualifiers | personId | x | y | possession | scoreHome | scoreAway | edited | orderNumber | xLegacy | yLegacy | isFieldGoal | side | description | personIdsFilter | teamId | teamTricode | descriptor | jumpBallRecoveredName | jumpBallRecoverdPersonId | playerName | playerNameI | jumpBallWonPlayerName | jumpBallWonPersonId | jumpBallLostPlayerName | jumpBallLostPersonId | shotDistance | shotResult | pointsTotal | assistPlayerNameInitial | assistPersonId | assistTotal | officialId | foulPersonalTotal | foulTechnicalTotal | foulDrawnPlayerName | foulDrawnPersonId | shotActionNumber | reboundTotal | reboundDefensiveTotal | reboundOffensiveTotal | turnoverTotal | stealPlayerName | stealPersonId | blockPlayerName | blockPersonId | value |
|---:|---------------:|:------------|:-----------------------|---------:|:-------------|:-------------|:----------|:---------------------|-----------:|---------:|---------:|-------------:|------------:|------------:|:---------------------|--------------:|----------:|----------:|--------------:|:-------|:-------------------------------------------------------|:--------------------------|--------------:|:--------------|:-------------|:------------------------|---------------------------:|:-------------|:---------------|:------------------------|----------------------:|:-------------------------|-----------------------:|---------------:|:-------------|--------------:|:--------------------------|-----------------:|--------------:|-------------:|--------------------:|---------------------:|:----------------------|--------------------:|-------------------:|---------------:|------------------------:|------------------------:|----------------:|------------------:|----------------:|------------------:|----------------:|--------:|
| 0 | 2 | PT12M00.00S | 2022-03-25T23:10:44.0Z | 1 | REGULAR | period | start | [] | 0 | nan | nan | 0 | 0 | 0 | 2022-03-25T23:10:44Z | 20000 | nan | nan | 0 | | Period Start | [] | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan |
| 1 | 4 | PT11M55.00S | 2022-03-25T23:10:47.2Z | 1 | REGULAR | jumpball | recovered | [] | 1626220 | nan | nan | 1610612762 | 0 | 0 | 2022-03-25T23:10:47Z | 40000 | nan | nan | 0 | | Jump Ball R. Gobert vs. M. Plumlee: Tip to R. O'Neale | [1626220, 203497, 203486] | 1.61061e+09 | UTA | startperiod | R. O'Neale | 1.62622e+06 | O'Neale | R. O'Neale | Gobert | 203497 | Plumlee | 203486 | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan |
| 2 | 7 | PT11M36.00S | 2022-03-25T23:11:06.3Z | 1 | REGULAR | 2pt | DUNK | ['pointsinthepaint'] | 203497 | 92.8548 | 47.0588 | 1610612762 | 0 | 2 | 2022-03-25T23:11:12Z | 70000 | -15 | 15 | 1 | right | R. Gobert DUNK (2 PTS) (D. Mitchell 1 AST) | [203497, 1628378] | 1.61061e+09 | UTA | nan | nan | nan | Gobert | R. Gobert | nan | nan | nan | nan | 2.08 | Made | 2 | D. Mitchell | 1.62838e+06 | 1 | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan |
| 3 | 9 | PT11M21.00S | 2022-03-25T23:11:25.8Z | 1 | REGULAR | foul | personal | ['2freethrow'] | 203497 | nan | nan | 1610612766 | 0 | 2 | 2022-03-25T23:11:38Z | 90000 | nan | nan | 0 | | R. Gobert shooting personal FOUL (1 PF) (Plumlee 2 FT) | [203497, 203486] | 1.61061e+09 | UTA | shooting | nan | nan | Gobert | R. Gobert | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | 200832 | 1 | 0 | Plumlee | 203486 | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan |
| 4 | 11 | PT11M21.00S | 2022-03-25T23:11:50.7Z | 1 | REGULAR | freethrow | 1 of 2 | [] | 203486 | nan | nan | 1610612766 | 0 | 2 | 2022-03-25T23:11:50Z | 110000 | nan | nan | 0 | | MISS M. Plumlee Free Throw 1 of 2 | [203486] | 1.61061e+09 | CHA | nan | nan | nan | Plumlee | M. Plumlee | nan | nan | nan | nan | nan | Missed | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan |
| 5 | 12 | PT11M21.00S | 2022-03-25T23:11:50.7Z | 1 | REGULAR | rebound | offensive | ['deadball', 'team'] | 0 | nan | nan | 1610612766 | 0 | 2 | 2022-03-25T23:11:50Z | 120000 | nan | nan | 0 | | TEAM offensive REBOUND | [] | 1.61061e+09 | CHA | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | 11 | nan | nan | nan | nan | nan | nan | nan | nan | nan |
| 6 | 13 | PT11M21.00S | 2022-03-25T23:12:06.4Z | 1 | REGULAR | freethrow | 2 of 2 | [] | 203486 | nan | nan | 1610612766 | 1 | 2 | 2022-03-25T23:12:06Z | 130000 | nan | nan | 0 | | M. Plumlee Free Throw 2 of 2 (1 PTS) | [203486] | 1.61061e+09 | CHA | nan | nan | nan | Plumlee | M. Plumlee | nan | nan | nan | nan | nan | Made | 1 | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan |
| 7 | 14 | PT11M06.00S | 2022-03-25T23:12:22.2Z | 1 | REGULAR | 3pt | Jump Shot | [] | 1626220 | 69.7273 | 75.2451 | 1610612762 | 1 | 2 | 2022-03-25T23:12:29Z | 140000 | 126 | 232 | 1 | right | MISS R. O'Neale 26' 3PT | [1626220] | 1.61061e+09 | UTA | nan | nan | nan | O'Neale | R. O'Neale | nan | nan | nan | nan | 26.42 | Missed | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan |
| 8 | 15 | PT11M02.00S | 2022-03-25T23:12:26.2Z | 1 | REGULAR | rebound | offensive | [] | 1627823 | nan | nan | 1610612762 | 1 | 2 | 2022-03-25T23:12:29Z | 150000 | nan | nan | 0 | | J. Hernangomez REBOUND (Off:1 Def:0) | [1627823] | 1.61061e+09 | UTA | nan | nan | nan | Hernangomez | J. Hernangomez | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | 14 | 1 | 0 | 1 | nan | nan | nan | nan | nan | nan |
| 9 | 16 | PT10M56.00S | 2022-03-25T23:12:33.1Z | 1 | REGULAR | 3pt | Jump Shot | ['2ndchance'] | 1628378 | 68.6761 | 70.098 | 1610612762 | 1 | 5 | 2022-03-25T23:12:38Z | 160000 | 100 | 242 | 1 | right | D. Mitchell 26' 3PT (3 PTS) (J. Hernangomez 1 AST) | [1628378, 1627823] | 1.61061e+09 | UTA | nan | nan | nan | Mitchell | D. Mitchell | nan | nan | nan | nan | 26.19 | Made | 3 | J. Hernangomez | 1.62782e+06 | 1 | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan |
Data is loaded dynamically via API as json format.So you can extract them using json() and pandas as follows:
import requests
import pandas as pd
api_url = 'https://cdn.cookielaw.org/vendorlist/iab2Data.json'
headers= {
"user-agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/104.0.0.0 Safari/537.36",
}
req=requests.get(api_url,headers=headers).json()
df = pd.DataFrame(req)
print(df)
Output:
gvlSpecificationVersion tcfPolicyVersion ... vendorListVersion lastUpdated
1 2 2 ... 159 2022-09-01T16:05:33Z
2 2 2 ... 159 2022-09-01T16:05:33Z
3 2 2 ... 159 2022-09-01T16:05:33Z
4 2 2 ... 159 2022-09-01T16:05:33Z
5 2 2 ... 159 2022-09-01T16:05:33Z
... ... ... ... ... ...
1146 2 2 ... 159 2022-09-01T16:05:33Z
1147 2 2 ... 159 2022-09-01T16:05:33Z
1148 2 2 ... 159 2022-09-01T16:05:33Z
1149 2 2 ... 159 2022-09-01T16:05:33Z
1150 2 2 ... 159 2022-09-01T16:05:33Z
[907 rows x 10 columns]
After I append 4 different dataframes in:
list_1 = [ ]
I have the following data stored in list_1:
| date | 16/17 |
| -------- | ------|
| 2016-12-29 | 50 |
| 2016-12-30 | 52 |
| 2017-01-01 | 53 |
| 2017-01-02 | 51 |
[4 rows x 1 columns],
16/17
| date | 17/18 |
| -------- | ------|
| 2017-12-29 | 60 |
| 2017-12-31 | 62 |
| 2018-01-01 | 64 |
| 2018-01-03 | 65 |
[4 rows x 1 columns],
17/18
| date | 18/19 |
| -------- | ------|
| 2018-12-30 | 54 |
| 2018-12-31 | 53 |
| 2019-01-02 | 52 |
| 2019-01-03 | 51 |
[4 rows x 1 columns],
18/19
| date | 19/20 |
| -------- | ------|
| 2019-12-29 | 62 |
| 2019-12-30 | 63 |
| 2020-01-01 | 62 |
| 2020-01-02 | 60 |
[4 rows x 1 columns],
19/20
For changing the date format to month/day I use the following code:
pd.to_datetime(df['date']).dt.strftime('%m/%d')
But the problem is when I want to arrange the data by months/days like that:
| date | 16/17 | 17/18 | 18/19 | 19/20 |
| -------- | ------| ------| ------| ------|
| 12/29 | 50 | 60 | NaN | 62 |
| 12/30 | 52 | NaN | 54 | 63 |
| 12/31 | NaN | 62 | 53 | NaN |
| 01/01 | 53 | 64 | NaN | 62 |
| 01/02 | 51 | NaN | 52 | 60 |
| 01/03 | NaN | 65 | 51 | NaN |
I've tried the following:
df = pd.concat(list_1,axis=1)
also:
df = pd.concat(list_1)
df.reset_index(inplace=True)
df = df.groupby(['date']).first()
also:
df = pd.concat(list_1)
df.reset_index(inplace=True)
df = df.groupby(['date'] sort=False).first()
but still cannot achieve the desired result.
You can use sort=False in groupby and create new column for subtract by first value of DatetimeIndex and use it for sorting:
def f(x):
x.index = pd.to_datetime(x.index)
return x.assign(new = x.index - x.index.min())
L = [x.pipe(f) for x in list_1]
df = pd.concat(L, axis=0).sort_values('new', kind='mergesort')
df = df.groupby(df.index.strftime('%m/%d'), sort=False).first().drop('new', axis=1)
print (df)
16/17 17/18 18/19 19/20
date
12/29 50.0 60.0 NaN 62.0
12/30 52.0 NaN 54.0 63.0
12/31 NaN 62.0 53.0 NaN
01/01 53.0 64.0 NaN 62.0
01/02 51.0 NaN 52.0 60.0
01/03 NaN 65.0 51.0 NaN
I have a dataframe that looks something like this the following:
+------------+------------------+--------+-----+-----+---+--------+-----------------------------+
| B_date | B_Time | F_Type | Fix | Est | S | C_Type | C_Time |
+------------+------------------+--------+-----+-----+---+--------+-----------------------------+
| 2019-07-22 | 16:42:27.7325458 | 1 | 100 | 100 | 2 | 2 | 2019-07-22 16:42:47.2129273 |
| 2019-07-22 | 16:44:04.7817750 | 1 | 100 | 100 | 2 | 2 | 2019-07-22 16:45:26.2923547 |
| 2019-07-22 | 16:48:21.5976290 | 1 | 100 | 100 | 7 | | |
| 2019-07-23 | 13:11:20.4519581 | 1 | 100 | 100 | 7 | | |
| 2019-07-23 | 13:28:49.5092331 | 1 | 100 | 100 | 2 | 2 | 2019-07-23 13:28:54.5274793 |
| 2019-07-23 | 13:29:06.6108796 | 1 | 100 | 100 | 2 | 2 | 2019-07-23 13:30:48.5358081 |
| 2019-07-23 | 13:31:12.7684213 | 1 | 100 | 100 | 2 | 3 | 2019-07-23 13:33:50.9405643 |
| 2019-07-25 | 09:32:12.7799801 | 1 | 105 | 105 | 7 | | |
| 2019-07-25 | 09:57:58.4536238 | 1 | 158 | 158 | 4 | | |
| 2019-07-25 | 10:03:22.7888221 | 1 | 152 | 152 | 2 | 2 | 2019-07-25 10:03:27.9576175 |
+------------+------------------+--------+-----+-----+---+--------+-----------------------------+
I need to get output as follows:
+------------+-------------------------------+--------+-----+-----+---+--------+-------------------------------+---------------+-----------------+---------------+
| B_date | B_Time | F_Type | Fix | Est | S | C_Type | C_Time | cancel_diff_1 | cancel_diff_2 | cancel_diff_3 |
+------------+-------------------------------+--------+-----+-----+---+--------+-------------------------------+---------------+-----------------+---------------+
| 2019-07-22 | 2019-07-22 16:42:27.732545800 | 1 | 100 | 100 | 2 | 2 | 2019-07-22 16:42:47.212927300 | NaT | 00:00:19.480381 | NaT |
| 2019-07-22 | 2019-07-22 16:44:04.781775000 | 1 | 100 | 100 | 2 | 2 | 2019-07-22 16:45:26.292354700 | NaT | 00:01:21.510579 | NaT |
| 2019-07-22 | 2019-07-22 16:48:21.597629000 | 1 | 100 | 100 | 7 | NaN | NaT | NaT | NaT | NaT |
| 2019-07-23 | 2019-07-23 13:11:20.451958100 | 1 | 100 | 100 | 7 | NaN | NaT | NaT | NaT | NaT |
| 2019-07-23 | 2019-07-23 13:28:49.509233100 | 1 | 100 | 100 | 2 | 2 | 2019-07-23 13:28:54.527479300 | NaT | 00:00:05.018246 | NaT |
+------------+-------------------------------+--------+-----+-----+---+--------+-------------------------------+---------------+-----------------+---------------+
I have actually done it using a function but it and assigning and checking for values which you can say is a python way, I want to do it in simple pandas.
IIUC try this:
df['B_Time']=df['B_Date']+' '+df['B_Time']
df['B_Time']=pd.to_datetime(df['B_Time'])
df.loc[df['C_Type']==1.0, 'diff_1']=df.loc[df['C_Type']==1, 'C_Time']-df.loc[df['C_Time']==1, 'B_Time']
df.loc[df['C_Type']==2.0, 'diff_2']=df.loc[df['C_Type']==2, 'C_Time']-df.loc[df['C_Time']==2, 'B_Time']
df.loc[df['C_Type']==3.0, 'diff_3']=df.loc[df['C_Type']==3, 'C_Time']-df.loc[df['C_Time']==3, 'B_Time']
I have users latest balance every day and I can see in the lates_balance column below
+----+------+------------+----------------+--+
| | user | date | latest_balance | |
| 0 | A | 2019-07-26 | 705.0 | |
| 1 | A | 2019-07-29 | 990.0 | |
| 2 | A | 2019-07-30 | 5.0 | |
| 3 | A | 2019-07-31 | 25.0 | |
| 4 | A | 2019-08-01 | 155.0 | |
| 5 | A | 2019-08-02 | 405.0 | |
| 6 | A | 2019-08-03 | 525.0 | |
| 7 | A | 2019-08-05 | 1000.0 | |
| 8 | A | 2019-08-06 | 825.0 | |
| 9 | B | 2019-08-07 | 230.0 | |
| 10 | A | 2019-08-07 | 965.0 | |
| 11 | B | 2019-08-08 | 224.0 | |
| 12 | A | 2019-08-08 | 80.0 | |
| 13 | A | 2019-08-09 | 380.0 | |
| 14 | B | 2019-08-10 | 4.0 | |
| 15 | B | 2019-08-11 | 114.0 | |
| 16 | A | 2019-08-12 | 725.0 | |
| 17 | B | 2019-08-12 | 234.0 | |
| 18 | A | 2019-08-13 | 815.0 | |
| 19 | B | 2019-08-13 | 243.0 | |
| 20 | B | 2019-08-15 | 13.0 | |
| 21 | A | 2019-08-16 | 75.0 | |
| 22 | B | 2019-08-16 | 53.0 | |
| 23 | A | 2019-08-17 | 890.0 | |
| 24 | B | 2019-08-17 | 36.0 | |
| 25 | A | 2019-08-19 | 100.0 | |
| 26 | A | 2019-08-20 | 115.0 | |
| 27 | A | 2019-08-21 | 150.0 | |
+----+------+------------+----------------+--+
we can see if the user is not active in someday we cannot see the users balance and we cannot make with a total daily sum.
I need to calculate the total balance of each user even they do not have any transaction with their last balance.
my idea was to use python dictionary and dict.update() them.
so if the user has a transaction and new balance add if not add the previous transaction for all day.
my code is:
from datetime import date, timedelta
date_upd =[]
total = {}
date_t ={}
start_date = min(df['date'])
end_date = max(df['date'])
delta = timedelta(days=1)
while start_date <= end_date:
for i,k in enumerate(df['date']):
if(k == start_date):
#print(k)
total.update({df['user'][i]:df['latest_balance'][i]})
else:
total.update({df['user'][i]:df['latest_balance'][i]})
pass
date_upd.append(sum(total.values()))
start_date += delta
#date_t.update(total)
and gives me this result
+----------+
| 705.0, |
| 990.0, |
| 5.0, |
| 25.0, |
| 155.0, |
| 405.0, |
| 525.0, |
| 1000.0, |
| 825.0, |
| 1055.0, |
| 1195.0, |
| 1189.0, |
| 304.0, |
| 604.0, |
| 384.0, |
| 494.0, |
| 839.0, |
| 959.0, |
| 1049.0, |
| 1058.0, |
| 828.0, |
| 88.0, |
| 128.0, |
| 943.0, |
| 926.0, |
| 136.0, |
| 151.0, |
| 186.0 |
+----------+
which is extra a few results because of not looping each day.
should be
705.0,
990.0,
5.0,
25.0,
155.0,
405.0,
525.0,
1000.0,
825.0,
,
1195.0,
,
304.0,
604.0,
384.0,
494.0,
839.0,
959.0,
,
1058.0,
828.0,
,
128.0,
,
926.0,
136.0,
151.0,
186.0
Not sure if I understand the question 100% but something like this?
df.pivot_table(columns='user', index='date', values='latest_balance').ffill().sum(axis=1)
The code below gives me this table:
raw = pd.read_clipboard()
raw.head()
+---+---------------------+-------------+---------+----------+-------------+
| | Afghanistan | South Asia | 652225 | 26000000 | Unnamed: 4 |
+---+---------------------+-------------+---------+----------+-------------+
| 0 | Albania | Europe | 28728 | 3200000 | 6656000000 |
| 1 | Algeria | Middle East | 2400000 | 32900000 | 75012000000 |
| 2 | Andorra | Europe | 468 | 64000 | NaN |
| 3 | Angola | Africa | 1250000 | 14500000 | 14935000000 |
| 4 | Antigua and Barbuda | Americas | 442 | 77000 | 770000000 |
+---+---------------------+-------------+---------+----------+-------------+
But when I attempt to rename the columns and create a DataFrame, all of the data disappears:
df = pd.DataFrame(raw, columns = ['name', 'region', 'area', 'population', 'gdp'])
df.head()
+---+------+--------+------+------------+-----+
| | name | region | area | population | gdp |
+---+------+--------+------+------------+-----+
| 0 | NaN | NaN | NaN | NaN | NaN |
| 1 | NaN | NaN | NaN | NaN | NaN |
| 2 | NaN | NaN | NaN | NaN | NaN |
| 3 | NaN | NaN | NaN | NaN | NaN |
| 4 | NaN | NaN | NaN | NaN | NaN |
+---+------+--------+------+------------+-----+
Any idea why?
You should just write:
df.columns = ['name', 'region', ...]
This is also much more efficient as you aren't trying to copy the entire DataFrame; as far as I know passing one DataFrame into the constructor for another will make a deep, not shallow copy.