How to scrape table off a website in dataframe format?

How to scrape table off a website in dataframe format? - python

I tried scraping a table of the NBA site but got the error no tables found.
import requests
from bs4 import BeautifulSoup
import pandas as pd
url = 'https://www.nba.com/game/0022101100/play-by-play?latest=0'
html = requests.get(url).content
df_list = pd.read_html(html)
How do I go about getting the play-by-play table?

As stated, that data is dynamically rendered. You could a) use Selenium to simulate opeing the browser, allowing the page to render, THEN use pandas to parse the table tags. or b) use the nba api and get the data in json format.
import requests
import pandas as pd
gameId = '0022101100'
url = f'https://cdn.nba.com/static/json/liveData/playbyplay/playbyplay_{gameId}.json'
jsonData = requests.get(url).json()
df = pd.json_normalize(jsonData,
record_path=['game', 'actions'])
Here is Option 2:
import requests
from bs4 import BeautifulSoup
import pandas as pd
import json
gameId = '0021900709'
url = f'https://www.nba.com/game/{gameId}/play-by-play'
headers = {
'referer': 'https://www.nba.com/',
'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/104.0.0.0 Safari/537.36'}
response = requests.get(url, headers=headers)
soup = BeautifulSoup(response.text, 'html.parser')
jsonStr = soup.find('script', {'id':'__NEXT_DATA__'}).text
jsonData = json.loads(jsonStr)['props']['pageProps']['playByPlay']
df = pd.json_normalize(jsonData,
record_path=['actions'])
Output: first 10 rows of 548
print(df.head(10).to_markdown())
| | actionNumber | clock | timeActual | period | periodType | actionType | subType | qualifiers | personId | x | y | possession | scoreHome | scoreAway | edited | orderNumber | xLegacy | yLegacy | isFieldGoal | side | description | personIdsFilter | teamId | teamTricode | descriptor | jumpBallRecoveredName | jumpBallRecoverdPersonId | playerName | playerNameI | jumpBallWonPlayerName | jumpBallWonPersonId | jumpBallLostPlayerName | jumpBallLostPersonId | shotDistance | shotResult | pointsTotal | assistPlayerNameInitial | assistPersonId | assistTotal | officialId | foulPersonalTotal | foulTechnicalTotal | foulDrawnPlayerName | foulDrawnPersonId | shotActionNumber | reboundTotal | reboundDefensiveTotal | reboundOffensiveTotal | turnoverTotal | stealPlayerName | stealPersonId | blockPlayerName | blockPersonId | value |
|---:|---------------:|:------------|:-----------------------|---------:|:-------------|:-------------|:----------|:---------------------|-----------:|---------:|---------:|-------------:|------------:|------------:|:---------------------|--------------:|----------:|----------:|--------------:|:-------|:-------------------------------------------------------|:--------------------------|--------------:|:--------------|:-------------|:------------------------|---------------------------:|:-------------|:---------------|:------------------------|----------------------:|:-------------------------|-----------------------:|---------------:|:-------------|--------------:|:--------------------------|-----------------:|--------------:|-------------:|--------------------:|---------------------:|:----------------------|--------------------:|-------------------:|---------------:|------------------------:|------------------------:|----------------:|------------------:|----------------:|------------------:|----------------:|--------:|
| 0 | 2 | PT12M00.00S | 2022-03-25T23:10:44.0Z | 1 | REGULAR | period | start | [] | 0 | nan | nan | 0 | 0 | 0 | 2022-03-25T23:10:44Z | 20000 | nan | nan | 0 | | Period Start | [] | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan |
| 1 | 4 | PT11M55.00S | 2022-03-25T23:10:47.2Z | 1 | REGULAR | jumpball | recovered | [] | 1626220 | nan | nan | 1610612762 | 0 | 0 | 2022-03-25T23:10:47Z | 40000 | nan | nan | 0 | | Jump Ball R. Gobert vs. M. Plumlee: Tip to R. O'Neale | [1626220, 203497, 203486] | 1.61061e+09 | UTA | startperiod | R. O'Neale | 1.62622e+06 | O'Neale | R. O'Neale | Gobert | 203497 | Plumlee | 203486 | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan |
| 2 | 7 | PT11M36.00S | 2022-03-25T23:11:06.3Z | 1 | REGULAR | 2pt | DUNK | ['pointsinthepaint'] | 203497 | 92.8548 | 47.0588 | 1610612762 | 0 | 2 | 2022-03-25T23:11:12Z | 70000 | -15 | 15 | 1 | right | R. Gobert DUNK (2 PTS) (D. Mitchell 1 AST) | [203497, 1628378] | 1.61061e+09 | UTA | nan | nan | nan | Gobert | R. Gobert | nan | nan | nan | nan | 2.08 | Made | 2 | D. Mitchell | 1.62838e+06 | 1 | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan |
| 3 | 9 | PT11M21.00S | 2022-03-25T23:11:25.8Z | 1 | REGULAR | foul | personal | ['2freethrow'] | 203497 | nan | nan | 1610612766 | 0 | 2 | 2022-03-25T23:11:38Z | 90000 | nan | nan | 0 | | R. Gobert shooting personal FOUL (1 PF) (Plumlee 2 FT) | [203497, 203486] | 1.61061e+09 | UTA | shooting | nan | nan | Gobert | R. Gobert | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | 200832 | 1 | 0 | Plumlee | 203486 | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan |
| 4 | 11 | PT11M21.00S | 2022-03-25T23:11:50.7Z | 1 | REGULAR | freethrow | 1 of 2 | [] | 203486 | nan | nan | 1610612766 | 0 | 2 | 2022-03-25T23:11:50Z | 110000 | nan | nan | 0 | | MISS M. Plumlee Free Throw 1 of 2 | [203486] | 1.61061e+09 | CHA | nan | nan | nan | Plumlee | M. Plumlee | nan | nan | nan | nan | nan | Missed | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan |
| 5 | 12 | PT11M21.00S | 2022-03-25T23:11:50.7Z | 1 | REGULAR | rebound | offensive | ['deadball', 'team'] | 0 | nan | nan | 1610612766 | 0 | 2 | 2022-03-25T23:11:50Z | 120000 | nan | nan | 0 | | TEAM offensive REBOUND | [] | 1.61061e+09 | CHA | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | 11 | nan | nan | nan | nan | nan | nan | nan | nan | nan |
| 6 | 13 | PT11M21.00S | 2022-03-25T23:12:06.4Z | 1 | REGULAR | freethrow | 2 of 2 | [] | 203486 | nan | nan | 1610612766 | 1 | 2 | 2022-03-25T23:12:06Z | 130000 | nan | nan | 0 | | M. Plumlee Free Throw 2 of 2 (1 PTS) | [203486] | 1.61061e+09 | CHA | nan | nan | nan | Plumlee | M. Plumlee | nan | nan | nan | nan | nan | Made | 1 | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan |
| 7 | 14 | PT11M06.00S | 2022-03-25T23:12:22.2Z | 1 | REGULAR | 3pt | Jump Shot | [] | 1626220 | 69.7273 | 75.2451 | 1610612762 | 1 | 2 | 2022-03-25T23:12:29Z | 140000 | 126 | 232 | 1 | right | MISS R. O'Neale 26' 3PT | [1626220] | 1.61061e+09 | UTA | nan | nan | nan | O'Neale | R. O'Neale | nan | nan | nan | nan | 26.42 | Missed | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan |
| 8 | 15 | PT11M02.00S | 2022-03-25T23:12:26.2Z | 1 | REGULAR | rebound | offensive | [] | 1627823 | nan | nan | 1610612762 | 1 | 2 | 2022-03-25T23:12:29Z | 150000 | nan | nan | 0 | | J. Hernangomez REBOUND (Off:1 Def:0) | [1627823] | 1.61061e+09 | UTA | nan | nan | nan | Hernangomez | J. Hernangomez | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | 14 | 1 | 0 | 1 | nan | nan | nan | nan | nan | nan |
| 9 | 16 | PT10M56.00S | 2022-03-25T23:12:33.1Z | 1 | REGULAR | 3pt | Jump Shot | ['2ndchance'] | 1628378 | 68.6761 | 70.098 | 1610612762 | 1 | 5 | 2022-03-25T23:12:38Z | 160000 | 100 | 242 | 1 | right | D. Mitchell 26' 3PT (3 PTS) (J. Hernangomez 1 AST) | [1628378, 1627823] | 1.61061e+09 | UTA | nan | nan | nan | Mitchell | D. Mitchell | nan | nan | nan | nan | 26.19 | Made | 3 | J. Hernangomez | 1.62782e+06 | 1 | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan |

Data is loaded dynamically via API as json format.So you can extract them using json() and pandas as follows:
import requests
import pandas as pd
api_url = 'https://cdn.cookielaw.org/vendorlist/iab2Data.json'
headers= {
"user-agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/104.0.0.0 Safari/537.36",
}
req=requests.get(api_url,headers=headers).json()
df = pd.DataFrame(req)
print(df)
Output:
gvlSpecificationVersion tcfPolicyVersion ... vendorListVersion lastUpdated
1 2 2 ... 159 2022-09-01T16:05:33Z
2 2 2 ... 159 2022-09-01T16:05:33Z
3 2 2 ... 159 2022-09-01T16:05:33Z
4 2 2 ... 159 2022-09-01T16:05:33Z
5 2 2 ... 159 2022-09-01T16:05:33Z
... ... ... ... ... ...
1146 2 2 ... 159 2022-09-01T16:05:33Z
1147 2 2 ... 159 2022-09-01T16:05:33Z
1148 2 2 ... 159 2022-09-01T16:05:33Z
1149 2 2 ... 159 2022-09-01T16:05:33Z
1150 2 2 ... 159 2022-09-01T16:05:33Z
[907 rows x 10 columns]

Related

Fill duplicates with missing value after grouping with some logic

I have a dataframe, I need to take off the duplicates of ticket_id if the owner_type is the same, and if not, pick 'm' over 's', if no value is picket then a NaN is returned:
data = pd.DataFrame({'owner_type':['m','m','m','s','s','m','s','s'],'ticket_id':[1,1,2,2,3,3,4,4]})
'| | owner_type | ticket_id |
|---:|:-------------|------------:|
| 0 | m | 1 |
| 1 | m | 1 |
| 2 | m | 2 |
| 3 | s | 2 |
| 4 | s | 3 |
| 5 | m | 3 |
| 6 | s | 4 |
| 7 | s | 4 |'
Should give back:
'| | owner_type | ticket_id |
|---:|:-------------|------------:|
| 0 | m | NaN |
| 1 | m | NaN |
| 2 | m | 2 |
| 3 | s | NaN |
| 4 | s | NaN |
| 5 | m | 3 |
| 6 | s | NaN |
| 7 | s | NaN |'
Pseudo code would be like : If ticket_id is duplicated, look at owner_type, if owner_type has mover than one value, return value of 'm' and NaN for 's'.
My attempt
data.groupby('ticket_id').apply(lambda x: x['owner_type'] if len(x) < 2 else NaN)
Not working

Try this:
(df['ticket_id'].where(
~df.duplicated(['owner_type','ticket_id'],keep=False) &
df['owner_type'].eq(df.groupby('ticket_id')['owner_type'].transform('min'))))
Old answer:
m = ~df.duplicated(keep=False) & df['owner_type'].eq('m')
df['ticket_id'].where(m)
Output:
0 NaN
1 NaN
2 2.0
3 NaN
4 NaN
5 3.0
6 NaN
7 NaN

Apply Coalesce after grouping on two columns in pandas

Please Please help.
I have a dataframe like
| | ID | Result | measurement_1 | measurement_2 | measurement_3 | measurement_4 | measurement_5 | start_time | end-time |
|----+-------+----------+-----------------+-----------------+-----------------+-----------------+-----------------+----------------------------+----------------------------|
| 0 | 12345 | nan | nan | nan | nan | nan | nan | 2020-10-20 21:24:03.390000 | 2020-10-20 23:46:36.990000 |
| 1 | 12345 | nan | nan | nan | nan | nan | nan | 2020-10-21 04:36:03.390000 | 2020-10-21 06:58:36.990000 |
| 2 | 12345 | nan | 49584 | 2827 | nan | nan | nan | 2020-10-21 09:24:03.390000 | 2020-10-21 11:46:36.990000 |
| 3 | 12345 | nan | nan | nan | 3940 | nan | nan | 2020-10-21 14:12:03.390000 | 2020-10-21 16:34:36.990000 |
| 4 | 12345 | nan | nan | nan | nan | nan | nan | 2020-10-21 21:24:03.390000 | 2020-10-21 23:46:36.990000 |
| 5 | 12345 | nan | nan | nan | nan | nan | nan | 2020-10-22 02:40:51.390000 | 2020-10-22 05:03:24.990000 |
| 6 | 12345 | nan | nan | nan | nan | nan | nan | 2020-10-22 08:26:27.390000 | 2020-10-22 10:49:00.990000 |
| 7 | 12345 | Pass | nan | nan | nan | 392 | 304 | 2020-10-22 14:12:03.390000 | 2020-10-22 16:34:36.990000 |
| 8 | 12346 | nan | nan | nan | nan | nan | nan | 2020-10-22 19:57:39.390000 | 2020-10-22 22:20:12.990000 |
| 9 | 12346 | nan | 22839 | 4059 | nan | nan | nan | 2020-10-23 01:43:15.390000 | 2020-10-23 04:05:48.990000 |
| 10 | 12346 | nan | nan | nan | 4059 | nan | nan | 2020-10-23 07:28:51.390000 | 2020-10-23 09:51:24.990000 |
| 11 | 12346 | nan | nan | nan | nan | nan | nan | 2020-10-23 13:14:27.390000 | 2020-10-23 15:37:00.990000 |
| 12 | 12346 | nan | nan | nan | nan | nan | nan | 2020-10-23 19:00:03.390000 | 2020-10-23 21:22:36.990000 |
| 13 | 12346 | nan | nan | nan | nan | nan | nan | 2020-10-24 00:45:39.390000 | 2020-10-24 03:08:12.990000 |
| 14 | 12346 | Fail | nan | nan | nan | 2938 | 495 | 2020-10-24 06:31:15.390000 | 2020-10-24 08:53:48.990000 |
| 15 | 12345 | nan | nan | nan | nan | nan | nan | 2020-10-24 12:16:51.390000 | 2020-10-24 14:39:24.990000 |
| 16 | 12345 | nan | 62839 | 1827 | nan | nan | nan | 2020-10-24 18:02:27.390000 | 2020-10-24 20:25:00.990000 |
| 17 | 12345 | nan | nan | nan | 2726 | nan | nan | 2020-10-24 23:48:03.390000 | 2020-10-25 02:10:36.990000 |
| 18 | 12345 | nan | nan | nan | nan | nan | nan | 2020-10-25 05:33:39.390000 | 2020-10-25 07:56:12.990000 |
| 19 | 12345 | Fail | nan | nan | nan | nan | 1827 | 2020-10-25 11:19:15.390000 | 2020-10-25 13:41:48.990000 |
+----+-------+----------+-----------------+-----------------+-----------------+-----------------+-----------------+----------------------------+----------------------------+
and want my output to look like
+----+-------+----------+-----------------+-----------------+-----------------+-----------------+-----------------+----------------------------+----------------------------+
| | ID | Result | measurement_1 | measurement_2 | measurement_3 | measurement_4 | measurement_5 | start_time | end-time |
|----+-------+----------+-----------------+-----------------+-----------------+-----------------+-----------------+----------------------------+----------------------------|
| 0 | 12345 | Pass | 49584 | 2827 | 3940 | 392 | 304 | 2020-10-20 21:24:03.390000 | 2020-10-22 16:34:36.990000 |
| 1 | 12346 | Fail | 22839 | 4059 | 4059 | 2938 | 495 | 2020-10-22 19:57:39.390000 | 2020-10-24 08:53:48.990000 |
| 2 | 12345 | Fail | 62839 | 1827 | 2726 | nan | 1827 | 2020-10-24 12:16:51.390000 | 2020-10-23 13:41:48.990000 |
+----+-------+----------+-----------------+-----------------+-----------------+-----------------+-----------------+----------------------------+----------------------------+

so far I am able to group the cols on `ID` and `Result`. Now want to apply the Coalesce to it (newDf)
df = pd.read_excel("Test_Coalesce.xlsx")
newDf = df.groupby(['ID','Result'])
newDf.all().reset_index()

It looks like you want to groupby consecutive blocks of ID. If so:
blocks = df['ID'].ne(df['ID'].shift()).cumsum()
agg_dict = {k:'first' if k != 'end-time' else 'last'
for k in df.columns}
df.groupby(blocks).agg(agg_dict)

Append column values from one column to another based on value in another column

I have a dataframe of connected values (edges and nodes). It shows how family and friends are connected and it looks like:
+---------------+--------------+--------------+----------------+-----------------+-------------+------------+--------------+------------+--------------+------------+--------------+-------------------+-------------------+-------------------+
| Orginal_Match | Orginal_Name | Connected_ID | Connected_Name | Connection_Type | Match-Final | ID_Match_0 | Name_Match_0 | ID_Match_1 | Name_Match_1 | ID_match_2 | Name_Match_2 | Connection_Type_0 | Connection_Type_1 | Connection_Type_2 |
+---------------+--------------+--------------+----------------+-----------------+-------------+------------+--------------+------------+--------------+------------+--------------+-------------------+-------------------+-------------------+
| 1 | A | 2 | B | FRIEND | 1 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 1 | A | 4 | E | FAMILY | 1 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 1 | A | 3 | F | FRIEND | 2 | 3 | C | 11 | H | 2 | B | FRIEND | FRIEND | FRIEND |
| 1 | A | 5 | G | FRIEND | 2 | 4 | E | NaN | NaN | NaN | NaN | FAMILY | NaN | NaN |
| 1 | A | 6 | D | FRIEND | 2 | 3 | C | NaN | NaN | NaN | NaN | FRIEND | NaN | NaN |
| 1 | A | 7 | B | FAMILY | 2 | 2 | B | NaN | NaN | NaN | NaN | FRIEND | NaN | NaN |
| 1 | A | 7 | B | FRIEND | 2 | 2 | B | NaN | NaN | NaN | NaN | FRIEND | NaN | NaN |
| 1 | A | 8 | B | FRIEND | 2 | 2 | B | NaN | NaN | NaN | NaN | FRIEND | NaN | NaN |
| 1 | A | 9 | C | OTHER | 2 | 3 | C | NaN | NaN | NaN | NaN | FRIEND | NaN | NaN |
| 1 | A | 10 | I | FRIEND | 3 | 3 | C | 6 | D | NaN | NaN | FRIEND | FRIEND | NaN |
+---------------+--------------+--------------+----------------+-----------------+-------------+------------+--------------+------------+--------------+------------+--------------+-------------------+-------------------+-------------------+
In the above dataframe, Original_Match is connected to Connected_ID either directly or indirectly. Match-Final says how many other connections (nodes) are between them. Thus, if Match-Final is 1, Original_Match and Connected_ID are directly connected. For any value x in Match-Final, the number of connections (edges along a single path) between Original_Match and Connected_ID is x-1 (that is the number of edges one must walk along to get from Original_Match to Connected_ID is equal to Match-Final; there can still be n nodes between each set of nodes, but they are all the same distance away). The ID_Match_0 through ID_Match_n columns say all the IDs that were connected in between the previous node and the next node.
As a note for clarity, Match-Final only states the number of edges between the first and last node in the connection if you were walk along one path. It has no bearing on the amount of nodes at every connection. Therefore Match-Final could be 2, meaning that you would need to walk along two edges to get from Orginal_Match to Connected_ID but there could be n paths you could take to do so, because Original_Match may be connected to n nodes, which are then connected to Connected_ID. Thus they are still only two steps away from one another.
So for example, in the above dataframe, the data states that:
Row 0: Match-Final == 1, so Original_Match is connected directly to Connected_ID and they are connected via, Connection_Type. Therefore
1A---FRIEND--2B
__________________________________________________________________________________________________
Row 2: Match-Final == 2, so Original_Match is connected to Connected_ID via ID_Match_0, ID_Match_1, ID_Match_2, using all the corresponding Connection_Type columns. Therefore
11H---------------------#
| |
FRIEND FRIEND
| |
1--FRIEND--3C--FRIEND--3F
| |
FRIEND FRIEND
| |
2B---------------------#
_________________________________________________________________________________________
Row 9: Match-Final == 3, so Original_Match is connected to something, which is then connected to ID_Match_1, ID_Match_2, which then connects to Connected_ID. Therefore
1A--FRIEND--3C--FRIEND--10I
In order to make this a network graph, however, I need to transform the dataframe above into:
+---------------+--------------+--------------+----------------+-----------------+
| Orginal_Match | Orginal_Name | Connected_ID | Connected_Name | Connection_Type |
+---------------+--------------+--------------+----------------+-----------------+
| 1 | A | 2 | B | FRIEND |
| 1 | A | 4 | E | FAMILY |
| 1 | A | 3 | F | FRIEND |
| 1 | A | 5 | G | FRIEND |
| 1 | A | 6 | D | FRIEND |
| 1 | A | 7 | B | FAMILY |
| 1 | A | 7 | B | FRIEND |
| 1 | A | 8 | B | FRIEND |
| 1 | A | 9 | C | OTHER |
| 1 | A | 10 | I | FRIEND |
| 3 | C | 3 | F | FRIEND |
| 11 | H | 3 | F | FRIEND |
| 2 | B | 3 | F | FRIEND |
| 4 | E | 5 | G | FAMILY |
| 3 | C | 6 | D | FRIEND |
| 2 | B | 7 | B | FRIEND |
| 2 | B | 7 | B | FRIEND |
| 2 | B | 8 | B | FRIEND |
| 3 | C | 9 | C | FRIEND |
| 3 | C | 10 | I | FRIEND |
| 6 | D | 10 | I | FRIEND |
+---------------+--------------+--------------+----------------+-----------------+
Which means that I need to append the values in ID_Match_0,..., ID_Match_n and Name_Match_0,..., Name_Match_n to Original_Match and Connected_ID based on where they match and the number in Match-Final. I also need to append the Connection_Type_n to Connection_Type via the same criteria.
This would need to be looped for n number of ID_Match, Name_Match, and Connection_Type columns.
I have considered using np.where but I haven't gotten anywhere with it. Any help would be greatly appreciated!

setting values to Nan does not work in Pandas based on some Condition

I'm trying to set values to Nan in a data frame based on a column value. I've tried some methods suggested on the Web but non of them actually setting the values to Nan for that particular column.
Following is some data for understanding purpose.
| user_id | produc_id_x | rating_x | product_id_y | rating_y |
|----------------|-------------|----------|--------------|----------|
| A3G70XRVGQJSD4 | NaN | NaN | B0000DC3TN | 2.0 |
| A392RM05V6KJ4B | B003AI2VGA | 3.0 | B00004CQYO | 4.0 |
| A7JI1GQJ9KYUA | Nan | Nan | Q700063BT0 | 4.0 |
| A3GZWYWL3BQDLI | Nan | Nan | B003A3R3ZY | 5.0 |
| A141HP4LYPWMSR | B003AI2VGA | 3.0 | B002LMSWNC | 3.0 |
What requirement is that I want to set rating_y to Nan where
product_id_x is Nan:
This is the code that I've written for this purpose but It's not setting values to Nan
masterDf=data.merge(data2,on="user_id",how="outer")
#masterDf contains the complete dataframe
masterDf.loc[masterDf['product_id_x']=='Nan','rating_y']='Nan'
Also this:
masterDfnan= masterDf.where(masterDf['product_id_x']=='Nan')
masterDfnan['rating_y']='Nan'
Also tried some other methods but None of them are working possibly.
Please help, Thanks.

Try this you may get your desired result:
masterDf.loc[masterDf['product_id_x'] == 'Nan', 'rating_y'] = np.nan
By doing this, you will result something like this:
| user_id | produc_id_x | rating_x | product_id_y | rating_y |
|----------------|-------------|----------|--------------|----------|
| A3G70XRVGQJSD4 | NaN | NaN | B0000DC3TN | Nan |
| A392RM05V6KJ4B | B003AI2VGA | 3.0 | B00004CQYO | 4.0 |
| A7JI1GQJ9KYUA | Nan | Nan | Q700063BT0 | Nan |
| A3GZWYWL3BQDLI | Nan | Nan | B003A3R3ZY | Nan |
| A141HP4LYPWMSR | B003AI2VGA | 3.0 | B002LMSWNC | 3.0 |
Try this if it doesn't help. Please let me know

Have you tried numpy np.nan? (first import numpy as np)
If your 'Nan' is a string do something like:
masterDf[cols] = masterDf[cols].apply(pd.to_numeric,errors='coerce')
After that or if your 'Nans' already are np.nan
masterDf.loc[masterDf['product_id_x'].isnull(),'rating_y'] = np.nan

Naming columns in pandas deletes data

The code below gives me this table:
raw = pd.read_clipboard()
raw.head()
+---+---------------------+-------------+---------+----------+-------------+
| | Afghanistan | South Asia | 652225 | 26000000 | Unnamed: 4 |
+---+---------------------+-------------+---------+----------+-------------+
| 0 | Albania | Europe | 28728 | 3200000 | 6656000000 |
| 1 | Algeria | Middle East | 2400000 | 32900000 | 75012000000 |
| 2 | Andorra | Europe | 468 | 64000 | NaN |
| 3 | Angola | Africa | 1250000 | 14500000 | 14935000000 |
| 4 | Antigua and Barbuda | Americas | 442 | 77000 | 770000000 |
+---+---------------------+-------------+---------+----------+-------------+
But when I attempt to rename the columns and create a DataFrame, all of the data disappears:
df = pd.DataFrame(raw, columns = ['name', 'region', 'area', 'population', 'gdp'])
df.head()
+---+------+--------+------+------------+-----+
| | name | region | area | population | gdp |
+---+------+--------+------+------------+-----+
| 0 | NaN | NaN | NaN | NaN | NaN |
| 1 | NaN | NaN | NaN | NaN | NaN |
| 2 | NaN | NaN | NaN | NaN | NaN |
| 3 | NaN | NaN | NaN | NaN | NaN |
| 4 | NaN | NaN | NaN | NaN | NaN |
+---+------+--------+------+------------+-----+
Any idea why?

You should just write:
df.columns = ['name', 'region', ...]
This is also much more efficient as you aren't trying to copy the entire DataFrame; as far as I know passing one DataFrame into the constructor for another will make a deep, not shallow copy.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to scrape table off a website in dataframe format? - python

Related

Fill duplicates with missing value after grouping with some logic

Apply Coalesce after grouping on two columns in pandas

Append column values from one column to another based on value in another column

setting values to Nan does not work in Pandas based on some Condition

Naming columns in pandas deletes data

Categories

Resources