Left join table without duplicating right table row values

Left join table without duplicating right table row values - python

I have 2 dataframe's as below. I want to join right side table (cycletime) to left table (current data).
Left Table- Current data (df_current)
| datetime_index | current | speed | cycle_counter |
|--------------------------|---------|-------|---------------|
| 27-10-2022 08:30:56.3056 | 30 | 60 | 1 |
| 27-10-2022 08:30:58.3058 | 30 | 60 | 1 |
| 27-10-2022 08:30:59.3059 | 31 | 62 | 1 |
| 27-10-2022 08:30:59.3059 | 30 | 60 | 1 |
| 27-10-2022 08:31:00.310 | 30.5 | 61 | 2 |
| 27-10-2022 08:31:01.311 | 30 | 60 | 2 |
| 27-10-2022 08:31:02.312 | 31 | 61 | 2 |
| 27-10-2022 08:31:02.312 | 30 | 60 | 3 |
| 27-10-2022 08:31:03.313 | 31 | 62 | 3 |
| 27-10-2022 08:31:04.314 | 30 | 60 | 3 |
Right Table- Cycletime data (df_cycletime)
| cycle_counter | total_time | up_time |
|---------------|------------|---------|
| 1 | 20 | 6 |
| 2 | 22 | 7 |
| 3 | 24 | 5 |
Code:
I used the below code
df = df_current.reset_index().merge(df_cycletime, how='left', on=cyclecounter).set_index('datetime')
What I get
| datetime_index | current | speed | cycle_counter | total_time | up_time |
|--------------------------|---------|-------|---------------|------------|---------|
| 27-10-2022 08:30:56.3056 | 30 | 60 | 1 | 20 | 6 |
| 27-10-2022 08:30:58.3058 | 30 | 60 | 1 | 20 | 6 |
| 27-10-2022 08:30:59.3059 | 31 | 62 | 1 | 20 | 6 |
| 27-10-2022 08:30:59.3059 | 30 | 60 | 1 | 20 | 6 |
| 27-10-2022 08:31:00.310 | 30.5 | 61 | 2 | 22 | 7 |
| 27-10-2022 08:31:01.311 | 30 | 60 | 2 | 22 | 7 |
| 27-10-2022 08:31:02.312 | 31 | 61 | 2 | 22 | 7 |
| 27-10-2022 08:31:02.312 | 30 | 60 | 3 | 24 | 5 |
| 27-10-2022 08:31:03.313 | 31 | 62 | 3 | 24 | 5 |
| 27-10-2022 08:31:04.314 | 30 | 60 | 3 | 24 | 5 |
Requirement: I don't want 'total_time' and 'up_time' to repeat, just only once for one cycle counter
| datetime_index | current | speed | cycle_counter | total_time | up_time |
|--------------------------|---------|-------|---------------|------------|---------|
| 27-10-2022 08:30:56.3056 | 30 | 60 | 1 | 20 | 6 |
| 27-10-2022 08:30:58.3058 | 30 | 60 | 1 | | |
| 27-10-2022 08:30:59.3059 | 31 | 62 | 1 | | |
| 27-10-2022 08:30:59.3059 | 30 | 60 | 1 | | |
| 27-10-2022 08:31:00.310 | 30.5 | 61 | 2 | 22 | 7 |
| 27-10-2022 08:31:01.311 | 30 | 60 | 2 | | |
| 27-10-2022 08:31:02.312 | 31 | 61 | 2 | | |
| 27-10-2022 08:31:02.312 | 30 | 60 | 3 | 24 | 5 |
| 27-10-2022 08:31:03.313 | 31 | 62 | 3 | | |
| 27-10-2022 08:31:04.314 | 30 | 60 | 3 | | |

You have to find duplicates in column total_time and column up_time according to cycle_counter column and replace them with empty string (""). this will work for all data.
df.loc[df.duplicated(['cycle_counter','total_time', 'up_time']), ['total_time','up_time']] = ""
print(df)
cycle_counter total_time up_time
0 1 20 6
1 1
2 1
3 1
4 2 22 7
5 2
6 2
7 3 24 5
8 3
9 3

Related

Python check when second columns value changes based on first column value change

I would like to be able to find out what time it takes for the values in one column to change, based on when the value changes in another column. I have loaded an example of the table below.
| 23-02-03 12:01:27.213000 | 60 | 0 |
| 23-02-03 12:01:27.243000 | 60 | 0 |
| 23-02-03 12:01:27.313000 | 60 | 0 |
| 23-02-03 12:01:27.353000 | 50 | 0 |
| 23-02-03 12:01:27.413000 | 50 | 0 |
| 23-02-03 12:01:27.453000 | 50 | 0 |
| 23-02-03 12:01:27.513000 | 50 | 10 |
| 23-02-03 12:01:27.553000 | 50 | 10 |
| 23-02-03 12:01:27.613000 | 50 | 10 |
| 23-02-03 12:01:27.653000 | 50 | 10 |
| 23-02-03 12:01:27.713000 | 50 | 10 |
| 23-02-03 12:01:27.753000 | 50 | 10 |
| 23-02-03 12:01:27.813000 | 50 | 10 |
| 23-02-03 12:01:27.853000 | 49.5 | 10 |
| 23-02-03 12:01:27.913000 | 49.5 | 10 |
| 23-02-03 12:01:27.953000 | 49.5 | 10 |
| 23-02-03 12:01:28.013000 | 49.5 | 10 |
| 23-02-03 12:01:28.053000 | 49.5 | 10 |
| 23-02-03 12:01:28.113000 | 49.5 | 10 |
| 23-02-03 12:01:28.153000 | 49.5 | 10 |
| 23-02-03 12:01:28.213000 | 49.5 | 10 |
| 23-02-03 12:01:28.253000 | 49.5 | 25 |
| 23-02-03 12:01:28.313000 | 49.5 | 25 |
| 23-02-03 12:01:28.353000 | 49.5 | 25 |
| 23-02-03 12:01:28.423000 | 49.5 | 25 |
| 23-02-03 12:01:28.453000 | 48.3 | 25 |
| 23-02-03 12:01:28.533000 | 48.3 | 25 |
| 23-02-03 12:01:28.553000 | 48.3 | 25 |
| 23-02-03 12:01:28.634000 | 48.3 | 25 |
| 23-02-03 12:01:28.653000 | 48.3 | 25 |
| 23-02-03 12:01:28.743000 | 48.3 | 33 |
| 23-02-03 12:01:28.753000 | 48.3 | 33 |
| 23-02-03 12:01:28.843000 | 48.3 | 33 |
| 23-02-03 12:01:28.853000 | 48.3 | 33 |
| 23-02-03 12:01:28.943000 | 48.3 | 33 |
| 23-02-03 12:01:28.953000 | 48.3 | 33 |
| 23-02-03 12:01:29.043000 | 48.3 | 33 |
| 23-02-03 12:01:29.053000 | 48.3 | 33 |
| 23-02-03 12:01:29.143000 | 48.3 | 33 |
| 23-02-03 12:01:29.153000 | 48.3 | 33 |
| 23-02-03 12:01:29.243000 | 48.3 | 33 |
| 23-02-03 12:01:29.253000 | 48.3 | 33 |
| 23-02-03 12:01:29.343000 | 48.3 | 33 |
| 23-02-03 12:01:29.353000 | 49.1 | 33 |
| 23-02-03 12:01:29.443000 | 49.1 | 33 |
| 23-02-03 12:01:29.463000 | 49.1 | 33 |
| 23-02-03 12:01:29.543000 | 49.1 | 59 |
| 23-02-03 12:01:29.563000 | 49.1 | 59 |
So the first column is time stamp. When the value on column 1 changes from 50 to 49.5, the value in the third column changes a while after.
From this example
col A changes from 60 to 50 at 27.353
col b changes from 0 to 10 at 27.513
So it takes .160 secs for the value in col b to change after the value changes in col a.
I would like to be able to use a python script to calculate this time difference, and also the average time difference.
I have just taken out the values to show below
| First Change | | |
|--------------------------|------|----|
| 23-02-03 12:01:27.353000 | 50 | |
| 23-02-03 12:01:27.513000 | | 10 |
| Time diff | | |
| 0.16 | | |
| Second change | | |
| 23-02-03 12:01:27.853000 | 49.5 | |
| 23-02-03 12:01:28.253000 | | 25 |
| Time diff | | |
| 0.4 | | |
| Third change | | |
| 23-02-03 12:01:28.453000 | 48.3 | |
| 23-02-03 12:01:28.743000 | | 33 |
| Time diff | | |
| 0.29 | | |
| Fourth change | | |
| 23-02-03 12:01:29.353000 | 49.1 | |
| 23-02-03 12:01:29.543000 | | 59 |
| 0.19 | | |
| Average Time diff | | |
| 0.26 | | |
thanks
So, I have been able to get the differences by the following code
df['Change 1'] = df['Col1'].diff()
df['Change 2'] = df['Col2'].diff()
This stores when col1 changes and when col2 changes, as seen below. But I am not sure how to get the time diff when between them
| Datetime | Col1 | Col2 | Change 1 | Change 2 |
|----------------------------|------|------|----------|----------|
| 23-02-03 12:01:27.213000 | 60 | 0 | 0 | 0 |
| 23-02-03 12:01:27.243000 | 60 | 0 | 0 | 0 |
| 23-02-03 12:01:27.313000 | 60 | 0 | 0 | 0 |
| 23-02-03 12:01:27.353000 | 50 | 0 | 10 | 0 |
| 23-02-03 12:01:27.413000 | 50 | 0 | 0 | 0 |
| 23-02-03 12:01:27.453000 | 50 | 0 | 0 | 0 |
| 23-02-03 12:01:27.513000 | 50 | 10 | 0 | 10 |
| 23-02-03 12:01:27.553000 | 50 | 10 | 0 | 0 |
| 23-02-03 12:01:27.613000 | 50 | 10 | 0 | 0 |
| 23-02-03 12:01:27.653000 | 50 | 10 | 0 | 0 |
| 23-02-03 12:01:27.713000 | 50 | 10 | 0 | 0 |
| 23-02-03 12:01:27.753000 | 50 | 10 | 0 | 0 |
| 23-02-03 12:01:27.813000 | 50 | 10 | 0 | 0 |
| 23-02-03 12:01:27.853000 | 49.5 | 10 | 0.5 | 0 |
| 23-02-03 12:01:27.913000 | 49.5 | 10 | 0 | 0 |
| 23-02-03 12:01:27.953000 | 49.5 | 10 | 0 | 0 |
| 23-02-03 12:01:28.013000 | 49.5 | 10 | 0 | 0 |
| 23-02-03 12:01:28.053000 | 49.5 | 10 | 0 | 0 |
| 23-02-03 12:01:28.113000 | 49.5 | 10 | 0 | 0 |
| 23-02-03 12:01:28.153000 | 49.5 | 10 | 0 | 0 |
| 23-02-03 12:01:28.213000 | 49.5 | 10 | 0 | 0 |
| 23-02-03 12:01:28.253000 | 49.5 | 25 | 0 | 15 |
| 23-02-03 12:01:28.313000 | 49.5 | 25 | 0 | 0 |
| 23-02-03 12:01:28.353000 | 49.5 | 25 | 0 | 0 |
| 23-02-03 12:01:28.423000 | 49.5 | 25 | 0 | 0 |
| 23-02-03 12:01:28.453000 | 48.3 | 25 | 1.2 | 0 |
| 23-02-03 12:01:28.533000 | 48.3 | 25 | 0 | 0 |
| 23-02-03 12:01:28.553000 | 48.3 | 25 | 0 | 0 |
| 23-02-03 12:01:28.634000 | 48.3 | 25 | 0 | 0 |
| 23-02-03 12:01:28.653000 | 48.3 | 25 | 0 | 0 |
| 23-02-03 12:01:28.743000 | 48.3 | 33 | 0 | 8 |
| 23-02-03 12:01:28.753000 | 48.3 | 33 | 0 | 0 |
| 23-02-03 12:01:28.843000 | 48.3 | 33 | 0 | 0 |
| 23-02-03 12:01:28.853000 | 48.3 | 33 | 0 | 0 |
| 23-02-03 12:01:28.943000 | 48.3 | 33 | 0 | 0 |
| 23-02-03 12:01:28.953000 | 48.3 | 33 | 0 | 0 |
| 23-02-03 12:01:29.043000 | 48.3 | 33 | 0 | 0 |
| 23-02-03 12:01:29.053000 | 48.3 | 33 | 0 | 0 |
| 23-02-03 12:01:29.143000 | 48.3 | 33 | 0 | 0 |
| 23-02-03 12:01:29.153000 | 48.3 | 33 | 0 | 0 |
| 23-02-03 12:01:29.243000 | 48.3 | 33 | 0 | 0 |
| 23-02-03 12:01:29.253000 | 48.3 | 33 | 0 | 0 |
| 23-02-03 12:01:29.343000 | 48.3 | 33 | 0 | 0 |
| 23-02-03 12:01:29.353000 | 49.1 | 33 | 0.8 | 0 |
| 23-02-03 12:01:29.443000 | 49.1 | 33 | 0 | 0 |
| 23-02-03 12:01:29.463000 | 49.1 | 33 | 0 | 0 |
| 23-02-03 12:01:29.543000 | 49.1 | 59 | 0 | 26 |
| 23-02-03 12:01:29.563000 | 49.1 | 59 | 0 | 0 |
I've had an idea, if I was able to drop the values in-between then this might make it easier to check

Efficient way to apply a pandas conditional to a large list of values [duplicate]

I have dataframe df_groups that contain sample number, group number and accuracy.
Tabel 1: Samples with their groups
+----+----------+------------+------------+
| | sample | group | Accuracy |
|----+----------+------------+------------|
| 0 | 0 | 6 | 91.6 |
| 1 | 1 | 4 | 92.9333 |
| 2 | 2 | 2 | 91 |
| 3 | 3 | 2 | 90.0667 |
| 4 | 4 | 4 | 91.8 |
| 5 | 5 | 5 | 92.5667 |
| 6 | 6 | 6 | 91.1 |
| 7 | 7 | 5 | 92.3333 |
| 8 | 8 | 2 | 92.7667 |
| 9 | 9 | 0 | 91.1333 |
| 10 | 10 | 4 | 92.5 |
| 11 | 11 | 5 | 92.4 |
| 12 | 12 | 7 | 93.1333 |
| 13 | 13 | 7 | 93.5333 |
| 14 | 14 | 2 | 92.1 |
| 15 | 15 | 6 | 93.2 |
| 16 | 16 | 8 | 92.7333 |
| 17 | 17 | 8 | 90.8 |
| 18 | 18 | 3 | 91.9 |
| 19 | 19 | 3 | 93.3 |
| 20 | 20 | 5 | 90.6333 |
| 21 | 21 | 9 | 92.9333 |
| 22 | 22 | 4 | 93.3333 |
| 23 | 23 | 9 | 91.5333 |
| 24 | 24 | 9 | 92.9333 |
| 25 | 25 | 1 | 92.3 |
| 26 | 26 | 9 | 92.2333 |
| 27 | 27 | 6 | 91.9333 |
| 28 | 28 | 5 | 92.1 |
| 29 | 29 | 8 | 84.8 |
+----+----------+------------+------------+
I want to return a dataframe with any accuracy above (e.g. 92).
so the results will be like this
Tabel 1: Samples with their groups when accuracy above 92.
+----+----------+------------+------------+
| | sample | group | Accuracy |
|----+----------+------------+------------|
| 1 | 1 | 4 | 92.9333 |
| 2 | 5 | 5 | 92.5667 |
| 3 | 7 | 5 | 92.3333 |
| 4 | 8 | 2 | 92.7667 |
| 5 | 10 | 4 | 92.5 |
| 6 | 11 | 5 | 92.4 |
| 7 | 12 | 7 | 93.1333 |
| 8 | 13 | 7 | 93.5333 |
| 9 | 14 | 2 | 92.1 |
| 10 | 15 | 6 | 93.2 |
| 11 | 16 | 8 | 92.7333 |
| 12 | 19 | 3 | 93.3 |
| 13 | 21 | 9 | 92.9333 |
| 14 | 22 | 4 | 93.3333 |
| 15 | 24 | 9 | 92.9333 |
| 16 | 25 | 1 | 92.3 |
| 17 | 26 | 9 | 92.2333 |
| 18 | 28 | 5 | 92.1 |
+----+----------+------------+------------+
so, the result will return based on the condition that is greater than or equal to the predefined accuracy (e.g. 92, 90 or 85, ect).

You can use df.loc[df['Accuracy'] >= predefined_accuracy] .

How do I login to a website prior to scraping data?

Before web scraping, I'm using the following code to login.
import requests
from bs4 import BeautifulSoup
login_url = 'https://www.footballoutsiders.com/user/login'
data = {
'username': 'username',
'password': 'password'
}
with requests.Session() as s:
response = s.post(login_url , data)
print(response.text)
I then do the following to view the table, but the cells are still locked.
index_page= s.get('https://www.footballoutsiders.com/stats/nfl/historical-lookup-by-
week/2020/1/overall')
soup = BeautifulSoup(index_page.text, 'lxml')
table1 = soup.find('table')
table1
What am I doing wrong?

You are passing in the wrong data. It should be 'user' and 'pass'. Also, don't use BeautifulSoup to parse <table> tags (when you only need the content). Pandas can do that for you (uses bs4 under the hood).
import requests
import pandas as pd
LOGIN_URL = 'https://www.footballoutsiders.com/user/login?destination=home'
login = {
'name': '123#email.com',
'pass': '54321',
'form_id': 'user_login_form',
'op': 'Login'}
s = requests.Session()
s.post(LOGIN_URL, data=login)
index_page= s.get('https://www.footballoutsiders.com/stats/nfl/historical-lookup-by-week/2020/1/overall')
df = pd.read_html(index_page.text)[0]
Output:
print(df.to_markdown())
| | Team | W-L | Total DVOA | Total DVOA.1 | Weighted DVOA | Weighted DVOA.1 | Offense DVOA | Offense DVOA.1 | Offense Weighted DVOA | Offense Weighted DVOA.1 | Defense DVOA | Defense DVOA.1 | Defense Weighted DVOA | Defense Weighted DVOA.1 | Special Teams DVOA | Special Teams DVOA.1 | Special Teams Weighted DVOA | Special Teams Weighted DVOA.1 |
|---:|:-------|:------|-------------:|:---------------|----------------:|:------------------|---------------:|:-----------------|------------------------:|:--------------------------|---------------:|:-----------------|------------------------:|:--------------------------|---------------------:|:-----------------------|------------------------------:|:--------------------------------|
| 0 | BAL | 1-0 | 1 | 88.0% | 1 | 88.0% | 1 | 39.9% | 1 | 39.9% | 3 | -38.8% | 3 | -38.8% | 2 | 9.4% | 2 | 9.4% |
| 1 | NE | 1-0 | 2 | 52.3% | 2 | 52.3% | 3 | 36.4% | 3 | 36.4% | 5 | -23.8% | 5 | -23.8% | 25 | -7.9% | 25 | -7.9% |
| 2 | JAX | 1-0 | 3 | 38.0% | 3 | 38.0% | 4 | 35.8% | 4 | 35.8% | 13 | 0.5% | 13 | 0.5% | 5 | 2.8% | 5 | 2.8% |
| 3 | SEA | 1-0 | 4 | 37.0% | 4 | 37.0% | 2 | 38.6% | 2 | 38.6% | 21 | 9.5% | 21 | 9.5% | 3 | 7.8% | 3 | 7.8% |
| 4 | PIT | 1-0 | 5 | 36.0% | 5 | 36.0% | 14 | 6.5% | 14 | 6.5% | 2 | -39.0% | 2 | -39.0% | 27 | -9.4% | 27 | -9.4% |
| 5 | WAS | 1-0 | 6 | 35.9% | 6 | 35.9% | 28 | -32.7% | 28 | -32.7% | 1 | -69.4% | 1 | -69.4% | 13 | -0.8% | 13 | -0.8% |
| 6 | BUF | 1-0 | 7 | 16.7% | 7 | 16.7% | 17 | 2.6% | 17 | 2.6% | 7 | -19.0% | 7 | -19.0% | 22 | -4.8% | 22 | -4.8% |
| 7 | LV | 1-0 | 8 | 13.7% | 8 | 13.7% | 5 | 31.7% | 5 | 31.7% | 23 | 16.5% | 23 | 16.5% | 15 | -1.4% | 15 | -1.4% |
| 8 | NO | 1-0 | 9 | 10.8% | 9 | 10.8% | 24 | -13.7% | 24 | -13.7% | 9 | -14.3% | 9 | -14.3% | 1 | 10.2% | 1 | 10.2% |
| 9 | MIN | 0-1 | 10 | 10.8% | 10 | 10.8% | 6 | 28.2% | 6 | 28.2% | 26 | 20.1% | 26 | 20.1% | 6 | 2.7% | 6 | 2.7% |
| 10 | LAC | 1-0 | 11 | 4.1% | 11 | 4.1% | 22 | -7.6% | 22 | -7.6% | 6 | -20.4% | 6 | -20.4% | 26 | -8.7% | 26 | -8.7% |
| 11 | CAR | 0-1 | 12 | 2.5% | 12 | 2.5% | 8 | 23.4% | 8 | 23.4% | 27 | 24.0% | 27 | 24.0% | 4 | 3.2% | 4 | 3.2% |
| 12 | CHI | 1-0 | 13 | 0.3% | 13 | 0.3% | 19 | 0.9% | 19 | 0.9% | 12 | -1.6% | 12 | -1.6% | 16 | -2.2% | 16 | -2.2% |
| 13 | DAL | 0-1 | 14 | 0.3% | 14 | 0.3% | 12 | 9.2% | 12 | 9.2% | 18 | 3.9% | 18 | 3.9% | 23 | -5.1% | 23 | -5.1% |
| 14 | DET | 0-1 | 15 | -0.1% | 15 | -0.1% | 21 | -0.3% | 21 | -0.3% | 15 | 1.2% | 15 | 1.2% | 9 | 1.4% | 9 | 1.4% |
| 15 | KC | 1-0 | 16 | -1.3% | 16 | -1.3% | 9 | 17.6% | 9 | 17.6% | 25 | 17.5% | 25 | 17.5% | 14 | -1.3% | 14 | -1.3% |
| 16 | GB | 1-0 | 17 | -5.0% | 17 | -5.0% | 7 | 24.2% | 7 | 24.2% | 30 | 28.7% | 30 | 28.7% | 12 | -0.4% | 12 | -0.4% |
| 17 | ARI | 1-0 | 18 | -5.2% | 18 | -5.2% | 20 | 0.5% | 20 | 0.5% | 17 | 2.9% | 17 | 2.9% | 17 | -2.8% | 17 | -2.8% |
| 18 | SF | 0-1 | 19 | -6.1% | 19 | -6.1% | 18 | 2.3% | 18 | 2.3% | 14 | 0.6% | 14 | 0.6% | 24 | -7.8% | 24 | -7.8% |
| 19 | HOU | 0-1 | 20 | -9.3% | 20 | -9.3% | 11 | 11.1% | 11 | 11.1% | 24 | 16.7% | 24 | 16.7% | 19 | -3.6% | 19 | -3.6% |
| 20 | LAR | 1-0 | 21 | -13.5% | 21 | -13.5% | 13 | 7.5% | 13 | 7.5% | 20 | 9.5% | 20 | 9.5% | 28 | -11.4% | 28 | -11.4% |
| 21 | TEN | 1-0 | 22 | -14.3% | 22 | -14.3% | 16 | 3.4% | 16 | 3.4% | 10 | -11.1% | 10 | -11.1% | 32 | -28.7% | 32 | -28.7% |
| 22 | TB | 0-1 | 23 | -16.3% | 23 | -16.3% | 25 | -16.1% | 25 | -16.1% | 8 | -16.4% | 8 | -16.4% | 30 | -16.7% | 30 | -16.7% |
| 23 | DEN | 0-1 | 24 | -17.1% | 24 | -17.1% | 23 | -13.3% | 23 | -13.3% | 19 | 3.9% | 19 | 3.9% | 11 | 0.1% | 11 | 0.1% |
| 24 | ATL | 0-1 | 25 | -26.0% | 25 | -26.0% | 10 | 11.8% | 10 | 11.8% | 32 | 38.9% | 32 | 38.9% | 10 | 1.1% | 10 | 1.1% |
| 25 | IND | 0-1 | 26 | -27.0% | 26 | -27.0% | 15 | 3.8% | 15 | 3.8% | 28 | 26.0% | 28 | 26.0% | 21 | -4.8% | 21 | -4.8% |
| 26 | CIN | 0-1 | 27 | -28.4% | 27 | -28.4% | 29 | -33.2% | 29 | -33.2% | 11 | -8.5% | 11 | -8.5% | 20 | -3.7% | 20 | -3.7% |
| 27 | NYJ | 0-1 | 28 | -35.2% | 28 | -35.2% | 26 | -21.1% | 26 | -21.1% | 16 | 2.7% | 16 | 2.7% | 29 | -11.4% | 29 | -11.4% |
| 28 | PHI | 0-1 | 29 | -41.1% | 29 | -41.1% | 32 | -71.3% | 32 | -71.3% | 4 | -33.6% | 4 | -33.6% | 18 | -3.4% | 18 | -3.4% |
| 29 | MIA | 0-1 | 30 | -48.9% | 30 | -48.9% | 27 | -23.2% | 27 | -23.2% | 29 | 28.3% | 29 | 28.3% | 8 | 2.6% | 8 | 2.6% |
| 30 | NYG | 0-1 | 31 | -54.7% | 31 | -54.7% | 30 | -45.4% | 30 | -45.4% | 22 | 11.9% | 22 | 11.9% | 7 | 2.6% | 7 | 2.6% |
| 31 | CLE | 0-1 | 32 | -107.6% | 32 | -107.6% | 31 | -57.5% | 31 | -57.5% | 31 | 33.3% | 31 | 33.3% | 31 | -16.8% | 31 | -16.8% |

update table information based on columns of another table

I am new in python have two dataframes, df1 contains information about all students with their group and score, and df2 contains updated information about few students when they change their group and score. How could I update the information in df1 based on the values of df2 (group and score)?
df1
+----+----------+-----------+----------------+
| |student No| group | score |
|----+----------+-----------+----------------|
| 0 | 0 | 0 | 0.839626 |
| 1 | 1 | 0 | 0.845435 |
| 2 | 2 | 3 | 0.830778 |
| 3 | 3 | 2 | 0.831565 |
| 4 | 4 | 3 | 0.823569 |
| 5 | 5 | 0 | 0.808109 |
| 6 | 6 | 4 | 0.831645 |
| 7 | 7 | 1 | 0.851048 |
| 8 | 8 | 3 | 0.843209 |
| 9 | 9 | 4 | 0.84902 |
| 10 | 10 | 0 | 0.835143 |
| 11 | 11 | 4 | 0.843228 |
| 12 | 12 | 2 | 0.826949 |
| 13 | 13 | 0 | 0.84196 |
| 14 | 14 | 1 | 0.821634 |
| 15 | 15 | 3 | 0.840702 |
| 16 | 16 | 0 | 0.828994 |
| 17 | 17 | 2 | 0.843043 |
| 18 | 18 | 4 | 0.809093 |
| 19 | 19 | 1 | 0.85426 |
+----+----------+-----------+----------------+
df2
+----+-----------+----------+----------------+
| | group |student No| score |
|----+-----------+----------+----------------|
| 0 | 2 | 1 | 0.887435 |
| 1 | 0 | 19 | 0.81214 |
| 2 | 3 | 17 | 0.899041 |
| 3 | 0 | 8 | 0.853333 |
| 4 | 4 | 9 | 0.88512 |
+----+-----------+----------+----------------+
The result
df: 3
+----+----------+-----------+----------------+
| |student No| group | score |
|----+----------+-----------+----------------|
| 0 | 0 | 0 | 0.839626 |
| 1 | 1 | 2 | 0.887435 |
| 2 | 2 | 3 | 0.830778 |
| 3 | 3 | 2 | 0.831565 |
| 4 | 4 | 3 | 0.823569 |
| 5 | 5 | 0 | 0.808109 |
| 6 | 6 | 4 | 0.831645 |
| 7 | 7 | 1 | 0.851048 |
| 8 | 8 | 0 | 0.853333 |
| 9 | 9 | 4 | 0.88512 |
| 10 | 10 | 0 | 0.835143 |
| 11 | 11 | 4 | 0.843228 |
| 12 | 12 | 2 | 0.826949 |
| 13 | 13 | 0 | 0.84196 |
| 14 | 14 | 1 | 0.821634 |
| 15 | 15 | 3 | 0.840702 |
| 16 | 16 | 0 | 0.828994 |
| 17 | 17 | 3 | 0.899041 |
| 18 | 18 | 4 | 0.809093 |
| 19 | 19 | 0 | 0.81214 |
+----+----------+-----------+----------------+
my code to update df1 from df2
dfupdated = df1.merge(df2, how='left', on=['student No'], suffixes=('', '_new'))
dfupdated['group'] = np.where(pd.notnull(dfupdated['group_new']), dfupdated['group_new'],
dfupdated['group'])
dfupdated['score'] = np.where(pd.notnull(dfupdated['score_new']), dfupdated['score_new'],
dfupdated['score'])
dfupdated.drop(['group_new', 'score_new'],axis=1, inplace=True)
dfupdated.reset_index(drop=True, inplace=True)
but I face the following error
KeyError: "['group'] not in index"

I don't know what's wrong I ran same and got the answer
giving a different way to solve it
try :
dfupdated = df1.merge(df2, on='student No', how='left')
dfupdated['group'] = dfupdated['group_y'].fillna(dfupdated['group_x'])
dfupdated['score'] = dfupdated['score_y'].fillna(dfupdated['score_x'])
dfupdated.drop(['group_x', 'group_y','score_x', 'score_y'], axis=1,inplace=True)
will give you the solution you want.
to get the max from each group
dfupdated.groupby(['group'], sort=False)['score'].max()

Pandas dataframe tweak

I have some data as follows:
+-----+-------+-------+--------------------+
| Sys | Event | Code | Duration |
+-----+-------+-------+--------------------+
| | 1 | 65 | 355.52 |
| | 1 | 66 | 18.78 |
| | 1 | 66 | 223.42 |
| | 1 | 66 | 392.17 |
| | 2 | 66 | 449.03 |
| | 2 | 66 | 506.03 |
| | 2 | 66 | 73.93 |
| | 3 | 66 | 123.17 |
| | 3 | 66 | 97.85 |
+-----+-------+-------+--------------------+
Now, for each Code, I want to sum the Durations for all Event = 1 and so on, regardless of Sys. How do I approach this?

As DYZ says:
df.groupby(['Code', 'Event']).Duration.sum()
Output:
Code Event
65 1 355.52
66 1 634.37
2 1028.99
3 221.02
Name: Duration, dtype: float64

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Left join table without duplicating right table row values - python

Related

Python check when second columns value changes based on first column value change

Efficient way to apply a pandas conditional to a large list of values [duplicate]

How do I login to a website prior to scraping data?

update table information based on columns of another table

Pandas dataframe tweak

Categories

Resources