How to calculated pivot table using python

How to calculated pivot table using python - python

I have a sample table below:
Temperature Voltage Data
25 3.3 2.15
25 3.3 2.21
25 3.3 2.23
25 3.3 2.26
25 3.3 2.19
25 3.45 2.4
25 3.45 2.37
25 3.45 2.42
25 3.45 2.34
25 3.45 2.35
105 3.3 3.2
105 3.3 3.22
105 3.3 3.23
105 3.3 3.24
105 3.3 3.26
105 3.45 3.33
105 3.45 3.32
105 3.45 3.34
105 3.45 3.3
105 3.45 3.36
I would like to calculate the average Data for each Temperature and Voltage case. I could do this in excel by making a pivot table but I would like to learn how to do it in python script so I can automate this data processing part.
Thank you,
Victor
P.S. sorry for the weird format table. I'm not exactly sure how to correctly copy and paste a table in here.

I think the function you need is .groupby() if you are familiar with it:
df.groupby(['Temperature','Voltage'])['Data'].mean()
This will generate a mean value of the value Data for each unique Temperature and Voltage combination. This is an example:
import pandas as pd
data = {
'Temperature': [25,25,25,25,25,25,25,25,25,25,105,105,105,105,105,105,105,105,105,105],
'Voltage': [3.3,3.3,3.3,3.3,3.3,3.45,3.45,3.45,3.45,3.45,3.3,3.3,3.3,3.3,3.3,3.45,3.45,3.45,3.45,3.45],
'Data': [2.15,2.21,2.23,2.26,2.19,2.4,2.37,2.42,2.34,2.35,3.2,3.22,3.23,3.24,3.26,3.33,3.32,3.34,3.3,3.36]
}
df = pd.DataFrame(data)
print(df.groupby(['Temperature','Voltage'])['Data'].mean())
Output:
Temperature Voltage
25 3.30 2.208
3.45 2.376
105 3.30 3.230
3.45 3.330

Related

Trying to use the BeautifulSoup Python module to pull individual elements from table data

I am new to Python and currently using BeautifulSoup with Python to try and pull some table data. I cannot get the individual elements out of the td. What I have so far is:
from bs4 import BeautifulSoup
import requests
source = requests.get('https://gol.gg/teams/list/season-ALL/split-ALL/region-ALL/tournament-LCS%20Summer%202020/week-ALL/').text
soup = BeautifulSoup(source, 'lxml')
td = soup.find_all('td', {'class': 'text-center'})
print(td)
This does display all of the td that I want to extract but am unable to figure out how to get each individual element out of the td.
Thank you in advanced for the help, it is much appreciated.

Try this:
from bs4 import BeautifulSoup
import requests
source = requests.get('https://gol.gg/teams/list/season-ALL/split-ALL/region-ALL/tournament-LCS%20Summer%202020/week-ALL/').text
soup = BeautifulSoup(source, 'lxml')
td = soup.find_all('td', {'class': 'text-center'})
print(*[text.get_text(strip=True) + '\n' for text in td])
Prints:
S10
NA
14
35.7%
0.91
1744
-48
33:19
11.2
12.4
5.5
7.0
50.0
64.3
2.71
54.2
1.00
57.1
1.14
and so on....

The following script extracts the data and saves the data to a csv file.
import requests
from bs4 import BeautifulSoup
import pandas as pd
res = requests.get('https://gol.gg/teams/list/season-ALL/split-ALL/region-ALL/tournament-LCS%20Summer%202020/week-ALL/')
soup = BeautifulSoup(res.text, 'html.parser')
table = soup.find("table", class_="table_list playerslist tablesaw trhover")
columns = [i.get_text(strip=True) for i in table.find("thead").find_all("th")]
data = []
table.find("thead").extract()
for tr in table.find_all("tr"):
data.append([td.get_text(strip=True) for td in tr.find_all("td")])
df = pd.DataFrame(data, columns=columns)
df.to_csv("data.csv", index=False)
Output:
Name Season Region Games Win rate K:D GPM GDM Game duration Kills / game Deaths / game Towers killed Towers lost FB% FT% DRAPG DRA% HERPG HER% DRA#15 TD#15 GD#15 NASHPG NASH% CSM DPM WPM VWPM WCPM
0 100 Thieves S10 NA 14 35.7% 0.91 1744 -48 33:19 11.2 12.4 5.5 7.0 50.0 64.3 2.71 54.2 1.00 57.1 1.14 0.4 -378 0.64 42.9 33.2 1937 3.0 1.19 1.31
1 CLG S10 NA 14 35.7% 0.81 1705 -120 35:25 10.6 13.2 4.9 7.9 28.6 28.6 1.93 31.5 0.57 28.6 0.64 -0.6 -1297 0.57 30.4 32.6 1826 3.2 1.17 1.37
2 Cloud9 S10 NA 14 78.6% 1.91 1922 302 28:52 15.0 7.9 8.3 3.1 64.3 64.3 3.07 72.5 1.43 71.4 1.29 0.7 2410 1.00 78.6 33.3 1921 3.0 1.10 1.26
3 Dignitas S10 NA 14 28.6% 0.86 1663 -147 32:44 8.9 10.4 3.9 8.1 42.9 35.7 2.14 41.7 0.57 28.6 0.79 -0.7 -796 0.36 25.0 32.5 1517 3.1 1.28 1.23
4 Evil Geniuses S10 NA 14 50.0% 0.85 1738 -0 34:09 11.1 13.1 6.5 6.0 64.3 57.1 2.36 48.5 1.00 53.6 1.00 0.5 397 0.50 46.5 32.3 1895 3.2 1.36 1.34
5 FlyQuest S10 NA 14 57.1% 1.28 1770 65 34:55 13.4 10.4 6.5 5.2 71.4 35.7 2.86 53.4 1.00 50.0 0.79 -0.1 69 0.71 69.2 32.7 1801 3.2 1.16 1.72
6 Golden Guardians S10 NA 14 50.0% 0.96 1740 6 36:13 10.7 11.1 6.3 6.1 50.0 35.7 3.29 62.8 0.86 42.9 1.43 0.1 711 0.50 43.6 33.7 1944 3.2 1.27 1.53
7 Immortals S10 NA 14 21.4% 0.54 1609 -246 33:54 7.5 14.0 4.3 7.9 35.7 35.7 2.29 39.9 1.00 53.6 0.79 -0.4 -1509 0.36 25.0 31.4 1734 3.3 1.37 1.47
8 Team Liquid S10 NA 14 78.6% 1.31 1796 135 35:07 11.4 8.6 7.9 4.4 42.9 64.3 2.36 43.6 0.93 50.0 1.14 0.2 522 1.21 78.6 33.1 1755 3.5 1.27 1.42
9 TSM S10 NA 14 64.3% 1.12 1768 52 34:20 11.6 10.4 7.2 5.7 50.0 78.6 2.79 51.9 1.21 64.3 0.93 0.1 -129 0.86 57.1 32.6 1729 3.2 1.33 1.33

Compare entries to bucketed table thresholds, as they get populated in an SQL database

I'm trying to compare live sensor data stored in an SQL database to a table of "acceptable thresholds".
Each entry in the database has three variables that identify it, and then a series of independent variables which I'm trying to compare.
This is the table to which I'd like to compare each entry. 'Gear', 'Throttle Bucket' and '% Load' are the three variables that I want to use to identify which row from the comparison table I want to compare each entry to.
Gear Throttle Bucket % Load Sensor 1 Expected Sensor 2 Expected Sensor 3 Expected
1 1000-1500 0-50 40.1 72.4 8.7
1 1000-1500 51-100 58.8 70.4 17.5
1 1000-1500 0-50 79.0 68.5 22.5
1 1000-1500 51-100 40.2 71.7 8.5
2 1000-1500 0-50 58.6 70.6 16.9
2 1000-1500 51-100 43.9 71.0 10.1
2 1000-1500 0-50 46.6 67.4 14.3
2 1000-1500 51-100 78.5 95.4 25.0
3 1000-1500 0-50 17.7 68.4 3.7
3 1000-1500 51-100 47.7 88.4 19.7
3 1000-1500 0-50 46.6 67.4 14.3
3 1000-1500 51-100 78.5 95.4 25.0
4 1000-1500 0-50 17.7 68.4 3.7
4 1000-1500 0-50 40.2 71.7 8.5
4 1000-1500 51-100 77.7 69.2 23.8
And this is how the data is coming into the server.
Entry Number Gear Throttle % Load Sensor 1 Sensor 2 Sensor 3
1 2 1134 36% 34.1 76.4 9.7
2 4 1758 87% 47.6 89.4 14.5
3 3 1642 15% 52.6 64.5 17.3
4 1 1224 51% 48.6 59.7 4.2
5 4 1421 34% 61.4 74.9 3.9
6 2 1801 73% 52.9 88.4 12.5
So what I've been trying to do is for every entry of that table, and using the value for "Gear", "Throttle" and '% Load", find out what the expected value for each of the sensors should've been and calculate the percent difference.
I have the "Comparison Table" stored locally as a Pandas dataframe, and every morning I bulk download all the entries from the live sensor data (approximately 1000 lines).
I've put together a loop for iterrows but I'm having trouble comparing float values to the bucketed columns. And I'm also having trouble setting three different conditions to "triangulate" on the correct line.
Please let me know if there is any information that I've missed. I appreciate any help in advance!

correlation matrix between cities

I want to find the corr btw cities and and Rainfall. Note that 'city' is categorical, not numerical.
I wand to compare their rainfall.
How do I go about it? I haven't seen anything on here that talk about how to deal with duplicate cities with different data
like
Date Location MinTemp MaxTemp Rainfall
12/1/2008 Albury 13.4 22.9 0.6
12/2/2008 Albury 7.4 25.1 0
12/3/2008 Albury 12.9 25.7 0
12/5/2008 Brisbane 20.5 29 9.6
12/6/2008 Brisbane 22.1 33.4 7.8
12/7/2008 Brisbane 22.6 33.4 12.4
12/8/2008 Brisbane 21.9 26.7 0
12/9/2008 Brisbane 19.5 27.6 0.2
12/10/2008 Brisbane 22.1 30.3 0.6
3/30/2011 Tuggeranong 9.8 25.2 0.4
3/31/2011 Tuggeranong 10.3 18.5 2.8
5/1/2011 Tuggeranong 5.5 20.8 0
5/2/2011 Tuggeranong 11 16.1 0
5/3/2011 Tuggeranong 7.3 17.5 0.6
8/29/2016 Woomera 15 22.9 0
8/30/2016 Woomera 12.5 22.1 12.8
8/31/2016 Woomera 8 20 0
9/1/2016 Woomera 11.6 21.4 0
9/2/2016 Woomera 11.2 19.6 0.3
9/3/2016 Woomera 7.1 20.4 0
9/4/2016 Woomera 6.5 18.6 0
9/5/2016 Woomera 7.3 21.5 0

One possible solution, if I understood you correctly (based on the title of OP), is:
Step 1
Preparing a dataset with Locations as columns and Rainfall as rows (note, you will lose information here up to a shortest rainfall series)
df2=df.groupby("Location")[["Location", "Rainfall"]].head(3) # head(3) is first 3 observations
df2.loc[:,"col"] = 4*["x1","x2","x3"] # 4 is number of unique cities
df3 = df2.pivot_table(index="col",columns="Location",values="Rainfall")
df3
Location Albury Brisbane Tuggeranong Woomera
col
x1 0.6 9.6 0.4 0.0
x2 0.0 7.8 2.8 12.8
x3 0.0 12.4 0.0 0.0
Step 2
Doing correlation matrix on the obtained dataset
df3.corr()
Location Albury Brisbane Tuggeranong Woomera
Location
Albury 1.000000 -0.124534 -0.381246 -0.500000
Brisbane -0.124534 1.000000 -0.869799 -0.797017
Tuggeranong -0.381246 -0.869799 1.000000 0.991241
Woomera -0.500000 -0.797017 0.991241 1.000000
An alternative, slightly more involved solution would be to keep the longest series and impute missing values with means or median.
But even though you will feed more data into your algo, it won't cure the main problem: your data seem to be misaligned. What I mean by this is that to do correlation analysis properly you should make it sure, that you compare comparable values, e.g. rainfall for summer with rainfall for summer for another city. To do analysis this way, you should make it sure you have equal amount of comparable rainfalls for each city: e.g. winter, spring, summer, autumn; or, January, February, ..., December.

How can I fill my dataframe

Can someone please tell me how I can fill in the missing values of my dataframe? The missing values dont come up as NaN or anything that is common instead they show as two dots like .. How would i go about filling them in with the mean of that row that they are in?
1971 1990 1999 2000 2001 2002
Estonia .. 17.4 8.3 8.5 8.5 8.6
Spain 61.6 151.2 205.9 222.2 233.2 241.6
SlovakRepublic 10.9 25.5 28.1 30.8 31.9 32.2
Slovenia .. 12.4 13.3 13.6 14.5 14.6
My headers are the years and my index are the countries.

It seems you can use mask, compare by numpy array created by values and replace by mean, last cast all columns to float:
print (df.mean(axis=1))
Estonia 10.26
Spain 210.82
SlovakRepublic 29.70
Slovenia 13.68
df = df.mask(df.values == '..', df.mean(axis=1), axis=0).astype(float)
print (df)
1971 1990 1999 2000 2001 2002
Estonia 10.26 17.4 8.3 8.5 8.5 8.6
Spain 61.6 151.2 205.9 222.2 233.2 241.6
SlovakRepublic 10.9 25.5 28.1 30.8 31.9 32.2
Slovenia 13.68 12.4 13.3 13.6 14.5 14.6

You should be able to use an .set_value
try df_name.set_value('index', 'column', value)
something like
df_name.set_value('Estonia','1971', 50)

Calculating daily, weekly and monthly mean in python (Pandas)

I have a file which has different readings in columns. I am able to find daily mean by using the following code. This is when the month, date, time are separated by space. I am just wondering, how can I do the same if I have date,month, year in first column and then time in second column. How can I calculate weekly, daily and monthly averages? Please note, the data is not equally sampled.
import pandas as pd
import numpy as np
df = pd.read_csv("Data_set.csv", sep='\s*', names=["month", "day", "time", "Temperature"])
group=df.groupby(["month","day"])
daily=group.aggregate({"Temperature":np.mean})
daily.to_csv('daily.csv')
Date Time T1 T2 T3
17/12/2013 00:28:38 19 23.1 7.3
17/12/2013 00:58:38 19 22.9 7.3
17/12/2013 01:28:38 18.9 22.8 6.3
17/12/2013 01:58:38 18.9 23.1 6.3
17/12/2013 02:28:38 18.8 23 6.3
.....
.....
24/12/2013 19:58:21 14.7 15.5 7
24/12/2013 20:28:21 14.7 15.5 7
24/12/2013 20:58:21 14.7 15.5 7
24/12/2013 21:28:21 14.7 15.6 6
24/12/2013 21:58:21 14.7 15.5 6
24/12/2013 22:28:21 14.7 15.5 5
24/12/2013 22:58:21 14.7 15.5 4

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to calculated pivot table using python - python

Related

Trying to use the BeautifulSoup Python module to pull individual elements from table data

Compare entries to bucketed table thresholds, as they get populated in an SQL database

correlation matrix between cities

How can I fill my dataframe

Calculating daily, weekly and monthly mean in python (Pandas)

Categories

Resources