I have two columns of my dataframe like this:
label
info_version
18.2.6
18.2.6
18.2.7
18.2.8
18.2.8
20.1.1
20.2.1
I want to label every version increment with its upgrade, like the first entry for every id remains none,and then it starts labelling.If there is no change then the entry is none. So the output should be something like this:
label
info_version
none
18.2.6
none
18.2.6
patch
18.2.7
patch
18.2.8
none
18.2.8
major
20.1.1
minor
20.2.1
import pandas as pd
from packaging import version
def version_upgrade(prev_version, current_version):
if prev_version is None:
return None
elif version.parse(current_version) > version.parse(prev_version):
if version.parse(current_version).major > version.parse(prev_version).major:
return "major"
elif version.parse(current_version).minor > version.parse(prev_version).minor:
return "minor"
else:
return "patch"
else:
return None
semver_df["label"] = None
prev_version_list = semver_df["info_version"].shift(1).tolist()
semver_df["label"] = semver_df["info_version"].apply(lambda x: version_upgrade(prev_version_list.pop(0), x))
This code works when I provide sample data, however in my case I need to sort it my commitdates as well,I am not sure how will that be achieved. Any help would be highly appreciated!
It should work basiclly th sme way:
import pandas as pd
from packaging import version
def version_upgrade(prev_version, current_version):
if prev_version is None:
return None
elif version.parse(current_version) > version.parse(prev_version):
if version.parse(current_version).major > version.parse(prev_version).major:
return "major"
elif version.parse(current_version).minor > version.parse(prev_version).minor:
return "minor"
else:
return "patch"
else:
return None
data = {
"label": [None, None, None, None, None, None, None],
"info_version": ["18.2.6", "18.2.6", "18.2.7", "18.2.8", "18.2.8", "20.1.1", "20.2.1"],
"commit_date": ["2022-01-01", "2022-01-02", "2022-01-03", "2022-01-04", "2022-01-05", "2022-01-06", "2022-01-07"]
}
semver_df = pd.DataFrame(data)
semver_df["commit_date"] = pd.to_datetime(semver_df["commit_date"])
semver_df = semver_df.sort_values("commit_date")
semver_df.dropna(subset=["info_version"], inplace=True)
semver_df["label"] = None
prev_version_list = semver_df["info_version"].shift(1).tolist()
semver_df["label"] = semver_df["info_version"].apply(lambda x: version_upgrade(prev_version_list.pop(0), x))
returning:
label info_version commit_date
0 None 18.2.6 2022-01-01
1 None 18.2.6 2022-01-02
2 patch 18.2.7 2022-01-03
3 patch 18.2.8 2022-01-04
4 None 18.2.8 2022-01-05
5 major 20.1.1 2022-01-06
6 minor 20.2.1 2022-01-07
Related
I'm trying to scrape a site (discogs.com) for a few different fields (num_have, num_want, num_versions, num_for_sale, value) per release_id. Generally it works ok, but I want to set some conditions to exclude release ids where:
num_have is greater than 18,
num_versions is 2 or less,
num_for_sale is 5 or less,
So I want results to be any release id that meets all three conditions. I can do that for conditions 1 & 2, but the 3rd is giving me trouble. I don't know how to adjust for where num_for_sale is 0. According to the api documentation (https://www.discogs.com/developers/#page:marketplace,header:marketplace-release-statistics), the body should look like this:
{
"lowest_price": {
"currency": "USD",
"value": 2.09
},
"num_for_sale": 26,
"blocked_from_sale": false
}
and "Releases that have no items for sale in the marketplace will return a body with null data in the lowest_price and num_for_sale keys. Releases that are blocked for sale will also have null data for these keys." So I think my errors are coming from where num_for_sale is 0, the script doesn't know what when value. When I wrap the code that accesses market_data in a try-except block, and set the values for value and currency to None if an exception occurs, I get an AttributeError "NoneType' object has no attribute 'get'"
What am I doing wrong? How should I rewrite this code:
import pandas as pd
import requests
import time
import tqdm
unique_northAmerica = pd.read_pickle("/Users/EJ/northAmerica_df.pkl")
unique_northAmerica = unique_northAmerica.iloc[1:69]
headers = {'Authorization': 'Discogs key=MY-KEY'}
results = []
for index, row in tqdm.tqdm(unique_northAmerica.iterrows(), total=len(unique_northAmerica)):
release_id = row['release_id']
response = requests.get(f'https://api.discogs.com/releases/{release_id}', headers=headers)
data = response.json()
if 'community' in data:
num_have = data['community']['have']
num_want = data['community']['want']
else:
num_have = None
num_want = None
if "master_id" in data:
master_id = data['master_id']
response = requests.get(f"https://api.discogs.com/masters/{master_id}/versions", headers=headers)
versions_data = response.json()
if "versions" in versions_data:
num_versions = len(versions_data["versions"])
else:
num_versions = 1
else:
num_versions = 1
response = requests.get(f'https://api.discogs.com/marketplace/stats/{release_id}', headers=headers)
market_data = response.json()
num_for_sale = market_data.get('num_for_sale', None)
# Add the condition to only append to `results` if num_have <= 18 and num_versions <= 2
if num_have and num_versions and num_have <= 18 and num_versions <= 2:
if num_for_sale and num_for_sale <= 5:
if 'lowest_price' in market_data:
value = market_data['lowest_price'].get('value', None)
else:
value = None
else:
value = None
if num_for_sale == 0:
value = None
results.append({
'release_id': release_id,
'num_have': num_have,
'num_want': num_want,
'num_versions': num_versions,
'num_for_sale': num_for_sale,
'value': value
})
time.sleep(4)
df = pd.DataFrame(results)
df.to_pickle("/Users/EJ/example.pkl")
Thanks in advance!
I've tried wrapping the code that accesses market_data in a try-except block, and set the values for value and currency to None if an exception occurs, I get an AttributeError "NoneType' object has no attribute 'get'"
Edit:
Traceback (most recent call last)
Cell In [139], line 41
39 if num_for_sale <= 5:
40 if 'lowest_price' in market_data:
---> 41 value = market_data['lowest_price'].get('value', None)
42 else:
43 value = None
AttributeError: 'NoneType' object has no attribute 'get'
You just need to add a check to see if the data is None.
if 'lowest_price' in market_data and market_data['lowest_price'] is not None:
value = market_data['lowest_price'].get('value', None)
else:
value = None
In fact you can probably skip checking to see if lowest_price exists, because the api instructions tell you it will be there, it just might have null data.
So you could change it to.
if market_data['lowest_price']:
value = ...
else:
value = None
Per the discogs api docs:
Releases that have no items for sale in the marketplace will return a body with null data in the lowest_price and num_for_sale keys. Releases that are blocked for sale will also have null data for these keys.
Which means that in one of those situations the converted json would look like this:
{
"lowest_price": None,
"num_for_sale": None,
"blocked_from_sale": false
}
So when your code tries to call get on market_data['lowest_price'] what your actually doing is calling None.get which raises the error.
The reason why it is still including if num_for_sale > 5 is because you are appending the results regardless of whether your check returns true of false. To fix all you need to do is adjust the indentation on your results.append statement.
if num_have and num_versions and num_have <= 18 and num_versions <= 2:
if num_for_sale and num_for_sale <= 5:
if market_data['lowest_price']:
value = market_data['lowest_price'].get('value', None)
else:
value = None
results.append({
'release_id': release_id,
'num_have': num_have,
'num_want': num_want,
'num_versions': num_versions,
'num_for_sale': num_for_sale,
'value': value
})
I got a class which similar as bellow code, it maintains a pandas frame of states, in my real case is much more complex than this given, so I need to create several algorithm based filters to find out one of my required record.
import pandas as pd
import datetime as dt
class stateMachine:
def __init__(self):
self.contextStack = pd.DataFrame(columns=['state',
'end_date', 'speed',
'weight', 'area'])
self.state = 'idle'
self.day = dt.date.today() - dt.timedelta(days=100)
def update(self):
state_ctx = {
'state': self.state,
'end_date':self.day,
'speed': self.speed ,
'weight':self.weight
}
df = pd.DataFrame([state_ctx])
self.contextStack = pd.concat([self.contextStack, df], ignore_index=True)
self.day = self.day + dt.timedelta(days=1)
def set_speed(self, speed):
self.speed = speed
def set_weight(self, weight):
self.weight = weight
def set_state(self, state):
self.state = state
Here is a very simple example code to add up items like:
sm = stateMachine()
states = ['idle', 'on', 'off']
for i in range(0, 10):
sm.set_speed(i)
sm.set_weight(i+100)
sm.set_state(states[i%3])
sm.update()
So after run this, I get my dataframes:
state end_date speed weight
0 idle 2022-01-24 0 100
1 on 2022-01-25 1 101
2 off 2022-01-26 2 102
3 idle 2022-01-27 3 103
4 on 2022-01-28 4 104
5 off 2022-01-29 5 105
6 idle 2022-01-30 6 106
7 on 2022-01-31 7 107
8 off 2022-02-01 8 108
9 idle 2022-02-02 9 109
my current algorithm can find one selected item, it looks like:
def get_filtered_stack(self, state):
filtered_df = self.contextStack[(self.contextStack['state']==state)]
return filtered_df
def find_item_understate(self, state, weight, speed):
self.state_stack = self.get_filtered_stack(state)
# after some operation, I get a index of wanted
# let's assume it is 0
ctx = self.state_stack.iloc[0,:]
return ctx
Here comes my problem, after my higher level application got this 'ctx' and later it no longer be able to address back it's index in the original whole pandas dataframe.
context = sm.find_item_understate('on', 104, 4)
because the 'index' information is lost after 'iloc', here is the test context from above code looks like:
state on
end_date 2022-01-25
speed 1
weight 101
Name: 1, dtype: object
In some cases, I need to find back the original index in my later process, in this case, it is 1.
But since the context already lost 'index' information, it causes a trouble to me that, in the end of the day, the out context lost the way back home.
note: the date/speed/weight columns can't be refered as a filter to address back the index.
I try to make a table (or csv, I'm using pandas dataframe) from the information of an XML file.
The file is here (.zip is 14 MB, XML is ~370MB), https://nvd.nist.gov/feeds/xml/cpe/dictionary/official-cpe-dictionary_v2.3.xml.zip . It has package information of different languages - node.js, python, java etc. aka, CPE 2.3 list by the US government org NVD.
this is how it looks like in the first 30 rows:
<cpe-list xmlns:config="http://scap.nist.gov/schema/configuration/0.1" xmlns="http://cpe.mitre.org/dictionary/2.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:scap-core="http://scap.nist.gov/schema/scap-core/0.3" xmlns:cpe-23="http://scap.nist.gov/schema/cpe-extension/2.3" xmlns:ns6="http://scap.nist.gov/schema/scap-core/0.1" xmlns:meta="http://scap.nist.gov/schema/cpe-dictionary-metadata/0.2" xsi:schemaLocation="http://scap.nist.gov/schema/cpe-extension/2.3 https://scap.nist.gov/schema/cpe/2.3/cpe-dictionary-extension_2.3.xsd http://cpe.mitre.org/dictionary/2.0 https://scap.nist.gov/schema/cpe/2.3/cpe-dictionary_2.3.xsd http://scap.nist.gov/schema/cpe-dictionary-metadata/0.2 https://scap.nist.gov/schema/cpe/2.1/cpe-dictionary-metadata_0.2.xsd http://scap.nist.gov/schema/scap-core/0.3 https://scap.nist.gov/schema/nvd/scap-core_0.3.xsd http://scap.nist.gov/schema/configuration/0.1 https://scap.nist.gov/schema/nvd/configuration_0.1.xsd http://scap.nist.gov/schema/scap-core/0.1 https://scap.nist.gov/schema/nvd/scap-core_0.1.xsd">
<generator>
<product_name>National Vulnerability Database (NVD)</product_name>
<product_version>4.9</product_version>
<schema_version>2.3</schema_version>
<timestamp>2022-03-17T03:51:01.909Z</timestamp>
</generator>
<cpe-item name="cpe:/a:%240.99_kindle_books_project:%240.99_kindle_books:6::~~~android~~">
<title xml:lang="en-US">$0.99 Kindle Books project $0.99 Kindle Books (aka com.kindle.books.for99) for android 6.0</title>
<references>
<reference href="https://play.google.com/store/apps/details?id=com.kindle.books.for99">Product information</reference>
<reference href="https://docs.google.com/spreadsheets/d/1t5GXwjw82SyunALVJb2w0zi3FoLRIkfGPc7AMjRF0r4/edit?pli=1#gid=1053404143">Government Advisory</reference>
</references>
<cpe-23:cpe23-item name="cpe:2.3:a:\$0.99_kindle_books_project:\$0.99_kindle_books:6:*:*:*:*:android:*:*"/>
</cpe-item>
The tree structure of the XML file is quite simple, the root is 'cpe-list', the child element is 'cpe-item', and the grandchild elements are 'title', 'references' and 'cpe23-item'.
From 'title', I want the text in the element;
From 'cpe23-item', I want the attribute 'name';
From 'references', I want the attributes 'href' from its great-grandchildren, 'reference'.
The dataframe should look like this:
| cpe23_name | title_text | ref1 | ref2 | ref3 | ref_other
0 | 'cpe23name 1'| 'this is a python pkg'| 'url1'| 'url2'| NaN | NaN
1 | 'cpe23name 2'| 'this is a java pkg' | 'url1'| 'url2'| NaN | NaN
...
my code is here,finished in ~100sec:
import xml.etree.ElementTree as et
xtree = et.parse("official-cpe-dictionary_v2.3.xml")
xroot = xtree.getroot()
import time
start_time = time.time()
df_cols = ["cpe", "text", "vendor", "product", "version", "changelog", "advisory", 'others']
title = '{http://cpe.mitre.org/dictionary/2.0}title'
ref = '{http://cpe.mitre.org/dictionary/2.0}references'
cpe_item = '{http://scap.nist.gov/schema/cpe-extension/2.3}cpe23-item'
p_cpe = None
p_text = None
p_vend = None
p_prod = None
p_vers = None
p_chan = None
p_advi = None
p_othe = None
rows = []
i = 0
while i < len(xroot):
for elm in xroot[i]:
if elm.tag == title:
p_text = elm.text
#assign p_text
elif elm.tag == ref:
for nn in elm:
s = nn.text.lower()
#check the lower text in refs
if 'version' in s:
p_vers = nn.attrib.get('href')
#assign p_vers
elif 'advisor' in s:
p_advi = nn.attrib.get('href')
#assign p_advi
elif 'product' in s:
p_prod = nn.attrib.get('href')
#assign p_prod
elif 'vendor' in s:
p_vend = nn.attrib.get('href')
#assign p_vend
elif 'change' in s:
p_chan = nn.attrib.get('href')
#assign p_vend
else:
p_othe = nn.attrib.get('href')
elif elm.tag == cpe_item:
p_cpe = elm.attrib.get("name")
#assign p_cpe
else:
print(elm.tag)
row = [p_cpe, p_text, p_vend, p_prod, p_vers, p_chan, p_advi, p_othe]
rows.append(row)
p_cpe = None
p_text = None
p_vend = None
p_prod = None
p_vers = None
p_chan = None
p_advi = None
p_othe = None
print(len(rows)) #this shows how far I got during the running time
i+=1
out_df1 = pd.DataFrame(rows, columns = df_cols)# move this part outside the loop by removing the indent
print("---853k rows take %s seconds ---" % (time.time() - start_time))
updated: the faster way is to move the 2nd last row out side the loop. Since 'rows' already get info in each loop, there is no need to make a new dataframe every time.
the running time now is 136.0491042137146 seconds. yay!
Since your XML is fairly flat, consider the recently added IO module, pandas.read_xml introduced in v1.3. Given XML uses a default namespace, to reference elements in xpath use namespaces argument:
url = "https://nvd.nist.gov/feeds/xml/cpe/dictionary/official-cpe-dictionary_v2.3.xml.zip"
df = pd.read_xml(
url, xpath=".//doc:cpe-item", namespaces={'doc': 'http://cpe.mitre.org/dictionary/2.0'}
)
If you do not have the default parser, lxml, installed, use the etree parser:
df = pd.read_xml(
url, xpath=".//doc:cpe-item", namespaces={'doc': 'http://cpe.mitre.org/dictionary/2.0'}, parser="etree"
)
Sample dataframe
ticket_start_time ticket_end_time status customer_type ticket_type customer_type
0 None None None None None None
1 None None None None None None
2 None None None None None None
3 None None None None None None
8 2021-10-22 16:26:50 2021-10-22 19:16:28 Por Acción R INSTALLATION R
9 2021-10-22 16:26:50 2021-10-22 16:38:23 Por Acción R INSTALLATION R
10 2021-10-22 16:26:50 2021-10-22 19:16:28 Por Acción R INSTALLATION R
I'm using the below code but it is hardcoded.Create a reusabe fntns for the above dataframe
import pyarrow
import pandas as pd
df = read_df()
columns_list = [req_cols]
filter_conditions = ["status = 'closed'" and "customer_type = 'R'"]
df.query()
def select_filter_df(df, columns_list, filter_conditions):
# apply the filters, and query
return df
Use function parameters?
def select_filter_df(filename, columns, querystring):
try:
df = pd.read_parquet(filename, columns=columns, engine='pyarrow')
df = df.query(querystring)
except Exception as error:
logger.error(error)
return df
# How to use it:
file_path = "D:\Project_centriam"
filename = os.path.join(file_path, "merged_result.parquet")
cols = ["ticket_start_time","ticket_end_time","status","customer_type","ticket_type","customer_type"]
qs = 'status == "Rechazado" and ticket_type =="INSTALLATION" and customer_type =="R"'
df = select_filter_df(filename, cols, qs)
required_cols = ["install_ticket_start_time","install_ticket_end_time","install_status","install_customer_type","install_ticket_type"]
filter_condition = 'install_status == "Rechazado" and install_ticket_type =="INSTALLATION" and install_customer_type =="R"'
def filter_df(df, column_list, filtered_df):
try:
df = pd.read_parquet(filename, engine='pyarrow')
column_list = df.filter(required_cols)
filtered_df = column_list.query(filter_condition)
print(filtered_df)
except Exception as error:
logger.error(error)
filter_df(df, column_list, filtered_df)
Comparing predicted data from multiple sources with actual data
I am trying to write a Python program which compares predicted points for fantasy football players with actual points in order to see which data provider's predictive algorithm is the most accurate. I have figured out how to visualize the actual and predicted points for a single player as in the above graph. However, not being a statistics expert, I am unsure as to what statistical method(s) to use in order to compute and visualize the comparison of identical data for 500+ players. Ultimately I'd like to end up with one graph which shows the difference between predicted and actual values over the course of an entire season for all players. Can someone please point me in the right general direction in terms of how this would be done theoretically from a statistics points of view? This is a python learning experience/exercise for me and I'm just using the chosen data because I have a personal interest in it. Thanks!
Below is a sample of my code. I realize that it's quite ugly currently and probably needs to turned into function. But this is just a quick proof of concept before I turn it into something more proper.
import sqlite3
import matplotlib.pyplot as plt
import pandas as pd
player_id = str(12) # This is just player #12, randomly chosen
conn = sqlite3.connect('FPL.sqlite')
cur = conn.cursor()
df_ffs = pd.read_sql_query('SELECT * FROM FFScout WHERE player_id ='+player_id,conn)
df_ffs = df_ffs.transpose()
df_ffs.columns = ['FFScout']
df_fff = pd.read_sql_query('SELECT * FROM FFFix WHERE player_id ='+player_id,conn)
df_fff = df_fff.transpose()
df_fff.columns = ['FFFix']
df_ffrev = pd.read_sql_query('SELECT * FROM FPLReview WHERE player_id ='+player_id,conn)
df_ffrev = df_ffrev.transpose()
df_ffrev.columns = ['FPLReview']
df_foverlord = pd.read_sql_query('SELECT * FROM FOverlord WHERE player_id ='+player_id,conn)
df_foverlord = df_foverlord.transpose()
df_foverlord.columns = ['FOverlord']
df_foomni = pd.read_sql_query('SELECT * FROM Foomni WHERE player_id ='+player_id,conn)
df_foomni = df_foomni.transpose()
df_foomni.columns = ['Foomni']
df_actual = pd.read_sql_query('SELECT * FROM GW WHERE player_id ='+player_id,conn)
df_actual = df_actual.transpose()
df_actual.columns = ['ActualGW']
merged_df = pd.concat([df_ffs,df_fff,df_ffrev,df_foverlord,df_foomni,df_actual], axis =1)
merged_df = merged_df.drop(merged_df.index[0])
print(merged_df.head(25))
merged_df.plot()
plt.xlabel('Gameweek')
plt.ylabel('Points')
merged_df dataframe currently look like this (first 25 rows):
FFScout FFFix FPLReview FOverlord Foomni ActualGW
GW1 None None None None None 0
GW2 None None None None None 8
GW3 None None None None None 1
GW4 None None None None None 5
GW5 None None None None None 0
GW6 None None None None None 0
GW7 None None None None None 0
GW8 None None None None None 0
GW9 None None None None None 1
GW10 None None None None None 5
GW11 None None None None None 4
GW12 None None None None None 2
GW13 None None None None None 12
GW14 None None None None None 2
GW15 5.1 None None None None 7
GW16 4.5 None None 4.8 5.2 None
GW17 3.9 None 3.67 6.2 3.3 None
GW18 3.9 None 4.22 5 None None
GW19 4.4 None 4.63 4.6 None None
GW20 4.1 None 4.19 6.4 None None
GW21 None None 4.25 7 None None
GW22 None None None None None None
GW23 None None None None None None
GW24 None None None None None None
GW25 None None None None None None
ActualGW row is the actual points for the player for each gameweek. The other columns are predicted values coming from other data sources for the same gameweek. So basically what I'd ultimately like to do is this x 500 players, but all in one graph (somehow).