I have a dataframe with a model id and associated values. The columns are date, client_id, model_id, category1, category2, color, and price. I have a simple flask app where the user can select a model id and add to their "purchase" history. Based on the model id I would like to add a row to the dataframe and bring the associated values of category1, category2, color, and price. What is the best way to do this using Pandas? I know in Excel I'd use a vlookup but I am unsure how to go about it using Python. Assume category1, category2, color, and price are unique to each model id.
client_id = input("ENTER Model ID: ")
model_id = input("ENTER Model ID: ")
def update_history(df, client_id, model_id):
today=pd.to_datetime('today')
#putting in tmp but just need to "lookup" these values from the original dataframe somehow
df.loc[len(df)]=[today, client_id, model_id, today, 'tmp', 'tmp','tmp', 'tmp']
return df
Code below adds a new row with new values to an existing dataframe. The list of new values could be passed in to the function.
Import libraries
import pandas as pd
import numpy as np
import datetime
Create sample dataframe
model_id = ['M1', 'M2', 'M3']
today = ['2018-01-01', '2018-01-02', '2018-01-01']
client_id = ['C1', 'C2', 'C3']
category1 = ['orange', 'apple', 'beans']
category2 = ['fruit', 'fruit', 'grains']
df = pd.DataFrame({'today':today, 'model_id': model_id, 'client_id':client_id,
'category1': category1, 'category2':category2})
df['today'] = pd.to_datetime(df['today'])
df
Function
def update_history(df, client_id, model_id, category1, category2):
today=pd.to_datetime('today')
# Create a temp dataframe with new values.
# Column names in this dataframe should match the existing dataframe
temp = pd.DataFrame({'today':[today], 'model_id': [model_id], 'client_id':[client_id],
'category1': [category1], 'category2':[category2]})
df = df.append(temp)
return df
Call function to append a row with new values to existing dataframe
update_history(df, client_id='C4', model_id='M4', category1='apple', category2='fruit')
You could try this. In case you are appending more than one row at a time, appending a dictionary to list and then appending them at once to a dataframe is faster.
modelid = ['MOD1', 'MOD2', 'MOD3']
today = ['2018-07-15', '2018-07-18', '2018-07-20']
clients = ['CLA', 'CLA', 'CLB']
cat_1 = ['CAT1', 'CAT2', 'CAT3']
cat_2 = ['CAT11', 'CAT12', 'CAT13']
mdf = pd.DataFrame({"model_id": modelid, "today": today, "client_id": clients, "cat_1":cat_1, "cat_2":cat_2})
def update_history(df, client_id, model_id):
today = pd.to_datetime('today')
row = df[df.model_id==model_id].iloc[0]
rows_list = []
dict = {"today":today, "client_id":client_id,
"model_id":model_id,"cat_1":row["cat_1"],
"cat_2":row["cat_2"]}
rows_list.append(dict)
df2 = pd.DataFrame(rows_list)
df = df.append(df2)
return df
mdf = update_history(mdf,"CLC","MOD1")
This is what I ended up doing. I still think there is a more elegant solution, so please let me know!
#create dataframe
modelid = ['MOD1', 'MOD2', 'MOD3']
today = ['2018-07-15', '2018-07-18', '2018-07-20']
clients = ['CLA', 'CLA', 'CLB']
cat_1 = ['CAT1', 'CAT2', 'CAT3']
cat_2 = ['CAT11', 'CAT12', 'CAT13']
mdf = pd.DataFrame({"model_id": modelid, "today": today, "client_id": clients, "cat_1":cat_1, "cat_2":cat_2})
#reorder columns
mdf = mdf[['cat_1', 'cat_2', 'model_id', 'client_id', 'today']]
#create lookup table
lookup=mdf[['cat_1','cat_2','model_id']]
lookup.drop_duplicates(inplace=True)
#get values
client_id = input("ENTER Client ID: ")
model_id = input("ENTER Model ID: ")
#append model id to list
model_id_lst=[]
model_id_lst.append(model_id)
today=pd.to_datetime('today')
#grab associated cat_1, and cat_2 from lookup table
temp=lookup[lookup['model_id'].isin(model_id_lst)]
out=temp.values.tolist()
out[0].extend([client_id, today])
#add this as a row to the df
mdf.loc[len(mdf)]=out[0]
Related
I have a function that is driving me crazy and I am supposed to use only PySpark.
The table below is a representation of the data:
There are IDs, Name, Surname and Validity over which I can partition by, but I should lit the value of the percentage of emails that are set correctly by ID.
Like the image below:
How can I solve this problem?
window = Window.partitionBy("ID", "email", "name", "surname", "validity").orderBy(col("ID").desc())
df = df.withColumn("row_num", row_number().over(window))
df_new = df.withColumn("total valid emails per ID", df.select("validity").where(df.validity == "valid" & df.row_num == 1)).count()
Something like:
win = Window.partitionBy("ID", "email", "name", "surname")
df = df.withColumn(
"pct_valid",
F.sum(F.when(F.col("validity") == "Valid", 1).otherwise(0)).over(win)
/ F.col("total emails"),
)
This would work:
df.withColumn("ValidAsNumber", F.when(F.col("Validity") == "Valid", 1).otherwise(0))\
.withColumn("TotalValid", F.sum("ValidAsNumber").over(Window.partitionBy("ID")))\
.withColumn("PercentValid", F.expr("(TotalValid/TotalEmails)*100")).show()
Input:
Output (I kept the intermediate columns for understanding, you can drop them):
I am working on processing a CDC data recieved via kafka tables, and load them into databricks delta tables. I am able to get it working all, except for a nested JSON string which is not getting loaded when using from_json, spark.read.json.
When I try to fetch schema of the json from level 1, using "spark.read.json(df.rdd.map(lambda row: row.value)).schema", the column INPUT_DATA is considered as string loaded as a string object. Am giving sample json string, the code that I tried, and the expected results.
I have many topics to process and each topic will have different schema, so I would like to process dynamically, and do not prefer to store the schemas, since the schema may change over time, and i would like to have my code handle the changes automatically.
Appreciate any help as I have spent whole day to figure out, and still trying. Thanks in advance.
Sample Json with nested tree:
after = {
"id_transaction": "121",
"product_id": 25,
"transaction_dt": 1662076800000000,
"creation_date": 1662112153959000,
"product_account": "40012",
"input_data": "{\"amount\":[{\"type\":\"CASH\",\"amount\":1000.00}],\"currency\":\"USD\",\"coreData\":{\"CustId\":11021,\"Cust_Currency\":\"USD\",\"custCategory\":\"Premium\"},\"context\":{\"authRequired\":false,\"waitForConfirmation\":false,\"productAccount\":\"CA12001\"},\"brandId\":\"TOYO-2201\",\"dealerId\":\"1\",\"operationInfo\":{\"trans_Id\":\"3ED23-89DKS-001AA-2321\",\"transactionDate\":1613420860087},\"ip_address\":null,\"last_executed_step\":\"PURCHASE_ORDER_CREATED\",\"last_result\":\"OK\",\"output_dataholder\":\"{\"DISCOUNT_AMOUNT\":\"0\",\"BONUS_AMOUNT_APPLIED\":\"10000\"}",
"dealer_id": 1,
"dealer_currency": "USD",
"Cust_id": 11021,
"process_status": "IN_PROGRESS",
"tot_amount": 10000,
"validation_result_code": "OK_SAVE_AND_PROCESS",
"operation": "Create",
"timestamp_ms": 1675673484042
}
I have created following script to get all the columns of the json structure:
import json
# table_column_schema = {}
json_keys = {}
child_members = []
table_column_schema = {}
column_schema = []
dbname = "mydb"
tbl_name = "tbl_name"
def get_table_keys(dbname):
table_values_extracted = "select value from {mydb}.{tbl_name} limit 1"
cmd_key_pair_data = spark.sql(table_values_extracted)
jsonkeys=cmd_key_pair_data.collect()[0][0]
json_keys = json.loads(jsonkeys)
column_names_as_keys = json_keys["after"].keys()
value_column_data = json_keys["after"].values()
column_schema = list(column_names_as_keys)
for i in value_column_data:
if ("{" in str(i) and "}" in str(i)):
a = json.loads(i)
for i2 in a.values():
if (str(i2).startswith("{") and str(i2).endswith('}')):
column_schema = column_schema + list(i2.keys())
table_column_schema['temp_table1'] = column_schema
return 0
get_table_keys("dbname")
The following code is used to process the json and create a dataframe with all nested jsons as the columns:
from pyspark.sql.functions import from_json, to_json, col
from pyspark.sql.types import StructType, StructField, StringType, LongType, MapType
import time
dbname = "mydb"
tbl_name = "tbl_name"
start = time.time()
df = spark.sql(f'select value from {mydb}.{tbl_name} limit 2')
tbl_columns = table_column_schema[tbl_name]
data = []
for i in tbl_columns:
if i == 'input_data':
# print('FOUND !!!!')
data.append(StructField(f'{i}', MapType(StringType(),StringType()), True))
else:
data.append(StructField(f'{i}', StringType(), True))
schema2 = spark.read.json(df.rdd.map(lambda row: row.value)).schema
print(type(schema2))
df2 = df.withColumn("value", from_json("value", schema2)).select(col('value.after.*'), col('value.op'))
Note: The VALUE is a column in my delta table (bronze layer)
Current dataframe output:
Expected dataframe output:
You can use rdd to get the schema and from_json to read the value as json.
schema = spark.read.json(df.rdd.map(lambda r: r.input_data)).schema
df = df.withColumn('input_data', f.from_json('input_data', schema))
new_cols = df.columns + df.select('input_data.*').columns
df = df.select('*', 'input_data.*').toDF(*new_cols).drop('input_data')
df.show(truncate=False)
+-------+----------------+---------------+---------+--------------+---------+--------------+---------------+----------+-------------+----------+----------------+----------------------+----------------+---------+-----------------------+---------------------+--------+--------+----------+----------------------+-----------+---------------------------------------+-----------------+
|Cust_id|creation_date |dealer_currency|dealer_id|id_transaction|operation|process_status|product_account|product_id|timestamp_ms |tot_amount|transaction_dt |validation_result_code|amount |brandId |context |coreData |currency|dealerId|ip_address|last_executed_step |last_result|operationInfo |output_dataholder|
+-------+----------------+---------------+---------+--------------+---------+--------------+---------------+----------+-------------+----------+----------------+----------------------+----------------+---------+-----------------------+---------------------+--------+--------+----------+----------------------+-----------+---------------------------------------+-----------------+
|11021 |1662112153959000|USD |1 |121 |Create |IN_PROGRESS |40012 |25 |1675673484042|10000 |1662076800000000|OK_SAVE_AND_PROCESS |[{1000.0, CASH}]|TOYO-2201|{false, CA12001, false}|{11021, USD, Premium}|USD |1 |null |PURCHASE_ORDER_CREATED|OK |{3ED23-89DKS-001AA-2321, 1613420860087}|{10000, 0} |
+-------+----------------+---------------+---------+--------------+---------+--------------+---------------+----------+-------------+----------+----------------+----------------------+----------------+---------+-----------------------+---------------------+--------+--------+----------+----------------------+-----------+---------------------------------------+-----------------+
So I currently have what is above.
I've managed to separate them into categories using groupby but now I would like to put them in a subplot of tables.
##open comma separated file and the columns Name, In Stock, committed, reorder point
file = pd.read_csv('Katana/InventoryItems-2022-01-06-09_10.csv',
usecols=['Name','In stock','Committed', 'Reorder point','Category'])
##take the columns and put them in to a list
Name = file['Name'].tolist()
InStock = file['In stock'].tolist()
Committed = file['Committed'].tolist()
ReorderPT = file['Reorder point'].tolist()
Category = file['Category'].tolist()
##take the lists and change them into appropriate type of data
inStock = [int(float(i)) for i in InStock]
commited = [int(float(i)) for i in Committed]
reorderpt = [int(float(i)) for i in ReorderPT]
##have the liss with correct data type and arrange them
inventory = {'Name': Name,
'In stock': inStock,
'Committed': commited,
'Reorder point': reorderpt,
'Category': Category
}
##take the inventory arrangement and display them into a table
frame = DataFrame(inventory)
grouped = frame.groupby(frame.Category)
df_elec = grouped.get_group('Electronics')
df_bedp = grouped.get_group('Bed Packaging')
df_fil = grouped.get_group('Filament')
df_fast = grouped.get_group('Fasteners')
df_kit = grouped.get_group('Kit Packaging')
df_pap = grouped.get_group('Paper')
Try something along the lines of:
import matplotlib.pyplot as plt
fig,axs = plt.subplots(nrows=6,ncols=1)
for ax,data in zip(axs,[df_elec,df_bedp,df_fil,df_fast,df_kit,df_pap]):
data.plot(ax=ax,table=True)
I would like to identify doctors based on their title in a dataframe and create a new column to indicate if they are a doctor but I am struggling with my code.
doctorcriteria = ['Dr', 'dr']
def doctor(x):
if doctorcriteria in x:
return 'Doctor'
else:
return 'Not a doctor'
df['doctorcall'] = df.caller_name
df.doctorcall.fillna('Not a doctor', inplace=True)
df.doctorcall = df.doctorcall.apply(doctor)
To create a new column with a function, you can use apply:
df = pd.DataFrame({'Title':['Dr', 'dr', 'Mr'],
'Name':['John', 'Jim', 'Jason']})
doctorcriteria = ['Dr', 'dr']
def doctor(x):
if x.Title in doctorcriteria:
return 'Doctor'
else: return 'Not a doctor'
df['IsDoctor'] = df.apply(doctor, axis=1)
But a more direct route to the answer would be to use map on the Title column.
doctor_titles = {'Dr', 'dr'}
df['IsDoctor'] = df['Title'].map(lambda title: title in doctor_titles)
I am planning to do some financial research and learning using data from the NASDAQ.
I want to retrieve data from Nasdaq such that the header has the following:
Stock Symbol
Company Name
Last Sale
Market Capitalization
IPO
Year
Sector
Industry
Last Update
And I used Python code to get the "list of companies and ticker names" using:
import pandas as pd
import json
PACKAGE_NAME = 'nasdaq-listings'
PACKAGE_TITLE = 'Nasdaq Listings'
nasdaq_listing = 'ftp://ftp.nasdaqtrader.com/symboldirectory/nasdaqlisted.txt'# Nasdaq only
def process():
nasdaq = pd.read_csv(nasdaq_listing,sep='|')
nasdaq = _clean_data(nasdaq)
# Create a few other data sets
nasdaq_symbols = nasdaq[['Symbol','Company Name']] # Nasdaq w/ 2 columns
# (dataframe, filename) datasets we will put in schema & create csv
datasets = [(nasdaq,'nasdaq-listed'), (nasdaq_symbols,'nasdaq-listed-symbols')]
for df, filename in datasets:
df.to_csv('data/' + filename + '.csv', index=False)
with open("datapackage.json", "w") as outfile:
json.dump(_create_datapackage(datasets), outfile, indent=4, sort_keys=True)
def _clean_data(df):
# TODO: do I want to save the file creation time (last row)
df = df.copy()
# Remove test listings
df = df[df['Test Issue'] == 'N']
# Create New Column w/ Just Company Name
df['Company Name'] = df['Security Name'].apply(lambda x: x.split('-')[0]) #nasdaq file uses - to separate stock type
#df['Company Name'] = TODO, remove stock type for otherlisted file (no separator)
# Move Company Name to 2nd Col
cols = list(df.columns)
cols.insert(1, cols.pop(-1))
df = df.loc[:, cols]
return df
def _create_file_schema(df, filename):
fields = []
for name, dtype in zip(df.columns,df.dtypes):
if str(dtype) == 'object' or str(dtype) == 'boolean': # does datapackage.json use boolean type?
dtype = 'string'
else:
dtype = 'number'
fields.append({'name':name, 'description':'', 'type':dtype})
return {
'name': filename,
'path': 'data/' + filename + '.csv',
'format':'csv',
'mediatype': 'text/csv',
'schema':{'fields':fields}
}
def _create_datapackage(datasets):
resources = []
for df, filename in datasets:
resources.append(_create_file_schema(df,filename))
return {
'name': PACKAGE_NAME,
'title': PACKAGE_TITLE,
'license': '',
'resources': resources,
}
process()
Now for each of these symbols, I want to get the other data (as in above).
Is there anyway I could do this?
Have you taken a look at pandas-datareader? You maybe able to get the other data from there. It has multiple data sources, such as Google, Yahoo Finance,
http://pandas-datareader.readthedocs.io/en/latest/remote_data.html#remote-data-google