Python columns function - python

I would like a function which will create columns through the modalities of the columns of a database.
These created columns will calculate the percent of the modalities for each observation

First you have to load data as JSON/Array.
import pandas as pd
hold_data = pd.read_csv (r'C:\Users\abdulalim\Desktop\Test\Products.csv')
hold_data.to_json (r'C:\Users\abdulalim\Desktop\Test\New_Products.json')
output: {"Product":{"0":"Desktop Computer","1":"Tablet","2":"Printer","3":"Laptop"},"Price":{"0":700,"1":250,"2":120,"3":1200}}
if data get as Array:
def get_data(request):
hold_data = request.data
for key in hold_data:
print(key[0])
print(key[1])
if data get as Json:
def get_data(request):
hold_data = request.data
hold_data['data1']
hold_data['data2']
After you get the data than do what ever you want

Related

How to filter out Column data From Multiple rows data?

Good Evening
Hi everyone, so i got the following JSON file from Walmart regarding their product items and price.
so i loaded up jupyter notebook, imported pandas and then loaded it into a Dataframe with custom columns as shown in the pics below.
now this is what i want to do:
make new columns named as min price and max price and load the data into it
how can i do that ?
Here is the code in jupyter notebook for reference.
i also want the offer price as some items dont have minprice and maxprice :)
EDIT: here is the PYTHON Code:
import json
import pandas as pd
with open("walmart.json") as f:
data = json.load(f)
walmart = data["items"]
wdf = pd.DataFrame(walmart,columns=["productId","primaryOffer"])
print(wdf.loc[0,"primaryOffer"])
pd.set_option('display.max_colwidth', None)
print(wdf)
Here is the JSON File:
https://pastebin.com/sLGCFCDC
The following code snippet on top of your code would achieve the required task:
min_prices = []
max_prices = []
offer_prices = []
for i,row in wdf.iterrows():
if('showMinMaxPrice' in row['primaryOffer']):
min_prices.append(row['primaryOffer']['minPrice'])
max_prices.append(row['primaryOffer']['maxPrice'])
offer_prices.append('N/A')
else:
min_prices.append('N/A')
max_prices.append('N/A')
offer_prices.append(row['primaryOffer']['offerPrice'])
wdf['minPrice'] = min_prices
wdf['maxPrice'] = max_prices
wdf['offerPrice'] = offer_prices
Here we are checking for the 'showMinMaxPrice' element from the json in the column named 'primaryOffer'. For cases where the minPrice and maxPrice is available, the offerPrice is shown as 'N/A' and vice-versa. These are first stored in lists and later added to the dataframe as columns.
The output for wdf.head() would then be:

Create Forecasts Looping over SKUs and Export to CSV using Facebook Prophet

I am new to Python so please bear with me.
I am trying to convert what I think may be a nested dictionary into a csv that I can export. Below is my code:
import pandas as pd
import os
from fbprophet import Prophet
# Read in File
df1 = pd.read_csv('File_Path.csv')
#Create Loop to Forecast Multiple SKUs
def get_prediction(df):
prediction = {}
df1 = df.rename(columns={'Date': 'ds','qty_ordered': 'y', 'item_no': 'item'})
list_items = df1.item.unique()
for item in list_items:
item_df = df1.loc[df1['item'] == item]
# set the uncertainty interval to 95% (the Prophet default is 80%)
my_model = Prophet(yearly_seasonality= True, seasonality_prior_scale=1.0)
my_model.fit(item_df)
future_dates = my_model.make_future_dataframe(periods=12, freq='M')
forecast = my_model.predict(future_dates)
prediction[item] = forecast
return prediction
# Save predictions to dictionary
df2 = get_prediction(df1)
# Convert dictionary
df3 = pd.DataFrame.from_dict(df3, index='columns)
So the last part of the code is where I am struggling. I need to convert the df2 dictionary to a dataframe (df3) so I can export it to a csv. But it looks as if it is a nested dictionary? Not sure if I need to update my function or not.
This is what a snippet of the dictionary looks like
I need to export it so it will look like this
Any help would be greatly appreciated!
The following code should help flattening df2 (dictionary of dataframes if I understand correctly).
def flatten(dict_of_df):
# insert column 'item'
for key, value in dict_of_df.items():
value['item'] = key
# return vertically concatenated dataframe with all the items
return pd.concat(dict_of_df.values())

How do I make this function iterable (getting indexerror)

I am fairly new to python and coding in general.
I have a big data file that provides daily data for the period 2011-2018 for a number of stock tickers (300~).
The data is a .csv file with circa 150k rows and looks as follows (short example):
Date,Symbol,ShortExemptVolume,ShortVolume,TotalVolume
20110103,AAWW,0.0,28369,78113.0
20110103,AMD,0.0,3183556,8095093.0
20110103,AMRS,0.0,14196,18811.0
20110103,ARAY,0.0,31685,77976.0
20110103,ARCC,0.0,177208,423768.0
20110103,ASCMA,0.0,3930,26527.0
20110103,ATI,0.0,193772,301287.0
20110103,ATSG,0.0,23659,72965.0
20110103,AVID,0.0,7211,18896.0
20110103,BMRN,0.0,21740,213974.0
20110103,CAMP,0.0,2000,11401.0
20110103,CIEN,0.0,625165,1309490.0
20110103,COWN,0.0,3195,24293.0
20110103,CSV,0.0,6133,25394.0
I have a function that allows me to filter for a specific symbol and get 10 observations before and after a specified date (could be any date between 2011 and 2018).
import pandas as pd
from datetime import datetime
import urllib
import datetime
def get_data(issue_date, stock_ticker):
df = pd.read_csv (r'D:\Project\Data\Short_Interest\exampledata.csv')
df['Date'] = pd.to_datetime(df['Date'], format="%Y%m%d")
d = df
df = pd.DataFrame(d)
short = df.loc[df.Symbol.eq(stock_ticker)]
# get the index of the row of interest
ix = short[short.Date.eq(issue_date)].index[0]
# get the item row for that row's index
iloc_ix = short.index.get_loc(ix)
# get the +/-1 iloc rows (+2 because that is how slices work), basically +1 and -1 trading days
short_data = short.iloc[iloc_ix-10: iloc_ix+11]
return [short_data]
I want to create a script that iterates a list of 'issue_dates' and 'stock_tickers'. The list (a .csv) looks as following:
ARAY,07/08/2017
ARAY,24/04/2014
ACETQ,16/11/2015
ACETQ,16/11/2015
NVLNA,15/08/2014
ATSG,29/09/2017
ATI,24/05/2016
MDRX,18/06/2013
MDRX,18/06/2013
AMAGX,10/05/2017
AMAGX,14/02/2014
AMD,14/09/2016
To break down my problem and question I would like to know how to do the following:
First, how do I load the inputs?
Second, how do I call the function on each input?
And last, how do I accumulate all the function returns in one dataframe?
To load the inputs and call the function for each row; iterate over the csv file and pass each row's values to the function and accumulate the resulting Seriesin a list.
I modified your function a bit: removed the DataFrame creation so it is only done once and added a try/except block to account for missing dates or tickers (your example data didn't match up too well). The dates in the second csv look like they are day/month/year so I converted them for that format.
import pandas as pd
import datetime, csv
def get_data(df, issue_date, stock_ticker):
'''Return a Series for the ticker centered on the issue date.
'''
short = df.loc[df.Symbol.eq(stock_ticker)]
# get the index of the row of interest
try:
ix = short[short.Date.eq(issue_date)].index[0]
# get the item row for that row's index
iloc_ix = short.index.get_loc(ix)
# get the +/-1 iloc rows (+2 because that is how slices work), basically +1 and -1 trading days
short_data = short.iloc[iloc_ix-10: iloc_ix+11]
except IndexError:
msg = f'no data for {stock_ticker} on {issue_date}'
#log.info(msg)
print(msg)
short_data = None
return short_data
df = pd.read_csv (datafile)
df['Date'] = pd.to_datetime(df['Date'], format="%Y%m%d")
results = []
with open('issues.csv') as issues:
for ticker,date in csv.reader(issues):
day,month,year = map(int,date.split('/'))
# dt = datetime.datetime.strptime(date, r'%d/%m/%Y')
date = datetime.date(year,month,day)
s = get_data(df,date,ticker)
results.append(s)
# print(s)
Creating a single DataFrame or table for all that info may be problematic especially since the date ranges are all different. Probably should ask a separate question regarding that. Its mcve should probably just include a few minimal Pandas Series with a couple of different date ranges and tickers.

mask function doesn't get rid of unwanted data

I'm working on a data frame taken from Adafruit IO and sadly some of my data is from a time when my project malfunctioned so some of the values are just equal NaN.
I tried to remove it by typing this code lines:
onlyValidData=temp_data.mask(temp_data['value'] =='NaN')
onlyValidData
This is data retreived from Adafruit IO Feed, getting analyzed by pandas, I tried using 'where' function too but it didn't work
my entire code is
import pandas as pd
temp_data = pd.read_json('https://io.adafruit.com/api/(...)')
light_data = pd.read_json('https://io.adafruit.com/api/(...)')
temp_data['created_at'] = pd.to_datetime(temp_data['created_at'], infer_datetime_format=True)
temp_data = temp_data.set_index('created_at')
light_data['created_at'] = pd.to_datetime(light_data['created_at'], infer_datetime_format=True)
light_data = light_data.set_index('created_at')
tempVals = pd.Series(temp_data['value'])
lightVals = pd.Series(light_data['value'])
onlyValidData=temp_data.mask(temp_data['value'] =='NaN')
onlyValidData
The output is all of my data for some reason, but it should be only the valid values.
Hey I think the issue here that you're looking for values equal to the string 'NaN', while actual NaN values aren't a string, or more specifically aren't anything.
Try using:
onlyValidData = temp_data.mask(temp_data['value'].isnull())
Edit: to remove rows rather than marking all values in that row as NaN:
onlyValidData = temp_data.dropna()

Pandas Google Distance Matrix API - Pass coordinates into URL

I am working with the Google Distance Matrix API, where I want to feed coordinates from a dataframe into the API and return the duration and distance between the two points.
Here is my dataframe:
import pandas as pd
import simplejson
import urllib
import numpy as np
Record orig_lat orig_lng dest_lat dest_lng
1 40.7484405 -74.0073127 40.7115242 -74.0145492
2 40.7421218 -73.9878531 40.7727216 -73.9863531
First, I need to combine the orig_lat & orig_lng and dest_lat & dest_lng into strings, which then pass into the url. So I've tried creating the variables orig_coord & dest_coord then passing them into the URL and returning values:
orig_coord = df[['orig_lat','orig_lng']].apply(lambda x: '{},{}'.format(x[0],x[1]), axis=1)
dest_coord = df[['dest_lat','dest_lng']].apply(lambda x: '{},{}'.format(x[0],x[1]), axis=1)
for row in df.itertuples():
url = "http://maps.googleapis.com/maps/api/distancematrix/json?origins={0}&destinations={1}&units=imperial&MYGOOGLEAPIKEY".format(orig_coord,end_coord)
result = simplejson.load(urllib.urlopen(url))
df['driving_time_text'] = result['rows'][0]['elements'][0]['duration']['text']
But I get the following error: "TypeError: () got an unexpected keyword argument 'axis'"
So my question is: how do I concatenate values from two columns into a string, then pass that string into a URL and output the result?
Thank you in advance!
Hmm, I am not sure how you constructed your data frame. Maybe post those details? But if you can live with referencing tuple elements positionally, this worked for me:
import pandas as pd
data = [{'orig_lat': 40.748441, 'orig_lng': -74.007313, 'dest_lat': 40.711524, 'dest_lng': -74.014549},
{'orig_lat': 40.742122, 'orig_lng': -73.987853, 'dest_lat': 40.772722, 'dest_lng': -73.986353}]
df = pd.DataFrame(data)
for row in df.itertuples():
orig_coord='{},{}'.format(row[1],row[2])
dest_coord='{},{}'.format(row[3],row[4])
url = "http://maps.googleapis.com/maps/api/distancematrix/json?origins={0}&destinations={1}&units=imperial&MYGOOGLEAPIKEY".format(orig_coord,dest_coord)
print url
produces
http://maps.googleapis.com/maps/api/distancematrix/json?origins=40.748441,-74.007313&destinations=40.711524,-74.014549&units=imperial&MYGOOGLEAPIKEY
http://maps.googleapis.com/maps/api/distancematrix/json?origins=40.742122,-73.987853&destinations=40.772722,-73.986353&units=imperial&MYGOOGLEAPIKEY
To update the data frame with the result, since row is a tuple and not writeable, you might want to keep track of the current index as you iterate. Maybe something like this:
data = [{'orig_lat': 40.748441, 'orig_lng': -74.007313, 'dest_lat': 40.711524, 'dest_lng': -74.014549, 'result': -1},
{'orig_lat': 40.742122, 'orig_lng': -73.987853, 'dest_lat': 40.772722, 'dest_lng': -73.986353, 'result': -1}]
df = pd.DataFrame(data)
i_row = 0
for row in df.itertuples():
orig_coord='{},{}'.format(row[1],row[2])
dest_coord='{},{}'.format(row[3],row[4])
url = "http://maps.googleapis.com/maps/api/distancematrix/json?origins={0}&destinations={1}&units=imperial&MYGOOGLEAPIKEY".format(orig_coord,dest_coord)
# Do stuff to get your result
df['result'][i_row] = result
i_row++

Categories

Resources