Python pandas map dict keys to values - python

I have a csv for input, whose row values I'd like to join into a new field. This new field is a constructed url, which will then be processed by the requests.post() method.
I am constructing my url correctly, but my issue is with the data object that should be passed to requests. How can I have the correct values passed to their proper keys when my dictionary is unordered? If I need to use an ordered dict, how can I properly set it up with my current format?
Here is what I have:
import pandas as pd
import numpy as np
import requests
test_df = pd.read_csv('frame1.csv')
headers = {'content-type': 'application/x-www-form-urlencoded'}
test_df['FIRST_NAME'] = test_df['FIRST_NAME'].astype(str)
test_df['LAST_NAME'] = test_df['LAST_NAME'].astype(str)
test_df['ADDRESS_1'] = test_df['ADDRESS_1'].astype(str)
test_df['CITY'] = test_df['CITY'].astype(str)
test_df['req'] = 'site-url.com?' + '&FIRST_NAME=' + test_df['FIRST_NAME'] + '&LAST_NAME=' + \
test_df['LAST_NAME'] + '&ADDRESS_1=' + test_df['ADDRESS_1'] + '&CITY=' + test_df['CITY']
arr = test_df.values
d = {'FIRST_NAME':test_df['FIRST_NAME'], 'LAST_NAME':test_df['LAST_NAME'],
'ADDRESS_1':test_df['ADDRESS_1'], 'CITY':test_df['CITY']}
test_df = pd.DataFrame(arr[0:, 0:], columns=d, dtype=np.str)
data = test_df.to_dict()
data = {k: v for k, v in data.items()}
test_df['raw_result'] = test_df['req'].apply(lambda x: requests.post(x, headers=headers,
data=data).content)
test_df.to_csv('frame1_result.csv')
I tried to map values to keys with a dict comprehension, but the assignment of a key like FIRST_NAME could end up mapping to values from an arbitrary field like test_df['CITY'].

Not sure if I understand the problem correctly. However, you can give argument to to_dict function e.g.
data = test_df.to_dict(orient='records')
which will give you output as follows: [{'FIRST_NAME': ..., 'LAST_NAME': ...}, {'FIRST_NAME': ..., 'LAST_NAME': ...}] (which will give you a list that has equal length as test_df). This might be one possibility to easily map it to a correct row.

Related

Create Json file from Python dataframe with grouping on one col and making column name as key with unique values as a list inside the key

#Create the pandas DataFrame#
My data frame is like this
data = [[6, 1, "False","var_1"], [6, 1, "False","var_2"], [7, 1, "False","var_3"]]
df = pd.DataFrame(data, columns =['CONSTRAINT_ID','CONSTRAINT_NODE_ID','PRODUCT_GRAIN','LEFT_SIDE_TYPE'])
##Expected Output Json##
I want to group by column CONSTRAINT_ID and the key should be natural numbers or index. LEFT_SIDE_TYPE column values should come in list
{
"1": {"CONSTRAINT_NODE_ID ":[1],
"product_grain":False,
"left_side_type":["Variable_1","Variable_2"],
},
"2": {"CONSTRAINT_NODE_ID ":[2],
"product_grain":False,
"left_side_type":["Variable_3"],
}
}
It is likely not the most efficient solution. However provided a df in the format specified in your original question, the below function will return a str consisting of a valid json string with the desired structure and values.
It filters the df by CONSTRAINT_ID, iterating across each unique value and creating a JSON object with a key 1...n and the desired values based on your original question within the response variable. This implementation uses set structures to store values during iterations to avoid duplication of values before converting these to list instances before they are added to the response.
import json
def generate_response(df):
response = dict()
constraints = df['CONSTRAINT_ID'].unique()
for i, c in enumerate(constraints):
temp = {'CONSTRAINT_NODE_ID': set(),'PRODUCT_GRAIN': None, 'LEFT_SIDE_TYPE': set()}
for _, row in df[df['CONSTRAINT_ID'] == c].iterrows():
temp['CONSTRAINT_NODE_ID'].add(row['CONSTRAINT_NODE_ID'])
temp['PRODUCT_GRAIN'] = row['PRODUCT_GRAIN']
temp['LEFT_SIDE_TYPE'].add(row['LEFT_SIDE_TYPE'])
temp['CONSTRAINT_NODE_ID'] = list(temp['CONSTRAINT_NODE_ID'])
temp['LEFT_SIDE_TYPE'] = list(temp['LEFT_SIDE_TYPE'])
response[str(i + 1)] = temp
return json.dumps(response, indent=4)

Python, how to create a table from JSON data - indexing

I am trying to create a table from JSON data. I have already used the json.dumps for my data:
this is what I am trying to export to the table:
label3 = json.dumps({'class': CLASSES[idx],"confidence": str(round(confidence * 100, 1)) + "%","startX": str(startX),"startY": str(startY),"EndX": str(endX),"EndY": str(endY),"Timestamp": now.strftime("%d/%m/%Y, %H:%M")})
I have tryied with:
val1 = json.loads(label3)
df = pd.DataFrame(val1)
print(df.T)
The system gives me an error that I must pass an index.
And also with:
val = ast.literal_eval(label3)
val1 = json.loads(json.dumps(val))
print(val1)
val2 = val1["class"][0]["confidence"][0]["startX"][0]["startY"][0]["endX"][0]["endY"][0]["Timestamp"][0]
df = pd.DataFrame(data=val2, columns=["class", "confidence", "startX", "startY", "EndX", "EndY", "Timestamp"])
print(df)
When I try this, the error it gives is that String indices mustb be integers.
How can I create the index?
Thank you,
There are two ways we can tackle this issue.
Do as directed by the error, pass the index to the dataframe function
pd.Dataframe(val1, index=list(range(number_of_rows)) # number of rows is 1 in your case.
While dumping the data using json.dumps, dump dictionary which has the mapping from key:list of values instead of key:value. For example
json.dumps({ 'class': [ CLASSES[idx] ],"confidence": [ ' some confidence ' ] })
I have shortened your given example. See I am passing values as list of values(even if it is only one value per key).

Most efficient way of converting RESTful output to dataframe

I have output from a REST call that I've converted to JSON.
It's a highly nested collection of dicts and lists, but I'm eventually able to convert it to dataframe as follows:
import panads as pd
from requests import get
url = 'http://stats.oecd.org/SDMX-JSON/data/MEI_FIN/IR3TIB.GBR+USA.M/all'
params = {
'startTime' : '2008-06',
'dimensionAtObservation' : 'TimeDimension'
}
r = get(url, params = params)
x = r.json()
d = x['dataSets'][0]['series']
a = pd.DataFrame(d['0:0:0']['observations'])
b = pd.DataFrame(d['0:1:0']['observations'])
This works absent some manipulation to make it easier to work with, and as there are multiple time series, I can do a version of the same for each, but it goes without saying it's kind of clunky.
Is there a better/cleaner way to do this.
The pandasdmx library makes this super-simple:
import pandasdmx as sdmx
df = sdmx.Request('OECD').data(
resource_id='MEI_FIN',
key='IR3TIB.GBR+USA.M',
params={'startTime': '2008-06', 'dimensionAtObservation': 'TimeDimension'},
).write()
Absent any responses, here's the solution I came up with. I added a list comprehension to deal with getting each series into a dataframe, and then a transpose as this source resulted in the series being aligned across rows instead of down columns.
import panads as pd
from requests import get
url = 'http://stats.oecd.org/SDMX-JSON/data/MEI_FIN/IR3TIB.GBR+USA.M/all'
params = {
'startTime' : '2008-06',
'dimensionAtObservation' : 'TimeDimension'
}
r = get(url, params = params)
x = r.json()
d = x['dataSets'][0]['series']
df = [pd.DataFrame(d[i]['observations']).loc[0] for i in d]
df = pd.DataFrame(df).T

iterate over list of dicts to create different strings

I have a pandas file with 3 different columns that I turn into a dictionary with to_dict, the result is a list of dictionaries:
df = [
{'HEADER1': 'col1-row1', 'HEADER2: 'col2-row1', 'HEADER3': 'col3-row1'},
{'HEADER1': 'col1-row2', 'HEADER2: 'col2-row2', 'HEADER3': 'col3-row2'}
]
Now my problem is that I need the value of 'col2-rowX' and 'col3-rowX' to build an URL and use requests and bs4 to scrape the websties.
I need my result to be something like the following:
requests.get("'http://www.website.com/' + row1-col2 + 'another-string' + row1-col3 + 'another-string'")
And i need to do that for every dictionary in the list.
I have tried iterating over the dictionaries using for-loops.
something like:
import pandas as pd
import os
os.chdir('C://Users/myuser/Desktop')
df = pd.DataFrame.from_csv('C://Users/myuser/Downloads/export.csv')
#Remove 'Code' column
df = df.drop('Code', axis=1)
#Remove 'Code2' as index
df = df.reset_index()
#Rename columns for easier manipulation
df.columns = ['CB', 'FC', 'PO']
#Convert to dictionary for easy URL iteration and creation
df = df.to_dict('records')
for row in df:
for key in row:
print(key)
You only ever iterate twice, and short-circuit out of the nested for loop every time it is executed by having a return statement there. Looking up the necessary information from the dictionary will allow you to build up your url's. One possible example:
def get_urls(l_d):
l=[]
for d in l_d:
l.append('http://www.website.com/' + d['HEADER2'] + 'another-string' + d['HEADER3'] + 'another-string')
return l
df = [{'HEADER1': 'col1-row1', 'HEADER2': 'col2-row1', 'HEADER3': 'col3-row1'},{'HEADER1': 'col1-row2', 'HEADER2': 'col2-row2', 'HEADER3': 'col3-row2'}]
print get_urls(df)
>>> ['http://www.website.com/col2-row1another-stringcol3-row1another-string', 'http://www.website.com/col2-row2another-stringcol3-row2another-string']

Searching items of large list in large python dictionary quickly

I am currently working to make a dictionary with a tuple of names as keys and a float as the value of the form {(nameA, nameB) : datavalue, (nameB, nameC) : datavalue ,...}
The values data is from a matrix I have made into a pandas DataFrame with the names as both the index and column labels. I have created an ordered list of the keys for my final dictionary called keys with the function createDictionaryKeys(). The issue I have is that not all the names from this list appear in my data matrix. I want to only include the names do appear in the data matrix in my final dictionary.
How can I do this search avoiding the slow linear for loop? I created a dictionary that has the name as key and a value of 1 if it should be included and 0 otherwise as well. It has the form {nameA : 1, nameB: 0, ... } and is called allow_dict. I was hoping to use this to do some sort of hash search.
def createDictionary( keynamefile, seperator, datamatrix, matrixsep):
import pandas as pd
keys = createDictionaryKeys(keynamefile, seperator)
final_dict = {}
data_df = pd.read_csv(open(datamatrix), sep = matrixsep)
pd.set_option("display.max_rows", len(data_df))
df_indices = list(data_df.index.values)
df_cols = list(data_df.columns.values)[1:]
for i in df_indices:
data_df = data_df.rename(index = {i:df_cols[i]})
data_df = data_df.drop("Unnamed: 0", 1)
allow_dict = descriminatePromoters( HARDCODEDFILENAME, SEP, THRESHOLD )
#print ( item for item in df_cols if allow_dict[item] == 0 ).next()
present = [ x for x in keys if x[0] in df_cols and x[1] in df_cols]
for i in present:
final_dict[i] = final_df.loc[i[0],i[1]]
return final_dict
Testing existence in python sets is O(1), so simply:
present = [ x for x in keys if x[0] in set(df_cols) and x[1] in set(df_cols)]
...should give you some speed up. Since you're iterating through in O(n) anyway (and have to to construct your final_dict), something like:
colset = set(df_cols)
final_dict = {k: final_df.loc[k[0],k[1]]
for k in keys if (k[0] in colset)
and (k[1] in colset)}
Would be nice, I would think.

Categories

Resources