Turning dictionary key into an element of dataframe - python

My code can get the job done but I know it is not a good way to handle it.
The input is thisdict and the output is shown at the end.
Can you help to make it more efficient?
import pandas as pd
thisdict = {
"A": {'v1':'3','v2':5},
"B": {'v1':'77','v2':99},
"ZZ": {'v1':'311','v2':152}
}
output=pd.DataFrame()
for key, value in thisdict.items():
# turn value to df
test2 =pd.DataFrame(value.items(), columns = ['item','value'])
test2['id'] = key
#transpose
test2 = test2.pivot(index='id',columns='item', values = 'value')
#concat
output=pd.concat([output,test2])
output

You can use:
output = pd.DataFrame.from_dict(thisdict, orient='index')
or
output = pd.DataFrame(thisdict).T
and if you wish, rename the index by:
output.index.rename('id', inplace=True)

Related

Calling a list of df column names to a function based on dictionary keys

I would like to call a pd.dataframe object but only the objects that are the ones in the key of a dictionary. I have multiple excel template files and they column names vary causing for the need of removal of certain column names. For reproducible reason i attached a sample below.
import pandas as pd
filename='template'
data= [['Auto','','','']]
df= pd.DataFrame(data,columns=['industry','System_Type__c','AccountType','email'])
valid= {'industry': ['Automotive'],
'SME Vertical': ['Agriculture'],
'System_Type__c': ['Access'],
'AccountType': ['Commercial']}
valid={k:v for k, v in valid.items() if k in df.columns.values}
errors= {}
errors[filename]={}
df1= df[['industry','System_Type__c','AccountType']]
mask = df1.apply(lambda c: c.isin(valid[c.name]))
df1.mask(mask|df1.eq(' ')).stack()
for err_i, (r, v) in enumerate(df1.mask(mask|df1.eq(' ')).stack().iteritems()):
errors[filename][err_i] = {"row": r[0],
"column": r[1],
"message": v + " is invalid check column " + r[1] + ' and replace with a standard value'}
I would like df1 to be a variable to a more dynamic list of df.DataFrame objects
how would I replace this piece of code to be more dynamic?
df1= df[['industry','System_Type__c','AccountType', 'SME Vertical']]
#desired output would drop SME Vertical since it is not a df column
df1= df[['industry','System_Type__c','AccountType']]
# list of the dictionary returns the keys
# you then filter the DF based on it and assign to DF1
df1=df[list(valid)]
df1
industry System_Type__c AccountType
0 Auto

Change values in DataFrame - .iloc vs .loc

Hey I write this code:
import pandas as pd
d1 = {"KEY": ["KEY1", "KEY2", "KEY3"], "value": ["A", "B", "C"]}
df1 = pd.DataFrame(d1)
df1["value 2"] = "nothing"
d2 = {"KEY": ["KEY2"], "value_alternative": ["D"]}
df2 = pd.DataFrame(d2)
for k in range(3):
key = df1.iloc[k]["KEY"]
print(key)
if key in list(df2["KEY"]):
df1.iloc[k]["value 2"] = df2.loc[df2["KEY"] == key, "value_alternative"].item()
else:
df1.iloc[k]["value 2"] = df1.iloc[k]["value"]
but unfortunately values in df1["value 2"] haven't changed :( I rewrite it as follows:
import pandas as pd
d1 = {"KEY": ["KEY1", "KEY2", "KEY3"], "value": ["A", "B", "C"]}
df1 = pd.DataFrame(d1)
df1["value 2"] = "nothing"
d2 = {"KEY": ["KEY2"], "value_alternative": ["D"]}
df2 = pd.DataFrame(d2)
for k in range(3):
key = df1.iloc[k]["KEY"]
print(key)
if key in list(df2["KEY"]):
df1.loc[k, "value 2"] = df2.loc[df2["KEY"] == key, "value_alternative"].item()
else:
df1.loc[k, "value 2"] = df1.iloc[k]["value"]
and then everything works fine, but I dont understand why the previous method don't work. What is the easiest way to change value in dataframe in a loop?
First of all. Don't use a for loop with dataframes if you really really have to.
Just use a boolean array to filter your dataframe with loc and assign your values that way.
You can do what you want with a simple merge.
df1 = df1.merge(df2, on='KEY', how='left').rename(columns={'value_alternative': 'value 2'})
df1.loc[df1['value 2'].isna(), 'value 2'] = df1['value']
Reason for iloc not working with assignment is in pandas you can't set a value in a copy of a dataframe. Pandas does this in order to work fast. To have access to the underlying data you need to use loc for filtering. Don't forget loc and iloc do different things. loc looks at the lables of the index while iloc looks at the index number.
In order for this to work you also have to delete the
df1["value 2"] = "nothing"
line from your program

How to read this JSON file in Python?

I'm trying to read such a JSON file in Python, to save only two of the values of each response part:
{
"responseHeader":{
"status":0,
"time":2,
"params":{
"q":"query",
"rows":"2",
"wt":"json"}},
"response":{"results":2,"start":0,"docs":[
{
"name":["Peter"],
"country":["England"],
"age":["23"]},
{
"name":["Harry"],
"country":["Wales"],
"age":["30"]}]
}}
For example, I want to put the name and the age in a table. I already tried it this way (based on this topic), but it's not working for me.
import json
import pandas as pd
file = open("myfile.json")
data = json.loads(file)
columns = [dct['name', 'age'] for dct in data['response']]
df = pd.DataFrame(data['response'], columns=columns)
print(df)
I also have seen more solutions of reading a JSON file, but that all were solutions of a JSON file with no other header values at the top, like responseHeader in this case. I don't know how to handle that. Anyone who can help me out?
import json
with open("myfile.json") as f:
columns = [(dic["name"],dic["age"]) for dic in json.load(f)["response"]["docs"]]
print(columns)
result:
[(['Peter'], ['23']), (['Harry'], ['30'])]
You can pass the list data["response"]["docs"] to pandas directly as it's a recordset.
df = pd.DataFrame(data["response"]["docs"])`
print(df)
>>> name country age
0 [Peter] [England] [23]
1 [Harry] [Wales] [30]
The data in you DatFrame will be bracketed though as you can see. If you want to remove the brackets you can consider the following:
for column in df.columns:
df.loc[:, column] = df.loc[:, column].str.get(0)
if column == 'age':
df.loc[:, column] = df.loc[:, column].astype(int)
sample = {"responseHeader":{
"status":0,
"time":2,
"params":{
"q":"query",
"rows":"2",
"wt":"json"}},
"response":{"results":2,"start":0,"docs":[
{
"name":["Peter"],
"country":["England"],
"age":["23"]},
{
"name":["Harry"],
"country":["Wales"],
"age":["30"]}]
}}
data = [(x['name'][0], x['age'][0]) for x in
sample['response']['docs']]
df = pd.DataFrame(names, columns=['name',
'age'])

Sort each dataframe in a dictionary of dataframes

Thanks to #Woody Pride's answer here: https://stackoverflow.com/a/19791302/5608428, I've got to 95% of what I want to achieve.
Which is, by the way, create a dict of sub dataframes from a large df.
All I need to do is sort each dataframe in the dictionary. It's such a small thing but I can't find an answer on here or Google.
import pandas as pd
import numpy as np
import itertools
def points(row):
if row['Ob1'] > row['Ob2']:
val = 2
else:
val = 1
return val
#create some data with Names column
data = pd.DataFrame({'Names': ['Joe', 'John', 'Jasper', 'Jez'] *4, \
'Ob1' : np.random.rand(16), 'Ob2' : np.random.rand(16)})
#create list of unique pairs
comboNames = list(itertools.combinations(data.Names.unique(), 2))
#create a data frame dictionary to store your data frames
DataFrameDict = {elem : pd.DataFrame for elem in comboNames}
for key in DataFrameDict.keys():
DataFrameDict[key] = data[:][data.Names.isin(key)]
#Add test calculated column
for tbl in DataFrameDict:
DataFrameDict[tbl]['Test'] = DataFrameDict[tbl].apply(points, axis=1)
#############################
#Checking test and sorts
##############################
#access df's to print head
for tbl in DataFrameDict:
print(DataFrameDict[tbl].head())
print()
#access df's to print summary
for tbl in DataFrameDict:
print(str(tbl[0])+" vs "+str(tbl[1])+": "+str(DataFrameDict[tbl]['Ob2'].sum()))
print()
#trying to sort each df
for tbl in DataFrameDict:
#Doesn't work
DataFrameDict[tbl].sort_values(['Ob1'])
#mistakenly deleted other attempts (facepalm)
for tbl in DataFrameDict:
print(DataFrameDict[tbl].head())
print()
The code runs but won't sort each df no matter what I try. I can access each df no problem for printing etc but no .sort_values()
As an aside, creating the df's with tuples for names(keys) was/is kind of hacky. Is there a better way to do this?
Many thanks
Looks like you just need to assign the sorted DataFrame back into the dict:
for tbl in DataFrameDict:
DataFrameDict[tbl] = DataFrameDict[tbl].sort_values(['Ob1'])

iterate over list of dicts to create different strings

I have a pandas file with 3 different columns that I turn into a dictionary with to_dict, the result is a list of dictionaries:
df = [
{'HEADER1': 'col1-row1', 'HEADER2: 'col2-row1', 'HEADER3': 'col3-row1'},
{'HEADER1': 'col1-row2', 'HEADER2: 'col2-row2', 'HEADER3': 'col3-row2'}
]
Now my problem is that I need the value of 'col2-rowX' and 'col3-rowX' to build an URL and use requests and bs4 to scrape the websties.
I need my result to be something like the following:
requests.get("'http://www.website.com/' + row1-col2 + 'another-string' + row1-col3 + 'another-string'")
And i need to do that for every dictionary in the list.
I have tried iterating over the dictionaries using for-loops.
something like:
import pandas as pd
import os
os.chdir('C://Users/myuser/Desktop')
df = pd.DataFrame.from_csv('C://Users/myuser/Downloads/export.csv')
#Remove 'Code' column
df = df.drop('Code', axis=1)
#Remove 'Code2' as index
df = df.reset_index()
#Rename columns for easier manipulation
df.columns = ['CB', 'FC', 'PO']
#Convert to dictionary for easy URL iteration and creation
df = df.to_dict('records')
for row in df:
for key in row:
print(key)
You only ever iterate twice, and short-circuit out of the nested for loop every time it is executed by having a return statement there. Looking up the necessary information from the dictionary will allow you to build up your url's. One possible example:
def get_urls(l_d):
l=[]
for d in l_d:
l.append('http://www.website.com/' + d['HEADER2'] + 'another-string' + d['HEADER3'] + 'another-string')
return l
df = [{'HEADER1': 'col1-row1', 'HEADER2': 'col2-row1', 'HEADER3': 'col3-row1'},{'HEADER1': 'col1-row2', 'HEADER2': 'col2-row2', 'HEADER3': 'col3-row2'}]
print get_urls(df)
>>> ['http://www.website.com/col2-row1another-stringcol3-row1another-string', 'http://www.website.com/col2-row2another-stringcol3-row2another-string']

Categories

Resources