I have a pandas file with 3 different columns that I turn into a dictionary with to_dict, the result is a list of dictionaries:
df = [
{'HEADER1': 'col1-row1', 'HEADER2: 'col2-row1', 'HEADER3': 'col3-row1'},
{'HEADER1': 'col1-row2', 'HEADER2: 'col2-row2', 'HEADER3': 'col3-row2'}
]
Now my problem is that I need the value of 'col2-rowX' and 'col3-rowX' to build an URL and use requests and bs4 to scrape the websties.
I need my result to be something like the following:
requests.get("'http://www.website.com/' + row1-col2 + 'another-string' + row1-col3 + 'another-string'")
And i need to do that for every dictionary in the list.
I have tried iterating over the dictionaries using for-loops.
something like:
import pandas as pd
import os
os.chdir('C://Users/myuser/Desktop')
df = pd.DataFrame.from_csv('C://Users/myuser/Downloads/export.csv')
#Remove 'Code' column
df = df.drop('Code', axis=1)
#Remove 'Code2' as index
df = df.reset_index()
#Rename columns for easier manipulation
df.columns = ['CB', 'FC', 'PO']
#Convert to dictionary for easy URL iteration and creation
df = df.to_dict('records')
for row in df:
for key in row:
print(key)
You only ever iterate twice, and short-circuit out of the nested for loop every time it is executed by having a return statement there. Looking up the necessary information from the dictionary will allow you to build up your url's. One possible example:
def get_urls(l_d):
l=[]
for d in l_d:
l.append('http://www.website.com/' + d['HEADER2'] + 'another-string' + d['HEADER3'] + 'another-string')
return l
df = [{'HEADER1': 'col1-row1', 'HEADER2': 'col2-row1', 'HEADER3': 'col3-row1'},{'HEADER1': 'col1-row2', 'HEADER2': 'col2-row2', 'HEADER3': 'col3-row2'}]
print get_urls(df)
>>> ['http://www.website.com/col2-row1another-stringcol3-row1another-string', 'http://www.website.com/col2-row2another-stringcol3-row2another-string']
Related
I would like to call a pd.dataframe object but only the objects that are the ones in the key of a dictionary. I have multiple excel template files and they column names vary causing for the need of removal of certain column names. For reproducible reason i attached a sample below.
import pandas as pd
filename='template'
data= [['Auto','','','']]
df= pd.DataFrame(data,columns=['industry','System_Type__c','AccountType','email'])
valid= {'industry': ['Automotive'],
'SME Vertical': ['Agriculture'],
'System_Type__c': ['Access'],
'AccountType': ['Commercial']}
valid={k:v for k, v in valid.items() if k in df.columns.values}
errors= {}
errors[filename]={}
df1= df[['industry','System_Type__c','AccountType']]
mask = df1.apply(lambda c: c.isin(valid[c.name]))
df1.mask(mask|df1.eq(' ')).stack()
for err_i, (r, v) in enumerate(df1.mask(mask|df1.eq(' ')).stack().iteritems()):
errors[filename][err_i] = {"row": r[0],
"column": r[1],
"message": v + " is invalid check column " + r[1] + ' and replace with a standard value'}
I would like df1 to be a variable to a more dynamic list of df.DataFrame objects
how would I replace this piece of code to be more dynamic?
df1= df[['industry','System_Type__c','AccountType', 'SME Vertical']]
#desired output would drop SME Vertical since it is not a df column
df1= df[['industry','System_Type__c','AccountType']]
# list of the dictionary returns the keys
# you then filter the DF based on it and assign to DF1
df1=df[list(valid)]
df1
industry System_Type__c AccountType
0 Auto
How can I build a function that creates these dataframes?:
buy_orders_1h = pd.DataFrame(
{'Date_buy': buy_orders_date_1h,
'Name_buy': buy_orders_name_1h
})
sell_orders_1h = pd.DataFrame(
{'Date_sell': sell_orders_date_1h,
'Name_sell': sell_orders_name_1h
})
I have 10 dataframes like this I create very manually and everytime I want to add a new column I would have to do it in all of them which is time consuming. If I can build a function I would only have to do it once.
The differences between the two above function are of course one is for buy signals the other is for sell signals.
I guess the inputs to the function should be:
_buy/_sell - for the Column name
buy_ / sell_ - for the Column input
I'm thinking input to the function could be something like:
def create_dfs(col, col_input,hour):
df = pd.DataFrame(
{'Date' + col : col_input + "_orders_date_" + hour,
'Name' + col : col_input + "_orders_name_" + hour
}
return df
buy_orders_1h = create_dfs("_buy", "buy_", "1h")
sell_orders_1h = create_dfs("_sell", "sell_", "1h")
A dataframe needs an index, so either you can manually pass an index, or enter your row values in list form:
def create_dfs(col, col_input, hour):
df = pd.DataFrame(
{'Date' + col: [col_input + "_orders_date_" + hour],
'Name' + col: [col_input + "_orders_name_" + hour]})
return df
buy_orders_1h = create_dfs("_buy", "buy_", "1h")
sell_orders_1h = create_dfs("_sell", "sell_", "1h")
Edit: Updated due to new information:
To call a global variable using a string, enter globals() before the string in the following manner:
'Date' + col: globals()[col_input + "_orders_date_" + hour]
Check the output please to see if this is what you want. You first create two dictionaries, then depending on the buy=True condition, it either appends to the buying_df or to the selling_df. I created two sample lists of dates and column names, and iteratively appended to the desired dataframes. After creating the dicts, then pandas.DataFrame is created. You do not need to create it iteratively, rather once in the end when your dates and names have been collected into a dict.
from collections import defaultdict
import pandas as pd
buying_df=defaultdict(list)
selling_df=defaultdict(list)
def add_column_to_df(date,name,buy=True):
if buy:
buying_df["Date_buy"].append(date)
buying_df["Name_buy"].append(name)
else:
selling_df["Date_sell"].append(date)
selling_df["Name_sell"].append(name)
dates=["1900","2000","2010"]
names=["Col_name1","Col_name2","Col_name3"]
for date, name in zip(dates,names):
add_column_to_df(date,name)
#print(buying_df)
df=pd.DataFrame(buying_df)
print(df)
Thanks to #Woody Pride's answer here: https://stackoverflow.com/a/19791302/5608428, I've got to 95% of what I want to achieve.
Which is, by the way, create a dict of sub dataframes from a large df.
All I need to do is sort each dataframe in the dictionary. It's such a small thing but I can't find an answer on here or Google.
import pandas as pd
import numpy as np
import itertools
def points(row):
if row['Ob1'] > row['Ob2']:
val = 2
else:
val = 1
return val
#create some data with Names column
data = pd.DataFrame({'Names': ['Joe', 'John', 'Jasper', 'Jez'] *4, \
'Ob1' : np.random.rand(16), 'Ob2' : np.random.rand(16)})
#create list of unique pairs
comboNames = list(itertools.combinations(data.Names.unique(), 2))
#create a data frame dictionary to store your data frames
DataFrameDict = {elem : pd.DataFrame for elem in comboNames}
for key in DataFrameDict.keys():
DataFrameDict[key] = data[:][data.Names.isin(key)]
#Add test calculated column
for tbl in DataFrameDict:
DataFrameDict[tbl]['Test'] = DataFrameDict[tbl].apply(points, axis=1)
#############################
#Checking test and sorts
##############################
#access df's to print head
for tbl in DataFrameDict:
print(DataFrameDict[tbl].head())
print()
#access df's to print summary
for tbl in DataFrameDict:
print(str(tbl[0])+" vs "+str(tbl[1])+": "+str(DataFrameDict[tbl]['Ob2'].sum()))
print()
#trying to sort each df
for tbl in DataFrameDict:
#Doesn't work
DataFrameDict[tbl].sort_values(['Ob1'])
#mistakenly deleted other attempts (facepalm)
for tbl in DataFrameDict:
print(DataFrameDict[tbl].head())
print()
The code runs but won't sort each df no matter what I try. I can access each df no problem for printing etc but no .sort_values()
As an aside, creating the df's with tuples for names(keys) was/is kind of hacky. Is there a better way to do this?
Many thanks
Looks like you just need to assign the sorted DataFrame back into the dict:
for tbl in DataFrameDict:
DataFrameDict[tbl] = DataFrameDict[tbl].sort_values(['Ob1'])
Hi I wrote some code that builds a default dictionary
def makedata(filename):
with open(filename, "r") as file:
for x in features:
previous = []
count = 0
for line in file:
var_name = x
regexp = re.compile(var_name + r'.*?([0-9.-]+)')
match = regexp.search(line)
if match and (match.group(1)) != previous:
previous = match.group(1)
count += 1
if count > wlength:
count = 1
target = str(str(count) + x)
dict.setdefault(target, []).append(match.group(1))
file.seek(0)
df = pd.DataFrame.from_dict(dict)
The dictionary looks good but when I try to convert to dataframe it is empty. I can't figure it out
dict:
{'1meanSignalLenght': ['0.5305184', '0.48961428', '0.47203177', '0.5177274'], '1amplCor': ['0.8780955002105448', '0.8634431017504487', '0.9381169983046714', '0.9407036427333355'], '1metr10.angle1': ['0.6439386643584522', '0.6555194964997434', '0.9512436169922103', '0.23789348400794422'], '1syncVar': ['0.1344131181025432', '0.08194580887223515', '0.15922251165913678', '0.28795644612520327'], '1linVelMagn': ['0.07062673289287498', '0.08792496681784517', '0.12603999663935528', '0.14791253129369603'], '1metr6.velSum': ['0.17850601560734558', '0.15855169971072014', '0.21396496345720045', '0.2739525279330513']}
df:
Empty DataFrame
Columns: []
Index: []
{}
I think part of your issue is that you are using the keyword 'dict', assuming it is a variable
make a dictionary in your function, call it something other than 'dict'. Have your function return that dictionary. Then when you make a dataframe use that return value. Right now, you are creating a data frame from an empty dictionary object.
df = pd.DataFrame(dict)
This should make a dataframe from the dictionary.
You can either pass a list of dicts simply using pd.DataFrame(list_of_dicts) (use pd.DataFrame([dict]) if your variable is not a list) or a dict of list using pd.DataFrame.from_dict(dict). In this last case dict should be something like dict = {a:[1,2,3], "b": ["a", "b", "c"], "c":...}.
see: Pandas Dataframe from dict with empty list value
I am currently working to make a dictionary with a tuple of names as keys and a float as the value of the form {(nameA, nameB) : datavalue, (nameB, nameC) : datavalue ,...}
The values data is from a matrix I have made into a pandas DataFrame with the names as both the index and column labels. I have created an ordered list of the keys for my final dictionary called keys with the function createDictionaryKeys(). The issue I have is that not all the names from this list appear in my data matrix. I want to only include the names do appear in the data matrix in my final dictionary.
How can I do this search avoiding the slow linear for loop? I created a dictionary that has the name as key and a value of 1 if it should be included and 0 otherwise as well. It has the form {nameA : 1, nameB: 0, ... } and is called allow_dict. I was hoping to use this to do some sort of hash search.
def createDictionary( keynamefile, seperator, datamatrix, matrixsep):
import pandas as pd
keys = createDictionaryKeys(keynamefile, seperator)
final_dict = {}
data_df = pd.read_csv(open(datamatrix), sep = matrixsep)
pd.set_option("display.max_rows", len(data_df))
df_indices = list(data_df.index.values)
df_cols = list(data_df.columns.values)[1:]
for i in df_indices:
data_df = data_df.rename(index = {i:df_cols[i]})
data_df = data_df.drop("Unnamed: 0", 1)
allow_dict = descriminatePromoters( HARDCODEDFILENAME, SEP, THRESHOLD )
#print ( item for item in df_cols if allow_dict[item] == 0 ).next()
present = [ x for x in keys if x[0] in df_cols and x[1] in df_cols]
for i in present:
final_dict[i] = final_df.loc[i[0],i[1]]
return final_dict
Testing existence in python sets is O(1), so simply:
present = [ x for x in keys if x[0] in set(df_cols) and x[1] in set(df_cols)]
...should give you some speed up. Since you're iterating through in O(n) anyway (and have to to construct your final_dict), something like:
colset = set(df_cols)
final_dict = {k: final_df.loc[k[0],k[1]]
for k in keys if (k[0] in colset)
and (k[1] in colset)}
Would be nice, I would think.