Hello you Pythonic lovers.
I have run into quite an interesting little issue that I have not been able to resolve due to my inexperience. I am constructing a dictionary in python based on a set of answers in a graph database and I have run into an interesting dilemma. (I am running Python 3
When all is said and done, I receive the following example output in my excel file (this is from column 0 , every entry is a row:
ACTUAL EXCEL FORMAT:
0/{'RecordNo': 0}
1/{'Dept': 'DeptName'}
2/{'Option 1': 'Option1Value'}
3/{'Option 2': 'Option2Value'}
4/{'Question1': 'Answer1'}
5/{'Question2': 'Answer2'}
6/{'Question3': 'Answer3'}
etc..
Expected EXCEL format:
0/Dept, Option 1, Option 2, Question 1, Question 2, Question 3
1/DeptName, Option1Value, Option2Value, Answer1, Answer2, Answer3
The keys of the dictionary are supposed to be the headers and the values, the contents of every row, but for some reason, it's writing it out as the key and value when I use the following output code:
EXCEL WRITER CODE:
ReportDF = pd.DataFrame.from_dict(DomainDict)
WriteMe = pd.ExcelWriter('Filname.xlsx')
ReportDF.to_excel(WriteMe, 'Sheet1')
try:
WriteMe.save()
print('Save completed')
except:
print('Error in saving file')
To build the dictionary, I use the following code:
EDIT (Removed sub-addition of dictionary entries, as it is the same and will be streamlined into a function call once the primary works).
DICTIONARY PREP CODE:
for Dept in Depts:
ABBR = Dept['dept.ABBR']
#print('Department: ' + ABBR)
Forests = getForestDomains(Quarter,ABBR)
for Forest in Forests:
DictEntryList = []
DictEntryList.append({'RecordNo': DomainCount})
DictEntryList.append({'Dept': ABBR})
ForestName = Forest['d.DomainName']
DictEntryList.append({'Forest ': ForestName})
DictEntryList.append({'Domain': ''})
AnswerEntryList = []
QList = getApplicableQuestions(str(SA))
for Question in QList:
FAnswer = ''
QDesc = Question['Question']
AnswerResult = getAnswerOfQuestionForDomainForQuarter(QDesc, ForestName, Quarter)
if AnswerResult:
for A in AnswerResult:
if(str(A['Answer']) != 'None'):
if(isinstance(A, numbers.Number)):
FAnswer = str(int(A['Answer']))
else:
FAnswer = str(A['Answer'])
else:
FAnswer = 'Unknown'
else:
print('GOBBLEGOBBLE')
FAnswer = 'Not recorded'
AnswerEntryList.append({QDesc: FAnswer})
for Entry in AnswerEntryList:
DictEntryList.append(Entry)
DomainDict[DomainCount] = DictEntryList
DomainCount+= 1
print('Ready to export')
If anyone could assist me in getting my data to export into the proper format within excel, it would be greatly appreciated.
EDIT:
Print of the final dictionary to be exported to excel:
{0: [{'RecordNo': 0}, {'Dept': 'Clothing'}, {'Forest ': 'my.forest'}, {'Domain': 'my.domain'}, {'Question1': 'Answer1'}, {'Question2': 'Answer2'}, {'Question3': 'Answer3'}], 1: [{...}]}
The problem in writing to Excel is due to the fact that the values in the final dictionary are lists of dictionaries themselves, so it may be that you want to take a closer look at how you're building the dictionary. In its current format, passing the final dictionary to pd.DataFrame.from_dict results in a DataFrame that looks like this:
# 0
# 0 {u'RecordNo': 0}
# 1 {u'Dept': u'Clothing'}
# 2 {u'Forest ': u'my.forest'}
# 3 {u'Domain': u'my.domain'}
# 4 {u'Question1': u'Answer1'}
# 5 {u'Question2': u'Answer2'}
# 6 {u'Question3': u'Answer3'}
So each value in the DataFrame row is itself a dict. To fix this, you can flatten/merge the inner dictionaries in your final dict before passing it into a DataFrame:
modified_dict = {k:{x.keys()[0]:x.values()[0] for x in v} for k, v in final_dict.iteritems()}
# {0: {'Domain': 'my.domain', 'RecordNo': 0, 'Dept': 'Clothing', 'Question1': 'Answer1', 'Question3': 'Answer3', 'Question2': 'Answer2', 'Forest ': 'my.forest'}}
Then, you can pass this dict into a Pandas object, with the additional argument orient=index (so that the DataFrame uses the keys in the inner dicts as columns) to get a DataFrame that looks like this:
ReportDF = pd.DataFrame.from_dict(modified_dict, orient='index')
# Domain RecordNo Dept Question1 Question3 Question2 Forest
# 0 my.domain 0 Clothing Answer1 Answer3 Answer2 my.forest
From there, you can write to Excel as you had indicated.
Edit: I can't test this without sample data, but from the look of it you can simplify your Dictionary Prep by building a dict instead of a list of dicts.
for Dept in Depts:
ABBR = Dept['dept.ABBR']
Forests = getForestDomains(Quarter,ABBR)
for Forest in Forests:
DictEntry = {}
DictEntry['RecordNo'] = DomainCount
DictEntry['Dept'] = ABBR
DictEntry['Forest '] = Forest['d.DomainName']
DictEntry['Domain'] = ''
QList = getApplicableQuestions(str(SA))
for Question in QList:
# save yourself a line of code and make 'Not recorded' the default value
FAnswer = 'Not recorded'
QDesc = Question['Question']
AnswerResult = getAnswerOfQuestionForDomainForQuarter(QDesc, ForestName, Quarter)
if AnswerResult:
for A in AnswerResult:
# don't convert None to string and then test for inequality to 'None'
# if statements evaluate None as False already
if A['Answer']:
if isinstance(A, numbers.Number):
FAnswer = str(int(A['Answer']))
else:
FAnswer = str(A['Answer'])
else:
FAnswer = 'Unknown'
else:
print('GOBBLEGOBBLE')
DictEntry[QDesc] = FAnswer
DomainDict[DomainCount] = DictEntry
DomainCount += 1
print('Ready to export')
Related
I am constructing a script that utilizes regex to search through a document I wrote and pull out the name, address, and description for businesses in the document. From there I am trying to setup a dataframe in Pandas using a dictionary with 'Name', 'Address', and 'Description' as keys while the results from the regex would serve as the values. The problem I am currently running into when I generate the dataframe is that the columns will correctly generate, but the results don't go into different rows.
import re
import socket
import csv
import pandas as pd
#read the Guide's text file
file = r'myfile.txt'
fh = open(file, encoding="utf8")
excel = r'myCSV.csv'
f = open(excel, 'w')
addList = list()
nameAddList = list()
descList = list()
nameList = list()
fhRead = fh.readlines()
for lines in fhRead:
addresses = re.findall('\(([^()]{2,}?)\)', lines) #returns the addresses found within the parenthesis
name = re.findall('^(.+?)\(', lines) #Returns the name
nameAdd = re.findall('^(.+?)-', lines) #Returns the name and addresses found within the parenthesis
desc = re.findall('\-(.*)', lines) #Returns the description for each bar
if len(addresses) < 1:
continue
else:
addList.append(addresses)
if len(nameAdd) < 1:
continue
else:
nameAddList.append(nameAdd)
if len(desc) < 1:
continue
else:
descList.append(desc)
if len(name) < 1:
continue
else:
nameList.append(name)
nameStr = str(nameList)
addressStr = str(addList)
descStr = str(descList)
#I think what is happening here is that when I pass through the list, it reads the list as one single value
data = {'Name': [[nameStr]],
'Address': [[addressStr]],
'Description': [descStr]}
df = pd.DataFrame(data)
print(df[['Name', 'Address', 'Description']])
This is my current output:
Name ... Description
0 [['Business 1 '], ['Business 2 '], ['Business 3'... ... [[' Description 1...
[1 rows x 3 columns]
An example of how I would want the code to appear is similar to this code that I found on Geeks for Geeks:
# Define a dictionary containing employee data
data = {'Name': [['Jai'], ['Princi'], ['Gaurav'], ['Anuj']],
'Age': [27, 24, 22, 32],
'Address': ['Delhi', 'Kanpur', 'Allahabad', 'Kannauj'],
'Qualification': ['Msc', 'MA', 'MCA', 'Phd']}
# Convert the dictionary into DataFrame
df = pd.DataFrame(data)
# select two columns
print(df[['Name', 'Qualification', ]])
That block of code produces this output:
Name Qualification
0 [Jai] Msc
1 [Princi] MA
2 [Gaurav] MCA
3 [Anuj] Phd
How can I accomplish this output with a generated list as opposed to having to hand enter in the values like in the Geeks for Geeks code?
Thank you for your help!
You need to add to the loop an append method, like that:
df = df.append(pd.DataFrame([[Name,Age,Address,Qualification]], columns=df.columns))
This code will append the new line to the end of the df
I am trying to compare two rows of data to one another which I have stored in a list.
for x in range(0, len_data_row):
if company_data[0][0][x] == company_data[1][0][x]:
print ('MATCH 1: {} - {}'.format(x, company_data[0][0][x]))
# do nothing
if company_data[0][0][x] == None and company_data[1][0][x] != None:
print ('MATCH 2: {} - {}'.format(x, company_data[1][0][x]))
# update first company_id with data from 2nd
if company_data[0][0][x] != None and company_data[1][0][x] == None:
print ('MATCH 3: {} - {}'.format(x, company_data[0][0][x]))
# update second company_id with data from 1st
Psuedocode of what I want to do:
If data at index[x] of a list is not None for row 2, but is blank for row 1, then write the value of row 2 at index[x] for row 1 data in my database.
The part I can't figure out is if in SQLAlchemy you can do specify which column is being updated by an "index" (I think in db-land index means something different than what I mean. What I mean is like a list index, e.g., list[1]). And also if you can dynamically specify which column is being updated by passing a variable to the update code? Here's what I'm looking to do (it doesn't work of course):
def some_name(column_by_index, column_value):
u = table_name.update().where(table_name.c.id==row_id).values(column_by_index=column_value)
db.execute(u)
Thank you!
I have really irritating thing in my script and don't have idea what's wrong. When I try to filter my dataframe and then add rows to newone which I want to export to excel this happen.
File exports as empty DF, also print shows me that "report" is empty but when I try to print report.Name, report.Value etc. I got normal and proper output with elements. Also I can only export one column to excel not entire DF which looks like empty.... What can cause that strange accident?
So this is my script:
df = pd.read_excel('testfile2.xlsx')
report = pd.DataFrame(columns=['Type','Name','Value'])
for index, row in df.iterrows():
if type(row[0]) == str:
type_name = row[0].split(" ")
if type_name[0] == 'const':
selected_index = index
report['Type'].loc[index] = type_name[1]
report['Name'].loc[index] = type_name[2]
report['Value'].loc[index] = row[1]
else:
for elements in type_name:
report['Value'].loc[selected_index] += " " + elements
elif type(row[0]) == float:
df = df.drop(index=index)
print(report) #output - Empty DataFrame
print(report.Name) output - over 500 elements
You are trying to manipulate a series that does not exist which leads to the described behaviour.
Doing what you did just with a way more simple example i get the same result:
report = pd.DataFrame(columns=['Type','Name','Value'])
report['Type'].loc[0] = "A"
report['Name'].loc[0] = "B"
report['Value'].loc[0] = "C"
print(report) #empty df
print(report.Name) # prints "B" in a series
Easy solution: Just add the whole row instead of the three single values:
report = pd.DataFrame(columns=['Type','Name','Value'])
report.loc[0] = ["A", "B", "C"]
or in your code:
report.loc[index] = [type_name[1], type_name[2], row[1]]
If you want to do it the same way you are doing it at the moment you first need to add an empty series with the given index to your DataFrame before you can manipulate it:
report.loc[index] = pd.Series([])
report['Type'].loc[index] = type_name[1]
report['Name'].loc[index] = type_name[2]
report['Value'].loc[index] = row[1]
Hi I wrote some code that builds a default dictionary
def makedata(filename):
with open(filename, "r") as file:
for x in features:
previous = []
count = 0
for line in file:
var_name = x
regexp = re.compile(var_name + r'.*?([0-9.-]+)')
match = regexp.search(line)
if match and (match.group(1)) != previous:
previous = match.group(1)
count += 1
if count > wlength:
count = 1
target = str(str(count) + x)
dict.setdefault(target, []).append(match.group(1))
file.seek(0)
df = pd.DataFrame.from_dict(dict)
The dictionary looks good but when I try to convert to dataframe it is empty. I can't figure it out
dict:
{'1meanSignalLenght': ['0.5305184', '0.48961428', '0.47203177', '0.5177274'], '1amplCor': ['0.8780955002105448', '0.8634431017504487', '0.9381169983046714', '0.9407036427333355'], '1metr10.angle1': ['0.6439386643584522', '0.6555194964997434', '0.9512436169922103', '0.23789348400794422'], '1syncVar': ['0.1344131181025432', '0.08194580887223515', '0.15922251165913678', '0.28795644612520327'], '1linVelMagn': ['0.07062673289287498', '0.08792496681784517', '0.12603999663935528', '0.14791253129369603'], '1metr6.velSum': ['0.17850601560734558', '0.15855169971072014', '0.21396496345720045', '0.2739525279330513']}
df:
Empty DataFrame
Columns: []
Index: []
{}
I think part of your issue is that you are using the keyword 'dict', assuming it is a variable
make a dictionary in your function, call it something other than 'dict'. Have your function return that dictionary. Then when you make a dataframe use that return value. Right now, you are creating a data frame from an empty dictionary object.
df = pd.DataFrame(dict)
This should make a dataframe from the dictionary.
You can either pass a list of dicts simply using pd.DataFrame(list_of_dicts) (use pd.DataFrame([dict]) if your variable is not a list) or a dict of list using pd.DataFrame.from_dict(dict). In this last case dict should be something like dict = {a:[1,2,3], "b": ["a", "b", "c"], "c":...}.
see: Pandas Dataframe from dict with empty list value
Prerequisites
Dataset I'm working with is MovieLens 100k
Python Packages I'm using is Surprise, Io and Pandas
Agenda is to test the recommendation system using KNN (+ K-Fold) on Algorithms: Vector cosine & Pearson, for both User based CF & Item based CF
Briefing
So far, I have coded for both UBCF & IBCF as below
Q1. IBCF Generates data as per input given to it, I need it to export a csv file since I need to find out the predicted values
Q2. UBCF needs to enter each data separately and doesn't work even with immediate below code:
csvfile = 'pred_matrix.csv'
with open(csvfile, "w") as output:
writer = csv.writer(output,lineterminator='\n')
#algo.predict(user_id, item_id, estimated_ratings)
for val in algo.predict(str(range(1,943)),range(1,1683),1):
writer.writerow([val])
Clearly it throws the error of lists, as it cannot be Comma separated.
Q3 Getting Precision & Recall on Evaluated and Recommended values
CODE
STARTS WITH
if ip == 1:
one = 'cosine'
else:
one = 'pearson'
choice = raw_input("Filtering Method: \n1.User based \n2.Item based \n Choice:")
if choice == '1':
user_based_cf(one)
elif choice == '2':
item_based_cf(one)
else:
sim_op={}
exit(0)
UBCF:
def user_based_cf(co_pe):
# INITIALIZE REQUIRED PARAMETERS
path = '/home/mister-t/Projects/PycharmProjects/RecommendationSys/ml-100k/u.user'
prnt = "USER"
sim_op = {'name': co_pe, 'user_based': True}
algo = KNNBasic(sim_options=sim_op)
# RESPONSIBLE TO EXECUTE DATA SPLITS Mentioned in STEP 4
perf = evaluate(algo, df, measures=['RMSE', 'MAE'])
print_perf(perf)
print type(perf)
# START TRAINING
trainset = df.build_full_trainset()
# APPLYING ALGORITHM KNN Basic
res = algo.train(trainset)
print "\t\t >>>TRAINED SET<<<<\n\n", res
# PEEKING PREDICTED VALUES
search_key = raw_input("Enter User ID:")
item_id = raw_input("Enter Item ID:")
actual_rating = input("Enter actual Rating:")
print algo.predict(str(search_key), item_id, actual_rating)
IBCF
def item_based_cf(co_pe):
# INITIALIZE REQUIRED PARAMETERS
path = '/location/ml-100k/u.item'
prnt = "ITEM"
sim_op = {'name': co_pe, 'user_based': False}
algo = KNNBasic(sim_options=sim_op)
# RESPONSIBLE TO EXECUTE DATA SPLITS = 2
perf = evaluate(algo, df, measures=['RMSE', 'MAE'])
print_perf(perf)
print type(perf)
# START TRAINING
trainset = df.build_full_trainset()
# APPLYING ALGORITHM KNN Basic
res = algo.train(trainset)
print "\t\t >>>TRAINED SET<<<<\n\n", res
# Read the mappings raw id <-> movie name
rid_to_name, name_to_rid = read_item_names(path)
search_key = raw_input("ID:")
print "ALGORITHM USED : ", one
toy_story_raw_id = name_to_rid[search_key]
toy_story_inner_id = algo.trainset.to_inner_iid(toy_story_raw_id)
# Retrieve inner ids of the nearest neighbors of Toy Story.
k=5
toy_story_neighbors = algo.get_neighbors(toy_story_inner_id, k=k)
# Convert inner ids of the neighbors into names.
toy_story_neighbors = (algo.trainset.to_raw_iid(inner_id)
for inner_id in toy_story_neighbors)
toy_story_neighbors = (rid_to_name[rid]
for rid in toy_story_neighbors)
print 'The ', k,' nearest neighbors of ', search_key,' are:'
for movie in toy_story_neighbors:
print(movie)
Q1. IBCF Generates data as per input given to it, I need it to export a csv file since I need to find out the predicted values
the easiest way to dump anything to a csv would be to use the csv module!
import csv
res = [x, y, z, ....]
csvfile = "<path to output csv or txt>"
#Assuming res is a flat list
with open(csvfile, "w") as output:
writer = csv.writer(output, lineterminator='\n')
for val in res:
writer.writerow([val])
#Assuming res is a list of lists
with open(csvfile, "w") as output:
writer = csv.writer(output, lineterminator='\n')
writer.writerows(res)