this code is supposed to take in a file with four index in each line and return the first and last indexs in a dictionary the the if-statement is fulfilled, and it works.The output might be: {1:[2,4,3], 3:[5,6,1]}
def value(filename):
f=open(filename,'r') *
bat_val=defaultdict(list)
for line in f:
four_vals = (line.split(','))
batch=four_vals[0]
x=float(four_vals[1])
y=float(four_vals[2])
circle = x**2 + y**2
if circle <= 1:
value = four_vals[3]
bat_val[batch].append(value.strip())
f.close()
return bat_val
print(value('sample2.txt'))
#Then I want to use the def value()-function in the function below to calculate the average for each key. If i got the output above i will now in this function get:
{1:3, 3:4}
def mean(file):
calc=value(open(file,'r') )
result={}
for bat,val in sorted(calc.items()):
mean = (sum(val))/len(val)
result[bat]=mean
return result
print(mean('sample4.txt'))
#but the function def value() says TypeError in line 12 (marked with *) and I dont understand why
There are two issues.
1- You need to append number to your dict's values since you'll use them to calculate mean
if circle <= 1:
value = float(four_vals[3].strip())
bat_val[batch].append(value)
if you don't want to do this, you can cast values to float before calculating mean in the mean function.
2- And as mentioned in the other answer, you should avoid opening file twice by replacing
calc=value(open(file,'r') )
with
calc=value(file)
you open the file twice
once in mean and again in value
try to pass the filename in mean:
def mean(file):
calc=value(file) # <<<<< this line changed
result={}
for bat,val in sorted(calc.items()):
mean = (sum(val))/len(val)
result[bat]=mean
return result
Related
This is a question I've had before: I have two arrays representing the inputs and corresponding outputs of a function. I need to find the input for a specific output that falls between data points. How do I do that?
For example:
import numpy as np
B = np.arange(0,10,1)
def fun(b):
return b*3/5
A = fun(B)
How to get the value of "B" for fun to return 3.75?
This technique uses linear interpolation to approximate.
I start with this function:
def interpABS(A,B,Aval):
if Aval>max(A) or Aval<min(A):
print('Error: Extrapolating beyond given data')
else:
if len(A)==len(B):
for i in np.arange(1,len(A),1):
ihi = i
ilo = i-1
if A[i]>Aval:
break
Alo = A[ilo]
Blo = B[ilo]
Ahi = A[ihi]
Bhi = B[ihi]
out = Blo + (Bhi-Blo)*(Aval-Alo)/(Ahi-Alo)
return out
else:
print('Error: inputs of different sizes')
Note: I'm kind of an amateur and don't know how to set up exceptions, so instead the error outputs are just print commands on a different path from the rest of the function. Those more experienced than I am may recommend improvements.
Use the output array from your function as A, and the corresponding input array as B, then input your target value as Aval. interpABS will return the an approximate input for your original function to get the target value
So, for our example above, interpABS(A,B,3.75) will return a value of 6.25
This can be useful even if Aval is a value of A to find the corresponding B value, since the math simplifies to Blo + 0. For example, changing Aval in the above example will give 5.0, which is part of the original input set B.
I have a code that I inform a folder, where it has n images that the code should return me the relative frequency histogram.
From there I have a function call:
for image in total_images:
histogram(image)
Where image is the current image that the code is working on and total_images is the total of images (n) it has in the previously informed folder.
And from there I call the histogram() function, sending as a parameter the current image that the code is working.
My histogram() function has the purpose of returning the histogram of the relative frequency of each image (rel_freq).
Although the returned values are correct, rel_freq should be a array 1x256 positions ranging from 0 to 255.
How can I transform the rel_freq variable into a 1x256 array? And each value stored in its corresponding position?
When I do len *rel_freq) it returns me 256, that's when I realized that it is not in the format I need...
Again, although the returned data is correct...
After that, I need to create an array store_all = len(total_images)x256 to save all rel_freq...
I need to save all rel_freq in an array to later save it and to an external file, such as .txt.
I'm thinking of creating another function to do this...
Something like that, but I do not know how to do it correctly, but I believe you will understand the logic...
def store_all_histograms(total_images):
n = len(total_images)
store_all = [n][256]
for i in range(0,n):
store_all[i] = rel_freq
I know the function store_all_histograms() is wrong, I just wrote it here to show more or less the way I'm thinking of doing... but again, I do not know how to do it properly... At this point, the error I get is:
store_all = [n][256]
IndexError: list index out of range
After all, I need the store_all variable to save all relative frequency histograms for example like this:
position: 0 ... 256
store_all = [
[..., ..., ...],
[..., ..., ...],
.
.
.
n
]
Now follow this block of code:
def histogram(path):
global rel_freq
#Part of the code that is not relevant to the question...
rel_freq = [(float(item) / total_size) * 100 if item else 0 for item in abs_freq]
def store_all_histograms(total_images):
n = len(total_images)
store_all = [n][256]
for i in range(0,n):
store_all[i] = rel_freq
#Part of the code that is not relevant to the question...
# Call the functions
for fn in total_images:
histogram(fn)
store_all_histograms(total_images)
I hope I have managed to be clear with the question.
Thanks in advance, if you need any additional information, you can ask me...
Return the result, don't use a global variable:
def histogram(path):
return [(float(item) / total_size) * 100 if item else 0 for item in abs_freq]
Create an empty list:
store_all = []
and append your results:
for fn in total_images:
store_all.append(histogram(fn))
Alternatively, use a list comprehension:
store_all = [histogram(fn) for fn in total_images]
for i in range(0,n):
store_all[i+1] = rel_freq
Try this perhaps? I'm a bit confused on the question though if I'm honest. Are you trying to shift the way you call the array with all the items by 1 so that instead of calling position 1 by list[0] you call it via list[1]?
So you want it to act like this?
>>list = [0,1,2,3,4]
>>list[1]
0
I'm trying to cross compare two outputs labeled "S" in compareDNA (calculating Hamming distance). Though, I cannot figure out how to call an integer from one def to another. I've tried returning the variable, but, I am unable to call it (in a different def) after returning it.
I'm attempting to see which output of "compareDNA(Udnalin, Mdnalin) and compareDNA(Udnalin, Hdnalin)" is higher, to determine which has a greater hamming distance.
How does one call an integer from one def to another?
import sys
def main():
var()
def var():
Mdna = open("mouseDNA.txt", "r")
Mdnalin = Mdna.readline()
print(Mdnalin)
Mdna.close
Hdna = open("humanDNA.txt", "r")
Hdnalin = Hdna.readline()
print(Hdnalin)
Hdna.close
Udna = open("unknownDNA.txt", "r")
Udnalin = Udna.readline()
print(Udnalin)
Udna.close
S = 0
S1 = 0
S2 = 0
print("Udnalin + Mdnalin")
compareDNA(Udnalin, Mdnalin)
S1 = S
print("Udnalin + Hdnalin")
compareDNA(Udnalin, Hdnalin)
def compareDNA(i, j):
diffs = 0
length = len(i)
for x in range(length):
if i[x] != j[x]:
diffs += 1
S = length - diffs / length
S = round(S, 2)
return S
# print("Mouse")
# print("Human")
# print("RATMA- *cough* undetermined")
main()
You probably want to assign the value returned by each call to compareDNA to a separate variable in your var function. Then you can do whatever you want with those values (what exactly you want to do is not clear from your question). Try something like this:
S1 = compareDNA(Udnalin, Mdnalin) # bind the return value from this call to S1
S2 = compareDNA(Udnalin, Hdnalin) # and this one to S2
# do something with S1 and S2 here!
If what you want to do is especially simple (e.g. comparing them to see which is larger), you could even use the return values directly in an expression, such as the condition in a if statement:
if compareDNA(Udnalin, Mdnalin) > S2 = compareDNA(Udnalin, Hdnalin):
print("Unknown DNA is closer to a Mouse")
else:
print("Unknown DNA is closer to a Human")
There's one further thing I'd like to point out, which is unrelated to the core of your question: You should use with statements to handle closing your files, rather than manually trying to close them. Your current code doesn't actually close the files correctly (you're missing the parentheses after .close in each case which are needed to make it a function call).
If you use a with statement instead, the files will be closed automatically at the end of the block (even if there is an exception):
with open("mouseDNA.txt", "r") as Mdna:
Mdnalin = Mdna.readline()
print(Mdnalin)
with open("humanDNA.txt", "r") as Hdna:
Hdnalin = Hdna.readline()
print(Hdnalin)
with open("unknownDNA.txt", "r") as Udna:
Udnalin = Udna.readline()
print(Udnalin)
I have some pandas groupby functions that write data to file, but for some reason I'm getting redundant data written to file. Here's the code:
This function gets applied to each item in the dataframe
def item_grouper(df):
# Get the frequency of each tag applied to the item
tag_counts = df['tag'].value_counts()
# Get the most frequent tag (or tags, assuming a tie)
max_tags = tag_counts[tag_counts==tag_counts.max()]
# Get the total nummber of annotations for the item
total_anno = len(df)
# Now, process each user who tagged the item
return df.groupby('uid').apply(user_grouper,total_anno,max_tags,tag_counts)
# This function gets applied to each user who tagged an item
def user_grouper(df,total_anno,max_tags,tag_counts):
# subtract user's annoations from total annoations for the item
total_anno = total_anno - len(df)
# calculate weight
weight = np.log10(total_anno)
# check if user has used (one of) the top tag(s), and adjust max_tag_count
if len(np.intersect1d(max_tags.index.values,df['iid']))>0:
max_tag_count = float(max_tags[0]-1)
else:
max_tag_count = float(max_tags[0])
# for each annotation...
for i,row in df.iterrows():
# calculate raw score
raw_score = (tag_counts[row['tag']]-1) / max_tag_count
# write to file
out.write('\t'.join(map(str,[row['uid'],row['iid'],row['tag'],raw_score,weight]))+'\n')
return df
So, one grouping function groups the data by iid (item id), does some processing, and then groups each sub-dataframe by uid (user_id), does some calculation, and writes to an output file. Now, the output file should have exactly one line per row in the original dataframe, but it doesn't! I keep getting the same data written to file multiple times. For instance, if I run:
out = open('data/test','w')
df.head(1000).groupby('iid').apply(item_grouper)
out.close()
The output should have 1000 lines (the code only writes one line per row in the dataframe), but the result output file has 1,997 lines. Looking at the file shows the exact same lines written multiple (2-4) times, seemingly at random (i.e. not all lines are double-written). Any idea what I'm doing wrong here?
See the docs on apply. Pandas will call the function twice on the first group (to determine between a fast/slow code path), so the side effects of the function (IO) will happen twice for the first group.
Your best bet here is probably to iterate over the groups directly, like this:
for group_name, group_df in df.head(1000).groupby('iid'):
item_grouper(group_df)
I agree with chrisb's determination of the problem. As a cleaner way, consider having your user_grouper() function not save any values, but instead return these. With a structure as
def user_grouper(df, ...):
(...)
df['max_tag_count'] = some_calculation
return df
results = df.groupby(...).apply(user_grouper, ...)
for i,row in results.iterrows():
# calculate raw score
raw_score = (tag_counts[row['tag']]-1) / row['max_tag_count']
# write to file
out.write('\t'.join(map(str,[row['uid'],row['iid'],row['tag'],raw_score,weight]))+'\n')
I have a simple, stupid Python problem. Given a graph, I'm trying to sample from a random variable whose distribution is the same as that of the degree distribution of the graph.
This seems like it should pretty straightforward. Yet somehow I am still managing to mess this up. My code looks like this:
import numpy as np
import scipy as sp
import graph_tool.all as gt
G = gt.random_graph(500, deg_sampler=lambda: np.random.poisson(1), directed=False)
deg = gt.vertex_hist(G,"total",float_count=False)
# Extract counts and values
count = list(deg[0])
value = list(deg[1])
# Generate vector of probabilities for each node
p = [float(x)/sum(count) for x in count]
# Load into a random variable for sampling
x = sp.stats.rv_discrete(values=(value,p))
print x.rvs(1)
However, upon running this it returns an error:
Traceback (most recent call last):
File "temp.py", line 16, in <module>
x = sp.stats.rv_discrete(values=(value,p))
File "/usr/lib/python2.7/dist-packages/scipy/stats/distributions.py", line 5637, in __init__
self.pk = take(ravel(self.pk),indx, 0)
File "/usr/lib/python2.7/dist-packages/numpy/core/fromnumeric.py", line 103, in take
return take(indices, axis, out, mode)
IndexError: index out of range for array
I'm not sure why this is. If in the code above I write instead:
x = sp.stats.rv_discrete(values=(range(len(count)),p))
Then the code runs fine, but it gives a weird result--clearly the way I've specified this distribution, a value of "0" ought to be most common. But this code gives "1" with high probability and never returns a "0," so something is getting shifted over somehow.
Can anyone clarify what is going on here? Any help would be greatly appreciated!
I believe the first argument for x.rvs() would be the loc arg. If you make loc=1 by calling x.rvs(1), you're adding 1 to all values.
Instead, you want
x.rvs(size=1)
As an aside, I'd recommend that you replace this:
# Extract counts and values
count = list(deg[0])
value = list(deg[1])
# Generate vector of probabilities for each node
p = [float(x)/sum(count) for x in count]
With:
count, value = deg # automatically unpacks along first axis
p = count.astype(float) / count.sum() # count is an array, so you can divide all elements at once