opening txt file and creating list and getting basic stats - python

I am trying to open a textfile called state_meet.txt file; the info is formatted as
gymnastics_school,participant_name,all-around_points_earned
see example:
Lanier City Gymnastics,Ben W.,55.301
Lanier City Gymnastics,Alex W.,54.801
Lanier City Gymnastics,Sky T.,51.2
Lanier City Gymnastics,William G.,47.3 etc..
and create functions to get info such as:
The total count of gymnasts that participated in the state meet.
The first place score.
The last place score.
The score differential between the first and last place.
The average score for all gymnasts.
The median score. (The median is the grade at the mid-point of a sorted list. If there is an even number of elements in the list, the median is the average of the 2 middle elements.)
The average of all scores above the median (not including the median).
The average of all scores below the median (not including the median).
The output should look as such
Summary of Data:
Number of gymnasts: 103
First place score: 143.94
Here's the code I have so far:
with open('state_meet.txt','r') as f:
for line in f:
allt = []
values = line.split()
print(values[3])
#first
max_val = max(values[3])
int(max_val)
print(max_val)
#last
min_val = min(values[3])
int(min_val)
print(min_val)
#Mean
total = sum(input_list)
length = len(input_list)
for nums in [input_list]:
mean_val = total / length
float(mean_val)
#Median
sorted(input_list)
med_val = sorted(lst)
lstLen = len(lst)
index = (lstLen - 1) // 2
this is what i have so far but my text is reading it as W.,55.301 instead of 55.301 and giving me errors

You have a comma-separated values (csv) file. Use the csv module.
import csv
data = []
with open("state_meet.txt") as f:
reader = csv.DictReader(f, fieldnames=["school", "participant", "score"])
for line in reader:
data.append(line)
# first place
record = max(data, lambda d: d["score"])
best_score = int(record["score"])
# last place
record = min(data, lambda d: d["score"])
worst_score = int(record["score"])
# Mean score
mean = sum(d["score"] for d in data) / len(data)
# Median score
median = sorted([d["score"] for d in data])[(len(data) - 1) // 2]
csv.DictReader reads the lines of your csv file and automatically converts each one to a dictionary, keyed by whatever you like. This is perhaps easier to read than the collections.namedtuple suggestion in dokelung's answer, though namedtuple is equally valid. The key here is that we can keep the entire record around instead of throwing away everything but the score.

You should use split(',') to replace split().
Use values[2] to get the third item of list values.
list allt seems no use.
It seems that no matter how many lines are there in state_meet.txt, values always gets the last line data.
I guess the things you want to do:
import collections
names = ["gymnastics_school", "participant_name", "all_around_points_earned"]
Data = collections.namedtuple("Data", names)
data = []
with open('state_meet.txt','r') as f:
for line in f:
line = line.strip()
items = line.split(',')
items[2] = float(items[2])
data.append(Data(*items))
# max value
max_one = max(data, key=lambda d:d.all_around_points_earned)
print(max_one.all_around_points_earned)
# min value
min_one = min(data, key=lambda d:d.all_around_points_earned)
print(min_one.all_around_points_earned)
# mean value
total = sum(d.all_around_points_earned for d in data)
mean_val = total/len(data)
print(mean_val)
# median value
sorted_data = sorted(data, key=lambda d:d.all_around_points_earned)
if len(data)%2==0:
a = sorted_data[len(data)//2].all_around_points_earned
b = sorted_data[len(data)//2-1].all_around_points_earned
median_val = (a+b)/2
else:
median_val = sorted_data[(len(data)-1)//2].all_around_points_earned
print(median_val)
Let me explain more:
First, I define a namedtuple type called "Data" with the the item names(gymnastics_school...). Then I can use d = Data('school', 'name', '50.0') to create a namedtuple d. We can easily fetch the item values by using . to get attributes.
>>> names = ["gymnastics_school", "participant_name", "all_around_points_earned"]
>>> Data = collections.namedtuple("Data", names)
>>> d = Data('school', 'name', '50.0')
>>> d.gymnastics_school
'scholl'
>>> d.participant_name
'name'
>>> d.all_around_points_earned
'50.0'
Next, when we iterate the lines in file object, use string method strip to remove blanks and new lines. It makes the line more clean. Then split(',') can help us splitting the line with specified delimiter ,.
Here, we do a conversion with function float because third item in the split list items is a float(but it is string in the file). Finally, use namedtuple Data to create data then append to list data.
Next, builtin function max and min can help us find the max/min item. But each thing in data is a namedtuple, we should use a lambda function to fetch the points then use them as keys to pick the max/min one.
Also, function sum let us compute the summation without a loop. Here, we have to extract the points to get their summation so we pass a generator d.all_around_points_earned for d in data to sum.
I get the median value by sorting data then get the middle one. When the number of data is odd, we just pick the center number. But if it is even, we should pick the middle "two" and compute their mean.
Hope my answer can help you!

split() by default splits on whitespace, did you mean
values = line.split(',')
to split on commas?
https://docs.python.org/2/library/stdtypes.html#str.split

Related

Creating a variable for each line in a txt file

I am building a program that will eventually be able to construct finite automata and am in the early stages of reading in the txt file to set up my variables (states, alphabet, starting position, final positions, and rules, respectively). Here is the code I have thus far:
def read_dfa(dfa_filename):
dfa = open(dfa_filename, 'r')
count = 0
for line in dfa:
count += 1
states, alphabet, initial, final, rules = (line.format(count, line.strip()))
The issue I am currently facing is that lines 1:4 will always provide the data I need, however, the rules variable could be just line 5, or it could be multiple lines (5:n). How do I make it so I can successfully store the lines I want to each appropriate variable?
This is the error I am faced with: 'ValueError: Too many values to unpack (expected 5)'
This is an example of what dfa_filename could be:
q1,q2 #states
a,b #alphabet
q1 #initial
q2 #final
q1,a,q2 #rules
q1,b,q1 #rules
q2,a,q2 #rules
q2,b,q2 #rules
You could use splitlines and assign the first 4 lines to states, alphabet, initial,and final. Then, assign the remaining lines to rules.
with open('textfile.txt') as input:
data = input.read()
states, alphabet, initial, final = data.splitlines()[0:4]
rules = data.splitlines()[4:]
if I understood correctly, the first 4 are always in this order and then there are multiple rules. And the number of rules can change?
If so, try to unpack with * Operator. This allows to unpack multiple values into one variable:
a, *b = 1, 2, 3
results in a=1 and b=[2, 3]
So it should be something around
states, alphabet, initial, final, *rules = (line.format(count, line.strip()))
Stay save
you can use list to unpack file in list and then get needed lines by index numbers.
file_lines = list(open('textfile.txt'))
states, alphabet, initial, final = file_lines[0:4]
rules = file_lines[4:]

How to fuzzy match two lists in Python

I have two lists: ref_list and inp_list. How can one make use of FuzzyWuzzy to match the input list from the reference list?
inp_list = pd.DataFrame(['ADAMS SEBASTIAN', 'HAIMBILI SEUN', 'MUTESI
JOHN', 'SHEETEKELA MATT', 'MUTESI JOHN KUTALIKA',
'ADAMS SEBASTIAN HAUSIKU', 'PETERS WILSON',
'PETERS MARIO', 'SHEETEKELA MATT NICKY'],
columns =['Names'])
ref_list = pd.DataFrame(['ADAMS SEBASTIAN HAUSIKU', 'HAIMBILI MIKE', 'HAIMBILI SEUN', 'MUTESI JOHN
KUTALIKA', 'PETERS WILSON MARIO', 'SHEETEKELA MATT NICKY MBILI'], columns =
['Names'])
After some research, I modified some codes I found on the internet. Problems with these codes - they work very well on small sample size. In my case the inp_list and ref_list are 29k and 18k respectively in length and it takes more than a day to run.
Below are the codes, first a helper function was defined.
def match_term(term, inp_list, min_score=0):
# -1 score in case I don't get any matches
max_score = -1
# return empty for no match
max_name = ''
# iterate over all names in the other
for term2 in inp_list:
# find the fuzzy match score
score = fuzz.token_sort_ratio(term, term2)
# checking if I am above my threshold and have a better score
if (score > min_score) & (score > max_score):
max_name = term2
max_score = score
return (max_name, max_score)
# list for dicts for easy dataframe creation
dict_list = []
#iterating over the sales file
for name in inp_list:
#use the defined function above to find the best match, also set the threshold to a chosen #
match = match_term(name, ref_list, 94)
#new dict for storing data
dict_ = {}
dict_.update({'passenger_name': name})
dict_.update({'match_name': match[0]})
dict_.update({'score': match[1]})
dict_list.append(dict_)
Where can these codes be improved to run smoothly and perhaps avoid evaluating items that have already been assessed?
You can try to vectorized the operations instead of evaluate the scores in a loop.
Make a df where the firse col ref is ref_list and the second col inp is each name in inp_list. Then call df.apply(lambda row:process.extractOne(row['inp'], row['ref']), axis=1). Finally you'll get the best match name and score in ref_list for each name in inp_list.
The measures you are using are computationally demanding with a number of pairs of strings that high. Alternatively to fuzzywuzzy, you could try to use instead a library called string-grouper which exploits a faster Tf-idf method and the cosine similarity measure to find similar words. As an example:
import random, string, time
import pandas as pd
from string_grouper import match_strings
alphabet = list(string.ascii_lowercase)
from_r, to_r = 0, len(alphabet)-1
random_strings_1 = ["".join(alphabet[random.randint(from_r, to_r)]
for i in range(6)) for j in range(5000)]
random_strings_2 = ["".join(alphabet[random.randint(from_r, to_r)]
for i in range(6)) for j in range(5000)]
series_1 = pd.Series(random_strings_1)
series_2 = pd.Series(random_strings_2)
t_1 = time.time()
matches = match_strings(series_1, series_2,
min_similarity=0.6)
t_2 = time.time()
print(t_2 - t_1)
print(matches)
It takes less than one second to do 25.000.000 comparisons! For a surely more useful test of the library look here: https://bergvca.github.io/2017/10/14/super-fast-string-matching.html where it is claimed that
"Using this approach made it possible to search for near duplicates in
a set of 663,000 company names in 42 minutes using only a dual-core
laptop".
To tune your matching algorithm further look at the **kwargs arguments you can give to the match_strings function above.

Using split in python on excel to get two parameter

I'm getting really confused with all the information on here using 'split' in python. Basically I want to write a code which opens a spreadsheet (with two columns in it) and the function I write will use the first column as x's and the second column as y's and then it will plot it in the x-y plane.
I thought I would use line.splitlines to cut each line in excel into (x,y) but I keep getting
'ValueError: need more than 1 value to unpack'
I don't know what this means?
Below is what I've written so far, (xdir is an initial condition for a different part of my question):
def plotMo(filename, xdir):
infile = open(filename)
data = []
for line in infile:
x,y = line.splitlines()
x = float(x)
y = float(y)
data.append([x,y])
infile.close()
return data
plt.plot(x,y)
For example with
0 0.049976
0.01 0.049902
0.02 0.04978
0.03 0.049609
0.04 0.04939
0.05 0.049123
0.06 0.048807
I would want to the first point in my plane to be (0, 0.049976) and the second plot to be (0.01, 0.049902).
x,y = line.splitlines() tries to split the current line into several lines.
Since splitlines returns only 1 element, there's an error because python cannot find a value to assign to y.
What you want is x,y = line.split() which will split the line according to 1 or more spaces (like awk would do) if no parameter is specified.
However it depends of the format: if there are blank lines you'll get the "unpack" problem at some point, so to be safe and skip malformed lines, write:
items = line.split()
if len(items)==2: x,y = items
To sum it up, a more pythonic, shorter & safer way of writing your routine would be:
def plotMo(filename):
with open(filename) as infile:
data = []
for line in infile:
items = line.split()
if len(items)==2:
data.append([float(e) for e in items])
return data
(maybe it could be condensed more, but that's good for starters)

Pandas groupby and file writing problems

I have some pandas groupby functions that write data to file, but for some reason I'm getting redundant data written to file. Here's the code:
This function gets applied to each item in the dataframe
def item_grouper(df):
# Get the frequency of each tag applied to the item
tag_counts = df['tag'].value_counts()
# Get the most frequent tag (or tags, assuming a tie)
max_tags = tag_counts[tag_counts==tag_counts.max()]
# Get the total nummber of annotations for the item
total_anno = len(df)
# Now, process each user who tagged the item
return df.groupby('uid').apply(user_grouper,total_anno,max_tags,tag_counts)
# This function gets applied to each user who tagged an item
def user_grouper(df,total_anno,max_tags,tag_counts):
# subtract user's annoations from total annoations for the item
total_anno = total_anno - len(df)
# calculate weight
weight = np.log10(total_anno)
# check if user has used (one of) the top tag(s), and adjust max_tag_count
if len(np.intersect1d(max_tags.index.values,df['iid']))>0:
max_tag_count = float(max_tags[0]-1)
else:
max_tag_count = float(max_tags[0])
# for each annotation...
for i,row in df.iterrows():
# calculate raw score
raw_score = (tag_counts[row['tag']]-1) / max_tag_count
# write to file
out.write('\t'.join(map(str,[row['uid'],row['iid'],row['tag'],raw_score,weight]))+'\n')
return df
So, one grouping function groups the data by iid (item id), does some processing, and then groups each sub-dataframe by uid (user_id), does some calculation, and writes to an output file. Now, the output file should have exactly one line per row in the original dataframe, but it doesn't! I keep getting the same data written to file multiple times. For instance, if I run:
out = open('data/test','w')
df.head(1000).groupby('iid').apply(item_grouper)
out.close()
The output should have 1000 lines (the code only writes one line per row in the dataframe), but the result output file has 1,997 lines. Looking at the file shows the exact same lines written multiple (2-4) times, seemingly at random (i.e. not all lines are double-written). Any idea what I'm doing wrong here?
See the docs on apply. Pandas will call the function twice on the first group (to determine between a fast/slow code path), so the side effects of the function (IO) will happen twice for the first group.
Your best bet here is probably to iterate over the groups directly, like this:
for group_name, group_df in df.head(1000).groupby('iid'):
item_grouper(group_df)
I agree with chrisb's determination of the problem. As a cleaner way, consider having your user_grouper() function not save any values, but instead return these. With a structure as
def user_grouper(df, ...):
(...)
df['max_tag_count'] = some_calculation
return df
results = df.groupby(...).apply(user_grouper, ...)
for i,row in results.iterrows():
# calculate raw score
raw_score = (tag_counts[row['tag']]-1) / row['max_tag_count']
# write to file
out.write('\t'.join(map(str,[row['uid'],row['iid'],row['tag'],raw_score,weight]))+'\n')

Divide a array into multiple (individual) arrays based on a bin size in python

I am reading a file that contains values like this:
-0.68285 -6.919616
-0.7876 -14.521115
-0.64072 -43.428411
-0.05368 -11.561341
-0.43144 -34.768892
-0.23268 -10.793603
-0.22216 -50.341101
-0.41152 -90.083377
-0.01288 -84.265557
-0.3524 -24.253145
How do i split this into individual arrays based on the value in column 1 with a bin width of 0.1?
i want my output something like this:
array1=[[-0.05368, -11.561341],[-0.01288, -84.265557]]
array2=[[-0.23268, -10.79360] ,[-0.22216, -50.341101]]
array3=[[-0.3524, -24.253145]]
array4=[[-0.43144, -34.768892], [-0.41152, -90.083377]]
array5=[[-0.68285, -6.919616],[-0.64072, -43.428411]]
array6=[[-0.7876, -14.521115]]
Here's a simple solution using Python's round function and dictionary class:
lines = open('some_file.txt').readlines()
dictionary = {}
for line in lines:
nums = line[:-1].split(' ') #remove the newline and split the columns
k = round(float(nums[0]), 1) #round the first column to get the bucket
if k not in dictionary:
dictionary[k] = [] #add an empty bucket
dictionary[k].append([float(nums[0]), float(nums[1])])
#add the numbers to the bucket
print dictionary
To get a particular bucket (like .3), just do:
x = dictionary[0.3]
or
x = dictionary.get(0.3, [])
if you just want an empty list returned for empty buckets.

Categories

Resources