Hi I am trying to learn higher order functions (HOF's) in python. I understand their simple uses for reduce, map and filter. But here I need to create a tuple of the stations the bikes came and went from with the number of events at those stations as the second value. Now the commented out code is this done with normal functions (I left it as a dictionary but thats easy to convert to a tuple).
But I've been racking my brain for a while and cant get it to work using HOF's. My idea right now is to somehow use map to go through the csvReader and add to the dictionary. For some reason I cant understand what to do here. Any help understanding how to use these functions properly would be helpful.
import csv
#def stations(reader):
# Stations = {}
# for line in reader:
# startstation = line['start_station_name']
# endstation = line['end_station_name']
# Stations[startstation] = Stations.get(startstation, 0) + 1
# Stations[endstation] = Stations.get(endstation, 0) + 1
# return Stations
Stations = {}
def line_list(x):
l = x['start_station_name']
l2 = x['end_station_name']
Stations[l] = Stations.get(l, 0) + 1
Stations[l2] = Stations.get(l2, 0) + 1
return dict(l,l2)
with open('citibike.csv', 'r') as fi:
reader = csv.DictReader(fi)
#for line in reader:
output = list(map(line_list,reader))
#print(listmap)
#output1[:10]
print(output)
list(map(...)) creates a list of results, not a dictionary.
If you want to fill in a dictionary, you can use reduce(), using the dictionary as the accumulator.
from functools import reduce
def line_list(Stations, x):
l = x['start_station_name']
l2 = x['end_station_name']
Stations[l] = Stations.get(l, 0) + 1
Stations[l2] = Stations.get(l2, 0) + 1
return Stations
with open('citibike.csv', 'r') as fi:
reader = csv.DictReader(fi)
result = reduce(line_list, reader, {})
print(result)
Related
Assume this is a sample of my data: dataframe
the entire dataframe is stored in a csv file (dataframe.csv) that is 40GBs so I can't open all of it at once.
I am hoping to find the most dominant 25 names for all genders. My instinct is to create a for loop that runs through the file (because I can't open it at once), and have a python dictionary that holds the counter for each name (that I will increment as I go through the data).
To be honest, I'm confused on where to even start with this (how to create the dictionary, since to_dict() does not appear to do what I'm looking for). And also, if this is even a good solution? Is there a more efficient way someone can think of?
SUMMARY -- sorry if the question is a bit long:
the csv file storing the data is very big and I can't open it at once, but I'd like to find the top 25 dominant names in the data. Any ideas on what to do and how to do it?
I'd appreciate any help I can get! :)
Thanks for your interesting task! I've implemented pure numpy + pandas solution. It uses sorted array to keep names and counts. Hence algorithm should be around O(n * log n) complexity.
I didn't any hash table in numpy, hash table definitely would be faster (O(n)). Hence I used existing sorting/inserting routines of numpy.
Also I used .read_csv() from pandas with iterator = True, chunksize = 1 << 24 params, this allows reading file in chunks and producing pandas dataframes of fixed size from each chunk.
Note! In the first runs (until program is debugged) set limit_chunks (number of chunks to process) in code to small value (like 5). This is to check that whole program runs correctly on partial data.
Program needs to run one time command python -m pip install pandas numpy to install these 2 packages if you don't have them.
Progress is printed once in a while, total megabytes done plus speed.
Result will be printed to console plus saved to res_fname file name, all constants configuring script are placed in the beginning of script. topk constant controls how many top names will be outputed to file/console.
Interesting how fast is my solution. If it is to slow maybe I devote some time to write nice HashTable class using pure numpy.
You can also try and run next code here online.
import os, math, time, sys
# Needs: python -m pip install pandas numpy
import pandas as pd, numpy as np
import pandas, numpy
fname = 'test.csv'
fname_res = 'test.res'
chunk_size = 1 << 24
limit_chunks = None # Number of chunks to process, set to None if to process whole file
all_genders = ['Male', 'Female']
topk = 1000 # How many top names to output
progress_step = 1 << 23 # in bytes
fsize = os.path.getsize(fname)
#el_man = enlighten.get_manager() as el_man
#el_ctr = el_man.counter(color = 'green', total = math.ceil(fsize / 2 ** 20), unit = 'MiB', leave = False)
tables = {g : {
'vals': np.full([1], chr(0x10FFFF), dtype = np.str_),
'cnts': np.zeros([1], dtype = np.int64),
} for g in all_genders}
tb = time.time()
def Progress(
done, total = min([fsize] + ([chunk_size * limit_chunks] if limit_chunks is not None else [])),
cfg = {'progressed': 0, 'done': False},
):
if not cfg['done'] and (done - cfg['progressed'] >= progress_step or done >= total):
if done < total:
while cfg['progressed'] + progress_step <= done:
cfg['progressed'] += progress_step
else:
cfg['progressed'] = total
sys.stdout.write(
f'{str(round(cfg["progressed"] / 2 ** 20)).rjust(5)} MiB of ' +
f'{str(round(total / 2 ** 20)).rjust(5)} MiB ' +
f'speed {round(cfg["progressed"] / 2 ** 20 / (time.time() - tb), 4)} MiB/sec\n'
)
sys.stdout.flush()
if done >= total:
cfg['done'] = True
with open(fname, 'rb', buffering = 1 << 26) as f:
for i, df in enumerate(pd.read_csv(f, iterator = True, chunksize = chunk_size)):
if limit_chunks is not None and i >= limit_chunks:
break
if i == 0:
name_col = df.columns.get_loc('First Name')
gender_col = df.columns.get_loc('Gender')
names = np.array(df.iloc[:, name_col]).astype('str')
genders = np.array(df.iloc[:, gender_col]).astype('str')
for g in all_genders:
ctab = tables[g]
gnames = names[genders == g]
vals, cnts = np.unique(gnames, return_counts = True)
if vals.size == 0:
continue
if ctab['vals'].dtype.itemsize < names.dtype.itemsize:
ctab['vals'] = ctab['vals'].astype(names.dtype)
poss = np.searchsorted(ctab['vals'], vals)
exist = ctab['vals'][poss] == vals
ctab['cnts'][poss[exist]] += cnts[exist]
nexist = np.flatnonzero(exist == False)
ctab['vals'] = np.insert(ctab['vals'], poss[nexist], vals[nexist])
ctab['cnts'] = np.insert(ctab['cnts'], poss[nexist], cnts[nexist])
Progress(f.tell())
Progress(fsize)
with open(fname_res, 'w', encoding = 'utf-8') as f:
for g in all_genders:
f.write(f'{g}:\n\n')
print(g, '\n')
order = np.flip(np.argsort(tables[g]['cnts']))[:topk]
snames, scnts = tables[g]['vals'][order], tables[g]['cnts'][order]
if snames.size > 0:
for n, c in zip(np.nditer(snames), np.nditer(scnts)):
n, c = str(n), int(c)
if c == 0:
continue
f.write(f'{c} {n}\n')
print(c, n.encode('ascii', 'replace').decode('ascii'))
f.write(f'\n')
print()
import pandas as pd
df = pd.read_csv("sample_data.csv")
print(df['First Name'].value_counts())
The second line will convert your csv into a pandas dataframe and the third line should print the occurances of each name.
https://dfrieds.com/data-analysis/value-counts-python-pandas.html
This doesn't seem to be a case where pandas is really going to be an advantage. But if you're committed to going down that route, change the read_csv chunksize paramater, then filter out the useless columns.
Perhaps consider using a different set of tooling such as a database or even vanilla python using a generator to populate a dict in the form of name:count.
I am new to Python, and am struggling with a task that I assume is an extremely simple one for an experienced programmer.
I am trying to create a list of lists of coordinates for different lines. For instance:
list = [ [(x,y), (x,y), (x,y)], [Line 2 Coordinates], ....]
I have the following code:
masterlist_x = list(range(-5,6))
oneline = []
data = []
numberoflines = list(range(2))
i = 1
for i in numberoflines:
slope = randint(-5,5)
y_int = randint(-10,10)
for element in masterlist_x:
oneline.append((element,slope * element + y_int))
data.append(oneline)
The output of the variable that should hold the coordinates to one line (oneline) holds two lines:
Output
I know this is an issue with the outer looping mechanism, but I am not sure how to proceed.
Any and all help is much appreciated. Thank you very much!
#khuynh is right, you simply had the oneline = [] in wrong place, you put all the coords in one line.
Also, you have a couple unnecessary things in your code:
you don't need list() the range(), you can just iterate them directly with for
also you don't need to declare the i for the for, it does it itself
that i is not actually used, which is fine. Python convention for unused variables is _
Fixed version:
from random import randint
masterlist_x = range(-5,6)
data = []
numberoflines = range(2)
for _ in numberoflines:
oneline = []
slope = randint(-5,5)
y_int = randint(-10,10)
for element in masterlist_x:
oneline.append((element,slope * element + y_int))
data.append(oneline)
print(data)
Also on-line there where you can run it: https://repl.it/repls/GreedyRuralProduct
I suspect the whole thing could be also made with much less code, and in a way in a simpler fashion, as a list comprehension ..
UPDATE: the inner loop is indeed very suitable for a list comprehension. Maybe the outer could be made into one as well, and the whole thing could two nested list comprehensions, but I only got confused when tried that. But this is clear:
from random import randint
masterlist_x = range(-5,6)
data = []
numberoflines = range(2)
for _ in numberoflines:
slope = randint(-5,5)
y_int = randint(-10,10)
oneline = [(element, slope * element + y_int)
for element in masterlist_x]
data.append(oneline)
print(data)
Again on repl.it too: https://repl.it/repls/SoupyIllustriousApplicationsoftware
I am still new to python but using it for my linguistics research.
So I am doing some research into toponyms, and I got a list of input data from a topographic institution, which looks like the following:
Official_Name, tab, Dialect_Name, tab, Administrative_district, Topographic_district, Y_coordinates, X_coordinates, Longitude, Latitude.
So, I defined a class:
class MacroTop:
def __init__(self, Official_Name, Dialect_Name, Adm_District, Topo_District, Y, X, Long, Lat):
self.Official_Name = Official_Name
self.Dialect_Name = Dialect_Name
self.Adm_District = Adm_District
self.Topo_District = Topo_District
self.Y = Y
self.X = X
self.Long = Long
self.Lat = Lat
So, with open(), I wanted to load my .txt file with the data I have to read it into the class using a loop but it did not work.
The result I want is to be able to access a feature of the class, say, Dialect_Name and be able to look through all the entries of that feature. I can do that just in the loop, but I wanted to define a class so I could be able to do more manipulation afterwards.
my loop:
with open("locLuxAll.txt", "r") as topo_list:
lines = topo_list.readlines()
for line in lines:
line = line.split('\t')
print(line)
print(line[0]) # This would access all the data that is characterized as Official_Name
I tried to make another loop:
for i in range(0-len(lines)):
lines[i] = MacroTop(str(line[0]), str(line[1]), str(line[2]), str(line[3]), str(line[4]), str(line[5]), str(line[6]), str(line[7]))
But that did not seem to work.
This line fails:
for i in range(0-len(lines)):
You're trying to loop through negative number I guess, so the output will be an empty list.
In [11]: [i for i in range(-200)]
Out[11]: []
EDIT:
Your code seems unreadable to me, you have for i in range(len(lines)) but in this for loop, you're iterating through line variable, where is it from? First of all I'd not write back to lines list as it comes from readlines. Create new list for that, and you dont need i variable, those lines will be kept in order anyway.
class_lines = []
for line in lines:
class_lines.append(MacroTop(str(line[0]), str(line[1]), str(line[2]), str(line[3]), str(line[4]), str(line[5]), str(line[6]), str(line[7])))
Or even with list comprehension:
class_lines = [MacroTop(str(line[0]), str(line[1]), str(line[2]), str(line[3]), str(
line[4]), str(line[5]), str(line[6]), str(line[7])) for line in lines]
I have the following case:
I need to get the time of a feature in a csv file and compare it with the time of the pictures taken by someone.
Then i need to find 2 (or less) matches. I will assign the first two pictures i find in a 2 mins interval from the time of the feature to that feature.
I managed to create two dictionaries with the details: feature_hours contains id and time of the feature. photo_hours contains photo_path and time of the photo.
sorted_feature and sorted_photo are two lists that sorted the two dictionaries.
The problem is that in the output csv file i only have 84 rows completed and some are blank. The feature csv file has 199 features. I think i incremented j too often. But i need a clear look from a pro, because i cannot figure it out.
Here is the code:
j=1
sheet1.write(0,71,"id")
sheet1.write(0,72,"feature_time")
sheet1.write(0,73,"Picture1")
sheet1.write(0,74,"Picture_time")
sheet1.write(0,75,"Picture2")
sheet1.write(0,76,"Picture_time")
def write_first_picture():
sheet1.write(j,71,feature_time[0])
sheet1.write(j,72,feature_time[1])
sheet1.write(j,73,photo_time[0])
sheet1.write(j,74,photo_time[1])
def write_second_picture():
sheet1.write(j-1,75,photo_time[0])
sheet1.write(j-1,76,photo_time[1])
def write_pictures():
if i==1:
write_first_picture()
elif i==2:
write_second_picture()
for feature_time in sorted_features:
i=0
for photo_time in sorted_photo:
if i<2:
if feature_time[1][0]==photo_time[1][0]:
if feature_time[1][1]==photo_time[1][1]:
if feature_time[1][2]<photo_time[1][2]:
i=i+1
write_pictures()
j=j+1
elif int(feature_time[1][1])+1==photo_time[1][1]:
i=i+1
write_pictures()
j=j+1
elif int(feature_time[1][1])+2==photo_time[1][1]:
i=i+1
write_pictures()
j=j+1
elif int(feature_time[1][0])+1==photo_time[1][0]:
if feature_time[1][1]>=58:
if photo_time[1][1]<=02:
i = i+1
write_pictures()
j=j+1
Edit: Here is examples of the two lists:
Features list: [('-70', ('10', '27', '03')), ('-73', ('10', '29', '50'))]
Photo list: [('20160801_125133-1151969393.jpg', ('12', '52', '04')), ('20160801_125211342753906.jpg', ('12', '52', '16'))]
There is a CSV module for python to help load these files. You could sort the results to try to be more efficient/short-circuit your checks as well. I cannot really tell what the i and j variables are meant to represent, but I am pretty sure you can do something like the following:
import csv
def hmstoseconds(hhmmss):
# 60 * 60 seconds in an hour, 60 seconds in a min, 1 second in a second
return sum(x*y for x, y in zip(hhmmss, (60*60, 60, 1)))
features = []
# features model looks like tuple(ID, (HH, MM, SS))
with open("features.csv") as f:
reader = csv.reader(f)
features = list(reader)
photos = []
# photos model looks like tuple(filename, (HH, MM, SS))
with open("photos.csv) as f:
reader = csv.reader(f)
photos = list(reader)
for feature in features:
for photo in photos:
# convert HH, MM, SS to seconds and find within 2 min (60s * 2)
# .. todo:: instead of nested for loops, we could use filter()
if abs(hmstoseconds((feature[1]) - hmstoseconds(photo[1])) <=(60 * 2):
# the photo was taken within 2 min of the feature
<here, write a photo>
In order to make this more maintainable/readable, you could also use namedtuples to better represent the data models:
import csv
from collections import namedtumple
# model definitions to help with readability/maintainence
# if the order of the indices changes or we add more fields, we just need to
# change them directly here instead of tracking the indexes everywhere
Feature = namedtuple("feature", "id, date")
Photo = namedtuple("photo", "file, date")
def hmstoseconds(hhmmss):
# 60 * 60 seconds in an hour, 60 seconds in a min, 1 second in a second
return sum(x*y for x, y in zip(hhmmss, (60*60, 60, 1)))
def within_two_min(date1, date2):
# convert HH, MM, SS to seconds for both dates
# return whether the absolute difference between them is within 2 min (60s * 2)
return abs(hmstoseconds(date1) - hmstoseconds(date2)) <= 60 * 2
if __name__ == '__main__':
# using main here means we avoid any nasty global variables
# and only execute this code when this file is run directly
features = []
with open("features.csv") as f:
reader = csv.reader(f)
features = [Feature(f) for f in reader]
photos = []
with open("photos.csv) as f:
reader = csv.reader(f)
photos = [Photo(p) for p in reader]
for feature in features:
for photo in photos:
# .. todo:: instead of nested for loops, we could use filter()
if within_two_min(feature.date, photo.date):
<here, write a photo>
Hopefully this gets you moving in the right direction. I don't fully understand what you were trying to do with i and j and the first/second "write_picture" stuff, but hoping you understand better the scope and access in python.
I am trying to open a textfile called state_meet.txt file; the info is formatted as
gymnastics_school,participant_name,all-around_points_earned
see example:
Lanier City Gymnastics,Ben W.,55.301
Lanier City Gymnastics,Alex W.,54.801
Lanier City Gymnastics,Sky T.,51.2
Lanier City Gymnastics,William G.,47.3 etc..
and create functions to get info such as:
The total count of gymnasts that participated in the state meet.
The first place score.
The last place score.
The score differential between the first and last place.
The average score for all gymnasts.
The median score. (The median is the grade at the mid-point of a sorted list. If there is an even number of elements in the list, the median is the average of the 2 middle elements.)
The average of all scores above the median (not including the median).
The average of all scores below the median (not including the median).
The output should look as such
Summary of Data:
Number of gymnasts: 103
First place score: 143.94
Here's the code I have so far:
with open('state_meet.txt','r') as f:
for line in f:
allt = []
values = line.split()
print(values[3])
#first
max_val = max(values[3])
int(max_val)
print(max_val)
#last
min_val = min(values[3])
int(min_val)
print(min_val)
#Mean
total = sum(input_list)
length = len(input_list)
for nums in [input_list]:
mean_val = total / length
float(mean_val)
#Median
sorted(input_list)
med_val = sorted(lst)
lstLen = len(lst)
index = (lstLen - 1) // 2
this is what i have so far but my text is reading it as W.,55.301 instead of 55.301 and giving me errors
You have a comma-separated values (csv) file. Use the csv module.
import csv
data = []
with open("state_meet.txt") as f:
reader = csv.DictReader(f, fieldnames=["school", "participant", "score"])
for line in reader:
data.append(line)
# first place
record = max(data, lambda d: d["score"])
best_score = int(record["score"])
# last place
record = min(data, lambda d: d["score"])
worst_score = int(record["score"])
# Mean score
mean = sum(d["score"] for d in data) / len(data)
# Median score
median = sorted([d["score"] for d in data])[(len(data) - 1) // 2]
csv.DictReader reads the lines of your csv file and automatically converts each one to a dictionary, keyed by whatever you like. This is perhaps easier to read than the collections.namedtuple suggestion in dokelung's answer, though namedtuple is equally valid. The key here is that we can keep the entire record around instead of throwing away everything but the score.
You should use split(',') to replace split().
Use values[2] to get the third item of list values.
list allt seems no use.
It seems that no matter how many lines are there in state_meet.txt, values always gets the last line data.
I guess the things you want to do:
import collections
names = ["gymnastics_school", "participant_name", "all_around_points_earned"]
Data = collections.namedtuple("Data", names)
data = []
with open('state_meet.txt','r') as f:
for line in f:
line = line.strip()
items = line.split(',')
items[2] = float(items[2])
data.append(Data(*items))
# max value
max_one = max(data, key=lambda d:d.all_around_points_earned)
print(max_one.all_around_points_earned)
# min value
min_one = min(data, key=lambda d:d.all_around_points_earned)
print(min_one.all_around_points_earned)
# mean value
total = sum(d.all_around_points_earned for d in data)
mean_val = total/len(data)
print(mean_val)
# median value
sorted_data = sorted(data, key=lambda d:d.all_around_points_earned)
if len(data)%2==0:
a = sorted_data[len(data)//2].all_around_points_earned
b = sorted_data[len(data)//2-1].all_around_points_earned
median_val = (a+b)/2
else:
median_val = sorted_data[(len(data)-1)//2].all_around_points_earned
print(median_val)
Let me explain more:
First, I define a namedtuple type called "Data" with the the item names(gymnastics_school...). Then I can use d = Data('school', 'name', '50.0') to create a namedtuple d. We can easily fetch the item values by using . to get attributes.
>>> names = ["gymnastics_school", "participant_name", "all_around_points_earned"]
>>> Data = collections.namedtuple("Data", names)
>>> d = Data('school', 'name', '50.0')
>>> d.gymnastics_school
'scholl'
>>> d.participant_name
'name'
>>> d.all_around_points_earned
'50.0'
Next, when we iterate the lines in file object, use string method strip to remove blanks and new lines. It makes the line more clean. Then split(',') can help us splitting the line with specified delimiter ,.
Here, we do a conversion with function float because third item in the split list items is a float(but it is string in the file). Finally, use namedtuple Data to create data then append to list data.
Next, builtin function max and min can help us find the max/min item. But each thing in data is a namedtuple, we should use a lambda function to fetch the points then use them as keys to pick the max/min one.
Also, function sum let us compute the summation without a loop. Here, we have to extract the points to get their summation so we pass a generator d.all_around_points_earned for d in data to sum.
I get the median value by sorting data then get the middle one. When the number of data is odd, we just pick the center number. But if it is even, we should pick the middle "two" and compute their mean.
Hope my answer can help you!
split() by default splits on whitespace, did you mean
values = line.split(',')
to split on commas?
https://docs.python.org/2/library/stdtypes.html#str.split