Run for loop on 2 variables from dataframe column

Run for loop on 2 variables from dataframe column - python

I need to run for loop on 2 columns coming from a dataframe and return a dict. But when i use zip I am getting only a part of a string on which the loop is running.
import pandas as pd
def split(owner, cost):
split_bill = {'ads': 0, 'qaweb': 0, 'ovt': 0, 'cs': 0, 'edu': 0, 'xms': 0, 'cc': 0}
for owner_in, cost in zip(owner, cost): --> #need to know what type of loop can work here
split_bill[owner_in] += cost
continue
return split_bill
data = {
"owner": ['ads', 'cs', 'edu'],
"cost": [2.3, 4.30, 45]
}
df = pd.DataFrame(data)
df['metric'] = df.apply(lambda x: split(x.owner, {x.cost}), axis=1)
Exptected output
df['metric'] =
metric
{'ads': 2.3, 'qaweb': 0, 'ovt': 0, 'cs': 0, 'edu': 0, 'xms': 0, 'cc': 0}
{'ads': 2.3, 'qaweb': 0, 'ovt': 0, 'cs': 4.3, 'edu': 0, 'xms': 0, 'cc': 0}
{'ads': 2.3, 'qaweb': 0, 'ovt': 0, 'cs': 0, 'edu': 45, 'xms': 0, 'cc': 0}
in the for loop owner_in is only taking a of ads Which should be taking ads instead of a.
Can you help with what type of loop could work?

zip is to zip some lists into list of tuple. The length of the final list is determined by the shortest list among those list.
In your example, owner is a string ads, cost is a set with one float value. In zip(owner, cost), string is treated as a list with three values. So the length of final list is 1 determined by the shortest set which has only one float value.
I guess you may want to do df.groupby('owner')['cost'].apply(sum).

Related

Covert complexed list to flat list

I have a long list complexed of numpy arrays and integers, below is an example:
[array([[2218.67288865]]), array([[1736.90215229]]), array([[1255.13141592]]), array([[773.36067956]]), array([[291.58994319]]), 0, 0, 0, 0, 0, 0, 0, 0, 0]
and i'd like to convert it to a regular list as so:
[2218.67288865, 1736.90215229, 1255.13141592, 773.36067956, 291.58994319, 0, 0, 0, 0, 0, 0, 0, 0, 0]
How can I do that efficiently?

You can use a generator for flattening the nested list:
def convert(obj):
try:
for item in obj:
yield from convert(item)
except TypeError:
yield obj
result = list(convert(data))

list(itertools.from_iterable(itertools.from_iterable(...))) should work for removing 2 levels of nesting: just add or remove copies of itertools.from_iterable(...) as needed.

Here the simplest seems to also be the fastest:
x = [array([[2218.67288865]]), array([[1736.90215229]]), array([[1255.13141592]]), array([[773.36067956]]), array([[291.58994319]]), 0, 0, 0, 0, 0, 0, 0, 0, 0]
[y if y.__class__==int else y.item(0) for y in x]
# [2218.67288865, 1736.90215229, 1255.13141592, 773.36067956, 291.58994319, 0, 0, 0, 0, 0, 0, 0, 0, 0]
timeit(lambda:[y if y.__class__==int else y.item(0) for y in x])
# 2.198630048893392

You can stick to numpy by using np.ravel:
np.hstack([np.ravel(i) for i in l]).tolist()
Output:
[2218.67288865,
1736.90215229,
1255.13141592,
773.36067956,
291.58994319,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0]

Numpy array adds extra list

I have a function that annotates some genomic variants with multiple items (detail snot important). For every variant, it stores all the information in a list. All variant lists are added to a list which ultimately looks something like this:
[['chr9', 11849076, 'chr9', 12028629, 'DEL', 0, 179553, 0, 0, '', '',
0, '', 0, 0, 13, 13], ['chr3', 5577129, 'chr3', 5708227, 'DUP', 0,
131098, 0, 0, '', '', 0, '', 0, 0, 13, 13],...]
This big list is returned by the annotator function and then I would like to convert it to a numpy array which goes fine:
annotated_tn = np.array(annotated_tn, dtype="object")
However, the result is not as expected:
array([list(['chr9', 11849076, 'chr9', 12028629, 'DEL', 0, 179553, 0, 0, '', '', 0, '', 0, 0, 13, 13]),
list(['chr3', 5577129, 'chr3', 5708227, 'DUP', 0, 131098, 0, 0, '', '', 0, '', 0, 0, 13, 13]),... ],dtype=object)
For some reason it adds an extra list() to all the lists in the array making them not indexable:
annotated_tn[:,1]
IndexError: too many indices for array
I believe the output should like this:
array([['chr9', 11849076, 'chr9', 12028629, 'DEL', 0, 179553, 0, 0, '', '', 0, '', 0, 0, 13, 13], ['chr3', 5577129, 'chr3', 5708227, 'DUP', 0, 131098, 0, 0, '', '', 0, '', 0, 0, 13, 13],..], dtype=object)
Any idea what is happening here?

My best guess is that there's a row in your data that doesn't have the same number of columns as the other rows.
If they were all the same length, then you're right and your code should work. But as soon as you add a row with a different length you get the exact result you're getting
Since you're only posting 2 rows of your data and both have 17 columns, then I can't say this for sure. But I'm pretty sure this is your problem

Why am I getting `list index out of range` error?

I wrote a code to download the synonyms of the words in a list, locations. But since a word can have multiple meanings, I used another list, meaning, to point to the serial number of the meaning I want for that word. Then calculate similarities between the words based on these synonyms found, and then save them in a file.
from nltk.corpus import wordnet as wn
from textblob import Word
from textblob.wordnet import Synset
locations = ['access', 'airport', 'amenity', 'area', 'atm', 'barrier', 'bay', 'bench', 'boundary', 'bridge', 'building', 'bus', 'cafe', 'car', 'coast', 'continue', 'created', 'defibrillator', 'drinking', 'embankment', 'entrance', 'ferry', 'foot', 'fountain', 'fuel', 'gate', 'golf', 'gps', 'grave', 'highway', 'horse', 'hospital', 'house', 'land', 'layer', 'leisure', 'man', 'market', 'marketplace', 'height', 'name', 'natural', 'exit', 'way', 'park', 'parking', 'place', 'worship', 'playground', 'police', 'station', 'post', 'mail', 'power', 'private', 'public', 'railway', 'ref', 'residential', 'restaurant', 'road', 'route', 'school', 'shelter', 'shop', 'source', 'sport', 'toilet', 'tourism', 'unknown', 'vehicle', 'vending', 'machine', 'village', 'wall', 'waste', 'waterway'];
meaning = [0, 0, 0, 0, 2, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 2, 5, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 1, 11, 0, 1, 0, 0, 3, 0, 4, 0, 0, 3, 4, 0, 0, 0, 10, 0, 9, 1, 2, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0]
ncols = len(locations)
nrows = len(locations)
matrix = [[0] * ncols for i in range(nrows)]
for i in range(0,len(locations)):
word1 = Word(locations[i])
SS1 = word1.synsets[meaning[i]]
for j in range(0,len(locations)):
word2 = Word(locations[j])
SS2 = word1.synsets[meaning[j]]
matrix[i][j] = SS1.path_similarity(SS2)
f = open('Similarities.csv', 'w')
print(matrix, file=f)
But the code gives the following error:
SS2 = word1.synsets[meaning[j]]
IndexError: list index out of range
When I printed out the values of i and j, I found that it prints till i=0 and j=36. That means that when j=36, the error arises. The word in the list at index 36 is man, and the value at index 36 of meaning is 11.
So, why is this error occuring and how do I fix it?
EDIT: The mistake was in SS2 = word1.synsets[meaning[j]]. It should have been SS2 = word2.synsets[meaning[j]]. Sorry.

len(word1.synsets) returns 8 and type(word1.synsets) returns list. So it's a list with indexes 0 to 7.
your list 'meaning' contains 11 at index 36. so when your loop reaches word1.synsets[11] you get the index out of range error.
Like Jose said, 7 is the max int you can have in 'meaning'.

count objects created in django application in past X days, for each day

I have following unsorted dict (dates are keys):
{"23-09-2014": 0, "11-10-2014": 0, "30-09-2014": 0, "26-09-2014": 0,
"03-10-2014": 0, "19-10-2014": 0, "15-10-2014": 0, "22-09-2014": 0,
"17-10-2014": 0, "29-09-2014": 0, "13-10-2014": 0, "16-10-2014": 0,
"12-10-2014": 0, "25-09-2014": 0, "14-10-2014": 0, "08-10-2014": 0,
"02-10-2014": 0, "09-10-2014": 0, "18-10-2014": 0, "24-09-2014": 0,
"28-09-2014": 0, "10-10-2014": 0, "21-10-2014": 0, "20-10-2014": 0,
"06-10-2014": 0, "04-10-2014": 0, "27-09-2014": 0, "05-10-2014": 0,
"01-10-2014": 0, "07-10-2014": 0}
I am trying to sort it from oldest to newest.
I've tried code:
mydict = OrderedDict(sorted(mydict .items(), key=lambda t: t[0], reverse=True))
to sort it, and it almost worked. It produced sorted dict, but it has ignored months:
{"01-10-2014": 0, "02-10-2014": 0, "03-10-2014": 0, "04-10-2014": 0,
"05-10-2014": 0, "06-10-2014": 0, "07-10-2014": 0, "08-10-2014": 0,
"09-10-2014": 0, "10-10-2014": 0, "11-10-2014": 0, "12-10-2014": 0,
"13-10-2014": 0, "14-10-2014": 0, "15-10-2014": 0, "16-10-2014": 0,
"17-10-2014": 0, "18-10-2014": 0, "19-10-2014": 0, "20-10-2014": 0,
"21-10-2014": 0, "22-09-2014": 0, "23-09-2014": 0, "24-09-2014": 0,
"25-09-2014": 0, "26-09-2014": 0, "27-09-2014": 0, "28-09-2014": 0,
"29-09-2014": 0, "30-09-2014": 0}
How can I fix this?
EDIT:
I need this to count objects created in django application in past X days, for each day.
event_chart = {}
date_list = [datetime.datetime.today() - datetime.timedelta(days=x) for x in range(0, 30)]
for date in date_list:
event_chart[formats.date_format(date, "SHORT_DATE_FORMAT")] = Event.objects.filter(project=project_name, created=date).count()
event_chart = OrderedDict(sorted(event_chart.items(), key=lambda t: t[0]))
return HttpResponse(json.dumps(event_chart))

You can use the datetime module to parse the strings into actual dates:
>>> from datetime import datetime
>>> sorted(mydict .items(), key=lambda t:datetime.strptime(t[0], '%d-%m-%Y'), reverse=True)

If you want to create a json response in the format: {"22-09-2014": 0, 23-09-2014": 0, "localized date": count_for_that_date} so that oldest dates will appear earlier in the output then you could make event_chart an OrderedDict:
event_chart = OrderedDict()
today = DT.date.today() # use DT.datetime.combine(date, DT.time()) if needed
for day in range(29, -1, -1): # last 30 days
date = today - DT.timedelta(days=day)
localized_date = formats.date_format(date, "SHORT_DATE_FORMAT")
day_count = Event.objects.filter(project=name, created=date).count()
event_chart[localized_date] = day_count
return HttpResponse(json.dumps(event_chart))

All instances of maximum

I have a function that gets a set of data from a server, sorts and displays it
def load_data(dateStr):
data = get_date(dateStr).splitlines()
result = []
for c in data:
a = c.split(',')
time = a[0]
temp = float(a[1])
solar = float(a[2])
kwH = a[3:]
i = 0
while i < len(power):
power[i] = int(power[i])
i = i+1
result.append((time, temp, solar, tuple(kwH)))
return result
This is what the function returns when you enter in a particular date(only 3 entries out of a long list), the first number in each entry is the time, second is the temperature.
>>> load_data('20-01-2014')
[('05:00', 19.9, 0.0, (0, 0, 0, 0, 0, 0, 0, 18, 34)), ('05:01', 19.9, 0.0, (0, 0, 0, 0, 0, 0, 0, 20, 26)), ('05:02', 19.9, 0.0, (0, 0, 0, 0, 0, 0, 0, 17, 35))
I need write a function to find the maximum temperature of a date, and show all of the times in the day that the maximum occurred. Something like this:
>>> data = load_data('07-10-2011')
>>> max_temp(data)
(18.9, ['13:08', '13:09', '13:10'])
How would I go about this? Or can you point me to anywhere that might have answers

This is one way to do it (this loops over the data twice):
>>> data = [('05:00', 19.9, 0.0, (0, 0, 0, 0, 0, 0, 0, 18, 34)), ('05:01', 19.9, 0.0, (0, 0, 0, 0, 0, 0, 0, 20, 26)), ('05:02', 19.9, 0.0, (0, 0, 0, 0, 0, 0, 0, 17, 35))]
>>> max_temp = max(data, key=lambda x: x[1])[1]
>>> max_temp
19.9
>>> result = [item for item in data if item[1] == max_temp]
>>> result
[('05:00', 19.9, 0.0, (0, 0, 0, 0, 0, 0, 0, 18, 34)), ('05:01', 19.9, 0.0, (0, 0, 0, 0, 0, 0, 0, 20, 26)), ('05:02', 19.9, 0.0, (0, 0, 0, 0, 0, 0, 0, 17, 35))]

The most optimal way to get all matching times for the maximum temperature is to simply loop over the values and track the maximum found so far:
def max_temp(data):
maximum = float('-inf')
times = []
for entry in data:
time, temp = entry[:2]
if temp == maximum:
times.append(time)
elif temp > maximum:
maximum = temp
times = [time]
return maximum, times
This loops over the data just once.
The convenient way (which is probably going to be close in performance anyway) is to use the max() function to find the maximum temperature first, then a list comprehension to return all times with that temperature:
def max_temp(data):
maximum = max(data, key=lambda e: e[1])[1]
return maximum, [e[0] for e in data if e[1] == maximum]
This loops twice over the data, but the max() loop is implemented mostly in C code.

def max_temp(data):
maxt = max([d[1] for d in data])
return (maxt, [d[0] for d in data if d[1] == maxt])

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Run for loop on 2 variables from dataframe column - python

Related

Covert complexed list to flat list

Numpy array adds extra list

Why am I getting `list index out of range` error?

count objects created in django application in past X days, for each day

All instances of maximum

Categories

Resources