Pandas Mean and Merge Two DataFrames - python

I have two dataframes that I need to get the means for plus merge based on their original column names. An example is this:
df = pd.DataFrame({
'sepal_length': [5.1, 4.9, 4.7, 4.6, 5.0, 5.4, 4.6, 5.0],
'sepal_width': [3.5, 3.0, 3.2, 3.1, 3.6, 3.9, 3.4, 3.4],
'petal_length': [1.4, 1.4, 1.3, 1.5, 1.4, 1.7, 1.4, 1.5],
'petal_width': [0.2, 0.2, 0.2, 0.2, 0.2, 0.4, 0.3, 0.2]
})
df2 = pd.DataFrame({
'sepal_length': [0.2, 0.2, 0.2, 0.2, 0.2, 0.4, 0.3, 0.2],
'sepal_width': [3.5, 3.0, 3.2, 3.1, 3.6, 3.9, 3.4, 3.4],
'petal_length': [1.4, 1.4, 1.3, 1.5, 1.4, 1.7, 1.4, 1.5],
'petal_width': [1.4, 1.4, 1.3, 1.5, 1.4, 1.7, 1.4, 1.5]
})
I get the means like this:
df_one=df.mean(axis=0).to_frame('Mean_One')
df_two=df2.mean(axis=0).to_frame('Mean_Two')
The question is how to merge these two datagrams (df_one and df_two) since there is no column name for the original petal info (e.g., sepal_length, sepal_width, etc.). If there were I could do this
pd.merge(df_one, df_two, on='?')
Thanks for any help on this.

If I'm understanding correctly you're trying to join the two averaged dataframes to have a column of average measurements for each of the dataframes.
If that's the case then you can join using their indexes:
pd.merge(df_one, df_two, left_index=True, right_index=True)
Output:
Mean_One Mean_Two
sepal_length 4.9125 0.2375
sepal_width 3.3875 3.3875
petal_length 1.4500 1.4500
petal_width 0.2375 1.4500

Related

I would like to create a new column based on conditions using .loc

I have the code:
to_test['averageRating'].unique()
array([5.8, 5.2, 5. , 6.5, 5.5, 7.3, 7.2, 4.2, 6.4, 7.1, 6.6, 5.4, 6.9,
6. , 6.1, 8.1, 6.3, 7.8, 3.9, 6.8, 6.2, 7.9, 7. , 4.9, 5.9, 7.5,
6.7, 8. , 5.7, 3.2, 4.8, 5.6, 7.4, 4.5, 3.6, 4.3, 3.4, 5.1, 4.4,
4.7, 7.7, 5.3, 4. , 8.4, 7.6, 3.3, 2.2, 3.7, 8.2, 4.1, 8.3, 1.7,
9. , 4.6, 8.5, 3.1, 3.8, 3.5, 1.9, 2.9, 2.8, 2.7, 9.2, 1.2, 2.1,
3. , 1.3, 1.1, 8.6, 2.5, 1. , 9.8, 8.7, 1.5, 9.3])
`
create a list of our conditions
conditions = [(to_test.loc[(to_test['averageRating']>=0.0) & (to_test['averageRating'] <= 3.3)]),
(to_test.loc[(to_test['averageRating']>=3.4) & (to_test['averageRating'] <=6.6)]),
(to_test.loc[(to_test['averageRating']>=6.7) & (to_test['averageRating'] <=10)])]
create a list of the values we want to assign for each condition
values = ['group1', 'group2', 'group3']
create a new column and use np.select to assign values to it using our lists as arguments
to_test['group'] = np.select(conditions, values)
display updated DataFrame
to_test.head()`
but it's not working
This is using a classic case of using cut. Sample code
df = pd.DataFrame({'averageRating' : np.random.uniform(0,10,100)})
df['group_using_cut'] = pd.cut(df['averageRating'],
[0,3.3,6.6,10],
labels=['group1','group2','group3'])
If you want to use np.select use conditions without loc
Sample Code
conds = [
(df['averageRating']>=0.0) & (df['averageRating'] <= 3.3),
(df['averageRating']>=3.4) & (df['averageRating'] <= 6.6),
(df['averageRating']>=6.7) & (df['averageRating'] <= 10),
]
df['group_using_selec'] = np.select(conds,['group1','group2','group3'])
Output df.head()

Use itertools.groupby (or a neet, pythonic way) to group a list by the difference between the consecutive numbers

I've read this question
But my question is a little different:
For example:
[0.0, 1.1, 1.2, 1.3, 1.4, 1.5, 1.6, 1.7, 1.8, 1.9, 5.0, 9.1, 9.2, 9.3, 9.4, 9.5, 9.6, 9.7, 9.8, 9.9, 10.0]
should gave me:
[[0.0], [1.1, 1.2, 1.3, 1.4, 1.5, 1.6, 1.7, 1.8, 1.9], [5.0], [9.1, 9.2, 9.3, 9.4, 9.5, 9.6, 9.7, 9.8, 9.9, 10.0]]
All the types of them is float.The difference of each element of sub list should smaller than 0.1.
I try to solve it without using third party module.(Not homework, just a practice for python)
What I have tried: too much code with itertools.groupby(Couldn't solve it).One of my attempt:
import itertools
lst = [0.0, 1.1, 1.2, 1.3, 1.4, 1.5, 1.6, 1.7, 1.8, 1.9, 5.0, 9.1, 9.2, 9.3, 9.4, 9.5, 9.6, 9.7, 9.8, 9.9, 10]
res = []
for _, item in itertools.groupby(enumerate(lst), key=lambda index_num: index_num[0]-index_num[1]):
print(list(item)) # Not expected. The solution I mentioned didn't work.
I want a neat, pythonic way to solve it.Any tricks are welcomed, I just want to learn more skills.:)

Sum combination of lists by element

I have a nested list, which can be of varying length (each sublist will always contain the same number of elements as the others):
list1=[[4.1,2.9,1.2,4.5,7.9,1.2],[0.7,1.1,2.0,0.4,1.8,2.2],[5.1,4.1,6.5,7.1,2.3,3.6]]
I can find every possible combination of sublists of length n using itertools:
n=2
itertools.combinations(list1,n)
[([4.1, 2.9, 1.2, 4.5, 7.9, 1.2], [0.7, 1.1, 2.0, 0.4, 1.8, 2.2]),
([4.1, 2.9, 1.2, 4.5, 7.9, 1.2], [5.1, 4.1, 6.5, 7.1, 2.3, 3.6]),
([0.7, 1.1, 2.0, 0.4, 1.8, 2.2], [5.1, 4.1, 6.5, 7.1, 2.3, 3.6])]
I would like to sum all lists in each tuple by index. In this example, I would end up with:
[([4.8, 4.0, 3.2, 4.9, 9.7, 3.4],
[9.2, 7.0, 7.7, 6.8, 10.2, 4.8],
[5.8, 5.2, 8.5, 7.5, 4.1, 5.8])]
I have tried:
[sum(x) for x in itertools.combinations(list1,n)]
[sum(x) for x in zip(*itertools.combinations(list1,n))]
Each run into errors.
You can use zip for this:
>>> [tuple(map(sum, zip(*x))) for x in itertools.combinations(list1, n)]
[(4.8, 4.0, 3.2, 4.9, 9.700000000000001, 3.4000000000000004),
(9.2, 7.0, 7.7, 11.6, 10.2, 4.8),
(5.8, 5.199999999999999, 8.5, 7.5, 4.1, 5.800000000000001)]
Try this :
>>> list1=[[4.1,2.9,1.2,4.5,7.9,1.2],[0.7,1.1,2.0,0.4,1.8,2.2],[5.1,4.1,6.5,7.1,2.3,3.6]]
>>> from itertools import combinations as c
>>> list(list(map(sum, zip(*k))) for k in c(list1, 2))
[[4.8, 4.0, 3.2, 4.9, 9.700000000000001, 3.4000000000000004], [9.2, 7.0, 7.7, 11.6, 10.2, 4.8], [5.8, 5.199999999999999, 8.5, 7.5, 4.1, 5.800000000000001]]

How can I convert a dict_keys list to integers

I am trying to find a way of converting a list within dict_keys() to an integer so I can use it as a trigger to send to another system. My code (below) imports a list of 100 words (a txt file with words each on a new line) which belong to 10 categories (e.g. the first 10 words belong to category 1, second 10 words belong to category 2 etc...).
Code:
from numpy.random import choice
from collections import defaultdict
number_of_elements = 10
Words = open('file_location').read().split()
categories = defaultdict(list)
for i in range(len(words)):
categories[i/number_of_elements].append(words[i])
category_labels = categories.keys()
category_labels
Output
dict_keys([0.0, 1.1, 2.0, 3.0, 4.9, 5.0, 0.5, 1.9, 8.0, 9.0, 1.3, 2.7, 3.9, 9.2, 9.4, 7.2, 4.2, 8.6, 5.1, 5.4, 3.3, 1.0, 6.6, 7.4, 7.7, 8.4, 5.8, 9.8, 0.7, 8.8, 2.1, 7.0, 6.4, 4.3, 0.1, 2.5, 3.8, 1.2, 6.9, 7.1, 5.6, 0.4, 5.3, 2.9, 7.3, 3.5, 9.5, 8.2, 2.8, 3.1, 0.9, 2.3, 8.1, 4.0, 6.3, 6.7, 4.5, 0.2, 1.7, 2.2, 8.9, 1.4, 7.6, 9.1, 7.8, 5.5, 4.8, 0.6, 3.2, 2.4, 6.5, 9.9, 9.6, 1.5, 6.0, 3.7, 4.7, 3.4, 5.9, 4.1, 1.6, 6.8, 9.3, 3.6, 8.5, 8.7, 0.3, 0.8, 7.5, 5.2, 2.6, 4.6, 5.7, 7.9, 6.1, 1.8, 8.3, 6.2, 9.7, 4.4])
What I need:
I would like the first number before the point (e.g. if it was 6.7, I just want the 6 as an int).
Thank you in advance for any help and/or advice!
Just convert your keys to integers using a list comprehension; note that there is no need to call .keys() here as iteration over the dictionary directly suffices:
[int(k) for k in categories]
You may want to bucket your values directly into integer categories rather than by floating point values:
categories = defaultdict(list)
for i, word in enumerate(words):
categories[int(i / number_of_elements)].append(word)
I used enumerate() to pair words up with their index, rather than use range() plus indexing back into words.

What is happening with my data, as I recover them from SQLite?

Unfortunately I have some speed data saved as TEXT in SQLite. I wish to select the max speed for every userid, and this is making it hard for me. When I try to output max(speed) on my list after I have retrieved it I get a value that is not present in my database.
So, here is how I retrieve my data:
def getMaxSpeed(databasepath, id):
con = lite.connect(databasepath)
speeds = []
with con:
cur = con.execute("SELECT speed FROM table where userid = {userid}".format(userid=id))
speed = [x[0] for x in cur]
for i in range(0,len(speed)):
speeds.append(float(speed[i]))
return speeds
Sample (cropped) output before the for-loop:
[u'0.00', u'0.20', u'0.20', u'0.20', u'0.10', u'0.20', u'0.10', u'0.10', u'0.20', u'13.60', u'12.30', u'12.20', u'12.30', u'12.20', u'13.00', u'13.00', u'13.00', u'13.00', u'13.60', u'13.60', u'13.70', u'14.00', u'13.10', u'6.90', u'7.50', u'7.60', u'7.70', u'6.00', u'5.90', u'8.10', u'8.10', u'8.10', u'8.30', u'8.20', u'4.60', u'1.70', u'1.70', u'3.10', u'3.90', u'3.60', u'3.50', u'3.50', u'3.30', u'3.30', u'3.30', u'2.00', u'2.00', u'2.10', u'2.10', u'3.70', u'3.60', u'3.50', u'3.50', u'3.30', u'6.00', u'4.20', u'4.20', u'4.30', u'4.20', u'4.30', u'4.30', u'4.20', u'4.20', u'4.70', u'4.80', u'5.00', u'6.40', u'6.40', u'5.10', u'5.10', u'2.20', u'2.20', u'2.20', u'2.20', u'0.00', u'0.10', u'0.10', u'0.30', u'0.10', u'0.10', u'0.10', u'0.20', u'0.00', u'13.20', u'10.50', u'10.50']
Sample (cropped) output after the for loop:
[0.0, 0.2, 0.2, 0.2, 0.1, 0.2, 0.1, 0.1, 0.2, 13.6, 12.3, 12.2, 12.3, 12.2, 13.0, 13.0, 13.0, 13.0, 13.6, 13.6, 13.7, 14.0, 13.1, 6.9, 7.5, 7.6, 7.7, 6.0, 5.9, 8.1, 8.1, 8.1, 8.3, 8.2, 4.6, 1.7, 1.7, 3.1, 3.9, 3.6, 3.5, 3.5, 3.3, 3.3, 3.3, 2.0, 2.0, 2.1, 2.1, 3.7, 3.6, 3.5, 3.5, 3.3, 6.0, 4.2, 4.2, 4.3, 4.2, 4.3, 4.3, 4.2, 4.2, 4.7, 4.8, 5.0, 6.4, 6.4, 5.1, 5.1, 2.2, 2.2, 2.2, 2.2, 0.0, 0.1, 0.1, 0.3, 0.1, 0.1, 0.1, 0.2, 0.0, 13.2, 10.5, 10.5]
Main method:
speed = []
for id in userid:
speed = getMaxSpeed(dbpath, id)
for x in speed:
if x > 40:
print x
print id
break
The last for loop is just to do some checking for erroneous data. It outputs a lot of different ID's where every speed equal 102.3. I expect my speeds to be between 0 and 30.
Sample output:
102.3
209407000
So, to check this is go back to my database and do:
SELECT distinct speed from table WHERE userid = 209407000
It outputs a lot of speeds,http://pastebin.com/bEe5tPkZ , no one near 102.3. So, something is happening in my retrieval method - but what? I'm having one of these days where everything just go wrong.

Categories

Resources