Finding the correlation between two dataframes

Finding the correlation between two dataframes - python

TeamA TeamB TeamC
12 17 19
13 20 21
14 21 26
15 22 15
difference = numpy.abs(data['TeamA'] - data['TeamB'])
teamC = data['TeamC']
df1 = pd.DataFrame(difference)
df1.columns = ['diff']
df2 = pd.DataFrame(teamC)
correlation = df1.corrwith(df2,axis=0)
I am looking to return the correlation between (the absolute points difference between team A and Team B) and the number of points of team C. However, my code is not returning any number. Any suggestion?

pandas is expecting a series inside the corrwith instead of a dataframe (despite what the documentations say).
This makes sense because just passing a dataframe does not really help since you do not know which columns to use for generating correlation score
You should instead be doing:
df1.corrwith(df2["TeamC"])
OUT
diff 0.18221
dtype: float64
This answer is just an extension of this thread:
pandas.DataFrame corrwith() method

Related

With pandas, how to create a table of average ocurrences, using more than one column?

Say I have a df that looks like this:
name day var_A var_B
0 Pete Wed 4 5
1 Luck Thu 1 10
2 Pete Sun 10 10
And I want to sum var_A and var_B for every name/person and then get the average of this sum by the number of ocurrences of that name/person.
Let's take Pete for example. Sum his variables (in this case, (4+10) + (5+10) = 29), and divide this sum by the ocurrences of Pete in the df (29/2 = 14,5). And the "day" column would be eliminated, there would be only one column for the name and another for the average.
Would look like this:
>>> df.method().method()
name avg
0 Pete 14.5
1 Luck 11.0
I've been trying to do this using groupby and other methods, but I eventually got stuck.
Any help would be appreciated.

I came up with
df.groupby('name')['var_A', 'var_B'].apply(lambda g: g.stack().sum()/len(g)).rename('avg').reset_index()
which produces the correct result, but I'm not sure it's the most elegant way.

pandas' groupby is a lazy expression, and as such it is reusable:
# create group
group = df.drop(columns="day").groupby("name")
# compute output
group.sum().sum(1) / group.size()
name
Luck 11.0
Pete 14.5
dtype: float64

How to take the max value of each category in 1 column across multiple rows

I am using Python 3.4 on Jupyternotebook.
I am looking to select the max of each product type from the below table. I've found the groupby code as written below but I am struggling to figure out how to do the search so that it takes into account the max for all box (box_1 and box_2), etc etc.
Perhaps best described as some sort of fuzzy matching?
Ideally my output should give me the max in each category:
box_2 18
bottles_3 31
.
.
.
How should I do this?
data = {'Product':['Box_1','Bottles_1','Pen_1','Markers_1','Bottles_2','Pen_2','Markers_2','Bottles_3','Box_2','Markers_2','Markers_3','Pen_3'],
'Sales':[14,24,31,12,13,7,9,31,18,16,18,14]}
df1=pd.DataFrame(data, columns=['Product','Sales'])
df1
df1.groupby(['Product'])['Sales'].max()

If I understand correctly, you first have to look at the category and then retrieve both the name of the product and the maximum value. Here is how to do that:
df1=pd.DataFrame(data, columns=['Product','Sales'])
df1['Category'] = df1.Product.str.split('_').str.get(0)
df1["rank"] = df1.groupby("Category")["Sales"].rank("dense", ascending=False)
df1[df1["rank"]==1.0][['Product','Sales']]
The rank function will rank the products within the categories according to their Sales value. Then, you need to filter out any category that ranks lower. That will give you the desired dataframe:
Product Sales
2 Pen_1 31
7 Bottles_3 31
8 Box_2 18
10 Markers_3 18

Here you go:
df1['Type'] = df1.Product.str.split('_').str.get(0)
df1.groupby(['Type'])['Sales'].max()
## -- End pasted text --
Out[1]:
Type
Bottles 31
Box 18
Markers 18
Pen 31
Name: Sales, dtype: int64

You can split values by _, select first values by indexing str[0] and pass to groupby and DataFrameGroupBy.idxmax for Product by maximal Sales:
df1 = df1.set_index('Product')
df2 = (df1.groupby(df1.index.str.split('_').str[0])['Sales']
.agg([('Product','idxmax'), ('Sales','max')])
.reset_index(drop=True))
print (df2)
Product Sales
0 Bottles_3 31
1 Box_2 18
2 Markers_3 18
3 Pen_1 31

Add value from series index to row of equal value in Pandas DataFrame

I'm facing bit of an issue adding a new column to my Pandas DataFrame: I have a DataFrame in which each row represents a record of location data and a timestamp. Those records belong to trips, so each row also contains a trip id. Imagine the DataFrame looks kind of like this:
TripID Lat Lon time
0 42 53.55 9.99 74
1 42 53.58 9.99 78
3 42 53.60 9.98 79
6 12 52.01 10.04 64
7 12 52.34 10.05 69
Now I would like to delete the records of all trips that have less than a minimum amount of records to them. I figured I could simply get the number of records of each trip like so:
lengths = df['TripID'].value_counts()
Then my idea was to add an additional column to the DataFrame and fill it with the values from that Series corresponding to the trip id of each record. I would then be able to get rid of all rows in which the value of the lengthcolumn is too small.
However, I can't seem to find a way to get the length values into the correct rows. Would any one have an idea for that or even a better approach to the entire problem?
Thanks very much!
EDIT:
My desired output should look something like this:
TripID Lat Lon time length
0 42 53.55 9.99 74 3
1 42 53.58 9.99 78 3
3 42 53.60 9.98 79 3
6 12 52.01 10.04 64 2
7 12 52.34 10.05 69 2

If I understand correctly, to get the length of the trip, you'd want to get the difference between the maximum time and the minimum time for each trip. You can do that with a groupby statement.
# Groupby, get the minimum and maximum times, then reset the index
df_new = df.groupby('TripID').time.agg(['min', 'max']).reset_index()
df_new['length_of_trip'] = df_new.max - df_new.min
df_new = df_new.loc[df_new.length_of_trip > 90] # to pick a random number
That'll get you all the rows with a trip length above the amount you need, including the trip IDs.

You can use groupby and transform to directly add the lengths column to the DataFrame, like so:
df["lengths"] = df[["TripID", "time"]].groupby("TripID").transform("count")

I managed to find an answer to my question that is quite a bit nicer than my original approach as well:
df = df.groupby('TripID').filter(lambda x: len(x) > 2)
This can be found in the Pandas documentation. It gets rid of all groups that have 2 or less elements in them, or trips that are 2 records or shorter in my case.
I hope this will help someone else out as well.

How do I create new columns with pandas by combining rows with a particular value?

I have a dataframe with two "categories" of information. One category is repeated across multiple rows, and the other is specific to each row.
It looks something like this:
City State Industry Pay Hours
15 10 1 20 40
15 10 2 30 25
20 10 1 25 30
20 10 2 50 80
I want it to look like:
City State Industry1Pay Industry1Hours Industry2Pay Industry2Hours
15 10 20 40 30 25
20 10 25 30 50 80
This is a simplified version because the full table is much too long to fit up there. There are 8 columns in place of city and state, and 2 additional columns to pay and hours. In addition, each row should contain 4 industries for now (it will be 5 once that data comes in).
I am really struggling with how to do this. The dataset is from a project conducted in Stata, so the columns are mostly floats and need to stay that way for when I send it in.
The closest I think I've gotten is
wage = wage.pivot_table(index='cityid', columns='Industry').rename_axis(None)
wage.columns = wage.columns.map('_'.join)
but I get an error because you can't join a float to a string, and I suspect that this will not work the way I'm hoping it will regardless.
So far I've looked at quite a few stackoverflow questions, as well as:
https://hackernoon.com/reshaping-data-in-python-fa27dda2ff77
http://nikgrozev.com/2015/07/01/reshaping-in-pandas-pivot-pivot-table-stack-and-unstack-explained-with-pictures/
and two others I am unable to link to because I haven't used stackoverflow very much
I'm really struggling with this, and would appreciate any help, even a link to a good tutorial to wrap my head around this. It seems like a really simple task but for the life of me I can't figure out how to do it without just manually moving stuff around in Excel.
I apologize in advance if this is a duplicate - I looked around a lot but I might be missing something obvious because I'm not sure what this is called beyond reshaping.

Let's use set_index and unstack:
df['Industry'] = 'Industry'+df.Industry.astype(str)
df_out = df.set_index(['City','State','Industry']).unstack()
And flatten the multiindex columns with swaplevel, map, join:
df_out.columns = df_out.columns.swaplevel(1,0)
df_out.columns = df_out.columns.map(''.join)
Output:
Industry1Pay Industry2Pay Industry1Hours Industry2Hours
City State
15 10 20 30 40 25
20 10 25 50 30 80
In case of multiple values of City, State, and Industry then use pivot_table
df['Industry'] = 'Industry'+df.Industry.astype(str)
df_out = df.pivot_table(index=['City','State'],columns='Industry',values=['Pay','Hours'], aggfunc='sum')
df_out.columns = df_out.columns.swaplevel(1,0)
df_out.columns = df_out.columns.map(''.join)
df_out
OR use groupby
df_out = df.groupby(['City','State','Industry'])['Pay','Hours'].sum().unstack()
df_out.columns = df_out.columns.swaplevel(1,0)
df_out.columns = df_out.columns.map(''.join)
df_out
Output:
Industry1Pay Industry2Pay Industry1Hours Industry2Hours
City State
15 10 20 30 40 25
20 10 25 50 30 80

Here's how to use pivot
In [38]: df.pivot_table(index=['City', 'State'],
columns='Industry',
values=['Pay', 'Hours'])
Out[38]:
Pay Hours
Industry 1 2 1 2
City State
15 10 20 30 40 25
20 10 25 50 30 80
To flatten the pivot and add column names.
In [94]: dff = df.pivot_table(index=['City', 'State'], columns='Industry',
values=['Pay', 'Hours'])
In [95]: cols = ['Industry%s%s' % x for x in zip(dff.columns.get_level_values(1),
dff.columns.get_level_values(0))]
In [96]: cols
Out[96]: ['Industry1Pay', 'Industry2Pay', 'Industry1Hours', 'Industry2Hours']
In [97]: dff.columns = cols
In [98]: dff.reset_index()
Out[98]:
City State Industry1Pay Industry2Pay Industry1Hours Industry2Hours
0 15 10 20 30 40 25
1 20 10 25 50 30 80

How to obtain 1 column from a series object pandas?

I originally have 3 columns, timestamp,response_time and type columns, what I need to do is find the mean of response time where all timestamps are same hence I grouped all timestamps together and applied mean function on them. I got the following series which is fine:
0 16.949689
1 17.274615
2 16.858884
3 17.025155
4 17.062008
5 16.846885
6 17.172994
7 17.025797
8 17.001974
9 16.924636
10 16.813300
11 17.152066
12 17.291899
13 16.946970
14 16.972884
15 16.871824
16 16.840024
17 17.227682
18 17.288211
19 17.370553
20 17.395759
21 17.449579
22 17.340357
23 17.137308
24 16.981012
25 16.946727
26 16.947073
27 16.830850
28 17.366538
29 17.054468
30 16.823983
31 17.115429
32 16.859003
33 16.919645
34 17.351895
35 16.930233
36 17.025194
37 16.824997
And I need to be able to plot column 1 vs column 2, but I am not abel to extract them seperately.
I obtained this column by doing groupby('timestamp') and then a mean() on that.
The problem I need to solve is how to extract each column of this series? or is there a better way to calculate the mean of 1 column for all same entries of another column?
ORIGINAL DATA :
1445544152817,SEND_MSG,123
1445544152817,SEND_MSG,123
1445544152829,SEND_MSG,135
1445544152829,SEND_MSG,135
1445544152830,SEND_MSG,135
1445544152830,GET_QUEUE,12
1445544152830,SEND_MSG,136
1445544152830,SEND_MSG,136
1445544152830,SEND_MSG,136
1445544152831,SEND_MSG,138
1445544152831,SEND_MSG,136
1445544152831,SEND_MSG,137
1445544152831,SEND_MSG,137
1445544152831,SEND_MSG,137
1445544152832,SEND_MSG,138
1445544152832,SEND_MSG,138
1445544152833,SEND_MSG,138
1445544152833,SEND_MSG,139
1445544152834,SEND_MSG,140
1445544152834,SEND_MSG,140
1445544152834,SEND_MSG,140
1445544152835,SEND_MSG,140
1445544152835,SEND_MSG,141
1445544152849,SEND_MSG,155
1445544152849,SEND_MSG,155
1445544152850,GET_QUEUE,21
1445544152850,GET_QUEUE,21
For each timestamp I want to find average of response_time and plot,I did that successfully as shown in the series above(first data) but I cannot seperate the timestamp and response_time columns anymore.

A Series always has just one column. The first column you see is the index. You can get it by your_series.index(). If you want the timestamp to become a data column again, and not an index you can use the as_index keyword in groupby:
df.groupby('timestamp', as_index = False).mean()
Or use your_series.reset_index().

if its a series, you can directly use:
your_series.mean()
you can extract the column by:
df['column_name']
then you can apply mean() to the series:
df['column_name'].mean()

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.