Grouping by classes

Grouping by classes - python

I would like to see how many times a url is labelled with 1 and how many times it is labelled with 0.
My dataset is
Label URL
0 0.0 www.nytimes.com
1 0.0 newatlas.com
2 1.0 www.facebook.com
3 1.0 www.facebook.com
4 0.0 issuu.com
... ... ...
3572 0.0 www.businessinsider.com
3573 0.0 finance.yahoo.com
3574 0.0 www.cnbc.com
3575 0.0 www.ndtv.com
3576 0.0 www.baystatehealth.org
I tried df.groupby("URL")["Label"].count() but it does not return the expected output:
Label URL Freq
0 0.0 www.nytimes.com 1
0 1.0 www.nytimes.com 0
1 0.0 newatlas.com 1
1 1.0 newatlas.com 0
2 1.0 www.facebook.com 2
2 0.0 www.facebook.com 0
4 0.0 issuu.com 1
4 1.0 issuu.com 0
... ... ...
What field should I consider I the group by to get something like the above df (expected output)?

You need unique combinations of URL and Label.
df.groupby(["URL","Label"]).count()

Now you can do value_counts
df.value_counts(["URL","Label"])

Use agg:
df.groupby("URL").agg({'Label',lambda x: x.nunique()})

Related

what is level parameter in df.reset_index() & what is the use of it?

In pandas.DataFrame.reset_index documentation here the functionality of 'level' parameter is written as: Only remove the given levels from the index. Removes all levels by default.
But when i used the below code line:
df.reset_index(level = 0)
It did nothing
below are the result without & with level parameter.
>>df.reset_index(inplace = True)
index y proba y_cap
0 1664 1.0 0.899965 1
1 2099 1.0 0.899828 1
2 1028 1.0 0.899825 1
3 9592 1.0 0.899812 1
4 8324 1.0 0.899768 1
... ... ... ... ...
10095 8294 1.0 0.500081 1
10096 1630 1.0 0.500058 1
10097 7421 1.0 0.500058 1
10098 805 1.0 0.500047 1
10099 5012 1.0 0.500019 1
>>df.reset_index(level = 0, inplace = True)
index y proba y_cap
0 1664 1.0 0.899965 1
1 2099 1.0 0.899828 1
2 1028 1.0 0.899825 1
3 9592 1.0 0.899812 1
4 8324 1.0 0.899768 1
... ... ... ... ...
10095 8294 1.0 0.500081 1
10096 1630 1.0 0.500058 1
10097 7421 1.0 0.500058 1
10098 805 1.0 0.500047 1
10099 5012 1.0 0.500019 1
Also if running any of the below code blocks for 2nd time:
>>df.reset_index(inplace = True)
OR
>>df.reset_index(level = 0, inplace = True)
I get below output with level_0 column added to it?? with some random values in it.
level_0 index y proba y_cap
0 0 1664 1.0 1.0 1
1 3808 2280 1.0 1.0 1
2 3828 6394 1.0 1.0 1
3 3827 3410 1.0 1.0 1
4 3826 4992 1.0 1.0 1
... ... ... ... ... ...
10095 7193 5399 1.0 0.0 1
10096 7194 1801 1.0 0.0 1
10097 7195 3777 1.0 0.0 1
10098 7196 3314 1.0 0.0 1
10099 10099 5012 1.0 0.0 1
And again if I run the code for 3rd time it pops below error:
cannot insert level_0, already exists
Please let me understand the significance of level parameter & when it is used? As when i re-run my code it adds the level_0 column

The level parameter applies to DataFrames with multi-indexes.

Pandas apply function to column taking the value of previous column

I have to create a timeseries using column values for computing the Recency of a customer.
The formula I have to use is R(t) = 0 if the customer has bought something in that month, R(t-1) + 1 otherwise.
I managed to compute a dataframe
CustomerID -1 0 1 2 3 4 5 6 7 8 9 10 11 12
0 17850 0 0.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0
1 13047 0 0.0 1.0 0.0 0.0 1.0 0.0 0.0 1.0 0.0 1.0 0.0 1.0 1.0
2 12583 0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
3 14688 0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0
4 15311 0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
3750 15471 0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 0.0
3751 13436 0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 0.0
3752 15520 0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 0.0
3753 14569 0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 0.0
3754 12713 0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 0.0
In which there's a 0 if the customer has bought something in that month and one otherwise. The column names indicate a time period, with the column "-1" as a dummy column.
How can I replace the value in each column with 0 if the current value is 0 and with the value of the previous column + 1 otherwise?
For example, the final result for the second customer should be 0 1 0 0 1 0 0 1 0 1 0 1 2
I know how to apply a function to a column, but I don't know how to make that function use the value from the previous column.

Just use apply function to iterate throw columns or rows of dataframe and do manipulation.
def apply_function(row):
return [item if i == 0 else 0 if item == 0 else item+row[i-1] for i,item in enumerate(row)]
new_df = df.apply(apply_function, axis=1, result_type='expand')
new_df.columns = df.columns # just to set previous column names

Do you insist on using the column structure? It is common with time series to use rows, e.g., a dataframe with columns CustomerID, hasBoughtThisMonth. You can then easily add the Recency column by using a pandas transform().
I cannot yet place comments hence the question in this way.
Edit: here is another way to go by. I took two customers as an example, and some random numbers of whether or not they bought something in a month.
Basically, you pivot your table, and use a groupby+cumsum to get your result. Notice that I avoid your dummy column in this way.
import pandas as pd
import numpy as np
np.random.seed(1)
# Make example dataframe
df = pd.DataFrame({'CustomerID': [1]*12+[2]*12,
'Month': [1,2,3,4,5,6,7,8,9,10,11,12]*2,
'hasBoughtThisMonth': np.random.randint(2,size=24)})
# Make Recency column by finding contiguous groups of ones, and groupby
contiguous_groups = df['hasBoughtThisMonth'].diff().ne(0).cumsum()
df['Recency']=df.groupby(by=['CustomerID', contiguous_groups],
as_index=False)['hasBoughtThisMonth'].cumsum().reset_index(drop=True)
The result is
CustomerID Month hasBoughtThisMonth Recency
0 1 1 1 1
1 1 2 1 2
2 1 3 0 0
3 1 4 0 0
4 1 5 1 1
5 1 6 1 2
6 1 7 1 3
7 1 8 1 4
8 1 9 1 5
9 1 10 0 0
10 1 11 0 0
11 1 12 1 1
12 2 1 0 0
13 2 2 1 1
14 2 3 1 2
15 2 4 0 0
16 2 5 0 0
17 2 6 1 1
18 2 7 0 0
19 2 8 0 0
20 2 9 0 0
21 2 10 1 1
22 2 11 0 0
23 2 12 0 0

It would be easier if you first set CustomerID as index and transpose your dataframe.
then apply your custom function.
i.e. something like:
df.T.apply(custom_func)

Find nearest neighbors

I have a large dataframe of the form:
user_id time_interval A B C D E F G H ... Z
0 12166 2.0 3.0 1.0 1.0 1.0 3.0 1.0 1.0 1.0 ... 0.0
1 12167 0.0 0.0 1.0 0.0 0.0 1.0 0.0 0.0 1.0 ... 0.0
2 12168 0.0 0.0 1.0 0.0 0.0 1.0 0.0 0.0 1.0 ... 0.0
3 12169 0.0 0.0 1.0 0.0 0.0 1.0 0.0 0.0 1.0 ... 0.0
4 12170 0.0 0.0 1.0 0.0 0.0 1.0 0.0 0.0 1.0 ... 0.0
... ... ... ... ... ... ... ... ... ... ... ... ...
I would like to find, for each user_id, based on the columns A-Z as coordinates,the closest neighbors within a 'radius' distance r. The output should look like, for example, for r=0.1:
user_id neighbors
12166 [12251,12345, ...]
12167 [12168, 12169,12170, ...]
... ...
I tried for-looping throughout the user_id list but it takes ages.
I did something like this:
import scipy
neighbors = []
for i in range(len(dataframe)):
user_neighbors = [dataframe["user_id"][j] for j in range(i+1,len(dataframe)) if scipy.spatial.distance.euclidean(dataframe.values[i][2:],dataframe.values[j][2:])<0.1]
neighbors.append([dataframe["user_id"][i],user_neighbors])
and I have been waiting for hours.
Is there a pythonic way to improve this?

Here's how I've done it using apply method.
The dummy data consisting of columns A-D with an added column for neighbors:
print(df)
user_id time_interval A B C D neighbors
0 12166 2 3 2 2 3 NaN
1 12167 0 1 4 3 3 NaN
2 12168 0 4 3 3 1 NaN
3 12169 0 2 2 3 2 NaN
4 12170 0 3 3 1 1 NaN
the custom function:
def func(row):
r = 2.5 # the threshold
out = df[(((df.iloc[:, 2:-1] - row[2:-1])**2).sum(axis=1)**0.5).le(r)]['user_id'].to_list()
out.remove(row['user_id'])
df.loc[row.name, ['neighbors']] = str(out)
df.apply(func, axis=1)
the output:
print(df):
user_id time_interval A B C D neighbors
0 12166 2 3 2 2 3 [12169, 12170]
1 12167 0 1 4 3 3 [12169]
2 12168 0 4 3 3 1 [12169, 12170]
3 12169 0 2 2 3 2 [12166, 12167, 12168]
4 12170 0 3 3 1 1 [12166, 12168]
Let me know if it outperforms the for-loop approach.

new Pandas Dataframe column calculated from other column values

How can I create a new column in a dataframe that consists of the MEAN of an indexed range of values in that row?
example:
1 2 3 JUNK
0 0.0 0.0 0.0 A
1 1.0 1.0 -1.0 B
2 2.0 2.0 1.0 C
the JUNK column would be ignored when trying to determine the MEAN column
expected output:
1 2 3 JUNK MEAN
0 0.0 0.0 0.0 A 0.0
1 1.0 1.0 -1.0 B 0.33
2 2.0 2.0 1.0 C 1.66

Use drop for removing or iloc for filter out unnecessary columns:
df['MEAN'] = df.drop('JUNK', axis=1).mean(axis=1)
df['MEAN'] = df.iloc[:, :-1].mean(axis=1)
print (df)
1 2 3 JUNK MEAN
0 0.0 0.0 0.0 A 0.000000
1 1.0 1.0 -1.0 B 0.333333
2 2.0 2.0 1.0 C 1.666667

Python Pandas - adding column names using For statement

I am trying to import Semeion Handwritten Digit Data Set as a pandas DataFrame, but the first row is being taken as column names.
df.head()
0.0000 0.0000.1 0.0000.2 0.0000.3 0.0000.4 0.0000.5 1.0000 1.0000.1 \
0 0.0 0.0 0.0 0.0 0.0 1.0 1.0 1.0
1 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
2 0.0 0.0 0.0 0.0 0.0 0.0 1.0 1.0
3 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
4 0.0 0.0 0.0 0.0 1.0 1.0 1.0 1.0
1.0000.2 1.0000.3 ... 1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8
0 1.0 1.0 ... 1 0 0 0 0 0 0 0 0 0
1 0.0 1.0 ... 1 0 0 0 0 0 0 0 0 0
2 1.0 1.0 ... 1 0 0 0 0 0 0 0 0 0
3 0.0 1.0 ... 1 0 0 0 0 0 0 0 0 0
4 1.0 1.0 ... 1 0 0 0 0 0 0 0 0 0
[5 rows x 266 columns]
Since the DataFrame has 266 columns, I am trying to assign numbers as column names, using lambda and a for loop.... using the following code:
df = pd.read_csv("http://archive.ics.uci.edu/ml/machine-learning-databases/semeion/semeion.data", delimiter = r"\s+",
names = (lambda x: x for x in range(0,266)) )
But am getting weird column names, like:
>>> df.head(2)
<function <genexpr>.<lambda> at 0x04F4E588> \
0 0.0
1 0.0
<function <genexpr>.<lambda> at 0x04F4E618> \
0 0.0
1 0.0
<function <genexpr>.<lambda> at 0x04F4E660> \
0 0.0
1 0.0
If I remove the parenthesis, then the code throws a syntax error:
>>> df = pd.read_csv("http://archive.ics.uci.edu/ml/machine-learning-databases/semeion/semeion.data", delimiter = r"\s+",
names = lambda x: x for x in range(0,266) )
SyntaxError: invalid syntax
Can someone tell me:
1) How to get column names as numbers... from 0 to 266
2) If in case I get a DataFrame with first row as column names, how do I push it down and add new column names, without losing the first row?
TIA

I think you need parameter header=None or names=range(266) for set default names of columns in read_csv:
url = "http://archive.ics.uci.edu/ml/machine-learning-databases/semeion/semeion.data"
df = pd.read_csv(url, sep = r"\s+", header=None)
df = pd.read_csv(url, sep = r"\s+", names=range(266))

Also you can try something like:
my_columns = [range(266)]

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Grouping by classes - python

You need unique combinations of URL and Label. df.groupby(["URL","Label"]).count()

Now you can do value_counts df.value_counts(["URL","Label"])

Use agg: df.groupby("URL").agg({'Label',lambda x: x.nunique()})

Related

what is level parameter in df.reset_index() & what is the use of it?

Pandas apply function to column taking the value of previous column

Find nearest neighbors

new Pandas Dataframe column calculated from other column values

Python Pandas - adding column names using For statement

Categories

Resources