Grouping by classes - python

I would like to see how many times a url is labelled with 1 and how many times it is labelled with 0.
My dataset is
Label URL
0 0.0 www.nytimes.com
1 0.0 newatlas.com
2 1.0 www.facebook.com
3 1.0 www.facebook.com
4 0.0 issuu.com
... ... ...
3572 0.0 www.businessinsider.com
3573 0.0 finance.yahoo.com
3574 0.0 www.cnbc.com
3575 0.0 www.ndtv.com
3576 0.0 www.baystatehealth.org
I tried df.groupby("URL")["Label"].count() but it does not return the expected output:
Label URL Freq
0 0.0 www.nytimes.com 1
0 1.0 www.nytimes.com 0
1 0.0 newatlas.com 1
1 1.0 newatlas.com 0
2 1.0 www.facebook.com 2
2 0.0 www.facebook.com 0
4 0.0 issuu.com 1
4 1.0 issuu.com 0
... ... ...
What field should I consider I the group by to get something like the above df (expected output)?

You need unique combinations of URL and Label.
df.groupby(["URL","Label"]).count()

Now you can do value_counts
df.value_counts(["URL","Label"])

Use agg:
df.groupby("URL").agg({'Label',lambda x: x.nunique()})

Related

what is level parameter in df.reset_index() & what is the use of it?

In pandas.DataFrame.reset_index documentation here the functionality of 'level' parameter is written as: Only remove the given levels from the index. Removes all levels by default.
But when i used the below code line:
df.reset_index(level = 0)
It did nothing
below are the result without & with level parameter.
>>df.reset_index(inplace = True)
index y proba y_cap
0 1664 1.0 0.899965 1
1 2099 1.0 0.899828 1
2 1028 1.0 0.899825 1
3 9592 1.0 0.899812 1
4 8324 1.0 0.899768 1
... ... ... ... ...
10095 8294 1.0 0.500081 1
10096 1630 1.0 0.500058 1
10097 7421 1.0 0.500058 1
10098 805 1.0 0.500047 1
10099 5012 1.0 0.500019 1
>>df.reset_index(level = 0, inplace = True)
index y proba y_cap
0 1664 1.0 0.899965 1
1 2099 1.0 0.899828 1
2 1028 1.0 0.899825 1
3 9592 1.0 0.899812 1
4 8324 1.0 0.899768 1
... ... ... ... ...
10095 8294 1.0 0.500081 1
10096 1630 1.0 0.500058 1
10097 7421 1.0 0.500058 1
10098 805 1.0 0.500047 1
10099 5012 1.0 0.500019 1
Also if running any of the below code blocks for 2nd time:
>>df.reset_index(inplace = True)
OR
>>df.reset_index(level = 0, inplace = True)
I get below output with level_0 column added to it?? with some random values in it.
level_0 index y proba y_cap
0 0 1664 1.0 1.0 1
1 3808 2280 1.0 1.0 1
2 3828 6394 1.0 1.0 1
3 3827 3410 1.0 1.0 1
4 3826 4992 1.0 1.0 1
... ... ... ... ... ...
10095 7193 5399 1.0 0.0 1
10096 7194 1801 1.0 0.0 1
10097 7195 3777 1.0 0.0 1
10098 7196 3314 1.0 0.0 1
10099 10099 5012 1.0 0.0 1
And again if I run the code for 3rd time it pops below error:
cannot insert level_0, already exists
Please let me understand the significance of level parameter & when it is used? As when i re-run my code it adds the level_0 column
The level parameter applies to DataFrames with multi-indexes.

Pandas apply function to column taking the value of previous column

I have to create a timeseries using column values for computing the Recency of a customer.
The formula I have to use is R(t) = 0 if the customer has bought something in that month, R(t-1) + 1 otherwise.
I managed to compute a dataframe
CustomerID -1 0 1 2 3 4 5 6 7 8 9 10 11 12
0 17850 0 0.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0
1 13047 0 0.0 1.0 0.0 0.0 1.0 0.0 0.0 1.0 0.0 1.0 0.0 1.0 1.0
2 12583 0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
3 14688 0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0
4 15311 0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
3750 15471 0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 0.0
3751 13436 0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 0.0
3752 15520 0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 0.0
3753 14569 0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 0.0
3754 12713 0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 0.0
In which there's a 0 if the customer has bought something in that month and one otherwise. The column names indicate a time period, with the column "-1" as a dummy column.
How can I replace the value in each column with 0 if the current value is 0 and with the value of the previous column + 1 otherwise?
For example, the final result for the second customer should be 0 1 0 0 1 0 0 1 0 1 0 1 2
I know how to apply a function to a column, but I don't know how to make that function use the value from the previous column.
Just use apply function to iterate throw columns or rows of dataframe and do manipulation.
def apply_function(row):
return [item if i == 0 else 0 if item == 0 else item+row[i-1] for i,item in enumerate(row)]
new_df = df.apply(apply_function, axis=1, result_type='expand')
new_df.columns = df.columns # just to set previous column names
Do you insist on using the column structure? It is common with time series to use rows, e.g., a dataframe with columns CustomerID, hasBoughtThisMonth. You can then easily add the Recency column by using a pandas transform().
I cannot yet place comments hence the question in this way.
Edit: here is another way to go by. I took two customers as an example, and some random numbers of whether or not they bought something in a month.
Basically, you pivot your table, and use a groupby+cumsum to get your result. Notice that I avoid your dummy column in this way.
import pandas as pd
import numpy as np
np.random.seed(1)
# Make example dataframe
df = pd.DataFrame({'CustomerID': [1]*12+[2]*12,
'Month': [1,2,3,4,5,6,7,8,9,10,11,12]*2,
'hasBoughtThisMonth': np.random.randint(2,size=24)})
# Make Recency column by finding contiguous groups of ones, and groupby
contiguous_groups = df['hasBoughtThisMonth'].diff().ne(0).cumsum()
df['Recency']=df.groupby(by=['CustomerID', contiguous_groups],
as_index=False)['hasBoughtThisMonth'].cumsum().reset_index(drop=True)
The result is
CustomerID Month hasBoughtThisMonth Recency
0 1 1 1 1
1 1 2 1 2
2 1 3 0 0
3 1 4 0 0
4 1 5 1 1
5 1 6 1 2
6 1 7 1 3
7 1 8 1 4
8 1 9 1 5
9 1 10 0 0
10 1 11 0 0
11 1 12 1 1
12 2 1 0 0
13 2 2 1 1
14 2 3 1 2
15 2 4 0 0
16 2 5 0 0
17 2 6 1 1
18 2 7 0 0
19 2 8 0 0
20 2 9 0 0
21 2 10 1 1
22 2 11 0 0
23 2 12 0 0
It would be easier if you first set CustomerID as index and transpose your dataframe.
then apply your custom function.
i.e. something like:
df.T.apply(custom_func)

Find nearest neighbors

I have a large dataframe of the form:
user_id time_interval A B C D E F G H ... Z
0 12166 2.0 3.0 1.0 1.0 1.0 3.0 1.0 1.0 1.0 ... 0.0
1 12167 0.0 0.0 1.0 0.0 0.0 1.0 0.0 0.0 1.0 ... 0.0
2 12168 0.0 0.0 1.0 0.0 0.0 1.0 0.0 0.0 1.0 ... 0.0
3 12169 0.0 0.0 1.0 0.0 0.0 1.0 0.0 0.0 1.0 ... 0.0
4 12170 0.0 0.0 1.0 0.0 0.0 1.0 0.0 0.0 1.0 ... 0.0
... ... ... ... ... ... ... ... ... ... ... ... ...
I would like to find, for each user_id, based on the columns A-Z as coordinates,the closest neighbors within a 'radius' distance r. The output should look like, for example, for r=0.1:
user_id neighbors
12166 [12251,12345, ...]
12167 [12168, 12169,12170, ...]
... ...
I tried for-looping throughout the user_id list but it takes ages.
I did something like this:
import scipy
neighbors = []
for i in range(len(dataframe)):
user_neighbors = [dataframe["user_id"][j] for j in range(i+1,len(dataframe)) if scipy.spatial.distance.euclidean(dataframe.values[i][2:],dataframe.values[j][2:])<0.1]
neighbors.append([dataframe["user_id"][i],user_neighbors])
and I have been waiting for hours.
Is there a pythonic way to improve this?
Here's how I've done it using apply method.
The dummy data consisting of columns A-D with an added column for neighbors:
print(df)
user_id time_interval A B C D neighbors
0 12166 2 3 2 2 3 NaN
1 12167 0 1 4 3 3 NaN
2 12168 0 4 3 3 1 NaN
3 12169 0 2 2 3 2 NaN
4 12170 0 3 3 1 1 NaN
the custom function:
def func(row):
r = 2.5 # the threshold
out = df[(((df.iloc[:, 2:-1] - row[2:-1])**2).sum(axis=1)**0.5).le(r)]['user_id'].to_list()
out.remove(row['user_id'])
df.loc[row.name, ['neighbors']] = str(out)
df.apply(func, axis=1)
the output:
print(df):
user_id time_interval A B C D neighbors
0 12166 2 3 2 2 3 [12169, 12170]
1 12167 0 1 4 3 3 [12169]
2 12168 0 4 3 3 1 [12169, 12170]
3 12169 0 2 2 3 2 [12166, 12167, 12168]
4 12170 0 3 3 1 1 [12166, 12168]
Let me know if it outperforms the for-loop approach.

new Pandas Dataframe column calculated from other column values

How can I create a new column in a dataframe that consists of the MEAN of an indexed range of values in that row?
example:
1 2 3 JUNK
0 0.0 0.0 0.0 A
1 1.0 1.0 -1.0 B
2 2.0 2.0 1.0 C
the JUNK column would be ignored when trying to determine the MEAN column
expected output:
1 2 3 JUNK MEAN
0 0.0 0.0 0.0 A 0.0
1 1.0 1.0 -1.0 B 0.33
2 2.0 2.0 1.0 C 1.66
Use drop for removing or iloc for filter out unnecessary columns:
df['MEAN'] = df.drop('JUNK', axis=1).mean(axis=1)
df['MEAN'] = df.iloc[:, :-1].mean(axis=1)
print (df)
1 2 3 JUNK MEAN
0 0.0 0.0 0.0 A 0.000000
1 1.0 1.0 -1.0 B 0.333333
2 2.0 2.0 1.0 C 1.666667

Python Pandas - adding column names using For statement

I am trying to import Semeion Handwritten Digit Data Set as a pandas DataFrame, but the first row is being taken as column names.
df.head()
0.0000 0.0000.1 0.0000.2 0.0000.3 0.0000.4 0.0000.5 1.0000 1.0000.1 \
0 0.0 0.0 0.0 0.0 0.0 1.0 1.0 1.0
1 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
2 0.0 0.0 0.0 0.0 0.0 0.0 1.0 1.0
3 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
4 0.0 0.0 0.0 0.0 1.0 1.0 1.0 1.0
1.0000.2 1.0000.3 ... 1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8
0 1.0 1.0 ... 1 0 0 0 0 0 0 0 0 0
1 0.0 1.0 ... 1 0 0 0 0 0 0 0 0 0
2 1.0 1.0 ... 1 0 0 0 0 0 0 0 0 0
3 0.0 1.0 ... 1 0 0 0 0 0 0 0 0 0
4 1.0 1.0 ... 1 0 0 0 0 0 0 0 0 0
[5 rows x 266 columns]
Since the DataFrame has 266 columns, I am trying to assign numbers as column names, using lambda and a for loop.... using the following code:
df = pd.read_csv("http://archive.ics.uci.edu/ml/machine-learning-databases/semeion/semeion.data", delimiter = r"\s+",
names = (lambda x: x for x in range(0,266)) )
But am getting weird column names, like:
>>> df.head(2)
<function <genexpr>.<lambda> at 0x04F4E588> \
0 0.0
1 0.0
<function <genexpr>.<lambda> at 0x04F4E618> \
0 0.0
1 0.0
<function <genexpr>.<lambda> at 0x04F4E660> \
0 0.0
1 0.0
If I remove the parenthesis, then the code throws a syntax error:
>>> df = pd.read_csv("http://archive.ics.uci.edu/ml/machine-learning-databases/semeion/semeion.data", delimiter = r"\s+",
names = lambda x: x for x in range(0,266) )
SyntaxError: invalid syntax
Can someone tell me:
1) How to get column names as numbers... from 0 to 266
2) If in case I get a DataFrame with first row as column names, how do I push it down and add new column names, without losing the first row?
TIA
I think you need parameter header=None or names=range(266) for set default names of columns in read_csv:
url = "http://archive.ics.uci.edu/ml/machine-learning-databases/semeion/semeion.data"
df = pd.read_csv(url, sep = r"\s+", header=None)
df = pd.read_csv(url, sep = r"\s+", names=range(266))
Also you can try something like:
my_columns = [range(266)]

Categories

Resources