Pandas framework: determining the count of a column data - python

I have a TSV file with data as shown below:
UserID ItemID
100 1
200 1
300 2
400 3
500 2
600 4
700 4
800 5
...
...
N X
I am new to the pandas framework and i want to know how can i get the count of all ItemID's associated with all the users, for the above dataset. For example, if we assume that in the above TSV file ItemID 1 repeats only two times, i need to get the count as 2 and so on. An example would be very helpful for me to get going. Appreciate your help in advance!

As mentioned by #edchum, value_count can be used on the column "ItemID". It will return a series with indices as "ItemID" and value as the count of "ItemID".
counter = df["ItemID"].value_counts() #df is your dataframe
print counter[1] #prints how many times 1 occurred

Here are 2 methods:
In [14]:
# setup data, note I have put userID 100 3 times
temp="""UserID ItemID
100 1
100 1
100 2
400 3
500 2
600 4
700 4
800 5"""
df = pd.read_csv(io.StringIO(temp), sep='\s+')
# count the occurences of each user
df.groupby('UserID').count()
Out[14]:
ItemID
UserID
100 3
400 1
500 1
600 1
700 1
800 1
In [15]:
# count each ItemID unique values
df['ItemID'].value_counts()
Out[15]:
4 2
2 2
1 2
5 1
3 1
dtype: int64

Related

Calculating average in array under different conditions using pandas [duplicate]

I have a DataFrame
>>> df = pd.DataFrame({'a':[1,1,1,2,2,2],
... 'b':[10,20,20,10,20,20],
... 'result':[100,200,300,400,500,600]})
...
>>> df
a b result
0 1 10 100
1 1 20 200
2 1 20 300
3 2 10 400
4 2 20 500
5 2 20 600
and want to create a new column that is the average result for the corresponding values for 'a' and 'b'. I can get those values with a groupby:
>>> df.groupby(['a','b'])['result'].mean()
a b
1 10 100
20 250
2 10 400
20 550
Name: result, dtype: int64
but can not figure out how to turn that into a new column in the original DataFrame. The final result should look like this,
>>> df
a b result avg_result
0 1 10 100 100
1 1 20 200 250
2 1 20 300 250
3 2 10 400 400
4 2 20 500 550
5 2 20 600 550
I could do this by looping through the combinations of 'a' and 'b' but that would get really slow and unwieldy for larger sets of data. There is probably a much simpler and faster way to go.
You need transform:
df['avg_result'] = df.groupby(['a', 'b'])['result'].transform('mean')
This generates a correctly indexed column of the groupby values for you:
a b result avg_result
0 1 10 100 100
1 1 20 200 250
2 1 20 300 250
3 2 10 400 400
4 2 20 500 550
5 2 20 600 550
Since the previous answer(https://stackoverflow.com/a/33445035/6504287) is pandas based, I'm adding the pyspark based solution as in below:
So it is better to go with the Window function as in the below code snippet example:
windowSpecAgg = Window.partitionBy('a', 'b')
ext_data_df.withColumn('avg_result', avg('result').over(windowSpecAgg)).show()
The above code is with respect to the example took in the previously provided solution(https://stackoverflow.com/a/33445035/6504287).

How to calculate the mean of rows with rows having same content from Columns A to C in Excel using python? [duplicate]

I have a DataFrame
>>> df = pd.DataFrame({'a':[1,1,1,2,2,2],
... 'b':[10,20,20,10,20,20],
... 'result':[100,200,300,400,500,600]})
...
>>> df
a b result
0 1 10 100
1 1 20 200
2 1 20 300
3 2 10 400
4 2 20 500
5 2 20 600
and want to create a new column that is the average result for the corresponding values for 'a' and 'b'. I can get those values with a groupby:
>>> df.groupby(['a','b'])['result'].mean()
a b
1 10 100
20 250
2 10 400
20 550
Name: result, dtype: int64
but can not figure out how to turn that into a new column in the original DataFrame. The final result should look like this,
>>> df
a b result avg_result
0 1 10 100 100
1 1 20 200 250
2 1 20 300 250
3 2 10 400 400
4 2 20 500 550
5 2 20 600 550
I could do this by looping through the combinations of 'a' and 'b' but that would get really slow and unwieldy for larger sets of data. There is probably a much simpler and faster way to go.
You need transform:
df['avg_result'] = df.groupby(['a', 'b'])['result'].transform('mean')
This generates a correctly indexed column of the groupby values for you:
a b result avg_result
0 1 10 100 100
1 1 20 200 250
2 1 20 300 250
3 2 10 400 400
4 2 20 500 550
5 2 20 600 550
Since the previous answer(https://stackoverflow.com/a/33445035/6504287) is pandas based, I'm adding the pyspark based solution as in below:
So it is better to go with the Window function as in the below code snippet example:
windowSpecAgg = Window.partitionBy('a', 'b')
ext_data_df.withColumn('avg_result', avg('result').over(windowSpecAgg)).show()
The above code is with respect to the example took in the previously provided solution(https://stackoverflow.com/a/33445035/6504287).

Create a new pandas column that uses an existing columns to fill previous rows and group by based on multiple conditions

I have below dataset with me :
myid id_1 Date group new_id
100 1 1-Jan-2020 A
100 2 3-Jan-2020 A
100 3 4-Jan-2020 A 101
100 4 15-Jan-2020 A
100 5 20-Feb-2020 A
200 6 3-Jan-2020 B
200 7 8-Feb-2020 B
200 8 9-Feb-2020 B 102
200 9 25-Mar-2020 B
200 9 26-Jan-2020 B
I wanted to create a column named ns.
The column "ns" needs to be created in a way that it uses myid, Date and new_id.
If the difference b/w the previous date is greater than 30 and it belongs to the same my_id, the value should be incremented, otherwise should retain the same value
If new_id is not null and the it will share the same value as the previous row and the next row will have an increment.
For every my_id the value starts from 1
Expected output :
myid id_1 Date group new_id ns
100 1 1-Jan-2020 A 1
100 2 3-Jan-2020 A 1
100 3 4-Jan-2020 A 101 1
100 4 15-Jan-2020 A 2
100 5 20-Jan-2020 A 3
200 6 3-Jan-2020 B 1
200 7 8-Feb-2020 B 2
200 8 9-Feb-2020 B 102 2
200 9 25-Mar-2020 B 3
200 9 26-Mar-2020 B 4
I have used df.groupby('CMID')['Date'].diff() and df.groupby('CMID')['PlanID'].bfill() , np.where to create multiple dummy columns in order to achieve this and still working on it , please let me know if there's a better way to go about this ?

Is there any efficient way to filter out cluster data simultaneously in a large pandas dataframe?

I have large pandas dataframe which look like this:
DF:
ID setID Weight
PG_002456788.1 1 100
UG_004678935.1 2 110
UG_012975895.1 2 150
PG_023788904.1 3 200
UR_073542247.1 3 200
UR_099876678.2 3 264
PR_066120875.1 4 400
PR_098759678.1 4 600
UR_096677888.2 4 750
PG_012667994.1 5 800
PG_077555239.1 5 800
I would like to filter out rows based on criteria:
Criteria to choose representative per setID is in below order of priority
Preference 1 ID starting with PG_
Preference 2 ID starting with UG_
Preference 3 ID starting with PR_
Preference 4 ID starting with UR_
Along with this next priority is to choose the highest Weight simultaneously for each setID cluster.
'Desired output:'
ID setID weight
PG_002456788.1 1 100
UG_012975895.1 2 150
PG_023788904.1 3 200
PR_098759678.1 4 600
PG_012667994.1 5 800
Also, I would like to print rows with same ID 'Initials' as well as weight separatly if there is any.
For example,
ID setID weight
PG_012667994.1 5 800
PG_077555239.1 5 800
IIUC you can define a pd.Categorical dummy column with the initial substring in ID, and use it and the Weight to order the dataframe. Then groupby setID, take the first:
df['ID_init'] = pd.Categorical(df.ID.str.split('_',1).str[0],
categories=['PG','UG','PR','UR'],
ordered=True)
(df.sort_values(by=['ID_init','Weight'], ascending=[True, False])
.groupby('setID')
.head(1)
.sort_values('setID')
.drop('ID_init',1))
ID setID Weight
0 PG_002456788.1 1 100
2 UG_012975895.1 2 150
3 PG_023788904.1 3 200
7 PR_098759678.1 4 600
9 PG_012667994.1 5 800
For the first part: create a new column called code from the ID. Then, sort the data frame by the code and weight, group by setID and take first entry.
df['code'] = df['ID'].str[:2].replace({'PG': 1, 'UG': 2, 'PR': 3, 'UR': 4})
df2 = df.sort_values(['code', 'Weight'], ascending=[True, False]).groupby('setID').first()
df2 = df2.reset_index().drop('code', axis=1)
Output
setID ID Weight
0 1 PG_002456788.1 100
1 2 UG_012975895.1 150
2 3 PG_023788904.1 200
3 4 PR_098759678.1 600
4 5 PG_012667994.1 800
The second part:
df3 = df.join(df.groupby(['setID', 'code']).count()['ID'],
on=['setID', 'code'], rsuffix='_Count')
df3[ df3['ID_Count'] > 1].drop(['code', 'ID_Count'], axis=1)
Output:
ID setID Weight
1 UG_004678935.1 2 110
2 UG_012975895.1 2 150
4 UR_073542247.1 3 200
5 UR_099876678.2 3 264
6 PR_066120875.1 4 400
7 PR_098759678.1 4 600
9 PG_012667994.1 5 800
10 PG_077555239.1 5 800

Pandas new column from groupby averages

I have a DataFrame
>>> df = pd.DataFrame({'a':[1,1,1,2,2,2],
... 'b':[10,20,20,10,20,20],
... 'result':[100,200,300,400,500,600]})
...
>>> df
a b result
0 1 10 100
1 1 20 200
2 1 20 300
3 2 10 400
4 2 20 500
5 2 20 600
and want to create a new column that is the average result for the corresponding values for 'a' and 'b'. I can get those values with a groupby:
>>> df.groupby(['a','b'])['result'].mean()
a b
1 10 100
20 250
2 10 400
20 550
Name: result, dtype: int64
but can not figure out how to turn that into a new column in the original DataFrame. The final result should look like this,
>>> df
a b result avg_result
0 1 10 100 100
1 1 20 200 250
2 1 20 300 250
3 2 10 400 400
4 2 20 500 550
5 2 20 600 550
I could do this by looping through the combinations of 'a' and 'b' but that would get really slow and unwieldy for larger sets of data. There is probably a much simpler and faster way to go.
You need transform:
df['avg_result'] = df.groupby(['a', 'b'])['result'].transform('mean')
This generates a correctly indexed column of the groupby values for you:
a b result avg_result
0 1 10 100 100
1 1 20 200 250
2 1 20 300 250
3 2 10 400 400
4 2 20 500 550
5 2 20 600 550
Since the previous answer(https://stackoverflow.com/a/33445035/6504287) is pandas based, I'm adding the pyspark based solution as in below:
So it is better to go with the Window function as in the below code snippet example:
windowSpecAgg = Window.partitionBy('a', 'b')
ext_data_df.withColumn('avg_result', avg('result').over(windowSpecAgg)).show()
The above code is with respect to the example took in the previously provided solution(https://stackoverflow.com/a/33445035/6504287).

Categories

Resources