Pandas framework: determining the count of a column data

Pandas framework: determining the count of a column data - python

I have a TSV file with data as shown below:
UserID ItemID
100 1
200 1
300 2
400 3
500 2
600 4
700 4
800 5
...
...
N X
I am new to the pandas framework and i want to know how can i get the count of all ItemID's associated with all the users, for the above dataset. For example, if we assume that in the above TSV file ItemID 1 repeats only two times, i need to get the count as 2 and so on. An example would be very helpful for me to get going. Appreciate your help in advance!

As mentioned by #edchum, value_count can be used on the column "ItemID". It will return a series with indices as "ItemID" and value as the count of "ItemID".
counter = df["ItemID"].value_counts() #df is your dataframe
print counter[1] #prints how many times 1 occurred

Here are 2 methods:
In [14]:
# setup data, note I have put userID 100 3 times
temp="""UserID ItemID
100 1
100 1
100 2
400 3
500 2
600 4
700 4
800 5"""
df = pd.read_csv(io.StringIO(temp), sep='\s+')
# count the occurences of each user
df.groupby('UserID').count()
Out[14]:
ItemID
UserID
100 3
400 1
500 1
600 1
700 1
800 1
In [15]:
# count each ItemID unique values
df['ItemID'].value_counts()
Out[15]:
4 2
2 2
1 2
5 1
3 1
dtype: int64

Related

Calculating average in array under different conditions using pandas [duplicate]

I have a DataFrame
>>> df = pd.DataFrame({'a':[1,1,1,2,2,2],
... 'b':[10,20,20,10,20,20],
... 'result':[100,200,300,400,500,600]})
...
>>> df
a b result
0 1 10 100
1 1 20 200
2 1 20 300
3 2 10 400
4 2 20 500
5 2 20 600
and want to create a new column that is the average result for the corresponding values for 'a' and 'b'. I can get those values with a groupby:
>>> df.groupby(['a','b'])['result'].mean()
a b
1 10 100
20 250
2 10 400
20 550
Name: result, dtype: int64
but can not figure out how to turn that into a new column in the original DataFrame. The final result should look like this,
>>> df
a b result avg_result
0 1 10 100 100
1 1 20 200 250
2 1 20 300 250
3 2 10 400 400
4 2 20 500 550
5 2 20 600 550
I could do this by looping through the combinations of 'a' and 'b' but that would get really slow and unwieldy for larger sets of data. There is probably a much simpler and faster way to go.

You need transform:
df['avg_result'] = df.groupby(['a', 'b'])['result'].transform('mean')
This generates a correctly indexed column of the groupby values for you:
a b result avg_result
0 1 10 100 100
1 1 20 200 250
2 1 20 300 250
3 2 10 400 400
4 2 20 500 550
5 2 20 600 550

Since the previous answer(https://stackoverflow.com/a/33445035/6504287) is pandas based, I'm adding the pyspark based solution as in below:
So it is better to go with the Window function as in the below code snippet example:
windowSpecAgg = Window.partitionBy('a', 'b')
ext_data_df.withColumn('avg_result', avg('result').over(windowSpecAgg)).show()
The above code is with respect to the example took in the previously provided solution(https://stackoverflow.com/a/33445035/6504287).

How to calculate the mean of rows with rows having same content from Columns A to C in Excel using python? [duplicate]

I have a DataFrame
>>> df = pd.DataFrame({'a':[1,1,1,2,2,2],
... 'b':[10,20,20,10,20,20],
... 'result':[100,200,300,400,500,600]})
...
>>> df
a b result
0 1 10 100
1 1 20 200
2 1 20 300
3 2 10 400
4 2 20 500
5 2 20 600
and want to create a new column that is the average result for the corresponding values for 'a' and 'b'. I can get those values with a groupby:
>>> df.groupby(['a','b'])['result'].mean()
a b
1 10 100
20 250
2 10 400
20 550
Name: result, dtype: int64
but can not figure out how to turn that into a new column in the original DataFrame. The final result should look like this,
>>> df
a b result avg_result
0 1 10 100 100
1 1 20 200 250
2 1 20 300 250
3 2 10 400 400
4 2 20 500 550
5 2 20 600 550
I could do this by looping through the combinations of 'a' and 'b' but that would get really slow and unwieldy for larger sets of data. There is probably a much simpler and faster way to go.

You need transform:
df['avg_result'] = df.groupby(['a', 'b'])['result'].transform('mean')
This generates a correctly indexed column of the groupby values for you:
a b result avg_result
0 1 10 100 100
1 1 20 200 250
2 1 20 300 250
3 2 10 400 400
4 2 20 500 550
5 2 20 600 550

Since the previous answer(https://stackoverflow.com/a/33445035/6504287) is pandas based, I'm adding the pyspark based solution as in below:
So it is better to go with the Window function as in the below code snippet example:
windowSpecAgg = Window.partitionBy('a', 'b')
ext_data_df.withColumn('avg_result', avg('result').over(windowSpecAgg)).show()
The above code is with respect to the example took in the previously provided solution(https://stackoverflow.com/a/33445035/6504287).

Create a new pandas column that uses an existing columns to fill previous rows and group by based on multiple conditions

I have below dataset with me :
myid id_1 Date group new_id
100 1 1-Jan-2020 A
100 2 3-Jan-2020 A
100 3 4-Jan-2020 A 101
100 4 15-Jan-2020 A
100 5 20-Feb-2020 A
200 6 3-Jan-2020 B
200 7 8-Feb-2020 B
200 8 9-Feb-2020 B 102
200 9 25-Mar-2020 B
200 9 26-Jan-2020 B
I wanted to create a column named ns.
The column "ns" needs to be created in a way that it uses myid, Date and new_id.
If the difference b/w the previous date is greater than 30 and it belongs to the same my_id, the value should be incremented, otherwise should retain the same value
If new_id is not null and the it will share the same value as the previous row and the next row will have an increment.
For every my_id the value starts from 1
Expected output :
myid id_1 Date group new_id ns
100 1 1-Jan-2020 A 1
100 2 3-Jan-2020 A 1
100 3 4-Jan-2020 A 101 1
100 4 15-Jan-2020 A 2
100 5 20-Jan-2020 A 3
200 6 3-Jan-2020 B 1
200 7 8-Feb-2020 B 2
200 8 9-Feb-2020 B 102 2
200 9 25-Mar-2020 B 3
200 9 26-Mar-2020 B 4
I have used df.groupby('CMID')['Date'].diff() and df.groupby('CMID')['PlanID'].bfill() , np.where to create multiple dummy columns in order to achieve this and still working on it , please let me know if there's a better way to go about this ?

Is there any efficient way to filter out cluster data simultaneously in a large pandas dataframe?

I have large pandas dataframe which look like this:
DF:
ID setID Weight
PG_002456788.1 1 100
UG_004678935.1 2 110
UG_012975895.1 2 150
PG_023788904.1 3 200
UR_073542247.1 3 200
UR_099876678.2 3 264
PR_066120875.1 4 400
PR_098759678.1 4 600
UR_096677888.2 4 750
PG_012667994.1 5 800
PG_077555239.1 5 800
I would like to filter out rows based on criteria:
Criteria to choose representative per setID is in below order of priority
Preference 1 ID starting with PG_
Preference 2 ID starting with UG_
Preference 3 ID starting with PR_
Preference 4 ID starting with UR_
Along with this next priority is to choose the highest Weight simultaneously for each setID cluster.
'Desired output:'
ID setID weight
PG_002456788.1 1 100
UG_012975895.1 2 150
PG_023788904.1 3 200
PR_098759678.1 4 600
PG_012667994.1 5 800
Also, I would like to print rows with same ID 'Initials' as well as weight separatly if there is any.
For example,
ID setID weight
PG_012667994.1 5 800
PG_077555239.1 5 800

IIUC you can define a pd.Categorical dummy column with the initial substring in ID, and use it and the Weight to order the dataframe. Then groupby setID, take the first:
df['ID_init'] = pd.Categorical(df.ID.str.split('_',1).str[0],
categories=['PG','UG','PR','UR'],
ordered=True)
(df.sort_values(by=['ID_init','Weight'], ascending=[True, False])
.groupby('setID')
.head(1)
.sort_values('setID')
.drop('ID_init',1))
ID setID Weight
0 PG_002456788.1 1 100
2 UG_012975895.1 2 150
3 PG_023788904.1 3 200
7 PR_098759678.1 4 600
9 PG_012667994.1 5 800

For the first part: create a new column called code from the ID. Then, sort the data frame by the code and weight, group by setID and take first entry.
df['code'] = df['ID'].str[:2].replace({'PG': 1, 'UG': 2, 'PR': 3, 'UR': 4})
df2 = df.sort_values(['code', 'Weight'], ascending=[True, False]).groupby('setID').first()
df2 = df2.reset_index().drop('code', axis=1)
Output
setID ID Weight
0 1 PG_002456788.1 100
1 2 UG_012975895.1 150
2 3 PG_023788904.1 200
3 4 PR_098759678.1 600
4 5 PG_012667994.1 800
The second part:
df3 = df.join(df.groupby(['setID', 'code']).count()['ID'],
on=['setID', 'code'], rsuffix='_Count')
df3[ df3['ID_Count'] > 1].drop(['code', 'ID_Count'], axis=1)
Output:
ID setID Weight
1 UG_004678935.1 2 110
2 UG_012975895.1 2 150
4 UR_073542247.1 3 200
5 UR_099876678.2 3 264
6 PR_066120875.1 4 400
7 PR_098759678.1 4 600
9 PG_012667994.1 5 800
10 PG_077555239.1 5 800

Pandas new column from groupby averages

I have a DataFrame
>>> df = pd.DataFrame({'a':[1,1,1,2,2,2],
... 'b':[10,20,20,10,20,20],
... 'result':[100,200,300,400,500,600]})
...
>>> df
a b result
0 1 10 100
1 1 20 200
2 1 20 300
3 2 10 400
4 2 20 500
5 2 20 600
and want to create a new column that is the average result for the corresponding values for 'a' and 'b'. I can get those values with a groupby:
>>> df.groupby(['a','b'])['result'].mean()
a b
1 10 100
20 250
2 10 400
20 550
Name: result, dtype: int64
but can not figure out how to turn that into a new column in the original DataFrame. The final result should look like this,
>>> df
a b result avg_result
0 1 10 100 100
1 1 20 200 250
2 1 20 300 250
3 2 10 400 400
4 2 20 500 550
5 2 20 600 550
I could do this by looping through the combinations of 'a' and 'b' but that would get really slow and unwieldy for larger sets of data. There is probably a much simpler and faster way to go.

You need transform:
df['avg_result'] = df.groupby(['a', 'b'])['result'].transform('mean')
This generates a correctly indexed column of the groupby values for you:
a b result avg_result
0 1 10 100 100
1 1 20 200 250
2 1 20 300 250
3 2 10 400 400
4 2 20 500 550
5 2 20 600 550

Since the previous answer(https://stackoverflow.com/a/33445035/6504287) is pandas based, I'm adding the pyspark based solution as in below:
So it is better to go with the Window function as in the below code snippet example:
windowSpecAgg = Window.partitionBy('a', 'b')
ext_data_df.withColumn('avg_result', avg('result').over(windowSpecAgg)).show()
The above code is with respect to the example took in the previously provided solution(https://stackoverflow.com/a/33445035/6504287).

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Pandas framework: determining the count of a column data - python

As mentioned by #edchum, value_count can be used on the column "ItemID". It will return a series with indices as "ItemID" and value as the count of "ItemID". counter = df["ItemID"].value_counts() #df is your dataframe print counter[1] #prints how many times 1 occurred

Related

Calculating average in array under different conditions using pandas [duplicate]

How to calculate the mean of rows with rows having same content from Columns A to C in Excel using python? [duplicate]

Create a new pandas column that uses an existing columns to fill previous rows and group by based on multiple conditions

Is there any efficient way to filter out cluster data simultaneously in a large pandas dataframe?

Pandas new column from groupby averages

Categories

Resources