Pandas - Count occurences in a column - python

I have this file with 19 columns of mixed dtypes. One of the column Names contain elements which are separated by space. For example:
Col1 Col2
adress1 x
adress2 a b
adress3 x c
adress4 a x d
What I want to do is go over Col2 and find out how many times each element occurs and put the result in a new column along with its corresponding in Col1
Note the above columns were already processed as a Dataframe.
I have this which somewhat give me the results but not what I want ultimately.
new_df = pd.Dataframe(old_df.Col2.str.split(' ').tolist(), index=old_df.Col1).stack
How do I put the results in a new column (replacing Col2) and also have the remaining columnS?
Something like:
Col1 Col2 Col3
adress1 x something
adress2 a something1
adress2 b something1
adress3 x NaN
adress3 c NaN
Also calculate occurrence of items in Col2?

We can do split first then do explode
s=df.assign(Col2=df.Col2.str.split()).explode('Col2')
s=s.groupby(['Col1','Col2']).size().to_frame('count').reset_index()
Out[48]:
Col1 Col2 count
0 adress1 x 1
1 adress2 a 1
2 adress2 b 1
3 adress3 c 1
4 adress3 x 1
5 adress4 a 1
6 adress4 d 1
7 adress4 x 1

Related

Repeat a particular row in pandas n no. of times in pandas and increase a count for every repeat in a column

I have a dataframe
df_in = pd.DataFrame([["A","X",5,4,1],["B","Y",3,3,1],["C","Y",4,7,4]], columns=['col1', 'col2', 'col3', 'col4','col5'])
I want to repeat a row n no. of times and the count also should increase from the no. present in col4.
Ex: I want to repeat the B row 3 times and count in col4 will increse from the current value present in col4 like 3,4 and 5. Similarly for C row repeat 2 times and increase the count in col4 from the current value.
Expected Output:
df_Out = pd.DataFrame([["A","X",5,4,1],["B","Y",3,3,1],["B","Y",3,4,1],["B","Y",3,5,1],["C","Y",4,7,4],["C","Y",4,8,4]], columns=['col1', 'col2', 'col3', 'col4','col5'])
How to do it?
Create dictionary for number of repeating, map by Series.map and if no match set 1, then use Index.repeat for index values with DataFrame.loc for append rows, last add counter by GroupBy.cumcount for col4:
d = {'B':3, 'C':2}
df = df_in.loc[df_in.index.repeat(df_in['col1'].map(d).fillna(1))]
df['col4'] += df.groupby(level=0).cumcount()
df = df.reset_index(drop=True)
print (df)
col1 col2 col3 col4 col5
0 A X 5 4 1
1 B Y 3 3 1
2 B Y 3 4 1
3 B Y 3 5 1
4 C Y 4 7 4
5 C Y 4 8 4

Is there a way to create n number of dataframes from columns in a dataframe? Python

hope you're fine.
I'm working with some dataframes that look like this:
df:
Col1 Col2 Col3 ... Coln
Row1 A 7 2 n
Row2 B 5 10 n
Row3 C 3 5 n
As you can see, it has n number of columns. I'm trying to make n number of dataframes with the column "Col1" and each one of the others, which also I would call each dataframe then or apply to all a function. It would look something like this:
df1:
Col1 Col2
Row1 A 7
Row2 B 5
Row3 C 3
df2:
Col1 Col3
Row1 A 2
Row2 B 10
Row3 C 5
...
dfn:
Col1 Coln
Row1 A n
Row2 B n
Row3 C n
I know that I can manually use .iloc[:,n] but that's not practical for n columns.
So, I have tried this way with dictionaries:
columns_list = df.columns.values.tolist()
d = {}
for name in columns_list:
for i in range(1, len(df.columns)+1):
d[name] = pd.DataFrame(data = (df1["Col1"],df.iloc[:,i]), columns = ["XYZ", "ABC"])
Bad news: doesn't work.
I have also tried with a function:
df_base = pd.DataFrame(data = df.iloc[:,0])
def particion(df):
for i in range(1, len(df.columns)+1):
df["df_" + str(i)] = df_base.join(df.iloc[:,i])
Bad news again: doesn't work.
I have done my research but couldn't find specifically someone that has had the same thing.
Does someone please have an idea of what can I do?
So you want to start by creating a list of your variable names, this can be done with list comprehension. As an example with n=5
n = 5
variable_names = [f"df{i}" for i in range(1,n+1)]
print(variable_names) # Output: ['df1', 'df2', 'df3', 'df4', 'df5']
From here you can create your list of column names and create a constant variable for your first column name
FIRST_COLUMN_NAME = list(df.columns)[0]
column_names = list(df.columns)[1:]
Then you can make use of globals() and zip() to iterate through and create the variables:
for variable, column_name in zip(variable_names, column_names):
globals()[variable] = df[[FIRST_COLUMN_NAME, column_name]]
Using a test dataframe:
col1 col2 col3 col4 col5 col6
0 1 2 3 4 5 6
1 2 3 4 5 6 7
I received the following outputs:
>>> print(df1)
col1 col2
0 1 2
1 2 3
>>> print(df2)
col1 col3
0 1 3
1 2 4
>>> print(df3)
col1 col4
0 1 4
1 2 5
and so on.

replace values in pandas based on aggregion and condition

I have a dataframe like this:
I want to replace values in col1 with a specific value (ex:with "b"). I should count the records of each group based on col1 and col2. For example count of col1 = a, col2 = t is 3 and col1 = a, col2 = u is 1 .
If the count is greater than 2 then replace the value of col1 with 'b'. For this example, i want to replace all "a" values with "b" where col2 = t.
I tried the below code, but it did not change all of then "a" values with this condition.
import pandas as pd
df = pd.read_excel('c:/test.xlsx')
df.loc[df[(df['col1'] == 'a') & (df['col2'] == 't')].agg("count")["ID"] >2, 'col1'] = 'b'
I want this result:
You can use numpy.where and check whether all your conditions are satisfied. If yes, replace the values in col1 with b, and otherwise leave the values as is:
import numpy as np
df['col1'] = np.where((df['col1']=='a') &
(df['col2']=='t') &
(df.groupby('col1')['ID'].transform('count') > 2),'b',df['col1'])
prints:
ID col1 col2
0 1 b t
1 2 b t
2 3 b t
3 4 a u
4 5 b t
5 6 b t
6 7 b u
7 8 c t
8 9 c u
9 10 c w
Using transform('count'), will check whether the grouped (by col1) ID column will have more than 2 values.

Pandas Groupby mean and first of multiple columns

My Pandas df is like following and want to apply groupby and then want to calculate the average and first of many columns
index col1 col2 col3 col4 col5 col6
0 a c 1 2 f 5
1 a c 1 2 f 7
2 a d 1 2 g 9
3 b d 6 2 g 4
4 b e 1 2 g 8
5 b e 1 2 g 2
something like this I tried
df.groupby(['col1','col5').agg({['col6','col3']:'mean',['col4','col2']:'first'})
expecting output
col1 col5 col6 col3 col4 col2
a f 6 1 2 c
a g 9 1 2 d
b g 4 3 2 e
but it seems, list is not an option here, in my real dataset I have 100 of columns of different nature so I cant pass them individually. Any thoughts on passing them as list?
if you have lists depending on the aggregation, you can do:
l_mean = ['col6','col3']
l_first = ['col4','col2']
df.groupby(['col1','col5']).agg({**{col:'mean' for col in l_mean},
**{col:'first' for col in l_first}})
the notation **{} is for unpacking dictionary, doing {**{}, **{}} create one dictionary from 2 dictionaries (it could be ore than two), it is like union of dictionaries. And doing {col:'mean' for col in l_mean} create a dictionary with each col of the list as a key and 'mean' as value, it is dictionary comprehension.
Or using concat:
gr = df.groupby(['col1','col5'])
pd.concat([gr[l_mean].mean(),
gr[l_first].first()],
axis=1)
and reset_index after to get the expected output
(
df.groupby(['col1','col5'])
.agg(col6=('col6', 'mean'),
col3=('col3', 'mean'),
col4=('col4', 'first'),
col2=('col2', 'first'))
)
this is an extension of #Ben.T's solution, just wrapping it in a function and passing it via the pipe method :
#set the list1, list2
def fil(grp,list1,list2):
A = grp.mean().filter(list1)
B = grp.first().filter(list2)
C = A.join(B)
return C
grp1 = ['col6','col3']
grp2 = ['col4','col2']
m = df.groupby(['col1','col5']).pipe(fil,grp1,grp2)
m

How to switch n columns to rows of a r rows pandas dataframe (n*r rows in the final dataframe)?

Let's take this dataframe :
pd.DataFrame(dict(Col1=["a","c"],Col2=["b","d"],Col3=[1,3],Col4=[2,4]))
Col1 Col2 Col3 Col4
0 a b 1 2
1 c d 3 4
I would like to have one row per value in column Col1 and column Col2 (n=2 and r=2 so the expected dataframe have 2*2 = 4 rows).
Expected result :
Ind Value Col3 Col4
0 Col1 a 1 2
1 Col1 c 3 4
2 Col2 b 1 2
3 Col2 d 3 4
How please could I do ?
Pandas melt does the job here; the rest just has to do with repositioning and renaming the columns appropriately.
Use pandas melt to transform the dataframe, using Col3 and 4 as the index variables. melt typically converts from wide to long.
Next step - reindex the columns, with variable and value as lead columns.
Finally, rename the columns appropriately.
(df.melt(id_vars=['Col3','Col4'])
.reindex(['variable','value','Col3','Col4'],axis=1)
.rename({'variable':'Ind','value':'Value'},axis=1)
)
Ind Value Col3 Col4
0 Col1 a 1 2
1 Col1 c 3 4
2 Col2 b 1 2
3 Col2 d 3 4

Categories

Resources