Appending a list to dataframe and adding count - python

I have a pandas data frame and a list -
d={'abc':[0,2,4,5,2,2],'bec':[0,5,6,4,0,2],'def':[7,6,0,1,1,2],'rtr':[5,6,7,2,0,3],'rwr':[5,6,7,1,0,5],'xx':[4,5,6,7,8,7]}
X=pd.DataFrame(d)
abc bec def rtr rwr xx
0 0 0 7 5 5 4
1 2 5 6 6 6 5
2 4 6 0 7 7 6
3 5 4 1 2 1 7
4 2 0 1 0 0 8
5 2 2 2 3 5 7
l=[ 'bec','def','cef','ghd','rtr','fgh','ewr']
Now I want to append the list to data frame in the following way-
For each row in dataframe- We count the number of non zero elements in it(lets say it is 3 in the first row
We take 50% of 3= 1.5 (rounded to 1) and we append those many elements from the list l to the row(starting from the beginning). For the first row it is 'bec', since 'bec' is already present in the
row we increase its count by 1.
If the element from list is not present in the dataframe we append it at the end.
Dry run-
for row 1(index 1)- no of non zero elements is 6. So 50% of it is 3. So we take the first 3 elements from list['bec','def','cef']. 'bec' is already present so its count increases by 1 and it becomes(2,2)=6.
Similarly 'def' is present so it becomes(2,3)=7. 'cef' isn't present in the dataframe so we add it and make the count as 1.
The final output looks like this-
abc bec def rtr rwr xx cef
0 0 1 8 5 5 4 0
1 2 6 7 6 6 5 1
2 4 7 1 7 7 6 0
3 5 5 2 2 1 7 1
4 2 1 1 0 0 8 0
5 2 1 1 3 5 7 1

We can use ne + sum along axis=1 to count the non zero values in each row, followed by floordiv with 2 to consider only 50% of these counts, next create a list of record with the help of dict.fromkeys method inside a list comprehension, now create a dataframe lets say y from these records and add it with X to get the desired result
y = pd.DataFrame(dict.fromkeys(l[:i], 1)
for i in X.ne(0).sum(1).floordiv(2).astype(int))
X.add(y.fillna(0), fill_value=0).astype(int)
abc bec cef def rtr rwr xx
0 0 1 0 8 5 5 4
1 2 6 1 7 6 6 5
2 4 7 0 1 7 7 6
3 5 5 1 2 2 1 7
4 2 1 0 1 0 0 8
5 2 3 1 3 3 5 7

Related

Pandas: Get rows with consecutive column values

I need to go through a large pd and select consecutive rows with similar values in a column. i.e. in the pd below and selecting column x: I want to specify consecutive values in column x? Say if I want consecutive values of 3 and 5 only
col row x y
1 1 1 1
5 7 3 0
2 2 2 2
6 3 3 8
9 2 3 4
5 3 3 9
4 9 4 4
5 5 5 1
3 7 5 2
6 6 6 6
5 8 6 2
3 7 6 0
The results output would be:
col row x y consecutive-count
6 3 3 8 1
9 2 3 4 1
5 3 3 9 1
5 5 5 1 2
3 7 5 2 2
I tried
m = df['x'].eq(df['x'].shift())
df[m|m.shift(-1, fill_value=False)]
But that includes the consecutive 6 that I don't want.
I also tried:
df.query( 'x in [3,5]')
That prints every row where x has 3 or 5.
IIUC use masks for boolean indexing. Check for 3 or 5, and use a cummax and reverse cummax to ensure having the order:
m1 = df['x'].eq(3)
m2 = df['x'].eq(5)
out = df[(m1|m2)&(m1.cummax()&m2[::-1].cummax())]
Output:
col row x y
2 6 3 3 8
3 9 2 3 4
4 5 3 3 9
6 5 5 5 1
7 3 7 5 2
you can create a group column for consecutive values, and filter by the group count and value of x:
# create unique ids for consecutive groups, then get group length:
group_num = (df.x.shift() != df.x).cumsum()
group_len = group_num.groupby(group_num).transform("count")
# filter main df:
df2 = df[(df.x.isin([3,5])) & (group_len > 1)]
# add new group num col
df2['consecutive-count'] = (df2.x != df2.x.shift()).cumsum()
output:
col row x y consecutive-count
3 6 3 3 8 1
4 9 2 3 4 1
5 5 3 3 9 1
7 5 5 5 1 2
8 3 7 5 2 2

Add column with group-by length

I have a pandas dataframe with several columns. I want to add a new column containing the number of values for which two values are the same.
For example, suppose I have the following dataframe:
x y
0 1 5
1 2 7
2 3 2
3 7 3
4 2 7
5 6 5
6 5 3
7 2 7
8 2 2
I want to add a third column that contains the number of values for which both x and y are the same. The desired output here would be
x y frequency
0 1 5 1
1 2 7 3
2 3 2 1
3 7 3 1
4 2 7 3
5 6 5 1
6 5 3 1
7 2 7 3
8 2 2 1
For instance, all rows with (x, y) = (2, 7) equal three because (2, 7) appears three times in the dataframe.
One way to get the output is to create a "hash" (i.e., df['hash'] = df['x'].astype(str) + ',' + df['y'].astype(str) followed by df['frequency'] = df['hash'].map(collections.Counter(df['hash'))), but can we do this directly with group-by? The frequency column is exactly equal to the entry's group in df.groupby(['x', 'y']).
Thanks
IIUC this will work for you:
df['frequency'] = df.groupby(['x','y'])['y'].transform('size')
Output:
x y frequency
0 1 5 1
1 2 7 3
2 3 2 1
3 7 3 1
4 2 7 3
5 6 5 1
6 5 3 1
7 2 7 3
8 2 2 1

Filtering pandas dataframe groups based on groups comparison

I am trying to remove corrupted data from my pandas dataframe. I want to remove groups from dataframe that has difference of value bigger than one from the last group. Here is an example:
Value
0 1
1 1
2 1
3 2
4 2
5 2
6 8 <- here number of group if I groupby by Value is larger than
7 8 the last groups number by 6, so I want to remove this
8 3 group from dataframe
9 3
Expected result:
Value
0 1
1 1
2 1
3 2
4 2
5 2
6 3
7 3
Edit:
jezrael solution is great, but in my case it is possible that there will be dubplicate group values:
Value
0 1
1 1
2 1
3 3
4 3
5 3
6 1
7 1
Sorry if I was not clear about this.
First remove duplicates for unique rows, then compare difference with shifted values and last filter by boolean indexing:
s = df['Value'].drop_duplicates()
v = s[s.diff().gt(s.shift())]
df = df[~df['Value'].isin(v)]
print (df)
Value
0 1
1 1
2 1
3 2
4 2
5 2
8 3
9 3
Maybe:
df2 = df.drop_duplicates()
print(df[df['Value'].isin(df2.loc[~df2['Value'].gt(df2['Value'].shift(-1)), 'Value'].tolist())])
Output:
Value
0 1
1 1
2 1
3 2
4 2
5 2
8 3
9 3
We can check if the difference is less than or equal to 5 or NaN. After we check if we have duplicates and keep those rows:
s = df[df['Value'].diff().le(5) | df['Value'].diff().isna()]
s[s.duplicated(keep=False)]
Value
0 1
1 1
2 1
3 2
4 2
5 2
8 3
9 3

Reset count of group after pandas dataframe concatenation [duplicate]

This question already has answers here:
Pandas recalculate index after a concatenation
(3 answers)
Closed 3 years ago.
I am doing data processing and I have a problem figuring out how to reset groups counter after concatenating pandas dataframes. Here is an example below to illustrate my problem:
For example I have two dataframes:
Counter Value
0 1 3
1 1 4
2 1 2
3 2 4
4 2 10
Counter Value
0 1 8
1 1 10
2 2 2
3 2 4
4 2 10
after concatenation I get:
Counter Value
0 1 3
1 1 4
2 1 2
3 2 4
4 2 10
0 1 8
1 1 10
2 2 2
3 2 4
4 2 10
and I want to reset counter and make it sequential and make counter values to be by one digit bigger than the last group of counters.
Counter Value
0 1 3
1 1 4
2 1 2
3 2 4
4 2 10
0 3 8
1 3 10
2 4 2
3 4 4
4 4 10
I was trying to shift all dataframe by one upwards and compare shifted values with original and if original one is bigger that the shifted one, add original value to all values below it. But this solution is not always working due to noisy and inconsistent raw data.
You can just add the maximum value in the Counter column in the first dataframe to the second before concatenating:
df2.Counter += df1.Counter.max()
pd.concat([df1, df2], ignore_index=True)
Counter Value
0 1 3
1 1 4
2 1 2
3 2 4
4 2 10
5 3 8
6 3 10
7 4 2
8 4 4
9 4 10
Or another way using shift():
df=pd.concat([df1,df2])
df=df.assign(Counter_1=df.Counter.ne(df.Counter.shift()).cumsum())
#for same col df=df.assign(Counter=df.Counter.ne(df.Counter.shift()).cumsum())
Counter Value Counter_1
0 1 3 1
1 1 4 1
2 1 2 1
3 2 4 2
4 2 10 2
0 1 8 3
1 1 10 3
2 2 2 4
3 2 4 4
4 2 10 4

Binning all values with pandas.cut

I have a dataframe that looks like the following:
index value
1 21.046091
2 52.400000
3 14.082153
4 1.859942
5 1.859942
6 2.331143
7 9.060000
8 0.789265
9 12967.7
The last value is much higher than the rest. I'm trying to bin all the values into 5 bins using pd.cut:
pd.cut(df['value'], 5, labels = [1,2,3,4,5])
But it only ends up returning the groups 1 and 5.
index value group
0 0.410000 1
1 21.046091 1
2 52.400000 1
3 14.082153 1
4 1.859942 1
5 1.859942 1
6 2.331143 1
7 9.060000 1
8 0.789265 1
9 12967.7 5
The higher value is clearly throwing it, but is there a way to ensure that all five bins are represented in the dataframe without getting rid of outlying values?
You could use qcut:
pd.qcut(df['value'],5,labels=[1,2,3,4,5])
Output:
index
1 4
2 5
3 4
4 1
5 1
6 2
7 3
8 1
9 5
Name: value, dtype: category
Categories (5, int64): [1 < 2 < 3 < 4 < 5]
print(df.assign(group = pd.qcut(df['value'],5,labels=[1,2,3,4,5])))
value group
index
1 21.046091 4
2 52.400000 5
3 14.082153 4
4 1.859942 1
5 1.859942 1
6 2.331143 2
7 9.060000 3
8 0.789265 1
9 12967.700000 5

Categories

Resources