Find columns with only one non-zero value in pandas - python

Let me begin by noting that this question is very close to this question about getting non-zero values for each column in a pandas dataframe, but in addition to getting the values, I would also like to know the row from which it was drawn. (And, ultimately, I would like to be able to re-use the code to find columns in which a non-zero value occurred some x amount of times.)
What I have is a dataframe with a count of words for a given year of documents:
|Year / Term | word1 | word2 | word3 | ... | wordn |
|------------|-------|-------|-------|-----|-------|
| 2001 | 23 | 0 | 0 | | 0 |
| 2002 | 0 | 0 | 12 | | 0 |
| 2003 | 0 | 42 | 34 | | 0 |
| year(n) | 0 | 0 | 0 | | 45 |
So for word1 I would like to get both 23 and 2001 -- this could be as a tuple or as a dictionary. (It doesn't matter so long as I can work through the data.) And ultimately, I would like very much to be able to discover that word3 enjoyed a two-year span of usage.
FTR, the dataframe has only 16 rows but it has a lot, a lot of columns. If there's an answer to this questions already available, revealing the weakness of my search fu, I will take the scorn as my just due.

In you case melt then groupby
df.melt('Year / Term').loc[lambda x : x['value']!=0].groupby('variable')['value'].apply(tupl)

Related

Remove rows from dataframe if it has partial match with other rows for specific columns

I want to remove rows in a dataframe which have partial overlaps in their start and end character indices.
Details:
I have two sentences and I have extracted some entities from them and organized them in a dataframe.
sentences :
| id | sentence |
| --- | --- |
| 1 | Today is a very sunny day and sun is shining |
| 2 | I bought the red balloon and playing with it |
My dataframe with the extracted entities looks like this:
| id | data | start_char_index | end_char index | token_position |
| ---| -------------- | ---------------- | -------------- | -------------- |
| 1 | very sunny day | 11 | 26 | [4,5,6] |
| 1 | shining | 37 | 45 | [10] |
| 1 | sunny | 16 | 21 | [5] |
| 2 | the red balloon | 9 | 25 | [3,4,5] |
| 2 | playing | 29 | 37 | [7] |
| 2 | red | 13 | 16 | [4] |
P.S. In this token position is the index of the specific token in text (starting from 1)
Now, for id 1. we see that 'very sunny day' and 'sunny' are partial overlaps (their start and end character indices and token position both overlap)
Same for id 2, where 'the red balloon' and 'red' have red which is an overlap and I want to remove the rows 'sunny' and 'red' which are smaller of the overlaps in the two different ids.
I thought about grouping them on ids and then removing those records by storing the start and end character indices (or token position) in a dictionary, but if I have alot of data rows and lot of ids, then it would be very slow.
Also I read about IntervalTree but I could not get to use it for partial overlaps very efficiently.
So could you please suggest some solution for this?
The final output dataframe should look like this:
| id | data | start_char_index | end_char index | token_position |
| ---| -------------- | ---------------- | -------------- | -------------- |
| 1 | very sunny day | 11 | 26 | [4,5,6] |
| 1 | shining | 37 | 45 | [10] |
| 2 | the red balloon | 9 | 25 | [3,4,5] |
| 2 | playing | 29 | 37 | [7] |
Thanks for the help in advance :)
Apart from Mortz's answer, I also tried pandas IntervalArray and overlap which was working faster for me. Putting it here for anyone else who might find it useful (Credits : https://stackoverflow.com/a/69336914/15941713 ):
from intervaltree import Interval, IntervalTree
def drop_subspan_duplicates(df):
idx1 = pd.arrays.IntervalArray.from_arrays(
df['start'],
df['end'],
closed='both')
df['wrd_id'] = df.apply(lambda x : df.index[idx1.overlaps(pd.Interval(x['start'], x['end'], closed='both'))][0],axis=1)
df= df.drop_duplicates(['wrd_id'],keep='first')
df.drop(['wrd_id'],axis=1,inplace=True)
return df
output = data.groupby('id').apply(drop_subspan_duplicates)
One can also refer to this answer for tackling the issue if one wishes to avoid dataframe operations
I am not sure a DataFrame is the best structure to solve this problem - but here is one approach
df = pd.DataFrame({'id': [1, 1, 1], 'start': [11, 20, 16], 'end': [18, 35, 17]})
# First we construct a range of numbers from the start and end index
df.loc[:, 'range'] = df.apply(lambda x: list(range(x['start'], x['end'])), axis=1)
# Next, we "cumulate" these ranges and measure the number of unique elements in the cumulative range at each row
df['range_size'] = df['range'].cumsum().apply(lambda x: len(set(x)))
# Finally we check if every row adds anything to the cumulative range - if a new row adds nothing, then we can drop that row
df['range_size_shifted'] = df['range'].cumsum().apply(lambda x: len(set(x))).shift(1)
df['drop'] = df.apply(lambda x: False if pd.isna(x['range_size_shifted']) else not int(x['range_size'] - x['range_size_shifted']), axis=1)
print(df)
# id start end drop
#0 1 11 18 False
#1 1 20 35 False
#2 1 16 17 True
If you want to do this for each group separately -
for key, group in df.groupby('id'):
group.loc[:, 'range'] = group.apply(lambda x: list(range(x['start'], x['end'])), axis=1)
group['range_size'] = group['range'].cumsum().apply(lambda x: len(set(x)))
group['range_size_shifted'] = group['range'].cumsum().apply(lambda x: len(set(x))).shift(1)
group['drop'] = group.apply(lambda x: False if pd.isna(x['range_size_shifted']) else not int(x['range_size'] - x['range_size_shifted']), axis=1)
print(group)

Python Pandas datasets - Having integers values in new column by making a dictionary

I am trying to output in a new column integers values (labels/classes) based on labels of another column in my dataset. Actually I did it by creating new columns (numerical column heading) for each class with boolean values in them, so then I can use these to create the new class column with numerical values. But I was trying to do it with a dictionary, which I think it is a good and faster way.
If I run a code like this:
x=df['Item_Type'].value_counts()
item_type_mapping={}
item_list=x.index
for i in range(0,len(item_list)):
item_type_mapping[item_list[i]]=i
It generates the dictionary, but then if I run:
df['Item_Type']=df['Item_Type'].map(lambda x:item_type_mapping[x])
or
df['New_column']=[item_type_mapping[item] for item in data.Item_Type]
It displays KeyError=None
Anybody know why this occurs? I think that's strange since the dictionary has been created and I can see it through my variables
Thanks
Edit 1
#Fourier
simply I have this column:
| Item_type|
| -------- |
| Nino |
| Nino |
| Nino |
| Pasquale |
| Franco |
| Franco |
and then I need the same column or a new one to display:
| Item_type| New_column |
| -------- | ---------- |
| Nino | 1 |
| Nino | 1 |
| Nino | 1 |
| Pasquale | 2 |
| Franco | 3 |
| Franco | 3 |
Your code works for me, but what you're trying to do is already provided by pandas as Categorical data.
df = pd.DataFrame({'Item_Type': list('abca')})
df['New_column'] = df.Item_Type.astype('category').cat.codes
Result:
Item_Type New_column
0 a 0
1 b 1
2 c 2
3 a 0

Performant alternative to constructing a dataframe by applying repeated pivots

I have a dataframe which contains a whole set of data and relevant id information:
| sample_id | run_id | x | y | z |
| 0 | 1 | 1 | 2 | 3 |
| 0 | 2 | 4 | 5 | 6 |
| 1 | 1 | 1 | 2 | 3 |
| 1 | 2 | 7 | 8 | 9 |
I wish to create a dataframe based on results from this. So a simple example would be my new dataframe should contain a row with the average information from a sample run:
| sample_id | avg_x | avg_y | avg_z |
| 0 | 2.5 | 3.5 | 4.5 |
| 1 | 4 | 5 | 6 |
At the moment I do this with a loop:
pivots = []
for i in samples:
df_sample = df_samples[df_samples['sample_id'] == i]
pivot = df_sample.pivot_table(index=index, columns='run_id', values=[x, y, z], aggfunc='mean')
# Add some other features. Involves creating more columns than existed in the initial df_samples dataframe
pivots.append(pivot)
# create new dataframe
pd.concat(pivots)
So my first question is, if I wanted to create a new dataframe which consists of repeated pivots of another dataframe. Is there a way to do that all at once with one pivot command instead of having to call it iteratively? If there is, is it more performant?
My second question involves the more complicated case. If it is possible to perform multiple pivots at once to build up the new dataframe when the new dataframe also will increase its dimensions i.e. it might look like
| s_id | avg_x | avg_y | avg_z | new_feature_1 |new_feature_2 |
| 0 | 2.5 | 3.5 | 4.5 | f(x11, x12, y11, y12, z11, z12) | g(avg_x1, avg_x2, avg_y1, avg_y2, avg_z1, avg_z2) |
| 1 | 4 | 5 | 6 | f(x21, x22, y21, y22, z21, z22) | g(avg_x1, avg_x2, avg_y1, avg_y2, avg_z1, avg_z2) |
The functions essentially perform individual operations on the data per sample_id to create new features.
Aside: I am looking for a good resource on working with large pandas dataframes and performantley constructing new ones or performing queries. I am almost always able to get the result I want using pandas. My implementations are often not efficient and akin to how it might be done in a lower level language like c++. I would like to improve my working knowledge and maybe this involves some theory I do not know on dataframes and tables etc. A recommendation for a resource would be good. Note that that is just additional helpful information and a recommendation alone does not answer the question and any answer that answers my two use cases above will be accepted with or without a recommendation for a resource.

Pandas - applying groupings and counts to multiple columns in order to generate/change a dataframe

I'm sure what I'm trying to do is fairly simple for those with better knowledge of PD, but I'm simply stuck at transforming:
+---------+------------+-------+
| Trigger | Date | Value |
+---------+------------+-------+
| 1 | 01/01/2016 | a |
+---------+------------+-------+
| 2 | 01/01/2016 | b |
+---------+------------+-------+
| 3 | 01/01/2016 | c |
+---------+------------+-------+
...etc, into:
+------------+---------------------+---------+---------+---------+
| Date | #of triggers | count a | count b | count c |
+------------+---------------------+---------+---------+---------+
| 01/01/2016 | 3 | 1 | 1 | 1 |
+------------+---------------------+---------+---------+---------+
| 02/01/2016 | 5 | 2 | 1 | 2 |
+------------+---------------------+---------+---------+---------+
... and so on
The issue is, I've got no bloody idea of how to achieve this..
I've scoured SO, but I can't seem to find anything that applies to my specific case.
I presume I'd have to group it all by date, but then once that is done, what do I need to do to get the remaining columns?
The initial DF is loaded from an SQL Alchemy query object, and then I want to manipulate it to get the result I described above. How would one do this?
Thanks
df.groupby(['Date','Value']).count().unstack(level=-1)
You can use GroupBy.size with unstack, also parameter sort=False is helpful:
df1 = df.groupby(['Date','Value'])['Value'].size().unstack(fill_value=0)
df1['Total'] = df1.sum(axis=1)
cols = df1.columns[-1:].union(df1.columns[:-1])
df1 = df1[cols]
print (df1)
Value Total a b c
Date
01/01/2016 3 1 1 1
The difference between size and count is:
size counts NaN values, count does not.

Python/Pandas: Conditional summing of columns

In python I have a pandas data frame similar to the one below:
| AUG12 | UNDERLYING | VOL |
|---------------------| | |
| 45 | 49 | 50 | 55 | | |
====================================================|
2012-11-14 | 1 | 1 | 2 | 3 | 49 | ? |
... ... ... ... ...
The task is: For each row, find column names which are greater than UNDERLYING (49), sum the values (2+3) and put the result in to VOL (5). How can I accomplish this in python? Many thanks in advance!
You could use DataFrame.apply function
def conditional_sum(row):
underlying = row['UNDERLYING'][0] # extra '[0]' is required due to multi leve index in column names
return row.loc['AUG12'].apply(lambda x: 0 if x < underlying else x).sum()
df.apply(conditional_sum, axis=1)

Categories

Resources