Group by specific token in Pandas Dataframe - python

so I have my dataframe which is formatted as below.
Sentiments.head()
Sentiment Tweet
0 0 [corona, updat, govern, vow, pay, wage, staff,...
1 0 [open, today, til, PM, takeaway, beer, need, s...
2 0 [that, call, corona, viru, coronaviru, london,...
3 1 [that, th, person, know, bought, corona, dog, ...
4 1 [hhmmm, colodia, drifu, nigeria, believ, coron...
I need to group the tweets, using the token 'govern' and 'Johnson'. I have tried this code below
grouped_df = Sentiments.groupby('Tweet')
grouped_df.get_group('govern')
However I get an error
TypeError: unhashable type: 'list'
Both of the columns are added from lists, so is it possible to group by specific tokens or do I need to change the datatypes?
Thanks in advance!

return the dataframe rows which contain the words 'govern' and 'Johnson'
This might be done using set arithemtic following way, consider following simple example: getting records where both bb and cc are present
import pandas as pd
def has_bb_cc(x):
return set(['bb','cc']).issubset(x)
df = pd.DataFrame({'col1':['a','b','c'],'col2':[['aa','bb','cc'],['bb','cc','dd'],['cc','dd','ee']]})
bb_cc_df = df[df.col2.apply(has_bb_cc)]
print(bb_cc_df)
output:
col1 col2
0 a [aa, bb, cc]
1 b [bb, cc, dd]
Explanation: I define function for checking if bb and cc are present using set arithmetic, then I apply it to column with lists thus getting pandas.Series of Trues and Falses which I then use to extract records from df.
As side note I would call it filtering rather than grouping.

Related

Python transform data long to wide

I'm looking to transform some data in Python.
Originally, in column 1 there are various identifiers (A to E in this example) associated with towns in column 2. There is a separate row for each identifier and town association. There can be any number of identifier to town associations.
I'd like to end up with ONE row per identifier and with all the associated towns going horizontally separated by commas.
Tried using long to wide but having difficulty in doing the above, appreciate any suggestions.
Thank you
One way to do it is using gruopby. For example, you can group by Column 1 and apply a function that returns the list of unique values for each group (i.e. each code).
import numpy as np
import pandas as pd
df = pd.DataFrame({
'col1': 'A A A A B B C C C D E E E E E'.split(' '),
'col2': ['Accrington', 'Acle', 'Suffolk', 'Hampshire', 'Lincolnshire',
'Derbyshire', 'Aldershot', 'Alford', 'Cumbria', 'Hampshire', 'Bath',
'Alston', 'Greater Manchester', 'Northumberland', 'Cumbria'],
})
def get_towns(town_list):
return ', '.join(np.unique(town_list))
df.groupby('col1')['col2'].apply(get_towns)
And the result is:
col1
A Accrington, Acle, Hampshire, Suffolk
B Derbyshire, Lincolnshire
C Aldershot, Alford, Cumbria
D Hampshire
E Alston, Bath, Cumbria, Greater Manchester, Nor...
Name: col2, dtype: object
Note: the last line contains also Cumbria, differently from you expected results as this value appears also with the code E. I guess that was a typo in your question...
Another option is to use .groupby with aggregate because conceptually, this is not a pivoting operation but, well, an aggregation (concatenation) of values. This solution is quite similar to Luca Clissa's answer, but it uses the pandas api instead of numpy.
>>> df.groupby("col1").col2.agg(list)
col1
A [Accrington, Acle, Suffolk, Hampshire]
B [Lincolnshire, Derbyshire]
C [Aldershot, Alford, Cumbria]
D [Hampshire]
E [Bath, Alston, Greater Manchester, Northumberl...
Name: col2, dtype: object
That gives you cells of lists; if you need strings, add a .str.join(", "):
>>> df.groupby("col1").col2.agg(list).str.join(", ")
col1
A Accrington, Acle, Suffolk, Hampshire
B Lincolnshire, Derbyshire
C Aldershot, Alford, Cumbria
D Hampshire
E Bath, Alston, Greater Manchester, Northumberla...
Name: col2, dtype: object
If you want col1 as a normal column instead of an index, add a .reset_index() at the end.

Filter Pandas Series by dictionary values

I have a pandas Series words which look like:
0 a
1 calculated
2 titration
3 curve
4 for
5 oxalic
6 acid
7 be
8 show
9 at
Name: word, dtype: object
I also have a Series occurances which looks like:
a 278
show 179
curve 2
Name: index, dtype: object
I want to filter words using occurances in a way that a word would be filtered if it is not in occurances or it value is less than 100.
In the given example I would like to get:
0 a
8 show
Name: word, dtype: object
isin only check existence and when I've tried to use apply\map or [] operator I got an Error
Series objects are mutable and cannot be hashed
I can also work with solution on DataFrames.
I think you would need to first filter the specific words you want from your occurences Series, and then use the index of it, as the value for the .isin():
output = words[words.isin(occurences[occurences > 100].index)]
Try this:
words[words.apply(lambda x: x not in occurances or (x in occurances and occurances[x]<100))]
The isin method works, but generate a list of booleans you should use as index:
>> # reproduce the example
>> import pandas as pd
>> words = pd.Series(['a','calculated','titration','curve','for','oxalic','acid','be','show','at'])
>> occurances = pd.Series(['a','show','curve'], index= [278, 179, 2])
>> # apply the filter
>> words[words.isin(occurances[occurances.index > 100])]
0 a
8 show
dtype: object

Remove partial duplicate row using column value

I'm trying to clean data where there is a lot of partial duplicate only storing the first row of data when the key in Col A has duplicate.
A B C D
0 foo bar lor ips
1 foo bar
2 test do kin ret
3 test do
4 er ed ln pr
expected output after cleaning
A B C D
0 foo bar lor ips
1 test do kin ret
2 er ed ln pr
I have been looking at methods such as drop_duplicates or even group_by but they don't really help in my case : the duplicate are partial since some rows contain empty data and only have similar value in col A and B.
group by partial work but doesn't return the transformed data , they just filter through.
I'm very new to panda and pointer are appreciated. I could probably doing it outside panda but i'm thinking there might be a better way to do it.
edit: sorry just noticed a mistake i made in the provided example. ( test had became " tes "
In your case how would you say partial duplicate? Please provide complicate example. In the above example instead of Col A duplication you could try Col B.
Expected output could be obtained from this following snippet,
print (df.drop_duplicates(subset=['B']))
Note: Suggested solution only works for the above sample, it won't work when it has different col A and same Col B value.

pandas - count overal elements after converting one column to a list of strings

I have a CSV file and read into pandas DataFrame using
df = pd.read_csv('my.csv')
My data looks like the following:
choice userid
A\nB\nC 111111
A\nC 222222
B 333333
From this DataFrame, I would like to achieve my goals by two steps:
(1) split the values in the choice column by '\n'
(2) count how many As, Bs and Cs in my CSV file.
I've tried:
target = df['choice'].str.split('\n')
target.value_counts()
But got the error as:
TypeError: unhashable type: 'list'
Could anyone tell me how I can achieve my goal. Thank you for your help!
Either of the following should do:
df.choice.str.split(r"[\\n]+", expand=True).stack().value_counts()
or
df.choice.str.split(r"[\\n]+").apply(pd.Series).stack().value_counts()
Both should return:
C 2
A 2
B 2
dtype: int64

Filtering Dataframe Using Headers From Other Dataframes in Python

I am trying to filter a dataframe based on the columns I have previously obtained from filtering the dataframe below.
AA BB CC DD EE FF GG
0 1 1 0 1 0 0
The dataframe is coming from a file where the data in each row is either a 0 or a 1 and will change based on the file that is loaded in. I have used the following code to filter this dataframe so that my output consists of only columns with a value of 1 in them.
with open('Factors.txt') as b:
IncludedFactors = pd.read_table(b, sep=',' )
print IncludedFactors
InterestingFactors = IncludedFactors.drop(IncludedFactors.columns[~IncludedFactors.iloc[0].astype(bool)],axis=1)
print InterestingFactors
output:
BB CC EE
1 1 1
I then need to filter out a bigger dataframe that has many headers, however I only need the ID, Xposition, Yposition and the headers of the InterestingFactors dataframe.
Below is the code I have tried, however the output still only consists of 3 headers instead of the 6 I need.
headers = InterestingFactors.columns.values
print headers
PivotTable = InfoTable.filter(items=['ID', 'Postion_X','Position_Y','headers'])
print PivotTable
Any help on how to do this correctly is greatly appreciated!
Here's one way you can do this:
headers = InterestingFactors.columns.append(pd.Index(['ID','Postion_X','Position_Y']))
PivotTable = InfoTable.loc[:, headers]
This combines the columns you're looking for from InterestingFactors with the 3 columns you mention above. This Index is passed to .loc[].
This also works:
headers = InterestingFactors.columns
PivotTable = InfoTable.loc[:, (pd.Index(['ID','Postion_X','Position_Y']) | headers)]
The datatypes for comparison (I believe) must be the same. Converting your list of 3 standard columns to pd.Index will allow you to use | within .loc[].

Categories

Resources