groupby and join text column

groupby and join text column - python

I have a csv file with this header text|business_id
I wanna group all texts related to one business
I used review_data=review_data.groupby(['business_id'])['text'].apply("".join)
The review_data is like:
text \
0 mr hoagi institut walk doe seem like throwback...
1 excel food superb custom servic miss mario mac...
2 yes place littl date open weekend staff alway ...
business_id
0 5UmKMjUEUNdYWqANhGckJw
1 5UmKMjUEUNdYWqANhGckJw
2 5UmKMjUEUNdYWqANhGckJw
I get this error: TypeError: sequence item 131: expected string, float found
these are the lines 130 to 132:
130 use order fair often past 2 year food get progress wors everi time order doesnt help owner alway regist rude everi time final decid im done dont think feel let inconveni order food restaur let alon one food isnt even good also insid dirti heck deliv food bmw cant buy scrub brush found golden dragon collier squar 100 time better|SQ0j7bgSTazkVQlF5AnqyQ
131 popular denni|wqu7ILomIOPSduRwoWp4AQ
132 want smth quick late night would say denni|wqu7ILomIOPSduRwoWp4AQ

I think you need filter notnull data with boolean indexing before groupby:
print review_data
text business_id
0 mr hoagi 5UmKMjUEUNdYWqANhGckJw
1 excel food 5UmKMjUEUNdYWqANhGckJw
2 NaN 5UmKMjUEUNdYWqANhGckJw
3 yes place 5UmKMjUEUNdYWqANhGckJw
review_data = review_data[review_data['text'].notnull()]
print review_data
text business_id
0 mr hoagi 5UmKMjUEUNdYWqANhGckJw
1 excel food 5UmKMjUEUNdYWqANhGckJw
3 yes place 5UmKMjUEUNdYWqANhGckJw
review_data=review_data.groupby(['business_id'])['text'].apply("".join)
print review_data
business_id
5UmKMjUEUNdYWqANhGckJw mr hoagi excel food yes place
Name: text, dtype: object

Related

Counting unique mentions in Pandas dataframe column while grouped by multiple other columns

For a school project I am attempting to determine the number of mentions specific words have in Reddit titles and comments. More specifically, stock ticker mentions. Currently the dataframe looks like this (where type could be a string of either title or comment):
body score id created subreddit type mentions
3860 There are much better industrials stocks than ... 1 NaN 2021-03-13 20:32:08+00:00 stocks comment {GE}
3776 I guy I work with told me about PENN about 9 m... 1 NaN 2021-03-13 20:29:30+00:00 investing comment {PENN}
4122 [mp4 link](https://preview.redd.it/ieae3z7suum... 2 NaN 2021-03-13 20:28:43+00:00 StockMarket comment {KB}
2219 If you cant decide, then just buy $GME options 1 NaN 2021-03-13 20:28:12+00:00 wallstreetbets comment {GME}
2229 This sub the most wholesome fucking thing in t... 2 NaN 2021-03-13 20:27:57+00:00 wallstreetbets comment {GME}
Where the mentions column contains a set of tickers mentioned in the body (could be multiple). What I wish to do is to count the number of unique mentions on a per-subreddit per-type (either comment or title) basis. The result I am looking for would be similar to this:
ticker subreddit type count
GME wallstreetbets comment 5
GME wallstreetbets title 4
GME investing comment 3
GME investing title 2
Repeated for all unique tickers mentioned.
I had used counters to figure this out utilizing dataframes specific to each instance (ie one dataframe for wallstreetbets comments, one dataframe for wallstreetbets titles) but I could not figure out how to make it work in this fashion when confined to a singular dataframe.

Sound like a simple groupby should do it:
df.groupby(['mentions','subreddit','type']).count()
produces
body score id created
mentions subreddit type
{GE} stocks comment 1 1 0 1
{GME} wallstreetbets comment 2 2 0 2
{KB} StockMarket comment 1 1 0 1
{PENN} investing comment 1 1 0 1

Pivot table rank by Name(Index) and Title(Column)

I have a dataset that looks like this:
The count represents the number of times they worked.
Title Name Count
Coach Bob 4
teacher sam 5
driver mark 8
Coach tina 10
teacher kate 3
driver frank 2
I want to create a table which I think will have to be a pivot, that sorts by count times worked, the name and title, so for example the output would look like this:
coach teacher driver
tina 10 sam 5 mark 8
bob 4 kate 3 drank 2
I am familiar with general pivot table code but I think Im going to need to use something a little bit more comprehensive.
DF_PIV = pd.pivot_table(DF, values=['count'], index=['title','Name'], columns=['title']
aggfunc=np.max)
I get an error ValueError: Grouper for 'view_title' not 1-dimensional, but I do not even think I on the right track here.

You can try:
(df.set_index(['Title', df.groupby('Title').cumcount()])
.unstack(0)
.astype(str)
.T
.groupby(level=1).agg(' '.join)
.T)
Output:
Title Coach driver teacher
0 Bob 4 mark 8 sam 5
1 tina 10 frank 2 kate 3

Mapping keyword with a dataframe column using pandas in python

I have a dataframe,
DF,
Name Stage Description
Sri 1 Sri is one of the good singer in this two
2 Thanks for reading
Ram 1 Ram is one of the good cricket player
ganesh 1 good driver
and a list,
my_list=["one"]
I tried mask=df["Description"].str.contains('|'.join(my_list),na=False)
but it gives,
output_DF.
Name Stage Description
Sri 1 Sri is one of the good singer in this two
Ram 1 Ram is one of the good cricket player
My desired output is,
desired_DF,
Name Stage Description
Sri 1 Sri is one of the good singer in this two
2 Thanks for reading
Ram 1 Ram is one of the good cricket player
It has to consider the stage column, I want all the rows associated with the description.

I think you need:
print (df)
Name Stage Description
0 Sri 1 Sri is one of the good singer in this two
1 2 Thanks for reading
2 Ram 1 Ram is one of the good cricket player
3 ganesh 1 good driver
#replace empty or whitespaces by previous value
df['Name'] = df['Name'].mask(df['Name'].str.strip() == '').ffill()
print (df)
Name Stage Description
0 Sri 1 Sri is one of the good singer in this two
1 Sri 2 Thanks for reading
2 Ram 1 Ram is one of the good cricket player
3 ganesh 1 good driver
#get all names by condition
my_list = ["one"]
names=df.loc[df["Description"].str.contains("|".join(my_list),na=False), 'Name']
print (names)
0 Sri
2 Ram
Name: Name, dtype: object
#select all rows contains names
df = df[df['Name'].isin(names)]
print (df)
Name Stage Description
0 Sri 1 Sri is one of the good singer in this two
1 Sri 2 Thanks for reading
2 Ram 1 Ram is one of the good cricket player

It looks to be finding "one" in the Description fields of the dataframe and returning the matching descriptions.
If you want the third row, you will have to add an array element for the second match
eg. 'Thanks' so something like my_list=["one", "Thanks"]

Sort text in second column based on values in first column

in python i would like to separate the text in different rows based on the values of the first number. So:
Harry went to School 100
Mary sold goods 50
Sick man
using the provided information below:
number text
1 Harry
1 Went
1 to
1 School
1 100
2 Mary
2 sold
2 goods
2 50
3 Sick
3 Man
for i in xrange(0, len(df['number'])-1):
if df['number'][i+1] == df['number'][i]:
# append text (e.g Harry went to school 100)
else:
# new row (Mary sold goods 50)

You can use groupby,
for name,group in df.groupby(df['number']):
print ' '.join([i for i in group['text']])
Result
Harry Went to School 100
Mary sold goods 50
Sick Man

Fill Missing Dates in DataFrame with Duplicate Dates in Groupby

I am trying to get a daily status count from the following DataFrame (it's a subset, the real data set is ~14k jobs with overlapping dates, only one status at any given time within a job):
Job Status User
Date / Time
1/24/2011 10:58:04 1 A Ted
1/24/2011 10:59:20 1 C Bill
2/11/2011 6:53:14 1 A Ted
2/11/2011 6:53:23 1 B Max
2/15/2011 9:43:13 1 C Bill
2/21/2011 15:24:42 1 F Jim
3/2/2011 15:55:22 1 G Phil Jr.
3/4/2011 14:57:45 1 H Ted
3/7/2011 14:11:02 1 I Jim
3/9/2011 9:57:34 1 J Tim
8/18/2014 11:59:35 2 A Ted
8/18/2014 13:56:21 2 F Bill
5/21/2015 9:30:30 2 G Jim
6/5/2015 13:17:54 2 H Jim
6/5/2015 14:40:38 2 I Ted
6/9/2015 10:39:15 2 J Tom
1/16/2015 7:45:58 3 A Phil Jr.
1/16/2015 7:48:23 3 C Jim
3/6/2015 14:09:42 3 A Bill
3/11/2015 11:16:04 3 K Jim
My initial thought (from the following link) was to groupby the job column, fill in the missing dates for each group and then ffill the statuses down.
Pandas reindex dates in Groupby
I was able to make this work...kinda...if two statuses occurred on the same date, one would not be included in output and consequently some statuses were missing.
I then found the following, it supposedly handles the duplicate issue, but I am unable to get it to work with my data.
Efficiently re-indexing one level with "forward-fill" in a multi-index dataframe
Am I on the right path thinking that filling in the missing dates and then ffill down the statuses is the correct way to ultimately capture daily counts of individual statuses? Is there another method that might better use pandas features that I'm missing?

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.