Pandas Naming Group by aggregated column - python

I'm trying to summarize a dataset of top 50 novels sold.
I want to create a table of authors and the numbers of book they have written.
I used the following code:
df.Author.value_counts().sort_values(ascending = False)
how can I name the column that lists the value count for each author?

You can check below snippet
top_50= [x for x in df.Author.value_counts().sort_values(ascending=False).head(50).count()]

Related

How do I check how many rows exists in one column of the same value pandas?

I have a pandas dataset of about 200k articles containing a column called Category. How do I display all the different types of articles and also count the number of rows a certain category for example "Entertainment" exists in the Category column?
To get the different Category :
df['Category'].unique()
And the following to count the number of rows using contains for the category Entertainment :
len(df[df['Category'].str.contains('Entertainment')])
Use Series.value_counts, then is possible see unique Category values in index and for count select values by Series.loc:
s = df['Category'].value_counts()
print (s.index.tolist())
print (s.loc['Entertainment'])

How to create a list of every items purchased by client X?

How can I create a list of every item purchased by the client ID that I specify?
This question is part of the recommendation system I am building.
My dataframe has 3 columns ['ClientID'],['Products'],['Ratings']
(A 'Rating' different than NaN represent an product purchased)
So far I have wrote this code:
First, I created a pivot table with ['ClientID'] on the vertical index and ['Products'] on the horizontal index and ['Ratings'] as the values.
piv = datacf1.pivot_table(index=['ClientID'], columns=['Products'], values=['Ratings'])
#Drop all columns containing only zeros, representing users who did not rate/purchased
piv.fillna(0, inplace=True)
piv = piv.T
piv = piv.loc[:, (piv != 0).any(axis=0)]
The following code uses the pivot table to get the list of purchased items by client X
#Create a list of every products traded by user X
purchased = piv.T[piv.loc[5039595.0,:]>0].index.tolist()
purchased
5039595 is the ['ClientID'] for which I want to create my list of items purchased and I would like to apply a different ['ClientID'] on demand.
I get an error when running the code to create the list.
Why I believe the 'create a list' code gives me an error:
I believe this code reads the vertical index as ['0','1','2','3',...] so it expects to find the column '5039595', however as said previously the vertical index represents the ['ClientID'] which are random.
Here is a snapshot of the vertical index ['ClientID'] of the pivot table:
How can I fix my code to look for the Client X that I want to create the list for?
Or is there another way to do it? Perhaps with my original dataset with 3 columns ['ClientID'],['Products'],['Ratings']
I think the best way to do this is to select all the rows containing purchases for a given customer then take all the unique product values from those rows. It would look something like this:
desired_rows = (data["ClientID"].isin([client_id]) & data[Ratings].notnull())
product_list = data.loc[desired_rows, "Products"].unique().tolist()
print(product_list)

Pandas groupby by count and than update with new string value and save to original column

I have a Pandas Dataframe with about 30_000 records and would like to find all the records for a specific column whose combined count is less than 10. The Dataframe contains clinical trial data and the column I need to filter and update are diseases for each trial. There are diseases that appear in numerous clinical trials so I need to first filter out all the diseases that appear less than 10 times and than for those diseases, change those text to a new string called 'other'. All this information needs to be than updated in that same column.
This is the code that I have come up with but JupyterLab seems to freeze when I try to run it.
df_diseases = df.groupby(['Diseases']).filter(lambda x: x['Diseases'].count() < 10).apply(lambda x: x.replace(x,'other'))
You can use groupby().transform():
s = df.groupby('Diseases')['Diseaes'].transform('count')
df.loc[s < 10, 'Disease'] = 'other'
Or you can use value_counts and map:
s = df['Diseases'].value_counts()
df['Dieases'] = np.where(df['Dieases'].map(s) > 10, df['Dieaseas'], 'other')
The answer to your question may be found here (look for the Pedro M Duarte's answer): Get statistics for each group (such as count, mean, etc) using pandas GroupBy?

How do I append reviews text and reviews rating to a list

I am writing a program which analyses online reviews and based on the ratings, stores the review into review_text and the corresponding rating into review_label as either positive(4 & 5 stars) or negative(1, 2 & 3 stars).
Tried the following codes to add the review text and review label information of each review without any success.
rev = ['review_text', 'review_label']
for file in restaurant_urls:
url_rev= file
html_r_r=requests.get(url_rev).text
doc_rest=html_r_r
soup_restaurant_content= BeautifulSoup(doc_rest, 'html.parser')
star_text = soup_restaurant_content.find('img').get('alt')
if star_text in ['1-star','2-star','3-star']:
rev['review_label'].append('Negative')
elif star_text in ['4-star','5-star']:
rev['review_label'].append('Positive')
else:
print('check')
rev['review_text'].append(soup_restaurant_content.find('p','text').get_text())
I want the reviews to be stored in the list rev with the review text stored in column review_text and the review label (whether positive or negative) under review_label. It would look something like
'review_text' 'review_label'
review_1 positive
review_2 negative
I think you are misunderstanding how lists work because lists do not have columns. In your case, rev is a list with two items, and you can add new items to the list (e.g. rev.append('review_user') will result in rev looking like this: ['review_text', 'review_label', 'review_user']). However, you cannot add an item to an item in the list (which it seems you are trying to do with rev['review_label'].append('Negative')).
In this specific case, I think the best solution is to have two separate lists, one for the review texts and one for the review labels, and append the respective items accordingly:
review_text = []
review_label = []
...
review_text.append(SOMETEXT)
review_label.append(SOMELABEL)
If you then want to have the data in a data frame, you can use pandas like so:
import pandas as pd
pd.DataFrame({"review_text": review_text, "review_label": review_label})
This should give you what you want. Note that review_text and review_label have to have the same length (which they should, in your case).
I hope this helps! Comment if you have any questions.

How to create a new python DataFrame with multiple columns of differing row lengths?

I'm organizing a new dataframe in order to easily insert data into a Bokeh visualization code snippet. I think my problem is due to differing row lengths, but I am not sure.
Below, I organized the dataset in alphabetical order, by country name, and created an alphabetical list of the individual countries. new_data.tail() Although Zimbabwe is listed last, there are 80336 rows, hence the sorting.
df_ind_data = pd.DataFrame(ind_data)
new_data = df_ind_data.sort_values(by=['country'])
new_data = new_data.reset_index(drop=True)
country_list = list(ind_data['country'])
new_country_set = sorted(set(country_list))
My goal is create a new DataFrame, with 76 cols (country names), with the specific 'trust' data in the rows underneath each country column.
df = pd.DataFrame()
for country in new_country_set:
pink = new_data.loc[(new_data['country'] == country)]
df[country] = pink.trust
Output here
As you can see, the data does not get included for the rest of the columns after the first. I believe this is due to the fact that the number of rows of 'trust' data for each country varies. While the first column has 1000 rows, there are some with as many as 2500 data points, and as little as 500.
I have attempted a few different methods to specify the number of rows in 'df', but to no avail.
The visualization code snippet I have utilizes this same exact data structure for the template data, so that it why I'm attempting to put it in a dataframe. Plus, I can't do it, so I want to know how to do it.
Yes, I can put it in a dictionary, but I want to put it in a dataframe.
You should use combine_first when you add a new column so that the dataframe index gets extended. Instead of
df[country] = pink.trust
you should use
df = pink.trust.combine_first(df)
which ensures that your index is always union of all added columns.
I think in this case pd.pivot(columns = 'var', values = 'val') , will work for you, especially when you already have dataframe. This function will transfer values from particular column into column names. You could see the documentation for additional info. I hope that helps.

Categories

Resources