Use Groupby to construct a dataframe with value counts of other column - python

I have a dataframe with two column features: startneighborhood and hour
hour can take any value from 1-24, i.e., [1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24]
startneighborhood can be 37 different neighborhood options.
I want to find the number of hours for every neighborhood and use "hour" as an index.
So my matrix would be 24 rows x 37 columns, with the 1:24 hours array as my index and the 37 neighborhood as the column names.
How can I use Pandas to perform this computation? I'm a bit lost on the fastest way.
I've constructed the dataframe, with the index and the neighborhood names as the column names. I now just need to add the values..

Im a little bit confused by the question but I think what you want to do is a crosstab
import pandas as pd
df = <...> #construct your dataframe
table = pd.crosstab(index=df.hour,columns=df.startneighborhood)
This will give you a 24x37 table where each element is the count of the number of occurrences of that combination of hour and startneighborhood.

Related

Combine two Pandas Data Frames where "Dates" matches

I have two Pandas DataFrames with one column in common, namely "Dates". I need to merge these two where "Dates" correspond. with pd.merge() it does the expected but removes the uncorresponding values. I want to keep other values too.
Ex: I have historical data for a stock for 1 min. and a calculated indicator for 5min. data ie. for each 5 rows I have a new value calculated in 1 min Data Frame.
I know that Series.dt.floor method may reveal useful here but I couldn't figure out.
I concatenated respective "Dates" to calculated indicator Series so that I can merge them where column matches. I obtained a right result but missing values. I need a continuity of 1 min values, i.e. same indicator must be valid for the next 5 entries then the second indicator value's turn to be merged.
df1.merge(df2, left_on='Dates', right_on='Dates')

Create Matrix from a DataFrame

I have a dataframe with several columns, including Department and ICA. I need to create a matrix where the rows are the departments, and the columns are ICA values (they range from bad-acceptable-good).
So position r,c would be a number that shows how many observations of ICA were recorded for each department.
For example, if Amazonas is row 1 and Acceptable is column 3, position (1,3) would be the number of acceptable observations for Amazonas.
Thanks!
You can get values from your DataFrame using integer-based indexing with the DataFrame.iloc method. This seems to do what you need.
For example, if df is your DataFrame, then df.iloc[0, 2] will give you the value at the first row and third column.

How can I obtain the top n groups in pandas?

I have a pandas dataframe. The final column in the dataframe is the max value of the RelAb column for each unique group (in this case, a species assignment) in the dataframe as obtained by:
df_melted['Max'] = df_melted.groupby('Species')['RelAb'].transform('max')
As you can see, the max value is represented in all rows of the group. Each group contains a large number of rows. I have the df sorted by max values, for which there are about 100 rows per max value. My goal is to obtain the top 20 groups based on the max value (i.e. a df with 100 X 20 rows - 2000 rows). I do not want to drop individual rows from groups in the dataframe, rather entire groups.
I am pasting a subset of the dataframe where the max for a group changes from one "Max" value to the next:
My feeling is that I need to convert the max so that the one value represents the entire group and then sort based on that column, perhaps as such?
For context, the reason I am doing this is because I am planning to make a stacked barchart with the most abundant species in the table for each sample. Right now, there are just way too many species, so it makes the stacked bar chart uninformative.
One way to do it:
aux = (df_melted.groupby('Species')['RelAb']
.max()
.nlargest(20, keep='all')
.to_list())
top20 = df_melted.loc[df_melted['Max'].isin(aux), :].copy()

how to add specific two columns and get new column as a total using pandas library?

I'm trying to add two-columns and trying to display their total in a new column and following as well
The total sum of sales in the month of Jan
The minimum sales amount in the month of Feb
The average (mean) sales for the month of Mar
and trying to create a data frame called d2 that only contains rows of data in d that don't have any missing (NaN) values
I have implemented the following code
import pandas as pd
new_val= pd.read_csv("/Users/mayur/574_repos_2019/ml-python-
class/assignments/data/assg-01-data.csv")
new_val['total'] = 'total'
new_val.to_csv('output.csv', index=False)
display(new_val)
d.head(5)# it's not showing top file lines of the .csv data
# .CSV file sample data
#account name street city state postal-code Jan Feb Mar total
#0118 Kerl, 3St . Waily Texas 28752.0 10000 62000 35000 total
#0118 mkrt, 1Wst. con Texas 22751.0 12000 88200 15000 total
It's giving me a total as a word.
When you used new_val['total'] = 'total' you basically told Pandas that you want a Column in your DataFrame called total where every variable is the string total.
What you want to fix is the variable assignment. For this I can give you quick and dirty solution that will hopefully make a more appealing solution be clearer to you.
You can iterate through your DataFrame and add the two columns to get the variable for the third.
for i,j in new_val.iterrows():
new_val.iloc[i]['total'] = new_val.iloc[i]['Jan'] + new_val.iloc[i]['Feb'] + new_val.iloc[i]['Mar']
Note, that this requires column total to have already been defined. This also requires iterating through your entire data set, so if your data set is large this is not the best option.
As mentioned by #Cavenfish, that new_val['total'] = 'total' creates a column total where value of every cell is the string total.
You should rather use new_val['total'] = new_val['Jan']+new_val['Feb']+new_val['Mar']
For treatment of NA values you can use a mask new_val.isna() which will generate boolean for all cells whether they are NA or not in your array. You can then apply any logic on top of it. For your example, the below should work:
new_val.isna().sum(axis=1)==4
Considering that you now have 4 columns in your dataframe Jan,Feb,Mar,total; it will return False in case one of the row contains NA. You can then apply this mask to new_val['total'] to assign default value in case NA is encountered in one of the columns for a row.

What is pandas syntax for lookup based on existing columns + row values?

I'm trying to recreate a bit of a convoluted scenario, but I will do my best to explain it:
Create a pandas df1 with two columns: 'Date' and 'Price' - done
I add two new columns: 'rollmax' and 'rollmin', where the 'rollmax' is an 8 days rolling maximum and 'rollmin' is a
rolling minimum. - done
Now I need to create another column 'rollmax_date' that would get
populated through a look up rule:
for the row n, go to the column 'Price' and parse through the values
for the last 8 days and find the maximum, then get the value of the
corresponding column 'Price' and put this value in the column 'rollingmax_date'.
the same logic for the 'rollingmin_date', but instead of rolling maximum date, we look for the rolling minimum date.
Now I need to find the previous 8 days max and min for the same rolling window of 8 days that I have already found.
I did the first two and tried the third one, but I'm getting wrong results.
The code below gives me only dates where on the same row df["Price"] is the same as df['rollmax'], but it doesn't bring all the corresponding dates from 'Date' to 'rollmax_date'
df['rollmax_date'] = df.loc[(df["Price"] == df.rollmax), 'Date']
This is an image with steps for recreating the lookup

Categories

Resources