Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 2 years ago.
Improve this question
I would like to edit a Pandas dataframe, and you can obtain the dataset from here.
Sample_dataset
As you can see, each "area" has some "category" and each "category" has different "price". I want to unify the "category" for each "area", and the value of "category" should be the bottom of each "area". In other words, some values of "category" will change as follows.
Before:
area:A, category:1, price:500
After:
area:A, category:2, price:500
image
I know that it's possible to edit this dataframe by pivot table as follows. But in this case, I cannot unify and display the values of "category".
pd.pivot_table(df, values="price", index=["area",], aggfunc='sum')
I would appreciate if you provide an idea to unify the category values.
You can try this, although it may not be the best option.
After using the code you mentioned:
df_new = pd.pivot_table(df, values="price", index=["area",], aggfunc='sum')
I have created a function that finds the last category for each area (where df is the original DataFrame):
def find_category(cat, list_categories):
list_categories.append(df[df['area'] == cat].iloc[-1].category)
Then with a for loop the last category for each area is searched and added to a new category column. Then you can reorder the columns if you want:
list_categories = []
for area in df_new.index:
find_category(area, list_categories)
df_new['category'] = list_categories
df_new = df_new[['category','price']]
The output would be:
category price
area
A 2 900
B 1 350
C 4 800
D 1 500
Related
Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 1 year ago.
Improve this question
I have following data
I'd like to select the bigest value on 2nd column related to value on 1st column.
For value 1 on 1st column, the selected value shall be 5.
The 1st column is time (for example: 06:54:11)
I can use matlab, python, excel, bash.
Using python, you can download your file (assuming it's an Excel file) to a pandas DataFrame, groupby on the first column and find the max value in the second column:
import pandas as pd
df = pd.read_excel('your_data.xlsx')
output = df.groupby('column1')['column2'].max()
Using Matlab you can get the Maximum with the build-in "max" function.
Try using [M,I] = max(data)
and replace data with your matrix name.
M will return you the maxima. In your case M(2) will be the maximum of the second row. With the Index (I) you can grab the corresponding time out of the first row.
time = data(I(2),1)
Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 2 years ago.
Improve this question
I have 3 pandas dataframes as :
abc
xyz
colour
type
pattern
colour
type
pattern
lenght
breadth
height
area
lenght
breadth
height
area
I want to combine the dataframes so that it looks like this :
abc colour length
type breadth
pattern height
area
xyz colour length
type breadth
pattern height
area
I also want to export the end result to an excel sheet so how do i do that without making it look messy?
first concat second and third df with rows of first df and concat them:
df1 = pd.DataFrame([['abc'],['xyz']],columns=['col1'])
df2 = pd.DataFrame([['colour'],
['type'],
['pattern']],columns=['col2'])
df3 = pd.DataFrame([['lenght'],
['breadth'],
['height'],
['area']],columns=['col3'])
pd.concat([pd.concat([df[lambda x: x.index == i].reset_index(),df2,df3],axis=1) for i in range(len(df1))]).drop('index',axis=1)
Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 2 years ago.
Improve this question
Date,hrs,Count,Status
2018-01-02,4,15,SFZ
2018-01-03,5,16,ACZ
2018-01-04,3,14,SFZ
2018-01-05,5,15,SFZ
2018-01-06,5,18,ACZ
This is the fraction of data to what I've been working on. The actual data is in the same format with around 1000 entries of each date in it. I am taking the start_date and end_date as inputs from user. Consider in this case it is:
start_date:2018-01-02
end_date:2018-01-06
So, I have to display a total for hrs and the count within the selected date range, on the output. Also I want to do it using an #app.callback in dash(plot.ly). Can someone help please?
Use Series.between with filtering by DataFrame.loc and boolean indexing for columns by condition and then sum:
df = df.loc[df['Date'].between('2018-01-02','2018-01-06'), ['hrs','Count']].sum()
print (df)
hrs 22
Count 78
dtype: int64
Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 4 years ago.
Improve this question
I have a dataframe with 'genre' as a column. In this column, each entry has several values. For example, a movie 'Harry Potter' could have fantasy,adventure in the genre column. As I am doing a data analysis and exploration, I have no idea how to represent this column with multiple values to show any relationships between movies and/or genre.
I have thought of using a graph analysis to show the relationship, but I would like to explore other approaches I can consider?
sample data
You can use str.get_dummies for new indicator columns by genres:
df = pd.DataFrame({'Movies': ['Harry Potter', 'Toy Story'],
'Genres': ['fantasy,adventure',
'adventure,animation,children,comedy,fantasy']})
#print (df)
df = df.set_index('Movies')['Genres'].str.get_dummies(',')
print (df)
adventure animation children comedy fantasy
Movies
Harry Potter 1 0 0 0 1
Toy Story 1 1 1 1 1
Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 5 years ago.
Improve this question
I have a dataframe with 3 columns (department, sales, region), and I want to write a method to display all rows that are from the least common region. Then I need to write another method to count the frequency of the departments that are represented in the least common region. No idea how to do this.
Functions would be unecessary - pandas already has implementations to accomplish what you want! Suppose I had the following csv file, test.csv...
department,sales,region
sales,26,midwest
finance,45,midwest
tech,69,west
finance,43,east
hr,20,east
sales,34,east
If I'm understanding you correctly, I would obtain a DataFrame representing the least common region like so:
import pandas as pd
df = pd.read_csv('test.csv')
counts = df['region'].value_counts()
least_common = counts[counts == counts.min()].index[0]
least_common_df = df.loc[df['region'] == least_common]
least_common_df is now:
department sales region
2 tech 69 west
As for obtaining the department frequency for the least common region, I'll leave that up to you. (I've already shown you how to get the frequency for region.)