This is supposed to be a simple IF statement that is updating based on a condition but it is not working.
Here is my code
df["Category"].fillna("999", inplace = True)
for index, row in df.iterrows():
if (str(row['Category']).strip()=="11"):
print(str(row['Category']).strip())
df["Category_Description"] = "Agriculture, Forestry, Fishing and Hunting"
elif (str(row['Category']).strip()=="21"):
df["Category_Description"] = "Mining, Quarrying, and Oil and Gas Extraction"
The print statement
print(str(row['Category']).strip())
is working fine but updates to the Category_Description column are not working.
The input data has the following codes
Category Count of Records
48 17845
42 2024
99 1582
23 1058
54 1032
56 990
32 916
33 874
44 695
11 630
53 421
81 395
31 353
49 336
21 171
45 171
52 116
71 108
61 77
51 64
62 54
72 51
92 36
55 35
22 14
The update resulted in
Agriculture, Forestry, Fishing and Hunting 41183
Here is a small sample of the dataset and code on repl.it
https://repl.it/#RamprasadRengan/SimpleIF#main.py
When I run the code above with this data I still see the same issue.
What am I missing here?
You are performing a row operation but applying a dataframe change in the "IF" statement. This will apply the values to all the records.
Try sometime like:
def get_category_for_record(rec):
if (str(row['Category']).strip()=="11"):
return "Agriculture, Forestry, Fishing and Hunting"
elif (str(row['Category']).strip()=="21"):
return "Mining, Quarrying, and Oil and Gas Extraction"
df["category"] = df.apply(get_category_for_record, axis = 1)
I think you want to add a column to the dataframe that maps category to a longer description. As mentioned in the comments, assignment to a column affects the entire column. But if you use a list, each row in the column gets the corresponding value.
So use a dictionary to map name to description, build a list, and assign it.
import pandas as pd
category_map = {
"11":"Agriculture, Forestry, Fishing and Hunting",
"21":"Mining, Quarrying, and Oil and Gas Extraction"}
df = pd.DataFrame([["48", 17845],
[" 11 ", 88888],
["12", 33333],
["21", 999]],
columns=["category", "count of records"])
# cleanup category and add description
df["category"] = df["category"].str.strip()
df["Category_Description"] = [category_map.get(cat, "")
for cat in df["category"]]
# alternately....
#df.insert(2, "Category_Description",
# [category_map.get(cat, "") for cat in df["category"]])
print(df)
Related
I have three different DateFrames (df2019, df2020, and df2021) and the all have the same columns(here are a few) with some overlapping 'BrandID':
BrandID StockedOutDays Profit SalesQuantity
243 01-02760 120 516452.76 64476
138 01-01737 96 603900.0 80520
166 01-02018 125 306796.8 52896
141 01-01770 109 297258.6 39372
965 02-35464 128 214039.2 24240
385 01-03857 92 326255.16 30954
242 01-02757 73 393866.4 67908
What I'm trying to do is add the value from one column for a specific BrandID from each of the 3 DataFrame's. In my specific case, I'd like to add the value of 'Sales Quantity' for 'BrandID' = 01-02757 from df2019, df2020 and df2021 and get a line I can run to see a single number.
I've searched around and tried a bunch of different things, but am stuck. Please help, thank you!
EDIT *** I'm looking for something like this I think, I just don't know how to sum them all together:
df2021.set_index('BrandID',inplace=True)
df2020.set_index('BrandID',inplace=True)
df2019.set_index('BrandID',inplace=True)
df2021.loc['01-02757']['SalesQuantity']+df2020.loc['01-02757']['SalesQuantity']+
df2019.loc['01-02757']['SalesQuantity']
import pandas as pd
df2019 = pd.DataFrame([{"BrandID":"01-02760", "StockedOutDays":120, "Profit":516452.76, "SalesQuantity":64476},
{"BrandID":"01-01737", "StockedOutDays":96, "Profit":603900.0, "SalesQuantity":80520}])
df2020 = pd.DataFrame([{"BrandID":"01-02760", "StockedOutDays":123, "Profit":76481.76, "SalesQuantity":2457},
{"BrandID":"01-01737", "StockedOutDays":27, "Profit":203014.0, "SalesQuantity":15648}])
df2019["year"] = 2019
df2020["year"] = 2020
df = pd.DataFrame.append(df2019, df2020)
df_sum = df.groupby("BrandID").agg("sum").drop("year",axis=1)
print(df)
print(df_sum)
df:
BrandID StockedOutDays Profit SalesQuantity year
0 01-02760 120 516452.76 64476 2019
1 01-01737 96 603900.00 80520 2019
0 01-02760 123 76481.76 2457 2020
1 01-01737 27 203014.00 15648 2020
df_sum:
StockedOutDays Profit SalesQuantity
BrandID
01-01737 123 806914.00 96168
01-02760 243 592934.52 66933
I have a sample dataframe as below (actual dataset is roughly 300k entries long):
user_id revenue
----- --------- ---------
0 234 100
1 2873 200
2 827 489
3 12 237
4 8942 28934
... ... ...
96 498 892384
97 2345 92
98 239 2803
99 4985 98332
100 947 4588
which displays the revenue generated by users. I would like to select the rows where the top 20% of the revenue is generated (hence giving the top 20% revenue generating users).
The methods that come closest to mind for me is calculating the total number of users, working out 20% of this ,sorting the dataframe with sort_values() and then using head() or nlargest(), but I'd like to know if there is a simpler and elegant way.
Can anybody propose a way for this?
Thank you!
Suppose You have dataframe df:
user_id revenue
234 21
2873 20
827 23
12 23
8942 28
498 22
2345 20
239 24
4985 21
947 25
I've flatten revenue distribution to show the idea.
Now calculating step by step:
df = pd.read_clipboard()
df = df.sort_values(by = 'revenue', ascending = False)
df['revenue_cum'] = df['revenue'].cumsum()
df['%revenue_cum'] = df['revenue_cum']/df['revenue'].sum()
df
result:
user_id revenue revenue_cum %revenue_cum
4 8942 28 28 0.123348
9 947 25 53 0.233480
7 239 24 77 0.339207
2 827 23 100 0.440529
3 12 23 123 0.541850
5 498 22 145 0.638767
0 234 21 166 0.731278
8 4985 21 187 0.823789
1 2873 20 207 0.911894
6 2345 20 227 1.000000
Only 2 top users generate 23.3% of total revenue.
This seems to be the case for df.quantile, from pandas documentation if you are looking for the top 20% all you need to do is pass the correct quantile value you desire.
A case example from your dataset:
import pandas as pd
import numpy as np
df = pd.DataFrame({'user_id':[234,2873,827,12,8942],
'revenue':[100,200,489,237,28934]})
df.quantile([0.8,1],interpolation='nearest')
This would print the top 2 rows in value:
user_id revenue
0.8 2873 489
1.0 8942 28934
I usually find useful to use sort_values to see the cumulative effect of every row and then keep rows up to some threshold:
# Sort values from highest to lowest:
df = df.sort_values(by='revenue', ascending=False)
# Add a column with aggregated effect of the row:
df['cumulative_percentage'] = 100*df.revenue.cumsum()/df.revenue.sum()
# Define the threshold I need to analyze and keep those rows:
min_threshold = 30
top_percent = df.loc[df['cumulative_percentage'] <= min_threshold]
The original df will be nicely sorted with a clear indication of the top contributing rows and the created 'top_percent' df will contain the rows that need to be analyzed in particular.
I am assuming you are looking for the cumulative top 20% revenue generating users. Here is a function that will help you get the expected output and even more. Just specify your dataframe, column name of the revenue and the n_percent you are looking for:
import pandas as pd
def n_percent_revenue_generating_users(df, col, n_percent):
df.sort_values(by=[col], ascending=False, inplace=True)
df[f'{col}_cs'] = df[col].cumsum()
df[f'{col}_csp'] = 100*df[f'{col}_cs']/df[col].sum()
df_ = df[df[f'{col}_csp'] > n_percent]
index_nearest = (df_[f'{col}_csp']-n_percent).abs().idxmin()
threshold_revenue = df_.loc[index_nearest, col]
output = df[df[col] >= threshold_revenue].drop(columns=[f'{col}_cs', f'{col}_csp'])
return output
n_percent_revenue_generating_users(df, 'revenue', 20)
I am new in Python and i have a question. I have an exported .csv with values and i want to sum each row's total value than make a total column to there.
I've tried that but it doesnt work.
import pandas as pd
wine = pd.read_csv('testelek.csv', 'rb', delimiter=';')
wine['Total'] = [wine[row].sum(axis=1) for row in wine]
I want to make my DataFrame like this.
101 102 103 104 .... Total
__________________________________________________________________________
0 80 84 86 78 .... 328
1 78 76 77 79 .... 310
2 79 81 88 83 .... 331
3 70 85 89 84 .... 328
4 78 84 88 85 .... 335
You can bypass the need for the list comprehension and just use the axis=1 parameter to get what you want.
wine['Total'] = wine.sum(axis=1)
A nice way to do this is by using .apply().
Suppose that you want to create a new column named Total by adding the values per row for columns named 101, 102, and 103 you can try the following:
wine['Total'] = wine.apply(lambda row: sum([row['101'], row['102'], row['103']]), axis=1)
Still relatively new to working in python and am having some issues.
I currently have a small program that takes csv files, merges them, puts them into a data frame, and then converts to excel.
What I want to do is match the values of 'Team' and 'Abrev' from the data frame columns based on the prefix of its values, and then replace Team column with 'Abrev' column contents.
Team Games Points Abrev
Arsenal 38 87 ARS
Liverpool 38 80 LIV
Manchester 38 82 MAN
Newcastle 38 73 NEW
I would like it to eventually look like the following:
Team Games Points
ARS 38 87
LIV 38 80
MAN 38 82
NEW 38 73
So what I'm thinking is that I need a for loop to iterate through the amount of rows in the dataframe, and then I need a way to compare the contents by the prefix in column Abrev. If first three letters match then replace, but I don't know how to go about it because I am trying not to hard code it.
Can someone help or point me in the right direction?
you can use apply operation to get the desired output.
df = pd.read_csv('input.csv')
df['Team'] = df.apply(lambda row: row['Team'] if row['Team'][:3].upper()!= row['Abrev']
else row['Abrev'],axis=1)
df.drop('Abrev', axis=1, inplace=True)
This gives you:
Team Games Points
ARS 38 87
LIV 38 80
MAN 38 82
NEW 38 73
pandas is what you are looking for
import pandas as pd
df = pd.read_csv('input.csv')
df['team'] = df['Abrev']
df.drop('Abrev', axis=1, inplace=True)
df.to_excel('output.xls')
I have a dataframe (df_input), and im trying to convert it to another dataframe (df_output), through applying a formula to each element in each row. The formula requires information about the the whole row (min, max, median).
df_input:
A B C D E F G H I J
2011-01-01 60 48 26 29 41 91 93 87 39 65
2011-01-02 88 52 24 99 1 27 12 26 64 87
2011-01-03 13 1 38 60 8 50 59 1 3 76
df_output:
F(A)F(B)F(C)F(D)F(E)F(F)F(G)F(H)F(I)F(J)
2011-01-01 93 54 45 52 8 94 65 37 2 53
2011-01-02 60 44 94 62 78 77 37 97 98 76
2011-01-03 53 58 16 63 60 9 31 44 79 35
Im trying to go from df_input to df_output, as above, after applying f(x) to each cell per row. The function foo is trying to map element x to f(x) by doing an OLS regression of the min, median and max of the row to some co-ordinates. This is done each period.
I'm aware that I iterate over the rows and then for each row apply the function to each element. Where i am struggling is getting the output of foo, into df_output.
for index, row in df_input.iterrows():
min=row.min()
max=row.max()
mean=row.mean()
#apply function to row
new_row = row.apply(lambda x: foo(x,min,max,mean)
#add this to df_output
help!
My current thinking is to build up the new df row by row? I'm trying to do that but im getting a lot of multiindex columns etc. Any pointers would be great.
thanks so much... merry xmas to you all.
Consider calculating row aggregates with DataFrame.* methods and then pass series values in a DataFrame.apply() across columns:
# ROW-WISE AGGREGATES
df['row_min'] = df.min(axis=1)
df['row_max'] = df.max(axis=1)
df['row_mean'] = df.mean(axis=1)
# COLUMN-WISE CALCULATION (DEFAULT axis=0)
new_df = df[list('ABCDEFGHIJ')].apply(lambda col: foo(col,
df['row_min'],
df['row_max'],
df['row_mean']))