I am new in Python and i have a question. I have an exported .csv with values and i want to sum each row's total value than make a total column to there.
I've tried that but it doesnt work.
import pandas as pd
wine = pd.read_csv('testelek.csv', 'rb', delimiter=';')
wine['Total'] = [wine[row].sum(axis=1) for row in wine]
I want to make my DataFrame like this.
101 102 103 104 .... Total
__________________________________________________________________________
0 80 84 86 78 .... 328
1 78 76 77 79 .... 310
2 79 81 88 83 .... 331
3 70 85 89 84 .... 328
4 78 84 88 85 .... 335
You can bypass the need for the list comprehension and just use the axis=1 parameter to get what you want.
wine['Total'] = wine.sum(axis=1)
A nice way to do this is by using .apply().
Suppose that you want to create a new column named Total by adding the values per row for columns named 101, 102, and 103 you can try the following:
wine['Total'] = wine.apply(lambda row: sum([row['101'], row['102'], row['103']]), axis=1)
Related
I'm trying to normalize a Pandas DF by row and there's a column which has string values which is causing me a lot of trouble. Anyone have a neat way to make this work?
For example:
system Fluency Terminology No-error Accuracy Locale convention Other
19 hyp.metricsystem2 111 28 219 98 0 133
18 hyp.metricsystem1 97 22 242 84 0 137
22 hyp.metricsystem5 107 11 246 85 0 127
17 hyp.eTranslation 49 30 262 80 0 143
20 hyp.metricsystem3 86 23 263 89 0 118
21 hyp.metricsystem4 74 17 274 70 0 111
I am trying to normalize each row from Fluency, Terminology, etc. Other over the total. In other words, divide each integer column entry over the total of each row (Fluency[0]/total_row[0], Terminology[0]/total_row[0], ...)
I tried using this command, but it's giving me an error because I have a column of strings
bad_models.div(bad_models.sum(axis=1), axis = 0)
Any help would be greatly appreciated...
Use select_dtypes to select numeric only columns:
subset = bad_models.select_dtypes('number')
bad_models[subset.columns] = subset.div(subset.sum(axis=1), axis=0)
print(bad_models)
# Output
system Fluency Terminology No-error Accuracy Locale convention Other
19 hyp.metricsystem2 0.211832 0.21374 0.145418 0.193676 0 0.172952
18 hyp.metricsystem1 0.185115 0.167939 0.160691 0.166008 0 0.178153
22 hyp.metricsystem5 0.204198 0.083969 0.163347 0.167984 0 0.16515
17 hyp.eTranslation 0.093511 0.229008 0.173971 0.158103 0 0.185956
20 hyp.metricsystem3 0.164122 0.175573 0.174635 0.175889 0 0.153446
21 hyp.metricsystem4 0.141221 0.129771 0.181939 0.13834 0 0.144343
I have a dataframe like the following:
Label
Indicator
Value1
Value2
A
77
50
50
A
776
60
70
A
771
70
40
A
7
80
50
A
7775
90
40
B
776
100
40
B
771
41
50
B
775
54
40
B
7775
55
50
What I want is an output like that:
Label
aggregation1
aggregation2
A
aggregation1_A_value
aggregation2_A_value
B
aggregation1_B_value
aggregation2_B_value
Knowing that the way I want to aggregate value is the following (example):
aggregation1 = value1 of indicator starting with 77 (but not 776) - value2 of indicator 776 and 775.
What I am doing now is the following: I split the Indicator into several columns, to have a new data frame:
Label
Indicator0
Indicator1
Indicator2
...
A
7
77
77
...
A
7
77
776
...
A
7
77
771
...
...
...
...
...
...
B
7
77
777
...
aggregation1_A = df.query("Label=='A' and Indicator1 is in ["77"] and Indicator2 is not in ["776"]")["value1"].sum()
aggregation1_A -= df.query("Label=='A' and Indicator2 is in ["776","775"]")["value2"].sum()
My issue is that I have more than 70 000 differents labels, and about 20 aggregations to run.
Dataframe is 500MB large.
I am wondering if there is any better way. I had a look with pandas UDF and apply a custom aggregation function but I didn't succeed so far.
Thank you for your help
You can use get_dummies to replace the step where you split your indicator separate columns. Then you can use those bool values to carry out your aggregations:
dummies = pd.get_dummies(df, columns=['Indicator'])
def agg_1(df):
ret = df.apply(lambda x: x['Value1']*x[['Indicator_77','Indicator_771', 'Indicator_7775']], axis=1).sum().sum()
ret -= df.apply(lambda x: x['Value2']*x[['Indicator_775', 'Indicator_776']], axis=1).sum().sum()
return ret
dummies.groupby('Label').apply([agg_1])
The lambda functions are just multiplying the values by whether or not the relevant indicators are in that row. The sum().sum() flattens the result of that multiplication into a scalar.
You can put all your aggregation functions in the list with agg_1.
This is supposed to be a simple IF statement that is updating based on a condition but it is not working.
Here is my code
df["Category"].fillna("999", inplace = True)
for index, row in df.iterrows():
if (str(row['Category']).strip()=="11"):
print(str(row['Category']).strip())
df["Category_Description"] = "Agriculture, Forestry, Fishing and Hunting"
elif (str(row['Category']).strip()=="21"):
df["Category_Description"] = "Mining, Quarrying, and Oil and Gas Extraction"
The print statement
print(str(row['Category']).strip())
is working fine but updates to the Category_Description column are not working.
The input data has the following codes
Category Count of Records
48 17845
42 2024
99 1582
23 1058
54 1032
56 990
32 916
33 874
44 695
11 630
53 421
81 395
31 353
49 336
21 171
45 171
52 116
71 108
61 77
51 64
62 54
72 51
92 36
55 35
22 14
The update resulted in
Agriculture, Forestry, Fishing and Hunting 41183
Here is a small sample of the dataset and code on repl.it
https://repl.it/#RamprasadRengan/SimpleIF#main.py
When I run the code above with this data I still see the same issue.
What am I missing here?
You are performing a row operation but applying a dataframe change in the "IF" statement. This will apply the values to all the records.
Try sometime like:
def get_category_for_record(rec):
if (str(row['Category']).strip()=="11"):
return "Agriculture, Forestry, Fishing and Hunting"
elif (str(row['Category']).strip()=="21"):
return "Mining, Quarrying, and Oil and Gas Extraction"
df["category"] = df.apply(get_category_for_record, axis = 1)
I think you want to add a column to the dataframe that maps category to a longer description. As mentioned in the comments, assignment to a column affects the entire column. But if you use a list, each row in the column gets the corresponding value.
So use a dictionary to map name to description, build a list, and assign it.
import pandas as pd
category_map = {
"11":"Agriculture, Forestry, Fishing and Hunting",
"21":"Mining, Quarrying, and Oil and Gas Extraction"}
df = pd.DataFrame([["48", 17845],
[" 11 ", 88888],
["12", 33333],
["21", 999]],
columns=["category", "count of records"])
# cleanup category and add description
df["category"] = df["category"].str.strip()
df["Category_Description"] = [category_map.get(cat, "")
for cat in df["category"]]
# alternately....
#df.insert(2, "Category_Description",
# [category_map.get(cat, "") for cat in df["category"]])
print(df)
I have a dataframe (df_input), and im trying to convert it to another dataframe (df_output), through applying a formula to each element in each row. The formula requires information about the the whole row (min, max, median).
df_input:
A B C D E F G H I J
2011-01-01 60 48 26 29 41 91 93 87 39 65
2011-01-02 88 52 24 99 1 27 12 26 64 87
2011-01-03 13 1 38 60 8 50 59 1 3 76
df_output:
F(A)F(B)F(C)F(D)F(E)F(F)F(G)F(H)F(I)F(J)
2011-01-01 93 54 45 52 8 94 65 37 2 53
2011-01-02 60 44 94 62 78 77 37 97 98 76
2011-01-03 53 58 16 63 60 9 31 44 79 35
Im trying to go from df_input to df_output, as above, after applying f(x) to each cell per row. The function foo is trying to map element x to f(x) by doing an OLS regression of the min, median and max of the row to some co-ordinates. This is done each period.
I'm aware that I iterate over the rows and then for each row apply the function to each element. Where i am struggling is getting the output of foo, into df_output.
for index, row in df_input.iterrows():
min=row.min()
max=row.max()
mean=row.mean()
#apply function to row
new_row = row.apply(lambda x: foo(x,min,max,mean)
#add this to df_output
help!
My current thinking is to build up the new df row by row? I'm trying to do that but im getting a lot of multiindex columns etc. Any pointers would be great.
thanks so much... merry xmas to you all.
Consider calculating row aggregates with DataFrame.* methods and then pass series values in a DataFrame.apply() across columns:
# ROW-WISE AGGREGATES
df['row_min'] = df.min(axis=1)
df['row_max'] = df.max(axis=1)
df['row_mean'] = df.mean(axis=1)
# COLUMN-WISE CALCULATION (DEFAULT axis=0)
new_df = df[list('ABCDEFGHIJ')].apply(lambda col: foo(col,
df['row_min'],
df['row_max'],
df['row_mean']))
This question already has answers here:
Remove row with null value from pandas data frame
(5 answers)
Closed 5 years ago.
If I have the the following dataframe. If there is a null in either Participation, Homework, Test, Presentation (if there is a null is any of the four columns), then I want to remove that row. How do I achieve this in Pandas.
Name Participation Homework Test Presentation Attendance
Andrew 92 Null 85 95 88
John 95 88 98 Null 90
Carrie 82 99 96 89 92
Simone 100 91 88 99 90
Here, I would want to remove everyone except for Carrie and Simone from the dataframe. How do I achieve this in pandas?
I found this on Stackoverflow, which I think may help df = df[pd.notnull(df['column_name'])], but is there anyway I can do this for all four columns (so a subset) instead of each column individually?
Thanks!
You can skip the replace if you use ne:
df[df.ne('Null').all(1)]
Name Participation Homework Test Presentation Attendance
2 Carrie 82 99 96 89 92
3 Simone 100 91 88 99 90
Preparation, let's replace that string 'Null' with np.nan first.
Now, let's try this using notnull, all with axis=1:
df[df.replace('Null',np.nan).notnull().all(1)]
Output:
Name Participation Homework Test Presentation Attendance
2 Carrie 82 99 96 89 92
3 Simone 100 91 88 99 90
Or using isnull, any, and ~:
df[~df.replace('Null',np.nan).isnull().any(1)]
replace + dropna
df.replace({'Null':np.nan}).dropna()
Out[504]:
Name Participation Homework Test Presentation Attendance
2 Carrie 82 99 96 89 92
3 Simone 100 91 88 99 90