I have a dataframe where one column has a list of zipcodes and the other has property values corresponding to the zipcode. I want to sum up the property values in each row according to the appropriate zipcode.
So, for example:
zip value
2210 $5,000
2130 $3,000
2210 $2,100
2345 $1,000
I would then add up the values
$5,000 + $2,100 = $7,100
and reap the total property value of for zipcode 2210 as $7,100.
Any help in this regard will be appreciated
You need:
df
zip value
0 2210 5000
1 2130 3000
2 2210 2100
3 2345 1000
df2 = df.groupby(['zip'])['value'].sum()
df2
zip value
2130 3000
2210 7100
2345 1000
Name: value, dtype: int64
You can read more about it here.
Also you will need to remove the $ sign in the column values. For that you can use something along the lines of the following while reading the dataframe initially:
df = pd.read_csv('zip_value.csv', header=0,names=headers,converters={'value': lambda x: float(x.replace('$',''))})
Edit: Changed the code according to comment.
To reset the index after groupby use:
df2 = df.groupby(['zip'])['value'].sum().reset_index()
Then to remove a particular column with zip value ,say, 2135 , you need
df3 = df2[df2['zip']!= 2135]
Related
I have a dataframe like as shown below
customer_id revenue_m7 revenue_m8 revenue_m9 revenue_m10
1 1234 1231 1256 1239
2 5678 3425 3255 2345
I would like to do the below
a) get average of revenue for each customer based on latest two columns (revenue_m9 and revenue_m10)
b) get average of revenue for each customer based on latest four columns (revenue_m7, revenue_m8, revenue_m9 and revenue_m10)
So, I tried the below
df['revenue_mean_2m'] = (df['revenue_m10']+df['revenue_m9'])/2
df['revenue_mean_4m'] = (df['revenue_m10']+df['revenue_m9']+df['revenue_m8']+df['revenue_m7'])/4
df['revenue_mean_4m'] = df.mean(axis=1) # i also tried this but how to do for only two columns (and not all columns)
But if I wish to compute average for past 12 months, then it may not be elegant to write this way. Is there any other better or efficient way to write this? I can just key in number of columns to look back and it can compute the average based on keyed in input
I expect my output to be like as below
customer_id revenue_m7 revenue_m8 revenue_m9 revenue_m10 revenue_mean_2m revenue_mean_4m
1 1234 1231 1256 1239 1867 1240
2 5678 3425 3255 2345 2800 3675.75
Use filter and slicing:
# keep only the "revenue_" columns
df2 = df.filter(like='revenue_')
# or
# df2 = df.filter(regex=r'revenue_m\d+')
# get last 2/4 columns and aggregate as mean
df['revenue_mean_2m'] = df2.iloc[:, -2:].mean(axis=1)
df['revenue_mean_4m'] = df2.iloc[:, -4:].mean(axis=1)
Output:
customer_id revenue_m7 revenue_m8 revenue_m9 revenue_m10 \
0 1 1234 1231 1256 1239
1 2 5678 3425 3255 2345
revenue_mean_2m revenue_mean_4m
0 1247.5 1240.00
1 2800.0 3675.75
if column order it not guaranteed
Sort them with natural sorting
# shuffle the DataFrame columns for demo
df = df.sample(frac=1, axis=1)
# filter and reorder the needed columns
from natsort import natsort_key
df2 = df.filter(regex=r'revenue_m\d+').sort_index(key=natsort_key, axis=1)
you could try something like this in reference to this post:
n_months = 4 # you could also do this in a loop for all months range(1, 12)
df[f'revenue_mean_{n_months}m'] = df.iloc[:, -n_months:-1].mean(axis=1)
I have a data frame called df:
Date Sales
01/01/2020 812
02/01/2020 981
03/01/2020 923
04/01/2020 1033
05/01/2020 988
... ...
How can I get the first occurrence of 7 consecutive days with sales above 1000?
This is what I am doing to find the rows where sales is above 1000:
In [221]: df.loc[df["sales"] >= 1000]
Out [221]:
Date Sales
04/01/2020 1033
08/01/2020 1008
09/01/2020 1091
17/01/2020 1080
18/01/2020 1121
19/01/2020 1098
... ...
You can assign a unique identifier per consecutive days, group by them, and return the first value per group (with a previous filter of values > 1000):
df = df.query('Sales > 1000').copy()
df['grp_date'] = df.Date.diff().dt.days.fillna(1).ne(1).cumsum()
df.groupby('grp_date').head(7).reset_index(drop=True)
where you can change the value of head parameter to the first n rows from consecutive days.
Note: you may need to use pd.to_datetime(df.Date, format='%d/%m/%Y') to convert dates from strings to pandas datetime, and sort them.
Couldn't you just sort by date and grab head 7?
df = df.sort_values('Date')
df.loc[df["sales"] >= 1000].head(7)
If you need the original maybe make a copy first
I have df with cols:
date Account invoice category sales
12-01-2019 123 123 exhaust 2200
13-01-2019 124 124 tyres 1300
15-01-2019 234 125 windscreen 4500
16-01-2019 123 134 gearbox 6000
I have grouped by account and sales a
dfres = df.groupby(['Account'])({'sales': np.sum})
I received:
sales
account
123 8200
124 3300
I want to now retrieve original df filtered by my grouped details, so a reduced dataset but i now have the same number of rows as original and only retain top 5% of sales for example. How can i remove unwanted accounts?
Ive tried:
index_list = res.index.tolist()
newdf = df[df.account.isin(index_list)]
Many thanks
If you want to keep the remaining columns, you'll need to tell pandas how to show the rest of the columns once grouped. For example, if you want to keep the information in invoice and category and date as a list of whatever invoices/cats/dates make up that Account sum, then:
dfres = df.groupby(['Account']).agg({'sales': np.sum, 'invoice':list, 'category':list, 'date':list})
You could then reset the index to turn it back into a flat dataframe:
dfres.reset_index()
I have the following csv file that I converted to a DataFrame:
apartment,floor,gasbill,internetbill,powerbill
401,4,120,nan,340
409,4,190,50,140
410,4,155,45,180
I want to be able to iterate each column, and if the value of a cell in internetbill column is not a number, delete that whole row. So in this example, the ''401,4,120,nan,340'' row would be eliminated from the DataFrame.
I thought that something like this would work, but I have no avail and I'm stuck
df.drop[df['internetbill'] == "nan"]
If you are using pd.read_csv then that nan will get imported as a np.nan. If so, then you need dropna
df.dropna(subset=['internetbill'])
apartment floor gasbill internetbill powerbill
1 409 4 190 50.0 140
2 410 4 155 45.0 180
If those are strings for whatever reason, you could do one of two things:
replace
df.replace({'internetbill': {'nan': np.nan}}).dropna(subset=['internetbill'])
to_numeric
df.assign(
internetbill=pd.to_numeric(df['internetbill'], errors='coerce')
).dropna(subset=['internetbill'])
I have a data series,
df,
primary
Buy 484
Sell 429
Blanks 130
FX Spot 108
Income 77
FX Forward 2
trying to crate a dataframe with 2 column.
first column values should be the index of df
second column should have the values of primary in df
by using,
filter_df=pd.DataFrame({'contents':df.index, 'values':df.values})
I get,
Exception: Data must be 1-dimensional
Use reset_index with rename_axis for new column name:
filter_df = df.rename_axis('content').reset_index()
Another solution with rename:
filter_df = df.reset_index().rename(columns={'index':'content'})
For DataFrame from constructor need df['primary'] for select column
filter_df=pd.DataFrame({'contents':df.index, 'values':df['primary'].values})
print (filter_df)
content primary
0 Buy 484
1 Sell 429
2 Blanks 130
3 FX Spot 108
4 Income 77
5 FX Forward 2