I have this csv data(an example):
I have 5000 zip codes with other columns but 34(zipcode) of them are only unique. I have to take each zipcode and hit another API to get the median income but how can I fill up the other row's median income column with a duplicate zip code?
N.B: Didn't find anything related to my case.
You want to us transform, which returns a DataFrame with the same indexes as the original object filled with the transformed values.
You will need to write a function that takes a zip code and returns the median value. See this example:
import pandas as pd
def get_med(zip_code):
# This would be your get call to the API
# Here, `zip_code` is a Series, use `.iloc[0]`
# to get the value of the group
return zip_code.iloc[0] * 100
df = pd.DataFrame({"zip":[1, 2, 3, 1, 1]})
df["med_income"] = df.groupby("zip")["zip"].transform(get_med)
# zip med_income
# 0 1 100
# 1 2 200
# 2 3 300
# 3 1 100
# 4 1 100
Alternatively you could generate all the median values in a dict and then map that back onto the DataFrame:
medians = {get_median(zip_code) for zip_code in df["zip"].unique()}
df["med_income"] = df["zip"].map(medians)
I believe you're looking for pandas map. So let's suppose the output of this second API is a dictionary (maybe you can manage to get it):
# Get unique zip codes to use as input to the API
zip_codes = df['Zip'].unique()
# Let's suppose you get an ouput like this
zip_dict = {46234: 1500, 46250: 2000, 46280: 1200} # and so on...
So, you can map the zip code to the Median Income like this:
df['Median Income'] = df['Zip'].map(zip_dict)
where df is your dataframe.
From what I understood, you want to get the unique values of the zipcodes? If yes, then you can use
df.yourColumn.unique()
Related
I have a dataset (df) like that :
Card Number Amount
0 102******38 22.0
1 102******56 76.0
2 102******38 25.6
and it's load using
import panda as pd
df = pd.read_csv("file.csv")
And I would like to calculate something like :
df["Zscore"] = df["Amount"] - AVERAGE(all X in df["Amount"] who have the same "Number" in df["Card Number"] )
My intuition is something like :
import numpy as np
import statistics as st
df["Zscore"] = df["Amount"] - st.mean(np.where(condition, df["Amount"], 0))
But I can't figure out how to express my condition
After some research, I found a solution using Verticapy
import verticapy.stats as st
df["Zscore"] = (df["Amount"] - st.mean(df["Amount"])._over(["CardNumber"]))
But I need to convert my code using Verticapy, and I would like another way to do that because I have never used Verticapy and don't really want to at the moment.
So do I need to use "np.where()" and in this case is it possible to formulate my condition ?
Or do I need to alter my way to attack the problem ?
First, you need to calculate the mean value per card number. Let's calculate that by grouping same card numbers, getting the average amount, and call that 'card_mean':
mean_values = df.groupby('Card Number')\
.mean()['Amount']\
.reset_index()\
.rename(columns={'Amount':'card_mean'})
Then, you want to merge that mean value back into the original dataframe, as a new column, for each 'Card Number' that you have in your original df
df = pd.merge(df, mean_values, how='left', on='Card Number')
This gives you a combined df with 2 columns: the 'Amount' (which you loaded), and the 'card_mean' per card number (which we just calculated by averaging in step 1)
Now you can go and do you magic with both, i.e., subtract each, average over that difference, etc.. For example:
df['z_score'] = df['Amount'] - df['card_mean']
Here is an example of the data I am dealing with:
This example of data is a shortened version of each Run. Here the runs are about 4 rows long. In a typical data set they are anywhere between 50-100 rows long. There are also 44 different runs.
So my goal is to get the average of the last 4 rows a given column in stage 2, right now I am achieving that, but it grabs the average based on these conditions for the whole spreadsheet. I want to be able to get these average values for each and every 'Run'.
df["Run"] = pd.DataFrame({
"Run": ["Run1.1", "Run1.2", "Run1.3", "Run2.1", "Run2.2", "Run2.3", "Run3.1", "Run3.2", "Run3.3", "Run4.1",
"Run4.2", "Run4.3", "Run5.1", "Run5.2", "Run5.3", "Run6.1", "Run6.2", "Run6.3", "Run7.1", "Run7.2",
"Run7.3", "Run8.1", "Run8.2", "Run8.3", "Run9.1", "Run9.2", "Run9.3", "Run10.1", "Run10.2", "Run10.3",
"Run11.1", "Run11.2", "Run11.3"],
})
av = df.loc[df['Stage'].eq(2),'Vout'].groupby("Run").tail(4).mean()
print(av)
I want to be able to get these averages for a given column that is in Stage 2, based on each and every 'Run'. As you can see before each data set there is a corresponding 'Run' e.g the second data set has 'Run1.2' before it.
Also, each file I am dealing with, the amount of rows per Run is different/not always the same.
So, it is important to note that this is not achievable with np.repeat, as with each new sheet of data, the rows can be any length, not just the same as the example above.
Expected output:
Run1.1 1841 (example value)
Run1.2 1703 (example value)
Run1.3 1390 (example value)
... so on
etc
Any help would be greatly appreciated.
What does your panda df look like after you import the csv?
I would say you can just groupby on the run column like such:
import pandas as pd
df = pd.DataFrame({
"run": ["run1.1", "run1.2", "run1.1", "run1.2"],
"data": [1, 2, 3, 4],
})
df.groupby("run").agg({"data": ["sum"]}).head()
Out[4]:
data
sum
run
run1.1 4
run1.2 6
This will do the trick:
av = df.loc[df["Stage"].eq(2)]
av = av.groupby("Run").tail(4).groupby("Run")["Vout"].mean()
Now df.groupby("a").tail(n) will return dataframe with only last n elements for each value of a. Then the second groupby will just aggregate these and return average per group.
I have a dataframe with the Columns "OfferID", "SiteID" and "CatgeoryID" which should represent an online ad on a website. I then want to add a new Column called "NPS" for the net promoter score. The values should be given randomly between 1 and 10 but where the OfferID, the SideID and the CatgeoryID are the same, they need to have the same value for the NPS. I thought of using a dictionary where the NPS is the key and the pairs of different IDs are the values but I haven't found a good way to do this.
Are there any recommendations?
Thanks in advance.
Alina
The easiest would be first to remove all duplicates ; you can do this using :
uniques = df[['OfferID', 'SideID', 'CategoryID']].drop_duplicates(keep="first")
Afterwards, you can do something like this (note that your random values are not uniques) :
uniques['NPS'] = [random.randint(0, 100) for x in uniques.index]
And then :
df = df.merge(uniques, on=['OfferID', 'SideID', 'CategoryID'], how='left')
I have a data frame in which one column 'F' has values from 0 to 100 and a second column 'E' has values from 0 to 500. I want to create a matrix in which frequencies fall within ranges in both 'F' and 'E'. For example, I want to know the frequency in range 20 to 30 for 'F' and range 400 to 500 for 'E'.
What I expect to have is the following matrix:
matrix of ranges
I have tried to group ranges using pd.cut() and groupby() but I don't know how to join data.
I really appreciate your help in creating the matrix with pandas.
you can use the cut function to create the bin "tag/name" for each column.
after you cat pivot the data frame.
df['rows'] = pd.cut(df['F'], 5)
df['cols'] = pd.cut(df['E'], 5)
df = df.groupby(['rows', 'cols']).agg('sum').reset_index([0,1], False) # your agg func here
df = df.pivot(columns='cols', index='rows')
So this is the way I found to create the matrix, that was obviously inspired by #usher's answer. I know it's more convoluted but wanted to share it. Thanks again #usher
E=df.E
F=df.F
bins_E=pd.cut(E, bins=(max(E)-min(E))/100)
bins_F=pd.cut(F, bins=(max(F)-min(F))/10)
bins_EF=bins_E.to_frame().join(bins_F)
freq_EF=bins_EF.groupby(['E', 'F']).size().reset_index(name="counts")
Mat_FE = freq_EF.pivot(columns='E', index='F')
TL;DR - I want to mimic the behaviour of functions such as DataFrameGroupBy.std()
I have a DataFrame which I group.
I want to take one row to represent each group, and then add extra statistics regarding these groups to the resulting DataFrame (such as the mean and std of these groups)
Here's an example of what I mean:
df = pandas.DataFrame({"Amount": [numpy.nan,0,numpy.nan,0,0,100,200,50,0,numpy.nan,numpy.nan,100,200,100,0],
"Id": [0,1,1,1,1,2,2,2,2,2,2,2,2,2,2],
"Date": pandas.to_datetime(["2011-11-02","NA","2011-11-03","2011-11-04",
"2011-11-05","NA","2011-11-04","2011-11-04",
"2011-11-06","2011-11-06","2011-11-06","2011-11-06",
"2011-11-08","2011-11-08","2011-11-08"],errors='coerce')})
g = df.groupby("Id")
f = g.first()
f["std"] = g.Amount.std()
Now, this works - but let's say I want a special std, which ignores 0, and regards each unique value only once:
def get_unique_std(group):
vals = group.unique()
vals = vals[vals>0]
return vals.std() if vals.shape[0] > 1 else 0
If I use
f["std"] = g.Amount.transform(get_unique_std)
I only get zeros... (Also for any other function such as max etc.)
But if I do it like this:
std = g.Amount.transform(get_unique_std)
I get the correct result, only not grouped anymore... I guess I can calculate all of these into columns of the original DataFrame (in this case df) before I take the representing row of the group:
df["std"] = g.Amount.transform(get_unique_std)
# regroup again the modified df
g = df.groupby("Id")
f = g.first()
But that would just be a waste of memory space since many rows corresponding to the same group would get the same value, and I'd also have to group df twice - once for calculating these statistics, and a second time to get the representing row...
So, as stated in the beginning, I wonder how I can mimic the behaviour of DataFrameGroupBy.std().
I think you may be looking for DataFrameGroupBy.agg()
You can pass your custom function like this and get a grouped result:
g.Amount.agg(get_unique_std)
You can also pass a dictionary and get each key as a column:
g.Amount.agg({'my_std': get_unique_std, 'numpy_std': pandas.np.std})