As the title suggests I am attempting to assign quartiles to specific row values in my dataframe. However, I want to group this information by date and assign column quartiles based upon the values which exist on that date. I am not quite sure if my current method is working correctly. (I am utilizing a loop to achieve this for each column in my dataframe).
import pandas as pd
# This function reads my dataframe and removes the Year Month and Day columns because we have 'Concat' column which is used as my key for grouping (ddmmyyy).
def readCSV():
directory = 'Data'
file = 'Copy of Student_Data_9120.csv'
data = pd.read_csv(os.path.join(directory, file))
data.drop(columns=['Year', 'Month', 'Day'], inplace=True)
return data
def getDecile(data):
# Finds the columns in my dataframe
test_list = data.columns.values.tolist()
remove_list = ['acc_date', 'permno', 'Portfolio_Formation_date', 'SUE', 'FYearEnd', 'Concat']
# Concat is my primary key dd/mm/yyyy which I am utilizing to group by
# Removes columns which I don't need deciles for
keyCols = filter(lambda i: i not in remove_list, test_list)
# For each column in my data, group by each date.
# On each date for each column assign the quartile of the row to the new column 'Column_Decile'.
for column in keyCols:
name = column + '_Decile'
data[name] = data.groupby(['Concat'])[column].transform(lambda x: pd.qcut(x.rank(method='first'), q=10, labels=range(1,11)))
return data
def printToCSV(quartileData):
file = 'Data_With_Quartiles.csv'
quartileData.to_csv(file)
data = readCSV()
quartileData = getDecile(data)
printToCSV(quartileData)
Related
In Python Pandas I need to create a new column in a dataframe (number of encounters in last year) in the below dataframe with the following logic.
I want to count the number of encounteruniqueid that customeruniqueid had in the prior 365 days and write those in a new column called number_of_unique_encouters_in_last_year in the dataframe.
Can you help?
Column 1: encounteruniqueid
Column 2: datetimeofencounter
Column 3: customeruniqueid
Column 4: Numberofencounters_in_last_yr
sample dataframe
Something like this might work:
def occurrences(df, cust_id, date):
# filter dataframe by customer id and date
x = df[(df["uniquecustomerid"] == cust_id) &
(df["datetime"].between((date - pd.offsets.Day(365)), date))]
# counter the number of unique ecnounters (should all be unique, so no need for nunique)
unique_in_last_year = x["uniqueencounter"].count()
# return the count
return unique_in_last_year
# apply the function to your dataframe
your_df.apply(lambda x: occurrences(your_df, x["uniquecustomerid"], x["datetime"]), axis=1)
I have a data frame like the following:
I need to keep some values matching the column's title which contains one extra column.
Could you please suggest the solution?
Here a solution:
# Building of a sample dataframe
df = pd.DataFrame({"index":["B","D","C","A"]}).groupby(["index"]).count()
df["value"] = None
# Function to fill the matching column
def match(index, column):
if(index==column):
return 1
else:
return ""
# Create the matching column and fill it with the right value
for index in df.index.array:
df[index] = df.apply(lambda row: match(row.name, index), axis=1)
# Print the result dataframe
print(df)
AttributeError: 'NoneType' object has no attribute 'transpose'
I have been trying to extract cells as dictionary(from pandas dataframe) and trying to join with existing data
for example , I have csv file which contains two columns id,device_type.each cell in device_type column contains dictionary data. i have trying to split and add with original data.
And trying to do something like below.
import json
import pandas
df = pandas.read_csv('D:\\1. Work\\csv.csv',header=0)
sf = df.head(12)
sf['visitor_home_cbgs'].fillna("{}", inplace = True).transpose()
-- csv file sample
ID,device_type
3c30ee03047b478,{"060379800281":11,"061110053031":5,"060372062002":5}
f5d639a64a88496099,{}
-- looks for output like below
id,device_type,ttype,tvalue
3c30ee03047b478,{"060379800281":11,"061110053031":5,"060372062002":5},"060379800281",11
3c30ee03047b478,{"060379800281":11,"061110053031":5,"060372062002":5},"061110053031",5
3c30ee03047b478,{"060379800281":11,"061110053031":5,"060372062002":5},"060372062002",5
f5d639a64a88496099,{},NIL,NIL
avoid inplace=True
sf['visitor_home_cbgs'].fillna("{}").transpose()
when you give inplace=True, it converts the same dataframe and returns null.
If you want to use inplace=True, then do like below
sf['visitor_home_cbgs'].fillna("{}", inplace=True)
sf.transpose()
To create rows from column values
One solution it to iterate through dataframe rows and create new dataframe with desired columns and values.
import json
def extract_JSON(row):
df2 = pd.DataFrame(columns=['ID', 'device_type', 'ttype', 'tvalue'])
device_type = row['device_type']
dict = json.loads(device_type)
for key in dict:
df2.loc[len(df2)] = [row['ID'], row['device_type'], key, dict[key]]
if df2.empty:
df2.loc[0] = [row['ID'], row['device_type'], '', '']
return df2
df3 = pd.DataFrame(columns=['ID', 'device_type', 'ttype', 'tvalue'])
for _, row in df.iterrows():
df3 = df3.append(extract_JSON(row))
df3
I am trying to eliminate the column and the row which state the index of the values and replace the first value of my column by 'kw'.
Have tried .drop without success
def main():
df_old_m, df_new_m = open_file_with_data_m(file_old_m, file_new_m)
#combine two dataframes together
df = df_old_m.append(df_new_m)
son=df["son"]
gson=df["gson"]
#add son
df_old_m = df["seeds"].append(son)
#add gson
df_old_m = df_old_m.append(gson)
#deleted repeated values
df_old_m = df_old_m.drop_duplicates()
#add kw as the header of the list
df_old_m.loc[-1] = 'kw' # adding a row
df_old_m.index = df_old_m.index + 1 # shifting index
df_old_m.sort_index(inplace=True)
This gives me .xlsx output
If kw is the column you want to be your new index, you can do this:
df.set_index('kw', inplace=True)
I'm initializing a DataFrame:
columns = ['Thing','Time']
df_new = pd.DataFrame(columns=columns)
and then writing values to it like this:
for t in df.Thing.unique():
df_temp = df[df['Thing'] == t] #filtering the df
df_new.loc[counter,'Thing'] = t #writing the filter value to df_new
df_new.loc[counter,'Time'] = dftemp['delta'].sum(axis=0) #summing and adding that value to the df_new
counter += 1 #increment the row index
Is there are better way to add new values to the dataframe each time without explicitly incrementing the row index with 'counter'?
If I'm interpreting this correctly, I think this can be done in one line:
newDf = df.groupby('Thing')['delta'].sum().reset_index()
By grouping by 'Thing', you have the various "t-filters" from your for-loop. We then apply a sum() to 'delta', but only within the various "t-filtered" groups. At this point, the dataframe has the various values of "t" as the indices, and the sums of the "t-filtered deltas" as a corresponding column. To get to your desired output, we then bump the "t's" into their own column via reset_index().