I have a Dataframe with 45 columns and 11k rows. This dataframe consist of players. Columns displaying their name, player_id, rating, height etc. Pretend you have the name of a player in the dataframe, or their ID, and you want to access the entire row of that player. You want to see all the information of that individual, but you only have one unique identifyer.
I tried using df.loc[[id_number]], but that only takes me to the index of the dataframe, which does not correspond to player_id.
Hopefully I explained it well enough. If you have any questions, please post them below.
df.loc[df['column_name'] == some_value]
Related to https://stackoverflow.com/a/17071908/17487637
You can try applying a mask:
df[df.playerId == id_number]
Assuming playerId is the name of the column containing the player ids.
As far as I have understood, you want to query a dataframe based on one unique identifier. Let's suppose you only have player name.
df[df.player_name==playername]
Here playername is the variable where you will store your desired player name.
I use this in my code to find what i need in excel file. Mayby will be helpful for you.
search = input('\t\t Find row: ')
xlsxfile = pd.read_excel('your_filename.xlsx', engine='openpyxl')
df = xlsxfile[xlsxfile['name_of_your_column' ].str.contains(search , na=False)]
dx = (df[['name', 'player_id','height']])
print(dx)
You will need some third party module for this like openpyxl.
Related
I want to assign the data in equal proportion between various people in python automatically. The names should appear in the first column 'Name' automatically.
Please be specific with what you mean by 'automatically'. If I don't get the result you wanted wrong then it should be like this:
df['Name'] = ["Person 1","Person 2","Person 3",...]
The length of 'Name' column rows should be the same with the other.
Background info
I'm working on a DataFrame where I have successfully joined two different datasets of football players using fuzzymatcher. These datasets did not have keys for an exact match and instead had to be done by their names. An example match of the name column from two databases to merge as one is the following
long_name name
L. Messi Lionel Andrés Messi Cuccittini
As part of the validation process of a 18,000 row database, I want to check the two date of birth columns in the merged DataFrame - df, ensuring that the columns match like the example below
dob birth_date
1987-06-24 1987-06-24
Both date columns have been converted from strings to dates using pd.to_datetime(), e.g.
df['birth_date'] = pd.to_datetime(df['birth_date'])
My question
My query, I have another column called 'value'. I want to update my pandas DataFrame so that if the two date columns match, the entry is unchanged. However, if the two date columns don't match, I want the data in this value column to be changed to null. This is something I can do quite easily in Excel with a date_diff calculation but I'm unsure in pandas.
My current code is the following:
df.loc[(df['birth_date'] != df['dob']),'value'] = np.nan
Reason for this step (feel free to skip)
The reason for this code is that it will quickly show me fuzzy matches that are inaccurate (approx 10% of total database) and allow me to quickly fix those.
Ideally I need to also work on the matching algorithm to ensure a perfect date match, however, my current algorithm currently works quite well in it's current state and the project is nearly complete. Any advice on this however I'd be happy to hear, if this is something you know about
Many thanks in advance!
IICU:
Please Try np.where.
Works as follows;
np.where(if condition, assign x, else assign y)
if condition=df.loc[(df['birth_date'] != df['dob'],
x=np.nan and
y= prevailing df.value
df['value']= np.where(df.loc[(df['birth_date'] != df['dob']),'value'], np.nan, df['value'])
I'm organizing a new dataframe in order to easily insert data into a Bokeh visualization code snippet. I think my problem is due to differing row lengths, but I am not sure.
Below, I organized the dataset in alphabetical order, by country name, and created an alphabetical list of the individual countries. new_data.tail() Although Zimbabwe is listed last, there are 80336 rows, hence the sorting.
df_ind_data = pd.DataFrame(ind_data)
new_data = df_ind_data.sort_values(by=['country'])
new_data = new_data.reset_index(drop=True)
country_list = list(ind_data['country'])
new_country_set = sorted(set(country_list))
My goal is create a new DataFrame, with 76 cols (country names), with the specific 'trust' data in the rows underneath each country column.
df = pd.DataFrame()
for country in new_country_set:
pink = new_data.loc[(new_data['country'] == country)]
df[country] = pink.trust
Output here
As you can see, the data does not get included for the rest of the columns after the first. I believe this is due to the fact that the number of rows of 'trust' data for each country varies. While the first column has 1000 rows, there are some with as many as 2500 data points, and as little as 500.
I have attempted a few different methods to specify the number of rows in 'df', but to no avail.
The visualization code snippet I have utilizes this same exact data structure for the template data, so that it why I'm attempting to put it in a dataframe. Plus, I can't do it, so I want to know how to do it.
Yes, I can put it in a dictionary, but I want to put it in a dataframe.
You should use combine_first when you add a new column so that the dataframe index gets extended. Instead of
df[country] = pink.trust
you should use
df = pink.trust.combine_first(df)
which ensures that your index is always union of all added columns.
I think in this case pd.pivot(columns = 'var', values = 'val') , will work for you, especially when you already have dataframe. This function will transfer values from particular column into column names. You could see the documentation for additional info. I hope that helps.
I have a table shown as below:
Customers are buying items in different dates. Each customer have a different number. Each item has a different ID.
I want to have an information in separate column for each ID is it first item for given customer or second or third etc.
I was trying:
df['item_order'] = np.where(df['Customer']==df['Customer'].shift(),
df.item_order.shift()+1, 0)
But there are only 0 for first and 1 for second, third etc.
You can try something like the below code using pandas
df[['ID','Customer','Date']].groupby(['ID','Customer']).agg('count')
Let me know if this is the output that you are expecting
thanks for help for everybody, solution is rank method.
You can find below solution for my issue:
df['rank'] = df.sort_values('Customer').groupby('Customer').Date.rank(method='first')
I have a pandas DataFrame with columns patient_id, patient_sex, patient_dob (and other less relevant columns). Rows can have duplicate patient_ids, as each patient may have more than one entry in the data for multiple medical procedures. I discovered, however, that a great many of the patient_ids are overloaded, i.e. more than one patient has been assigned to the same id (evidenced by many instances of a single patient_id being associated with multiple sexes and multiple days of birth).
To refactor the ids so that each patient has a unique one, my plan was to group the data not only by patient_id, but by patient_sex and patient_dob as well. I figure this must be sufficient to separate the data into individual users (and if two patients with the same sex and dob just happened to be assigned the same id, then so be it.
Here is the code I currently use:
# I just use first() here as a way to aggregate the groups into a DataFrame.
# Bonus points if you have a better solution!
indv_patients = patients.groupby(['patient_id', 'patient_sex', 'patient_dob']).first()
# Create unique ids
new_patient_id = 'new_patient_id'
for index, row in indv_patients.iterrows():
# index is a tuple of the three column values, so this should get me a unique
# patient id for each patient
indv_patients.loc[index, new_patient_id] = str(hash(index))
# Merge new ids into original patients frame
patients_with_new_ids = patients.merge(indv_patients, left_on=['patient_id', 'patient_sex', 'patient_dob'], right_index=True)
# Remove byproduct columns, and original id column
drop_columns = [col for col in patients_with_new_ids.columns if col not in patients.columns or col == new_patient_id]
drop_columns.append('patient_id')
patients_with_new_ids = patients_with_new_ids.drop(columns=drop_columns)
patients = patients_with_new_ids.rename(columns={new_patient_id : 'patient_id'})
The problem is that with over 7 million patients, this is way too slow a solution, the biggest bottleneck being the for-loop. So my question is, is there a better way to fix these overloaded ids? (The actual id doesn't matter, so long as its unique for each patient)
I don't know what the values for the columns are but have you tried something like this?
patients['new_patient_id'] = patients.apply(lambda x: x['patient_id'] + x['patient_sex'] + x['patient_dob'],axis=1)
This should create a new column and you can then use groupby with the new_patient_id