I want to assign the data in equal proportion between various people in python automatically. The names should appear in the first column 'Name' automatically.
Please be specific with what you mean by 'automatically'. If I don't get the result you wanted wrong then it should be like this:
df['Name'] = ["Person 1","Person 2","Person 3",...]
The length of 'Name' column rows should be the same with the other.
Related
I have this dataframe and I want to add a column to it with the total of distinct SalesOrderId for a given CustomerId
So, with I am trying to do there would be a new column with the value 3 for all this rows.
How can I do it?
I am trying this way but I get an error
data['TotalOrders'] = data.groupby([['CustomerID','SalesOrderID']]).size().reset_index(name='count')
Try using transform:
data['TotalOrders'] = df.groupby('CustomerID')['SalesOrderID'].transform('nunique')
This will give you one entry for each entry in the group. (thanks #Rodalm)
I have a Dataframe with 45 columns and 11k rows. This dataframe consist of players. Columns displaying their name, player_id, rating, height etc. Pretend you have the name of a player in the dataframe, or their ID, and you want to access the entire row of that player. You want to see all the information of that individual, but you only have one unique identifyer.
I tried using df.loc[[id_number]], but that only takes me to the index of the dataframe, which does not correspond to player_id.
Hopefully I explained it well enough. If you have any questions, please post them below.
df.loc[df['column_name'] == some_value]
Related to https://stackoverflow.com/a/17071908/17487637
You can try applying a mask:
df[df.playerId == id_number]
Assuming playerId is the name of the column containing the player ids.
As far as I have understood, you want to query a dataframe based on one unique identifier. Let's suppose you only have player name.
df[df.player_name==playername]
Here playername is the variable where you will store your desired player name.
I use this in my code to find what i need in excel file. Mayby will be helpful for you.
search = input('\t\t Find row: ')
xlsxfile = pd.read_excel('your_filename.xlsx', engine='openpyxl')
df = xlsxfile[xlsxfile['name_of_your_column' ].str.contains(search , na=False)]
dx = (df[['name', 'player_id','height']])
print(dx)
You will need some third party module for this like openpyxl.
I am studying pandas, bokeh etc. to get started with Data Vizualisation. Right now I am practising with a giant table containing different birds. There are plenty of columns; two of those columns are "SCIENTIFIC NAME" and another one is "OBSERVATION COUNT".
I want to extract those two columns.
I did
df2 = df[["SCIENTIFIC NAME" , "OBSERVATION COUNT"]]
but the problem then is, that every entry is inside the table (since sometimes there are multiple entries/rows due to other columns of the same SCIENTIFIC NAME, but the OBSERVATION COUNT is always the same for the scientific name)
How can I get those two sectors but with the unique values, so every scientific name once, with the corresonding observation count.
EDIT: I just realized that sometimes the same scientific names have different observation counts due to another column. Is there a way to extract every first unique item from a column
IIUC, You can use drop_duplicates:
df2 = df[["SCIENTIFIC NAME" , "OBSERVATION COUNT"]].drop_duplicates()
To get counts:
df2 = df.groupby(["SCIENTIFIC NAME" , "OBSERVATION COUNT"])["SCIENTIFIC NAME"].count()
I'm organizing a new dataframe in order to easily insert data into a Bokeh visualization code snippet. I think my problem is due to differing row lengths, but I am not sure.
Below, I organized the dataset in alphabetical order, by country name, and created an alphabetical list of the individual countries. new_data.tail() Although Zimbabwe is listed last, there are 80336 rows, hence the sorting.
df_ind_data = pd.DataFrame(ind_data)
new_data = df_ind_data.sort_values(by=['country'])
new_data = new_data.reset_index(drop=True)
country_list = list(ind_data['country'])
new_country_set = sorted(set(country_list))
My goal is create a new DataFrame, with 76 cols (country names), with the specific 'trust' data in the rows underneath each country column.
df = pd.DataFrame()
for country in new_country_set:
pink = new_data.loc[(new_data['country'] == country)]
df[country] = pink.trust
Output here
As you can see, the data does not get included for the rest of the columns after the first. I believe this is due to the fact that the number of rows of 'trust' data for each country varies. While the first column has 1000 rows, there are some with as many as 2500 data points, and as little as 500.
I have attempted a few different methods to specify the number of rows in 'df', but to no avail.
The visualization code snippet I have utilizes this same exact data structure for the template data, so that it why I'm attempting to put it in a dataframe. Plus, I can't do it, so I want to know how to do it.
Yes, I can put it in a dictionary, but I want to put it in a dataframe.
You should use combine_first when you add a new column so that the dataframe index gets extended. Instead of
df[country] = pink.trust
you should use
df = pink.trust.combine_first(df)
which ensures that your index is always union of all added columns.
I think in this case pd.pivot(columns = 'var', values = 'val') , will work for you, especially when you already have dataframe. This function will transfer values from particular column into column names. You could see the documentation for additional info. I hope that helps.
I have excel data file with thousands of rows and columns.
I am using python and have started using pandas dataframes to analyze data.
What I want to do in column D is to calculate annual change for values in column C for each year for each ID.
I can use excel to do this – if the org ID is same are that in the prior row, calculate annual change (leaving the cells highlighted in blue because that’s the first period for that particular ID). I don’t know how to do this using python. Can anyone help?
Assuming the dataframe is already sorted
df.groupby(‘ID’).Cash.pct_change()
However, you can speed things up with the assumption things are sorted. Because it’s not necessary to group in order to calculate percentage change from one row to next
df.Cash.pct_change().mask(
df.ID != df.ID.shift()
)
These should produce the column values you are looking for. In order to add the column, you’ll need to assign to a column or create a new dataframe with the new column
df[‘AnnChange’] = df.groupby(‘ID’).Cash.pct_change()