Finding a primary key - python

I have a big dataset with 4.5 million rows and 150 columns. I want to create a table in my database, and I want to create an index for it.
There isn't a column with ids and I would like to know if there is an easy way to find a column or combination of columns that could be unique to base my index on those.
I am using python and pandas

Related

Pandas dataframe- How to count the number of distinct rows for a given ID

I have this dataframe and I want to add a column to it with the total of distinct SalesOrderId for a given CustomerId
So, with I am trying to do there would be a new column with the value 3 for all this rows.
How can I do it?
I am trying this way but I get an error
data['TotalOrders'] = data.groupby([['CustomerID','SalesOrderID']]).size().reset_index(name='count')
Try using transform:
data['TotalOrders'] = df.groupby('CustomerID')['SalesOrderID'].transform('nunique')
This will give you one entry for each entry in the group. (thanks #Rodalm)

How could I create a column with matchin values from different datasets with different lengths

I want to create a new column in the dataset in which a ZipCode is assigned to a specific Region.
There are in total 5 Regions. Every Region consists of an x amount of ZipCodes. I would like to use the two different datasets to create a new column.
I tried some codes already, however, I failed because the series are not identically labeled. How should I tackle this problem?
I have two datasets, one of them has 1518 rows x 3 columns and the other one has
46603 rows x 3 columns.
As you can see in the picture:
df1 is the first dataset with the Postcode and Regio columns, which are the ZipCodes assigned to the corresponding Regio.
df2 is the second dataset where the Regio column is missing as you can see. I would like to add a new column into the df2 dataset which contains the corresponding Regio.
I hope someone could help me out.
Kind regards.
I believe you need to map the zipcode from dataframe 2 to the region column from the first dataframe. Assuming Postcode and ZipCode are same.
First create a dictionary from df1 and then replace the zipcode values based on the dictionary values
zip_dict = dict(zip(df1.Postcode, df1.Regio))
df2.ZipCode.replace(zip_dict)

Extract unique value with multiple columns from DataFrame

I have a dataframe where I want to extract values from two columns but the criteria set is unique values from one of the columns. In the image below, I want to extract unique values of 'education' along with its corresponding values from 'education-num'. I can easily extract the unique values with df['education'].unique() and I am stuck with not being able to extract the 'education-num'.
image of the dataframe.
(Originally the task was to compute the population of people with education of Bachelors, Masters and Doctorate and I assume this would be easier when comparing the 'education-num' rather than logical operators on string. But if there's any way we could do it directly from the 'education' that would also be helpful.
Edit: Turns out the Dataframe.isin helps to select rows by the list of string as given in the solution here.)
P.S. stack-overflow didn't allow me to post the image directly and posted a link to it instead...đŸ˜’
Select columns by subset and call DataFrame.drop_duplicates:
df1 = df[['education', 'education-num']].drop_duplicates()
If need count population use:
df2 = df.groupby(['education', 'education-num']).size().reset_index(name='count')

Take values from one column and create new column from it

I have big database that has one column called "Measurments" and one column with called "data" which contains data about those different measurments, for example, i measurments you can find height, weight and different indices values and in data you will find the value for this "measurment".
I would like to organize this database in a way that each unique measurment type, will have its' own column, so for example i'll have column name weight, height ect. and the vvalue they got from the column "data".
Until nowI have used this way in order to create many little databases with my relevant data:
df_NDVI=df[(df['Measurement'] == 'NDVI') & (df['Data']!='Corrupt')]
df_VPP_kg=df[(df['Measurement'] == 'WEIGHT')]
But as youcan see, it is not efficient and it creates many databases instead of one with those columns.
My end goal: to take each unique field from "measurments" column and create new column for it with the correct data from column "data".
Try this:
df["obs"]=df.groupby("Measurements")["Measurements"].cumcount()
df.pivot(index="obs", columns="Measurements", values="Data")
So you will get 1 column for each unique value from Measurements, and Data will be order below by order of observation.

How to calculate based on multiple conditions using Python data frames?

I have excel data file with thousands of rows and columns.
I am using python and have started using pandas dataframes to analyze data.
What I want to do in column D is to calculate annual change for values in column C for each year for each ID.
I can use excel to do this – if the org ID is same are that in the prior row, calculate annual change (leaving the cells highlighted in blue because that’s the first period for that particular ID). I don’t know how to do this using python. Can anyone help?
Assuming the dataframe is already sorted
df.groupby(‘ID’).Cash.pct_change()
However, you can speed things up with the assumption things are sorted. Because it’s not necessary to group in order to calculate percentage change from one row to next
df.Cash.pct_change().mask(
df.ID != df.ID.shift()
)
These should produce the column values you are looking for. In order to add the column, you’ll need to assign to a column or create a new dataframe with the new column
df[‘AnnChange’] = df.groupby(‘ID’).Cash.pct_change()

Categories

Resources