I have the following dataframe, which contains 2 rows:
index name food color number year hobby music
0 Lorenzo pasta blue 5 1995 art jazz
1 Lorenzo pasta blue 3 1995 art jazz
I want to write a code that will be able to tell me which column is the one that can distinguish between the these two rows.
For example , in this dataframe, the column "number" is the one that distinguish between the two rows.
Unti now I have done this very simply by just go over column after column using iloc and see the values.
duplicates.iloc[:,3]
>>>
0 blue
1 blue
It's important to take into account that:
This should be for loop, each time I check it on new generated dataframe.
There may be nore than 2 rows which I need to check
There may be more than 1 column that can distinguish between the rows.
I thought that the way to check such a thing will be something like take each time one column, get the unique values and check if they are equal to each other ,similarly to this:
for n in np.arange(0,len(df.columns)):
tmp=df.iloc[:,n]
and then I thought to compare if all the values are similar to each other on the temporal dataframe, but here I got stuck because sometimes I have many rows and also I need.
My end goal: to be able to check inside for loop to identify the column that has different values in each row of the temporaldtaframe, hence can help to distinguish between the rows.
You can apply the duplicated method on all columns:
s = df.apply(pd.Series.duplicated).any()
s[~s].index
Output: ['number']
I would like to group some strings in the column called 'type' and insert them in a plotly bar, the problem is that from the new table created with groupby I can't extract the x and y to define them in the graph:
tipol1 = df.groupby(['tipology']).nunique()
tipol1
the outpot gives me tipology as index and the grouping based on how many times they repeat
number data
typology
one 2 113
two 33 33
three 12 88
four 44 888
five 11 66
in the number column (in which I have other values it gives me the correct grouping of the tipology column)
Also in the date column it gives me values (I think grouping the dates but not the dates in the correct format)
I also found:
tipol=df.groupby(['tipology']).nunique()
tipol2 = tipol[['number']]
tipol2
to take only the number column,
but nothing to do, I would need the tipology column (not in index) and the column with the tipology grouping numbers to get the x and y axis to import it into plotly!
One last try I made (making a big mess):
tipol=df.groupby(['tipology'],as_index=False).nunique()
tipol2 = tipol[['number']]
fig = go.Figure(data=[
go.Bar(name='test', x=df['tipology'], y=tipol2)
])
fig.update_layout(barmode='stack')
fig.show()
any suggestions
thanks!
UPDATE
I would have too much code to give an example, it would be difficult for me and it would waste your time too. basically I would need a groupby with the addition of a column that would show the grouping value eg:
tipology Date
home 10/01/18
home 11/01/18
garden 12/01/18
garden 12/01/18
garden 13/01/18
bathroom 13/01/18
bedroom 14/01/18
bedroom 15/01/18
kitchen 16/01/18
kitchen 16/01/18
kitchen 17/01/18
I wish this would happen:
by deleting the date column and inserting the value column in the DataFrame that does the count
tipology value
home 2
garden 3
bathroom 1
bedroom 2
kitchen 3
Then (I'm working with jupyer notebook)
leaving the date column and adding the corresponding values to the value column based on their grouping:
tipology Date value
home 10/01/18 1
home 11/01/18 1
garden 12/01/18 2
garden 12/01/18_____.
garden 13/01/18 1
bathroom 13/01/18 1
bedroom 14/01/18 1
bedroom 15/01/18 1
kitchen 16/01/18 2
kitchen 16/01/18_____.
kitchen 17/01/18 1
I would need the columns to assign them to the x and y axes to import them to a graph! so none of the columns should be index
By default the method groupby will return a dataframe where the fields you are grouping on will be in the index of the dataframe. You can adjust this behaviour by setting as_index=False in the group by. Then tipology will still be a column in the dataframe that is returned:
tipol1 = df.groupby('tipology', as_index=False).nunique()
I am trying to find the rows, in a very large dataframe, with the highest mean.
Reason: I scan something with laser trackers and used a "higher" point as reference to where the scan starts. I am trying to find the object placed, through out my data.
I have calculated the mean of each row with:
base = df.mean(axis=1)
base.columns = ['index','Mean']
Here is an example of the mean for each row:
0 4.407498
1 4.463597
2 4.611886
3 4.710751
4 4.742491
5 4.580945
This seems to work fine, except that it adds an index column, and gives out columns with an index of type float64.
I then tried this to locate the rows with highest mean:
moy = base.loc[base.reset_index().groupby(['index'])['Mean'].idxmax()]
This gives out tis :
index Mean
0 0 4.407498
1 1 4.463597
2 2 4.611886
3 3 4.710751
4 4 4.742491
5 5 4.580945
But it only re-index (I have now 3 columns instead of two) and does nothing else. It still shows all rows.
Here is one way without using groupby
moy=base.sort_values('Mean').tail(1)
It looks as though your data is a string or single column with a space in between your two numbers. Suggest splitting the column into two and/or using something similar to below to set the index to your specific column of interest.
import pandas as pd
df = pd.read_csv('testdata.txt', names=["Index", "Mean"], delimiter="\s+")
df = df.set_index("Index")
print(df)
I am using Python Pandas to groupby a column called "Trace". For each trace, there is a "Value" column with two peaks that I am trying to transfer to a different dataframe. The first problem is that the when I use groupby, it doesn't keep the rest of the data from the row of the value I want to select. For example, if a Pandas dataframe had 6 columns, then I want to preserve all six columns after I use groupby. The second problem is that the two maximums I want are not the two greatest values in the column, but rather "peaks" in the dataset. For example, the attached image shows the two peaks whose values I want. I want the greatest values from each of the two peaks to be exported to a new dataframe with row values from other columns in the previous dataframe.
In the following code, I want to groupby the "Trace" column and pick the two peaks in the "Value" column, while still preserving the "Sample" column after choosing the peaks. The peaks I want to choose would be 52 and 21 for Trace 1 and 61 and 23 for Trace 2.
d = {"Trace": [1,1,1,1,1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2,2,2,2,2], "Sample": [1,2,3,4,5,6,7,8,9,10,11,12,1,2,3,4,5,6,7,8,9,10,11,12], "Value": [1,2,3,7,52,33,11,4,2,21,10,3,3,7,15,61,37,16,6,3,11,23,4]}
Any suggestions? I have been using .groupby("Trace") and .nlargest().
The choice of the "peak" confuses me, unless you hardcode the Trace values I don't think you will go far.
On a more sensible stance, for someone searching here, I will post the solution o getting groupby, nlargest - getting all the fields while you are at it -
df.groupby(['Trace']).apply(lambda x: x.nlargest(2, columns=['Value']))
Output
Sample Trace Value
Trace
1 3 4 1 12
4 5 1 9
2 13 4 2 15
14 5 2 11
Here, if you are looking for the two "peak" values by Value column grouped by Trace, this should be an elegant solution
This question is related to posts on creating a new column by mapping/lookup using a dictionary. (Adding a new pandas column with mapped value from a dictionary and pandas - add new column to dataframe from dictionary). However, what if I want to create, multiple new columns with dictionary values.
For argument's sake, let's say I have the following df:
country
0 bolivia
1 canada
2 ghana
And in a different dataframe, I have country mappings:
country country_id category color
0 canada 11 north red
1 bolivia 12 central blue
2 ghana 13 south green
I've been using pd.merge to merge the the mapping dataframe to my df using country and another index as keys it basically does the job, which gives me my desired output:
country country_id category color
0 bolivia 12 central blue
1 canada 11 north red
2 ghana 13 south green
But, lately, I've been wanting to experiment with using dictionaries. I suppose a related question is how does one determine to use pd.merge or dictionaries to accomplish my task.
For one-off columns that I'll map, I'll create a new column by mapping to a dictionary:
country_dict = dict(zip(country, country_id))
df['country_id'] = df['country'].map(entity_dict)
It seems impractical to define a function that takes in different dictionaries and to create each new column separately (e.g., dict(zip(key, value1)), dict(zip(key, value2))). I'm stuck on how to proceed in creating multiple columns at the same time. I started over, and tried creating the country mapping excel worksheet as a dictionary:
entity_dict = entity.set_index('country').T.to_dict('list')
and then from there, converting the dict values to columns:
entity_mapping = pd.DataFrame.from_dict(entity_dict, orient = 'index')
entity_mapping.columns = ['col1', 'col2', 'col3']
And I've been stuck going around in circles for the past few days. Any help/feedback would be appreciated!
Ok after tackling this some where..I guess pd.merge makes the most sense.