i have fairly simple question but could not find the answer somehow.
My Pandas dataframe looks like this:
0 1 2 3 ....
fruit apple apple banana apple ....
county .... .... .... .... ....
basically I want to count the different fruit types and plot them in a bar plot with X axis beeing the categories and Y beeing the number of occurrences.
I tried df["fruit"].value_counts() with .plot but apparently i always get a key error as it doesn't seem to be a valid row key?
Thanks.
Dataframes follow a tabular format where the convention is to have features as columns and entries as rows. So you need to transpose your dataframe. After that df["fruit"] will give what you expect.
I believe that fruit is in your dataframe index. If so, use:
df.loc['fruit'].value_counts().plot.bar()
Fairly straightforward. transpose will give you columns as rows so now you can
df.T["fruit"].value_counts()
Related
I have the following dataframe, which contains 2 rows:
index name food color number year hobby music
0 Lorenzo pasta blue 5 1995 art jazz
1 Lorenzo pasta blue 3 1995 art jazz
I want to write a code that will be able to tell me which column is the one that can distinguish between the these two rows.
For example , in this dataframe, the column "number" is the one that distinguish between the two rows.
Unti now I have done this very simply by just go over column after column using iloc and see the values.
duplicates.iloc[:,3]
>>>
0 blue
1 blue
It's important to take into account that:
This should be for loop, each time I check it on new generated dataframe.
There may be nore than 2 rows which I need to check
There may be more than 1 column that can distinguish between the rows.
I thought that the way to check such a thing will be something like take each time one column, get the unique values and check if they are equal to each other ,similarly to this:
for n in np.arange(0,len(df.columns)):
tmp=df.iloc[:,n]
and then I thought to compare if all the values are similar to each other on the temporal dataframe, but here I got stuck because sometimes I have many rows and also I need.
My end goal: to be able to check inside for loop to identify the column that has different values in each row of the temporaldtaframe, hence can help to distinguish between the rows.
You can apply the duplicated method on all columns:
s = df.apply(pd.Series.duplicated).any()
s[~s].index
Output: ['number']
Problem
Hey there! I'm having some trouble trying to split one column of my dataframe in two (or even more) new columns. I think this depends on the fact that the dataframe I'm working with comes from a really big csv file, almost 10gb worth of space. Once it is loaded into a Pandas dataframe, this is represented by ~60mil of rows and 5 cols.
Example
Initially, the dataframes looks something like this:
In [1]: df
Out[1]:
category other_col
0 animal.cat 5
1 animal.dog 3
2 clothes.shirt.sports 6
3 shoes.laces 1
4 None 0
I want to first remove the rows of the df for which the category is not defined (i.e., the last one), and then split the category column in three new columns based on where the dot appears: one for the main category, one for the first subcategory and another one for the last subcategory (if that actually exists). Finally, I want to merge the whole dataframe back together.
In other words, this is what I want to obtain:
In [2]: df_after
Out[2]:
other_col main_cat sub_category_1 sub_category_2
0 5 animal cat None
1 3 animal dog None
2 6 clothes shirt sports
3 1 shoes laces None
My approach
My approach for this was the following:
df = df[df['category'].notnull()]
df_wt_cat = df.drop(columns=['category'])
df_cat_subcat = df['category'].str.split('.', expand=True).rename(columns={0: 'main_cat', 1: 'sub_category_1', 2: 'sub_category_2', 3: 'sub_category_3'})
df_after = pd.concat([df_wt_cat, df_cat_subcat], axis=1)
which seems to work just fine with small datasets, but it sucks up too much memory when this is applied on a dataframe that big and the Jupyter kernel just dies.
I've tried to read the dataframe in chunks, but I'm not really sure how should I proceed after that; I've obviously tried searching this kind of problem here on stack overflow, but I didn't manage to find anything useful.
Any help is appreciated!
split and join methods do the job:
results = df['category'].str.split(".", expand = True))
df_after = df.join(results)
after doing that you can freely filter resulting dataframe
I have a pandas DataFrame
ID Unique_Countries
0 123 [Japan]
1 124 [nan]
2 125 [US,Brazil]
.
.
.
I got the Unique_Countries column by aggregating over unique countries from each ID group. There were many IDs with only 'NaN' values in the original country column. They are now displayed as what you see in row 1. I would like to filter on these but can't seem to. When I type
df.Unique_Countries[1]
I get
array([nan], dtype=object)
I have tried several methods including
isnull() and
isnan()
but it gets messed up because it is a numpy array.
If your cell has NaN not in 1st position, try use explode and groupby.all
df[df.Unique_Countries.explode().notna().groupby(level=0).all()]
OR
df[df.Unique_Countries.explode().notna().all(level=0)]
Let's try
df.Unique_Countries.str[0].isna() #'nan' is True
df.Unique_Countries.str[0].notna() #'nan' is False
To pick only non-nan-string just use mask above
df[df.Unique_Countries.str[0].notna()]
I believe that the answers based on string method contains would fail if a country contains the substring nan in it.
In my opinion the solution should be this:
df.explode('Unique_Countries').dropna().groupby('ID', as_index=False).agg(list)
This code drops nan from your dataframe and returns the dataset in the original form.
I am not sure from your question if you want to dropna or you want to know the IDs of the records which have nan in the Unique_Countries column you can use something similar:
long_ss = df.set_index('ID').squeeze().explode()
long_ss[long_ss.isna()]
So, what i am trying to do, is complete the NaN values of a Dataframe with the correct values that are to be found in a second dataframe. It would be something like this
df={"Name":["Lennon","Mercury","Jagger"],"Band":["The Beatles", "Queen", NaN]}
df2={"Name":["Jagger"],"Band":["The Rolling Stones"]}
So, I have this command to know which rows have at least one NaN:
inds = list(pd.isnull(dfinal).any(1).nonzero()[0].astype(int))
I thought this would be useful to use a for like function (didn't succeed there)
And then I tried this:
result=df.join(dfinal, on=["Name"])
But it gives me the following error
ValueError: You are trying to merge on object and int64 columns. If
you wish to proceed you should use pd.concat
I checked, and both Series "Name" are string values. So i am unable to solve this.
Keep in mind there are more columns, and the likely result it would be that if a row has one NaN, it will have like 7 NaN.
It is there a way to solve this?
Thanks in advance!
Map and Fillna()
we can target missing values in your target df with missing values from the second dataframe based on the Name column.
df["Band"] = df["Band"].fillna(df["Name"].map(df2.set_index("Name")["Band"]))
print(df)
Name Band
0 Lennon The Beatles
1 Mercury Queen
2 Jagger The Rolling Stones
I have a series of names each related to an ID.
In pandas I then combined these names so each ID would just have a combination as opposed to many individual names.
Then I created a count to see how many times these combinations would appear.
For example I wanted people who ate apples and oranges.
**Combination Count**
Apples, Oranges 2
Apples 1
Oranges 1
However, my specific data set was far too large and I have many elements with the count of 1. I am trying to combine these into an "other" group to display using seaborn for a bar chart. However, all the names overlap due the such volume of data. I want to merge probably the last 500 rows of my data set to "other" (as the combination name) and the count is the sum of all those counts.
In this example it would be like this:
**Combination Count**
Apples, Oranges 2
Other 2
I have tried using groupby, but lacking experience in pandas I am unsure of how to write this syntactically. Any help would be appreciated.
Assuming you have done import numpy as np, you can use np.where() to generate a new column which uses 'Other' if the Count is 1, or the existing Combination otherwise.Then we can .groupby and sum to find totals on 'New Combination'. Assuming your frame is called df:
df['New Combination'] = np.where(df['Count'] == 1, 'Other', df['Combination'])
totals = df.groupby('New Combination').agg({'Count': 'sum'})
This gives you:
Count
New Combination
Apples, Oranges 2
Other 2