First time posting here - have decided to try and learn how to use python whilst on Covid-19 forced holidays.
I'm trying to summarise some data from a pretty simple database and have been using the value_counts function.
Rather than running it on every column individually, I'd like to loop it over each one and return a summary table. I can do this using df.apply(pd.value_counts) but can't work out how to enter parameters into the the value counts as I want to have dropna = False.
Basic example of data I have:
# Import libraries
import pandas as pd
import numpy as np
# create list of winners and runnerup
data = [['john', 'barry'], ['john','barry'], [np.nan,'barry'], ['barry','john'],['john',np.nan],['linda','frank']]
# Create the pandas DataFrame
df = pd.DataFrame(data, columns = ['winner', 'runnerup'])
# print dataframe.
df
How I was doing the value counts for each column:
#Who won the most?
df['winner'].value_counts(dropna=False)
Output:
john 3
linda 1
barry 1
NaN 1
Name: winner, dtype: int64
How can I enter the dropna=False when using apply function? I like the table it outputs below but want the NaN to appear in the list.
#value counts table
df.apply(pd.value_counts)
winner runnerup
barry 1.0 3.0
frank NaN 1.0
john 3.0 1.0
linda 1.0 NaN
#value that is missing from list
#NaN 1.0 1.0
Any help would be appreciated!!
You can use df.apply, like this:
df.apply(pd.value_counts, dropna=False)
In pandas apply function, if there is a single parameter, you simply do:
.apply(func_name)
The parameter is the value of the cell.
This works exactly the same way for pandas build in function as well as user defined functions (UDF).
for UDF, when there are more than one parameters:
.apply(func_name, args=(arg1, arg2, arg3, ...))
See: this link
Related
I have a pandas DataFrame
ID Unique_Countries
0 123 [Japan]
1 124 [nan]
2 125 [US,Brazil]
.
.
.
I got the Unique_Countries column by aggregating over unique countries from each ID group. There were many IDs with only 'NaN' values in the original country column. They are now displayed as what you see in row 1. I would like to filter on these but can't seem to. When I type
df.Unique_Countries[1]
I get
array([nan], dtype=object)
I have tried several methods including
isnull() and
isnan()
but it gets messed up because it is a numpy array.
If your cell has NaN not in 1st position, try use explode and groupby.all
df[df.Unique_Countries.explode().notna().groupby(level=0).all()]
OR
df[df.Unique_Countries.explode().notna().all(level=0)]
Let's try
df.Unique_Countries.str[0].isna() #'nan' is True
df.Unique_Countries.str[0].notna() #'nan' is False
To pick only non-nan-string just use mask above
df[df.Unique_Countries.str[0].notna()]
I believe that the answers based on string method contains would fail if a country contains the substring nan in it.
In my opinion the solution should be this:
df.explode('Unique_Countries').dropna().groupby('ID', as_index=False).agg(list)
This code drops nan from your dataframe and returns the dataset in the original form.
I am not sure from your question if you want to dropna or you want to know the IDs of the records which have nan in the Unique_Countries column you can use something similar:
long_ss = df.set_index('ID').squeeze().explode()
long_ss[long_ss.isna()]
When using pandas groupby functions and manipulating the output after the groupby, I've noticed that some functions behave differently in terms of what is returned as the index and how this can be manipulated.
Say we have a dataframe with the following information:
Name Type ID
0 Book1 ebook 1
1 Book2 paper 2
2 Book3 paper 3
3 Book1 ebook 1
4 Book2 paper 2
if we do
df.groupby(["Name", "Type"]).sum()
we get a DataFrame:
ID
Name Type
Book1 ebook 2
Book2 paper 4
Book3 paper 3
which contains a MultiIndex with the columns used in the groupby:
MultiIndex([('Book1', 'ebook'),
('Book2', 'paper'),
('Book3', 'paper')],
names=['Name', 'Type'])
and one column called ID.
but if I apply a size() function, the result is a Series:
Name Type
Book1 ebook 2
Book2 paper 2
Book3 paper 1
dtype: int64
And at last, if I do a pct_change(), we get only the resulting DataFrame column:
ID
0 NaN
1 NaN
2 NaN
3 0.0
4 0.0
TL;DR. I want to know why some functions return a Series whilst some others a DataFrame as this made me confused when dealing with different operations within the same DataFrame.
From the document
Size:
Returns
Series
Number of rows in each group.
For the sum , since you did not pass the column for sum, so it will return the data frame without the groupby key
df.groupby(["Name", "Type"])['ID'].sum() # return Series
Function like diff and pct_change is not agg, it will return the value with the same index as original dataframe, for count , mean, sum they are agg, return with the value and groupby key as index
The outputs are different because the aggregations are different, and those are what mostly control what is returned. Think of the array equivalent. The data are the same but one "aggregation" returns a single scalar value, the other returns an array the same size as the input
import numpy as np
np.array([1,2,3]).sum()
#6
np.array([1,2,3]).cumsum()
#array([1, 3, 6], dtype=int32)
The same thing goes for aggregations of a DataFrameGroupBy object. All the first part of the groupby does is create a mapping from the DataFrame to the groups. Since this doesn't really do anything there's no reason why the same groupby with a different operation needs to return the same type of output (see above).
gp = df.groupby(["Name", "Type"])
# Haven't done any aggregations yet...
The other important part here is that we have a DataFrameGroupBy object. There are also SeriesGroupBy objects, and that difference can change the return.
gp
#<pandas.core.groupby.generic.DataFrameGroupBy object>
So what happens when you aggregate?
With a DataFrameGroupBy when you choose an aggregation (like sum) that collapses to a single value per group the return will be a DataFrame where the indices are the unique grouping keys. The return is a DataFrame because we provided a DataFrameGroupBy object. DataFrames can have multiple columns and had there been another numeric column it would have aggregated that too, necessitating the DataFrame output.
gp.sum()
# ID
#Name Type
#Book1 ebook 2
#Book2 paper 4
#Book3 paper 3
On the other hand if you use a SeriesGroupBy object (select a single column with []) then you'll get a Series back, again with the index of unique group keys.
df.groupby(["Name", "Type"])['ID'].sum()
|------- SeriesGroupBy ----------|
#Name Type
#Book1 ebook 2
#Book2 paper 4
#Book3 paper 3
#Name: ID, dtype: int64
For aggregations that return arrays (like cumsum, pct_change) a DataFrameGroupBy will return a DataFrame and a SeriesGroupBy will return a Series. But the index is no longer the unique group keys. This is because that would make little sense; typically you'd want to do a calculation within the group and then assign the result back to the original DataFrame. As a result the return is indexed like the original DataFrame you provided for aggregation. This makes creating these columns very simple as pandas handles all of the alignment
df['ID_pct_change'] = gp.pct_change()
# Name Type ID ID_pct_change
#0 Book1 ebook 1 NaN
#1 Book2 paper 2 NaN
#2 Book3 paper 3 NaN
#3 Book1 ebook 1 0.0 # Calculated from row 0 and aligned.
#4 Book2 paper 2 0.0
But what about size? That one is a bit weird. The size of a group is a scalar. It doesn't matter how many columns the group has or whether values in those columns are missing, so sending it a DataFrameGroupBy or SeriesGroupBy object is irrelevant. As a result pandas will always return a Series. Again being a group level aggregation that returns a scalar it makes sense to have the return indexed by the unique group keys.
gp.size()
#Name Type
#Book1 ebook 2
#Book2 paper 2
#Book3 paper 1
#dtype: int64
Finally for completeness, though aggregations like sum return a single scalar value it can often be useful to bring those values back to the every row for that group in the original DataFrame. However the return of a normal .sum has a different index, so it won't align. You could merge the values back on the unique keys, but pandas provides the ability to transform these aggregations. Since the intent here is to bring it back to the original DataFrame, the Series/DataFrame is indexed like the original input
gp.transform('sum')
# ID
#0 2 # Row 0 is Book1 ebook which has a group sum of 2
#1 4
#2 3
#3 2 # Row 3 is also Book1 ebook which has a group sum of 2
#4 4
Suppose I have a dataframe like this:
Height Speed
0 4.0 39.0
1 7.8 24.0
2 8.9 80.5
3 4.2 60.0
Then, through some feature extraction, I get this:
0 39.0
1 24.0
2 80.5
3 60.0
However, I want it to be a dataframe where the column index is still there. How would you get the following?
Speed
0 39.0
1 24.0
2 80.5
3 60.0
I am looking for an answer that compares the original with the new column and determines that the new column must be named Speed. In other words, it shouldn't just rename the new column 'Speed'.
Here is the feature extraction: Let X be the original dataframe and X1 be the returned array that lacks a column name.
svc = SVC(kernel="linear")
rfecv = RFECV(estimator=svc, step=1, cv=StratifiedKFold(2),
scoring='accuracy')
X1=rfecv.fit_transform(X, y)
Thanks
EDIT:
For the comments I am receiving, I will clarify my ambiguity. I believe that the feature extraction method above takes a dataframe or a series/array. Then, it returns an array. I am passing a dataframe into it. This dataframe contains the column labels and the data. However, it returns an array that lacks column names. Another caveat is that this must be ambiguous in general. I cannot explicitly name my columns because the columns will change in my program. It could return two arrays, four arrays, ... I am looking for a method that will compare the original dataframe to the array(s) given after the feature extraction and realize that the new array is "subset" of the original dataframe. Then, mark it with the orginal column name(s). Let me know your thoughts on that! Sorry guys and thank you for your help.
RFECV, after being fit, has an attribute called support_, which is a boolean mask of selected features. You can obtain the names of the chosen features by doing:
selected_cols = original_df.columns[rfecv.support_]
Easy peasey!
I have data from csv:
time, meas, meas2
15:10, 10, 0.3
15:22, 12, 0.4
15:30, 4
So every row can contain different number of data, less or equal to number of columns in first row.
I am writing some simple stats app. But for one graph I need for example sum of data in column with name meas. But for the second graph I would like to filter this data by the time.
Is there any ready-to-get- class with some kind of object to utilise getting data from columns or rows depending on need?
Or I just need to keep data in rows and calculate input for the 1st graph on the fly?
You are looking for the pandas library. The docs can be found here https://pandas.pydata.org/pandas-docs/stable/
You can run pip install pandas to install it.
The DataFrame is the basic pandas object that you work with. You can read your data in like this:
>>> import pandas as pd
>>> df = pd.read_csv(file_name)
>>> df
time meas meas2
0 15:10 10 0.3
1 15:22 12 0.4
2 15:30 4 NaN
>>> df['meas'].sum()
26
At this point time will be string values. To convert them to time objects you could do this (there may be a better way):
>>> df['time'] = [x.time() for x in pd.to_datetime(df['time'])]
Now to filter on time... Let's say you want everything after line 1.
>>> time1 = df['time'][1]
>>> df['time'] > time1
0 False
1 False
2 True
Name: time, dtype: bool
You can use the boolean expression to filter your DataFrame like this:
>>> df[df['time'] > time0]
time meas meas2
2 15:30:00 4 NaN
Your question is a little confusing, but it sounds like a Pandas DataFrame would be helpful. You can read csv files right into them.
import pandas as pd
df=pd.read_csv('your_csv_file.csv')
Of course you may need to get familiar with pandas for this to be useful.
import pandas as pd
import numpy as np
titanic= pd.read_csv("C:\\Users\\Shailesh.Rana\\Downloads\\train.csv")
title=[] #to extract titles out of names
for i in range(len(titanic)):
title.append(titanic.loc[:,"Name"].iloc[i].split(" ")[1]) #index 1 is title
titanic.iloc[(np.array(title)=="Master.")&(np.array(titanic.Age.isnull()))].loc[:,"Age"]=3.5
#values with master title and NAN as age
The last line doesn't make a change to the original dataset. In fact, if I run this line again, it still shows a series with 4 NaN values.
Use str.split with str[1] for select second list.
Also convert to numpy array is not necessary, iloc should be removed too.
titanic = pd.DataFrame({'Name':['John Master.','Joe','Mary Master.'],
'Age':[10,20,np.nan]})
titanic.loc[(titanic.Name.str.split().str[1]=="Master.") &(titanic.Age.isnull()) ,"Age"]=3.5
print (titanic)
Age Name
0 10.0 John Master.
1 20.0 Joe
2 3.5 Mary Master.