I am passing a column called petrol['tax'] of a dataframe to a function using .apply which returns 1st quartile. I am trying to use the below code but it throws me this error 'float' object has no attribute 'quantile'.
def Quartile(petrol_attrib):
return petrol_attrib.quantile(.25)
petrol['tax'].apply(Quartile)
I need help to implement this.
df = pd.DataFrame(np.array([[1, 1], [2, 10], [3, 100], [4, 100]]),
columns=['a', 'b'])
Now you can use the quantile function in pandas. Make sure you using the numbers between 0 and 1. Here an example below:
df.a.quantile(0.5)
3.25
You can apply the function to the whole dataframe
Related
When grouping a Pandas DataFrame, when should I use transform and when should I use aggregate? How do
they differ with respect to their application in practice and which one do you
consider more important?
consider the dataframe df
df = pd.DataFrame(dict(A=list('aabb'), B=[1, 2, 3, 4], C=[0, 9, 0, 9]))
groupby is the standard use aggregater
df.groupby('A').mean()
maybe you want these values broadcast across the whole group and return something with the same index as what you started with.
use transform
df.groupby('A').transform('mean')
df.set_index('A').groupby(level='A').transform('mean')
agg is used when you have specific things you want to run for different columns or more than one thing run on the same column.
df.groupby('A').agg(['mean', 'std'])
df.groupby('A').agg(dict(B='sum', C=['mean', 'prod']))
Situation:
1. all_task_usage_10_19
all_task_usage_10_19 is the file which consists of 29229472 rows × 20 columns.
There are multiple rows with the same ID inside the column machine_ID with different values in other columns.
Columns:
'start_time_of_the_measurement_period','end_time_of_the_measurement_period', 'job_ID', 'task_index','machine_ID', 'mean_CPU_usage_rate','canonical_memory_usage', 'assigned_memory_usage','unmapped_page_cache_memory_usage', 'total_page_cache_memory_usage', 'maximum_memory_usage','mean_disk_I/O_time', 'mean_local_disk_space_used', 'maximum_CPU_usage','maximum_disk_IO_time', 'cycles_per_instruction_(CPI)', 'memory_accesses_per_instruction_(MAI)', 'sample_portion',
'aggregation_type', 'sampled_CPU_usage'
2. clustering code
I am trying to cluster multiple machine_ID records using the following code, referencing: How to combine multiple rows into a single row with pandas
3. Output
Output displayed using: with option_context as it allows to better visualise the content
My Aim:
I am trying to cluster multiple rows with the same machine_ID into a single record, so I can apply algorithms like Moving averages, LSTM and HW for predicting cloud workloads.
Something like this.
Maybe a Multi-Index is what you're looking for?
df.set_index(['machine_ID', df.index])
Note that by default set_index returns a new dataframe, and does not change the original.
To change the original (and return None) you can pass an argument inplace=True.
Example:
df = pd.DataFrame({'machine_ID': [1, 1, 2, 2, 3],
'a': [1, 2, 3, 4, 5],
'b': [10, 20, 30, 40, 50]})
new_df = df.set_index(['machine_ID', df.index]) # not in-place
df.set_index(['machine_ID', df.index], inplace=True) # in-place
For me, it does create a multi-index: first level is 'machine_ID', second one is the previous range index:
The below code worked for me:
all_task_usage_10_19.groupby('machine_ID')[['start_time_of_the_measurement_period','end_time_of_the_measurement_period','job_ID', 'task_index','mean_CPU_usage_rate', 'canonical_memory_usage',
'assigned_memory_usage', 'unmapped_page_cache_memory_usage', 'total_page_cache_memory_usage', 'maximum_memory_usage',
'mean_disk_I/O_time', 'mean_local_disk_space_used','maximum_CPU_usage',
'maximum_disk_IO_time', 'cycles_per_instruction_(CPI)',
'memory_accesses_per_instruction_(MAI)', 'sample_portion',
'aggregation_type', 'sampled_CPU_usage']].agg(list).reset_index()
I have this Dataframe in python
and I want to multiple every row in the first dataframe by this single row in the dataframe below as a vector
Some things I have tried from googling : df.mul, df.apply. But it seems to multiply the two frames together normally instead of a vectorized operation
Example data:
df = pd.DataFrame({'x':[1,2,3], 'y':[1,2,3]})
v1 = pd.DataFrame({'x':[2], 'y':[3]})
Multiply DataFrame with row:
df.multiply(np.array(v1), axis='columns')
If the use case needs accurate matching of columns
Example:
df = pd.DataFrame([[1, 2], [3, 4]], columns=['x', 'y'])
coeffs_df = pd.DataFrame([[10, 9]], columns=['y', 'x'])
Need to convert the df with single row (coeffs_df) to a series first, the perform multiply
df.multiply(coeffs_df.iloc[0], axis='columns')
This question already has answers here:
Convert pandas dataframe to NumPy array
(15 answers)
Closed 2 years ago.
The community reviewed whether to reopen this question 1 year ago and left it closed:
Original close reason(s) were not resolved
How can I get the index or column of a DataFrame as a NumPy array or Python list?
To get a NumPy array, you should use the values attribute:
In [1]: df = pd.DataFrame({'A': [1, 2, 3], 'B': [4, 5, 6]}, index=['a', 'b', 'c']); df
A B
a 1 4
b 2 5
c 3 6
In [2]: df.index.values
Out[2]: array(['a', 'b', 'c'], dtype=object)
This accesses how the data is already stored, so there isn't any need for a conversion.
Note: This attribute is also available for many other pandas objects.
In [3]: df['A'].values
Out[3]: Out[16]: array([1, 2, 3])
To get the index as a list, call tolist:
In [4]: df.index.tolist()
Out[4]: ['a', 'b', 'c']
And similarly, for columns.
You can use df.index to access the index object and then get the values in a list using df.index.tolist(). Similarly, you can use df['col'].tolist() for Series.
pandas >= 0.24
Deprecate your usage of .values in favour of these methods!
From v0.24.0 onwards, we will have two brand spanking new, preferred methods for obtaining NumPy arrays from Index, Series, and DataFrame objects: they are to_numpy(), and .array. Regarding usage, the docs mention:
We haven’t removed or deprecated Series.values or
DataFrame.values, but we highly recommend and using .array or
.to_numpy() instead.
See this section of the v0.24.0 release notes for more information.
to_numpy() Method
df.index.to_numpy()
# array(['a', 'b'], dtype=object)
df['A'].to_numpy()
# array([1, 4])
By default, a view is returned. Any modifications made will affect the original.
v = df.index.to_numpy()
v[0] = -1
df
A B
-1 1 2
b 4 5
If you need a copy instead, use to_numpy(copy=True);
v = df.index.to_numpy(copy=True)
v[-1] = -123
df
A B
a 1 2
b 4 5
Note that this function also works for DataFrames (while .array does not).
array Attribute
This attribute returns an ExtensionArray object that backs the Index/Series.
pd.__version__
# '0.24.0rc1'
# Setup.
df = pd.DataFrame([[1, 2], [4, 5]], columns=['A', 'B'], index=['a', 'b'])
df
A B
a 1 2
b 4 5
<!- ->
df.index.array
# <PandasArray>
# ['a', 'b']
# Length: 2, dtype: object
df['A'].array
# <PandasArray>
# [1, 4]
# Length: 2, dtype: int64
From here, it is possible to get a list using list:
list(df.index.array)
# ['a', 'b']
list(df['A'].array)
# [1, 4]
or, just directly call .tolist():
df.index.tolist()
# ['a', 'b']
df['A'].tolist()
# [1, 4]
Regarding what is returned, the docs mention,
For Series and Indexes backed by normal NumPy arrays, Series.array
will return a new arrays.PandasArray, which is a thin (no-copy)
wrapper around a numpy.ndarray. arrays.PandasArray isn’t especially
useful on its own, but it does provide the same interface as any
extension array defined in pandas or by a third-party library.
So, to summarise, .array will return either
The existing ExtensionArray backing the Index/Series, or
If there is a NumPy array backing the series, a new ExtensionArray object is created as a thin wrapper over the underlying array.
Rationale for adding TWO new methods
These functions were added as a result of discussions under two GitHub issues GH19954 and GH23623.
Specifically, the docs mention the rationale:
[...] with .values it was unclear whether the returned value would be the
actual array, some transformation of it, or one of pandas custom
arrays (like Categorical). For example, with PeriodIndex, .values
generates a new ndarray of period objects each time. [...]
These two functions aim to improve the consistency of the API, which is a major step in the right direction.
Lastly, .values will not be deprecated in the current version, but I expect this may happen at some point in the future, so I would urge users to migrate towards the newer API, as soon as you can.
If you are dealing with a multi-index dataframe, you may be interested in extracting only the column of one name of the multi-index. You can do this as
df.index.get_level_values('name_sub_index')
and of course name_sub_index must be an element of the FrozenList df.index.names
Since pandas v0.13 you can also use get_values:
df.index.get_values()
A more recent way to do this is to use the .to_numpy() function.
If I have a dataframe with a column 'price', I can convert it as follows:
priceArray = df['price'].to_numpy()
You can also pass the data type, such as float or object, as an argument of the function
I converted the pandas dataframe to list and then used the basic list.index(). Something like this:
dd = list(zone[0]) #Where zone[0] is some specific column of the table
idx = dd.index(filename[i])
You have you index value as idx.
Below is a simple way to convert a dataframe column into a NumPy array.
df = pd.DataFrame(somedict)
ytrain = df['label']
ytrain_numpy = np.array([x for x in ytrain['label']])
ytrain_numpy is a NumPy array.
I tried with to.numpy(), but it gave me the below error:
TypeError: no supported conversion for types: (dtype('O'),)* while doing Binary Relevance classfication using Linear SVC.
to.numpy() was converting the dataFrame into a NumPy array, but the inner element's data type was a list because of which the above error was observed.
This question didn't have a satisfactory answer, so I'm asking it again.
Suppose I have the following Pandas DataFrame:
df1 = pd.DataFrame({'group': ['a', 'a', 'b', 'b'], 'values': [1, 1, 2, 2]})
I group by the first column 'group':
g1 = df1.groupby('group')
I've now created a "DataFrameGroupBy". Then I extract the first column from the GroupBy object:
g1_1st_column = g1['group']
The type of g1_1st_column is "pandas.core.groupby.SeriesGroupBy". Notice it's not a "DataFrameGroupBy" anymore.
My question is, how can I convert the SeriesGroupBy object back to a DataFrame object? I tried using the .to_frame() method, and got the following error:
g1_1st_column = g1['group'].to_frame()
AttributeError: Cannot access callable attribute 'to_frame' of 'SeriesGroupBy' objects, try using the 'apply' method.
How would I use the apply method, or some other method, to convert to a DataFrame?
Manish Saraswat kindly answered my question in the comments.
g1['group'].apply(pd.DataFrame)