I am getting familiar with Pandas and I want to learn the logic with a few simple examples.
Let us say I have the following panda DataFrame object:
import pandas as pd
d = {'year':pd.Series([2014,2014,2014,2014], index=['a','b','c','d']),
'dico':pd.Series(['A','A','A','B'], index=['a','b','c','d']),
'mybool':pd.Series([True,False,True,True], index=['a','b','c','d']),
'values':pd.Series([10.1,1.2,9.5,4.2], index=['a','b','c','d'])}
df = pd.DataFrame(d)
Basic Question.
How do I take a column as a list.
I.e., d['year']
would return
[2013,2014,2014,2014]
Question 0
How do I take rows 'a' and 'b' and columns 'year' and 'values' as a new dataFrame?
If I try:
d[['a','b'],['year','values']]
it doesn't work.
Question 1.
How would I aggregate (sum/average) the values column by the year, and dico columns, for example. I.e., such that different years/dico combinations would not be added, but basically mybool would be removed from the list.
I.e., after aggregation (this case average) I should get:
tipo values year
A 10.1 2013
A (9.5+1.2)/2 2014
B 4.2 2014
If I try the groupby function it seems to output some odd new DataFrame structure with bool in it, and all possible years/dico combinations - my objective is rather to have that simpler new sliced and smaller dataframe I showed above.
Question 2. How do I filter by a condition?
I.e., I want to filter out all bool columns that are False.
It'd return:
tipo values year mybool
A 10.1 2013 True
A 9.5 2014 True
B 4.2 2014 True
I've tried the panda tutorial but I still get some odd behavior so asking directly seems to be a better idea.
Thanks!
values from series in a list:
df['year'].values #returns an array
loc lets you subset a dateframe by index labels:
df.loc[['a','b'],['year','values']]
Group by lets you aggregate over columns:
df.groupby(['year','dico'],as_index=False).mean() #don't have 2013 in your df
Filtering by a column value:
df[df['mybool']==True]
Related
I am attempting to use pandas to create a new df based on a set of conditions that compares the rows from one another within the original df. I am new to using pandas and feel comfortable comparing two df from one another and basic column comparisons, but for some reason the row by row comparison is stumping me. My specific conditions and problem are found below:
Cosine_i_ start_time fid_ Shape_Area
0 0.820108 2022-08-31T10:48:34Z emit20220831t104834_o24307_s000 0.067763
1 0.962301 2022-08-27T12:25:06Z emit20220827t122506_o23908_s000 0.067763
2 0.811369 2022-08-19T15:39:39Z emit20220819t153939_o23110_s000 0.404882
3 0.788322 2023-01-29T13:23:39Z emit20230129t132339_o02909_s000 0.404882
4 0.811369 2022-08-19T15:39:39Z emit20220819t153939_o23110_s000 0.108256
^^Above is my original df that I will be working with.
Goal: I am hoping to create a new df that contains only the FIDs that meet the following conditions: If the shape area is equal, the cosi values have a difference greater than 0.1, and the start time has a difference greater than 5 days. This is going to be applied to a large dataset, the df displayed is just a small sample one I made to help write the code.
For example: Rows 2 & 3 have the same shape area, so then looking at the cosi values, they have a difference in values greater than 0.1, and lastly they have a difference in their start times that is greater than 5 days. They meet all set conditions, so I would then like to take the FID values for BOTH of these rows and append it to a new df.
So essentially I want to compare every row with the other rows and that's where I am having trouble.
I am looking for as much guidance as possible on how to set this up as I am very very new to coding and am hoping to get a tutorial of some sort!
Thanks in advance.
Group by Shape_Area and filter each pair (single items on Shape_Area are omitted) by required conditions:
fids = df.groupby('Shape_Area').filter(lambda x: x.index.size > 1
and x['Cosine_i_'].diff().values[-1] >= 0.1
and x['start_time'].diff().abs().dt.days.values[-1] > 5,
dropna=True)['fid_'].tolist()
print(fids)
I have a pandas DataFrame
ID Unique_Countries
0 123 [Japan]
1 124 [nan]
2 125 [US,Brazil]
.
.
.
I got the Unique_Countries column by aggregating over unique countries from each ID group. There were many IDs with only 'NaN' values in the original country column. They are now displayed as what you see in row 1. I would like to filter on these but can't seem to. When I type
df.Unique_Countries[1]
I get
array([nan], dtype=object)
I have tried several methods including
isnull() and
isnan()
but it gets messed up because it is a numpy array.
If your cell has NaN not in 1st position, try use explode and groupby.all
df[df.Unique_Countries.explode().notna().groupby(level=0).all()]
OR
df[df.Unique_Countries.explode().notna().all(level=0)]
Let's try
df.Unique_Countries.str[0].isna() #'nan' is True
df.Unique_Countries.str[0].notna() #'nan' is False
To pick only non-nan-string just use mask above
df[df.Unique_Countries.str[0].notna()]
I believe that the answers based on string method contains would fail if a country contains the substring nan in it.
In my opinion the solution should be this:
df.explode('Unique_Countries').dropna().groupby('ID', as_index=False).agg(list)
This code drops nan from your dataframe and returns the dataset in the original form.
I am not sure from your question if you want to dropna or you want to know the IDs of the records which have nan in the Unique_Countries column you can use something similar:
long_ss = df.set_index('ID').squeeze().explode()
long_ss[long_ss.isna()]
I have a dataframe that has Date as its index. The dataframe has stock market related data so the dates are not continuous. If I want to move lets say 120 rows up in the dataframe, how do I do that. For example:
If I want to get the data starting from 120 trading days before the start of yr 2018, how do I do that below:
df['2018-01-01':'2019-12-31']
Thanks
Try this:
df[df.columns[df.columns.get_loc('2018-01-01'):df.columns.get_loc('2019-12-31')]]
Get location of both Columns in the column array and index them to get the desired.
UPDATE :
Based on your requirement, make some little modifications of above.
Yearly Indexing
>>> df[df.columns[(df.columns.get_loc('2018')).start:(df.columns.get_loc('2019')).stop]]
Above df.columns.get_loc('2018') will give out numpy slice object and will give us index of first element of 2018 (which we index using .start attribute of slice) and similarly we do for index of last element of 2019.
Monthly Indexing
Now consider you want data for First 6 months of 2018 (without knowing what is the first day), the same can be done using:
>>> df[df.columns[(df.columns.get_loc('2018-01')).start:(df.columns.get_loc('2018-06')).stop]]
As you can see above we have indexed the first 6 months of 2018 using the same logic.
Assuming you are using pandas and the dataframe is sorted by dates, a very simple way would be:
initial_date = '2018-01-01'
initial_date_index = df.loc[df['dates']==initial_date].index[0]
offset=120
start_index = initial_date_index-offset
new_df = df.loc[start_index:]
When using pandas groupby functions and manipulating the output after the groupby, I've noticed that some functions behave differently in terms of what is returned as the index and how this can be manipulated.
Say we have a dataframe with the following information:
Name Type ID
0 Book1 ebook 1
1 Book2 paper 2
2 Book3 paper 3
3 Book1 ebook 1
4 Book2 paper 2
if we do
df.groupby(["Name", "Type"]).sum()
we get a DataFrame:
ID
Name Type
Book1 ebook 2
Book2 paper 4
Book3 paper 3
which contains a MultiIndex with the columns used in the groupby:
MultiIndex([('Book1', 'ebook'),
('Book2', 'paper'),
('Book3', 'paper')],
names=['Name', 'Type'])
and one column called ID.
but if I apply a size() function, the result is a Series:
Name Type
Book1 ebook 2
Book2 paper 2
Book3 paper 1
dtype: int64
And at last, if I do a pct_change(), we get only the resulting DataFrame column:
ID
0 NaN
1 NaN
2 NaN
3 0.0
4 0.0
TL;DR. I want to know why some functions return a Series whilst some others a DataFrame as this made me confused when dealing with different operations within the same DataFrame.
From the document
Size:
Returns
Series
Number of rows in each group.
For the sum , since you did not pass the column for sum, so it will return the data frame without the groupby key
df.groupby(["Name", "Type"])['ID'].sum() # return Series
Function like diff and pct_change is not agg, it will return the value with the same index as original dataframe, for count , mean, sum they are agg, return with the value and groupby key as index
The outputs are different because the aggregations are different, and those are what mostly control what is returned. Think of the array equivalent. The data are the same but one "aggregation" returns a single scalar value, the other returns an array the same size as the input
import numpy as np
np.array([1,2,3]).sum()
#6
np.array([1,2,3]).cumsum()
#array([1, 3, 6], dtype=int32)
The same thing goes for aggregations of a DataFrameGroupBy object. All the first part of the groupby does is create a mapping from the DataFrame to the groups. Since this doesn't really do anything there's no reason why the same groupby with a different operation needs to return the same type of output (see above).
gp = df.groupby(["Name", "Type"])
# Haven't done any aggregations yet...
The other important part here is that we have a DataFrameGroupBy object. There are also SeriesGroupBy objects, and that difference can change the return.
gp
#<pandas.core.groupby.generic.DataFrameGroupBy object>
So what happens when you aggregate?
With a DataFrameGroupBy when you choose an aggregation (like sum) that collapses to a single value per group the return will be a DataFrame where the indices are the unique grouping keys. The return is a DataFrame because we provided a DataFrameGroupBy object. DataFrames can have multiple columns and had there been another numeric column it would have aggregated that too, necessitating the DataFrame output.
gp.sum()
# ID
#Name Type
#Book1 ebook 2
#Book2 paper 4
#Book3 paper 3
On the other hand if you use a SeriesGroupBy object (select a single column with []) then you'll get a Series back, again with the index of unique group keys.
df.groupby(["Name", "Type"])['ID'].sum()
|------- SeriesGroupBy ----------|
#Name Type
#Book1 ebook 2
#Book2 paper 4
#Book3 paper 3
#Name: ID, dtype: int64
For aggregations that return arrays (like cumsum, pct_change) a DataFrameGroupBy will return a DataFrame and a SeriesGroupBy will return a Series. But the index is no longer the unique group keys. This is because that would make little sense; typically you'd want to do a calculation within the group and then assign the result back to the original DataFrame. As a result the return is indexed like the original DataFrame you provided for aggregation. This makes creating these columns very simple as pandas handles all of the alignment
df['ID_pct_change'] = gp.pct_change()
# Name Type ID ID_pct_change
#0 Book1 ebook 1 NaN
#1 Book2 paper 2 NaN
#2 Book3 paper 3 NaN
#3 Book1 ebook 1 0.0 # Calculated from row 0 and aligned.
#4 Book2 paper 2 0.0
But what about size? That one is a bit weird. The size of a group is a scalar. It doesn't matter how many columns the group has or whether values in those columns are missing, so sending it a DataFrameGroupBy or SeriesGroupBy object is irrelevant. As a result pandas will always return a Series. Again being a group level aggregation that returns a scalar it makes sense to have the return indexed by the unique group keys.
gp.size()
#Name Type
#Book1 ebook 2
#Book2 paper 2
#Book3 paper 1
#dtype: int64
Finally for completeness, though aggregations like sum return a single scalar value it can often be useful to bring those values back to the every row for that group in the original DataFrame. However the return of a normal .sum has a different index, so it won't align. You could merge the values back on the unique keys, but pandas provides the ability to transform these aggregations. Since the intent here is to bring it back to the original DataFrame, the Series/DataFrame is indexed like the original input
gp.transform('sum')
# ID
#0 2 # Row 0 is Book1 ebook which has a group sum of 2
#1 4
#2 3
#3 2 # Row 3 is also Book1 ebook which has a group sum of 2
#4 4
I'm trying to add two-columns and trying to display their total in a new column and following as well
The total sum of sales in the month of Jan
The minimum sales amount in the month of Feb
The average (mean) sales for the month of Mar
and trying to create a data frame called d2 that only contains rows of data in d that don't have any missing (NaN) values
I have implemented the following code
import pandas as pd
new_val= pd.read_csv("/Users/mayur/574_repos_2019/ml-python-
class/assignments/data/assg-01-data.csv")
new_val['total'] = 'total'
new_val.to_csv('output.csv', index=False)
display(new_val)
d.head(5)# it's not showing top file lines of the .csv data
# .CSV file sample data
#account name street city state postal-code Jan Feb Mar total
#0118 Kerl, 3St . Waily Texas 28752.0 10000 62000 35000 total
#0118 mkrt, 1Wst. con Texas 22751.0 12000 88200 15000 total
It's giving me a total as a word.
When you used new_val['total'] = 'total' you basically told Pandas that you want a Column in your DataFrame called total where every variable is the string total.
What you want to fix is the variable assignment. For this I can give you quick and dirty solution that will hopefully make a more appealing solution be clearer to you.
You can iterate through your DataFrame and add the two columns to get the variable for the third.
for i,j in new_val.iterrows():
new_val.iloc[i]['total'] = new_val.iloc[i]['Jan'] + new_val.iloc[i]['Feb'] + new_val.iloc[i]['Mar']
Note, that this requires column total to have already been defined. This also requires iterating through your entire data set, so if your data set is large this is not the best option.
As mentioned by #Cavenfish, that new_val['total'] = 'total' creates a column total where value of every cell is the string total.
You should rather use new_val['total'] = new_val['Jan']+new_val['Feb']+new_val['Mar']
For treatment of NA values you can use a mask new_val.isna() which will generate boolean for all cells whether they are NA or not in your array. You can then apply any logic on top of it. For your example, the below should work:
new_val.isna().sum(axis=1)==4
Considering that you now have 4 columns in your dataframe Jan,Feb,Mar,total; it will return False in case one of the row contains NA. You can then apply this mask to new_val['total'] to assign default value in case NA is encountered in one of the columns for a row.