Automating the creation of dataframes from subsets of an existing dataframe

Automating the creation of dataframes from subsets of an existing dataframe - python

I'm working with the kaggle New York City Airbnb Open Data which is available here:
https://www.kaggle.com/dgomonov/new-york-city-airbnb-open-data
The data contains a column of the 'neighbourhood_groups', consisting of the 5 boroughs of NYC, and 'neighbourhood', consisting of the neighbourhoods within each neighbourhood group.
I have created a subset of the Manhattan neighbourhood with the following code:
airbnb_manhattan = airbnb[airbnb['neighbourhood_group'] == 'Manhattan']
I would like to create further subsets of this dataframe by neighbourhood. However, there are 32 neighbourhoods, so I'd like to automate the process.
This is the code that I tried:
manhattan_neighbourhoods = list(airbnb_manhattan['neighbourhood'].unique())
neighbourhoods = pd.DataFrame()
for n in manhattan_neighbourhoods:
neighbourhoods[n] = pd.DataFrame(affordable_manhattan[affordable_manhattan['neighbourhood'] == manhattan_neighbourhoods[n]])
Which produces the following error message:
TypeError: list indices must be integers or slices, not str
Thanks.

You should not copy into new dfs unless strictly necessary. Try to do your analysis with the full df as much as possible. Use .groupby as in
by_neigh = airbnb.groupby('neighbourhood_group')
Then use .agg, .apply, or .transform as needed. Or as a last resort you can iterate with
for neigh, rows in by_neigh:
Or get just one group with
by_neigh.get_group('Manhattan')
The advantage of all this is that underlying data is not copied intil absolutely necessary, and pandas can just view the same array with different filters and slices as needed.
Read more in the docs

Related

New dataframe in Pandas based on specific values(a lot of them) from existing df

Good evening! I'm using pandas on Jupyter Notebook. I have a huge dataframe representing full history of posts of 26 channels in a messenger. It has a column "dialog_id" which represents in which dialog the message was sent(so, there can be only 26 unique values in the column, but there are more then 700k rows, and the df is sorted itself by time, not id, so it is kinda chaotic). I have to split this dataframe into 2 different(one will contain full history of 13 channels, and the other will contain history for the rest 13 channels). I know ids by which I have to split, they are random as well. For example, one is -1001232032465 and the other is -1001153765346.
The question is, how do I do it most elegantly and adequate?
I know I can do it somehow with df.loc[], but I don't want to put like 13 rows of df.loc[]. I've tried to use logical operators for this, like:
df1.loc[(df["dialog_id"] == '-1001708255880') & (df["dialog_id"] == '-1001645788710' )], but it doesn't work. I suppose I'm using them wrong. I expect a solution with any method creating a new df, with the use of logical operators. In verbal expression, I think it should sound like "put the row in a new df if the dialog_id is x, or dialog_id is y, or dialog_id is z, etc". Please help me!

The easiest way seems to be just setting up a query.
df = pd.DataFrame(dict(col_id=[1,2,3,4,], other=[5,6,7,8,]))
channel_groupA = [1,2]
channel_groupB = [3,4]
df_groupA = df.query(f'col_id == {channel_groupA}')
df_groupB = df.query(f'col_id == {channel_groupB}')

Split and create data from a column to many columns

I have a pandas data frame in which the values of one of its columns looks like that
print(VCF['INFO'].iloc[0])
Results (Sorry I can copy and paste this data as I am working from a cluster without an internet connection)
I need to create new columns with the name END, SVTYPE and SVLEN and their info as values of that columns. Following the example, this would be
END SVTYPE SVLEN-
224015456 DEL 223224913
The rest of the info contained in the column INFOI do not need it so far.
The information contained in this column is huge but as far I can read there is not more something=value as you can see in the picture.

Simply use .str.extract:
extracted = df['INFO'].str.extract('END=(?P<END>.+?);SVTYPE=(?P<SVTYPE>.+?);SVLEN=(?P<SVLEN>.+?);')
Output:
>>> extracted
END SVTYPE SVLEN
0 224015456 DEL -223224913

How to loop a command in python with a list as variable input?

This is my first post to the coding community, so I hope I get the right level of detail in my request for help!
Background info:
I want to repeat (loop) command in a df using a variable that contains a list of options. While the series 'amenity_options' contains a simple list of specific items (let's say only four amenities as the example below) the df is a large data frame with many other items. My goal is the run the operation below for each item in the 'amenity_option' until the end of the list.
amenity_options = ['cafe','bar','cinema','casino'] # this is a series type with multiple options
df = df[df['amenity'] == amenity_options] # this is my attempt to select the the first value in the series (e.g. cafe) out of dataframe that contains such a column name.
df.to_excel('{}_amenity.xlsx, format('amenity') # wish to save the result (e.g. cafe_amenity) as a separate file.
Desired result:I wish to loop step one and two for each and every item available in the list (e.g. cafe, bar, cinema...). So that I will have separate excel files in the end. Any thoughts?

What #Rakesh suggested is correct, you probably just need one more step.
df = df[df['amenity'].isin(amenity_options)]
for key, g in df.groupby('amenity'):
g.to_excel('{}_amenity.xlsx'.format(key))
After you call groupby() of your df, you will get 4 groups so that you can directly loop on them.
The key is the group key, which are cafe, bar and etc. and the g is the sub-dataframe that specifically filtered by that key.

Seems like you just need a simple for loop:
for amenity in amenity_options:
df[df['amenity'] == amenity].to_excel(f"{amenity}_amenity.xlsx")

Splitting DataFrame into two DataFrames and filter these two DataFrames in order to have the same dimensions

i have the following problem and had an idea to solve it, but it didn't worked:
I have the data on DAX Call and Put Options for every trading day in a month. After transforming and some calculations I have the following DataFrame:
DaxOpt. The goal is now to get rid of every row (either Call or Put Option) which does not have the respective pair. With pair I mean a Call and Put Option with the same 'EXERCISE_PRICE' and 'TAU', where 'TAU' = the time to maturity in years. The red boxes in the picture are examples for a pair. So either having a DataFrame with only the pairs or having two DataFrames with Call and Put Options where the rows are the respective pairs.
My idea was creating two new DataFrames one which contains only the Call Options and the other the Put Options, sort them after 'TAU' and 'EXERCISE_PRICE' and working my way through with pandas isin function, in order to get rid of the Call or Put Options which do not have the respective pair.
DaxOptCall = DaxOpt[DaxOpt.CALL_PUT_FLAG == 'C']
DaxOptPut = DaxOpt[DaxOpt.CALL_PUT_FLAG == 'P']
The problem is that the DaxOptCall and DaxOptPut have different dimensions, so isin function is not applicable. I am trying to find the most efficient way, since the data I am using now is just a fraction from the real data.
Would appreciate any help or idea.

See if this works for you:
Once you separated your df into two dfs by CALL/PUT options, convert the column(s) that are unique to your pairs into index columns:
# Assuming your unique columns are TAU and EXERCISE_PRICE
df_call = df_call.set_index(["EXERCISE_PRICE", "TAU"])
df_put = df_put.set_index(["EXERCISE_PRICE", "TAU"])
Next, take the intersection of the indexes, which will return a pandas MultiIndex object
mtx = df_call.index.intersection(df_put.index)
Then use the mtx object to extract the common elements from the dfs
df_call.loc[mtx]
df_put.loc[mtx]
You can merge these if you want them to be in the same df and reset the index to the original column.

'IndexError: list index out of range' when applying lambda function to a column in pandas

I have a pandas dataframe in which one of the columns contains address information and I want to slice the address to only provide the zipcode and put this into a new column. For example a typical address looks like this:
609 Lizeth Streets Bolton MA 01740 US.
To grab the zip I tried:
split_zip = lambda x: str(x).split()[-2]
df['Zipcode'] = df['Address'].apply(split_zip)
Doing that I get an
'IndexError: list index out of range'
Sidenote: When I don't specify an index, it will put the split list in the column just as I would expect (i.e. [609, Lizeth, Streets, Bolton, MA, 01740, US]). I can see that the zip is in the [-2] position and I just don't know why it won't grab it. Additionally, trying to grab the [1] index also throws the same error. The only index that seems to work is when I use [-1] which grabs 'US'
I'm fairly new to python and working with data in pandas so any help would be much appreciated!

Here's a way you can try:
df['Zipcode'] = df['Address'].str.split(' ').str[-2]

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Automating the creation of dataframes from subsets of an existing dataframe - python

Related

New dataframe in Pandas based on specific values(a lot of them) from existing df

Split and create data from a column to many columns

How to loop a command in python with a list as variable input?

Splitting DataFrame into two DataFrames and filter these two DataFrames in order to have the same dimensions

'IndexError: list index out of range' when applying lambda function to a column in pandas

Categories

Resources