Access different values in one data frame column? - python

Df is a loaded in csv file that contains different stats.
player_name,player_id,season,season_type,team
Giannis Antetokounmpo,antetgi01,2020,PO,MIL
I have tried this:
print(df.loc[(df["team"] == "LAL") & (df["team"] == "LAC") & (df["season_type"] == "
I am trying to access the "team" column and filter elements that also meet the "season_type" requirement, however there is no output.
What works currently:
print(df.loc[(df["team"] == "LAL") & (df["season_type"] == "PO")])
When I do this I am able to get the correct output but for only one specific team.
My question is how can I perform this on multiple names?

Good question, this should work for you:
team_list = ["LAL", "LAC"]
df = df[df.team.isin(team_list) & df.season_type == 'PO']

Related

Trouble applying a secondary filter (referencing a csv) to a pandas df

I need help fixing a filter I’m trying to implement in pandas. Here is what I'm working with:
import pandas as pd
discogs_df = pd.read_pickle("/Users/USER/file.pkl")
filtered_df = discogs_df[(discogs_df['Format1'] == "Vinyl") &
((discogs_df['Country'].str.contains("US")) |
(discogs_df['Country'].str.contains("Canada"))) &
(discogs_df['Style'].str.contains("Soul"))]
USER_df = pd.read_csv("/Users/USER/file.csv")
USER_df['Style'] = USER_df['Style'].astype(str).replace("", "No Style")
USER_df['Style'].fillna("No Style", inplace=True)
soul_df = USER_df[USER_df['Style'].str.contains("Soul")]
top4years = soul_df['Year'].value_counts().head(4).index
filtered_df = filtered_df[(filtered_df['Year'].isin(top_4_years)) |
(filtered_df['Year'].isnull()) |
(filtered_df['Year'] == 0)]
soul_df = soul_df['Style'].str.split(', ').apply(pd.Series).stack().str.strip().to_frame().reset_index(drop=True)
soul_df.columns = ['Style']
top6styles = soul_df[soul_df['Style'] != 'No Style']['Style'].value_counts().head(6).index
top201labels = discogs_df['Label'].value_counts().head(201).index
filtered_df_labels = filtered_df[~filtered_df['Label'].isin(top_201_labels)]
top334artists = discogs_df['Artist'].value_counts().head(334).index
filtered_df_artists = filtered_df_labels[~filtered_df_labels['Artist'].isin(top_334_artists)].copy()
def style_filter(df, style_list):
df = df[df['Style'].apply(lambda x: any(s in x for s in style_list))]
return df
filtered_df_artists = style_filter(filtered_df_artists, top_6_styles)
print(filtered_df_artists[["ReleaseID", "Style"]].head(16))
I have a large .pkl (about 16 million rows, 8 columns). I’m filtering that for:
column Format1 contains “Vinyl” and,
column Country contains “US” OR “Canada” and,
column Style contains “Soul” and,
column Year references a local .csv that has the same format as the .pkl. It counts the top 4 most common unique values in the .csv Year column when the .csv Style column contains “Soul”
Then, I remove a bunch of common labels and artists from results. There are no issues with these, but my problem arises while trying to add a secondary filter to the Style column. I’m trying to reference the same .csv and count a top 6 most common unique style values when the .csv Style contains “Soul”. But I can never output anything where this secondary Style filter's style values are present. The desired output should have results where "Soul" is always present in results. If there are styles additional to "Soul" those can ONLY be values present in the top6styles.
Is it possible that the code logic within the function:
"def style_filter(df, style_list): df = df[df['Style'].apply(lambda x: any(s in x for s in style_list))] return df"
is causing the error with the filter for the "Style" column in the .pkl file when trying to count the top 6 most common unique values? The top6styles values are: Index(['Soul', 'Funk', 'Boogie', 'Disco', 'UK Street Soul', 'Gospel'], dtype='object') but like, I said, my results all contain "Soul", but then all sorts of style values that are not in top6styles. Please someone let me know what I'm doing wrong. I feel like I'm going insane. I'm happy to supply examples of any df is that would be helpful. Thanks in advance!

comparing two columns of a row in python dataframe

I know that one can compare a whole column of a dataframe and making a list out of all rows that contain a certain value with:
values = parsedData[parsedData['column'] == valueToCompare]
But is there a possibility to make a list out of all rows, by comparing two columns with values like:
values = parsedData[parsedData['column01'] == valueToCompare01 and parsedData['column02'] == valueToCompare02]
Thank you!
It is completely possible, but I have never tried using and in order to mask the dataframe, rather using & would be of interest in this case. Note that, if you want your code to be more clear, use ( ) in each statement:
values = parsedData[(parsedData['column01'] == valueToCompare01) & (parsedData['column02'] == valueToCompare02)]

How can I create new dataframes using for looping and the method query to filter my dataframe that already exist?

I want to create new dataframes using method query and a for looping, but when I try to make this happen
this error appears UndefinedVariableError: name 'i' is not defined.
I tried to do this using this code:
for sigla in sigla_estados:
nome_estado_df = 'dataset_' + sigla
for i in range(28):
nome_estado_df = consumo_alimentar.query("UF == #lista_estados[i]")
My list (lista_estados) has 27 items, so I tried to pass through all using range.
I couldn't realize what is the problem, I am beginner.
From your code I suppose you want to create multiple dataframes, each one of them containing the rows in consumo_alimentar that apply to one specific country (column UF with a name that matches the country names in lista_estados).
I also assume that you have an array (sigla_estados) that contains the country codes of countries in lista_estados and that have the same length that lista_estados and arranged in such a way that the country code of lista_estados[x] is equal to sigla_estados[x] for all x.
If my assumptions are right, this code could work:
for i in range(len(lista_estados)):
estado = lista_estados[i]
sigla = sigla_estados[i]
mask = consumo_alimentar['UF'] == estado
nome_estado_df[sigla] = consumo_alimentar[mask]
With that code you'll get an array of data frames that I think is more or less what you want to. If you want to use the query method, this should also work:
for i in range(len(lista_estados)):
estado = lista_estados[i]
sigla = sigla_estados[i]
query_str = "UF == #estado"
nome_estado_df[sigla] = consumo_alimentar.query(query_str)

python pandas assignment of missing value as an copy

I'm trying to set the mean value of group of products in my dataset (wants to iterate each category and fill the missing data eventually)
df.loc[df.iCode == 160610,'oPrice'].fillna(value=df[df.iCode == 160610].oPrice.mean(), inplace=True)
it's not working (maybe treating it like a copy)
Thanks
df.loc[(df.iCode == 160610) & (df.oPrice.isna()),'oPrice'] = df.loc[df.iCode == 160610].oPrice.mean()

Pandas - Selecting multiple dataframe criteria

I have a DataFrame with multiple columns and I need to set the criteria to access specific values from two different columns. I'm able to do it successfully on one column as shown here:
status_filter = df[df['STATUS'] == 'Complete']
But I'm struggling to specify values from two columns. I've tried something like this but get errors:
status_filter = df[df['STATUS'] == 'Complete' and df['READY TO INVOICE'] == 'No']
It may be a simple answer, but any help is appreciated.
Your code has two very small errors: 1) need parentheses for two or more criteria and 2) you need to use the ampersand between your criteria:
status_filter = df[(df['STATUS'] == 'Complete') & (df['READY TO INVOICE'] == 'No')]
status_filter = df.ix[(df['STATUS'] == 'Complete') & (df['READY TO INVOICE'] == 'No'),]
ur welcome
you can use:
status_filter = df[(df['STATUS'] == 'Complete') & (df['READY TO INVOICE'] == 'No')]

Categories

Resources