So this is a common question but I cant find an answer that fits this particular scenario.
So I have a Dataframe with columns for genres eg "Drama, Western" and one hot encoded versions of the genres so for the drama and western there is a 1 in both columns but where its just Western genre its 1 for that column 0 for drama.
I want a filtered dataframe containing rows with only Western and no other genre. Im trying to oversample for a model as it is a minor class but I don't want to increase other genre counts as a byproduct
There are multiple rows so I can't use the index and there are multiple genres so I can't use a condition like df[(df['Western']==1) & (df['Drama']==0) without having to account for 24 genres.
Index | Genre | Drama | Western | Action | genre 4 |
0 Drama, Western 1 1 0 0
1 Western 0 1 0 0
3 Action, Western 0 1 1 0
If I understand your question correctly, you want those rows where only 'Western' is 1, i.e. the genre is only Western, nothing else.
Why do you have to use the encoded columns then? Just use the original 'Genre' column where the data is in string format. No need to overcomplicate things.
new_df = df[df['Genre']=='Western']
Make a column_list of genre like column_list = ['Western', 'Drama', 'Action', ...] and find its sum, if its sum is equal to 1, then we can compare the value of 'Western' column if it is equal to 1. Try this out, this should return the Index of row where only 'Western' is 1:
column_list = ['Western', 'Drama', 'Action', ...]
df.loc[df[column_list].sum(axis=1)==1 and df['Western']==1, 'Index']
If you haven't got the Genre column, you could do
df[
(df['Western']==1)
&
(df[df.columns.difference(['Western'])]==0).all(axis=1)
]
Related
The dataframe looks like:
name education education_2 education_3
name_1 NaN some college NaN
name_2 NaN NaN graduate degree
name_3 high school NaN NaN
I just want to keep one education column. I tried to use the conditional statement and compared to each other, I got nothing but error though. I also looked through the merge solution, but in vain. Does anyone know how to deal with it using Python or pandas? Thank you in advance.
name education
name_1 some college
name_2 graduate degree
name_3 high school
One day I hope they'll have better functions for String type rows, rather than the limited support for columns currently available:
df['education'] = (df.filter(like='education') # Filters to only Education columns.
.T # Transposes
.convert_dtypes() # Converts to pandas dtypes, still somewhat in beta.
.max() # Gets the max value from the column, which will be the not-null one.
)
df = df[['name', 'education']]
print(df)
Output:
name education
0 name 1 some college
1 name 2 graduate degree
2 name 3 high school
Looping this wouldn't be too hard e.g.:
cols = ['education', 'age', 'income']
for col in cols:
df[col] = df.filter(like=col).bfill(axis=1)[col]
df = df[['name'] + cols]
You can use df.fillna to do so.
df['combine'] = df[['education','education2','education3']].fillna('').sum(axis=1)
df
name education education2 education3 combine
0 name1 NaN some college NaN some college
1 name2 NaN NaN graduate degree graduate degree
2 name3 high school NaN NaN high school
If you have a lot of columns to combine, you can try this.
df['combine'] = df[df.columns[1:]].fillna('').sum(axis=1)
use bfill to fill the empty (NaN) values
df.bfill(axis=1).drop(columns=['education 2','education 3'])
name education
0 name 1 some college
1 name 2 graduate degree
2 name 3 high school
if there are other columns in between then choose the columns to apply bfill
In essence, if you have multiple columns for education that you need to consolidate under a single column then choose the columns to which you apply the bfill. subsequently, you can delete those columns from which you back filled.
df[['education','education 2','education 3']].bfill(axis=1).drop(columns=['education 2','education 3'])
I am reading tabular data from the email in the pandas dataframe.
There is no guarantee that column names will contain in the first row.Sometimes data is in the following format.The actual column names are [ID,Name and Year]
dummy1 dummy2 dummy3
test_column1 test_column2 test_column3
ID Name Year
1 John Sophomore
2 Lisa Junior
3 Ed Senior
Sometimes the column names come in the first row as expected.
ID Name Year
1 John Sophomore
2 Lisa Junior
3 Ed Senior
Once I read the HTML table from the email,how I remove the initial rows that don't contain the column names?So in the first case I would need to remove first 2 rows in the dataframe(including column row) and in the second case,i wouldn't have to remove anything.
Also,the column names can be in any sequence.
basically,I want to do in following
1.check whether once of the column names contains in one of the rows in dataframe
2.Remove the rows above
if "ID" in row:
remove the above rows
How can I achieve this?
You can first get index of valid columns and then filter and set accordingly.
df = pd.read_csv("d.csv",sep='\s+', header=None)
col_index = df.index[(df == ["ID","Name","Year"]).all(1)].item() # get columns index
df.columns = df.iloc[col_index].to_numpy() # set valid columns
df = df.iloc[col_index + 1 :] # filter data
df
ID Name Year
3 1 John Sophomore
4 2 Lisa Junior
5 3 Ed Senior
or
If you want to se ID as index
df = df.iloc[col_index + 1 :].set_index('ID')
df
Name Year
ID
1 John Sophomore
2 Lisa Junior
3 Ed Senior
Ugly but effective quick try:
id_name = df.columns[0]
df_clean = df[(df[id_name] == 'ID') | (df[id_name].dtype == 'int64')]
I have a df that looks like this:
index life_stage
1 Early Childhood
2 Birth
3 Infancy
...
The life_stage column is not ordered correctly and I cannot rely on alphabetical order.
The correct order would be
Birth
Infancy
Early Childhood
Is it possible to sort the life stage column according to an order that I specify in Pandas?
Let's convert "life_stage" into an ordered categorical column using pd.Categorical:
df['life_stage'] = pd.Categorical(
df['life_stage'],
categories=['Birth', 'Infancy', 'Early Childhood'],
ordered=True
)
Note the order in which I specify the categories to pd.Categorical. Now, call sort_values using life_stage:
df.sort_values(by=['life_stage'])
index life_stage
2 1 Birth
1 2 Infancy
0 3 Early Childhood
For reference, sorting "life_stage" alphabetically gets you
index life_stage
2 1 Birth
0 3 Early Childhood # wrong!
1 2 Infancy
IIUC, you want pd.Categorical with order:
s = pd.Categorical(['Infancy', 'Birth', 'Early Childhood'],
categories=['Birth', 'Infancy', 'Early Childhood'],
ordered=True)
s.sort_values()
Output:
[Birth, Infancy, Early Childhood]
Categories (3, object): [Birth < Infancy < Early Childhood]
I have two datasets say df1 and df:
df1
df1 = pd.DataFrame({'ids': [101,102,103],'vals': ['apple','java','python']})
ids vals
0 101 apple
1 102 java
2 103 python
df
df = pd.DataFrame({'TEXT_DATA': [u'apple a day keeps doctor away', u'apple tree in my farm', u'python is not new language', u'Learn python programming', u'java is second language']})
TEXT_DATA
0 apple a day keeps doctor away
1 apple tree in my farm
2 python is not new language
3 Learn python programming
4 java is second language
What I want to do is want to update the columns values based on filtered data and map the match data to the new column such that my output is
TEXT_DATA NEW_COLUMN
0 apple a day keeps doctor away 101
1 apple tree in my farm 101
2 python is not new language 103
3 Learn python programming 103
4 java is second language 102
I tried matching using
df[df['TEXT_DATA'].str.contains("apple")]
is there any way by which i can do this?
You could do something like this:
my_words = {'python': 103, 'apple': 101, 'java': 102}
for word in my_words.keys():
df1.loc[df1['my_column'].str.contains(word, na=False), ['my_second_column']] = my_words[word]
First, you need to extract the values in df1['vals']. Then, create a new column and add the extraction result to the new column. And finally, merge both dataframes.
extr = '|'.join(x for x in df1['vals'])
df['vals'] = df['TEXT_DATA'].str.extract('('+ extr + ')', expand=False)
newdf = pd.merge(df, df1, on='vals', how='left')
To select the fields in the result, type the column name in the header section:
newdf[['TEXT_DATA','ids']]
You could use a cartesian product of both dataframes and then select the relevant rows and columns.
tmp = df.assign(key=1).merge(df1.assign(key=1), on='key').drop(columns='key')
resul = tmp.loc[tmp.apply(func=(lambda x: x.vals in x.TEXT_DATA), axis=1)]\
.drop(columns='vals').reset_index(drop=True)
nocity.head()
user_id business_id stars
0 cjpdDjZyprfyDG3RlkVG3w uYHaNptLzDLoV_JZ_MuzUA 5
1 bjTcT8Ty4cJZhEOEo01FGA uYHaNptLzDLoV_JZ_MuzUA 3
2 AXgRULmWcME7J6Ix3I--ww uYHaNptLzDLoV_JZ_MuzUA 3
3 oU2SSOmsp_A8JYI7Z2JJ5w uYHaNptLzDLoV_JZ_MuzUA 4
4 0xtbPEna2Kei11vsU-U2Mw uYHaNptLzDLoV_JZ_MuzUA 5
withcity.head()
business_id city
0 YDf95gJZaq05wvo7hTQbbQ Richmond Heights
1 mLwM-h2YhXl2NCgdS84_Bw Charlotte
2 v2WhjAB3PIBA8J8VxG3wEg Toronto
3 CVtCbSB1zUcUWg-9TNGTuQ Scottsdale
4 duHFBe87uNSXImQmvBh87Q Phoenix
nocity dataframe has business_id, (they may be repeating since it also has the rating each user_id gave for each business_id)
The withcity dataframe has the city associated with each business_id
The result I want is:
This is going to be super hard to word:
I want to look up the city associated with each business_id from the withcity dataframe and create a new column in nocity called cityname, which now has the city name associated with that business_id
Why I gave up trying and came here
I know this can be performed with some sort of join operation.. But I don't understand which one exactly.. I looked them up online and got a little confused as to what would happen if some business_id wasn't available in the two dataframes when performing that join operation?
For example:
withcity has some business_id with some city value; and when performing whichever appropriate join with the nocity, it does not find that particular business_id
So I came here for help.
What other alternative did I try?
area_dict = dict(zip(withcity.business_id, withcity.city))
emptylist = []
for rows in nocity['business_id']:
for key, value in area_dict.items():
if(key == rows):
emptylist.append(value)
I created a dictionary which held the business_id and the city from the withcity dataframe, and performed some sort of matching comparison with the nocity dataframe.
But my method, will probably take a lot of time since there are 4.7 million records to be exact.
IIUC merge
nocity.merge(withcity,on='business_id',how='left')
Out[855]:
user_id business_id stars city
0 cjpdDjZyprfyDG3RlkVG3w uYHaNptLzDLoV_JZ_MuzUA 5 NaN
1 bjTcT8Ty4cJZhEOEo01FGA uYHaNptLzDLoV_JZ_MuzUA 3 NaN
2 AXgRULmWcME7J6Ix3I--ww uYHaNptLzDLoV_JZ_MuzUA 3 NaN
3 oU2SSOmsp_A8JYI7Z2JJ5w uYHaNptLzDLoV_JZ_MuzUA 4 NaN
4 0xtbPEna2Kei11vsU-U2Mw uYHaNptLzDLoV_JZ_MuzUA 5 NaN
In general, whenever you have a situation like this you want to consider avoiding loops and iterations and instead perform a merge. Then afterwards, you massage the data to fit your needs. For example, Wen's solution is the most apt way to do this.
However there were a few things I would add. Say those are my two dfs below:
Let's call the first and second dfs, nocity and withcity respectively.
You want to do:
nocity.merge(withcity, on='business_id', how='left')
However, if you end up getting nan values as Wen got above. Check the datatypes of your keys
Meaning, if you business_id field in nocity was int (for some reason) while the business_id field in withcity was str then Pandas will have issues merging the dataframes and you get NaN values instead of the desired City Name.
To check you would do
#for all datatypes in the nocity df
print(nocity.dtypes)
#or just for the field's dtypes
print(nocity.business_id.dtypes)
Then you would convert to a common datatype like str if they were different...
#example conversion of pandas column (series) to different datatype
nocity.business_id = nocity.business_id.astype(str)
withcity.business_id = withcity.business_id.astype(str)
#then perform merge as usual
nocity = nocity.merge(withcity, on='business_id', how='left')
Hope this helps. Also don't forget to change your name from 'city' to 'cityname' if that is what you prefer
nocity.rename(columns = {'city': 'city name'})