Search for index value in a dataframe - python

I am looking for setting different conditions depending on the index value.
I have the following indices values:
country
Uk
Us
Es
In
It
Ge
Ho
where country is an index in my dataframe.
I would need to do the following
if index value is equal to 'Uk' then do something;
if index value is equal to 'Us' then do something else;
and so on.
I have tried as follows
if df.index.isin(['Us']) or df.isin(['Uk']):
stop_words = stopwords.words('english')
if df.index.isin(['Es']):
stop_words = stopwords.words('spanish')
but it is the wrong approach. I am not familiar with indices in pandas dataframe as I have always used column.
Help and suggestions are appreciated.

You can select the index of your dataframe using .loc()
df.loc['US', 'stop_words'] = 'english'
df.loc['UK', 'stop_words'] = 'english'
df.loc['ES', 'stop_words'] = 'spanish'
this example will create a new column stop_words with english or spanish depending on the index.

Related

How to clean dataframe column filled with names using Python?

I have the following dataframe:
df = pd.DataFrame( columns = ['Name'])
df['Name'] = ['Aadam','adam','AdAm','adammm','Adam.','Bethh','beth.','beht','Beeth','Beth']
I want to clean the column in order to achieve the following:
df['Name Corrected'] = ['adam','adam','adam','adam','adam','beth','beth','beth','beth','beth']
df
Cleaned names are based on the following reference table:
ref = pd.DataFrame( columns = ['Cleaned Names'])
ref['Cleaned Names'] = ['adam','beth']
I am aware of fuzzy matching but I'm not sure if that's the most efficient way of solving the problem.
You can try:
lst=['adam','beth']
out=pd.concat([df['Name'].str.contains(x,case=False).map({True:x}) for x in lst],axis=1)
df['Name corrected']=out.bfill(axis=1).iloc[:,0]
#Finally:
df['Name corrected']=df['Name corrected'].ffill()
#but In certain condition ffill() gives you wrong values
Explaination:
lst=['adam','beth']
#created a list of words
out=pd.concat([df['Name'].str.contains(x,case=False).map({True:x}) for x in lst],axis=1)
#checking If the 'Name' column contain the word one at a time that are inside the list and that will give a boolean series of True and False and then we are mapping The value of that particular element that is inside list so True becomes that value and False become NaN and then we are concatinating both list of Series on axis=1 so that It becomes a Dataframe
df['Name corrected']=out.bfill(axis=1).iloc[:,0]
#Backword filling values on axis=1 and getting the 1st column
#Finally:
df['Name corrected']=df['Name corrected'].ffill()
#Forward filling the missing values

Extracting top-N occurrences in a grouped dataframe using pandas

I've been trying to find out the top-3 highest frequency restaurant names under each type of restaurant
The columns are:
rest_type - Column for the type of restaurant
name - Column for the name of the restaurant
url - Column used for counting occurrences
This was the code that ended up working for me after some searching:
df_1=df.groupby(['rest_type','name']).agg('count')
datas=df_1.groupby(['rest_type'], as_index=False).apply(lambda x : x.sort_values(by="url",ascending=False).head(3))
['url'].reset_index().rename(columns={'url':'count'})
The final output was as follows:
I had a few questions pertaining to the above code:
How are we able to groupby using rest_type again for datas variable after grouping it earlier. Should it not give the missing column error? The second groupby operation is a bit confusing to me.
What does the first formulated column level_0 signify? I tried the code with as_index=True and it created an index and column pertaining to rest_type so I couldn't reset the index. Output below:
Thank you
You can use groupby a second time as it is present in the index which is recognized by groupby.
level_0 comes from the reset_index command because you index is unnamed.
That said, and provided I understand your dataset, I feel that you could achieve your goal more easily:
import random
df = pd.DataFrame({'rest_type': random.choices('ABCDEF', k=20),
'name': random.choices('abcdef', k=20),
'url': range(20), # looks like this is a unique identifier
})
def tops(s, n=3):
return s.value_counts().sort_values(ascending=False).head(n)
df.groupby('rest_type')['name'].apply(tops, n=3)
edit: here is an alternative to format the result as a dataframe with informative column names
(df.groupby('rest_type')
.apply(lambda x: x['name'].value_counts().nlargest(3))
.reset_index().rename(columns={'name': 'counts', 'level_1': 'name'})
)
I have a similar case where the above query looks working partially. In my case the cooccurrence value is coming as 1 always.
Here in my input data frame.
And my query is below
top_five_family_cooccurence_df = (common_top25_cooccurance1_df.groupby('family') .apply(lambda x: x['related_family'].value_counts().nlargest(5)) .reset_index().rename(columns={'related_family': 'cooccurence', 'level_1': 'related_family'}) )
I am getting result as
Where as The cooccurrence is always giving me 1.

How to replace a string that is a part of a dataframe with a list in pandas?

I am a beginner at coding, and since this is a very simple question, I know there must be answers out there. However, I've searched for about a half hour, typing countless queries in google, and all has flown over my head.
Lets say I have a dataframe with columns "Name", "Hobbies" and 2 people, so 2 rows. Currently, I have the hobbies as strings in the form "hobby1, hobby2". I would like to change this into ["hobby1", "hobby2"]
hobbies_as_string = df.iloc[0, 2]
hobbies_as_list = hobbies_as_string.split(',')
df.iloc[0, -2] = hobbies_as_list
However, this falls to an error, ValueError: Must have equal len keys and value when setting with an iterable. I don't understand why if I get hobbies_as_string as a copy, I'm able to assign the hobbies column as a list no problem. I'm also able to assign df.iloc[0,-2] as a string, such as "Hey", and that works fine. I'm guess it has to do the with ValueError. Why won't pandas let me assign it as a list??
Thank you very much for your help and explanation.
Use the "at" method to replace a value with a list
import pandas as pd
# create a dataframe
df = pd.DataFrame(data={'Name': ['Stinky', 'Lou'],
'Hobbies': ['Shooting Sports', 'Poker']})
# replace Lous hobby of poker with a list of degen hobbies with the at method
df.at[1, 'Hobbies'] = ['Poker', 'Ponies', 'Dice']
Are you looking to apply a split row-wise to each value into a list?
import pandas as pd
df = pd.DataFrame({'Name' : ['John', 'Kate'],
'Hobbies' : ["Hobby1, Hobby2", "Hobby2, Hobby3"]})
df['Hobbies'] = df['Hobbies'].apply(lambda x: x.split(','))
df
OR if you are not a big lambda exer, then you can do str.split() on the entire column, which is easier:
import pandas as pd
df = pd.DataFrame({'Name' : ['John', 'Kate'],
'Hobbies' : ["Hobby1, Hobby2", "Hobby2, Hobby3"]})
df['Hobbies'] = df['Hobbies'].str.split(",")
df
Output:
Name Hobbies
0 John [Hobby1, Hobby2]
1 Kate [Hobby2, Hobby3]
Another way of doing it
df=pd.DataFrame({'hobbiesStrings':['"hobby1, hobby2"']})
df
replace ,whitespace with "," and put hobbiesStrings values in a list
x=df.hobbiesStrings.str.replace('((?<=)(\,\s+)+)','","').values.tolist()
x
Here I use regex expressions
Basically I am replacing comma \, followed by whitespace \s with ","
rewrite column s using df.assign
df=df.assign(hobbies_stringsnes=[x])
Chained together
df=df.assign(hobbies_stringsnes=[df.hobbiesStrings.str.replace('((\,\s))','","').values.tolist()])
df
Output

how to get range of index of pandas dataframe

What is the most efficient way to get the range of indices for which the corresponding column content satisfy a condition .. like rows starting with tag and ending with "body" tag.
for e.g the data frame looks like this
I want to get the row index 1-3
Can anyone suggest the most pythonic way to achieve this?
import pandas as pd
df=pd.DataFrame([['This is also a interesting topic',2],['<body> the valley of flowers ...',1],['found in the hilly terrain',5],
['we must preserve it </body>',6]],columns=['description','count'])
print(df.head())
What condition are you looking to satisfy?
import pandas as pd
df=pd.DataFrame([['This is also a interesting topic',2],['<body> the valley of flowers ...',1],['found in the hilly terrain',5],
['we must preserve it </body>',6]],columns=['description','count'])
print(df)
print(len(df[df['count'] != 2].index))
Here, df['count'] != 2 subsets the df, and len(df.index) returns the length of the index.
Updated; note that I used str.contains(), rather than explicitly looking for starting or ending strings.
df2 = df[(df.description.str.contains('<body>') | (df.description.str.contains('</body>')))]
print(df2)
print(len(df2.index))
help from: Check if string is in a pandas dataframe
You can also find the index of start and end row then add the rows in between them to get all contents in between
start_index = df[df['description'].str.contains("<body>")==True].index[0]
end_index = df[df['description'].str.contains("</body>")==True].index[0]
print(df["description"][start_index:end_index+1].sum())

creating a dataframe from a variable length text string

I am new to numpy and pandas. I am trying to add the words and their indexes to a dataframe. The text string can be of variable length.
text=word_tokenize('this string can be of variable length')
df2 = pd.DataFrame({'index':np.array([]),'word':np.array([])})
for i in text:
for i, row in df2.iterrows():
word_val = text[i]
index_val = text.index(i)
df2.set_value(i,'word',word_val)
df2.set_value(i,'index',index_val)
print df2
To create a DataFrame from each word of your string(can be of any length), you can directly use
df2 = pd.DataFrame(text, columns=['word'])
your nltk "word_tokenize" providing you a list of words which can be used to provide column data and by default pandas take care of index.
Just pass the list directly into the DataFrame method:
pd.DataFrame(['i', 'am', 'a', 'fellow'], columns=['word'])
word
0 i
1 am
2 a
3 fellow
I'm not sure you want to name a column 'index' and in this case the values will be the same as the index of the DataFrame itself. Also its not a good practice to name a column 'index' as you wont be able to access it with the df.column_name syntax and your code could be confusing to other people.

Categories

Resources