List Index Out of Range error - pandas - python

I have two data frames. df1 looks like -
MovieName Actors
lights out Maria Bello
legend Tom Hardy*Emily Browning*Christopher Eccleston*David Thewlis
df2 looks like -
ActorName Gender
Tom male
Emily female
Christopher male
I want to add two columns in df1 'female_actors' and 'male_actors' which contains the count of female and male actors in that particular movie respectively. Whether an actor is male or female is done based on df2.
Here is what I am doing -
def func(actors, gender):
actors = [act.split()[0] for act in actors.split('*')]
n_gender = df2.Gender[df2.Gender==gender][df2.ActorName.isin(actors)].count()
return n_gender
df1['male_actors'] = df1.Actors.apply(lambda x: func(x, 'male'))
df1['female_actors'] = df1.Actors.apply(lambda x: func(x, 'female'))
This code gives me list index out of range error.
Please note that -
If particular name isn't present in gender.csv, don't count it in the total.
If there is just one actor in a movie, and it isn't present in gender.csv, then it's count should be zero.
Result should be -
MovieName Actors male_actors female_actors
lights out Maria Bello 0 0
legend Tom Hardy*Emily Browning*Christopher Eccleston*David Thewlis 2 1
Feel free to suggest some other approach.

How about this?
df1['Male'] = df1.Actors.apply(lambda x: len(pd.concat( [df2[(df2.ActorName == name) & (df2.Gender == 'male')] for name in x.split('*')] )))
df1['Female'] = df1.Actors.apply(lambda x: len(pd.concat( [df2[(df2.ActorName == name) & (df2.Gender == 'female')] for name in x.split('*')] )))

using str and join
d2 = df2.set_index('ActorName')
d1 = df1.set_index('MovieName')
method 1
split
d1.join(d1.Actors.str.split('*', expand=True).stack() \
.str.split(expand=True)[0].map(d2.Gender) \
.groupby(level='MovieName') \
.value_counts().unstack()).fillna(0).reset_index()
method 2
extractall
d1.join(d1.Actors.str.extractall('((?P<first>[^*]+)\s+(?P<last>[^*]+))') \
['first'].map(d2.Gender).groupby(level='MovieName') \
.value_counts().unstack()).fillna(0).reset_index()

Related

How to filter and sort specific csv using python

Please help me with the python script to filter the below CSV.
Below is the example of the CSV dump for which I have done the initial filtration.
Last_name
Gender
Name
Phone
city
Ford
Male
Tom
123
NY
Rich
Male
Robert
21312
LA
Ford
Female
Jessica
123123
NY
Ford
Male
John
3412
NY
Rich
Other
Linda
12312
LA
Ford
Other
James
4321
NY
Smith
Male
David
123123
TX
Rich
Female
Mary
98689
LA
Rich
Female
Jennifer
86860
LA
Ford
Male
Richard
12123
NY
Smith
Other
Daniel
897097
TX
Ford
Other
Lisa
123123123
NY
import re
def gather_info (L_name):
dump_filename = "~/Documents/name_report.csv"
LN = []
with open(dump_filename, "r") as FH:
for var in FH.readlines():
if L_name in var
final = var.split(",")
print(final[1], final[2], final[3])
return LN
if __name__ == "__main__":
L_name = input("Enter the Last name: ")
la_name = gather_info(L_name)
By this, I am able to filter by the last name. for example, if I choose L_name as Ford, then I have my output as
Gender
Name
Phone
Male
Tom
123
Female
Jessica
123123
Male
John
3412
Other
James
4321
Male
Richard
12123
Other
Lisa
22412
I need help extending the script by selecting each gender and the values in the list to perform other functions, then calling the following gender and the values to achieve the same functions. for example, first, it selects the gender Male [Tom, John] and performs other functions. then selects the next gender Female [Jessica] and performs the same functions and then selects the gender Other [James, Lisa] and performs the same functions.
I would recomend using the pandas module which allows for easy filtering and grouping of data
import pandas as pd
if __name__ == '__main__':
data = pd.read_csv('name_reports.csv')
L_name = input("Enter the last name: ")
by_last_name = data[data['Last_name'] == L_name]
groups = by_last_name.groupby(['Gender'])
for group_name, group_data in groups:
print(group_name)
print(group_data)
Breaking this down into its pieces the first part is
data = pd.read_csv('name_reports.csv')
This reads the data from the csv and places it into a dataframe
Second we have
by_last_name = data[data['Last_name'] == L_name]
This filters the dataframe to only have results with Last_name equal to L_name
Next we group the data.
groups = by_last_name.groupby(['Gender'])
this groups the filtered data frames by gender
then we iterate over this. It returns a tuple with the group name and the dataframe associated with that group.
for group_name, group_data in groups:
print(group_name)
print(group_data)
This loop just prints out the data to access fields from it you can use the iterrows function
for index,row in group_data.iterrows():
print(row['city']
print(row['Phone']
print(row['Name']
And then you can use those for whatever function you want. I would recommend reading on the documentation for pandas since depending on the function you plan on using there may be a better way to do it using the library. Here is the link to the library https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.html
Since you cannot use the pandas module then a method using only the csv module would look like this
import csv
def has_last_name(row,last_name):
return row['Last_name'] == last_name
def has_gender(row,current_gender):
return row['Gender'] == current_gender
if __name__ == '__main__':
data = None
genders = ['Male','Female','Other']
with open('name_reports.csv') as csvfile:
data = list(csv.DictReader(csvfile,delimiter=','))
L_name = input('Enter the Last name: ')
get_by_last_name = lambda row: has_last_name(row,L_name)
filtered_by_last_name = list(filter(get_by_last_name,data))
for gender in genders:
get_by_gender = lambda row: has_gender(row,gender)
filtered_by_gender = list(filter(get_by_gender,filtered_by_last_name))
print(filtered_by_gender)
The important part is the filter built in function. This takes in a function that takes in an item from a list and returns a bool. filter takes this function and an iterable and returns a generator of items that return true for that function. The other important part is the csv.DictReader which returns your csv file as a dictionary which makes allows you to access attributes by key instead of by index.

How to select the specific datas from the Dataframe after being used value_couts()?

I used python to read a file which contains the baby's names, genders and birth-years. Now I want to find out the names which are used both by boys and girls. I used value_counts()to get the appearance times of each name, but now I don't know how to extract the names from all the names.
Here is my codes:
def names_both(year):
names = []
path = 'babynames/yob%d.txt' % year
columns = ['name', 'sex', 'birth']
frame = pd.read_csv(path, names=columns)
frame = frame['name'].value_counts()
print(frame)
"""if len(names) != 0:
print(names)
else:
print('None')"""
The frame now is like this:
Lou 2
Willie 2
Erie 2
Cora 2
..
Perry 1
Coy 1
Adolphus 1
Ula 1
Emily 1
Name: name, Length: 1889, dtype: int64
Here is the csv:
Anna,F,2604
Emma,F,2003
Elizabeth,F,1939
Minnie,F,1746
Margaret,F,1578
Ida,F,1472
Alice,F,1414
Bertha,F,1320
Sarah,F,1288
Annie,F,1258
Clara,F,1226
Ella,F,1156
Florence,F,1063
...
Thanks for helping!
Here we are for counting the number of names given both to girls and boys:
common_girl_and_boys_names = (
# work name by name
frame.groupby('name')
# count the number of sex given for the name and keep the one given to both sex, this boolean will be put in a column call 0
.apply(lambda x: len(x['sex'].unique()) == 2)
# the name are now in the index, reset it in order to get the names
.reset_index()
# keep only names with the column 0 with True value
.loc[lambda x: x[0], 'name']
)
final_df = (
# keep only the names common to boys and girls (the series build before)
frame.loc[frame['name'].isin(common_girl_and_boys_names), :]
# sex is now useless
.drop(['sex'], axis='columns')
# work name by name and sum the number of birth
.groupby('name')
.sum()
)
You can put those lines after the read_csv function. I hope it is want you want.

Removing index from returned data

I have a data-frame (df)
which looks like:
first_name surname location identifier
0 Fred Smith London FredSmith
1 Jane Jones Bristol JaneJones
I am trying to query a particular field and return it to a variable value using:
value = df.loc[df['identifier'] == query_identifier ,'location']
so where query_identifier is equal to FredSmith I get returned to value:
0 London
How can I remove the 0 so I just have:
London
Try this statement:
value = df.loc[df['identifier'] == "FredSmith" ,'location'].values[0]
This will help you.
If there is multiple values for the same identifier, then:
value = df.loc[df['identifier'] == "FredSmith" ,'location'].values
for df_values in value:
print(df_values)
This is just enhancement.

Take list as column values in pandas dataframe

I have a dataframe as below :
Card_x Country Age Code Card_y
S INDIA Adult Garments S,E,D,G,M,A
S INDIA Adult Grocery D,S,G,A,M,E
I have list as below :
lis1 = [S,D,G,E,M,A]
Now i wanted my dataframe to be as below :
Explanation : Group by Card_x, Country , Age and get the lis1 values as "Card_y"
Card_x Country Age Card_y
S INDIA Adult S,D,G,E,M,A
Can i be helped ?
Note : Logic for calulating lis1 is below :
lis1=[]
for i in range(len(t)):
l=df.Card_y.iloc[i].split(',')
lis1.append(l)
sorted(lis1[0], key=lambda elem: sum(sublist.index(elem) for sublist in lis1) / len(lis1))
Basically, lis1 gets the Rank of each Card_y for different "Code" and gets the Average Rank and recomputes the Rank with least Average.
Eg : S is in 1st Rank for Code - Garments, and 2rd Rank for Code - Grocery.so average is 1+2/2=1.5
D is 3rd Rank for Code - Garments, and 1st Rank for Code - Grocery. so average is 3+1/2=2.
Now based on the average, with least average i get the Ranked list.
so it will be S,D,G,E,M,A
Try:
df_out = df.groupby(['Card_x','Country','Age'])['Card_y'].apply(lambda x: x.str.split(',', expand=True)
.rename(columns = lambda x: x+1)
.stack().reset_index(level=1))
df_out = df_out.groupby(['Card_x','Country','Age',0])['level_1'].mean().sort_values().reset_index(level=-1)
df_out.groupby(['Card_x','Country','Age'])[0].agg(','.join).rename('Card_y').reset_index()
Output:
Card_x Country Age Card_y
0 S INDIA Adult S,D,G,E,A,M

Compile one DataFrame from a loop sequence of smaller DataFrames

I am looping through a list of 103 FourSquare URLs to find "Coffee Shops."
I can create a DataFrame for each URL and print each DataFrame as I loop through the list (sample output at bottom).
I cannot figure out how to append the DataFrame for each URL into a single DataFrame as I loop through the list. My goal is to compile a single DataFrame from the DataFrames I am printing.
x = 0
while x < 103 :
results = requests.get(URLs[x]).json()
def get_category_type(row):
try:
categories_list = row['categories']
except:
categories_list = row['venue.categories']
if len(categories_list) == 0:
return None
else:
return categories_list[0]['name']
venues = results['response']['groups'][0]['items']
nearby_venues = json_normalize(venues) # flatten JSON
# filter columns
filtered_columns = ['venue.name', 'venue.categories', 'venue.location.lat', 'venue.location.lng']
nearby_venues =nearby_venues.loc[:, filtered_columns]
# filter the category for each row
nearby_venues['venue.categories'] = nearby_venues.apply(get_category_type, axis=1)
# clean columns
nearby_venues.columns = [col.split(".")[-1] for col in nearby_venues.columns]
dfven = nearby_venues.loc[nearby_venues['categories'] == 'Coffee Shop']
print(x, '!!!', dfven, '\n')
x = x + 1
Here is some output (I do get complete results):
0 !!! name categories lat lng
5 Tim Hortons Coffee Shop 43.80200 -79.198169
8 Tim Hortons / Esso Coffee Shop 43.80166 -79.199133
1 !!! Empty DataFrame
Columns: [name, categories, lat, lng]
Index: []
2 !!! name categories lat lng
5 Starbucks Coffee Shop 43.770367 -79.186313
18 Tim Hortons Coffee Shop 43.769591 -79.187081
3 !!! name categories lat lng
0 Starbucks Coffee Shop 43.770037 -79.221156
4 Country Style Coffee Shop 43.773716 -79.207027
I apologize if this is bad form or a breach of etiquette but I solved my problem and figured I should post. Perhaps making an effort to state the problem for StackOverflow helped me solve it?
First I learned how to ignore empty DataFrames:
dfven = nearby_venues.loc[nearby_venues['categories'] == 'Coffee Shop']
if dfven.empty == False :
Once I added this code my printed output was a clean series of identically formatted data frames so appending them into one data frame was easy. I created a data frame at the beginning of my code (merge = pd.DataFrame()) and then added this line where I was printing.
merge = merge.append(dfven)
Now my output is perfect.

Categories

Resources