Compile one DataFrame from a loop sequence of smaller DataFrames - python

I am looping through a list of 103 FourSquare URLs to find "Coffee Shops."
I can create a DataFrame for each URL and print each DataFrame as I loop through the list (sample output at bottom).
I cannot figure out how to append the DataFrame for each URL into a single DataFrame as I loop through the list. My goal is to compile a single DataFrame from the DataFrames I am printing.
x = 0
while x < 103 :
results = requests.get(URLs[x]).json()
def get_category_type(row):
try:
categories_list = row['categories']
except:
categories_list = row['venue.categories']
if len(categories_list) == 0:
return None
else:
return categories_list[0]['name']
venues = results['response']['groups'][0]['items']
nearby_venues = json_normalize(venues) # flatten JSON
# filter columns
filtered_columns = ['venue.name', 'venue.categories', 'venue.location.lat', 'venue.location.lng']
nearby_venues =nearby_venues.loc[:, filtered_columns]
# filter the category for each row
nearby_venues['venue.categories'] = nearby_venues.apply(get_category_type, axis=1)
# clean columns
nearby_venues.columns = [col.split(".")[-1] for col in nearby_venues.columns]
dfven = nearby_venues.loc[nearby_venues['categories'] == 'Coffee Shop']
print(x, '!!!', dfven, '\n')
x = x + 1
Here is some output (I do get complete results):
0 !!! name categories lat lng
5 Tim Hortons Coffee Shop 43.80200 -79.198169
8 Tim Hortons / Esso Coffee Shop 43.80166 -79.199133
1 !!! Empty DataFrame
Columns: [name, categories, lat, lng]
Index: []
2 !!! name categories lat lng
5 Starbucks Coffee Shop 43.770367 -79.186313
18 Tim Hortons Coffee Shop 43.769591 -79.187081
3 !!! name categories lat lng
0 Starbucks Coffee Shop 43.770037 -79.221156
4 Country Style Coffee Shop 43.773716 -79.207027

I apologize if this is bad form or a breach of etiquette but I solved my problem and figured I should post. Perhaps making an effort to state the problem for StackOverflow helped me solve it?
First I learned how to ignore empty DataFrames:
dfven = nearby_venues.loc[nearby_venues['categories'] == 'Coffee Shop']
if dfven.empty == False :
Once I added this code my printed output was a clean series of identically formatted data frames so appending them into one data frame was easy. I created a data frame at the beginning of my code (merge = pd.DataFrame()) and then added this line where I was printing.
merge = merge.append(dfven)
Now my output is perfect.

Related

Iterate values over a new column

I have a webpage which I would like to extract and store it's values into separate columns. Furthermore, I want to extract the movie title and insert it as a new column, but it must iterate over the rows of which the elements from the title were collected.
For example (expected output):
Location Name Latitude Longitude \
0 1117 Broadway (Gil's Music Shop) 47.252495 -122.439644
1 2715 North Junett St (Kat and Bianca's House) 47.272591 -122.474480
2 Aurora Bridge 47.646713 -122.347435
3 Buckaroo Tavern (closed) 47.657841 -122.350327
movie
0 10-things-i-hate-about-you-locations-250
1 10-things-i-hate-about-you-locations-250
2 10-things-i-hate-about-you-locations-250
3 10-things-i-hate-about-you-locations-250
.
.
.
What I have tried:
url = ['https://www.latlong.net/location/10-cloverfield-lane-locations-553',
'https://www.latlong.net/location/10-things-i-hate-about-you-locations-250',
'https://www.latlong.net/location/12-angry-men-locations-818']
url_test = []
for i in range(0, len(test), 1):
df = pd.read_html(test[i])[0]
df['movie'] = test[i].split('/')[-1]
However, this gives only the output:
Location Name Latitude Longitude \
0 New York City 40.742298 -73.982559
1 New York County Courthouse 40.714310 -74.001930
movie
0 12-angry-men-locations-818
1 12-angry-men-locations-818
Which is missing the rest of the results
I get the feeling it's because the data is split in the pandas dataframe, so I have tried merging before appending the columns using:
url_test = []
for i in range(0, len(test), 1):
df = pd.read_html(test[i])[0]
df = pd.merge(df, how='inner')
df['movie'] = test[i].split('/')[-1]
But I get the following error:
TypeError: merge() missing 1 required positional argument: 'right'
Try:
test = ['https://www.latlong.net/location/10-cloverfield-lane-locations-553',
'https://www.latlong.net/location/10-things-i-hate-about-you-locations-250',
'https://www.latlong.net/location/12-angry-men-locations-818']
url_test = []
for i in range(0, len(test), 1):
df = pd.read_html(test[i])[0]
df['movie'] = test[i].split('/')[-1]
url_test.append(df)
final_df = pd.concat(url_test, ignore_index=True)
print(final_df)

convert scraped list to pandas dataframe using columns and index

process and data scraping of url( within all given links in a loop )looks like :
for url in urls :
page=requests.get(url)
#fetch and proceed page here and acquire cars info one per page
print(car.name)
print(car_table)
and the output :
BMW
['color','red','weight','50kg','height','120cm','width','200cm','serial','','owner','']
FORD
['color','','weight','','height','','width','','serial','','owner','']
HONDA
['color','blue','weight','60kg','height','','width','160cm','serial','OMEGA','owner','']
at the end how can i have a dataframe same as below by considering that i dunno number of car fields(columns) and number of cars(index) but defined df with them as columns and index
print(car_df)
-----|color|weight|height|width|serial|owner
BMW |red 50 120 200
FORD |
HONDA|blue 60 160 OMEGA
any help appreciated :)
This approach is to create a list of dicts as we iterate through the urls, and then after the loop we convert this to a dictionary. I'm assuming that the car_table is always the column followed by the value over and over again
import pandas as pd
import numpy as np
#Creating lists from your output instead of requesting from the url since you didn't share that
car_names = ['BMW','FORD','HONDA']
car_tables = [
['color','red','weight','50kg','height','120cm','width','200cm','serial','','owner',''],
['color','','weight','','height','','width','','serial','','owner',''],
['color','blue','weight','60kg','height','','width','160cm','serial','OMEGA','owner',''],
]
urls = range(len(car_names))
all_car_data = []
for url in urls:
car_name = car_names[url] #using car_name instead of car.name for this example
car_table = car_tables[url] #again, you get this value some other way
car_data = {'name':car_name}
columns = car_table[::2] #starting from 0, skip every other entry to just get the columns
values = car_table[1::2] #starting from 1, skip every other entry to just get the values
#Zip the columns together with the values, then iterate and update the dict
for col,val in zip(columns,values):
car_data[col] = val
#Add the dict to a list to keep track of all the cars
all_car_data.append(car_data)
#Convert to a dataframe
df = pd.DataFrame(all_car_data)
#df = df.replace({'':np.NaN}) #you can use this if you want to replace the '' with NaNs
df
Output:
name color weight height width serial owner
0 BMW red 50kg 120cm 200cm
1 FORD
2 HONDA blue 60kg 160cm OMEGA

My headers are in the first column of my txt file. I want to create a Pandas DF

Sample data from text file
[User]
employeeNo=123
last_name=Toole
first_name=Michael
language=english
email = michael.toole#123.ie
department=Marketing
role=Marketing Lead
[User]
employeeNo=456
last_name= Ronaldo
first_name=Juan
language=Spanish
email=juan.ronaldo#sms.ie
department=Data Science
role=Team Lead
Location=Spain
[User]
employeeNo=998
last_name=Lee
first_name=Damian
language=english
email=damian.lee#email.com
[User]
Wondering if someone could help me, you can see my sample dataset above. What I would like to do (please tell me if there is a more efficient way) is to loop through the first column and whereever the list of unique ids occur (e.g first_name, last_name, role etc) append the value in the corresponding row to that list and do this which each unique ID so that I'm left with the below.
I have read about multi-indexing and I'm not sure if that might be a better solution but I couldn't get it to work (I'm quite new to python)
enter image description here
# Define a list of selected persons
selectedList = textFile
# Define a list of searching person
searchList = ['uid']
# Define an empty list
foundList = []
# Iterate each element from the selected list
for index, sList in enumerate(textFile):
# Match the element with the element of searchList
if sList in searchList:
# Store the value in foundList if the match is found
foundList.append(selectedList[index])
You have a text file where each records starts with a [User] line and data lines have a key=value format. I know no module able to automatically handle that, but it is easy to parse it by hand. Code could be:
with open('file.txt') as fd:
data = [] # a list of records
for line in fd:
line = line.strip() # strip end of line
if line == '[User]': # new record
row = {} # row will be a key: value dict
data.append(row)
else:
k,v = line.split('=', 1) # split on the = character
row[k] = v
df = pd.DataFrame(data) # list of key: value dicts => dataframe
With the sample data shown, we get:
employeeNo last_name first_name language email department role email Location
0 123 Toole Michael english michael.toole#123.ie Marketing Marketing Lead NaN NaN
1 456 Ronaldo Juan Spanish NaN Data Science Team Lead juan.ronaldo#sms.ie Spain
2 998 Lee Damian english NaN NaN NaN damian.lee#email.com NaN
I'm sure there is a more optimal way to do this, but it would be to get a unique list of row names, this time extracting them in a loop process and combining them into a new dataframe. Finally, update it with the desired column names.
import pandas as pd
import numpy as np
import io
data = '''
[User]
employeeNo=123
last_name=Toole
first_name=Michael
language=english
email=michael.toole#123.ie
department=Marketing
role="Marketing Lead"
[User]
employeeNo=456
last_name= Ronaldo
first_name=Juan
language=Spanish
email=juan.ronaldo#sms.ie
department="Data Science"
role=Team Lead
Location=Spain
[User]
employeeNo=998
last_name=Lee
first_name=Damian
language=english
email=damian.lee#email.com
[User]
'''
df = pd.read_csv(io.StringIO(data), sep='=', comment='[', header=None)
new_cols = df[0].unique()
new_df = pd.DataFrame()
for col in new_cols:
tmp = df[df[0] == col]
tmp.reset_index(inplace=True)
new_df = pd.concat([new_df, tmp[1]], axis=1)
new_df.columns = new_cols
new_df['User'] = None
new_df = new_df[['User','employeeNo','last_name','first_name','language','email','department','role','Location']]
new_df
User employeeNo last_name first_name language email department role Location
0 None 123 Toole Michael english michael.toole#123.ie Marketing Marketing Lead Spain
1 None 456 Ronaldo Juan Spanish juan.ronaldo#sms.ie Data Science Team Lead NaN
2 None 998 Lee Damian english damian.lee#email.com NaN NaN NaN
Rewrite based on testing of previous version offset values
import pandas as pd
# Revised from previous answer - ensures key value pairs are contained to the same
# record - previous version assumed the first record had all the expected keys -
# inadvertently assigned (Location) value of second record to the first record
# which did not have a Location key
# This version should perform better - only dealing with one single df
# - and using pandas own pivot() function
textFile = 'file.txt'
filter = '[User]'
# Decoration - enabling a check and balance - how many users are we processing?
textFileOpened = open(textFile,'r')
initialRead = textFileOpened.read()
userCount = initialRead.count(filter) # sample has 4 [User] entries - but only three actual unique records
print ('User Count {}'.format(userCount))
# Create sets so able to manipulate and interrogate
allData = []
oneRow = []
userSeq = 0
#Iterate through file - assign record key and [userSeq] Key to each pair
with open(textFile, 'r') as fp:
for fileLineSeq, line in enumerate(fp):
if filter in str(line):
userSeq = userSeq + 1 # Ensures each key value pair is grouped
else: userSeq = userSeq
oneRow = [fileLineSeq, userSeq, line]
allData.append(oneRow)
df = pd.DataFrame(allData)
df.columns = ['FileRow','UserSeq','KeyValue'] # rename columns
userSeparators = df[df['KeyValue'] == str(filter+'\n') ].index # Locate [User Records]
df.drop(userSeparators, inplace = True) # Remove [User] records
df = df.replace(' = ' , '=' , regex=True ) # Input data dirty - cleaning up
df = df.replace('\n' , '' , regex=True ) # remove the new lines appended during the list generation
# print(df) # Test as necessary here
# split KeyValue column into two
df[['Key', 'Value']] = df.KeyValue.str.split('=', expand=True)
# very powerful function - convert to table
df = df.pivot(index='UserSeq', columns='Key', values='Value')
print(df)
Results
User Count 4
Key Location department email employeeNo first_name language last_name role
UserSeq
1 NaN Marketing michael.toole#123.ie 123 Michael english Toole Marketing Lead
2 Spain Data Science juan.ronaldo#sms.ie 456 Juan Spanish Ronaldo Team Lead
3 NaN NaN damian.lee#email.com 998 Damian english Lee NaN

How to append a list of elements into a single feature of a dataframe?

I have two dataframes, a df of actors who have a feature that is a list of movie identifier numbers for films that they've worked on. I also have a list of movies that have an identifier number that will show up in the actor's list if the actor was in that movie.
I've attempted to iterate through the movies dataframe, which does produce results but is too slow.
It seems like iterating through the list of movies from the actors dataframe would result in less looping, but I've been unable to save results.
Here is the actors dataframe:
print(actors[['primaryName', 'knownForTitles']].head())
primaryName knownForTitles
0 Rowan Atkinson tt0109831,tt0118689,tt0110357,tt0274166
1 Bill Paxton tt0112384,tt0117998,tt0264616,tt0090605
2 Juliette Binoche tt1219827,tt0108394,tt0116209,tt0241303
3 Linda Fiorentino tt0110308,tt0119654,tt0088680,tt0120655
4 Richard Linklater tt0243017,tt1065073,tt2209418,tt0405296
And the movies dataframe:
print(movies[['tconst', 'primaryTitle']].head())
tconst primaryTitle
0 tt0001604 The Fatal Wedding
1 tt0002467 Romani, the Brigand
2 tt0003037 Fantomas: The Man in Black
3 tt0003593 Across America by Motor Car
4 tt0003830 Detective Craig's Coup
As you can see, the movies['tconst'] identifier shows up in a list in the actors dataframe.
My very slow iteration through the movie dataframe is as follows:
def add_cast(movie_df, actor_df):
results = movie_df.copy()
length = len(results)
#create an empty feature
results['cast'] = ""
#iterate through the movie identifiers
for index, value in results['tconst'].iteritems():
#create a new dataframe containing all the cast associated with the movie id
cast = actor_df[actor_df['knownForTitles'].str.contains(value)]
#check to see if the 'primaryName' list is empty
if len(list(cast['primaryName'].values)) != 0:
#set the new movie 'cast' feature equal to a list of the cast names
results.loc[index]['cast'] = list(cast['primaryName'].values)
#logging
if index % 1000 == 0:
logging.warning(f'Results location: {index} out of {length}')
#delete cast df to free up memory
del cast
return results
This generates some results but is not fast enough to be useful. One observation is that by creating a new dataframe of all the actors who have the movie identifier in their knownForTitles is that this list can be put into a single feature of the movies dataframe.
Whereas for my attempt to loop through the actors dataframe below, I don't seem to be able to append items into the movies dataframe:
def actors_loop(movie_df, actor_df):
results = movie_df.copy()
length = len(actor_df)
#create an empty feature
results['cast'] = ""
#iterate through all actors
for index, value in actor_df['knownForTitles'].iteritems():
#skip empties
if str(value) == r"\N":
logging.warning(f'skipping: {index} with a value of {value}')
continue
#generate a list of movies that this actor has been in
cinemetography = [x.strip() for x in value.split(',')]
#iterate through every movie the actor has been in
for movie in cinemetography:
#pull out the movie info if it exists
movie_info = results[results['tconst'] == movie]
#continue if empty
if len(movie_info) == 0:
continue
#set the cast variable equal to the actor name
results[results['tconst'] == movie]['cast'] = (actor_df['primaryName'].loc[index])
#delete the df to save space ?maybe
del movie_info
#logging
if index % 1000 == 0:
logging.warning(f'Results location: {index} out of {length}')
return results
So if I run the above code, I get a very fast result, but the 'cast' field remains empty.
I figured out the problem I was having with def actors_loop(movie_df, actor_df) function. The problem is that
results['tconst'] == movie]['cast'] = (actor_df['primaryName'].loc[index])
is setting the value equal to a copy of the results dataframe. It would be better to use the df.set_value() method or the df.at[] method.
I also figured out a much faster solution to the problem, rather than iterate through two dataframes and create recursive looping, it would be better to iterate once. So I created a list of tuples:
def actor_tuples(actor_df):
tuples =[]
for index, value in actor_df['knownForTitles'].iteritems():
cinemetography = [x.strip() for x in value.split(',')]
for movie in cinemetography:
tuples.append((actor_df['primaryName'].loc[index], movie))
return tuples
This created a list of tuples of the following form:
[('Fred Astaire', 'tt0043044'),
('Lauren Bacall', 'tt0117057')]
I then created a list of movie identifier numbers and index points (from the movie dataframe), that took this form:
{'tt0000009': 0,
'tt0000147': 1,
'tt0000335': 2,
'tt0000502': 3,
'tt0000574': 4,
'tt0000615': 5,
'tt0000630': 6,
'tt0000675': 7,
'tt0000676': 8,
'tt0000679': 9}
I then used the below function to iterate through the actor tuples and use the movie identifier as the key in the movie dictionary, this returned the correct movie index, which I used to add the actor name tuple to the target dataframe:
def add_cast(movie_df, Atuples, Mtuples):
results_df = movie_df.copy()
results_df['cast'] = ''
counter = 0
total = len(Atuples)
for tup in Atuples:
#this passes the movie ID into the movie_dict that will return an index
try:
movie_index = Mtuples[tup[1]]
if results_df.at[movie_index, 'cast'] == '':
results_df.at[movie_index, 'cast'] += tup[0]
else:
results_df.at[movie_index, 'cast'] += ',' + tup[0]
except KeyError:
pass
#logging
counter +=1
if counter % 1000000 == 0:
logging.warning(f'Index {counter} out of {total}, {counter/total}% finished')
return results_df
It ran in 10 minutes (making 2 sets of tuples, then the adding function) for 16.5 million actor tuples. The results are below:
0 tt0000009 Miss Jerry 1894 Romance
1 tt0000147 The Corbett-Fitzsimmons Fight 1897 Documentary,News,Sport
2 tt0000335 Soldiers of the Cross 1900 Biography,Drama
3 tt0000502 Bohemios 1905 \N
4 tt0000574 The Story of the Kelly Gang 1906 Biography,Crime,Drama
cast
0 Blanche Bayliss,Alexander Black,William Courte...
1 Bob Fitzsimmons,Enoch J. Rector,John L. Sulliv...
2 Herbert Booth,Joseph Perry,Orrie Perry,Reg Per...
3 Antonio del Pozo,El Mochuelo,Guillermo PerrĂ­n,...
4 Bella Cola,Sam Crewes,W.A. Gibson,Millard John...
Thank you stack overflow!

How to add entries in Pandas DataFrame?

Basically I have census data of US that I have read in Pandas from a csv file.
Now I have to write a function that finds counties in a specific manner (not gonna explain that because that's not what the question is about) from the table I have gotten from csv file and return those counties.
MY TRY:
What I did is that I created lists with the names of the columns (that the function has to return), then applied the specific condition in the for loop using if-statement to read the entries of all required columns in their respective list. Now I created a new DataFrame and I want to read the entries from lists into this new DataFrame. I tried the same for loop to accomplish it, but all in vain, tried to make Series out of those lists and tried passing them as a parameter in the DataFrame, still all in vain, made DataFrames out of those lists and tried using append() function to concatenate them, but still all in vain. Any help would be appreciated.
CODE:
#idxl = list()
#st = list()
#cty = list()
idx2 = 0
cty_reg = pd.DataFrame(columns = ('STNAME', 'CTYNAME'))
for idx in range(census_df['CTYNAME'].count()):
if((census_df.iloc[idx]['REGION'] == 1 or census_df.iloc[idx]['REGION'] == 2) and (census_df.iloc[idx]['POPESTIMATE2015'] > census_df.iloc[idx]['POPESTIMATE2014']) and census_df.loc[idx]['CTYNAME'].startswith('Washington')):
#idxl.append(census_df.index[idx])
#st.append(census_df.iloc[idx]['STNAME'])
#cty.append(census_df.iloc[idx]['CTYNAME'])
cty_reg.index[idx2] = census_df.index[idx]
cty_reg.iloc[idxl2]['STNAME'] = census_df.iloc[idx]['STNAME']
cty_reg.iloc[idxl2]['CTYNAME'] = census_df.iloc[idx]['CTYNAME']
idx2 = idx2 + 1
cty_reg
CENSUS TABLE PIC:
SAMPLE TABLE:
REGION STNAME CTYNAME
0 2 "Wisconsin" "Washington County"
1 2 "Alabama" "Washington County"
2 1 "Texas" "Atauga County"
3 0 "California" "Washington County"
SAMPLE OUTPUT:
STNAME CTYNAME
0 Wisconsin Washington County
1 Alabama Washington County
I am sorry for the less-knowledge about the US-states and counties, I just randomly put the state names and counties in the sample table, just to show you what do I want to get out of that. Thanks for the help in advanced.
There are some missing columns in the source DF posted in the OP. However, reading the loop I don't think the loop is required at all. There are 3 filters required - for REGION, POPESTIMATE2015 and CTYNAME. If I have understood the logic in the OP, then this should be feasible without the loop
Option 1 - original answer
print df.loc[
(df.REGION.isin([1,2])) & \
(df.POPESTIMATE2015 > df.POPESTIMATE2014) & \
(df.CTYNAME.str.startswith('Washington')), \
['REGION', 'STNAME', 'CTYNAME']]
Option 2 - using and with pd.eval
q = pd.eval("(df.REGION.isin([1,2])) and \
(df.POPESTIMATE2015 > df.POPESTIMATE2014) and \
(df.CTYNAME.str.startswith('Washington'))", \
engine='python')
print df.loc[q, ['REGION', 'STNAME', 'CTYNAME']]
Option 3 - using and with df.query
regions_list = [1,2]
dfq = df.query("(REGION==#regions_list) and \
(POPESTIMATE2015 > POPESTIMATE2014) and \
(CTYNAME.str.startswith('Washington'))", \
engine='python')
print dfq[['REGION', 'STNAME', 'CTYNAME']]
If I'm reading the logic in your code right, you want to select rows according to the following conditions:
REGION should be 1 or 2
POPESTIMATE2015 > POPESTIMATE2014
CTYNAME needs to start with "Washington"
In general, Pandas makes it easy to select rows based on conditions without having to iterate over the dataframe:
df = census_df[
((df.REGION == 1) | (df.REGION == 2)) & \
(df.POPESTIMATE2015 > POPESTIMATE2014) & \
(df.CTYNAME.str.startswith('Washington'))
]
Assuming you're selecting some kind of rows that satisfy a criteria, let's just say that select(row) and this function returns True if selected or False if not. I'll not infer what it is because you specifically said it was not important
And then you wanted the STNAME and CTYNAME of that row.
So here's what you would do:
your_new_df = census_df[census_df.apply(select, axis=1)]\
.apply(lambda x: x[['STNAME', 'CTYNAME']], axis=1)
This is the one liner that will get you what you wanted provided you wrote the select function that will pick the rows.

Categories

Resources