There is a process in Pandas DataFrame that I am trying to do for my capstone project on the Yelp Dataset Challenge. I have found a way to do it using loops, but given the large dataset I am working with, it takes a long time. (I tried running it 24 hours, and it still was not complete.)
Is there a more efficient way to do this in Pandas without looping?
Note: business.categories (business is a DataFrame) provides a list of categories a business is in stored as a string (e.g. "[restaurant, entertainment, bar, nightlife]"). It is written in the format of a list bust saved as a string.
# Creates a new DataFrame with businesses as rows and columns as categories tags with 0 or 1 depending on whether the business is that category
categories_list = []
# Makes empty values an string of empty lists. This prevents Null errors later in the code.
business.categories = business.categories.fillna('[]')
# Creates all categories as a single list. Goes through each business's list of categories and adds any unique values to the master list, categories_list
for x in range(len(business)):
# business.categories is storing each value as a list (even though it's formatted just like a string), so this converts it to a List
categories = eval(str(business.categories[x]))
# Looks at each categories, adding it to categories_list if it's not already there
for category in categories:
if category not in categories_list:
categories_list.append(category)
# Makes the list of categories (and business_id) the colums of the new DataFrame
categories_df = pd.DataFrame(columns = ['business_id'] + categories_list, index = business.index)
# Loops through determining whether or not each business has each category, storing this as a 1 or 0 for that category type respectivity.
for x in range(len(business)):
for y in range(len(categories_list)):
cat = categories_list[y]
if cat in eval(business.categories[x]):
categories_df[cat][x] = 1
else:
categories_df[cat][x] = 0
# Imports the original business_id's into the new DataFrame. This allows me to cross-reference this DataFrame with my other datasets for analysis
categories_df.business_id = business.business_id
categories_df
Given that the data is stored as list-like strings, I don't think you can avoid looping over the data frame (either explicitly or implicitly, using str methods) in Python speeds (this seems like an unfortunate way of storing the data. can itbe avoided upstream?). However, I have some ideas for improving the approach. Since you know the resulting index ahead of time, you could immediately start building the DataFrame without knowing all the categories in advance, something like
categories_df = pd.DataFrame(index=business.index)
for ix, categories in business.categories.items():
for cat in eval(categories):
categories_df.loc[ix, cat] = 1
# if cat is not already in the columns this will add it in, with null values in the other rows
categories_df.fillna(0, inplace=True)
If you know some or all of the categories in advance then adding them as columns initially before the loop should help as well.
Also, you could try doing categories[1:-1].split(', ') instead of eval(categories). A quick test tells me it should be around 15 times faster.
To ensure the same result, you should do
for ix, categories in business.categories.items():
for cat in categories[1:-1].split(','):
categories_df.loc[ix, cat.strip()] = 1
to be on the safe side, as you won't know how much white space there might be around the commas. Avoiding much of the nested looping and in statements should speed your programme up considerably.
Not exactly sure what you ultimately want to do is... But
Consider the dataframe business
business = pd.DataFrame(dict(
categories=['[cat, dog]', '[bird, cat]', '[dog, bird]']
))
You can convert these strings to lists with
business.categories.str.strip('[]').str.split(', ')
Or even pd.get_dummies
business.categories.str.strip('[]').str.get_dummies(', ')
bird cat dog
0 0 1 1
1 1 1 0
2 1 0 1
Related
I have a bunch of keywords stored in a 620x2 pandas dataframe seen below. I think I need to treat each entry as its own set, where semicolons separate elements. So, we end up with 1240 sets. Then I'd like to be able to search how many times keywords of my choosing appear together. For example, I'd like to figure out how many times 'computation theory' and 'critical infrastructure' appear together as a subset in these sets, in any order. Is there any straightforward way I can do this?
Use .loc to find if the keywords appear together.
Do this after you have split the data into 1240 sets. I don't understand whether you want to make new columns or just want to keep the columns as is.
# create a filter for keyword 1
filter_keyword_2 = (df['column_name'].str.contains('critical infrastructure'))
# create a filter for keyword 2
filter_keyword_2 = (df['column_name'].str.contains('computation theory'))
# you can create more filters with the same construction as above.
# To check the number of times both the keywords appear
len(df.loc[filter_keyword_1 & filter_keyword_2])
# To see the dataframe
subset_df = df.loc[filter_keyword_1 & filter_keyword_2]
.loc selects the conditional data frame. You can use subset_df=df[df['column_name'].str.contains('string')] if you have only one condition.
To the column split or any other processing before you make the filters or run the filters again after processing.
Not sure if this is considered straightforward, but it works. keyword_list is the list of paired keywords you want to search.
df['Author Keywords'] = df['Author Keywords'].fillna('').str.split(';\s*').apply(set)
df['Index Keywords'] = df['Index Keywords'].fillna('').str.split(';\s*').apply(set)
df.apply(lambda x : x.apply(lambda y : all([kw in y for kw in keyword_list]))).sum().sum()
I have a dataframe, df, shown below. Each row is a story and each column is a word that appears in the corpus of stories. A 0 means the word is absent in the story while a 1 means the word is present.
I want to find which words are present in each story (i.e. col val == 1). How can I go about finding this (preferably without for-loops)?
Thanks!
Assuming you are just trying to look at one story, you can filter for the story (let's say story 34972) and transpose the dataframe with:
df_34972 = df[df.index=34972].T
and then you can send the values equal to 1 to a list:
[*df_34972[df_34972['df_34972'] == 1]]
If you are trying to do this for all stories, then you can do this, but it will be a slightly different technique. From the link that SammyWemmy provided, you can melt() the dataframe and filter for 1 values for each story. From there you could .groupby('story_column') which is 'index' (after using reset_index()) in the example below:
df = df.reset_index().melt(id_vars='index')
df = df[df['values'] == 1]
df.groupby('index')['variable'].apply(list)
This is my first post to the coding community, so I hope I get the right level of detail in my request for help!
Background info:
I want to repeat (loop) command in a df using a variable that contains a list of options. While the series 'amenity_options' contains a simple list of specific items (let's say only four amenities as the example below) the df is a large data frame with many other items. My goal is the run the operation below for each item in the 'amenity_option' until the end of the list.
amenity_options = ['cafe','bar','cinema','casino'] # this is a series type with multiple options
df = df[df['amenity'] == amenity_options] # this is my attempt to select the the first value in the series (e.g. cafe) out of dataframe that contains such a column name.
df.to_excel('{}_amenity.xlsx, format('amenity') # wish to save the result (e.g. cafe_amenity) as a separate file.
Desired result:I wish to loop step one and two for each and every item available in the list (e.g. cafe, bar, cinema...). So that I will have separate excel files in the end. Any thoughts?
What #Rakesh suggested is correct, you probably just need one more step.
df = df[df['amenity'].isin(amenity_options)]
for key, g in df.groupby('amenity'):
g.to_excel('{}_amenity.xlsx'.format(key))
After you call groupby() of your df, you will get 4 groups so that you can directly loop on them.
The key is the group key, which are cafe, bar and etc. and the g is the sub-dataframe that specifically filtered by that key.
Seems like you just need a simple for loop:
for amenity in amenity_options:
df[df['amenity'] == amenity].to_excel(f"{amenity}_amenity.xlsx")
I'm about to start some Python Data analysis unlike anything I've done before. I'm currently studying numpy, but so far it doesn't give me insight on how to do this.
I'm using python 2.7.14 Anaconda with cx_Oracle to Query complex records.
Each record will be a unique individual with a column for Employee ID, Relationship Tuples (Relationship Type Code paired with Department number, may contain multiple), Account Flags (Flag strings, may contain multiple). (3 columns total)
so one record might be:
[(123456), (135:2345678, 212:4354670, 198:9876545), (Flag1, Flag2, Flag3)]
I need to develop a python script that will take these records and create various counts.
The example record would be counted in at least 9 different counts
How many with relationship: 135
How many with relationship: 212
How many with relationship: 198
How many in Department: 2345678
How many in Department: 4354670
How many in Department: 9876545
How many with Flag: Flag1
How many with Flag: Flag2
How many with Flag: Flag3
The other tricky part of this, is I can't pre-define the relationship codes, departments, or flags What I'm counting for has to be determined by the data retrieved from the query.
Once I understand how to do that, hopefully the next step to also get how many relationship X has Flag y, etc., will be intuitive.
I know this is a lot to ask about, but If someone could just point me in the right direction so I can research or try some tutorials that would be very helpful. Thank you!
At least you need to structurate this data to make a good analysis, you can do it in your database engine or in python (I will do it by this way, using pandas like SNygard suggested).
At first, I create some fake data(it was provided by you):
import pandas as pd
import numpy as np
from ast import literal_eval
data = [[12346, '(135:2345678, 212:4354670, 198:9876545)', '(Flag1, Flag2, Flag3)'],
[12345, '(136:2343678, 212:4354670, 198:9876541, 199:9876535)', '(Flag1, Flag4)']]
df = pd.DataFrame(data,columns=['id','relationships','flags'])
df = df.set_index('id')
df
This return a dataframe like this:
raw_pandas_dataframe
In order to summarize or count by columns, we need to improve our data structure, in some way that we can apply group by operations with department, relationships or flags.
We will convert our relationships and flags columns from string type to a python list of strings. So, the flags column will be a python list of flags, and the relationships column will be a python list of relations.
df['relationships'] = df['relationships'].str.replace('\(','').str.replace('\)','')
df['relationships'] = df['relationships'].str.split(',')
df['flags'] = df['flags'].str.replace('\(','').str.replace('\)','')
df['flags'] = df['flags'].str.split(',')
df
The result is:
dataframe_1
With our relationships column converted to list, we can create a new dataframe with as much columns
as relations in that lists we have.
rel = pd.DataFrame(df['relationships'].values.tolist(), index=rel.index)
After that we need to stack our columns preserving its index, so we will use pandas multi_index: the id and the relation column number(0,1,2,3)
relations = rel.stack()
relations.index.names = ['id','relation_number']
relations
We get: dataframe_2
At this moment we have all of our relations in rows, but still we can't group by using
relation_type feature. So we will split our relations data in two columns: relation_type and department using :.
clear_relations = relations.str.split(':')
clear_relations = pd.DataFrame(clear_relations.values.tolist(), index=clear_relations.index,columns=['relation_type','department'])
clear_relations
The result is
dataframe_3_clear_relations
Our relations are ready to analyze, but our flags structure still is very useless. So we will convert the flag list, to columns and after that we will stack them.
flags = pd.DataFrame(df['flags'].values.tolist(), index=rel.index)
flags = flags.stack()
flags.index.names = ['id','flag_number']
The result is dataframe_4_clear_flags
Voilá!, It's all ready to analyze!.
So, for example, how many relations from each type we have, and wich one is the biggest:
clear_relations.groupby('relation_type').agg('count')['department'].sort_values(ascending=False)
We get: group_by_relation_type
All code: Github project
If you're willing to consider other packages, take a look at pandas which is built on top of numpy. You can read sql statements directly into a dataframe, then filter.
For example,
import pandas
sql = '''SELECT * FROM <table> WHERE <condition>'''
df = pandas.read_sql(sql, <connection>)
# Your output might look like the following:
0 1 2
0 12346 (135:2345678, 212:4354670, 198:9876545) (Flag1, Flag2, Flag3)
1 12345 (136:2343678, 212:4354670, 198:9876545) (Flag1, Flag2, Flag4)
# Format your records into rows
# This part will take some work, and really depends on how your data is formatted
# Do you have repeated values? Are the records always the same size?
# Select only the rows where relationship = 125
rel_125 = df[df['Relationship'] = 125]
The pandas formatting is more in depth than fits in a Q&A, but some good resources are here: 10 Minutes to Pandas.
You can also filter the rows directly, though it may not be the most efficient. For example, the following query selects only the rows where a relationship starts with '212'.
df[df['Relationship'].apply(lambda x: any(y.startswith('212') for y in x))]
I am building an itinerary of vegetation types for a given location. Data is passed to me as a CSV and I want a way in which I can automatically re-classify items in one column, into broader classes that I provide. I can already read the data in with pandas, do a bit of housekeeping and then write out the data frame to a new file.
However, given I am provided with a column such that:
species = ['maple', 'oak', holly, 'sawgrass', 'cat tails'...... 'birch']
I would like to be able to automatically, reclassify these into broad categories using another list like:
VegClass = ['Tree', 'Bush', 'Grass']
The only way I know to do this would be to iterate through the species list, in a manner similar to:
out = []
for i in species:
if species[i]=='Oak':
out.append('Tree')
but this would require that I write a lot of code if the species list becomes very large and I don't imagine it would be very efficient with large datasets.
Is there a more direct way of doing this? I understand that I would need to list all the species manually (in separate classes) e.g.:
TreeSpecies = ['oak'....'birch']
GrassSpecies = ['Sawgrass....']
but I would only have to do this once to build a dictionary of species names. Im expecting more data so may have to add an additional species name or two in future, but this would not be considered too time intensive if I could process a lot of the data quickly.
You need to create a dict of classifier mappings for your different items, for instance,
classifier = {'oak': 'Tree',
'maple': 'Tree',
'holly': 'Tree',
'sawgrass': 'Grass',
'cat tails': 'Bush',
'birch': 'Tree'}
Then getting a column of groups is as simple as calling map on your column.
>>> df.species.map(classifier)
0 Tree
1 Tree
2 Tree
3 Grass
4 Bush
5 Tree
Name: species, dtype: object
so you can set a new column with
df['classification'] = df.species.map(classifier)
You need a dictionary like
VegClass = {'oak': 'Tree', 'seagrass': 'Grass'}
df['class'] = df['species'].map(VegClass)
I don't know if I follow you, but since you will have to create some sort of associative list, in the form
plant | type
oak | tree
sawgrass | grass
kkk | bush
...
Just create a hash table and get the type from the hash table.
You may read the table from an external file so it is not hardcoded in your program.