Pandas - KeyError - Dropping rows by index in a nested loop - python

I have a Pandas dataframe named pd. I am attempting to use a nested-for-loop to iterate through each tuple of the dataframe and, at each iteration, compare the tuple with all other tuples in the frame. During the comparison step, I am using Python's difflib.SequenceMatcher().ratio() and dropping tuples that have a high similarity (ratio > 0.8).
Problem:
Unfortunately, I am getting a KeyError after the first, outer-loop, iteration.
I suspect that, by dropping the tuples, I am invalidating the outer-loop's indexer. Or, I am invalidating the inner-loop's indexer by attempting to access an element that doesn't exist (dropped).
Here is the code:
import json
import pandas as pd
import pyreadline
import pprint
from difflib import SequenceMatcher
# Note, this file, 'tweetsR.json', was originally csv, but has been translated to json.
with open("twitter data/tweetsR.json", "r") as read_file:
data = json.load(read_file) # Load the source data set, esport tweets.
df = pd.DataFrame(data) # Load data into a pandas(pd) data frame for pandas utilities.
df = df.drop_duplicates(['text'], keep='first') # Drop tweets with identical text content. Note,
these tweets are likely reposts/retweets, etc.
df = df.reset_index(drop=True) # Adjust the index to reflect dropping of duplicates.
def duplicates(df):
for ind in df.index:
a = df['text'][ind]
for indd in df.index:
if indd != 26747: # Trying to prevent an overstep keyError here
b = df['text'][indd+1]
if similar(a,b) >= 0.80:
df.drop((indd+1), inplace=True)
print(str(ind) + "Completed") # Debugging statement, tells us which iterations have completed
duplicates(df)
Error Output:
Can anyone help me understand this and/or fix it?

One solution, which was mentioned by #KazuyaHatta, is the itertools.combination(). Although, the way I've used it (there may be another way), it's O(n^2). So, in this case, with 27,000 tuples, it's nearly 357,714,378 combinations to iterate (too long).
Here is the code:
# Create a set of the dropped tuples and run this code on bizon overnight.
def duplicates(df):
# Find out how to improve the speed of this
excludes = set()
combos = itertools.combinations(df.index, 2)
for combo in combos:
if str(combo) not in excludes:
if similar(df['text'][combo[0]], df['text'][combo[1]]) > 0.8:
excludes.add(f'{combo[0]}, {combo[1]}')
excludes.add(f'{combo[1]}, {combo[0]}')
print("Dropped: " + str(combo))
print(len(excludes))
duplicates(df)
My next step, which #KazuyaHatta described, is to attempt the dropping-by-mask method.
Note: I unfortunately won't be able to post a sample of the dataset.

Related

Fastest way to generate new rows and append them to DataFrame

I want to change the target value in the data set within a certain interval. When doing it with 500 data, it takes about 1.5 seconds, but I have around 100000 data. Most of the execution time is spent in this process. I want to speed this up.
What is the fastest and most efficient way to append rows to a DataFrame?
I tried the solution in this link, tried to create a dictionary, but I couldn't do it.
Here is the code which takes around 1.5 seconds for 500 data.
def add_new(df,base,interval):
df_appended = pd.DataFrame()
np.random.seed(5)
s = np.random.normal(base,interval/3,4)
s = np.append(s,base)
for i in range(0,5):
df_new = df
df_new["DeltaG"] = s[i]
df_appended = df_appended.append(df_new)
return df_appended
DataFrames in the pandas are continuous peaces of memory, so appending or concatenating etc. dataframes is very inefficient - this operations create new DataFrames and overwrite all data from old DataFrames.
But basic python structures as list and dicts are not, when append new element to it python just create pointer to new element of structure.
So my advice - make all you data processing on lists or dicts and convert them to DataFrames in the end.
Another advice can be creating preallocated DataFrame of the final size and just change values in it using .iloc. But it works only if you know final size of your resulting DataFrame.
Good examples with code: Add one row to pandas DataFrame
If you need more code examples - let me know.
def add_new(df1,base,interval,has_interval):
dictionary = {}
if has_interval == 0:
for i in range(0,5):
dictionary[i] = (df1.copy())
elif has_interval == 1:
np.random.seed(5)
s = np.random.normal(base,interval/3,4)
s = np.append(s,base)
for i in range(0,5):
df_new = df1
df_new[4] = s[i]
dictionary[i] = (df_new.copy())
return dictionary
It works. It takes around 10 seconds for whole data. Thanks for your answers.

Python For loop with data (csv)

I have this data:
http://prntscr.com/gojey0
Which keeps going on downward.
How do I find the top 20 most common platforms using python code?
I'm really lost. I thought of maybe going through the list in a for loop and counting each one? that seems wrong though..
Use pandas: http://pandas.pydata.org/
something like:
import pandas as pd
df = pd.read_csv("your_csv_file.csv")
top_platforms = df.nlargest(20, "Score")["Platform"]
A dictionary would be a good choice for collecting this information:
Initialize an empty dict.
For each row in the csv file:
Get the platform column.
If that platform is not already in the dict, create it with a count of one.
Otherwise if it is already in the dict, increment its count by one.
When you're done, sort the dict by the count value and print the top 20 entries.
I would use pandas to read in csv files
import pandas as pd
from collection import Counter
df = pd.read_csv('DATA.csv') # read the csv file into a dataframe *df*
# create counter object containing dictionary
# invoke the pandas groupby and count methods
d = Counter(dict(df.groupby(['Platform'])['Platform'].count()))
d will be a counter object "containing" a dictionary of the form {<platform>:<number of counts in dataset>}
You can get the top k most common platforms as follows:
k = 20
d.most_common(k)
>>> [('<platform1>', count1),
('<platform2>', count2),
('<platform3>', count3),
('<platform4>', count4),
....
Hope that helps. In future it would be better to see the head (first few lines) of your data or what code you have tried so far... or even what data wrangling tool you're using!

Pandas efficient Multiindex getting and setting

Below is a snapshot of my data structure in pandas
I build the below structure in a for loop
I am using sortlevel to lexsort the dataframe
df.sortlevel(inplace=True)
1) I need to get an efficient way to get and set specific rows as shown below. This is the formula i am using and it is not efficient.
a) Will i be able to set the values of the rows using assignment
df.loc[idx['AAA', slice(None),'LLL']].iloc[:,0:n]
df.loc[idx['AAA', slice(None),'LLL']].iloc[:,0:n] = another_df
2) How to Efficiently sum the columns for a result below
df.loc[idx['AAA', slice(None),'LLL']].iloc[:,0:n].sum(axis=1)
I am looking for an efficient way to slice the dataframe.
Thanks
Thanks for letting me know the right way to post questions for Pandas. Any case, below are my findings regarding this problem. Multindex is certainly powerful from the standpoint of organizing data and exporting to csv or excel. However, accessing and selecting data has been challenging to accomplish.
Best Practices for initializing multi index
I found it easy to pre allocate the index rather than create them on the fly. Creating index on the fly is not efficient and you will be faced with lexsort warning.
Sort the data frame index once the data frame is initialized.
When accessing do not leave the row or column identifier empty. Use :
for site_name in site_s:
no_of_progs = len(site_s[site_name])
prog_name_in_sites = site_s[site_name].keys()
prog_level_cols = ['A','B', 'C']
prog_level_cols = ['A', 'C']
site_level_cols = ['A PLAN', 'A TOTAL','A UP','A DOWN','A AVAILABLE' ]
if counter == 0:
pd_index_col = pd.MultiIndex.from_product([ [site_name], prog_name_in_sites,prog_level_cols],
names=['SITE', 'PROGRAM','TYPE'])
else:
pd_index_col = pd_index_col.append(pd.MultiIndex.from_product([ [site_name], prog_name_in_sites,prog_level_cols],
names=['SITE', 'PROGRAM','TYPE']))
if no_of_progs >1:
pd_index_col = pd_index_col.append(pd.MultiIndex.from_product([ [site_name], ['LINES'] ,site_level_cols],
names=['SITE', 'PROGRAM','TYPE']))
counter = counter+1
df_A_site_level = pd.DataFrame(0,columns=arr_wk_num_wkly,index= pd_index_col, dtype=np.float64)
df_A_site_level.sort_index(inplace=True)
For setting and getting below are the two methods I recommend
df.iloc - if you know the positional index of rows and/or columns
df.loc - if you want to access data based on labels
Accessing using loc - Use the below to set or get cell/row values
idx = pd.IndexSlice
df_A_site_level[idx[site_name, :,'C'], df_A_site_level[0:no]]
Accessing using iloc - Use the below to set or get cell/row values
df_A_site_level.iloc[no_1:no_2,no3:no_4]

What is a proper idiom in pandas for creating a dataframes from the output of a apply function on a df?

Edit --- I've made some progress, and discovered the drop_duplicates method in pandas, which saves some custom duplicate removal functions I created.
This changes the question in a couple of ways, b/c it changes my initial requirements.
One of the operations I need to conduct is grabbing the latest feed entries --- the feed urls exist in a column in a data frame. Once I've done the apply I get feed objects back:
import pandas as pd
import feedparser
import datetime
df_check_feeds = pd.DataFrame({'account_name':['NYTimes', 'WashPo'],'feed_url':['http://rss.nytimes.com/services/xml/rss/nyt/HomePage.xml', 'http://feeds.washingtonpost.com/rss/homepage'], 'last_update':['2015-12-28 23:50:40', '2015-12-28 23:50:40']})
df_check_feeds["feeds_results"] = pd.DataFrame(df_check_feeds.feed_url.apply(lambda feed_url: feedparser.parse(feed_url)))
df_check_feeds["entries"] = df_check_feeds.feeds_results.apply(lambda x: x.entries)
So, now I'm stuck with the feed entries in the "entries" column, I'd like to create a two new data frames in one apply method, and concatenate the two frames immediately.
I've expressed the equivalent in a for loop:
frames_list = []
for index in df_check_feeds.index:
df_temp = pd.DataFrame(df_check_feeds.entries[index])
df_temp['account_name'] = df_check_feeds.ix[index,'account_name']
# some error checking on the info here
frames_list.append(df_temp)
df_total_results = pd.concat(frames_list)
df_total_results
I realize I could do this in a for loop (and indeed have written that), but I feel there is some better, more succinct pandas idiomatic way of writing this statement.
A more compact way could be:
df_total_results = df_check_feeds.groupby('account_name').apply(lambda x: pd.DataFrame(x['entries'].iloc[0]))

Pandas formatting column within DataFrame and adding timedelta Index error

I'm trying to use panda to do some analysis on some messaging data and am running into a few problems try to prep the data. It is coming from a database I don't have control of and therefore I need to do a little pruning and formatting before analyzing it.
Here is where I'm at so far:
#select all the messages in the database. Be careful if you get the whole test data base, may have 5000000 messages.
full_set_data = pd.read_sql("Select * from message",con=engine)
After I make this change to the timestamp, and set it as index, I'm no longer and to call to_csv.
#convert timestamp to a timedelta and set as index
#full_set_data[['timestamp']] = full_set_data[['timestamp']].astype(np.timedelta64)
indexed = full_set_data.set_index('timestamp')
indexed.to_csv('indexed.csv')
#extract the data columns I really care about since there as a bunch I don't need
datacolumns = indexed[['address','subaddress','rx_or_tx', 'wordcount'] + [col for col in indexed.columns if ('DATA' in col)]]
Here I need to format the DATA columns, I get a "SettingWithCopyWarning".
#now need to format the DATA columns to something useful by removing the upper 4 bytes
for col in datacolumns.columns:
if 'DATA' in col:
datacolumns[col] = datacolumns[col].apply(lambda x : int(x,16) & 0x0000ffff)
datacolumns.to_csv('data_col.csv')
#now group the data by "interaction key"
groups = datacolumns.groupby(['address','subaddress','rx_or_tx'])
I need to figure out how to get all the messages from a given group. get_group() requires I know key values ahead of time.
key_group = groups.get_group((1,1,1))
#foreach group in groups:
#do analysis
I have tried everything I could think of to fix the problems I'm running into but I cant seem to get around it. I'm sure it's from me misunderstanding/misusing Pandas as I'm still figuring it out.
I looking to solve these issues:
1) Can't save to csv after I add index of timestamp as timedelta64
2) How do I apply a function to a set of columns to remove SettingWithCopyWarning when reformatting DATA columns.
3) How to grab the rows for each group without having to use get_group() since I don't know the keys ahead of time.
Thanks for any insight and help so I can better understand how to properly use Pandas.
Firstly, you can set the index column(s) and parse dates while querying the DB:
indexed = pd.read_sql_query("Select * from message", engine=engine,
parse_dates='timestamp', index_col='timestamp')
Note I've used pd.read_sql_query here rather than pd.read_sql, which is deprecated, I think.
SettingWithCopy warning is due to the fact that datacolumns is a view of indexed, i.e. a subset of it's rows /columns, not an object in it's own right. Check out this part of the docs: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
One way to get around this is to define
datacolumns = indexed[<cols>].copy()
Another would to do
indexed = indexed[<cols>]
which effectively removes the columns you don't want, if you're happy that you won't need them again. You can then manipulate indexed at your leisure.
As for the groupby, you could introduce a columns of tuples which would be the group keys:
indexed['interaction_key'] = zip(indexed[['address','subaddress','rx_or_tx']]
indexed.groupby('interaction_key').apply(
lambda df: some_function(df.interaction_key, ...)
I'm not sure if it's all exactly what you want but let me know and I can edit.

Categories

Resources