I've checked posts and haven't found a solution to my problem. I'm getting the error I put in the subject after the code works fine.
I'm simply trying to add a row to a holder dataframe that only appends rows that aren't similar to previously appended rows. You'll see that friend is checked against 'Target' and Target against 'Friend' in the query.
It iterates 71 times before giving me the error. 'cur' is the iterator, which is not included in this section of code. Here's the code:
same = df[(df['Source']==cur) & (df['StratDiff']==0)]
holder = pd.DataFrame(index=['pbp'],columns=['Source', 'Target', 'Friend', 'SS', 'TS', 'FS'])
holder.iloc[0:0]
i=1
for index, row in same.iterrows():
Target = row['Target']
stratcur = row['SourceStrategy']
strattar = row['TargetStrategy']
sametarget = df[(df['Source']==Target)]
samejoin = pd.merge(same, sametarget, how='inner', left_on=['Target'],
right_on = ['Target'])
for index, row in samejoin.iterrows():
Friend = row['Target']
stratfriend = row['TargetStrategy_x']
#print(cur, Friend, Target)
temp = holder[holder[(holder['Source']==cur) &
(holder['Target']==Friend) & (holder['Friend']==Target)]]
if temp.isnull().values.any():
holder.loc[i] = [cur,Target,Friend,stratcur,strattar,stratfriend]
print(i, cur)
i=i+1
I just want to update everyone. I was able to solve this. It took awhile, but the problem was located in line where I query holder. It was too complex. I simplified it into multiple, simpler queries. It works fine now.
Related
I am currently working on finding position(row index, column index) of maximum cell in each column of a dataframe.
There are a lot of similar dataframe like this, so I made a function like below.
def FindPosition_max(series_max, txt_name):
# series_max : only needed to get the number of columns in time_history(Index starts from 1).
time_history = pd.read_csv(txt_name, skiprows=len(series_max)+7, header=None, usecols=[i for i in range(1,len(series_max)+1)])[:-2]
col_index = series_max.index
row_index = []
for current_col_index in series_max.index:
row_index.append(time_history.loc[:, current_col_index].idxmax())
return row_index, col_index.tolist()
This works well, but takes too much time to run with a lot of dataframes. I found on the internet that .apply() is much more faster than for loop and I tried like this.
def FindPosition_max(series_max, txt_name):
time_history = pd.read_csv(txt_name, skiprows=len(series_max)+7, header=None, usecols=[i for i in range(1,len(series_max)+1)])[:-2]
col_index = series_max.index
row_index = pd.Series(series_max.index).apply(lambda x: time_history.loc[:, x].idxmax())
return row_index, series_max.index.tolist()
And the error comes like this,
File "C:\Users\hwlee\anaconda3\lib\site-packages\pandas\core\internals\construction.py", line 844, in _list_of_series_to_arrays
indexer = indexer_cache[id(index)] = index.get_indexer(columns)
AttributeError: 'builtin_function_or_method' object has no attribute 'get_indexer'
I tried to find what causes this error, but this error never goes away. Also when I tested the codes inside the function separately, it works well.
Could anyone help me to solve this problem? Thank u!
I really hope you can help.
I am working on the below, which finds the closest schools to a particular property.
Everything is working fine, except the final step. Where I define the field 'Primary' in the raw_data dataframe, it always returns NaN, but, if I step through, the variable Primary_Name does get populated.
Any idea why this might be?
for index, row in raw_data.iterrows():
start = (row['lat'], row['long'])
for index, row in schools.iterrows():
schoolloc = (row['Latitude'], row['Longitude'])
schools.loc[index,'distance'] = geopy.distance.geodesic(start, schoolloc).km
schools.dropna()
primary = schools.where(schools['PhaseOfEducation (name)'] == 'Primary')
Nearest_Primary = primary[primary['distance'] == min(primary['distance'])]
Primary_Name = Nearest_Primary.iloc[0]['EstablishmentName']
raw_data.loc[index,'primary'] = Primary_Name
I'm trying to run a script (API to google search console) over a table of keywords and dates in order to check if there was improvement in keyword performance (SEO) after the date.
Since i'm really clueless im guessing and trying but Jupiter notebook isn't responding so i can't even tell if im wrong...
This git was made by Josh Carty
the git from which i took this code is:
https://github.com/joshcarty/google-searchconsole
Already pd.read_csv the input table (consist of two columns 'keyword' and 'date'),
made the columns into two separate lists (or maybe it better to use dictionary/other?):
KW_list and
Date_list
I tried:
for i in KW_list and j in Date_list:
for i in KW_list and j in Date_list:
account = searchconsole.authenticate(client_config='client_secrets.json',
credentials='credentials.json')
webproperty = account['https://www.example.com/']
report = webproperty.query.range(j, days=-30).filter('query', i, 'contains').get()
report2 = webproperty.query.range(j, days=30).filter('query', i, 'contains').get()
df = pd.DataFrame(report)
df2 = pd.DataFrame(report2)
df
Expect to see the data frame of all the different keywords (keyowrd1-stat1 , keyword2 - stats2 below, etc. [no overwrite]) at the dates 30 days before the date in the neighbor cell (in the input file)
or at least some respond from J.notebook so i will know what is going on.
Try using the zip function to combine the lists into a list of tuples. This way, the date and the corresponding keyword are combined.
account = searchconsole.authenticate(client_config='client_secrets.json', credentials='credentials.json')
webproperty = account['https://www.example.com/']
df1 = None
df2 = None
first = True
for (keyword, date) in zip(KW_list, Date_list):
report = webproperty.query.range(date, days=-30).filter('query', keyword, 'contains').get()
report2 = webproperty.query.range(date, days=30).filter('query', keyword, 'contains').get()
if first:
df1 = pd.DataFrame(report)
df2 = pd.DataFrame(report2)
first = False
else:
df1 = df1.append(pd.DataFrame(report))
df2 = df2.append(pd.DataFrame(report2))
I have a huge set of data. Something like 100k lines and I am trying to drop a row from a dataframe if the row, which contains a list, contains a value from another dataframe. Here's a small time example.
has = [['#a'], ['#b'], ['#c, #d, #e, #f'], ['#g']]
use = [1,2,3,5]
z = ['#d','#a']
df = pd.DataFrame({'user': use, 'tweet': has})
df2 = pd.DataFrame({'z': z})
tweet user
0 [#a] 1
1 [#b] 2
2 [#c, #d, #e, #f] 3
3 [#g] 5
z
0 #d
1 #a
The desired outcome would be
tweet user
0 [#b] 2
1 [#g] 5
Things i've tried
#this seems to work for dropping #a but not #d
for a in range(df.tweet.size):
for search in df2.z:
if search in df.loc[a].tweet:
df.drop(a)
#this works for my small scale example but throws an error on my big data
df['tweet'] = df.tweet.apply(', '.join)
test = df[~df.tweet.str.contains('|'.join(df2['z'].astype(str)))]
#the error being "unterminated character set at position 1343770"
#i went to check what was on that line and it returned this
basket.iloc[1343770]
user_id 17060480
tweet [#IfTheyWereBlackOrBrownPeople, #WTF]
Name: 4612505, dtype: object
Any help would be greatly appreciated.
is ['#c, #d, #e, #f'] 1 string or a list like this ['#c', '#d', '#e', '#f'] ?
has = [['#a'], ['#b'], ['#c', '#d', '#e', '#f'], ['#g']]
use = [1,2,3,5]
z = ['#d','#a']
df = pd.DataFrame({'user': use, 'tweet': has})
df2 = pd.DataFrame({'z': z})
simple solution would be
screen = set(df2.z.tolist())
to_delete = list() # this will speed things up doing only 1 delete
for id, row in df.iterrows():
if set(row.tweet).intersection(screen):
to_delete.append(id)
df.drop(to_delete, inplace=True)
speed comparaison (for 10 000 rows):
st = time.time()
screen = set(df2.z.tolist())
to_delete = list()
for id, row in df.iterrows():
if set(row.tweet).intersection(screen):
to_delete.append(id)
df.drop(to_delete, inplace=True)
print(time.time()-st)
2.142000198364258
st = time.time()
for a in df.tweet.index:
for search in df2.z:
if search in df.loc[a].tweet:
df.drop(a, inplace=True)
break
print(time.time()-st)
43.99799990653992
For me, your code works if I make several adjustments.
First, you're missing the last line when putting range(df.tweet.size), either increase this or (more robust, if you don't have an increasing index), use df.tweet.index.
Second, you don't apply your dropping, use inplace=True for that.
Third, you have #d in a string, the following is not a list: '#c, #d, #e, #f' and you have to change it to a list so it works.
So if you change that, the following code works fine:
has = [['#a'], ['#b'], ['#c', '#d', '#e', '#f'], ['#g']]
use = [1,2,3,5]
z = ['#d','#a']
df = pd.DataFrame({'user': use, 'tweet': has})
df2 = pd.DataFrame({'z': z})
for a in df.tweet.index:
for search in df2.z:
if search in df.loc[a].tweet:
df.drop(a, inplace=True)
break # so if we already dropped it we no longer look whether we should drop this line
This will provide the desired result. Be aware of this potentially being not optimal due to missing vectorization.
EDIT:
you can achieve the string being a list with the following:
from itertools import chain
df.tweet = df.tweet.apply(lambda l: list(chain(*map(lambda lelem: lelem.split(","), l))))
This applies a function to each line (assuming each line contains a list with one or more elements): Split each element (should be a string) by comma into a new list and "flatten" all the lists in one line (if there are multiple) together.
EDIT2:
Yes, this is not really performant But basically does what was asked. Keep that in mind and after having it working, try to improve your code (less for iterations, do tricks like collecting the indices and then drop all of them).
I have 2 data frames. Title and section are two columns among others.
I need to check if the combination of particular title and section from one data frame is present in the second.
E.g. Data frame torange has columns title, s_low, s_high and others; usc has columns title and section.
If there's following row in torange
title s_low s_high
1 1 17
the code needs to check in usc if a row
title section
1 1
and a row
title section
1 17
exist; and create a new table to write the title, section, and the rest of the columns in torange by expanding the range between s_low and s_high in usc.
I've written the below code but somehow it doesn't work/ stops after just a few iterations. I'm suspecting that there's something wrong in the counter for 'i', may be a syntax error. Also, it may be something to do with the syntax of any()
import MySQLdb as db
from pandas import DataFrame
from pandas.io.sql import frame_query
cnxn = db.connect('127.0.0.1','xxxxx','xxxxx','xxxxxxx', charset='utf8', use_unicode=True )
torange = frame_query("SELECT title, s_low, s_high, post, pre, noy, rnum from torange", cnxn)
usc = frame_query("SELECT title, section from usc", cnxn)
i=0
for row in torange:
t = torange.title[i]
s_low = torange.s_low[i]
s_high = torange.s_high[i]
for row in usc:
if (any(usc.title == t) & any(usc.section == s_low)):
print 't', t, 's_low' , s_low,'i', i
if (any(usc.title == t) & any(usc.section == s_high)):
print 't', t, 's_high', s_high, 'i',i
print i, '*******************match************************'
i=i+1
(Please ignore the priint statements. This is part of a bigger task that I'm doing and prints are used just as checks to see what's happening. )
Any help in this regard would be greatly appreciated.
Your whole checking and iterations are messed up. You iterate row in usc, yet your any() conditions check usc, not row. Moreover, row is your iterator for both loops. Here is some cleaner starting point:
for index, row in torange.iterrows():
t = row['title']
s_low = row['s_low']
s_high = row['s_high']
uscrow = usc[(usc.title == t) & (usc.section == slow)]
# uscrow now contains all the rows in usc that fulfill your condition.