Potentially faulty or weird behavior for pandas.series.isin - python

I have 2 tables in my database (visits, events).
visits has a primary key visit_id,
events_and_pages has a column visit_id which is sort of a foreign key of visits. (An events row can belong to 0 to 1 visit)
What I want to do: Filter-out from events table all the visit_id that don't belong to visits table. Simple task.
I have the data for each of those tables stored in pandas.DataFrame, respectively df_visits and df_events
I do the following operation :
len(set(df_visits.visit_id) - set(df_events.visit_id)) I get a result of 1670, which is compliant with what I should expect.
But when I do
filter_real_v = df_events.visit_id.isin(set(visits.visit_id))
filter_real_v.value_counts() # I get only True values
filter_real_v = df_events.visit_id.isin(visits.visit_id)
filter_real_v.value_counts() # I get only True values
Even weirder, when I use
pd.DataFrame(df_events.visit_id).isin(real_visits)).visit_id.value_counts() #I get all False values except 8 that are True
pd.DataFrame(df_events.visit_id).isin(set(real_visits)).visit_id.value_counts() #I get all True values
What is going on here? And how can I define a filter for which visit_id exists in events but not in visits?
Please find in this link, the df_events and df_visits csv files to reproduce this error (comma separated index,visit_id)
EDIT : Add snippet for minimal reproducible code:
Download the files in the link and put them in a file_path_events & file_path_visits of your chosing
Execute the code bellow:
import pandas as pd
events = pd.read_csv("df_events.csv")
events.set_index('index',inplace=True)
visits = pd.read_csv("df_visits.csv")
visits.set_index('index',inplace=True)
correct_delta = len(set(visits.visit_id) - set(events.visit_id))
print(correct_delta) #1670
filter_real_v = events.visit_id.isin(set(visits.visit_id))
bad_delta = filter_real_v.value_counts()
print(bad_delta[True]) #702680
Best regards

Everything is behaving correctly, your just misinterpreting the set operation "-"
len(set(df_visits.visit_id) - set(df_events.visit_id))
Will return the values of df_visits.visit_id not in df_events.visit_id. Note: If values of df_events.visit_id are not in df_visits.visit_id they will not be represented here. This is how sets work.
For example:
set([1,2,3,9]) - set([9,10,11])
Output:
{1, 2, 3}
Notice how 10 or 11 do not show up in the answer. None of the second set will as a matter of fact. Only the values in the second set will be taken away from the first set.
With isin() you are effectively doing:
visits['visit_id'].isin(df_events['visit_id'].values).value_counts()
True 56071
False 1670
# Note 1670 is the exact same you got in your set operation
and not:
df_events['visit_id'].isin(visits['visit_id'].values).value_counts()
True 702680

Related

Efficient way to loop through GroupBy DataFrame

Since my last post did lack in information:
example of my df (the important col):
deviceID: unique ID for the vehicle. Vehicles send data all Xminutes.
mileage: the distance moved since the last message (in km)
positon_timestamp_measure: unixTimestamp of the time the dataset was created.
deviceID mileage positon_timestamp_measure
54672 10 1600696079
43423 20 1600696079
42342 3 1600701501
54672 3 1600702102
43423 2 1600702701
My Goal is to validate the milage by comparing it to the max speed of the vehicle (which is 80km/h) by calculating the speed of the vehicle using the timestamp and the milage. The result should then be written in the orginal dataset.
What I've done so far is the following:
df_ori['dataIndex'] = df_ori.index
df = df_ori.groupby('device_id')
#create new col and set all values to false
df_ori['valid'] = 0
for group_name, group in df:
#sort group by time
group = group.sort_values(by='position_timestamp_measure')
group = group.reset_index()
#since I can't validate the first point in the group, I set it to valid
df_ori.loc[df_ori.index == group.dataIndex.values[0], 'validPosition'] = 1
#iterate through each data in the group
for i in range(1, len(group)):
timeGoneSec = abs(group.position_timestamp_measure.values[i]-group.position_timestamp_measure.values[i-1])
timeHours = (timeGoneSec/60)/60
#calculate speed
if((group.mileage.values[i]/timeHours)<maxSpeedKMH):
df_ori.loc[dataset.index == group.dataIndex.values[i], 'validPosition'] = 1
dataset.validPosition.value_counts()
It definitely works the way I want it to, however it lacks in performance a lot. The df contains nearly 700k in data (already cleaned). I am still a beginner and can't figure out a better solution. Would really appreciate any of your help.
If I got it right, no for-loops are needed here. Here is what I've transformed your code into:
df_ori['dataIndex'] = df_ori.index
df = df_ori.groupby('device_id')
#create new col and set all values to false
df_ori['valid'] = 0
df_ori = df_ori.sort_values(['position_timestamp_measure'])
# Subtract preceding values from currnet value
df_ori['timeGoneSec'] = \
df_ori.groupby('device_id')['position_timestamp_measure'].transform('diff')
# The operation above will produce NaN values for the first values in each group
# fill the 'valid' with 1 according the original code
df_ori[df_ori['timeGoneSec'].isna(), 'valid'] = 1
df_ori['timeHours'] = df_ori['timeGoneSec']/3600 # 60*60 = 3600
df_ori['flag'] = (df_ori['mileage'] / df_ori['timeHours']) <= maxSpeedKMH
df_ori.loc[df_ori['flag'], 'valid'] = 1
# Remove helper columns
df_ori = df.drop(columns=['flag', 'timeHours', 'timeGoneSec'])
The basic idea is try to use vectorized operation as much as possible and to avoid for loops, typically iteration row by row, which can be insanly slow.
Since I can't get the context of your code, please double check the logic and make sure it works as desired.

How to remove 'giambja01' from DataFrame?

I have a Data Science Related project. I need to remove a certain name from my DataFrame. here is what I attempted:
delete_row_1 = batsal[batsal["playerID"]=='giambja01'].index
remaining_players = batsal.drop(delete_row_1)
To test whether this worked I wrote this and got False:
'giambja01' in remaining_players['playerID']
False
It seems to have worked. and yet when i run the following code i get this:
remaining_players['playerID']
10836 giambja01
13287 heltoto01
2446 berkmla01
11336 gonzalu01
8271 drewjd01
25101 pujolal01
17276 lawtoma02
82 abreubo01
5395 catalfr01
10852 giambje01
22174 nevinph01
20635 mientdo01
6275 coninje01
11545 gracema01
20173 mclemma01
23005 ordonma01
24596 pierrju01
22418 nixontr01
5903 clarkto02
30281 sweenmi01
20688 millake01
18086 loducpa01
11810 grievbe01
3145 boonebr01
29869 stewash01
33183 whitero02
32039 vidrojo01
Name: playerID, dtype: object
I am attaching a sample DataFrame:
batsal = pd.DataFrame({'playerID':['giambja01' , 'damonjo01' , 'saenzol01'],'Sex':['M','M','M']})
Please let me know what I did wrong.
The issue is drop works with columns and not rows. Instead you have to designate the index of the item you would like to remove and the columns data should be removed from. You should try:
df.drop(index='giambja01', columns='1').
Try this, specifying the index:
remaining_players = batsal.drop(index=delete_row_1)
Find the documentation on the function here.

Is pandas and numpy any good for manipulation of non numeric data?

I've been going in circles for days now, and I've run out of steam. Doesn't help that I'm new to python / numpy / pandas etc.
I started with numpy which led me to pandas, because of a GIS function that delivers a numpy array of data. That is my starting point. I'm trying to get to an endpoint being a small enriched dataset, in an excel spreadsheet.
But it seems like going down a rabbit hole trying to extract that data, and then manipulate it with the numpy toolsets. The delivered data is one dimensional, but each row contains 8 fields. A simple conversion to pandas and then to ndarray, magically makes it all good. Except that I lose headers in the process, and it just snowballs from there.
I've had to revaluate my understanding, based on some feedback on another post, and that's fine. But I'm just going in circles. Example after example seems to use predominantly numerical data, and I'm starting to get the feeling that's where it's strength lies. My trying to use it for what I call more of a non-mathematical / numerical purpose looks like I'm barking up the wrong tree.
Any advice?
Addendum
The data I extract from the GIS system is names, dates, other textual data. I then have another csv file that I need to use as a lookup, so that I can enrich the source with more textual information which finally gets published to excel.
SAMPLE DATA - SOURCE
WorkCode Status WorkName StartDate EndDate siteType Supplier
0 AT-W34319 None Second building 2020-05-04 2020-05-31 Type A Acem 1
1 AT-W67713 None Left of the red office tower 2019-02-11 2020-08-28 Type B Quester Q
2 AT-W68713 None 12 main street 2019-05-23 2020-11-03 Class 1 Type B Dettlim Group
3 AT-W70105 None city central 2019-03-07 2021-08-06 Other Hans Int
4 AT-W73855 None top floor 2019-05-06 2020-10-28 Type a None
SAMPLE DATA - CSV
["Id", "Version","Utility/Principal","Principal Contractor Contact"]
XM-N33463,7.1,"A Contracting company", "555-12345"
XM-N33211,2.1,"Contractor #b", "555-12345"
XM-N33225,1.3,"That other contractor", "555-12345"
XM-N58755,1.0,"v Contracting", "555-12345"
XM-N58755,2.3,"dsContracting", "555-12345"
XM-222222,2.3,"dsContracting", "555-12345"
BM-O33343,2.1,"dsContracting", "555-12345"
def SMAN():
####################################################################################################################
# Exporting the results of the analysis...
####################################################################################################################
"""
Approach is as follows:
1) Get the source data
2) Get he CSV lookup data loaded into memory - it'll be faster
3) Iterate through the source data, looking for matches in the CSV data
4) Add an extra couple of columns onto the source data, and populate it with the (matching) lookup data.
5) Export the now enhanced data to excel.
"""
arcpy.env.workspace = workspace + filenameGDB
input = "ApprovedActivityByLocalBoard"
exportFile = arcpy.da.FeatureClassToNumPyArray(input, ['WorkCode', 'Status','WorkName', 'PSN2', 'StartDate', 'EndDate', 'siteType', 'Supplier'])
# we have our data, but it's (9893,) instead of [9893 rows x 8 columns]
pdExportFile = pandas.DataFrame(exportFile)
LBW = pdExportFile.to_numpy()
del exportFile
del pdExportFile
# Now we have [9893 rows x 8 columns] - but we've lost the headers
col_list = ["WorkCode", "Version","Principal","Contact"]
allPermits = pandas.read_csv("lookup.csv", usecols=col_list)
# Now we have the CSV file loaded ... and only the important parts - should be fast.
# Shape: (94523, 4)
# will have to find a way to improve this...
# CSV file has got more than WordCode, because there are different versions (as different records)
# Only want the last one.
# each record must now be "enhanced" with matching record from the CSV file.
finalReport = [] # we are expecting this to be [9893 rows x 12 columns] at the end
counter = -1
for eachWorksite in LBW [:5]: #let's just work with 5 records right now...
counter += 1
# eachWorksite=list(eachWorksite) # eachWorksite is a tuple - so need to convert it
# # but if we change it to a list, we lose the headers!
certID = LBW [counter][0] # get the ID to use for lookup matching
# Search the CSV data
permitsFound = allPermits[allPermits['Id']==certID ]
permitsFound = permitsFound.to_numpy()
if numpy.shape(permitsFound)[0] > 1:
print ("Too many hits!") # got to deal with that CSV Version field.
exit()
else:
# now "enrich" the record/row by adding on the fields from the lookup
# so a row goes from 8 fields to 12 fields
newline = numpy.append (eachWorksite, permitsFound)
# and this enhanced record/row must become the new normal
# but I cannot change the original, so it must go into a new container
finalReport = numpy.append(finalReport, newline, axis = 0)
# now I should have a new container, of "enriched" data
# which as gone from [9893 rows x 8 columns] to [9893 rows x 12 columns]
# Some of the columns of course, could be empty.
#Now let's dump the results to an Excel file and make it accessible for everyone else.
df = pandas.DataFrame (finalReport)
filepath = 'finalreport.csv'
df.to_csv('filepath', index = False)
# Somewhere I was getting Error("Cannot convert {0!r} to Excel".format(value))
# Now I get
filepath = 'finalReport.xlsx'
df.to_excel(filepath, index=False)
I have eventually answered my own question, and this is how:
Yes, for my situation, pandas worked just fine, even beautifully for
manipulating non numerical data. I just had to learn some basics.
The biggest learning was to understand the pandas data frame as an object that has to be manipulated remotely by various functions/tools. Just because I "print" the dataframe, doesn't mean it's just text. (Thanks juanpa.arrivillaga for poitning out my erroneous assumptions) in Why can I not reproduce a nd array manually?
I also had to wrap my mind around the concept of indexes and columns, and how they could be altered/manipulated/ etc. And then, how to use them to maximum effect.
Once those fundamentals had been sorted, the rest followed naturally, and my code reduced to a couple of nice elegant functions.
Cheers

Python: calculate values in column only in rows with specific value in other column

I am desperately trying to solve this issue:
I have a csv with information on well core data with different columns, among them one column with IDs and two with X and Y coordinates. I was told now by the data supplier that some of the well cores (= rows) have wrong Y coordinates - the value should be e.g. instead 1400 -1400.
I am now trying to write a script to automatically change all the Y-values in the affected rows (well cores) (by *-1), but nothing has worked:
ges = pd.read_csv(r"C:\A....csv")
bk = [26740001, 26740002, 26740003] # List of IDs that should be changed
for x in bk:
for line in ges:
np.where(ges.query('ID== {}'.format(x)), ges.Y=ges.Y*-1, ges['Y'])
I have also tried it like this:
for line in ges:
if ges.ID.values == bk:
ges.Y = ges.Y*-1
else:
pass
or like this:
ges.loc[(ges.ID == bk), 'Y']=*-1
or:
ges.loc[ges['ID'].isin(bk), ges['Y']] = *-1
or:
ges.loc[ges['ID'].isin(bk), ges['Y']] = ges['Y']*-1
I am very grateful for every help!
edit:
I am sorry, this is my first post. To make it clearer, my data looks like this:
Now I was informed that the Y-values of ID 2, 3 and 6 are wrong and should be negative values. So my desired output is the following:
ID X Y other column other column
1 3459 1245 information information
2 4541 -1256 information information
3 2378 -2353 information information
4 6947 874 information information
5 2349 2351 information information
6 2347 -746 information information
I hope it is clear now. Thanks.
Try the following:
ids = [26740001, 26740002, 26740003]
for number_id in ids:
idx = ges['ID'] == number_id
ges.loc[idx, 'Y'] *= -1

Filling in missing value based on values in both preceding and succeeding rows

I have a dataset analogous to the one below where for a website I have the number of views every month for two years (2001-2002). However, due to the way the data was gathered, I only have information for a website if it had > 0 views. So, I am trying to fill in the number of views for months where that is not the case: i.e., cases where the website was online but had no views.
Unfortunately, I have no information for when the website was first published, so I assume that it was introduced the first time there are non-zero values for a month. I also assume the website was taken down if there are consecutive months with np.nan values at the end of 2002.
So, currently, the Views column has np.nan values for both months where views are zero, and the website was simply not online.
I want to make sure that months with zero views have 0 in the Views column, such that the below data frame,
Website ,Month,Year ,Views
1,January,2001,
1,February,2001,
1,March,2001,3.0
1,April,2001,4.0
1,May,2001,23.0
1,June,2001,
1,July,2001,5.0
1,August,2001,4.0
1,September,2001,3.0
1,October,2001,3.0
1,November,2001,3.0
1,December,2001,35.0
1,January,2002,6.0
1,February,2002,
1,March,2002,3.0
1,April,2002,
1,May,2002,
1,June,2002,3.0
1,July,2002,3.0
1,August,2002,2.0
1,September,2002,
1,October,2002,
1,November,2002,
1,December,2002,
2,January,2001,3.0
2,February,2001,1.0
2,March,2001,2.0
2,April,2001,2.0
2,May,2001,22.0
2,June,2001,
2,July,2001,4.0
2,August,2001,3.0
2,September,2001,3.0
2,October,2001,4.0
2,November,2001,
2,December,2001,1.0
2,January,2002,
2,February,2002,4.0
2,March,2002,2.0
2,April,2002,5.0
2,May,2002,2.0
2,June,2002,
2,July,2002,2.0
2,August,2002,3.0
2,September,2002,
2,October,2002,
2,November,2002,2.0
2,December,2002,5.0
looks like this:
Website ,Month,Year ,Views
1,January,2001,
1,February,2001,
1,March,2001,3.0
1,April,2001,4.0
1,May,2001,23.0
1,June,2001,0.0
1,July,2001,5.0
1,August,2001,4.0
1,September,2001,3.0
1,October,2001,3.0
1,November,2001,3.0
1,December,2001,35.0
1,January,2002,6.0
1,February,2002,0.0
1,March,2002,3.0
1,April,2002,0.0
1,May,2002,0.0
1,June,2002,3.0
1,July,2002,3.0
1,August,2002,2.0
1,September,2002,
1,October,2002,
1,November,2002,
1,December,2002,
2,January,2001,3.0
2,February,2001,1.0
2,March,2001,2.0
2,April,2001,2.0
2,May,2001,22.0
2,June,2001,0.0
2,July,2001,4.0
2,August,2001,3.0
2,September,2001,3.0
2,October,2001,4.0
2,November,2001,0.0
2,December,2001,1.0
2,January,2002,0.0
2,February,2002,4.0
2,March,2002,2.0
2,April,2002,5.0
2,May,2002,2.0
2,June,2002,0.0
2,July,2002,2.0
2,August,2002,3.0
2,September,2002,0.0
2,October,2002,0.0
2,November,2002,2.0
2,December,2002,5.0
In other words, if all preceding months for that website show np.nan values, and the current value is np.nan, it should remain that way. Similarly, if all following months show np.nan, the column should remain np.nan as well. However, if at least one preceding month is not np.nan the value should change to 0, etc.
The tricky part is that my dataset has about 4,000,000 rows, and I need a fairly efficient way to do this.
Does anyone have any suggestions?
Here's my approach
# s counts the non-null views so far
s = df['Views'].notnull().groupby(df['Website']).cumsum()
# fill the null only where s > 0
df['Views'] = np.where(df['Views'].isna() & s.gt(0), 0, df['Views'])
# equivalent
# df.loc[df['View'].isna() & s.gt(0), 'Views'] = 0
I followed Quang Hoang's response and used the below code, which worked perfectly:
#Same as Quang Hoang's answer:
s = df['Views'].notnull().groupby(df['Website']).cumsum()
#Count the non-null views so far but starting with the last observations
b = df['Views'].notnull()[::-1].groupby(df['Website']).cumsum()
# fill the null only where s > 0 and b > 0
df['Views'] = np.where(df['Views'].isna() & (s.gt(0) & b.gt(0)), 0, df['Views'])

Categories

Resources