Python Pandas to match rows with overlapping coordinates - python

I am a python newbie, trying to figure out a problem using pandas.
I have two .csv files that I have imported as pandas dataframes.
one of these files is a file with rows for ID number, Start and End coordinates:
ID Start End
1 45 99
3 27 29
6 13 23
19 11 44
my second file has a columns for a code, and start and end coordinates as well:
Code Start End
ss13d 67 100
dfv45 55 100
aal33 101 222
mm0ww 24 28
I want to find start and end coordinates that overlap between both of these files in no particular order, so that the result would look something like this:
ID Start End Code Start End
1 45 99 ss13d 67 100
1 45 99 dfv45 55 100
3 27 29 mm0ww 24 28
I have tried using pandas.merge(), but from what I understand the lists need to have columns in common. In this case it's my start columns, but I can't merge on those columns since they are the ones being compared.
For now I finally figured the logic behind how I would locate overlaps:
df = pd.read_csv (r'file1.csv')
df2 = pd.read_csv ('file2.csv')
c= (df['Start'] <= df2['Start']) & (df['End'] >= df2['Start']) | (df['Start'] <= df2['End']) & (df['End'] >= df2['End'])
but I haven't had any luck getting anything to work.
Could someone point me in the right direction? Neither concat, nor merge works for me in this situation I think.

So to start out,you should probably rename your columns so that you can tell which belongs to which dataframe, it'll make things easier when comparing them later.
df1 = df1.rename(columns={'Start': 'Start_1', 'End': 'End_1'})
df2 = df2.rename(columns={'Start': 'Start_2', 'End': 'End_2'})
Next, if you want to merge two dataframes, but don't have any column in common, you can simply create one:
df1["key"] = 0
df2["key"] = 0
Then you can merge on that column and drop it again
joined_df = pd.merge(df1, df2).drop(columns=['key'])
Finally, you can filter your columns based on overlap for example like this:
joined_df[(joined_df["Start_2"] > joined_df["Start_1"]) & (joined_df["Start_2"] < joined_df["End_1"])]
(Just a tip, use & and | as binary operators to combine filters and always put parenthesis around your bools.)
Hope this helps and good luck with pandas!

Related

Extracting discontinuous set of rows in pandas

I have a pandas dataframe containing 100 rows. It looks something like this
id_number Name Age Salary
00001 Alice 50 6.2234
00002 John 29 9.1
.
.
.
00098 Susan 36 11.58
00099 Remy 50 3.7
00100 Walter 50 5.52
From this dataframe, I want to extract the rows corresponding to individuals whose ID numbers do NOT lie between 11 and 20. I want rows 0 to 9, and 20 to 99.
df.iloc allows extracting a continuous set of rows, like 20 to 99, but not 0 to 9 and 20 to 99 in the same go.
I also tried df[(df['id_number'] >= 20) & (df['id_number'] < 10)] but that returns an empty dataframe.
Is there a straightforward way to do this, that does not require doing two separate extractions and their concatenation?
drop sliced index is what you need
in this case we are slicing 11 through to 20
Data
df=pd.Series(np.arange(1,101))
df
drop slice
df.drop(df.loc[11:20].index, inplace=True)
df
This seemed to work (suggested by #FarhoodET):
df[(df['id_number'] => 20) | (df['id_number'] < 10)]

Merge levels for categorical levels in pandas

I am wondering, how to merge levels of a categorical variable in Python ?
I have the following dataset:
dataset['Reason'].value_counts().head(5).
Reason Count
0 339
7 125
11 124
3 82
0 65
Now, I want to merge the first and last occurrence of, so that the output looks like:
dataset['Reason'].value_counts().head(5)
Reason Count
0 404
7 125
11 124
3 82
2 52
In order to get to the reason, I have had to split a string, which might have led to the various levels in the reason column.
I have tried to use the loc function, but I am wondering, whether there is smarter way to do it:
dataset.loc[dataset['Reason'] == '0' , ['Reason']] = 'On request'
dataset.loc[dataset['Reason'] == '0 ' , ['Reason']] = 'On request'
Thanks, Michael.
Like #anky_91 mentioned use Series.str.strip if all values are strings:
dataset['Reason'].str.strip().value_counts().head(5)
If some values are numeric first cast to strings by Series.astype:
dataset['Reason'].astype(str).str.strip().value_counts().head(5)

Sort or remove specific value from pandas dataframe

Tried to organize this data before doing some analysis, on Python, for example, showing number of step count at a timestamp
One of the purpose is to calculate step difference for some period (e.g. per minute, per hour), however as it may seen, the step count shows sometimes higher value in between lower value (at 10:48:46) which makes counting the step difference complicated. And to be noted, the count restart to 0 after 65535 (asked here on how to make it readable after 65535: Panda dataframe conditional change and worked well on nice sorted value).
I know it may be unsolvable because I can't easily remove the unwanted row or sort by value on the column, but hopefully there's any idea to solve this?
IIUC, do you want:
#simple setup
df = pd.DataFrame({'stepcount':[33,32,41,45,67,76,64,65,69,70,75,76,76,76,76]})
df[df['stepcount'] >= df['stepcount'].cummax()]
Output:
stepcount
0 33
2 41
3 45
4 67
5 76
11 76
12 76
13 76
14 76

Add value from series index to row of equal value in Pandas DataFrame

I'm facing bit of an issue adding a new column to my Pandas DataFrame: I have a DataFrame in which each row represents a record of location data and a timestamp. Those records belong to trips, so each row also contains a trip id. Imagine the DataFrame looks kind of like this:
TripID Lat Lon time
0 42 53.55 9.99 74
1 42 53.58 9.99 78
3 42 53.60 9.98 79
6 12 52.01 10.04 64
7 12 52.34 10.05 69
Now I would like to delete the records of all trips that have less than a minimum amount of records to them. I figured I could simply get the number of records of each trip like so:
lengths = df['TripID'].value_counts()
Then my idea was to add an additional column to the DataFrame and fill it with the values from that Series corresponding to the trip id of each record. I would then be able to get rid of all rows in which the value of the lengthcolumn is too small.
However, I can't seem to find a way to get the length values into the correct rows. Would any one have an idea for that or even a better approach to the entire problem?
Thanks very much!
EDIT:
My desired output should look something like this:
TripID Lat Lon time length
0 42 53.55 9.99 74 3
1 42 53.58 9.99 78 3
3 42 53.60 9.98 79 3
6 12 52.01 10.04 64 2
7 12 52.34 10.05 69 2
If I understand correctly, to get the length of the trip, you'd want to get the difference between the maximum time and the minimum time for each trip. You can do that with a groupby statement.
# Groupby, get the minimum and maximum times, then reset the index
df_new = df.groupby('TripID').time.agg(['min', 'max']).reset_index()
df_new['length_of_trip'] = df_new.max - df_new.min
df_new = df_new.loc[df_new.length_of_trip > 90] # to pick a random number
That'll get you all the rows with a trip length above the amount you need, including the trip IDs.
You can use groupby and transform to directly add the lengths column to the DataFrame, like so:
df["lengths"] = df[["TripID", "time"]].groupby("TripID").transform("count")
I managed to find an answer to my question that is quite a bit nicer than my original approach as well:
df = df.groupby('TripID').filter(lambda x: len(x) > 2)
This can be found in the Pandas documentation. It gets rid of all groups that have 2 or less elements in them, or trips that are 2 records or shorter in my case.
I hope this will help someone else out as well.

how to speed up dataframe analysis

I'm looping through a DataFrame of 200k rows. It's doing what I want but it takes hours. I'm not very sophisticated when it comes to all the ways you can join and manipulate DataFrames so I wonder if I'm doing this in a very inefficient way. It's quite simple, here's the code:
three_yr_gaps = []
for index, row in df.iterrows():
three_yr_gaps.append(df[(df['GROUP_ID'] == row['GROUP_ID']) &
(df['BEG_DATE'] >= row['THREE_YEAR_AGO']) &
(df['END_DATE'] <= row['BEG_DATE'])]['GAP'].sum() + row['GAP'])
df['GAP_THREE'] = three_yr_gaps
The DF has a column called GAP that holds an integer value. the logic I'm employing to sum this number up is:
for each row get these columns from the dataframe:
those that match on the group id, and...
those that have a beginning date within the last 3 years of this rows start date, and...
those that have an ending date before this row's beginning date.
sum up those rows GAP number and add this row's GAP number then append those to a list of indexes.
So is there a faster way to introduce this logic into some kind of automatic merge or join that could speed up this process?
PS.
I was asked for some clarification on input and output, so here's a constructed dataset to play with:
from dateutil import parser
df = pd.DataFrame( columns = ['ID_NBR','GROUP_ID','BEG_DATE','END_DATE','THREE_YEAR_AGO','GAP'],
data = [['09','185',parser.parse('2008-08-13'),parser.parse('2009-07-01'),parser.parse('2005-08-13'),44],
['10','185',parser.parse('2009-08-04'),parser.parse('2010-01-18'),parser.parse('2006-08-04'),35],
['11','185',parser.parse('2010-01-18'),parser.parse('2011-01-18'),parser.parse('2007-01-18'),0],
['12','185',parser.parse('2014-09-04'),parser.parse('2015-09-04'),parser.parse('2011-09-04'),0]])
and here's what I wrote at the top of the script, may help:
The purpose of this script is to extract gaps counts over the
last 3 year period. It uses gaps.sql as its source extract. this query
returns a DataFrame that looks like this:
ID_NBR GROUP_ID BEG_DATE END_DATE THREE_YEAR_AGO GAP
09 185 2008-08-13 2009-07-01 2005-08-13 44
10 185 2009-08-04 2010-01-18 2006-08-04 35
11 185 2010-01-18 2011-01-18 2007-01-18 0
12 185 2014-09-04 2015-09-04 2011-09-04 0
The python code then looks back at the previous 3 years (those
previous rows that have the same GROUP_ID but whose effective dates
come after their own THIRD_YEAR_AGO and whose end date come before
their own beginning date). Those rows are added up and a new column is
made called GAP_THREE. What remains is this:
ID_NBR GROUP_ID BEG_DATE END_DATE THREE_YEAR_AGO GAP GAP_THREE
09 185 2008-08-13 2009-07-01 2005-08-13 44 44
10 185 2009-08-04 2010-01-18 2006-08-04 35 79
11 185 2010-01-18 2011-01-18 2007-01-18 0 79
12 185 2014-09-04 2015-09-04 2011-09-04 0 0
you'll notice that row id_nbr 11 has a 79 value in the last 3 years but id_nbr 12 has 0 because the last gap was 35 in 2009 which is more than 3 years before 12's beginning date of 2014

Categories

Resources