Sort or remove specific value from pandas dataframe - python

Tried to organize this data before doing some analysis, on Python, for example, showing number of step count at a timestamp
One of the purpose is to calculate step difference for some period (e.g. per minute, per hour), however as it may seen, the step count shows sometimes higher value in between lower value (at 10:48:46) which makes counting the step difference complicated. And to be noted, the count restart to 0 after 65535 (asked here on how to make it readable after 65535: Panda dataframe conditional change and worked well on nice sorted value).
I know it may be unsolvable because I can't easily remove the unwanted row or sort by value on the column, but hopefully there's any idea to solve this?

IIUC, do you want:
#simple setup
df = pd.DataFrame({'stepcount':[33,32,41,45,67,76,64,65,69,70,75,76,76,76,76]})
df[df['stepcount'] >= df['stepcount'].cummax()]
Output:
stepcount
0 33
2 41
3 45
4 67
5 76
11 76
12 76
13 76
14 76

Related

Python Pandas to match rows with overlapping coordinates

I am a python newbie, trying to figure out a problem using pandas.
I have two .csv files that I have imported as pandas dataframes.
one of these files is a file with rows for ID number, Start and End coordinates:
ID Start End
1 45 99
3 27 29
6 13 23
19 11 44
my second file has a columns for a code, and start and end coordinates as well:
Code Start End
ss13d 67 100
dfv45 55 100
aal33 101 222
mm0ww 24 28
I want to find start and end coordinates that overlap between both of these files in no particular order, so that the result would look something like this:
ID Start End Code Start End
1 45 99 ss13d 67 100
1 45 99 dfv45 55 100
3 27 29 mm0ww 24 28
I have tried using pandas.merge(), but from what I understand the lists need to have columns in common. In this case it's my start columns, but I can't merge on those columns since they are the ones being compared.
For now I finally figured the logic behind how I would locate overlaps:
df = pd.read_csv (r'file1.csv')
df2 = pd.read_csv ('file2.csv')
c= (df['Start'] <= df2['Start']) & (df['End'] >= df2['Start']) | (df['Start'] <= df2['End']) & (df['End'] >= df2['End'])
but I haven't had any luck getting anything to work.
Could someone point me in the right direction? Neither concat, nor merge works for me in this situation I think.
So to start out,you should probably rename your columns so that you can tell which belongs to which dataframe, it'll make things easier when comparing them later.
df1 = df1.rename(columns={'Start': 'Start_1', 'End': 'End_1'})
df2 = df2.rename(columns={'Start': 'Start_2', 'End': 'End_2'})
Next, if you want to merge two dataframes, but don't have any column in common, you can simply create one:
df1["key"] = 0
df2["key"] = 0
Then you can merge on that column and drop it again
joined_df = pd.merge(df1, df2).drop(columns=['key'])
Finally, you can filter your columns based on overlap for example like this:
joined_df[(joined_df["Start_2"] > joined_df["Start_1"]) & (joined_df["Start_2"] < joined_df["End_1"])]
(Just a tip, use & and | as binary operators to combine filters and always put parenthesis around your bools.)
Hope this helps and good luck with pandas!

Opening a .txt for pandas with no delimiter and number of values of different size

TLDR: How to save .txt data without delimiter in dataframe where each value array has a different length and is date depended.
I've got a fairly big data set saved in a .txt file with no delimiter in the following format:
id DateTime 4 84 464 8 64 874 5 854 652 1854 51 84 521 [. . .] 98 id DateTime 45 5 5 456 46 4 86 45 6 48 6 42 84 5 42 84 32 8 6 486 4 253 8 [. . .]
id and DateTime are numbers as well but ive written them in strings for readability here.
The length between the first id DateTime combination and the next is variable and not all values start/end on the same date.
Right now what I do is use .read_csv whith delimiter=" " which results in a three column DataFrame with id, DateTime and Values all stacked upon each other:
id DateTime Value
10 01.01 78
10 02.01 781
10 03.01 45
[:]
220 05.03 47
220 06.03 8
220 07.03 12
[:]
Then I create a dictionary for each id with the respective DateTime and their Values with dict[id]= df["Value"][df["id"]==id] resulting in a dictionary with keys as id.
Sadly using .from_dict() doesn't work here because each value list is of different length. To solve this I create a np.zeros() that is bigger than the biggest of the value arrays from the dictionary and save the values for each id inside a new np.array based on their DateTime. Those new arrays are then combined in a new data frame resulting in a lot of rows populated with zeros.
Desired output is:
A DataFrame with each column representing a id and their values.
First column as the overall Timeframe of the data set. Bascilly min(DateTime) to max(DateTime)
Rows in a column where no values exist should be NaN
This seems to be a lot of hassle for something that is in structure so simple (see original format). Besides that, it's quite slow. There must be a way to save the data inside a DataFrame based upon the DateTime leaving unpopulated areas with NaN.
What is a (if possible) more optimal solution for my issue?
from what i understand this should work:
for id in df.id.unique():
df[str(id)] = df.id.where(df.id == id)

Add value from series index to row of equal value in Pandas DataFrame

I'm facing bit of an issue adding a new column to my Pandas DataFrame: I have a DataFrame in which each row represents a record of location data and a timestamp. Those records belong to trips, so each row also contains a trip id. Imagine the DataFrame looks kind of like this:
TripID Lat Lon time
0 42 53.55 9.99 74
1 42 53.58 9.99 78
3 42 53.60 9.98 79
6 12 52.01 10.04 64
7 12 52.34 10.05 69
Now I would like to delete the records of all trips that have less than a minimum amount of records to them. I figured I could simply get the number of records of each trip like so:
lengths = df['TripID'].value_counts()
Then my idea was to add an additional column to the DataFrame and fill it with the values from that Series corresponding to the trip id of each record. I would then be able to get rid of all rows in which the value of the lengthcolumn is too small.
However, I can't seem to find a way to get the length values into the correct rows. Would any one have an idea for that or even a better approach to the entire problem?
Thanks very much!
EDIT:
My desired output should look something like this:
TripID Lat Lon time length
0 42 53.55 9.99 74 3
1 42 53.58 9.99 78 3
3 42 53.60 9.98 79 3
6 12 52.01 10.04 64 2
7 12 52.34 10.05 69 2
If I understand correctly, to get the length of the trip, you'd want to get the difference between the maximum time and the minimum time for each trip. You can do that with a groupby statement.
# Groupby, get the minimum and maximum times, then reset the index
df_new = df.groupby('TripID').time.agg(['min', 'max']).reset_index()
df_new['length_of_trip'] = df_new.max - df_new.min
df_new = df_new.loc[df_new.length_of_trip > 90] # to pick a random number
That'll get you all the rows with a trip length above the amount you need, including the trip IDs.
You can use groupby and transform to directly add the lengths column to the DataFrame, like so:
df["lengths"] = df[["TripID", "time"]].groupby("TripID").transform("count")
I managed to find an answer to my question that is quite a bit nicer than my original approach as well:
df = df.groupby('TripID').filter(lambda x: len(x) > 2)
This can be found in the Pandas documentation. It gets rid of all groups that have 2 or less elements in them, or trips that are 2 records or shorter in my case.
I hope this will help someone else out as well.

how to speed up dataframe analysis

I'm looping through a DataFrame of 200k rows. It's doing what I want but it takes hours. I'm not very sophisticated when it comes to all the ways you can join and manipulate DataFrames so I wonder if I'm doing this in a very inefficient way. It's quite simple, here's the code:
three_yr_gaps = []
for index, row in df.iterrows():
three_yr_gaps.append(df[(df['GROUP_ID'] == row['GROUP_ID']) &
(df['BEG_DATE'] >= row['THREE_YEAR_AGO']) &
(df['END_DATE'] <= row['BEG_DATE'])]['GAP'].sum() + row['GAP'])
df['GAP_THREE'] = three_yr_gaps
The DF has a column called GAP that holds an integer value. the logic I'm employing to sum this number up is:
for each row get these columns from the dataframe:
those that match on the group id, and...
those that have a beginning date within the last 3 years of this rows start date, and...
those that have an ending date before this row's beginning date.
sum up those rows GAP number and add this row's GAP number then append those to a list of indexes.
So is there a faster way to introduce this logic into some kind of automatic merge or join that could speed up this process?
PS.
I was asked for some clarification on input and output, so here's a constructed dataset to play with:
from dateutil import parser
df = pd.DataFrame( columns = ['ID_NBR','GROUP_ID','BEG_DATE','END_DATE','THREE_YEAR_AGO','GAP'],
data = [['09','185',parser.parse('2008-08-13'),parser.parse('2009-07-01'),parser.parse('2005-08-13'),44],
['10','185',parser.parse('2009-08-04'),parser.parse('2010-01-18'),parser.parse('2006-08-04'),35],
['11','185',parser.parse('2010-01-18'),parser.parse('2011-01-18'),parser.parse('2007-01-18'),0],
['12','185',parser.parse('2014-09-04'),parser.parse('2015-09-04'),parser.parse('2011-09-04'),0]])
and here's what I wrote at the top of the script, may help:
The purpose of this script is to extract gaps counts over the
last 3 year period. It uses gaps.sql as its source extract. this query
returns a DataFrame that looks like this:
ID_NBR GROUP_ID BEG_DATE END_DATE THREE_YEAR_AGO GAP
09 185 2008-08-13 2009-07-01 2005-08-13 44
10 185 2009-08-04 2010-01-18 2006-08-04 35
11 185 2010-01-18 2011-01-18 2007-01-18 0
12 185 2014-09-04 2015-09-04 2011-09-04 0
The python code then looks back at the previous 3 years (those
previous rows that have the same GROUP_ID but whose effective dates
come after their own THIRD_YEAR_AGO and whose end date come before
their own beginning date). Those rows are added up and a new column is
made called GAP_THREE. What remains is this:
ID_NBR GROUP_ID BEG_DATE END_DATE THREE_YEAR_AGO GAP GAP_THREE
09 185 2008-08-13 2009-07-01 2005-08-13 44 44
10 185 2009-08-04 2010-01-18 2006-08-04 35 79
11 185 2010-01-18 2011-01-18 2007-01-18 0 79
12 185 2014-09-04 2015-09-04 2011-09-04 0 0
you'll notice that row id_nbr 11 has a 79 value in the last 3 years but id_nbr 12 has 0 because the last gap was 35 in 2009 which is more than 3 years before 12's beginning date of 2014

unexpected behavior when combining two dataframes in pandas

This may be a bug, but it may also be a subtlety of pandas that I'm missing. I'm combining two dataframes and the result's index isn't sorted. What's weird is that I've never seen a single instance of combine_first that failed to maintain the index sorted before.
>>> a1
X Y
DateTime
2012-11-06 16:00:11.477563 8 80
2012-11-06 16:00:11.477563 8 63
>>> a2
X Y
DateTime
2012-11-06 15:11:09.006507 1 37
2012-11-06 15:11:09.006507 1 36
>>> a1.combine_first(a2)
X Y
DateTime
2012-11-06 16:00:11.477563 8 80
2012-11-06 16:00:11.477563 8 63
2012-11-06 15:11:09.006507 1 37
2012-11-06 15:11:09.006507 1 36
>>> a2.combine_first(a1)
X Y
DateTime
2012-11-06 16:00:11.477563 8 80
2012-11-06 16:00:11.477563 8 63
2012-11-06 15:11:09.006507 1 37
2012-11-06 15:11:09.006507 1 36
I can reproduce, so I'm happy to take suggestions. Guesses as to what's going on are most welcome.
The combine_first function uses index.union to combine and sort the indexes. The index.union docstring states that it only sorts if possible, so combine_first is not necessarily going to return sorted results by design.
For non-monotonic indexes, the index.union tries to sort, but returns unsorted results if there is an exception. I don't know if this is a bug or not, but index.union does not even attempt to sort monotonic indexes like the datetime index in your example.
I've opened an issue on GitHub, but I guess you should do a2.combine_first(a1).sort_index() for any datetime indexes for now.
Update: This bug is now fixed on GitHub
Do you actually mean to use .append()?
Try:-
a2.append(a1)
combine_first is not actually an append operation. See - http://pandas.pydata.org/pandas-docs/dev/basics.html?highlight=combine_first#combining-overlapping-data-sets:-
A problem occasionally arising is the combination of two similar data
sets where values in one are preferred over the other. An example
would be two data series representing a particular economic indicator
where one is considered to be of “higher quality”. However, the lower
quality series might extend further back in history or have more
complete data coverage. As such, we would like to combine two
DataFrame objects where missing values in one DataFrame are
conditionally filled with like-labeled values from the other
DataFrame.
while append is http://pandas.pydata.org/pandas-docs/dev/generated/pandas.DataFrame.append.html?highlight=append
Append columns of other to end of this frame’s columns and index,
returning a new object. Columns not in this frame are added as new
columns.

Categories

Resources