unexpected behavior when combining two dataframes in pandas - python

This may be a bug, but it may also be a subtlety of pandas that I'm missing. I'm combining two dataframes and the result's index isn't sorted. What's weird is that I've never seen a single instance of combine_first that failed to maintain the index sorted before.
>>> a1
X Y
DateTime
2012-11-06 16:00:11.477563 8 80
2012-11-06 16:00:11.477563 8 63
>>> a2
X Y
DateTime
2012-11-06 15:11:09.006507 1 37
2012-11-06 15:11:09.006507 1 36
>>> a1.combine_first(a2)
X Y
DateTime
2012-11-06 16:00:11.477563 8 80
2012-11-06 16:00:11.477563 8 63
2012-11-06 15:11:09.006507 1 37
2012-11-06 15:11:09.006507 1 36
>>> a2.combine_first(a1)
X Y
DateTime
2012-11-06 16:00:11.477563 8 80
2012-11-06 16:00:11.477563 8 63
2012-11-06 15:11:09.006507 1 37
2012-11-06 15:11:09.006507 1 36
I can reproduce, so I'm happy to take suggestions. Guesses as to what's going on are most welcome.

The combine_first function uses index.union to combine and sort the indexes. The index.union docstring states that it only sorts if possible, so combine_first is not necessarily going to return sorted results by design.
For non-monotonic indexes, the index.union tries to sort, but returns unsorted results if there is an exception. I don't know if this is a bug or not, but index.union does not even attempt to sort monotonic indexes like the datetime index in your example.
I've opened an issue on GitHub, but I guess you should do a2.combine_first(a1).sort_index() for any datetime indexes for now.
Update: This bug is now fixed on GitHub

Do you actually mean to use .append()?
Try:-
a2.append(a1)
combine_first is not actually an append operation. See - http://pandas.pydata.org/pandas-docs/dev/basics.html?highlight=combine_first#combining-overlapping-data-sets:-
A problem occasionally arising is the combination of two similar data
sets where values in one are preferred over the other. An example
would be two data series representing a particular economic indicator
where one is considered to be of “higher quality”. However, the lower
quality series might extend further back in history or have more
complete data coverage. As such, we would like to combine two
DataFrame objects where missing values in one DataFrame are
conditionally filled with like-labeled values from the other
DataFrame.
while append is http://pandas.pydata.org/pandas-docs/dev/generated/pandas.DataFrame.append.html?highlight=append
Append columns of other to end of this frame’s columns and index,
returning a new object. Columns not in this frame are added as new
columns.

Related

Opening a .txt for pandas with no delimiter and number of values of different size

TLDR: How to save .txt data without delimiter in dataframe where each value array has a different length and is date depended.
I've got a fairly big data set saved in a .txt file with no delimiter in the following format:
id DateTime 4 84 464 8 64 874 5 854 652 1854 51 84 521 [. . .] 98 id DateTime 45 5 5 456 46 4 86 45 6 48 6 42 84 5 42 84 32 8 6 486 4 253 8 [. . .]
id and DateTime are numbers as well but ive written them in strings for readability here.
The length between the first id DateTime combination and the next is variable and not all values start/end on the same date.
Right now what I do is use .read_csv whith delimiter=" " which results in a three column DataFrame with id, DateTime and Values all stacked upon each other:
id DateTime Value
10 01.01 78
10 02.01 781
10 03.01 45
[:]
220 05.03 47
220 06.03 8
220 07.03 12
[:]
Then I create a dictionary for each id with the respective DateTime and their Values with dict[id]= df["Value"][df["id"]==id] resulting in a dictionary with keys as id.
Sadly using .from_dict() doesn't work here because each value list is of different length. To solve this I create a np.zeros() that is bigger than the biggest of the value arrays from the dictionary and save the values for each id inside a new np.array based on their DateTime. Those new arrays are then combined in a new data frame resulting in a lot of rows populated with zeros.
Desired output is:
A DataFrame with each column representing a id and their values.
First column as the overall Timeframe of the data set. Bascilly min(DateTime) to max(DateTime)
Rows in a column where no values exist should be NaN
This seems to be a lot of hassle for something that is in structure so simple (see original format). Besides that, it's quite slow. There must be a way to save the data inside a DataFrame based upon the DateTime leaving unpopulated areas with NaN.
What is a (if possible) more optimal solution for my issue?
from what i understand this should work:
for id in df.id.unique():
df[str(id)] = df.id.where(df.id == id)

Pandas: Join or merge multiple dataframes on a column where column values are repeating

I have three dataframes with row counts more than 71K. Below are the samples.
df_1 = pd.DataFrame({'Device_ID':[1001,1034,1223,1001],'Col_A':[45,56,78,33]})
df_2 = pd.DataFrame({'Device_ID':[1001,1034,1223,1001,1887],'Col_B':[35,46,78,33,66]})
df_3 = pd.DataFrame({'Device_ID':[1001,1034,1223,1001,1887,1223],'Col_C':[5,14,8,13,16,8]})
Edit
As suggested, below is my desired out put
df_final
Device_ID Col_A Col_B Col_C
1001 45 35 5
1034 56 46 14
1223 78 78 8
1001 33 33 13
1887 Nan 66 16
1223 NaN NaN 8
While using pd.merge() or df_1.set_index('Device_ID').join([df_2.set_index('Device_ID'),df_3.set_index('Device_ID')],on='Device_ID') it is taking very long time. One reason is repeating values of Device_ID.
I am aware of reduce method, but my suspect is it may lead to same situation.
Is there any better and efficient way?
To get your desired outcome, you can use this:
result = pd.concat([df_1.drop('Device_ID', axis=1),df_2.drop('Device_ID',axis=1),df_3],axis=1).set_index('Device_ID')
If you don't want to use Device_ID as index, you can remove the set_index part of the code. Also, note that because of the presence of NaN's in some columns (Col_A and Col_B) in the final dataframe, Pandas will cast non-missing values to floats, as NaN can't be stored in an integer array (unless you have Pandas version 0.24, in which case you can read more about it here).

Sort or remove specific value from pandas dataframe

Tried to organize this data before doing some analysis, on Python, for example, showing number of step count at a timestamp
One of the purpose is to calculate step difference for some period (e.g. per minute, per hour), however as it may seen, the step count shows sometimes higher value in between lower value (at 10:48:46) which makes counting the step difference complicated. And to be noted, the count restart to 0 after 65535 (asked here on how to make it readable after 65535: Panda dataframe conditional change and worked well on nice sorted value).
I know it may be unsolvable because I can't easily remove the unwanted row or sort by value on the column, but hopefully there's any idea to solve this?
IIUC, do you want:
#simple setup
df = pd.DataFrame({'stepcount':[33,32,41,45,67,76,64,65,69,70,75,76,76,76,76]})
df[df['stepcount'] >= df['stepcount'].cummax()]
Output:
stepcount
0 33
2 41
3 45
4 67
5 76
11 76
12 76
13 76
14 76

Add value from series index to row of equal value in Pandas DataFrame

I'm facing bit of an issue adding a new column to my Pandas DataFrame: I have a DataFrame in which each row represents a record of location data and a timestamp. Those records belong to trips, so each row also contains a trip id. Imagine the DataFrame looks kind of like this:
TripID Lat Lon time
0 42 53.55 9.99 74
1 42 53.58 9.99 78
3 42 53.60 9.98 79
6 12 52.01 10.04 64
7 12 52.34 10.05 69
Now I would like to delete the records of all trips that have less than a minimum amount of records to them. I figured I could simply get the number of records of each trip like so:
lengths = df['TripID'].value_counts()
Then my idea was to add an additional column to the DataFrame and fill it with the values from that Series corresponding to the trip id of each record. I would then be able to get rid of all rows in which the value of the lengthcolumn is too small.
However, I can't seem to find a way to get the length values into the correct rows. Would any one have an idea for that or even a better approach to the entire problem?
Thanks very much!
EDIT:
My desired output should look something like this:
TripID Lat Lon time length
0 42 53.55 9.99 74 3
1 42 53.58 9.99 78 3
3 42 53.60 9.98 79 3
6 12 52.01 10.04 64 2
7 12 52.34 10.05 69 2
If I understand correctly, to get the length of the trip, you'd want to get the difference between the maximum time and the minimum time for each trip. You can do that with a groupby statement.
# Groupby, get the minimum and maximum times, then reset the index
df_new = df.groupby('TripID').time.agg(['min', 'max']).reset_index()
df_new['length_of_trip'] = df_new.max - df_new.min
df_new = df_new.loc[df_new.length_of_trip > 90] # to pick a random number
That'll get you all the rows with a trip length above the amount you need, including the trip IDs.
You can use groupby and transform to directly add the lengths column to the DataFrame, like so:
df["lengths"] = df[["TripID", "time"]].groupby("TripID").transform("count")
I managed to find an answer to my question that is quite a bit nicer than my original approach as well:
df = df.groupby('TripID').filter(lambda x: len(x) > 2)
This can be found in the Pandas documentation. It gets rid of all groups that have 2 or less elements in them, or trips that are 2 records or shorter in my case.
I hope this will help someone else out as well.

iloc Pandas Slice Difficulties

Ive updated the below information to be a little clearer as per the comments:
I have the following dataframe df (it has 38 columns this is only the last few):
Col # 33 34 35 36 37 38
id 09.2018 10.2018 11.2018 12.2018 LTx LTx2
123 0.505 0.505 0.505 0.505 33 35
223 2.462 2.464 0.0 30.0 33 36
323 1.231 1.231 1.231 1.231 33 35
423 0.859 0.855 0.850 0.847 33 36
I am trying to create a new column which is the sum of a slice using iloc so an example for col 123 it would look like the following:
df['LTx3'] = (df.iloc[:, 33:35]).sum(axis=1)
This is perfect obviously for 123 but not for 223. I had assumed this would work:
df['LTx3'] = (df.iloc[:, 'LTx':'LTx2']).sum(axis=1)
But consistantly get the same error:
TypeError: cannot do slice indexing on <class 'pandas.core.indexes.base.Index'> with these indexers [LTx] of <class 'str'>
I had been trying some variation of this such as below but unfortunatley also havent led to any working solution:
df['LTx3'] = (df.iloc[:, df.columns.get_loc('LTx'):df.columns.get_loc('LTx2')]).sum(axis=1)
Basically columns LTx and LTx2 are made up of integres but vary row to row. I want to use these integers as the references for the slice - Im not quite certain how I should do this.
If anyone could help lead me to a solution it would be fantastic!
Cheers
I'd recommend reading up on .loc, .iloc slicing in pandas:
https://pandas.pydata.org/pandas-docs/stable/indexing.html
.loc selects based on name(s). .iloc selects based on index (numerical) position.
You can also subset based on column names. Note also that depending on how you create your dataframe, you may have numbers cast as strings.
To get the row corresponding to 223:
df3 = df[df['Col'] == '223']
To get the columns corresponding to the names 33, 34, and 45:
df3 = df[df['Col'] == '223'].loc[:, '33':'35']
If you want to select rows wherein any column contains a given string, I found this solution: Most concise way to select rows where any column contains a string in Pandas dataframe?
df[df.apply(lambda row: row.astype(str).str.contains('LTx2').any(), axis=1)]

Categories

Resources