Merge levels for categorical levels in pandas - python

I am wondering, how to merge levels of a categorical variable in Python ?
I have the following dataset:
dataset['Reason'].value_counts().head(5).
Reason Count
0 339
7 125
11 124
3 82
0 65
Now, I want to merge the first and last occurrence of, so that the output looks like:
dataset['Reason'].value_counts().head(5)
Reason Count
0 404
7 125
11 124
3 82
2 52
In order to get to the reason, I have had to split a string, which might have led to the various levels in the reason column.
I have tried to use the loc function, but I am wondering, whether there is smarter way to do it:
dataset.loc[dataset['Reason'] == '0' , ['Reason']] = 'On request'
dataset.loc[dataset['Reason'] == '0 ' , ['Reason']] = 'On request'
Thanks, Michael.

Like #anky_91 mentioned use Series.str.strip if all values are strings:
dataset['Reason'].str.strip().value_counts().head(5)
If some values are numeric first cast to strings by Series.astype:
dataset['Reason'].astype(str).str.strip().value_counts().head(5)

Related

Python Pandas to match rows with overlapping coordinates

I am a python newbie, trying to figure out a problem using pandas.
I have two .csv files that I have imported as pandas dataframes.
one of these files is a file with rows for ID number, Start and End coordinates:
ID Start End
1 45 99
3 27 29
6 13 23
19 11 44
my second file has a columns for a code, and start and end coordinates as well:
Code Start End
ss13d 67 100
dfv45 55 100
aal33 101 222
mm0ww 24 28
I want to find start and end coordinates that overlap between both of these files in no particular order, so that the result would look something like this:
ID Start End Code Start End
1 45 99 ss13d 67 100
1 45 99 dfv45 55 100
3 27 29 mm0ww 24 28
I have tried using pandas.merge(), but from what I understand the lists need to have columns in common. In this case it's my start columns, but I can't merge on those columns since they are the ones being compared.
For now I finally figured the logic behind how I would locate overlaps:
df = pd.read_csv (r'file1.csv')
df2 = pd.read_csv ('file2.csv')
c= (df['Start'] <= df2['Start']) & (df['End'] >= df2['Start']) | (df['Start'] <= df2['End']) & (df['End'] >= df2['End'])
but I haven't had any luck getting anything to work.
Could someone point me in the right direction? Neither concat, nor merge works for me in this situation I think.
So to start out,you should probably rename your columns so that you can tell which belongs to which dataframe, it'll make things easier when comparing them later.
df1 = df1.rename(columns={'Start': 'Start_1', 'End': 'End_1'})
df2 = df2.rename(columns={'Start': 'Start_2', 'End': 'End_2'})
Next, if you want to merge two dataframes, but don't have any column in common, you can simply create one:
df1["key"] = 0
df2["key"] = 0
Then you can merge on that column and drop it again
joined_df = pd.merge(df1, df2).drop(columns=['key'])
Finally, you can filter your columns based on overlap for example like this:
joined_df[(joined_df["Start_2"] > joined_df["Start_1"]) & (joined_df["Start_2"] < joined_df["End_1"])]
(Just a tip, use & and | as binary operators to combine filters and always put parenthesis around your bools.)
Hope this helps and good luck with pandas!

Sort or remove specific value from pandas dataframe

Tried to organize this data before doing some analysis, on Python, for example, showing number of step count at a timestamp
One of the purpose is to calculate step difference for some period (e.g. per minute, per hour), however as it may seen, the step count shows sometimes higher value in between lower value (at 10:48:46) which makes counting the step difference complicated. And to be noted, the count restart to 0 after 65535 (asked here on how to make it readable after 65535: Panda dataframe conditional change and worked well on nice sorted value).
I know it may be unsolvable because I can't easily remove the unwanted row or sort by value on the column, but hopefully there's any idea to solve this?
IIUC, do you want:
#simple setup
df = pd.DataFrame({'stepcount':[33,32,41,45,67,76,64,65,69,70,75,76,76,76,76]})
df[df['stepcount'] >= df['stepcount'].cummax()]
Output:
stepcount
0 33
2 41
3 45
4 67
5 76
11 76
12 76
13 76
14 76

Add value from series index to row of equal value in Pandas DataFrame

I'm facing bit of an issue adding a new column to my Pandas DataFrame: I have a DataFrame in which each row represents a record of location data and a timestamp. Those records belong to trips, so each row also contains a trip id. Imagine the DataFrame looks kind of like this:
TripID Lat Lon time
0 42 53.55 9.99 74
1 42 53.58 9.99 78
3 42 53.60 9.98 79
6 12 52.01 10.04 64
7 12 52.34 10.05 69
Now I would like to delete the records of all trips that have less than a minimum amount of records to them. I figured I could simply get the number of records of each trip like so:
lengths = df['TripID'].value_counts()
Then my idea was to add an additional column to the DataFrame and fill it with the values from that Series corresponding to the trip id of each record. I would then be able to get rid of all rows in which the value of the lengthcolumn is too small.
However, I can't seem to find a way to get the length values into the correct rows. Would any one have an idea for that or even a better approach to the entire problem?
Thanks very much!
EDIT:
My desired output should look something like this:
TripID Lat Lon time length
0 42 53.55 9.99 74 3
1 42 53.58 9.99 78 3
3 42 53.60 9.98 79 3
6 12 52.01 10.04 64 2
7 12 52.34 10.05 69 2
If I understand correctly, to get the length of the trip, you'd want to get the difference between the maximum time and the minimum time for each trip. You can do that with a groupby statement.
# Groupby, get the minimum and maximum times, then reset the index
df_new = df.groupby('TripID').time.agg(['min', 'max']).reset_index()
df_new['length_of_trip'] = df_new.max - df_new.min
df_new = df_new.loc[df_new.length_of_trip > 90] # to pick a random number
That'll get you all the rows with a trip length above the amount you need, including the trip IDs.
You can use groupby and transform to directly add the lengths column to the DataFrame, like so:
df["lengths"] = df[["TripID", "time"]].groupby("TripID").transform("count")
I managed to find an answer to my question that is quite a bit nicer than my original approach as well:
df = df.groupby('TripID').filter(lambda x: len(x) > 2)
This can be found in the Pandas documentation. It gets rid of all groups that have 2 or less elements in them, or trips that are 2 records or shorter in my case.
I hope this will help someone else out as well.

iloc Pandas Slice Difficulties

Ive updated the below information to be a little clearer as per the comments:
I have the following dataframe df (it has 38 columns this is only the last few):
Col # 33 34 35 36 37 38
id 09.2018 10.2018 11.2018 12.2018 LTx LTx2
123 0.505 0.505 0.505 0.505 33 35
223 2.462 2.464 0.0 30.0 33 36
323 1.231 1.231 1.231 1.231 33 35
423 0.859 0.855 0.850 0.847 33 36
I am trying to create a new column which is the sum of a slice using iloc so an example for col 123 it would look like the following:
df['LTx3'] = (df.iloc[:, 33:35]).sum(axis=1)
This is perfect obviously for 123 but not for 223. I had assumed this would work:
df['LTx3'] = (df.iloc[:, 'LTx':'LTx2']).sum(axis=1)
But consistantly get the same error:
TypeError: cannot do slice indexing on <class 'pandas.core.indexes.base.Index'> with these indexers [LTx] of <class 'str'>
I had been trying some variation of this such as below but unfortunatley also havent led to any working solution:
df['LTx3'] = (df.iloc[:, df.columns.get_loc('LTx'):df.columns.get_loc('LTx2')]).sum(axis=1)
Basically columns LTx and LTx2 are made up of integres but vary row to row. I want to use these integers as the references for the slice - Im not quite certain how I should do this.
If anyone could help lead me to a solution it would be fantastic!
Cheers
I'd recommend reading up on .loc, .iloc slicing in pandas:
https://pandas.pydata.org/pandas-docs/stable/indexing.html
.loc selects based on name(s). .iloc selects based on index (numerical) position.
You can also subset based on column names. Note also that depending on how you create your dataframe, you may have numbers cast as strings.
To get the row corresponding to 223:
df3 = df[df['Col'] == '223']
To get the columns corresponding to the names 33, 34, and 45:
df3 = df[df['Col'] == '223'].loc[:, '33':'35']
If you want to select rows wherein any column contains a given string, I found this solution: Most concise way to select rows where any column contains a string in Pandas dataframe?
df[df.apply(lambda row: row.astype(str).str.contains('LTx2').any(), axis=1)]

How can I drop the first level from a multi level column by chaining?

How can I drop the first level from a multi level column?
For a data frame:
tmp.head(1000).groupby(['user_id', 'aisle_id']).agg({'aisle_id': ['count']})
giving
aisle_id
count
user_id aisle_id
382 38 1
84 2
115 1
3107 43 1
3321 37 1
69 2
I want to drop the aisle_id in my columns. How can I do this by chaining commands without having to start another statement?
Change your groupby statement.
tmp.head(1000).groupby(['user_id', 'aisle_id'])['aisle_id'].agg(['count'])
You can quickly access the first level of the columns multiindex with the dot operator. Similar to how you'd access columns with a single level index.
just add .aisle_id at the end. Or equivalently ['aisle_id']
tmp.head(1000).groupby(['user_id', 'aisle_id']).agg({'aisle_id': ['count']}) \
.aisle_id
count
user_id aisle_id
381 38 1
382 84 2
115 1
3107 43 1
3321 37 1
69 2
Response to Comment
#displayname these are equivalent df.aisle_id and df.xs('aisle_id'). What I mean to point out is that it will access all columns whose first level is aisle_id. If you were to aggregate the way you had, this will work identically to what ScottBoston has suggested. The difference is that if you wanted to store the results of an aggregation into a variable that was over more that just one column, then those results are preserved and you can access just aisle_id with df.aisle_id. The advantage of ScottBoston's solution is that when more that one column is available, we limit the calculation to just over aisle_id.
Use reset_index at level 0 with drop set to True.
tmp.head(1000).groupby(['user_id', 'aisle_id']) \
.agg({'aisle_id': ['count']}).T.reset_index(level=0, drop=True).T

Categories

Resources