How to combine groupby and sort in pandas - python

I am trying to get one result per 'name' with all of the latest data, unless the column is blank. In R I would have used group_by, sorted by timestamp and selected the latest value for each column. I tried to do that here and got very confused. Can someone explain how to do this in Python? In the example below my goal is:
col2 date name
1 4 2018-03-27 15:55:29 bil #latest timestamp with the latest non-blank col4 value
Heres my code so far:
d = {'name':['bil','bil','bil'],'date': ['2018-02-27 14:55:29', '2018-03-27 15:55:29', '2018-02-28 19:55:29'], 'col2': [3,'', 4]}
df2 = pd.DataFrame(data=d)
print(df2)
grouped = df2.groupby(['name']).sum().reset_index()
print(grouped)
sortedvals=grouped.sort_values(['date'], ascending=False)
print(sortedvals)

Here's one way:
df3 = df2[df2['col2'] != ''].sort_values('date', ascending=False).drop_duplicates('name')
# col2 date name
# 2 4 2018-02-28 19:55:29 bil
However, the dataframe you provided and output you desire seem to be inconsistent.

Related

Assign counts from .count() to a dataframe + column names - pandas python

Hoping someone can help me here - i believe i am close to the solution.
I have a dataframe, of which i have am using .count() in order to return a series of all column names of my dataframe, and each of their respective non-NAN value counts.
Example dataframe:
feature_1
feature_2
1
1
2
NaN
3
2
4
NaN
5
3
Example result for .count() here would output a series that looks like:
feature_1 5
feature_2 3
I am now trying to get this data into a dataframe, with the column names "Feature" and "Count". To have the expected output look like this:
Feature
Count
feature_1
5
feature_2
3
I am using .to_frame() to push the series to a dataframe in order to add column names. Full code:
df = data.count()
df = df.to_frame()
df.columns = ['Feature', 'Count']
However receiving this error message - "ValueError: Length mismatch: Expected axis has 1 elements, new values have 2 elements", as if though it is not recognising the actual column names (Feature) as a column with values.
How can i get it to recognise both Feature and Count columns to be able to add column names to them?
Add Series.reset_index instead Series.to_frame for 2 columns DataFrame - first column from index, second from values of Series:
df = data.count().reset_index()
df.columns = ['Feature', 'Count']
print (df)
Feature Count
0 feature_1 5
1 feature_2 3
Another solution with name parameter and Series.rename_axis or with DataFrame.set_axis:
df = data.count().rename_axis('Feature').reset_index(name='Count')
#alternative
df = data.count().reset_index().set_axis(['Feature', 'Count'], axis=1)
print (df)
Feature Count
0 feature_1 5
1 feature_2 3
This happens because your new dataframe has only one column (the column name is taken as series index, then translated into dataframe index with the func to_frame()). In order to assign a 2 elements list to df.columns you have to reset the index first:
df = data.count()
df = df.to_frame().reset_index()
df.columns = ['Feature', 'Count']

pandas possible bug with groupby and resample

I am a newbie in pandas and seeking advice if this is a possible bug?
Dataframe with non unique datetime index. Col1 is a group variable, col2 is values.
i want to resample the hourly values to years and grouping by the group variable. i do this with this command
df_resample = df.groupby('col1').resample('Y').mean()
This works fine and creates a multiindex of col1 and the datetimeindeks, where col1 is now NOT a column in the dataframe
How ever if i change mean() to max() this is not the case. Then col1 is part of the multiindex, but the column is still present in the dataframe.
Isnt this a bug?
Sorry, but i dont know how to present dummy data as a dataframe in this post?
Edit:
code example:
from datetime import datetime, timedelta
import pandas as pd
data = {'category':['A', 'B', 'C'],
'value_hour':[1,2,3]}
days = pd.date_range(datetime.now(), datetime.now() + timedelta(2), freq='D')
df = pd.DataFrame(data, index=days)
df_mean = df.groupby('category').resample('Y').mean()
df_max = df.groupby('category').resample('Y').max()
print(df_mean, df_max)
category value_hour
A 2021-12-31 1.0
B 2021-12-31 2.0
C 2021-12-31 3.0
category category value_hour
A 2021-12-31 A 1
B 2021-12-31 B 2
C 2021-12-31 C 3
Trying to drop the category column from df_max gives an KeyError
df_max.drop('category')
File "C:\Users\mav\Anaconda3\envs\EWDpy\lib\site-packages\pandas\core\indexes\base.py", line 3363, in get_loc
raise KeyError(key) from err
KeyError: 'category'
Concerning the KeyError: the problem is that you are trying to drop the "category" row instead of the column.
When using drop to drop the columns you should add axis = 1 as in the following code:
df_max.drop('category', axis=1)
axis=1 indicates you are looking at the columns

Pandas creating new columns based on conditions

I have two dataframes df1 and df2. df1 is a dataframe with various columns and df2 is a dataframe with only one column col2, which is a list of words.
It is obviously wrong, but my code so far is: df1["col_new"] = df1[df1["col1"]].str.contains(df2["col2"])
Basically, I want to create a new column called col_new in df1 that has copied values from col2 in df2 if the values are partial matches to values in col1 in df1.
For example, if col2 = "apple" and col1 = "im.apple3", then I want to copy or assign the value "apple" to col_new and so on.
Another question I have is to find the index/position of second uppercase letter in a string in col1 in df1.
I found a similar question on here and wrote this code: df["sec_upper"] = df["col1"].apply(lambda x: re.research("[A-Z]+{2}",x).span())[1] but I get an error saying "multiple repeat at position 6".
Can someone please help me out? Thank you in advance!
EDIT2: First problem solved. Can anyone please help me with the second problem?
EDIT1:
Example dataframes:
df1
col1
im.apple3
Cookiemm
Hi_World123
df2
col2
apple
cookie
world
candy
soda
Expected output:
col1 new_col sec_upper
im.apple3 apple NaN
Cookiemm cookie NaN
Hi_World123 world 4
Try this:
df1['new_col'] = df1['col1'].str.lower().str.extract(f"({'|'.join(df2['col2'])})")
Output:
col1 new_col
0 im.apple3 apple
1 Cookiemm cookie
2 Hi_World123 world

Pandas - Splitting text using a delimiter

Given below is the view of my Dataframe
Id,user_id
1,glen-max
2,tom-moody
I am trying to split the values in user_id column and store it in a new column.
I am able to split the user_id using the below code.
z = z['user_id'].str.split('-', 1, expand=True)
I would like this split column to be part of my original Dataframe.
Given below is the expected format of the Dataframe
Id,user_id,col1,col2
1,glen-max,glen,max
2,tom-moody,tom,moody
Could anyone help how I could make it part of the original Dataframe. Tnx..
General solution is possible multiple -:
df = z.join(z['user_id'].str.split('-', 1, expand=True).add_prefix('col'))
print (df)
Id user_id col0 col1
0 1 glen-max glen max
1 2 tom-moody tom moody
If always maximal one - is possible use:
z[['col1','col2']] = z['user_id'].str.split('-', 1, expand=True)
print (z)
Id user_id col1 col2
0 1 glen-max glen max
1 2 tom-moody tom moody
Using str.split
Ex:
import pandas as pd
df = pd.read_csv(filename, sep=",")
df[["col1","col2"]] = df['user_id'].str.split('-', 1, expand=True)
print(df)
Output:
Id user_id col1 col2
0 1 glen-max glen max
1 2 tom-moody tom moody

Changing values in a dataframe column based off a different column (python)

Col1 Col2
0 APT UB0
1 AK0 UUP
2 IL2 PB2
3 OIU U5B
4 K29 AAA
My data frame looks similar to the above data. I'm trying to change the values in Col1 if the corresponding values in Col2 have the letter "B" in it. If the value in Col2 has "B", then I want to add "-B" to the end of the value in Col1.
Ultimately I want Col1 to look like this:
Col1
0 APT-B
1 AK0
2 IL2-B
.. ...
I have an idea of how to approach it... but I'm somewhat confused because I know my code is incorrect. In addition there are NaN values in my actual code for Col1... which will definitely give an error when I'm trying to do val += "-B" since it's not possible to add a string and a float.
for value in dataframe['Col2']:
if "Z" in value:
for val in dataframe['Col1']:
val += "-B"
Does anyone know how to fix/solve this?
Rather than using a loop, lets use pandas directly:
import pandas as pd
df = pd.DataFrame({'Col1': ['APT', 'AK0', 'IL2', 'OIU', 'K29'], 'Col2': ['UB0', 'UUP', 'PB2', 'U5B', 'AAA']})
df.loc[df.Col2.str.contains('B'), 'Col1'] += '-B'
print(df)
Output:
Col1 Col2
0 APT-B UB0
1 AK0 UUP
2 IL2-B PB2
3 OIU-B U5B
4 K29 AAA
You have too many "for" loops in your code. You just need to iterate over the rows once, and for any row satisfying your condition you make the change.
for idx, row in df.iterrows():
if 'B' in row['Col2']:
df.loc[idx, 'Col1'] = str(df.loc[idx, 'Col1']) + '-B'
edit: I used str to convert the previous value in Col1 to a string before appending, since you said you sometimes have non-string values there. If this doesn't work for you, please post your test data and results.
You can use a lambda expression. If 'B' is in Col2, then '-B' get appended to Col1. The end result is assigned back to Col1.
df['Col1'] = df.apply(lambda x: x.Col1 + ('-B' if 'B' in x.Col2 else ''), axis=1)
>>> df
Col1 Col2
0 APT-B UB0
1 AK0 UUP
2 IL2-B PB2
3 OIU-B U5B
4 K29 AAA

Categories

Resources