Separate adjacent values in a column to split them [Python - Pandas]

Separate adjacent values in a column to split them [Python - Pandas] - python

in my dataframe i have a column [date_time/full_company_name] that contains a date, time and a company name. I want to split the column in order to have 2 columns, one with date and time and one with the company name - the issue is, that they are directly adjacent e.g.
[2011-11-19 12:22:10Anderson-Henderson]
So my initial idea of using the following code:
split = df[['date_time', 'full_company_name']] = df['date_time/full_company_name'].str.split('/', n=1, expand=True)
returned 2 columns but one with all the information and the second one without values.
How can I insert a '/' between date and company name in my initial dataframe to make use of this kind of splitting? or is there an easier way overall?

You can also do this by string slicing:
so firstly use astype() method and strip() method:
df['date_time/full_company_name']=df['date_time/full_company_name'].astype(str).str.strip('[]')
Finally assign columns:
df['date_time']=df['date_time/full_company_name'].str[1:20]
df['full_company_name']=df['date_time/full_company_name'].str[20:-1]
Now if you print df you will get:
date_time/full_company_name date_time full_company_name
0 '2011-11-19 12:22:10Anderson-Henderson' 2011-11-19 12:22:10 Anderson-Henderson
1 '2011-11-19 12:22:10Anderson-Henderson' 2011-11-19 12:22:10 Anderson-Henderson

I hope you find a better solution but until you do, I've come up with one that works.
Split by ":" into multiple columns, then take the seconds from the name column and shift that to the date column.
df[['date', 'hour', 'name']] = df["col"].str.split(':', expand=True)
df['date'] = df['date'] + ":" + df['hour'] + ":" + df['name'].str[:2]
df['name'] = df['name'].str[2:]
Output:
col date name
2011-11-19 12:22:10Anderson-Henderson 2011-11-19 12:22:10 Anderson-Henderson

Related

combine dataframe column with similar name. and concate values with separated by ',' (comma)

input file contains the product and its price on a particular date
product 05-Oct-2020 07-Oct-2020 09-Nov-2020 13-Nov-2020
A 66.2 69.5 72.95 76.55
B 368.7 382.8 384.7 386.8
output file should, combine all the days of month in one column and concatenate values with separated with comma (,)
product Oct-2020 Nov-2020
A 66.2, 69.5 72.95, 76.55
B 368.7, 382.8 384.7, 386.8
i tried to change column name with date format , from '1-jan-2020' to 'jan-2020'
with
keys = [dt.strptime(key, "%d-%b-%Y").strftime("%B-%Y") for key in data.keys()]
and after df transpose we can use groupby.
like there is option to group by and sum the values as :-
df.groupby().sum()
is there something that can join values (string operation) with separate them with comma.
click here to get sample data
any direction is appreciated.

The trick is to use Grouper on the colums:
inp = pd.read_excel("Stackoverflow sample.xlsx")
df = inp.set_index("Product")
df.columns = pd.to_datetime(df.columns)
out = (
df
.T
.groupby(pd.Grouper(level=0, freq="MS"))
.agg(lambda xs: ", ".join(map(str, filter(pd.notnull, xs))))
.T
)
Using the provided sample this yields the following 5 first rows for out:
If you want to convert to a particular date formatting do
out.columns = out.columns.strftime("%b-%Y")
which results in

Python Pandas DataFrame - How to sum values in 1 column based on partial match in another column (date type)?

I have encountered some issues while processing my dataset using Pandas DataFrame.
Here is my dataset:
My data types are displayed below:
My dataset is derived from:
MY_DATASET = pd.read_excel(EXCEL_FILE_PATH, index_col = None, na_values = ['NA'], usecols = "A, D")
I would like to sum all values in the "NUMBER OF PEOPLE" column for each month in the "DATE" column. For example, all values in "NUMBER OF PEOPLE" column would be added as long as the value in the "DATE" column was "2020-01", "2020-02" ...
However, I am stuck since I am unsure how to use the .groupby on partial match.
After 1) is completed, I am also trying to convert the values in the "DATE" column from YYYY-MM-DD to YYYY-MMM, like 2020-Jan.
However, I am unsure if there is such a format.
Does anyone know how to resolve these issues?
Many thanks!

Check
s = df['NUMBER OF PEOPLE'].groupby(pd.to_datetime(df['DATE'])).dt.strftime('%Y-%b')).sum()

You can get an abbeviated month name using strftime('%b') but the month name will be all in lowercase:
df['group_time'] = df.date.apply(lambda x: x.strftime('%Y-%B'))
If you need the first letter of the month in uppercase, you could do something like this:
df.group_date = df.group_date.apply(lambda x: f'{x[0:5]}{x[5].upper()}{x[6:]}'
# or in one step:
df['group_date']= df.date.apply(lambda x: x.strftime('%Y-%B')).apply(lambda x: f'{x[0:5]}
...: {x[5].upper()}{x[6:]}')
Now you just need to .groupby and .sum():
result = df['NUMBER OF PEOPLE'].groupby(df.group_date).sum()

I did some tinkering around and found that this worked for me as well:
Cheers all

Date concatenating in new column in dataframe

I have dataframe with column date with type datetime64[ns].
When I try to create new column day with format MM-DD based on date column only first method works from below. Why second method doesn't work in pandas?
df['day'] = df['date'].dt.strftime('%m-%d')
df['day2'] = str(df['date'].dt.month) + '-' + str(df['date'].dt.day)
Result for one row:
day 01-04
day2 0 1\n1 1\n2 1\n3 1\n4 ...
Types of columns
day object
day2 object

Problem of solution is if use str with df['date'].dt.month it return Series, correct way is use Series.astype:
df['day2'] = df['date'].dt.month.astype(str) + '-' + df['date'].dt.day.astype(str)

Wide to long returns empty output - Python dataframe

I have a dataframe which can be generated from the code as given below
df = pd.DataFrame({'person_id' :[1,2,3],'date1':
['12/31/2007','11/25/2009','10/06/2005'],'val1':
[2,4,6],'date2': ['12/31/2017','11/25/2019','10/06/2015'],'val2':[1,3,5],'date3':
['12/31/2027','11/25/2029','10/06/2025'],'val3':[7,9,11]})
I followed the below solution to convert it from wide to long
pd.wide_to_long(df, stubnames=['date', 'val'], i='person_id',
j='grp').sort_index(level=0)
Though this works with sample data as shown below, it doesn't work with my real data which has more than 200 columns. Instead of person_id, my real data has subject_ID which is values like DC0001,DC0002 etc. Does "I" always have to be numeric? Instead it adds the stub values as new columns in my dataset and has zero rows
This is how my real columns looks like
My real data might contains NA's as well. So do I have to fill them with default values for wide_to_long to work?
Can you please help as to what can be the issue? Or any other approach to achieve the same result is also helpful.

Try adding additional argument in the function which allows the strings suffix.
pd.long_to_wide(.......................,suffix='\w+')

The issue is with your column names, the numbers used to convert from wide to long need to be at the end of your column names or you need to specify a suffix to groupby. I think the easiest solution is to create a function that accepts regex and the dataframe.
import pandas as pd
import re
def change_names(df, regex):
# Select one of three column groups
old_cols = df.filter(regex = regex).columns
# Create list of new column names
new_cols = []
for col in old_cols:
# Get the stubname of the original column
stub = ''.join(re.split(r'\d', col))
# Get the time point
num = re.findall(r'\d+', col) # returns a list like ['1']
# Make new column name
new_col = stub + num[0]
new_cols.append(new_col)
# Create dictionary mapping old column names to new column names
dd = {oc: nc for oc, nc in zip(old_cols, new_cols)}
# Rename columns
df.rename(columns = dd, inplace = True)
return df
tdf = pd.DataFrame({'person_id' :[1,2,3],'h1date': ['12/31/2007','11/25/2009','10/06/2005'],'t1val': [2,4,6],'h2date': ['12/31/2017','11/25/2019','10/06/2015'],'t2val':[1,3,5],'h3date': ['12/31/2027','11/25/2029','10/06/2025'],'t3val':[7,9,11]})
# Change date columns
tdf = change_names(tdf, 'date$')
tdf = change_names(tdf, 'val$')
print(tdf)
person_id hdate1 tval1 hdate2 tval2 hdate3 tval3
0 1 12/31/2007 2 12/31/2017 1 12/31/2027 7
1 2 11/25/2009 4 11/25/2019 3 11/25/2029 9
2 3 10/06/2005 6 10/06/2015 5 10/06/2025 11

This is quite late to answer this question. But putting the solution here in case someone else find it useful
tdf = pd.DataFrame({'person_id' :[1,2,3],'h1date': ['12/31/2007','11/25/2009','10/06/2005'],'t1val': [2,4,6],'h2date': ['12/31/2017','11/25/2019','10/06/2015'],'t2val':[1,3,5],'h3date': ['12/31/2027','11/25/2029','10/06/2025'],'t3val':[7,9,11]})
## You can use m13op22 solution to rename your columns with numeric part at the
## end of the column name. This is important.
tdf = tdf.rename(columns={'h1date': 'hdate1', 't1val': 'tval1',
'h2date': 'hdate2', 't2val': 'tval2',
'h3date': 'hdate3', 't3val': 'tval3'})
## Then use the non-numeric portion, (in this example 'hdate', 'tval') as
## stubnames. The mistake you were doing was using ['date', 'val'] as stubnames.
df = pd.wide_to_long(tdf, stubnames=['hdate', 'tval'], i='person_id', j='grp').sort_index(level=0)
print(df)

Row wise operations in pandas dataframe based on dates (sorting issue)

This question has two parts:
1) Is there a better way to do this?
2) If NO to #1, how can I fix my date issue?
I have a dataframe as follows
GROUP DATE VALUE DELTA
A 12/20/2015 2.5 ??
A 11/30/2015 25
A 1/31/2016 8.3
B etc etc
B etc etc
C etc etc
C etc etc
This is a representation, there are close to 100 rows for each group (each row representing a unique date).
For each letter in GROUP, I want to find the change in value between successive dates. So for example for GROUP A I want the change between 11/30/2015 and 12/20/2015, which is -22.5. Currently I am doing the following:
df['DATE'] = pd.to_datetime(df['DATE'],infer_datetime_format=True)
df.sort_values('DATE',ascending=True)
df_out = []
for GROUP in df.GROUP.unique():
x = df[df.GROUP == GROUP]
x['VALUESHIFT'] = x['VALUE'].shift(+1)
x['DELTA'] = x['VALUE'].sub(x['VALUESHIFT'])
df_out.append(x)
df_out = pd.concat(df_out)
The challenge I am running into is the dates are not sorted correctly. So when the shift takes place and I calculate the delta it is not really the delta between successive dates.
Is this the right approach to handle? If so how can I fix my date issue? I have reviewed/tried the following to no avail:
Applying datetime format in pandas for sorting
how to make a pandas dataframe column into a datetime object showing just the date to correctly sort
doing calculations in pandas dataframe based on trailing row
Pandas - Split dataframe into multiple dataframes based on dates?

Answering my own question. This works:
df['DATE'] = pd.to_datetime(df['DATE'],infer_datetime_format=True)
df_out = []
for ID in df.GROUP.unique():
x = df[df.GROUP == ID]
x.sort_values('DATE',ascending=True, inplace=True)
x['VALUESHIFT'] = x['VALUE'].shift(+1)
x['DELTA'] = x['VALUE'].sub(x['VALUESHIFT'])
df_out.append(x)
df_out = pd.concat(df_out)
1) Added inplace=True to sort value.
2) Added the sort within the for loop.
3) Changed by loop from using GROUP to ID since it is also the name of a column name, which I imagine is considered sloppy?

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Separate adjacent values in a column to split them [Python - Pandas] - python

Related

combine dataframe column with similar name. and concate values with separated by ',' (comma)

Python Pandas DataFrame - How to sum values in 1 column based on partial match in another column (date type)?

Date concatenating in new column in dataframe

Wide to long returns empty output - Python dataframe

Row wise operations in pandas dataframe based on dates (sorting issue)

Categories

Resources