How to use wide_to_long (Pandas) - python

I have this code which I thought would reformat the dataframe so that the columns with the same column name would be replaced by their duplicates.
# Function that splits dataframe into two separate dataframes, one with all unique
# columns and one with all duplicates
def sub_dataframes(dataframe):
# Extract common prefix -> remove trailing digits
columns = dataframe.columns.str.replace(r'\d*$', '', regex=True).to_series().value_counts()
# Split columns
unq_cols = columns[columns == 1].index
dup_cols = dataframe.columns[~dataframe.columns.isin(unq_cols)] # All columns from
dataframe that is not in unq_cols
return dataframe[unq_cols], dataframe[dup_cols]
unq_df = sub_dataframes(df)[0]
dup_df = sub_dataframes(df)[1]
print("Unique columns:\n\n{}\n\nDuplicate
columns:\n\n{}".format(unq_df.columns.tolist(), dup_df.columns.tolist()))
Output:
Unique columns:
['total_tracks', 'popularity']
Duplicate columns:
['t_dur0', 't_dur1', 't_dur2', 't_dance0', 't_dance1', 't_dance2', 't_energy0', 't_energy1', 't_energy2',
't_key0', 't_key1', 't_key2', 't_speech0', 't_speech1', 't_speech2', 't_acous0', 't_acous1', 't_acous2',
't_ins0', 't_ins1', 't_ins2', 't_live0', 't_live1', 't_live2', 't_val0', 't_val1', 't_val2', 't_tempo0',
't_tempo1', 't_tempo2']
Then I tried to use wide_to_long to combine columns with the same name:
cols = unq_df.columns.tolist()
temp = pd.wide_to_long(dataset.reset_index(), stubnames=['t_dur','t_dance', 't_energy', 't_key', 't_mode',
't_speech', 't_acous', 't_ins', 't_live', 't_val',
't_tempo'], i=['index'] + cols, j='temp', sep='t_')
.reset_index().groupby(cols, as_index=False).mean()
temp
Which gave me this output:
I tried to look at this question, but the dataframe that's returned has "Nothing to show". What am I doing wrong here? How do I fix this?
EDIT
Here is an example of how I've done it "by-hand", but I am trying to do it more efficiently using the already defined built-in functions.
The desired output is the dataframe that is shown last.

Related

Reformatting a dataframe to access it for sort after concatenating two series

I've joined or concatenated two series into a dataframe. However one of the issues I'm not facing is that I have no column headings on the actual data that would help me do a sort
hist_a = pd.crosstab(category_a, category, normalize=True)
hist_b = pd.crosstab(category_b, category, normalize=True)
counts_a = pd.Series(np.diag(hist_a), index=[hist_a.index])
counts_b = pd.Series(np.diag(hist_b), index=[hist_b.index])
df_plots = pd.concat([counts_a, counts_b], axis=1).fillna(0)
The data looks like the following:
0 1
category
0017817703277 0.000516 5.384341e-04
0017817703284 0.000516 5.384341e-04
0017817731348 0.000216 2.856169e-04
0017817731355 0.000216 2.856169e-04
and I'd like to do a sort, but there are no proper column headings
df_plots = df_plots.sort_values(by=['0?'])
But the dataframe seems to be in two parts. How could I better structure the dataframe to have 'proper' columns such as '0' or 'plot a' rather than being indexable by an integer, which seems to be hard to work with.
category plot a plot b
0017817703277 0.000516 5.384341e-04
0017817703284 0.000516 5.384341e-04
0017817731348 0.000216 2.856169e-04
0017817731355 0.000216 2.856169e-04
Just rename the columns of the dataframe, for example:
df = pd.DataFrame({0:[1,23]})
df = df.rename(columns={0:'new name'})
If you have a lot of columns you rename all of them at once like:
df = pd.DataFrame({0:[1,23]})
rename_dict = {key: f'Col {key}' for key in df.keys() }
df = df.rename(columns=rename_dict)
You can also define the series with the name, so you avoid changing the name afterwards:
counts_a = pd.Series(np.diag(hist_a), index=[hist_a.index], name = 'counts_a')
counts_b = pd.Series(np.diag(hist_b), index=[hist_b.index], name = 'counts_b')

How to delete string value from a Dataframe with no column or row names

I have this csv file "flagged_dates.csv" which contains string values. Some of them are dates and others have the value zero. I want to get rid off the zeroes but I am struggling to find the solution. I thought of using something like str.rstrip but I need column names that I don't have. Can you propose anything? Thank you in advance :)
Here's an example of the dataframe:
flagged_dates = pd.read_csv('/content/drive/MyDrive/shared/data/flag_raster.csv')
print(flagged_dates.iloc[:10, :10].to_csv(index=False)) #The entire dataframe contains 100 rows and columns
Unnamed: 0,0,1,2,3,4,5,6,7,8
0,0,2019-10-10 21:59:17.074007,2015-10-13 00:55:55.544607,2017-05-24 06:00:15.959202,0,2016-12-07 09:01:04.729686,0,0,2019-05-29 11:16:44.130063
1,0,0,0,2019-02-21 07:15:12.114444,2017-04-29 17:44:49.584567,2017-06-28 11:26:30.686198,2019-03-25 10:18:28.397219,2019-05-01 01:27:21.282324,0
2,0,0,2016-09-22 04:08:48.025359,0,0,2016-09-24 17:35:26.833975,0,0,0
3,0,0,0,2015-07-12 21:13:44.182608,0,0,2017-10-03 22:17:52.257038,2019-01-04 08:20:07.684796,0
4,0,0,0,0,0,2016-03-04 10:12:21.341795,0,0,0
5,2016-08-23 09:22:21.965876,2018-05-01 09:12:19.511673,2017-12-12 07:00:04.313859,0,0,2016-10-23 15:30:11.193316,2016-03-01 01:22:12.548658,2015-06-14 21:36:41.142947,2018-08-19 23:37:49.534862
6,0,0,2019-01-29 16:23:27.070208,0,0,0,2016-08-08 01:13:21.147689,0,0
7,0,0,2017-12-04 22:51:46.265644,0,0,2016-05-13 05:11:55.885217,0,0,0
8,0,0,2018-03-16 03:45:21.555053,0,0,0,0,2019-12-04 04:04:20.600046,0
9,0,0,0,0,2018-01-10 08:28:51.902587,0,0,0,2015-05-05 06:25:43.124125
If you are only interested in the dates you could do the following to get list of dates excluding 0's
import pandas as pd
df1 = pd.read_csv('myCsv.csv')
my_list = df1.values.flatten()
my_list = my_list[my_list!='0']
print(my_list) # my_list is an numpy.ndarray
Input myCsv.csv
0,2019-10-10 21:59:17.074007,2015-10-13 00:55:55.544607,2017-05-24 06:00:15.959202,0,2016-12-07 09:01:04.729686,0,0,2019-05-29 11:16:44.130063
0,0,0,2019-02-21 07:15:12.114444,2017-04-29 17:44:49.584567,2017-06-28 11:26:30.686198,2019-03-25 10:18:28.397219,2019-05-01 01:27:21.282324,0
0,0,2016-09-22 04:08:48.025359,0,0,2016-09-24 17:35:26.833975,0,0,0
0,0,0,2015-07-12 21:13:44.182608,0,0,2017-10-03 22:17:52.257038,2019-01-04 08:20:07.684796,0
0,0,0,0,0,2016-03-04 10:12:21.341795,0,0,0
2016-08-23 09:22:21.965876,2018-05-01 09:12:19.511673,2017-12-12 07:00:04.313859,0,0,2016-10-23 15:30:11.193316,2016-03-01 01:22:12.548658,2015-06-14 21:36:41.142947,2018-08-19 23:37:49.534862
0,0,2019-01-29 16:23:27.070208,0,0,0,2016-08-08 01:13:21.147689,0,0
0,0,2017-12-04 22:51:46.265644,0,0,2016-05-13 05:11:55.885217,0,0,0
0,0,2018-03-16 03:45:21.555053,0,0,0,0,2019-12-04 04:04:20.600046,0
0,0,0,0,2018-01-10 08:28:51.902587,0,0,0,2015-05-05 06:25:43.124125
Output
['2019-02-21 07:15:12.114444' '2017-04-29 17:44:49.584567'
'2017-06-28 11:26:30.686198' '2019-03-25 10:18:28.397219'
'2019-05-01 01:27:21.282324' '2016-09-22 04:08:48.025359'
'2016-09-24 17:35:26.833975' '2015-07-12 21:13:44.182608'
'2017-10-03 22:17:52.257038' '2019-01-04 08:20:07.684796'
'2016-03-04 10:12:21.341795' '2016-08-23 09:22:21.965876'
'2018-05-01 09:12:19.511673' '2017-12-12 07:00:04.313859'
'2016-10-23 15:30:11.193316' '2016-03-01 01:22:12.548658'
'2015-06-14 21:36:41.142947' '2018-08-19 23:37:49.534862'
'2019-01-29 16:23:27.070208' '2016-08-08 01:13:21.147689'
'2017-12-04 22:51:46.265644' '2016-05-13 05:11:55.885217'
'2018-03-16 03:45:21.555053' '2019-12-04 04:04:20.600046'
'2018-01-10 08:28:51.902587' '2015-05-05 06:25:43.124125']
In case you don't have column names, you can simply rename your columns (they will be renamed in the order you state in the list).
For a 4 column dataframe:
df.columns = ['col1', 'col2', 'col3', 'col4']
I bet at some point you will have to deal with it so it's a good practice to start the data wrangling with your column names issue solved.

convert pandas series (with strings) to python list

It's probably a silly thing but I can't seem to correctly convert a pandas series originally got from an excel sheet to a list.
dfCI is created by importing data from an excel sheet and looks like this:
tab var val
MsrData sortfield DetailID
MsrData strow 4
MsrData inputneeded "MeasDescriptionTest", "SiteLocTest", "SavingsCalcsProvided","BiMonthlyTest"
# get list of cols for which input is needed
cols = dfCI[((dfCI['var'] == 'inputneeded') & (dfCI['tab'] == 'MsrData'))]['val'].values.tolist()
print(cols)
>> ['"MeasDescriptionTest", "SiteLocTest", "SavingsCalcsProvided", "BiMonthlyTest"']
# replace null text with text
invalid = 'Input Needed'
for col in cols:
dfMSR[col] = np.where((dfMSR[col].isnull()), invalid, dfMSR[col])
However the second set of (single) quotes added when I converted cols from series to list, makes all the columns a single value so that
col = '"MeasDescriptionTest", "SiteLocTest", "SavingsCalcsProvided", "BiMonthlyTest"'
The desired output for cols is
cols = ["MeasDescriptionTest", "SiteLocTest", "SavingsCalcsProvided", "BiMonthlyTest"]
What am I doing wrong?
Once you've got col, you can convert it to your expected output:
In [1109]: col = '"MeasDescriptionTest", "SiteLocTest", "SavingsCalcsProvided", "BiMonthlyTest"'
In [1114]: cols = [i.strip() for i in col.replace('"', '').split(',')]
In [1115]: cols
Out[1115]: ['MeasDescriptionTest', 'SiteLocTest', 'SavingsCalcsProvided', 'BiMonthlyTest']
Another possible solution that comes to mind given the structure of cols is:
list(eval(cols[0])) # ['MeasDescriptionTest', 'SiteLocTest', 'SavingsCalcsProvided', 'BiMonthlyTest']
Although this is valid, it's less safe and I would go with list-comprehension as #MayankPorwal suggested.

Python remove everything after specific string and loop through all rows in multiple columns in a dataframe

I have a file full of URL paths like below spanning across 4 columns in a dataframe that I am trying to clean:
Path1 = ["https://contentspace.global.xxx.com/teams/Australia/WA/Documents/Forms/AllItems.aspx?\
RootFolder=%2Fteams%2FAustralia%2FWA%2FDocuments%2FIn%20Scope&FolderCTID\
=0x012000EDE8B08D50FC3741A5206CD23377AB75&View=%7B287FFF9E%2DD60C%2D4401%2D9ECD%2DC402524F1D4A%7D"]
I want to remove everything after a specific string which I defined it as "string1" and I would like to loop through all 4 columns in the dataframe defined as "df_MasterData":
string1 = "&FolderCTID"
import pandas as pd
df_MasterData = pd.read_excel(FN_MasterData)
cols = ['Column_A', 'Column_B', 'Column_C', 'Column_D']
for i in cols:
# Objective: Replace "&FolderCTID", delete all string after
string1 = "&FolderCTID"
# Method 1
df_MasterData[i] = df_MasterData[i].str.split(string1).str[0]
# Method 2
df_MasterData[i] = df_MasterData[i].str.split(string1).str[1].str.strip()
# Method 3
df_MasterData[i] = df_MasterData[i].str.split(string1)[:-1]
I did search and google and found similar solutions which were used but none of them work.
Can any guru shed some light on this? Any assistance is appreciated.
Added below is a few example rows in column A and B for these URLs:
Column_A = ['https://contentspace.global.xxx.com/teams/Australia/NSW/Documents/Forms/AllItems.aspx?\
RootFolder=%2Fteams%2FAustralia%2FNSW%2FDocuments%2FIn%20Scope%2FA%20I%20TOPPER%20GROUP&FolderCTID=\
0x01200016BC4CE0C21A6645950C100F37A60ABD&View=%7B64F44840%2D04FE%2D4341%2D9FAC%2D902BB54E7F10%7D',\
'https://contentspace.global.xxx.com/teams/Australia/Victoria/Documents/Forms/AllItems.aspx?RootFolder\
=%2Fteams%2FAustralia%2FVictoria%2FDocuments%2FIn%20Scope&FolderCTID=0x0120006984C27BA03D394D9E2E95FB\
893593F9&View=%7B3276A351%2D18C1%2D4D32%2DADFF%2D54158B504FCC%7D']
Column_B = ['https://contentspace.global.xxx.com/teams/Australia/WA/Documents/Forms/AllItems.aspx?\
RootFolder=%2Fteams%2FAustralia%2FWA%2FDocuments%2FIn%20Scope&FolderCTID=0x012000EDE8B08D50FC3741A5\
206CD23377AB75&View=%7B287FFF9E%2DD60C%2D4401%2D9ECD%2DC402524F1D4A%7D',\
'https://contentspace.global.xxx.com/teams/Australia/QLD/Documents/Forms/AllItems.aspx?RootFolder=%\
2Fteams%2FAustralia%2FQLD%2FDocuments%2FIn%20Scope%2FAACO%20GROUP&FolderCTID=0x012000E689A6C1960E8\
648A90E6EC3BD899B1A&View=%7B6176AC45%2DC34C%2D4F7C%2D9027%2DDAEAD1391BFC%7D']
This is how i would do it,
first declare a variable with your target columns.
Then use stack() and str.split to get your target output.
finally, unstack and reapply the output to your original df.
cols_to_slice = ['ColumnA','ColumnB','ColumnC','ColumnD']
string1 = "&FolderCTID"
df[cols_to_slice].stack().str.split(string1,expand=True)[1].unstack(1)
if you want to replace these columns in your target df then simply do -
df[cols_to_slice] = df[cols_to_slice].stack().str.split(string1,expand=True)[1].unstack(1)
You should first get the index of string using
indexes = len(string1) + df_MasterData[i].str.find(string1)
# This selected the final location of this string
# if you don't want to add string in result just use below one
indexes = len(string1) + df_MasterData[i].str.find(string1)
Now do
df_MasterData[i] = df_MasterData[i].str[:indexes]

How to compare two str values dataframe python pandas

I am trying to compare two different values in a dataframe. The questions/answers I've found I wasn't able to utilize.
import pandas as pd
# from datetime import timedelta
"""
read csv file
clean date column
convert date str to datetime
sort for equity options
replace date str column with datetime column
"""
trade_reader = pd.read_csv('TastyTrades.csv')
trade_reader['Date'] = trade_reader['Date'].replace({'T': ' ', '-0500': ''}, regex=True)
date_converter = pd.to_datetime(trade_reader['Date'], format="%Y-%m-%d %H:%M:%S")
options_frame = trade_reader.loc[(trade_reader['Instrument Type'] == 'Equity Option')]
clean_frame = options_frame.replace(to_replace=['Date'], value='date_converter')
# Separate opening transaction from closing transactions, combine frames
opens = clean_frame[clean_frame['Action'].isin(['BUY_TO_OPEN', 'SELL_TO_OPEN'])]
closes = clean_frame[clean_frame['Action'].isin(['BUY_TO_CLOSE', 'SELL_TO_CLOSE'])]
open_close_set = set(opens['Symbol']) & set(closes['Symbol'])
open_close_frame = clean_frame[clean_frame['Symbol'].isin(open_close_set)]
'''
convert Value to float
sort for trade readability
write
'''
ocf_float = open_close_frame['Value'].astype(float)
ocf_sorted = open_close_frame.sort_values(by=['Date', 'Call or Put'], ascending=True)
# for readability, revert back to ocf_sorted below
ocf_list = ocf_sorted.drop(
['Type', 'Instrument Type', 'Description', 'Quantity', 'Average Price', 'Commissions', 'Fees', 'Multiplier'], axis=1
)
ocf_list.reset_index(drop=True, inplace=True)
ocf_list['Strategy'] = ''
# ocf_list.to_csv('Sorted.csv')
# create strategy list
debit_single = []
debit_vertical = []
debit_calendar = []
credit_vertical = []
iron_condor = []
# shift columns
ocf_list['Symbol Shift'] = ocf_list['Underlying Symbol'].shift(1)
ocf_list['Symbol Check'] = ocf_list['Underlying Symbol'] == ocf_list['Symbol Shift']
# compare symbols, append depending on criteria met
for row in ocf_list:
if row['Symbol Shift'] is row['Underlying Symbol']:
debit_vertical.append(row)
print(type(ocf_list['Underlying Symbol']))
ocf_list.to_csv('Sorted.csv')
print(debit_vertical)
# delta = timedelta(seconds=10)
The error I get is:
line 51, in <module>
if row['Symbol Check'][-1] is row['Underlying Symbol'][-1]:
TypeError: string indices must be integers
I am trying to compare the newly created shifted column to the original, and if they are the same, append to a list. Is there a way to compare two string values at all in python? I've tried checking if Symbol Check is true and it still returns an error about str indices must be int. .iterrows() didn't work
Here, you will actually iterate through the columns of your DataFrame, not the rows:
for row in ocf_list:
if row['Symbol Shift'] is row['Underlying Symbol']:
debit_vertical.append(row)
You can use one of the methods iterrows or itertuples to iterate through the rows, but they return rows as lists and tuples respectively, which means you can't index them using the column names, as you did here.
Second, you should use == instead of is since you are probably comparing values, not identities.
Lastly, I would skip iterating over the rows entirely, as pandas is made for selecting rows based on a condition. You should be able to replace the aforementioned code with this:
debit_vertical = ocf_list[ocf_list['Symbol Shift'] == ocf_list['Underlying Symbol']].values.tolist()

Categories

Resources