how to equalize two dataframes? - python

Hello I have two DFs (rateQualityOut and subsetOut):
"rateQualityOut" is an empty DF that I created to store the temporary output "subsetOut". The idea is that all outputs (once the loop is finished) should be stored in that DF.
rateQualityOut[['pID', 'carry_dt','position', 'product_type' ,'positionLength']].loc[currLength:currLength+addLength,:]
pID carry_dt position product_type positionLength
0 NaN NaT NaN NaN NaN
1 NaN NaT NaN NaN NaN
2 NaN NaT NaN NaN NaN
3 NaN NaT NaN NaN NaN
4 NaN NaT NaN NaN NaN
5 NaN NaT NaN NaN NaN
and another DF which has the temporary output
subsetOut
subsetOut[['pID', 'carry_dt','position', 'product_type' ,'positionLength']]
pID carry_dt position product_type positionLength
2739 1 2018-11-01 CITI_52299G66_201210 Physical 5
2738 1 2018-11-02 CITI_52299G66_201210 Physical 5
2737 1 2018-11-05 CITI_52299G66_201210 Physical 5
2736 1 2018-11-06 CITI_52299G66_201210 Physical 5
2735 1 2018-11-07 CITI_52299G66_201210 Physical 5
I am looking to store the temporary output "subsetOut" into "rateQualityOut". and What I have done in the past is simply do this:
rateQualityOut.loc[currLength:currLength+addLength,:] = subsetOut
However it seems that it is not working as planned. The output shows that the NaN are not populated as expected.
pID carry_dt position product_type positionLength
0 NaN NaT NaN NaN NaN
1 NaN NaT NaN NaN NaN
2 NaN NaT NaN NaN NaN
3 NaN NaT NaN NaN NaN
4 NaN NaT NaN NaN NaN
5 NaN NaT NaN NaN NaN
Can i please have some suggestions? Thank you so much

Typically it is easier and faster not to put subsetOut into rateQualityOut with each iteration. Instead you could store the subsets into a list and concatenate them at the end:
import pandas as pd
rateQualityOut = [] #Make a list
for i in someIterator:
#do something here
rateQualityOut.append(subsetOut)
rateQualityOut = pd.concat(rateQualityOut)

Related

How to pull any cells from a table/dataframe into a column if they contain specific string?

I am using Python in CoLab and I am trying to find something that will allow me to move any cells from a subset of a data frame into a new/different column in the same data frame OR sort the cells of the dataframe into the correct columns.
The original column in the CSV looked like this:
and using
Users[['Motorbike', 'Car', 'Bus', 'Train', 'Tram', 'Taxi']] = Users['What distance did you travel in the last month by:'].str.split(',', expand=True)
I was able to split the column into 6 new series to give this
However, now I would like all the cells with 'Motorbike' in the motorbike column, all the cells wih 'Car' in the Car column and so on, without overwriting any other cells OR if this cannot be done, to just assign any occurances of Motorbike, Car etc into the new columns 'Motorbike1', 'Car1' etc. that I have added to the dataframe as shown below. Can anyone help please?
new columns
I have tried to copy the cells in original columns to the new columns and then get rid of values containing say not 'Car' However repeating for the next original column into the same first new column it overwrites.
There are no repeats of any mode of transport in any row. i.e there is only one or less occurrence of each mode of transport in every row.
You can use a regex to extract the xxx (yyy)(yyy) parts, then reshape:
out = (df['col_name']
.str.extractall(r'([^,]+) (\([^,]*\))')
.set_index(0, append=True)[1]
.droplevel('match')
.unstack(0)
)
output:
Bus Car Motorbike Taxi Train Tram
0 NaN NaN NaN (km)(20) NaN NaN
1 NaN (km)(500) (km)(500) NaN NaN NaN
2 NaN (km)(1000) NaN NaN NaN NaN
3 NaN (km)(100) NaN NaN (km)(20) NaN
4 (km)(150) NaN NaN (km)(25) (km)(700) NaN
5 (km)(40) (km)(0) (km)(0) NaN NaN NaN
6 NaN (km)(300) NaN (km)(100) NaN NaN
7 NaN (km)(300) NaN NaN NaN NaN
8 NaN NaN NaN NaN (km)(80) (km)(300)
9 (km)(50) (km)(700) NaN NaN NaN (km)(50)
If you only need the numbers, you can change the regex:
(df['col_name'].str.extractall(r'([^,]+)\s+\(km\)\((\d+)\)')
.set_index(0, append=True)[1]
.droplevel('match')
.unstack(0).rename_axis(columns=None)
)
Output:
Bus Car Motorbike Taxi Train Tram
0 NaN NaN NaN 20 NaN NaN
1 NaN 500 500 NaN NaN NaN
2 NaN 1000 NaN NaN NaN NaN
3 NaN 100 NaN NaN 20 NaN
4 150 NaN NaN 25 700 NaN
5 40 0 0 NaN NaN NaN
6 NaN 300 NaN 100 NaN NaN
7 NaN 300 NaN NaN NaN NaN
8 NaN NaN NaN NaN 80 300
9 50 700 NaN NaN NaN 50
Use list comprehension with split for dictionaries, then pass to DataFrame constructor:
L = [dict([y.split() for y in x.split(',')])
for x in df['What distance did you travel in the last month by:']]
df = pd.DataFrame(L)
print (df)
Taxi Motorbike Car Train Bus Tram
0 (km)(20) NaN NaN NaN NaN NaN
1 NaN (km)(500) (km)(500) NaN NaN NaN
2 NaN NaN (km)(1000) NaN NaN NaN
3 NaN NaN (km)(100) (km)(20) NaN NaN
4 (km)(25) NaN NaN (km)(700) (km)(150) NaN
5 NaN (km)(0) (km)(0) NaN (km)(40) NaN
6 (km)(100) NaN (km)(300) NaN NaN NaN
7 NaN NaN (km)(300) NaN NaN NaN
8 NaN NaN NaN (km)(80) NaN (km)(300)
9 NaN NaN (km)(700) NaN (km)(50) (km)(50)

Copying a column from one pandas dataframe to another

I have the following code where i try to copy the EXPIRATION from the recent dataframe to the EXPIRATION column in the destination dataframe:
recent = pd.read_excel(r'Y:\Attachments' + '\\' + '962021.xlsx')
print('HERE\n',recent)
print('HERE2\n', recent['EXPIRATION'])
destination= pd.read_excel(r'Y:\Attachments' + '\\' + 'Book1.xlsx')
print('HERE3\n', destination)
destination['EXPIRATION']= recent['EXPIRATION']
print('HERE4\n', destination)
The problem is that destination has less rows than recent so some of the lower rows in the EXPIRATION column from recent do not end up in the destination dataframe. I want all the EXPIRATION values from recent to be in the destination dataframe, even if all the other values are NaN.
Example Output:
HERE
Unnamed: 0 IGNORE DATE_TRADE DIRECTION EXPIRATION NAME OPTION_TYPE PRICE QUANTITY STRATEGY STRIKE TIME_TRADE TYPE UNDERLYING
0 0 21 6/9/2021 B 08/06/2021 BNP FP E C 12 12 CONDORI 12 9:23:40 ETF NASDAQ
1 1 22 6/9/2021 B 16/06/2021 BNP FP E P 12 12 GOLD/SILVER 12 10:9:19 ETF NASDAQ
2 2 23 6/9/2021 B 16/06/2021 TEST P 12 12 CONDORI 21 10:32:12 EQT TEST
3 3 24 6/9/2021 B 22/06/2021 TEST P 12 12 GOLD/SILVER 12 10:35:5 EQT NASDAQ
4 4 0 6/9/2021 B 26/06/2021 TEST P 12 12 GOLD/SILVER 12 10:37:11 ETF FTSE100
HERE2
0 08/06/2021
1 16/06/2021
2 16/06/2021
3 22/06/2021
4 26/06/2021
Name: EXPIRATION, dtype: object
HERE3
Unnamed: 0 IGNORE DATE_TRADE DIRECTION EXPIRATION NAME OPTION_TYPE PRICE QUANTITY STRATEGY STRIKE TIME_TRADE TYPE UNDERLYING
0 NaN NaN NaN NaN 2 NaN NaN NaN NaN NaN NaN NaN NaN NaN
1 NaN NaN NaN NaN 1 NaN NaN NaN NaN NaN NaN NaN NaN NaN
2 NaN NaN NaN NaN 3 NaN NaN NaN NaN NaN NaN NaN NaN NaN
HERE4
Unnamed: 0 IGNORE DATE_TRADE DIRECTION EXPIRATION NAME OPTION_TYPE PRICE QUANTITY STRATEGY STRIKE TIME_TRADE TYPE UNDERLYING
0 NaN NaN NaN NaN 08/06/2021 NaN NaN NaN NaN NaN NaN NaN NaN NaN
1 NaN NaN NaN NaN 16/06/2021 NaN NaN NaN NaN NaN NaN NaN NaN NaN
2 NaN NaN NaN NaN 16/06/2021 NaN NaN NaN NaN NaN NaN NaN NaN NaN
Joining is generally the best approach, but I see that you have no id column apart from native pandas indexing, and there are only Nans in destination, so if you are sure that ordering is not a problem you can just use:
>>> destination = pd.concat([recent,destination[['EXPIRATION']]], ignore_index=True, axis=1)
Unnamed: 0 IGNORE DATE_TRADE DIRECTION EXPIRATION ...
0 NaN NaN NaN NaN 08/06/2021 ...
1 NaN NaN NaN NaN 16/06/2021 ...
2 NaN NaN NaN NaN 16/06/2021 ...
3 NaN NaN NaN NaN 22/06/2021 ...
4 NaN NaN NaN NaN 26/06/2021 ...

How to select a table based on its header from an Excel spreadsheet that contains multiple tables?

I am trying to interact with a spreadsheet and I have imported it using:
InitialImportedData = pd.read_excel(WorkbookLocation, SheetName)
The problem is that the spreadsheet I am importing from contains multiple tables, and I only want to use one of them. Is there a way to remove all the rows and columns before a specific value?
The table I am looking for has a header Premium. how do I get the table I want as a dataframe rather than all of them with loads of NaN's scattered in my frame?
Is there a way to locate a string in a dataframe and slice it based on that? It is the only one labelled Premium.
edit
I was able to find the location of the start of my table using:
I solved this in a different way, perhaps useful for people who want to slice up dataframes that they didn't read in through excel.
for x in range (InitialImportedData.shape[1]):
try :
list(InitialImportedData.iloc[:,x]).index('Premium')
print list(InitialImportedData.iloc[:,x]).index('Premium'),x
except:
pass
By converting to a list I was able to look where the value sat. I have not worked out how to slice my data correctly at the end.
I can use:
InitialImportedData.iloc[20:,4:]
to create a dataset which Starts in the corner I need (it happens to be at 20,4) but I have not found a way to slice the end of the table so it doesn't bring in extra information from the worksheet.
I have included an example dataset below:
Unnamed: 0 Unnamed: 1 Unnamed: 2 Unnamed: 3 Unnamed: 4 \
0 NaN Table 1 NaN NaN NaN
1 NaN Header1 Header2 NaN NaN
2 NaN 9.88496 2.29552 NaN NaN
3 NaN 7.36861 2.6275 NaN NaN
4 NaN 5.34938 8.37391 NaN NaN
5 NaN 8.77608 3.70626 NaN NaN
6 NaN 7.37828 2.62692 NaN NaN
7 NaN 6.82297 9.59347 NaN NaN
8 NaN 7.6804 7.38528 NaN NaN
9 NaN 2.07633 3.76247 NaN NaN
10 NaN NaN NaN NaN NaN
11 NaN NaN NaN NaN NaN
12 NaN NaN NaN NaN NaN
13 NaN NaN NaN NaN NaN
14 NaN NaN NaN NaN NaN
15 NaN NaN NaN NaN NaN
16 NaN NaN NaN NaN NaN
17 NaN NaN NaN NaN NaN
18 NaN NaN NaN NaN NaN
19 NaN NaN Premium NaN NaN
20 NaN NaN FinalHeader1 FinalHeader2 FinalHeader3
21 NaN NaN 0.679507 8.95 5.87512
22 NaN NaN 6.22637 6.54385 4.70131
23 NaN NaN 8.84881 6.74557 3.31503
24 NaN NaN 0.506901 5.36873 2.42905
25 NaN NaN 3.91448 0.542635 8.0885
26 NaN NaN 5.4045 9.08379 2.35789
27 NaN NaN 4.26343 1.37477 0.719881
28 NaN NaN 3.03682 9.62835 1.56601
Unnamed: 5 Unnamed: 6 Unnamed: 7 Unnamed: 8 Unnamed: 9
0 NaN NaN NaN NaN NaN
1 NaN NaN NaN NaN NaN
2 NaN NaN NaN NaN NaN
3 NaN NaN NaN NaN NaN
4 NaN NaN NaN NaN NaN
5 NaN NaN NaN NaN NaN
6 NaN Table 2 NaN NaN NaN
7 NaN NewHeader1 NewHeader2 NewHeader3 NewHeader4
8 NaN 1.2035 2.13923 9.59979 4.90745
9 NaN 0.273928 9.84469 3.62225 1.07671
10 NaN 3.67524 9.82434 0.366233 7.9009
11 NaN 2.16405 2.66321 9.08495 8.29695
12 NaN 6.77611 7.90381 5.13672 3.26688
13 NaN 1.95482 1.95997 3.40453 0.702198
14 NaN 6.39919 5.24728 4.16757 6.06336
15 NaN 2.34901 9.35103 2.72374 7.39052
16 NaN NaN NaN NaN NaN
17 NaN NaN NaN NaN NaN
18 NaN NaN NaN NaN NaN
19 NaN NaN NaN NaN NaN
20 NaN NaN NaN NaN NaN
21 NaN NaN NaN NaN NaN
22 NaN NaN NaN NaN NaN
23 NaN NaN NaN NaN NaN
24 NaN NaN NaN NaN NaN
25 NaN NaN NaN NaN NaN
26 NaN NaN NaN NaN NaN
27 NaN NaN NaN NaN NaN
28 NaN NaN NaN NaN NaN
That is completely possible. Below is some of my own code that I have done this with. Combo1x is taking the heading "Name" in the sheet "Reference". Hope this helps!
filelog=pd.read_excel(desktop,read_only=True, sheetname=None, na_filter=False)
combo1= Combobox(frame3, state='readonly')
combo1x=list(filelog['Reference']['Name'])
EDIT: One way you could get all the numbers for just "Premium" would to be to take the max row and work backwards with a while statement.
ash=logbook["Approvals"]
rows = ash.max_row
mylist=[]
while rows != FinalHeader1
mylist.append()
rows -= 1
I ended up solving my problem by writing a function as follows:
# This function will search for a table within a dataframe, and cut out the section defined with the header specified
# this header must be in the top left, and their must be nothing below or to the right of the table
def CutOutTable(WhereWeAreSearching, WhatWeAreSearchingFor):
for x in range (WhereWeAreSearching.shape[1]):
try :
list(WhereWeAreSearching.iloc[:,x]).index(WhatWeAreSearchingFor)
WhereToCut = list(WhereWeAreSearching.iloc[:,x]).index(WhatWeAreSearchingFor),x
SlicedVersionOfWhereWeAreSearching = WhereWeAreSearching.iloc[WhereToCut[0]:,WhereToCut[1]:]
return SlicedVersionOfWhereWeAreSearching.dropna(axis = 1,how = 'all')
except:
pass
It looks for the position in the dataframe which contains the phrase you are looking for and cuts information above and to the left of that, followed by removing the columns which contains NaNs to the right of it, thus giving you your whole table. If and only if your table is the bottom rightmost item in your excel worksheet.

Python Pandas Datetime and dataframe indexing issue

I have a datetime issue where I am trying to match up a dataframe
with dates as index values.
For example, I have dr which is an array of numpy.datetime.
dr = [numpy.datetime64('2014-10-31T00:00:00.000000000'),
numpy.datetime64('2014-11-30T00:00:00.000000000'),
numpy.datetime64('2014-12-31T00:00:00.000000000'),
numpy.datetime64('2015-01-31T00:00:00.000000000'),
numpy.datetime64('2015-02-28T00:00:00.000000000'),
numpy.datetime64('2015-03-31T00:00:00.000000000')]
Then I have dataframe with returndf with dates as index values
print(returndf)
1 2 3 4 5 6 7 8 9 10
10/31/2014 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
11/30/2014 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
Please ignore the missing values
Whenever I try to match date in dr and dataframe returndf, using the following code for just 1 month returndf.loc[str(dr[1])],
I get an error
KeyError: 'the label [2014-11-30T00:00:00.000000000] is not in the [index]'
I would appreciate if someone can help with me on how to convert numpy.datetime64('2014-10-31T00:00:00.000000000') into 10/31/2014 so that I can match it to the data frame index value.
Thank you,
Your index for returndf is not a DatetimeIndex. Make is so:
returndf = returndf.set_index(pd.to_datetime(returndf.index))
Your dr is a list of Numpy datetime64 objects. That bothers me:
dr = pd.to_datetime(dr)
Your sample data clearly shows that the index of returndf does not include all the items in dr. In that case, use reindex
returndf.reindex(dr)
1 2 3 4 5 6 7 8 9 10
2014-10-31 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
2014-11-30 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
2014-12-31 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
2015-01-31 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
2015-02-28 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
2015-03-31 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN

Python Groupby omitting columns

I have a dataframe that looks like this
dg:
thing1 thing2 thing3 thing4 thing5 thing6 thing7 ID
NAN 1 NAN NAN NAN NAN NAN 222
NAN NAN 3 NAN NAN NAN NAN 222
NAN NAN NAN 2 NAN NAN NAN 222
3 NAN NAN NAN NAN NAN 3 222
NAN NAN NAN NAN NAN NAN NAN 222
NAN NAN NAN NAN 4 NAN NAN 222
NAN NAN NAN NAN NAN 4 NAN 222
NAN 3 NAN 2 NAN NAN NAN 555
NAN NAN 3 NAN NAN NAN NAN 555
NAN NAN NAN NAN NAN NAN NAN 555
when I do a groupby like this:
dg = dg.groupby('ID').max().reset_index()
it produces the following ouput, omitting two columns, like this:
ID thing2 thing3 thing4 thing5 thing7
222 1 3 2 4 3
555 3 2
The dataframe follows that pattern but I don't know why two columns are being omitted
NAN values are np.nan
I found out I had a string "N/A" value in the midst of my np.nan values. Lesson is strings with integers can cause columns to disappear when doing groupby functions. The columns that didn't have "N/A" string didn't disappear upon doing groupby functions. When I replaced "N/A" strings with np.nan the columns didn't disappear when I did the groupby

Categories

Resources