Used code and file: https://github.com/CaioEuzebio/Python-DataScience-MachineLearning/tree/master/SalesLogistics
I am working on an analysis using pandas. Basically I need to sort the orders by quantity of products, and containing the same products.
Example: I have order 1 and order 2, both have product A and product B. Using the product list and product quantity as a key I will create a pivot that will index this combination of products and return me the order who own the same products.
The general objective of the analysis is to obtain a dataframe as follows:
dfFinal
listProds Ordens NumProds
[prod1,prod2,prod3] 1 3
2
3
[prod1,prod3,prod5] 7 3
15
25
[prod5] 8 1
3
So far the code looks like this.
Setting the 'Order' column as index so that the first pivot is made.
df1.index=df1['Ordem']
df3 = df1.assign(col=df1.groupby(level=0).Produto.cumcount()).pivot(columns='col', values='Produto')
With this pivot I get the dataframe below.
df3 =
col 0 1 2 3 4 5 6 7 8 9 ... 54 55 56 57 58 59 60 61 62 63
Ordem
10911KD YIZ12FF-A YIZ12FF-A YIIE2FF-A YIR72FF-A YIR72FF-A YIR72FF-A NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
124636 HYY32ZY-A HYY32ZY-A NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
1719KD5 YI742FF-A YI742FF-A YI742FF-A YI742FF-A NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
22215KD YI762FF-A YI762FF-A YI762FF-A YI762FF-A YI762FF-A YI762FF-A YI6E2FF-A YI6E2FF-A YI6E2FF-A NaN ... NaN NaN NaN NaN NaN
When I finish running the code, NaN values appear, and I need to remove them from the lines so that I don't influence the analysis I'm doing.
Related
I am using Python in CoLab and I am trying to find something that will allow me to move any cells from a subset of a data frame into a new/different column in the same data frame OR sort the cells of the dataframe into the correct columns.
The original column in the CSV looked like this:
and using
Users[['Motorbike', 'Car', 'Bus', 'Train', 'Tram', 'Taxi']] = Users['What distance did you travel in the last month by:'].str.split(',', expand=True)
I was able to split the column into 6 new series to give this
However, now I would like all the cells with 'Motorbike' in the motorbike column, all the cells wih 'Car' in the Car column and so on, without overwriting any other cells OR if this cannot be done, to just assign any occurances of Motorbike, Car etc into the new columns 'Motorbike1', 'Car1' etc. that I have added to the dataframe as shown below. Can anyone help please?
new columns
I have tried to copy the cells in original columns to the new columns and then get rid of values containing say not 'Car' However repeating for the next original column into the same first new column it overwrites.
There are no repeats of any mode of transport in any row. i.e there is only one or less occurrence of each mode of transport in every row.
You can use a regex to extract the xxx (yyy)(yyy) parts, then reshape:
out = (df['col_name']
.str.extractall(r'([^,]+) (\([^,]*\))')
.set_index(0, append=True)[1]
.droplevel('match')
.unstack(0)
)
output:
Bus Car Motorbike Taxi Train Tram
0 NaN NaN NaN (km)(20) NaN NaN
1 NaN (km)(500) (km)(500) NaN NaN NaN
2 NaN (km)(1000) NaN NaN NaN NaN
3 NaN (km)(100) NaN NaN (km)(20) NaN
4 (km)(150) NaN NaN (km)(25) (km)(700) NaN
5 (km)(40) (km)(0) (km)(0) NaN NaN NaN
6 NaN (km)(300) NaN (km)(100) NaN NaN
7 NaN (km)(300) NaN NaN NaN NaN
8 NaN NaN NaN NaN (km)(80) (km)(300)
9 (km)(50) (km)(700) NaN NaN NaN (km)(50)
If you only need the numbers, you can change the regex:
(df['col_name'].str.extractall(r'([^,]+)\s+\(km\)\((\d+)\)')
.set_index(0, append=True)[1]
.droplevel('match')
.unstack(0).rename_axis(columns=None)
)
Output:
Bus Car Motorbike Taxi Train Tram
0 NaN NaN NaN 20 NaN NaN
1 NaN 500 500 NaN NaN NaN
2 NaN 1000 NaN NaN NaN NaN
3 NaN 100 NaN NaN 20 NaN
4 150 NaN NaN 25 700 NaN
5 40 0 0 NaN NaN NaN
6 NaN 300 NaN 100 NaN NaN
7 NaN 300 NaN NaN NaN NaN
8 NaN NaN NaN NaN 80 300
9 50 700 NaN NaN NaN 50
Use list comprehension with split for dictionaries, then pass to DataFrame constructor:
L = [dict([y.split() for y in x.split(',')])
for x in df['What distance did you travel in the last month by:']]
df = pd.DataFrame(L)
print (df)
Taxi Motorbike Car Train Bus Tram
0 (km)(20) NaN NaN NaN NaN NaN
1 NaN (km)(500) (km)(500) NaN NaN NaN
2 NaN NaN (km)(1000) NaN NaN NaN
3 NaN NaN (km)(100) (km)(20) NaN NaN
4 (km)(25) NaN NaN (km)(700) (km)(150) NaN
5 NaN (km)(0) (km)(0) NaN (km)(40) NaN
6 (km)(100) NaN (km)(300) NaN NaN NaN
7 NaN NaN (km)(300) NaN NaN NaN
8 NaN NaN NaN (km)(80) NaN (km)(300)
9 NaN NaN (km)(700) NaN (km)(50) (km)(50)
Given a dataframe with row and column multiindex, how would you copy a row index "object" and manipulate a specific index value on a chosen level? Ultimately I would like to add a new row to the dataframe with this manipulated index.
Taking this dataframe df as an example:
col_index = pd.MultiIndex.from_product([['A','B'], [1,2,3,4]], names=['cInd1', 'cInd2'])
row_index = pd.MultiIndex.from_arrays([['2010','2011','2009'],['a','r','t'],[45,34,35]], names=["rInd1", "rInd2", 'rInd3'])
df = pd.DataFrame(data=None, index=row_index, columns=col_index)
df
cInd1 A B
cInd2 1 2 3 4 1 2 3 4
rInd1 rInd2 rInd3
2010 a 45 NaN NaN NaN NaN NaN NaN NaN NaN
2011 r 34 NaN NaN NaN NaN NaN NaN NaN NaN
2009 t 35 NaN NaN NaN NaN NaN NaN NaN NaN
I would like to take the index of the first row, manipulate the "rInd2" value and use this index to insert another row.
Pseudo code would be something like this:
#Get Index
idx = df.index[0]
#Manipulate Value
idx[1] = "L" #or idx["rInd2"]
#Make new row with new index
df.loc[idx, slice(None)] = None
The desired output would look like this:
cInd1 A B
cInd2 1 2 3 4 1 2 3 4
rInd1 rInd2 rInd3
2010 a 45 NaN NaN NaN NaN NaN NaN NaN NaN
2011 r 34 NaN NaN NaN NaN NaN NaN NaN NaN
2009 t 35 NaN NaN NaN NaN NaN NaN NaN NaN
2010 L 45 NaN NaN NaN NaN NaN NaN NaN NaN
What would be the most efficient way to achieve this?
Is there a way to do the same procedure with column index?
Thanks
So I have a dataframe with NaN values and I tranfsform all the rows in that dataframe in a list which then is added to another list.
Index 1 2 3 4 5 6 7 8 9 10 ... 71 72 73 74 75 76 77 78 79 80
orderid
20000765 624380 nan nan nan nan nan nan nan nan nan ... nan nan nan nan nan nan nan nan nan nan
20000766 624380 nan nan nan nan nan nan nan nan nan ... nan nan nan nan nan nan nan nan nan nan
20000768 1305984 1305985 1305983 1306021 nan nan nan nan nan nan ... nan nan nan nan nan nan nan nan nan nan
records = []
for i in range(0, 60550):
records.append([str(dfpivot.values[i,j]) for j in range(0, 10)])
However, a lot of rows contain NaN values which I want to delete from the list, before I put it in the list of lists. Where do I need to insert that code and how do I do this?
I thought that this code would do the trick, but I guess it looks only to the direct values in the 'list of lists':
records = [x for x in records if str(x) != 'nan']
I'm new to Python, so I'm still figuring out the basics.
One way is to take advantage of the fact that stack removes NaNs to generate the nested list:
df.stack().groupby(level=0).apply(list).values.tolist()
# [[624380.0], [624380.0], [1305984.0, 1305985.0, 1305983.0, 1306021.0]]
IF you want to keep rows with nans you can do it like this:
In [5457]: df.T.dropna(how='all').T
Out[5457]:
Index 1 2 3 4
0 20000765.000 624380.000 nan nan nan
1 20000766.000 624380.000 nan nan nan
2 20000768.000 1305984.000 1305985.000 1305983.000 1306021.000
if you don't want any columns with nans you can drop them like this:
In [5458]: df.T.dropna().T
Out[5458]:
Index 1
0 20000765.000 624380.000
1 20000766.000 624380.000
2 20000768.000 1305984.000
To create the array:
In [5464]: df.T.apply(lambda x: x.dropna().tolist()).tolist()
Out[5464]:
[[20000765.0, 624380.0],
[20000766.0, 624380.0],
[20000768.0, 1305984.0, 1305985.0, 1305983.0, 1306021.0]]
or
df.T[1:].apply(lambda x: x.dropna().tolist()).tolist()
Out[5471]: [[624380.0], [624380.0], [1305984.0, 1305985.0, 1305983.0, 1306021.0]]
depending on how you want the array
One way to do this would be with a nested list comprehension:
[[j for j in i if not pd.isna(j)] for i in dfpivot.values]
EDIT
it looks like you want strings - in which case,
[[str(j) for j in i if not pd.isna(j)] for i in dfpivot.values]
I am trying to interact with a spreadsheet and I have imported it using:
InitialImportedData = pd.read_excel(WorkbookLocation, SheetName)
The problem is that the spreadsheet I am importing from contains multiple tables, and I only want to use one of them. Is there a way to remove all the rows and columns before a specific value?
The table I am looking for has a header Premium. how do I get the table I want as a dataframe rather than all of them with loads of NaN's scattered in my frame?
Is there a way to locate a string in a dataframe and slice it based on that? It is the only one labelled Premium.
edit
I was able to find the location of the start of my table using:
I solved this in a different way, perhaps useful for people who want to slice up dataframes that they didn't read in through excel.
for x in range (InitialImportedData.shape[1]):
try :
list(InitialImportedData.iloc[:,x]).index('Premium')
print list(InitialImportedData.iloc[:,x]).index('Premium'),x
except:
pass
By converting to a list I was able to look where the value sat. I have not worked out how to slice my data correctly at the end.
I can use:
InitialImportedData.iloc[20:,4:]
to create a dataset which Starts in the corner I need (it happens to be at 20,4) but I have not found a way to slice the end of the table so it doesn't bring in extra information from the worksheet.
I have included an example dataset below:
Unnamed: 0 Unnamed: 1 Unnamed: 2 Unnamed: 3 Unnamed: 4 \
0 NaN Table 1 NaN NaN NaN
1 NaN Header1 Header2 NaN NaN
2 NaN 9.88496 2.29552 NaN NaN
3 NaN 7.36861 2.6275 NaN NaN
4 NaN 5.34938 8.37391 NaN NaN
5 NaN 8.77608 3.70626 NaN NaN
6 NaN 7.37828 2.62692 NaN NaN
7 NaN 6.82297 9.59347 NaN NaN
8 NaN 7.6804 7.38528 NaN NaN
9 NaN 2.07633 3.76247 NaN NaN
10 NaN NaN NaN NaN NaN
11 NaN NaN NaN NaN NaN
12 NaN NaN NaN NaN NaN
13 NaN NaN NaN NaN NaN
14 NaN NaN NaN NaN NaN
15 NaN NaN NaN NaN NaN
16 NaN NaN NaN NaN NaN
17 NaN NaN NaN NaN NaN
18 NaN NaN NaN NaN NaN
19 NaN NaN Premium NaN NaN
20 NaN NaN FinalHeader1 FinalHeader2 FinalHeader3
21 NaN NaN 0.679507 8.95 5.87512
22 NaN NaN 6.22637 6.54385 4.70131
23 NaN NaN 8.84881 6.74557 3.31503
24 NaN NaN 0.506901 5.36873 2.42905
25 NaN NaN 3.91448 0.542635 8.0885
26 NaN NaN 5.4045 9.08379 2.35789
27 NaN NaN 4.26343 1.37477 0.719881
28 NaN NaN 3.03682 9.62835 1.56601
Unnamed: 5 Unnamed: 6 Unnamed: 7 Unnamed: 8 Unnamed: 9
0 NaN NaN NaN NaN NaN
1 NaN NaN NaN NaN NaN
2 NaN NaN NaN NaN NaN
3 NaN NaN NaN NaN NaN
4 NaN NaN NaN NaN NaN
5 NaN NaN NaN NaN NaN
6 NaN Table 2 NaN NaN NaN
7 NaN NewHeader1 NewHeader2 NewHeader3 NewHeader4
8 NaN 1.2035 2.13923 9.59979 4.90745
9 NaN 0.273928 9.84469 3.62225 1.07671
10 NaN 3.67524 9.82434 0.366233 7.9009
11 NaN 2.16405 2.66321 9.08495 8.29695
12 NaN 6.77611 7.90381 5.13672 3.26688
13 NaN 1.95482 1.95997 3.40453 0.702198
14 NaN 6.39919 5.24728 4.16757 6.06336
15 NaN 2.34901 9.35103 2.72374 7.39052
16 NaN NaN NaN NaN NaN
17 NaN NaN NaN NaN NaN
18 NaN NaN NaN NaN NaN
19 NaN NaN NaN NaN NaN
20 NaN NaN NaN NaN NaN
21 NaN NaN NaN NaN NaN
22 NaN NaN NaN NaN NaN
23 NaN NaN NaN NaN NaN
24 NaN NaN NaN NaN NaN
25 NaN NaN NaN NaN NaN
26 NaN NaN NaN NaN NaN
27 NaN NaN NaN NaN NaN
28 NaN NaN NaN NaN NaN
That is completely possible. Below is some of my own code that I have done this with. Combo1x is taking the heading "Name" in the sheet "Reference". Hope this helps!
filelog=pd.read_excel(desktop,read_only=True, sheetname=None, na_filter=False)
combo1= Combobox(frame3, state='readonly')
combo1x=list(filelog['Reference']['Name'])
EDIT: One way you could get all the numbers for just "Premium" would to be to take the max row and work backwards with a while statement.
ash=logbook["Approvals"]
rows = ash.max_row
mylist=[]
while rows != FinalHeader1
mylist.append()
rows -= 1
I ended up solving my problem by writing a function as follows:
# This function will search for a table within a dataframe, and cut out the section defined with the header specified
# this header must be in the top left, and their must be nothing below or to the right of the table
def CutOutTable(WhereWeAreSearching, WhatWeAreSearchingFor):
for x in range (WhereWeAreSearching.shape[1]):
try :
list(WhereWeAreSearching.iloc[:,x]).index(WhatWeAreSearchingFor)
WhereToCut = list(WhereWeAreSearching.iloc[:,x]).index(WhatWeAreSearchingFor),x
SlicedVersionOfWhereWeAreSearching = WhereWeAreSearching.iloc[WhereToCut[0]:,WhereToCut[1]:]
return SlicedVersionOfWhereWeAreSearching.dropna(axis = 1,how = 'all')
except:
pass
It looks for the position in the dataframe which contains the phrase you are looking for and cuts information above and to the left of that, followed by removing the columns which contains NaNs to the right of it, thus giving you your whole table. If and only if your table is the bottom rightmost item in your excel worksheet.
I have got the following dataframe, in which each column contains a set of values, and each index is only used once. However, I would like to get a completely filled dataframe. In order to do that I need to select, from each column, an X amount of values, in which X is the length of the column with the least non-nan values (in this case column '1.0').
>>> stat_df_iws
iws_w -2.0 -1.0 0.0 1.0
0 0.363567 NaN NaN NaN
1 0.183698 NaN NaN NaN
2 NaN -0.337931 NaN NaN
3 -0.231770 NaN NaN NaN
4 NaN 0.544836 NaN NaN
5 NaN -0.377620 NaN NaN
6 NaN NaN -0.428396 NaN
7 NaN NaN -0.443317 NaN
8 NaN -0.268033 NaN NaN
9 NaN 0.246714 NaN NaN
10 NaN NaN -0.503887 NaN
11 NaN NaN NaN -0.298935
12 NaN -0.252775 NaN NaN
13 NaN -0.447757 NaN NaN
14 -0.650598 NaN NaN NaN
15 -0.660542 NaN NaN NaN
16 NaN -0.952041 NaN NaN
17 -0.667356 NaN NaN NaN
18 -0.920873 NaN NaN NaN
19 NaN -0.537657 NaN NaN
20 NaN NaN -0.525121 NaN
21 NaN NaN NaN -0.619755
22 NaN -0.652138 NaN NaN
23 NaN -0.924181 NaN NaN
24 NaN -0.665720 NaN NaN
25 NaN NaN -0.336841 NaN
26 -0.428931 NaN NaN NaN
27 NaN -0.348248 NaN NaN
28 NaN 0.781024 NaN NaN
29 0.110727 NaN NaN NaN
... ... ... ... ...
I've achieved this with the following code, but it is not a very pythonic way of solving this.
def get_non_null_from_pivot(df):
lngth = min(list(len(col.dropna()) for ind, col in df.iteritems()))
df = pd.concat([df.loc[:,-2.0].dropna().head(lngth).reset_index(drop=True),\
df.loc[:,-1.0].dropna().head(lngth).reset_index(drop=True),\
df.loc[:,0.0].dropna().head(lngth).reset_index(drop=True),\
df.loc[:,1.0].dropna().head(lngth).reset_index(drop=True)], \
axis=1)
Is there a simpler way to achieve the same goal, so that I can more automatically repeat this step for other dataframes? Preferably without for-loops, for efficiency reasons.
I've made the function a little shorter by looping through the columns, and it seems to work perfectly.
def get_non_null_from_pivot_short(df):
lngth = min(list(len(col.dropna()) for ind, col in df.iteritems()))
df = pd.concat(list(df.loc[:,col].dropna().head(lngth).reset_index(drop=True) for col in df), \
axis=1)
return df