Sorting column label pandas pivot table - python

Another pandas sort question (tried all current SO ones and didn't find any solutions).
I have a pandas pivot_table like so:
rows = ['Tool_Location']
cols = ['shift_date','Part_Number', 'shift']
pt = df.pivot_table(index='Tool_Location', values='Number_of_parts', dropna=False, fill_value ='0', columns=cols, aggfunc='count')
produces:
Shift date 10/19
Part_number 40001
shift first second third
tool_loc
T01 0 1 0
T02 2 1 0
I'd like to switch the order of shift labels so it is third first second
EDIT:
Getting closer to a solution but not seeing it.
Using:
col_list = pt.columns.tolist()
output:
[('10/20/16', 'first'), ('10/20/16', 'second'), ('10/20/16', 'third'), ('10/21/16', 'first'), ('10/21/16', 'second'), ('10/21/16', 'third')]
Anyone know how to dynamically reorder the items so its:
[('10/20/16', 'third'), ('10/20/16', 'first'), ('10/20/16', 'second'), ('10/21/16', 'first'), ('10/21/16', 'second'), ('10/21/16', 'second')]
Because then we could reorder the columns by using pt = pt[col_list]

df.pivot_table produces a dataframe. What if you do something like this after your lines:
pt = pt[["third","first","second"]]

You can use lambda and dict to sort your tuples:
pt = pt[sorted(df.columns.tolist(), key=lambda tup: list(your_dict.keys())[list(your_dict.values()).index(tup[1])])]
To do this, you have also declare a dict with needed order:
your_dict = {1: 'first', 3: 'third', 2: 'second'}

Related

Get column names with corresponding index in python pandas

I have this dataframe df where
>>> df = pd.DataFrame({'Date':['10/2/2011', '11/2/2011', '12/2/2011', '13/2/11'],
'Event':['Music', 'Poetry', 'Theatre', 'Comedy'],
'Cost':[10000, 5000, 15000, 2000],
'Name':['Roy', 'Abraham', 'Blythe', 'Sophia'],
'Age':['20', '10', '13', '17']})
I want to determine the column index with the corresponding name. I tried it with this:
>>> list(df.columns)
But the solution above only returns the column names without index numbers.
How can I code it so that it would return the column names and the corresponding index for that column? Like This:
0 Date
1 Event
2 Cost
3 Name
4 Age
Simpliest is add pd.Series constructor:
pd.Series(list(df.columns))
Or convert columns to Series and create default index:
df.columns.to_series().reset_index(drop=True)
Or:
df.columns.to_series(index=False)
You can use loop like this:
myList = list(df.columns)
index = 0
for value in myList:
print(index, value)
index += 1
A nice short way to get a dictionary:
d = dict(enumerate(df))
output: {0: 'Date', 1: 'Event', 2: 'Cost', 3: 'Name', 4: 'Age'}
For a Series, pd.Series(list(df)) is sufficient as iteration occurs directly on the column names
In addition to using enumerate, this also can get a numbers in order using zip, as follows:
import pandas as pd
df = pd.DataFrame({'Date':['10/2/2011', '11/2/2011', '12/2/2011', '13/2/11'],
'Event':['Music', 'Poetry', 'Theatre', 'Comedy'],
'Cost':[10000, 5000, 15000, 2000],
'Name':['Roy', 'Abraham', 'Blythe', 'Sophia'],
'Age':['20', '10', '13', '17']})
result = list(zip([i for i in range(len(df.columns))], df.columns.values,))
for r in result:
print(r)
#(0, 'Date')
#(1, 'Event')
#(2, 'Cost')
#(3, 'Name')
#(4, 'Age')

pandas groupby filter nunique

So far an example of what I have is here:
df = pd.DataFrame({"barcode": [1,2,2,3,3,4, 4, 4], "date": ['today', 'today', 'tomorrow', 'tomorrow', 'tomorrow', 'yesterday', 'yesterday' ,'yesterday'], "info": [40,20,10,15,17,19, 21, 23]})
gb= df.groupby(['date'])
gb.filter(lambda x: x['barcode'].nunique!=1)
which returns:
Empty DataFrame
Columns: [barcode, date, info]
Index: []
Only "yesterday" should remain after I filter this because there are 2 distinct barcodes in the group "today", and 2 distinct barcodes in the group "tomorrow". What is going on here? and in the example the column to filter on is sorted but does it need to be?
I will recommend
gb= df.groupby(['date'])
df = df[gb['barcode'].transform('nunqiue').eq(1)]
nunique is a method, not a property. Fix:
gb.filter(lambda x: x['barcode'].nunique() ==1)

Finding conditioned consecutive values in a pandas DataFrame

I have a pandas dataframe with multiple rows and columns filled with types and values. All are strings. I want to write a function that conditions:
1) which type I search (column 1)
2) a first value (column 2)
3) a second, consecutive value (in the next row of column 2)
I manage to write a function that searches one value of one type as below, but how do I add the second type? I think it might be with help of df.shift(axis=0), but I do not know how to combine that command with a conditional search.
import pandas as pd
d = {'type': ['wordclass', 'wordclass', 'wordclass', 'wordclass', 'wordclass', 'wordclass',
'english', 'english', 'english', 'english', 'english', 'english'],
'values': ['dem', 'noun', 'cop', 'det', 'dem', 'noun', 'this', 'tree', 'is', 'a', 'good', 'tree']}
df = pd.DataFrame(data=d)
print(df)
tiername = 'wordclass'
v1 = 'dem'
v2 = 'noun'
def search_single_tier(tiername, v1):
searchoutput = df[df['type'].str.contains(tiername) & df['values'].str.match(v1)]
return searchoutput
x = search_single_tier(tiername, v1)
print(x)```
You don't need to create a function for doing this. Instead, try this:
In [422]: tiername = 'wordclass'
## This equates `type` columns to `tiername`.
## `.iloc[0:2]` gets the first 2 rows for the matched condition
In [423]: df[df.type.eq(tiername)].iloc[0:2]
Out[423]:
type values
0 wordclass dem
1 wordclass noun
After Op's comment:
Find all consecutive rows like this:
tiername = 'wordclass'
v1 = 'dem'
In [455]: ix_list = df[df.type.eq(tiername) & df['values'].eq(v1)].index.tolist()
In [464]: pd.concat([df.iloc[ix_list[0]: ix_list[0]+2], df.iloc[ix_list[1]: ix_list[1]+2]])
Out[464]:
type values
0 wordclass dem
1 wordclass noun
4 wordclass dem
5 wordclass noun

Adding a column to a panda dataframe with values from columns from another dataframe, depending on a key from a dictionary

I have the following two dataframes:
df1 = pd.DataFrame([["blala Amazon", '02/30/2017', 'Amazon'], ["blala Amazon", '04/28/2017', 'Amazon'], ['blabla Netflix', '06/28/2017', 'Netflix']], columns=['text', 'date', 'keyword'])
df2 = pd.DataFrame([['01/28/2017', '3.4', '10.2'], ['02/30/2017', '3.7', '10.5'], ['03/28/2017', '6.0', '10.9']], columns=['dates', 'ReturnOnAssets.1', 'ReturnOnAssets.2'])
(perhaps it's clearer in the screenshots here: https://imgur.com/a/YNrWpR2)
The df2 is much larger than shown here - it contains columns for 100 companies. So for example, for the 10th company, the column names are: ReturnOnAssets.10, etc.
I have created a dictionary which maps the company names to the column names:
stocks = {'Microsoft':'','Apple' :'1', 'Amazon':'2', 'Facebook':'3',
'Berkshire Hathaway':'4', 'Johnson & Johnson':'5',
'JPMorgan' :'6', 'Alphabet': '7'}
and so on.
Now, what I am trying to achieve is adding a column "ReturnOnAssets" from d2 to d1, but for a specific company and for a specific date. So looking at df1, the first tweet (i.e. "text") contains a keyword "Amazon" and it was posted on 04/28/2017. I now need to go to df2 to the relevant column name for Amazon (i.e. "ReturnOnAssets.2") and fetch the value for the specified date.
So what I expect looks like this:
df1 = pd.DataFrame([["blala Amazon", '02/30/2017', 'Amazon', **'10.5'**], ["blala Amazon", '04/28/2017', 'Amazon', 'x'], ["blabla Netflix', '06/28/2017', 'Netflix', 'x']], columns=['text', 'date', 'keyword', 'ReturnOnAssets'])
By x I mean values which where not included in the example df1 and df2.
I am fairly new to pandas and I can't wrap my head around it. I tried:
keyword = df1['keyword']
txt = 'ReturnOnAssets.'+ stocks[keyword]
df1['ReturnOnAssets'] = df2[txt]
But I don't know how to fetch the relevant date, and also this gives me an error: "Series' objects are mutable, thus they cannot be hashed", which probably comes from the fact that I cannot just add a whole column of keywords to the text string.
I don't know how to achieve the operation I need to do, so I would appreciate help.
It can probably be shortened and you can add if statements to deal with when there are missing values.
import pandas as pd
import numpy as np
df1 = pd.DataFrame([["blala Amazon", '05/28/2017', 'Amazon'], ["blala Facebook", '04/28/2017', 'Facebook'], ['blabla Netflix', '06/28/2017', 'Netflix']], columns=['text', 'dates', 'keyword'])
df1
df2 = pd.DataFrame([['06/28/2017', '3.4', '10.2'], ['05/28/2017', '3.7', '10.5'], ['04/28/2017', '6.0', '10.9']], columns=['dates', 'ReturnOnAsset.1', 'ReturnOnAsset.2'])
#creating myself a bigger df2 to cover all the way to netflix
for i in range (9):
df2[('ReturnOnAsset.' + str(i))]=np.random.randint(1, 1000, df1.shape[0])
stocks = {'Microsoft':'0','Apple' :'1', 'Amazon':'2', 'Facebook':'3',
'Berkshire Hathaway':'4', 'Johnson & Johnson':'5',
'JPMorgan' :'6', 'Alphabet': '7', 'Netflix': '8'}
#new col where to store values
df1['ReturnOnAsset']=np.nan
for index, row in df1.iterrows():
colname=('ReturnOnAsset.' + stocks[row['keyword']] )
df1['ReturnOnAsset'][index]=df2.loc[df2['dates'] ==row['dates'] , colname]
Next time please give us a correct test data, I modified your dates and dictionary for match the first and second column (netflix and amazon values).
This code will work if and only if all dates from df1 are in df2 (Note that in df1 the column name is date and in df2 the column name is dates)
df1 = pd.DataFrame([["blala Amazon", '02/30/2017', 'Amazon'], ["blala Amazon", '04/28/2017', 'Amazon'], ['blabla Netflix', '02/30/2017', 'Netflix']], columns=['text', 'date', 'keyword'])
df2 = pd.DataFrame([['04/28/2017', '3.4', '10.2'], ['02/30/2017', '3.7', '10.5'], ['03/28/2017', '6.0', '10.9']], columns=['dates', 'ReturnOnAssets.1', 'ReturnOnAssets.2'])
stocks = {'Microsoft':'','Apple' :'5', 'Amazon':'2', 'Facebook':'3',
'Berkshire Hathaway':'4', 'Netflix':'1',
'JPMorgan' :'6', 'Alphabet': '7'}
df1["ReturnOnAssets"]= [ df2["ReturnOnAssets." + stocks[ df1[ "keyword" ][ index ] ] ][ df2.index[ df2["dates"] == df1["date"][index] ][0] ] for index in range(len(df1)) ]
df1

Change order of list of lists according to another list

I have a bunch of CSV-files where first line is the column name, and now I want to change the order according to another list.
Example:
[
['date','index','name','position'],
['2003-02-04','23445','Steiner, James','98886'],
['2003-02-04','23446','Holm, Derek','2233'],
...
]
The above order differs slightly between the files, but the same column-names are always available.
So the I want the columns to be re-arranged as:
['index','date','name','position']
I can solve it by comparing the first row, making an index for each column, then re-map each row into a new list of lists using a for-loop.
And while it works, it feels so ugly even my blind old aunt would yell at me if she saw it.
Someone on IRC told me to look at on map() and operator but I'm just not experienced enough to puzzle those together. :/
Thanks.
Plain Python
You could use zip to transpose your data:
data = [
['date','index','name','position'],
['2003-02-04','23445','Steiner, James','98886'],
['2003-02-04','23446','Holm, Derek','2233']
]
columns = list(zip(*data))
print(columns)
# [('date', '2003-02-04', '2003-02-04'), ('index', '23445', '23446'), ('name', 'Steiner, James', 'Holm, Derek'), ('position', '98886', '2233')]
It becomes much easier to modify the columns order now.
To calculate the needed permutation, you can use:
old = data[0]
new = ['index','date','name','position']
mapping = {i:new.index(v) for i,v in enumerate(old)}
# {0: 1, 1: 0, 2: 2, 3: 3}
You can apply the permutation to the columns:
columns = [columns[mapping[i]] for i in range(len(columns))]
# [('index', '23445', '23446'), ('date', '2003-02-04', '2003-02-04'), ('name', 'Steiner, James', 'Holm, Derek'), ('position', '98886', '2233')]
and transpose them back:
list(zip(*columns))
# [('index', 'date', 'name', 'position'), ('23445', '2003-02-04', 'Steiner, James', '98886'), ('23446', '2003-02-04', 'Holm, Derek', '2233')]
With Pandas
For this kind of tasks, you should use pandas.
It can parse CSVs, reorder columns, sort them and keep an index.
If you have already imported data, you could use these methods to import the columns, use the first row as header and set index column as index.
import pandas as pd
df = pd.DataFrame(data[1:], columns=data[0]).set_index('index')
df then becomes:
date name position
index
23445 2003-02-04 Steiner, James 98886
23446 2003-02-04 Holm, Derek 2233
You can avoid those steps by importing the CSV correctly with pandas.read_csv. You'd need usecols=['index','date','name','position'] to get the correct order directly.
Simple and stupid:
LIST = [
['date', 'index', 'name', 'position'],
['2003-02-04', '23445', 'Steiner, James', '98886'],
['2003-02-04', '23446', 'Holm, Derek', '2233'],
]
NEW_HEADER = ['index', 'date', 'name', 'position']
def swap(lists, new_header):
mapping = {}
for lst in lists:
if not mapping:
mapping = {
old_pos: new_pos
for new_pos, new_field in enumerate(new_header)
for old_pos, old_field in enumerate(lst)
if new_field == old_field}
yield [item for _, item in sorted(
[(mapping[index], item) for index, item in enumerate(lst)])]
if __name__ == '__main__':
print(LIST)
print(list(swap(LIST, NEW_HEADER)))
To rearrange your data, you can use a dictionary:
import csv
s = [
['date','index','name','position'],
['2003-02-04','23445','Steiner, James','98886'],
['2003-02-04','23446','Holm, Derek','2233'],
]
new_data = [{a:b for a, b in zip(s[0], i)} for i in s[1:]]
final_data = [[b[c] for c in ['index','date','name','position']] for b in new_data]
write = csv.writer(open('filename.csv'))
write.writerows(final_data)

Categories

Resources