select columns based on row values of another dataset - python

So I have two dataframes dfA and dfB. I want to select several columns of dfA based on the rows in dfB. This is how my dfA looks like:
index abandoned dismiss yes train tram go
0 0.5 9.1 1.4 2.5 2.5 5.6
1 2.4 3.2 1.8 4.9 9.3 3.2
2 1.5 5.7 3.9 2.1 1.1 0.9
and this is how dfB looks like:
index keywords
0 abandoned
1 wanted
2 goes
3 train
4 bold
5 go
6 images
7 links
so I want my dfC looks like this:
index abandoned train go
0 0.5 2.5 5.6
1 2.4 4.9 3.2
2 1.5 2.1 0.9
This was my attempt, but it gave me null dataframe:
dfC= dfB[~dfB["keywords"].isin(dfA)]
can anyone help me? thank you

Use DataFrame.loc with filter columns names by Index.isin:
dfC = dfA.loc[:, dfA.columns.isin(dfB['keywords'])]
Or filtering by Index.intersection:
dfC = dfA[dfA.columns.intersection(dfB['keywords'])]
print (dfC)
abandoned train go
index
0 0.5 2.5 5.6
1 2.4 4.9 3.2
2 1.5 2.1 0.9

Related

Find if there are two columns with different names but identical values using pandas

I have table with 30 columns, mainly numericalm with 500k rows. I would like to check if I have two columns inside this table that have the same values for all rows.
for example :
I have this table:
>>> num1 num2 num3 num4
0 5.1 2.3 7 5.1
1 2.2 4.4 3.1 2.2
2 3.7 11.1 5.9 3.7
3 4.2 1.5 0.3 4.2
so in this case I would like to drop column "num4" because is identical to column "num1".
Until now I saw only ways to see if there is the same values or if they hsave the same name but not if the two columns are identical.
My end goal: to get rid of duplicated columns (by values and not by name)
Try duplicated
out = df.loc[:,~df.T.duplicated()]
Out[397]:
num1 num2 num3
0 5.1 2.3 7.0
1 2.2 4.4 3.1
2 3.7 11.1 5.9
3 4.2 1.5 0.3
Or
out = df.T.drop_duplicates().T
Out[399]:
num1 num2 num3
0 5.1 2.3 7.0
1 2.2 4.4 3.1
2 3.7 11.1 5.9
3 4.2 1.5 0.3
The following works quite well.
import pandas as pd
data = {'ID':['1234564567', '5678995545'],'Amount': [59.99, 19.99],
'ID2':['1234564567', '5678995545']}
data = pd.DataFrame(data)
data.T.drop_duplicates().T
ID Amount
0 1234564567 59.99
1 5678995545 19.99
Basically you need to transpose first, drop_duplicates and transpose again.
Cheers

filter based on partial string for multiple columns headers

Consider the following Dataframe:
AAA3 ABB3 DAT4 DEE3 ABB4 AAA4 DAA3 EAV3 DAC4 DEE4
01/01/2020 1 1.1 1.5 1.2 1.32 1.2 1.2 1 0.9 0.5
02/01/2020 1 1.1 1.5 1.2 1.32 1.2 1.2 1 0.9 0.5
03/01/2020 1 1.1 1.5 1.2 1.32 1.2 1.2 1 0.9 0.5
04/01/2020 1 1.1 1.5 1.2 1.32 1.2 1.2 1 0.9 0.5
The values are not important, so I am giving all columns the same value.
What I want to do is see if the alphabets characters part of my columns headers, has an match among the headers, and if it does, remove the header that has an 4, leaving only the name that has a 3.
For example:
There is an AAA3, as well as an AAA4. I want to drop the AAA4 column, leaving only AAA3.
Note that there is a column named DAC4, but there isn't a DAC3. So I want to keep my DAC4 column.
I couldn't solve my problem with the following question :
Select by partial string from a pandas DataFrame
Create a mask on duplicated of alphabet part. Create another mask where the last char is 3. Finally slicing using these masks
m = df.columns.str.extract(r'(^[A-Za-z]+)').duplicated(keep=False)
m1 = df.columns.str.endswith('3')
df_final = df.loc[:,(~m | m1).values]
Out[146]:
AAA3 ABB3 DAT4 DEE3 DAA3 EAV3 DAC4
01/01/2020 1 1.1 1.5 1.2 1.2 1 0.9
02/01/2020 1 1.1 1.5 1.2 1.2 1 0.9
03/01/2020 1 1.1 1.5 1.2 1.2 1 0.9
04/01/2020 1 1.1 1.5 1.2 1.2 1 0.9
Step 1: Get a dictionary of similar columns :
from collections import defaultdict
from itertools import chain
d = defaultdict(list)
for entry in df.columns:
d[entry[:-1]].append(entry)
d
defaultdict(list,
{'AAA': ['AAA3', 'AAA4'],
'ABB': ['ABB3', 'ABB4'],
'DAT': ['DAT4'],
'DEE': ['DEE3', 'DEE4'],
'DAA': ['DAA3'],
'EAV': ['EAV3'],
'DAC': ['DAC4']})
Step 2: Get columns that end with 4 :
from itertools import chain
cols_to_drop = list(chain.from_iterable([[ent for ent in value
if ent.endswith("4")]
for key,value in d.items()
if len(value) > 1]))
cols_to_drop
['AAA4', 'ABB4', 'DEE4']
Step 3: Drop columns :
df.drop(columns=cols_to_drop)
AAA3 ABB3 DAT4 DEE3 DAA3 EAV3 DAC4
0 01/01/2020 1 1.1 1.5 1.2 1 0.9
1 02/01/2020 1 1.1 1.5 1.2 1 0.9
2 03/01/2020 1 1.1 1.5 1.2 1 0.9
3 04/01/2020 1 1.1 1.5 1.2 1 0.9

Convert frequency table to raw data in Pandas

I have a sensor. For some reasons, the sensor like to record data like this:
>df
obs count
-0.3 3
0.9 2
1.4 5
i.e. it first records observations and make a count table out of it. What I would like to do it convert this df into a series with raw observations. For example, I would like to end up with: [-0.3,-0.3,-0.3,0.9,0.9,1.4,1.4 ....]
Similar question asked for excel.
If your dataframe structure is like this one (or similar):
obs count
0 -0.3 3
1 0.9 2
2 1.4 5
This is an option, using numpy.repeat:
import numpy as np
times = df['count']
df2['obs'] = np.concatenate([np.repeat(df['obs'],times)])
print(df2)
obs
0 -0.3
1 -0.3
2 -0.3
3 0.9
4 0.9
5 1.4
6 1.4
7 1.4
8 1.4
9 1.4

Exporting with pd.to_csv to include both colname and rowname [duplicate]

Initial Problem
When I run the following in ipython
import numpy as np
import pandas as pd
df = pd.DataFrame(np.round(9*np.random.rand(4,4), decimals=1))
df.index.name = 'x'
df.columns.name = 'y'
df.to_csv('output.csv')
df
it outputs the following result:
y 0 1 2 3
x
0 7.6 7.4 0.3 7.5
1 5.6 0.0 1.5 5.9
2 7.1 2.1 0.0 0.9
3 3.7 6.6 3.3 8.4
However when I open output.csv the "y" is removed:
x 0 1 2 3
0 7.6 7.4 0.3 7.5
1 5.6 0 1.5 5.9
2 7.1 2.1 0 0.9
3 3.7 6.6 3.3 8.4
How do I make it so that the df.columns.name is retained when I output the dataframe to csv?
Crude workaround
Current crude work-around is me doing the following:
df.to_csv('output.csv', index_label = 'x|y')
Which results in output.csv reading:
x|y 0 1 2 3
0 7.6 7.4 0.3 7.5
1 5.6 0 1.5 5.9
2 7.1 2.1 0 0.9
3 3.7 6.6 3.3 8.4
Something better would be great! Thanks for your help (in advance).
Context
This is what I am working on: https://github.com/SimonBiggs/Electron-Cutout-Factors
This is an example table: https://github.com/SimonBiggs/Electron-Cutout-Factors/blob/master/output/20140807_173714/06app06eng/interpolation-table.csv
You can pass a list to name the columns, then you can specify the index name when you are writing to csv:
df.columns = ['column_name1', 'column_name2', 'column_name3']
df.to_csv('/path/to/file.csv', index_label='Index_name')
How about this? It's slightly different but hopefully usable, since it fits the CSV paradigm:
>>> df.columns = ['y{}'.format(name) for name in df.columns]
>>> df.to_csv('output.csv')
>>> print open('output.csv').read()
x,y0,y1,y2,y3
0,3.5,1.5,1.6,0.3
1,7.0,4.7,6.5,5.2
2,6.6,7.6,3.2,5.5
3,4.0,2.8,7.1,7.8

Conditional column arithmetic in pandas dataframe

I have a pandas dataframe with the following structure:
import numpy as np
import pandas as pd
myData = pd.DataFrame({'x': [1.2,2.4,5.3,2.3,4.1], 'y': [6.7,7.5,8.1,5.3,8.3], 'condition':[1,1,np.nan,np.nan,1],'calculation': [np.nan]*5})
print myData
calculation condition x y
0 NaN 1 1.2 6.7
1 NaN 1 2.4 7.5
2 NaN NaN 5.3 8.1
3 NaN NaN 2.3 5.3
4 NaN 1 4.1 8.3
I want to enter a value in the 'calculation' column based on the values in 'x' and 'y' (e.g. x/y) but only in those cells where the 'condition' column contains NaN (np.isnan(myData['condition']). The final dataframe should look like this:
calculation condition x y
0 NaN 1 1.2 6.7
1 NaN 1 2.4 7.5
2 0.654 NaN 5.3 8.1
3 0.434 NaN 2.3 5.3
4 NaN 1 4.1 8.3
I'm happy with the idea of stepping through each row in turn using a 'for' loop and then using 'if' statements to make the calculations but the actual dataframe I have is very large and I wanted do the calculations in an array-based way. Is this possible? I guess I could calculate the value for all rows and then delete the ones I don't want but this seems like a lot of wasted effort (the NaNs are quite rare in the dataframe) and, in some cases where 'condition' equals 1, the calculation cannot be made due to division by zero.
Thanks in advance.
Use where and pass your condition to it, this will then only perform your calculation where the rows meet the condition:
In [117]:
myData['calculation'] = (myData['x']/myData['y']).where(myData['condition'].isnull())
myData
Out[117]:
calculation condition x y
0 NaN 1 1.2 6.7
1 NaN 1 2.4 7.5
2 0.654321 NaN 5.3 8.1
3 0.433962 NaN 2.3 5.3
4 NaN 1 4.1 8.3
EdChum's answer worked for me well! Still, I wanted to extend this thread as I think it will be useful for other people.
Let's assume your dataframe is
c x y
0 1 1.2 6.7
1 1 2.4 7.5
2 0 5.3 8.1
3 0 2.3 5.3
4 1 4.1 8.3
and you would like to update 0s in column c with associated x/y.
c x y
0 1 1.2 6.7
1 1 2.4 7.5
2 0.65 5.3 8.1
3 0.43 2.3 5.3
4 1 4.1 8.3
You can do
myData['c'] = (myData['x']/myData['y']).where(cond=myData['c']==0, other=myData['c'])
or
myData['c'].where(cond=myData['c'] != 0, other=myData['x']/myData['y'], inplace=True)
In both cases where 'cond' is not satisfied, 'other' is performed. In the second code snippet, inplace flag also works nicely (as it would also in the first code snippet.)
I found these solutions from pandas official site "where" and pandas official site "indexing"
This kind of operations are exactly what I need most of the time. I am new to Pandas and it took me a while to find this useful thread. Could anyone recommend some comprehensive tutorials to practice these types of arithmetic operations? I need to "filter/ groupby/ slice a dataframe then apply different functions/operations to each group/slice separately or all at once and keep it all inplace." Cheers!

Categories

Resources