My data contain columns with empty rows that are read by pandas as nan.
I want to create a dictionary of list from this data. However, some list contains nan and I want to remove it.
If I use dropna() in data.dropna().to_dict(orient='list'), this will remove all the rows that contains at least one nan, thefore I lose data.
Col1 Col2 Col3
a x r
b y v
c x
z
data = pd.read_csv(sys.argv[2], sep = ',')
dict = data.to_dict(orient='list')
Current output:
dict = {Col1: ['a','b','c',nan], Col2: ['x', 'y',nan,nan], Col3: ['r', 'v', 'x', 'z']}
Desire Output:
dict = {Col1: ['a','b','c'], Col2: ['x', 'y'], Col3: ['r', 'v', 'x', 'z']}
My goal: get the dictionary of a list, with nan remove from the list.
Not sure exactly the format you're expecting, but you can use list comprehension and itertuples to do this.
First create some data.
import pandas as pd
import numpy as np
data = pd.DataFrame.from_dict({'Col1': (1, 2, 3), 'Col2': (4, 5, 6), 'Col3': (7, 8, np.nan)})
print(data)
Giving a data frame of:
Col1 Col2 Col3
0 1 4 7.0
1 2 5 8.0
2 3 6 NaN
And then we create the dictionary using the iterator.
dict_1 = {x[0]: [y for y in x[1:] if not pd.isna(y)] for x in data.itertuples(index=True) }
print(dict_1)
>>>{0: [1, 4, 7.0], 1: [2, 5, 8.0], 2: [3, 6]}
To do the same for the columns is even easier:
dict_2 = {data[column].name: [y for y in data[column] if not pd.isna(y)] for column in data}
print(dict_2)
>>>{'Col1': [1, 2, 3], 'Col2': [4, 5, 6], 'Col3': [7.0, 8.0]}
I am not sure if I understand your question correctly, but if I do and what you want is to replace the nan with a value so as not to lose your data then what you are looking for is pandas.DataFrame.fillna function. You mentioned the original value is an empty row, so filling the nan with data.fillna('') which fills it with empty string.
EDIT: After providing the desired output, the answer to your question changes a bit. What you'll need to do is to use dict comprehension with list comprehension to build said dictionary, looping by column and filtering nan. I see that Andrew already provided the code to do this in his answer so have a look there.
Related
Suppose I have a dataframe where some columns contain a zero value as one of their elements (or potentially more than one zero). I don't specifically want to retrieve these columns or discard them (I know how to do that) - I just want to locate these. For instance: if there is are zeros somewhere in the 4th, 6th and the 23rd columns, I want a list with the output [4,6,23].
You could iterate over the columns, checking whether 0 occurs in each columns values:
[i for i, c in enumerate(df.columns) if 0 in df[c].values]
Use any() for the fastest, vectorized approach.
For instance,
df = pd.DataFrame({'col1': [1, 2, 3],
'col2': [0, 100, 200],
'col3': ['a', 'b', 'c']})
Then,
>>> s = df.eq(0).any()
col1 False
col2 True
col3 False
dtype: bool
From here, it's easy to get the indexes. For example,
>>> s[s].tolist()
['col2']
Many ways to retrieve the indexes from a pd.Series of booleans.
Here is an approach that leverages a couple of lambda functions:
d = {'a': np.random.randint(10, size=100),
'b': np.random.randint(1,10, size=100),
'c': np.random.randint(10, size=100),
'd': np.random.randint(1,10, size=100)
}
df = pd.DataFrame(d)
df.apply(lambda x: (x==0).any())[lambda x: x].reset_index().index.to_list()
[0, 2]
Another idea based on #rafaelc slick answer (but returning relative locations of the columns instead of column names):
df.eq(0).any().reset_index()[lambda x: x[0]].index.to_list()
[0, 2]
Or with the column names instead of locations:
df.apply(lambda x: (x==0).any())[lambda x: x].index.to_list()
['a', 'c']
I have a pandas DataFrame with 4 columns and I want to create a new DataFrame that only has three of the columns. This question is similar to: Extracting specific columns from a data frame but for pandas not R. The following code does not work, raises an error, and is certainly not the pandasnic way to do it.
import pandas as pd
old = pd.DataFrame({'A' : [4,5], 'B' : [10,20], 'C' : [100,50], 'D' : [-30,-50]})
new = pd.DataFrame(zip(old.A, old.C, old.D)) # raises TypeError: data argument can't be an iterator
What is the pandasnic way to do it?
There is a way of doing this and it actually looks similar to R
new = old[['A', 'C', 'D']].copy()
Here you are just selecting the columns you want from the original data frame and creating a variable for those. If you want to modify the new dataframe at all you'll probably want to use .copy() to avoid a SettingWithCopyWarning.
An alternative method is to use filter which will create a copy by default:
new = old.filter(['A','B','D'], axis=1)
Finally, depending on the number of columns in your original dataframe, it might be more succinct to express this using a drop (this will also create a copy by default):
new = old.drop('B', axis=1)
The easiest way is
new = old[['A','C','D']]
.
Another simpler way seems to be:
new = pd.DataFrame([old.A, old.B, old.C]).transpose()
where old.column_name will give you a series.
Make a list of all the column-series you want to retain and pass it to the DataFrame constructor. We need to do a transpose to adjust the shape.
In [14]:pd.DataFrame([old.A, old.B, old.C]).transpose()
Out[14]:
A B C
0 4 10 100
1 5 20 50
columns by index:
# selected column index: 1, 6, 7
new = old.iloc[: , [1, 6, 7]].copy()
As far as I can tell, you don't necessarily need to specify the axis when using the filter function.
new = old.filter(['A','B','D'])
returns the same dataframe as
new = old.filter(['A','B','D'], axis=1)
Generic functional form
def select_columns(data_frame, column_names):
new_frame = data_frame.loc[:, column_names]
return new_frame
Specific for your problem above
selected_columns = ['A', 'C', 'D']
new = select_columns(old, selected_columns)
As an alternative:
new = pd.DataFrame().assign(A=old['A'], C=old['C'], D=old['D'])
If you want to have a new data frame then:
import pandas as pd
old = pd.DataFrame({'A' : [4,5], 'B' : [10,20], 'C' : [100,50], 'D' : [-30,-50]})
new= old[['A', 'C', 'D']]
You can drop columns in the index:
df = pd.DataFrame({'A': [1, 1], 'B': [2, 2], 'C': [3, 3], 'D': [4, 4]})
df[df.columns.drop(['B', 'C'])]
or
df.loc[:, df.columns.drop(['B', 'C'])]
Output:
A D
0 1 4
1 1 4
df = pd.DataFrame({'A': [1, 1], 'B': [2, 2], 'C': [3, 3], 'D': [4, 4]})
new = df.filter(['A','B','D'], axis=1)
I'm new on stackoverflow and have switched from R to python. I'm trying to do something probably not too difficult, and while I can do this by butchering, I am wondering what the most pythonic way to do it is. I am trying to divide certain values (E where F=a) in a column by values further down in the column (E where F=b) using column D as a lookup:
import pandas as pd
df = pd.DataFrame({'D':[1, 2, 3, 4, 5, 1, 2, 3, 4, 5, 1], 'E':[10,20,30,40,50,100, 250, 250, 360, 567, 400],'F':['a', 'a', 'a', 'a', 'a', 'b','b', 'b', 'b', 'b', 'c']})
print(df)
out = pd.DataFrame({'D': [1, 2, 3, 4, 5], 'a/b': [0.1, 0.08, 0.12 , 0.1111, 0.0881]}
print(out)
Can anyone help write this nicely?
I'm not entirely sure what you mean by "using D column as lookup" since there is no need for such lookup in the example you provided.
However the quick and dirty way to achieve the output you did provide is
output = pd.DataFrame({'a/b': df[df['F'] == 'a']['E'].values / df[df['F'] == 'b']['E'].values})
output['D'] = df['D']
which makes output to be
a/b D
0 0.100000 1
1 0.080000 2
2 0.120000 3
3 0.111111 4
4 0.088183 5
Lookup with .loc in pandas dataframe as df.loc[rows, columns] where the conditions for rows and columns are True
import numpy as np
# get indices from column D. I convert it to a list structure to make sure that the order is maintained.
idx = list(set(df['D']))
# A is an array of values with 'F'=a
A = np.array([df.loc[(df['F']=='a') & (df['D']==i),'E'].values[0] for i in idx])
# B is an array of values with 'F'=b
B = np.array([df.loc[(df['F']=='b') & (df['D']==i),'E'].values[0] for i in idx])
# Now devide towards your new dataframe of divisions
out = pd.DataFrame(np.vstack([A/B,idx]).T, columns = ['a/b','D'])
Instead of using numpy.vstack, you can use:
out = pd.DataFrame(A/B,idx).T
out.columns = ['a/b','D']
with the same result. I tried to do it in a single line (for no reason whatsoever)
Got it:
df = df.set_index('D')
out = df.loc[(df['F'] == 'a'), 'E'] / df.loc[(df['F'] == 'b'), 'E']
out = out.reset_index()
Thanks for your thoughts - I got inspired.
Question
I want to find all unique, unordered combinations of the MultiIndex in which element 'A' is present and sum these rows.
Data
I have the following dataframe:
df = pd.DataFrame({'col1': [2, 4, 6, 3], 'col2': [8, -2, 5, 3]},
index=pd.MultiIndex.from_tuples([('A', 'B'), ('B', 'A'), ('A', 'C'), ('C', 'A')]))
df.index.names = ['from', 'to']
Output:
col1 col2
from to
A B 2 8
B A 4 -2
A C 6 5
C A 3 3
Expected Output
col1 col2
from to
A B 6 6
C 9 8
The row line of the expected output is the sum of the first two rows. They are added, because you can change the order of the two elements so that they are identical sorted(['A', 'B']) == sorted(['B', 'A']). Respectively, the second row of the output is the sum of the third and fourth row.
Ugly, unreadable code
My solution builds two new columns 'group1' and 'group2' which are ordered for the sole purpose to group the data. Yes, I could refactor my two functions into one lambda function to have fewer lines of code. However, I don't mind a few extra lines of code to use apply(). My problem is that the solution is almost unintelligible to any reader (this includes me in a few weeks).
The code produces the expected output, but ... (to quote Raymond Hettinger) There Must Be a Better Way!
def switch_cols(col1: str, col2: str, search_word):
if col2 == search_word:
return [col2, col1]
return [col1, col2]
def apply_switch_cols(df):
return switch_cols(df['from'], df['to'], search_word='A')
df['group1'] = ''
df['group2'] = ''
df[['group1', 'group2']] = df.reset_index().apply(apply_switch_cols, axis=1).to_list()
df = df.groupby(['group1', 'group2']).sum()
df.index.names = ['from', 'to']
Here is one way:
df = pd.DataFrame({'col1': [2, 4, 6, 3], 'col2': [8, -2, 5, 3]},
index=pd.MultiIndex.from_tuples([('A', 'B'), ('B', 'A'), ('A', 'C'), ('C', 'A')]))
df.index.names = ['from', 'to']
df = df.reset_index()
df.index = pd.MultiIndex.from_tuples(df[['from','to']].apply(lambda x: sorted(x), axis=1))
df.groupby(level=[0,1]).sum()
Output:
col1 col2
A B 6 6
C 9 8
Using Pandas, I'm trying to find the most recent overlapping occurrence of some value in Column A that also happens to be in Column B (though, not necessarily occurring in the same row); This is to be done for all rows in column A.
I've accomplished something close with an n^2 solution (by creating a list of each column and iterating through with a nested for-loop), but I would like to use something faster if possible; as this needs to be implemented in a table with tens of thousands of entries. (So, a Vectorized solution would be ideal, but I am more looking for the "right" way to do this.)
df['idx'] = range(0, len(df.index))
A = list(df['r_A'])
B = list(df['r_B'])
A_B_Dict = {}
for i in range(0, len(B)-1):
for j in range(0, len(A)-1):
if B[i] == A[j]:
A_search = df.loc[df['r_A'] == A[j]].index
A_B_Dict[B[i]] = A_search
Given some df like so:
df = [[1, 'A', 'A'],
[2, 'B', 'D'],
[3, 'C', 'B']
[4, 'D', 'D']
]
df = pd.DataFrame(data, columns = ['idx', 'A', 'B'])
It should give back something like:
A_B_Dict = {'A': 1, 'B': 3, 'C':None', 'D':4}
Such that, the most recent observance (Or all observances, for that matter) from Column A that occur in Column B are stored as the value of A_B_Dict where the key of A_B_Dict is the original value observed in Column A.
IIUC
d=dict(zip(df.B,df.idx))
dict(zip(df.A,df.A.map(d)))
{'A': 1.0, 'B': 3.0, 'C': nan, 'D': 4.0}