I basically have a dataset that looks as follows
Col1 Col2 Col3 Count
A B 1 50
A B 1 50
A C 20 1
A D 17 2
A E 5 70
A E 15 20
Suppose it is called data. I basically do data.groupby(by=['Col1', 'Col2', 'Col3'], as_index=False, sort=False).sum(), which should give me this:
Col1 Col2 Col3 Count
A B 1 100
A C 20 1
A D 17 2
A E 5 70
A E 15 20
However, this returns an empty dataset, which does have the columns I want but no rows. The only caveat is that the by parameter is getting calculated dynamically, instead of fixed (thats because the columns might change, although Count will always be there).
Any ideas on why this could be failing, and how to fix it?
EDIT: Further searching revealed that pandas' groupby removes rows that have NULL at any column. This is a problem for me because every single column might be NULL. Hence, the actual question is: any reasonable way to deal with NULLs and still use groupby?
Would love to be corrected here, but I'm not sure if there is a clean way to handle missing data. As you noted, Pandas will just exclude rows from groupby that contain NaN values
You could fill the NaN values with something beyond the range of your data:
data = pd.read_csv("c:/Users/simon/Desktop/data.csv")
data.fillna(-999, inplace=True)
new = data.groupby(by=['Col1', 'Col2', 'Col3'], as_index=False, sort=False).sum()
Which is messy because it wont add those values to the correct group by for the summation. But theres no real way to groupby something thats missing
Another method might be to fill each column separately with some missing value that is appropriate for that variable.
Related
I am working with a large dataset with a column for reviews which is comprised of a series of strings for example: "A,B,C" , "A,B*,B" etc..
for example,
import pandas as pd
df=pd.DataFrame({'cat1':[1,2,3,4,5],
'review':['A,B,C', 'A,B*,B,C', 'A,C', 'A,B,C,D', 'A,B,C,A,B']})
df2 = df["review"].str.split(",",expand = True)
df.join(df2)
I want to split that column up into separate columns for each letter, then add those columns into the original data frame. I used df2 = df["review"].str.split(",",expand = True) and df.join(df2) to do that.
However, when i use df["A"].unique() there are entries that should not be in the column. I only want 'A' to appear there, but there is also B and C. Also, B and B* are not splitting into two columns.
My dataset is quite large so I don't know how to properly illustrate this problem, I have tried to provide a small scale example, however, everything seems to be working correctly in this example;
I have tried to look through the original column with df['review'].unique() and all entries were entered correctly (no missing commas or anything like that), so I was wondering if there is something wrong with my approach that would influence it to not work correctly across all datasets. Or is there something wrong with my dataset.
Does anyone have any suggestions as to how I should troubleshoot?
when i use df["A"].unique() there are entries that should not be in the column. I only want 'A' to appear there
IIUC, you wanted to create dummy variables instead?
df2 = df.join(df['review'].str.get_dummies(sep=',').pipe(lambda x: x*[*x]).replace('',float('nan')))
Output:
cat1 review A B B* C D
0 1 A,B,C A B NaN C NaN
1 2 A,B*,B,C A B B* C NaN
2 3 A,C A NaN NaN C NaN
3 4 A,B,C,D A B NaN C D
4 5 A,B,C,A,B A B NaN C NaN
I am trying to get a value situated on the third column from a pandas dataframe by knowing the values of interest on the first two columns, which point me to the right value to fish out. I do not know the row index, just the values I need to look for on the first two columns. The combination of values from the first two columns is unique, so I do not expect to get a subset of the dataframe, but only a row. I do not have column names and I would like to avoid using them.
Consider the dataframe df:
a 1 bla
b 2 tra
b 3 foo
b 1 bar
c 3 cra
I would like to get tra from the second row, based on the b and 2 combination that I know beforehand. I've tried subsetting with
df = df.loc['b', :]
which returns all the rows with b on the same column (provided I've read the data with index_col = 0) but I am not able to pass multiple conditions on it without crashing or knowing the index of the row of interest. I tried both df.loc and df.iloc.
In other words, ideally I would like to get tra without even using row indexes, by doing something like:
df[(df[,0] == 'b' & df[,1] == `2`)][2]
Any suggestions? Probably it is something simple enough, but I have the tendency to use the same syntax as in R, which apparently is not compatible.
Thank you in advance
As #anky has suggested, a way to do this without knowing the column names nor the row index where your value of interest is, would be to read the file in a pandas dataframe using multiple column indexing.
For the provided example, knowing the column indexes at least, that would be:
df = pd.read_csv(path, sep='\t', index_col=[0, 1])
then, you can use:
df = df.iloc[df.index.get_loc(("b", 2)):]
df.iloc[0]
to get the value of interest.
Thanks again #anky for your help. If you found this question useful, please upvote #anky 's comment in the posted question.
I'd probably use pd.query for that:
import pandas as pd
df = pd.DataFrame(index=['a', 'b', 'b', 'b', 'c'], data={"col1": [1, 2, 3, 1, 3], "col2": ['bla', 'tra', 'foo', 'bar', 'cra']})
df
col1 col2
a 1 bla
b 2 tra
b 3 foo
b 1 bar
c 3 cra
df.query('col1 == 2 and col2 == "tra"')
col1 col2
b 2 tra
I have a df that looks something like this:
Col1 Col2 Col3 ColN
0 0 2 1
10 5 0 8
0 0 0 12
Trying to get a sum/mean of all the times a value has not been zero, for each row (and then add it as a 'Sum/Mean' column), to have output:
Col1 Col2 Col3 ColN Sum
0 0 2 1 2
10 5 0 8 1
0 0 0 12 3
In the df, I'm recording number of times an event has occurred. I'm trying to get the average number of occurrences or frequency (or I guess, the number of times a value in a row has been not 0).
Is there some way to apply this dataframe-wide? I have about 2000 rows, and have been hacking away trying to use Counter but have managed to get the number of times something has been observed only for 1 row :(
Or maybe I should convert all non-zero numbers to a dummy variable, but then still don't know how to select and sum?
As yatu suggested,
df.ne(0).sum(1)
does the job. (Note: when I use it to do df['Sum'] = df.ne(0).sum(1), I get a warning message, but I don't really understand the implications)
SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
Actually, I get several rows with zeroes in the wanted column that are still there (not sure why), so I also go and remove any rows with zeroes after this (this is all very ugly, but no idea...)
df = df[(df[['Sum']] != 0).all(axis=1)]
Is it possible to reorder pandas.DataFrame rows (based on an index) inplace?
I am looking for something like this:
df.reorder(idx, axis=0, inplace=True)
Where idx is not equal to but is the same type of df.index, it contains the same elements, but in another order. The output is df reordered before new idx.
I have not found anything in documentation and I fail to use reindex_axis. Which made me hoping it was possible because:
A new object is produced unless the new index is equivalent to the
current one and copy=False
I might have misunderstood what equivalent index means in this context.
Try using the reindex function (note that this is not inplace):
>>import pandas as pd
>>df = pd.DataFrame({'col1':[1,2,3],'col2':['test','hi','hello']})
>>df
col1 col2
0 1 test
1 2 hi
2 3 hello
>>df = df.reindex([2,0,1])
>>df
col1 col2
2 3 hello
0 1 test
1 2 hi
If I have a simple table, such as:
index location col1 col2 col3 col4
1 a TRUE yes 1 4
2 a FALSE null 2 6
3 b TRUE null 6 3
4 b TRUE no 3 4
5 b FALSE yes 4 6
6 c TRUE no 57 8
7 d FALSE null 74 9
If I wanted to aggregated the duplicate records in location, i.e the two a's or the three b's, I have been using a basic groupby functions. This works well for simple tables.
However, is it possible to expand this functionaility to allow rules per column when aggregating? As an example for col1, if TRUE was present, it would trump any FALSE value, or in col3, it would sum the values, whereas in col4 it add calculate the average? Is it possible to define these rules per column and then apply them when using groupby?
I have searched online, but not found anything that seems to cover this, however I may be barking up the wrong tree.
Thanks.
Use groupby and agg
funcs = dict(
col1=dict(Trump=lambda x: x.any()),
col3='sum',
col4=dict(Avg='mean')
)
df.groupby('location').agg(funcs)
When using agg on a groupby object with multiple columns, you can pass a dict that defines which functions to apply to which column.
In this high level dictionary (funcs), the keys are the existing column names to apply the functions defined in the value.
For example:
agg({'col1': lambda x: x.any(), 'col2': 'sum'})
Says to use any() on col1 and sum on col2. If col1 or col2 did not exist in the dataframe, this would fail.
Further, we didn't have to live with the default column names that come from this aggregation. I'll run the mini example above to illustrate.
df.groupby('location').agg({'col1': lambda x: x.any(), 'col3': 'sum'})
There isn't much description about what we've done. We can describe the functions as we'd like if we pass a dictionary as the function instead, with the key being the description and the value being the function. I'll use the same example but expanding it with a better description.
df.groupby('location').agg(
{'col1': {'All I need is one True': lambda x: x.any()},
'col3': {'SUMMATION': 'sum'}})
Armed with that information, hopefully my solution makes perfect sense.