I have a dataframe df of thousands of items where the value of the column "group" repeats from two to ten times. The dataframe has seven columns, one of them is named "url"; another one "flag". All of them are strings.
I would like to use Pandas in order to traverse through these groups. For each group I would like to find the longest item in the "url" column and store a "0" or "1" in the "flag" column that corresponds to that item. I have tried the following but I can not make it work. I would like to 1) get rid of the loop below, and 2) be able to compare all items in the group through df.apply(...)
all_groups = df["group"].drop_duplicates.tolist()
for item in all_groups:
df[df["group"]==item].apply(lambda x: Here I would like to compare the items within one group)
Can apply() and lambda be used in this context? Any faster way to implement this?
Thank you!
Using groupby() and .transform() you could do something like:
df['flag'] = df.groupby('group')['url'].transform(lambda x: x.str.len() == x.map(len).max())
Which provides a boolean value for df['flag']. If you need it as 0, 1 then just add .astype(int) to the end.
Unless you write code and find it's running slowly don't sweat optimizing it. In the words of Donald Knuth "Premature optimization is the root of all evil."
If you want to use apply and lambda (as mentioned in the question):
df = pd.DataFrame({'url': ['abc', 'de', 'fghi', 'jkl', 'm'], 'group': list('aaabb'), 'flag': 0})
Looks like:
flag group url
0 0 a abc
1 0 a de
2 0 a fghi
3 0 b jkl
4 0 b m
Then figure out which elements should have their flag variable set.
indices = df.groupby('group')['url'].apply(lambda s: s.str.len().idxmax())
df.loc[indices, 'flag'] = 1
Note this only gets the first url with maximal length. You can compare the url lengths to the maximum if you want different behavior.
So df now looks like:
flag group url
0 0 a abc
1 0 a de
2 1 a fghi
3 1 b jkl
4 0 b m
Related
I have a large DataFrame with 100 million records, I am trying to optimize the run time by using numpy.
Sample data:
dat = pd.DataFrame({'ID' : [1,2,3,4,5],
'item' : ['beauty', 'beauty', 'shoe','shoe','handbag'],
'mylist' : [['beauty','something'], ['shoe', 'something', 'else'], ['shoe', 'else','some'], ['else'], ['some', 'thing', 'else']]})
dat
ID item mylist
0 1 beauty [beauty, something]
1 2 beauty [shoe, something, else]
2 3 shoe [shoe, else, some]
3 4 shoe [else]
4 5 handbag [some, thing, else]
I am trying to filter those rows where item column's string exists in mylist column using:
dat[np.where(dat['item'].isin(dat['mylist']), True, False)]
But I am not getting any output and all of above values as False.
I could get the required results using:
dat[dat.apply(lambda row : row['item'] in row['mylist'], axis = 1)]
ID item mylist
0 1 beauty [beauty, something]
2 3 shoe [shoe, else, some]
But as numpy operations are faster, I am trying to use np.where. Could someone please let me know who to fix the code?
You can't vectorize easily with Series of lists, you can use a list comprehension to be a bit faster than apply:
out = dat.loc[[item in l for item,l in zip(dat['item'], dat['mylist'])]]
A vectorial solution would be:
out = dat.loc[dat.explode('mylist').eval('item == mylist').groupby(level=0).any()]
# or
out = dat.explode('mylist').query('item == mylist').groupby(level=0).first()
# or, if you are sure that there is at most 1 match
out = dat.explode('mylist').query('item == mylist')
But the explode step might be a bottleneck. You must try with your real data.
output:
ID item mylist
0 1 beauty [beauty, something]
2 3 shoe [shoe, else, some]
timing
I ran a quick test on 100k rows (using df = pd.concat([dat]*20000, ignore_index=True))
the list comprehension is the fastest (~20ms)
explode approaches are between 60-90ms (explode itself requiring 40ms)
apply is by far the slowest (almost 600ms)
I'm having difficulties counting the number of elements in a list within a DataFrame's column. My problem comes from the fact that, after importing my input csv file, the rows that are supposed to contain an empty list [] are actually parsed as lists containing the empty string [""]. Here's a reproducible example to make things clearer:
import pandas as pd
df = pd.DataFrame({"ID": [1, 2, 3], "NETWORK": [[""], ["OPE", "GSR", "REP"], ["MER"]]})
print(df)
ID NETWORK
0 1 []
1 2 [OPE, GSR, REP]
2 3 [MER]
Even though one might think that the list for the row where ID = 1 is empty, it's not. It actually contains the empty string [""] which took me a long time to figure out.
So whatever standard method I try to use to calculate the number of elements within each list I get a wrong value of 1 for those who are supposed to be empty:
df["COUNT"] = df["NETWORK"].str.len()
print(df)
ID NETWORK COUNT
0 1 [] 1
1 2 [OPE, GSR, REP] 3
2 3 [MER] 1
I searched and tried a lot of things before posting here but I couldn't find a solution to what seems to be a very simple problem. I should also note that I'm looking for a solution that doesn't require me to modify my original input file nor modify the way I'm importing it.
You just need to write a custom apply function that ignores the ''
df['COUNT'] = df['NETWORK'].apply(lambda x: sum(1 for w in x if w!=''))
Another way:
df['NETWORK'].apply(lambda x: len([y for y in x if y]))
Using apply is probably more straightforward. Alternatively, explode, filter, then group by count.
_s = df['NETWORK'].explode()
_s = _s[_s != '']
df['count'] = _s.groupby(level=0).count()
This yields:
NETWORK count
ID
1 [] NaN
2 [OPE, GSR, REP] 3.0
3 [MER] 1.0
Fill NA with zeroes if needed.
df["COUNT"] = df["NETWORK"].apply(lambda x: len(x))
Use a lambda function on each row and in the lambda function return the length of the array
So I struggled to even come up with a title for this question. Not sure I can edit the question title, but I would be happy to do so once there is clarity.
I have a data set from an experiment where each row is a point in time for a specific group. [Edited based on better approach to generate data by Daniela Vera below]
df = pd.DataFrame({'x1': np.random.randn(30),'time': [1,2,3,4,5,6,7,8,9,10] * 3,'grp': ['c', 'c', 'c','a','a','b','b','c','c','c'] * 3})
df.head(10)
x1 time grp
0 0.533131 1 c
1 1.486672 2 c
2 1.560158 3 c
3 -1.076457 4 a
4 -1.835047 5 a
5 -0.374595 6 b
6 -1.301875 7 b
7 -0.533907 8 c
8 0.052951 9 c
9 -0.257982 10 c
10 -0.442044 1 c
In the dataset some people/group only start to have values after time 5. In this case group b. However, in the dataset I am working with there are up to 5,000 groups rather than just the 3 groups in this example.
I would like to be able to identify everyone that only have values that appear after time 5, and drop them from the overall dataframe.
I have come up with a solution that works, but I feel like it is very clunky, and wondered if there was something cleaner.
# First I split the data into before and after the time of interest
after = df[df['time'] > 5].copy()
before = df[df['time'] < 5].copy()
#Then I merge the two dataframes and use indicator to find out which ones only appear after time 5.
missing = pd.merge(after,before, on='grp', how='outer', indicator = True)
#Then I use groupby and nunique to identify the groups that only appear after time 5 and save it as
an array
something = missing[missing['_merge'] == 'left_only'].groupby('ent_id').nunique()
#I extract the list of group ids from the array
something = something.index
# I go back to my main dataframe and make group id the index
df = df.set_index('grp')
#I then apply .drop on the array of group ids
df = df.drop(something)
df = df.reset_index()
Like I said, super clunky. But I just couldn't figure out an alternative. Please let me know if anything isn't clear and I'll happily edit with more details.
I am not sure If I get it, but let's say you have this data:
df = pd.DataFrame({'x1': np.random.randn(30),'time': [1,2,3,4,5,6,7,8,9,10] * 3,'grp': ['c', 'c', 'c','a','a','b','b','c','c','c'] * 3})
In this case, group "b" just has data for times 6, 7, which is above time 5. You can use this process to get a dictionary with the times in which each group has at least one data point and also a list called "keep" with the groups that have data point over the time 5.
list_groups = ["a","b","c"]
times_per_group = {}
keep = []
for group in list_groups:
times_per_group[group] = list(df[df.grp ==group].time.unique())
condition = any([i<5 for i in list(df[df.grp==group].time.unique())])
if condition:
keep.append(group)
Finally, you just keep the groups present in the list "keep":
df = df[df.grp.isin(keep)]
Let me know if I understood your question!
Of course you can just simplify the process, the dictionary is just to check, but you actually don´t need the whole code.
If this results is what you´re looking for, you can just do:
keep = [group for group in list_groups if any([i<5 for i in list(df[df.grp == group].time.unique())])]
I have DataFrame with thousands rows. Its structure is as below
A B C D
0 q 20 'f'
1 q 14 'd'
2 o 20 'a'
I want to compare the A column of current row and next row. If those values are equal I want to add the value of B column which has lower the value to D column of compared row which has greater value. Then I want to remove the moved column value of column B. It's like a swap process.
A B C D
0 q 20 'f' 14
1 o 20 'a'
I have thousands rows and iloc, loc, at methods work slow. At least I want to use DataFrame apply method. I tried some code samples but they didn't work.
I want to do something as below:
DataFrame.apply(lambda row: self.compare(row, next(row)), axis=1))
I have a compare method but I couldn't pass next row to the compare method. How can I pass it to the method? Also I am open to hear faster pandas solutions.
Best not to do that with apply as it will be slow; you can look at using shift, e.g.
df['A_shift'] = df['A'].shift(1)
df['Is_Same'] = 0
df.loc[df.A_shift == df.A, 'Is_Same'] = 1
Gets a bit more complicated if you're doing the shift within groups, but still possible.
I have a dataframe like the following, where everything is formatted as a string:
df
property value count
0 propAb True 10
1 propAA False 10
2 propAB blah 10
3 propBb 3 8
4 propBA 4 7
5 propCa 100 4
I am trying to find a way to filter the dataframe by applying a series of regex-style rules to both the property and value columns together.
For example, some sample rules may be like the following:
"if property starts with 'propA' and value is not 'True', drop the row".
Another rule may be something more mathematical, like:
"if property starts with 'propB' and value < 4, drop the row".
Is there a way to accomplish something like this without having to iterate over all rows each time for every rule I want to apply?
You still have to apply each rule (how else?), but let pandas handle the rows. Also, instead of removing the rows that you do not like, keep the rows that you do. Here's an example of how the first two rules can be applied:
rule1 = df.property.str.startswith('propA') & (df.value != 'True')
df = df[~rule1] # Keep everything that does NOT match
rule2 = df.property.str.startswith('propB') & (df.value < 4)
df = df[~rule2] # Keep everything that does NOT match
By the way, the second rule will not work because value is not a numeric column.
For the first one:
df = df.drop(df[(df.property.startswith('propA')) & (df.value is not True)].index)
and the other one:
df = df.drop(df[(df.property.startswith('propB')) & (df.value < 4)].index)