Concatenate Using Lambda And Conditions - python

I am trying to using lambda and map to create a new column within my dataframe. Essentially the new column will take column A if a criteria is met and column B is the criteria is not met. Please see my code below.
df['LS'] = df.['Long'].map(lambda y:df.Currency if y>0 else df.StartDate)
However, when I do this the function returns the entire column to each item in my new column.
In English I am going through each item y in column Long. If the item is > 0 then take the yth value in column "Currency". Otherwise take the yth value in column "Start".
Iteration is extremely slow in running the above. Are there any other options?
Thanks!
James

Just do
df['LS']=np.where(df.Long>0,df.Currency,df.StartDate)
which is the good vectored approach.
df.Long.map apply to each row, but return actually df.State or df.current which are Series.
An other approach is to consider:
df.apply(lambda row : row[1] if row[0]>0 else row[2],1)
will also work with df.columns=Index(['Long', 'Currency', 'StartDate', ...])
but it is not a vectored approach, so it is slow. (200x slower for 1000 rows in this case).

You can do the same using where:
df['LS'] = df['Currency'].where(df['Long']>0,df['StartDate'])

Related

Pandas map, check if any values in a list is inside another

I have the following list
x = [1,2,3]
And the following df
Sample df
pd.DataFrame({'UserId':[1,1,1,2,2,2,3,3,3,4,4,4],'Origins':[1,2,3,2,2,3,7,8,9,10,11,12]})
Lets say I want to return, the userid who contains any of the values in the list, in his groupby origins list.
Wanted result
pd.Series({'UserId':[1,2]})
What would be the best approach? To do this, maybe a groupby with a lambda, but I am having a little trouble formulating the condition.
df['UserId'][df['Origins'].isin(x)].drop_duplicates()
I had considered using unique(), but that returns a numpy array. Since you wanted a series, I went with drop_duplicates().
IIUC, OP wants, for each Origin, the UserId whose number appears in list x. If that is the case, the following, using pandas.Series.isin and pandas.unique will do the work
df_new = df[df['Origins'].isin(x)]['UserId'].unique()
[Out]:
[1 2]
Assuming one wants a series, one can convert the dataframe to a series as follows
df_new = pd.Series(df_new)
[Out]:
0 1
1 2
dtype: int64
If one wants to return a Series, and do it all in one step, instead of pandas.unique, one can use pandas.DataFrame.drop_duplicates (see Steven Rumbaliski answer).

How to apply a function pairwise on rows in a series?

I want something like this:
df.groupby("A")["B"].diff()
But instead of diff(), I want be able to compute if the two rows are different or identical, and return 1 if the current row is different from the previous, and 0 if it is identical.
Moreover, I really would like to use a custom function instead of diff(), so that I can do general pairwise row operations.
I tried using .rolling(2) and .apply() at different places, but I just can not get it to work.
Edit:
Each row in the dataset is a packet.
The first row in the dataset is the first recorded packet, and the last row is the last recorded packet, i.e., they are ordered by time.
One of the features(columns) is called "ID", and several packets have the same ID.
Another column is called "data", its values are 64 bit binary values (strings), i.e., 001011010011001.....10010 (length 64).
I want to create two new features(columns):
Compare the "data" field of the current packet with the data field of the previous packet with the Same ID, and compute:
If they are different (1 or 0)
How different (a figure between 0 and 1)
Hi I think it is best if you forgo using the grouby and shift instead:
equal_index = (df == df.shift(1))[X].all(axis=1)
where X is a list of columns you want to be identic. Then you can create your own grouper by
my_grouper = (~equal_index).cumsum()
and use it together with agg to aggregate with whatever function you wish
df.groupby(my_grouper).agg({'B':f})
Use DataFrameGroupBy.shift with compare for not equal by Series.ne:
df["dc"] = df.groupby("ID")["data"].shift().ne(df['data']).astype(int)
EDIT: for correlation between 2 Series use:
df["dc"] = df['data'].corr(df.groupby("ID")["data"].shift())
Ok, I solved it myself with
def create_dc(df: pd.DataFrame):
dc = df.groupby("ID")["data"].apply(lambda x: x != x.shift(1)).astype(int)
dc.fillna(1, inplace=True)
df["dc"] = dc
this does what I want.
Thank you #Arnau for inspiring me to use .shift()!

Pandas conditional row values based on an other column

Picture of the dataframe1
Hi! I've been trying to figure out how I could calculate wallet balances of erc-20 tokens, but can't get this to work.The idea is simple, when the "Status" columns row value is "Sending", the value would be negative, and when it is "receiving", it would be positive. Lastly I would use groupby and calculate sums by token symbols. The problem is, I can't get the conditional statement to work. What would be a way to do this? I've tried making loop iterations but they don't seem to work.
Assuming that df is the dataframe you presented, it's enough to select proper slice and multiply values by -1:
df.loc[df['Status'] == 'Sending', 'Value'] *= -1
And then grouping:
df = df.groupby(['Symbol']).sum().reset_index()
The looping in pandas is not a good idea – you are able to perform operations in a more optimal, vectorised manner, so try to avoid that.

PySpark - an efficient way to find DataFrame columns with more than 1 distinct value

I need an efficient way to list and drop unary columns in a Spark DataFrame (I use the PySpark API). I define a unary column as one which has at most one distinct value and for the purpose of the definition, I count null as a value as well. That means that a column with one distinct non-null value in some rows and null in other rows is not a unary column.
Based on the answers to this question I managed to write an efficient way to obtain a list of null columns (which are a subset of my unary columns) and drop them as follows:
counts = df.summary("count").collect()[0].asDict()
null_cols = [c for c in counts.keys() if counts[c] == '0']
df2 = df.drop(*null_cols)
Based on my very limited understanding of the inner workings of Spark this is fast because the method summary manipulates the entire data frame simultaneously (I have roughly 300 columns in my initial DataFrame). Unfortunately, I cannot find a similar way to deal with the second type of unary columns - ones which have no null values but are lit(something).
What I currently have is this (using the df2 I obtain from the code snippet above):
prox_counts = (df2.agg(*(F.approx_count_distinct(F.col(c)).alias(c)
for c in df2.columns
)
)
.collect()[0].asDict()
)
poss_unarcols = [k for k in prox_counts.keys() if prox_counts[k] < 3]
unar_cols = [c for c in poss_unarcols if df2.select(c).distinct().count() < 2]
Essentially, I first find columns which could be unary in a fast but approximate way and then look at the "candidates" in more detail and more slowly.
What I don't like about it is that a) even with the approximative pre-selection it is still fairly slow, taking over a minute to run even though at this point I only have roughly 70 columns (and about 6 million rows) and b) I use the approx_count_distinct with the magical constant 3 (approx_count_distinct does not count null, hence 3 instead of 2). Since I'm not exactly sure how the approx_count_distinct works internally I am a little worried that 3 is not a particularly good constant since the function might estimate the number of distinct (non-null) values as say 5 when it really is 1 and so maybe a higher constant is needed to guarantee nothing is missing in the candidate list poss_unarcols.
Is there a smarter way to do this, ideally so that I don't even have to drop the null columns separately and do it all in one fell swoop (although that is actually quite fast and so that big a big issue)?
I suggest that you have a look at the following function
pyspark.sql.functions.collect_set(col)
https://spark.apache.org/docs/latest/api/python/pyspark.sql.html?highlight=dataframe
It shall return all the values in col with multiplicated elements eliminated. Then you can check for the length of result (whether it equals one). I would be wondering about performance but I think it will beat distinct().count() definitely. Lets have a look on Monday :)
you can df.na.fill("some non exisitng value").summary() and then drop the relevant columns from the original dataframe
So far the best solution I found is this (it is faster than the other proposed answers, although not ideal, see below):
rows = df.count()
nullcounts = df.summary("count").collect()[0].asDict()
del nullcounts['summary']
nullcounts = {key: (rows-int(value)) for (key, value) in nullcounts.items()}
# a list for columns with just null values
null_cols = []
# a list for columns with no null values
full_cols = []
for key, value in nullcounts.items():
if value == rows:
null_cols.append(key)
elif value == 0:
full_cols.append(key)
df = df.drop(*null_cols)
# only columns in full_cols can be unary
# all other remaining columns have at least 1 null and 1 non-null value
try:
unarcounts = (df.agg(*(F.countDistinct(F.col(c)).alias(c) for c in full_cols))
.collect()[0]
.asDict()
)
unar_cols = [key for key in unarcounts.keys() if unarcounts[key] == 1]
except AssertionError:
unar_cols = []
df = df.drop(*unar_cols)
This works reasonably fast, mostly because I don't have too many "full columns", i.e. columns which contain no null rows and I only go through all rows of these, using the fast summary("count") method to clasify as many columns as I can.
Going through all rows of a column seems incredibly wasteful to me, since once two distinct values are found, I don't really care what's in the rest of the column. I don't think this can be solved in pySpark though (but I am a beginner), this seems to require a UDF and pySpark UDFs are so slow that it is not likely to be faster than using countDistinct(). Still, as long as there are many columns with no null rows in a dataframe, this method will be pretty slow (and I am not sure how much one can trust approx_count_distinct() to differentiate between one or two distinct values in a column)
As far as I can say it beats the collect_set() approach and filling the null values is actually not necessary as I realized (see the comments in the code).
I tried your solution, and it was too slow in my situation, so I simply grabbed the first row of the data frame and checked for duplicates. This turned out to be far more performant. I'm sure there's a better way, but I don't know what it is!
first_row = df.limit(1).collect()[0]
drop_cols = [
key for key, value in df.select(
[
sqlf.count(
sqlf.when(sqlf.col(column) != first_row[column], column)
).alias(column)
for column in df.columns
]
).collect()[0].asDict().items()
if value == 0
]
df = df.drop(*[drop_cols])

How can I check if a Pandas dataframe's index is sorted

I have a vanilla pandas dataframe with an index. I need to check if the index is sorted. Preferably without sorting it again.
e.g. I can test an index to see if it is unique by index.is_unique() is there a similar way for testing sorted?
How about:
df.index.is_monotonic
Just for the sake of completeness, this would be the procedure to check whether the dataframe index is monotonic increasing and also unique, and, if not, make it to be:
if not (df.index.is_monotonic_increasing and df.index.is_unique):
df.reset_index(inplace=True, drop=True)
NOTE df.index.is_monotonic_increasing is returning True even if there are repeated indices, so it has to be complemented with df.index.is_unique.
API References
Index.is_monotonic_increasing
Index.is_unique
DataFrame.reset_index
If sort is all allowed, try
all(df.sort_index().index == df.index)
If not, try
all(a <= b for a, b in zip(df.index, df.index[1:]))
The first one is more readable while the second one has smaller time complexity.
EDIT
Add another method I've just found. Similar with the second one but the comparison is vetorized
all(df.index[:-1] <= df.index[1:])
For non-indices:
df.equals(df.sort())

Categories

Resources