All, I have an analytical csv file with 190 columns and 902 rows. I need to recode values in several columns (18 to be exact) from it's current 1-5 Likert scaling to 0-4 Likert scaling.
I've tried using replace:
df.replace({'Job_Performance1': {1:0, 2:1, 3:2, 4:3, 5:4}}, inplace=True)
But that throws a Value Error: "Replacement not allowed with overlapping keys and values"
I can use map:
df['job_perf1'] = df.Job_Performance1.map({1:0, 2:1, 3:2, 4:3, 5:4})
But, I know there has to be a more efficient way to accomplish this since this use case is standard in statistical analysis and statistical software e.g. SPSS
I've reviewed multiple questions on StackOverFlow but none of them quite fit my use case.
e.g. Pandas - replacing column values, pandas replace multiple values one column, Python pandas: replace values multiple columns matching multiple columns from another dataframe
Suggestions?
You can simply subtract a scalar value from your column which is in effect what you're doing here:
df['job_perf1'] = df['job_perf1'] - 1
Also as you need to do this on 18 cols, then I'd construct a list of the 18 column names and just subtract 1 from all of them at once:
df[col_list] = df[col_list] - 1
No need for a mapping. This can be done as a vector addition, since effectively, what you're doing, is subtracting 1 from each value. This works elegantly:
df['job_perf1'] = df['Job_Performance1'] - numpy.ones(len(df['Job_Performance1']))
Or, without numpy:
df['job_perf1'] = df['Job_Performance1'] - [1] * len(df['Job_Performance1'])
Related
I am new to Python and am converting SQL to Python and want to learn the most efficient way to process a large dataset (rows > 1 million and columns > 100). I need to create multiple new columns based on other columns in the DataFrame. I have recently learned how to use pd.concat for new boolean columns, but I also have some non-boolean columns that rely on the values of other columns.
In SQL I would use a single case statement (case when age > 1000 then sample_id else 0 end as custom1, etc...). In Python I can achieve the same result in 2 steps (pd.concat + loc find & replace) as shown below. I have seen references in other posts to using the apply method but have also read in other posts that the apply method can be inefficient.
My question is then, for the code shown below, is there a more efficient way to do this? Can I do it all in one step within the pd.concat (so far I haven't been able to get that to work)? I am okay doing it in 2 steps if necessary. I need to be able to handle large integers (100 billion) in my custom1 element and have decimals in my custom2 element.
And finally, I tried using multiple separate np.where statements but received a warning that my DataFrame was fragmented and that I should try to use concat. So I am not sure which approach overall is most efficient or recommended.
Update - after receiving a comment and an answer pointing me towards use of np.where, I decided to test the approaches. Using a data set with 2.7 million rows and 80 columns, I added 25 new columns. First approach was to use the concat + df.loc replace as shown in this post. Second approach was to use np.where. I ran the test 10 times and np.where was faster in all 10 trials. As noted above, I think repeated use of np.where in this way can cause fragmentation, so I suppose now my decision comes down to faster np.where with potential fragmentation vs. slower use of concat without risk of fragmentation. Any further insight on this final update is appreciated.
df = pd.DataFrame({'age': [120, 4000],
'weight': [505.31, 29.01],
'sample_id': [999999999999, 555555555555]},
index=['rock1', 'rock2'])
#step 1: efficiently create starting custom columns using concat
df = pd.concat(
[
df,
(df["age"] > 1000).rename("custom1").astype(int),
(df["weight"] < 100).rename("custom2").astype(float),
],
axis=1,
)
#step2: assign final values to custom columns based on other column values
df.loc[df.custom1 == 1, 'custom1'] = (df['sample_id'])
df.loc[df.custom2 == 1, 'custom2'] = (df['weight'] / 2)
Thanks for any feedback you can provide...I appreciate your time helping me.
The standard way to do this is using numpy where:
import numpy as np
df['custom1'] = np.where(df.age.gt(1000), df.sample_id, 0)
df['custom2'] = np.where(df.weight.lt(100), df.weight / 2, 0)
I have 2 dataframes that are similar in terms of the data they show (areas, regions, etc), and I am interested in one particular variable, the Area variable.
For the 2 dataframes, namely a and b , I have checked what areas each have by using a.Area.unique() and b.Area.unique(), and also checked the number in each by using nunique().
However, they do not have the same amount of variables, and I need to identify what areas are missing / additional from either dataframe. How can I check the 2 dataframes against each other to identify the difference?
I hope this makes sense, thank you in advance!
We could help better if you could update your question with some sample data, but one potential answer is using merge method:
new_df = pd.merge(a, b, on='Area', how='outer')
This new_df gives you a union of dataframe based on Area column between two dataframes.
Another solution could be:
a.loc[~a['Area'].isin(b['Area']), 'Area']
This gives you areas in a which are not in b['Area'] series.
You can convert it to set and use set operations for checking
a_set=set(a.Area.unique())
b_set=set(b.Area.unique())
list(a_set-b_set) # list of areas in a but not in b
list(b_set-a_set) # list of areas in b but not in a
list(b_set&a_set) # list of areas in both a and b >> intersection
list(b_set|a_set) # list of all areas >> union
I'm trying to downsample Dask dataframes by any x number of rows.
For instance, if I was using datetimes as an index, I could just use:
df = df.resample('1h').ohlc()
But I don't want to resample by datetimes, I want to resample by a fixed number of rows...something like:
df = df.resample(rows=100).ohlc()
I did a bunch of searching and found these three old SO pages:
This one suggests:
df.groupby(np.arange(len(df))//x), where x = the number of rows.
pd.DataFrame(df.values.reshape(-1,2,df.shape[1]).mean(1)), but I have trouble understanding this one.
pd.DataFrame(np.einsum('ijk->ik',df.values.reshape(-1,2,df.shape[1]))/2.0), but I also have trouble understanding this one.
This one suggests df.groupby(np.arange(len(df))//x) again.
This one suggests df_sub = df.rolling(x).mean()[::x], but it says it's wasteful, and doesn't seem optimized for Dask.
The best, fastest option seems to be df.groupby(np.arange(len(df))//x), and it works fine in Pandas. However, when I try it in Dask, I get: ValueError: Grouper and axis must be same length
How do I resample by # of rows using Dask?
I have dataframes with:
A standard index (e.g. 1,2,3,4,5...,n)
Datetime values I could potentially use as an index (although I don't necessarily want to)
Non-standard lengths (i.e. Some of them have an even number of rows, and some have an odd number).
I need an efficient way to list and drop unary columns in a Spark DataFrame (I use the PySpark API). I define a unary column as one which has at most one distinct value and for the purpose of the definition, I count null as a value as well. That means that a column with one distinct non-null value in some rows and null in other rows is not a unary column.
Based on the answers to this question I managed to write an efficient way to obtain a list of null columns (which are a subset of my unary columns) and drop them as follows:
counts = df.summary("count").collect()[0].asDict()
null_cols = [c for c in counts.keys() if counts[c] == '0']
df2 = df.drop(*null_cols)
Based on my very limited understanding of the inner workings of Spark this is fast because the method summary manipulates the entire data frame simultaneously (I have roughly 300 columns in my initial DataFrame). Unfortunately, I cannot find a similar way to deal with the second type of unary columns - ones which have no null values but are lit(something).
What I currently have is this (using the df2 I obtain from the code snippet above):
prox_counts = (df2.agg(*(F.approx_count_distinct(F.col(c)).alias(c)
for c in df2.columns
)
)
.collect()[0].asDict()
)
poss_unarcols = [k for k in prox_counts.keys() if prox_counts[k] < 3]
unar_cols = [c for c in poss_unarcols if df2.select(c).distinct().count() < 2]
Essentially, I first find columns which could be unary in a fast but approximate way and then look at the "candidates" in more detail and more slowly.
What I don't like about it is that a) even with the approximative pre-selection it is still fairly slow, taking over a minute to run even though at this point I only have roughly 70 columns (and about 6 million rows) and b) I use the approx_count_distinct with the magical constant 3 (approx_count_distinct does not count null, hence 3 instead of 2). Since I'm not exactly sure how the approx_count_distinct works internally I am a little worried that 3 is not a particularly good constant since the function might estimate the number of distinct (non-null) values as say 5 when it really is 1 and so maybe a higher constant is needed to guarantee nothing is missing in the candidate list poss_unarcols.
Is there a smarter way to do this, ideally so that I don't even have to drop the null columns separately and do it all in one fell swoop (although that is actually quite fast and so that big a big issue)?
I suggest that you have a look at the following function
pyspark.sql.functions.collect_set(col)
https://spark.apache.org/docs/latest/api/python/pyspark.sql.html?highlight=dataframe
It shall return all the values in col with multiplicated elements eliminated. Then you can check for the length of result (whether it equals one). I would be wondering about performance but I think it will beat distinct().count() definitely. Lets have a look on Monday :)
you can df.na.fill("some non exisitng value").summary() and then drop the relevant columns from the original dataframe
So far the best solution I found is this (it is faster than the other proposed answers, although not ideal, see below):
rows = df.count()
nullcounts = df.summary("count").collect()[0].asDict()
del nullcounts['summary']
nullcounts = {key: (rows-int(value)) for (key, value) in nullcounts.items()}
# a list for columns with just null values
null_cols = []
# a list for columns with no null values
full_cols = []
for key, value in nullcounts.items():
if value == rows:
null_cols.append(key)
elif value == 0:
full_cols.append(key)
df = df.drop(*null_cols)
# only columns in full_cols can be unary
# all other remaining columns have at least 1 null and 1 non-null value
try:
unarcounts = (df.agg(*(F.countDistinct(F.col(c)).alias(c) for c in full_cols))
.collect()[0]
.asDict()
)
unar_cols = [key for key in unarcounts.keys() if unarcounts[key] == 1]
except AssertionError:
unar_cols = []
df = df.drop(*unar_cols)
This works reasonably fast, mostly because I don't have too many "full columns", i.e. columns which contain no null rows and I only go through all rows of these, using the fast summary("count") method to clasify as many columns as I can.
Going through all rows of a column seems incredibly wasteful to me, since once two distinct values are found, I don't really care what's in the rest of the column. I don't think this can be solved in pySpark though (but I am a beginner), this seems to require a UDF and pySpark UDFs are so slow that it is not likely to be faster than using countDistinct(). Still, as long as there are many columns with no null rows in a dataframe, this method will be pretty slow (and I am not sure how much one can trust approx_count_distinct() to differentiate between one or two distinct values in a column)
As far as I can say it beats the collect_set() approach and filling the null values is actually not necessary as I realized (see the comments in the code).
I tried your solution, and it was too slow in my situation, so I simply grabbed the first row of the data frame and checked for duplicates. This turned out to be far more performant. I'm sure there's a better way, but I don't know what it is!
first_row = df.limit(1).collect()[0]
drop_cols = [
key for key, value in df.select(
[
sqlf.count(
sqlf.when(sqlf.col(column) != first_row[column], column)
).alias(column)
for column in df.columns
]
).collect()[0].asDict().items()
if value == 0
]
df = df.drop(*[drop_cols])
I have two 2D numpy arrays shaped:
(19133L, 12L)
(248L, 6L)
In each case, the first 3 fields form an identifier.
I want to reduce the larger matrix so that it only contains rows with identifiers that also exist in the second matrix. So the shape should be (248L, 12L). How can I do this?
I would then like to sort it so that the arrays are indexed by the first value, second value and third value so that (3 3 4) comes after (3 3 5) etc. Is there a multi field sort function?
Edit:
I have tried pandas:
df1 = DataFrame(arr1.astype(str))
df2 = DataFrame(arr2.astype(str))
df1.set_index([0,1,2])
df2.set_index([0,1,2])
out = merge(df1,df2,how="inner")
print(out.shape)
But this results in (0,13) shape
Use pandas.
pandas.set_index() allows multiple keys. So set the index to the first three columns (use drop=False, inplace=True) to avoid needlessly mutating or copying your dataframe.
Then, merge(...how='inner') to intersect your dataframes.
In general, numpy runs out of steam very quickly for arbitrary dataframe manipulations; your default thing should be to try pandas. Also much more performant.