I have 2 dataframes that are similar in terms of the data they show (areas, regions, etc), and I am interested in one particular variable, the Area variable.
For the 2 dataframes, namely a and b , I have checked what areas each have by using a.Area.unique() and b.Area.unique(), and also checked the number in each by using nunique().
However, they do not have the same amount of variables, and I need to identify what areas are missing / additional from either dataframe. How can I check the 2 dataframes against each other to identify the difference?
I hope this makes sense, thank you in advance!
We could help better if you could update your question with some sample data, but one potential answer is using merge method:
new_df = pd.merge(a, b, on='Area', how='outer')
This new_df gives you a union of dataframe based on Area column between two dataframes.
Another solution could be:
a.loc[~a['Area'].isin(b['Area']), 'Area']
This gives you areas in a which are not in b['Area'] series.
You can convert it to set and use set operations for checking
a_set=set(a.Area.unique())
b_set=set(b.Area.unique())
list(a_set-b_set) # list of areas in a but not in b
list(b_set-a_set) # list of areas in b but not in a
list(b_set&a_set) # list of areas in both a and b >> intersection
list(b_set|a_set) # list of all areas >> union
Related
I am new to Python and am converting SQL to Python and want to learn the most efficient way to process a large dataset (rows > 1 million and columns > 100). I need to create multiple new columns based on other columns in the DataFrame. I have recently learned how to use pd.concat for new boolean columns, but I also have some non-boolean columns that rely on the values of other columns.
In SQL I would use a single case statement (case when age > 1000 then sample_id else 0 end as custom1, etc...). In Python I can achieve the same result in 2 steps (pd.concat + loc find & replace) as shown below. I have seen references in other posts to using the apply method but have also read in other posts that the apply method can be inefficient.
My question is then, for the code shown below, is there a more efficient way to do this? Can I do it all in one step within the pd.concat (so far I haven't been able to get that to work)? I am okay doing it in 2 steps if necessary. I need to be able to handle large integers (100 billion) in my custom1 element and have decimals in my custom2 element.
And finally, I tried using multiple separate np.where statements but received a warning that my DataFrame was fragmented and that I should try to use concat. So I am not sure which approach overall is most efficient or recommended.
Update - after receiving a comment and an answer pointing me towards use of np.where, I decided to test the approaches. Using a data set with 2.7 million rows and 80 columns, I added 25 new columns. First approach was to use the concat + df.loc replace as shown in this post. Second approach was to use np.where. I ran the test 10 times and np.where was faster in all 10 trials. As noted above, I think repeated use of np.where in this way can cause fragmentation, so I suppose now my decision comes down to faster np.where with potential fragmentation vs. slower use of concat without risk of fragmentation. Any further insight on this final update is appreciated.
df = pd.DataFrame({'age': [120, 4000],
'weight': [505.31, 29.01],
'sample_id': [999999999999, 555555555555]},
index=['rock1', 'rock2'])
#step 1: efficiently create starting custom columns using concat
df = pd.concat(
[
df,
(df["age"] > 1000).rename("custom1").astype(int),
(df["weight"] < 100).rename("custom2").astype(float),
],
axis=1,
)
#step2: assign final values to custom columns based on other column values
df.loc[df.custom1 == 1, 'custom1'] = (df['sample_id'])
df.loc[df.custom2 == 1, 'custom2'] = (df['weight'] / 2)
Thanks for any feedback you can provide...I appreciate your time helping me.
The standard way to do this is using numpy where:
import numpy as np
df['custom1'] = np.where(df.age.gt(1000), df.sample_id, 0)
df['custom2'] = np.where(df.weight.lt(100), df.weight / 2, 0)
I have a Pandas DataFrame like:
COURSE BIB# COURSE 1 COURSE 2 STRAIGHT-GLIDING MEAN PRESTASJON
1 2 20.220 22.535 19.91 21.3775 1.073707
0 1 21.235 23.345 20.69 22.2900 1.077332
This is from a pilot and the DataFrame may be much longer when we perform the real experiment. Now that I have calculated the performance for each BIB#, I want to allocate them into two different groups based on their performance. I have therefore written the following code:
df1 = df1.sort_values(by='PRESTASJON', ascending=True)
This sorts values in the DataFrame. Now I want to assign even rows to one group and odd rows to another. How can I do this?
I have no idea what I am looking for. I have looked up in the documentation for the random module in Python but that is not exactly what I am looking for. I have seen some questions/posts pointing to a scikit-learn stratification function but I don't know if that is a good choice. Alternatively, is there a way to create a loop that accomplishes this? I appreciate your help.
Here a figure to illustrate what I want to accomplish
How about this:
threshold = 0.5
df1['group'] = df1['PRESTASJON'] > threshold
Or if you want values for your groups:
df['group'] = np.where(df['PRESTASJON'] > threshold, 'A', 'B')
Here, 'A' will be assigned to column 'group' if precision meets our threshold, otherwise 'B'.
UPDATE: Per OP's update on the post, if you want to group them alternatively into two groups:
#sort your dataframe based on precision column
df1 = df1.sort_values(by='PRESTASJON')
#create new column with default value 'A' and assign even rows (alternative rows) to 'B'
df1['group'] = 'A'
df1.iloc[1::2,-1] = 'B'
Are you splitting the dataframe alternatingly? If so, you can do:
df1 = df1.sort_values(by='PRESTASJON', ascending=True)
for i,d in df1.groupby(np.arange(len(df1)) %2):
print(f'group {i}')
print(d)
Another way without groupby:
df1 = df1.sort_values(by='PRESTASJON', ascending=True)
mask = np.arange(len(df1)) %2
group1 = df1.loc[mask==0]
group2 = df1.loc[mask==1]
I want to extract the values from two different columns of a pandas dataframe, put them in a list with no duplicate values.
I have tried the following:
arr = df[['column1', 'column2']].values
thelist= []
for ix, iy in np.ndindex(arr.shape):
if arr[ix, iy] not in thelist:
thelist.append(edges[ix, iy])
This works but it is taking too long. The dataframe contains around 30 million rows.
Example:
column1 column2
1 adr1 adr2
2 adr1 adr2
3 adr3 adr4
4 adr4 adr5
Should generate the list with the values:
[adr1, adr2, adr3, adr4, adr5]
Can you please help me find a more efficient way of doing this, considering that the dataframe contains 30 million rows.
#ALollz gave a right answer. I'll extend from there. To convert into list as expected just use list(np.unique(df.values))
You can use just np.unique(df) (maybe this is the shortest version).
Formally, the first parameter of np.unique should be an array_like object,
but as I checked, you can also pass just a DataFrame.
Of course, if you want just plain list not a ndarray, write
np.unique(df).tolist().
Edit following your comment
If you want the list unique but in the order of appearance, write:
pd.DataFrame(df.values.reshape(-1,1))[0].drop_duplicates().tolist()
Operation order:
reshape changes the source array into a single column.
Then a DataFrame is created, with default column name = 0.
Then [0] takes just this (the only) column.
drop_duplicates acts exactly what the name says.
And the last step: tolist converts to a plain list.
I'm using sklearn's pairwise distance function, which saved my life when computing a huge matrix, but the problem I'm having is that I lose my indices.
Specifically, I initially have a huge dataframe of 17000 x 300, which I break down into 4 different dataframes based on some class condition.
The 4 separate dataframes keep the original indices, but after I run the pairwise distance function on one of those dataframes, it gives me back a 2d array with correct values but the indices have been reset from 0 up.
How do I keep or recover the original indices?
distance1 = pair.pairwise_distances(df1, metric='euclidean')
You can create a DataFrame with matching indices using the DataFrame constructor taking the index parameter:
pd.DataFrame(distance1, index=df1.index)
Furthermore, if you would like to concatenate it horizontally to your existing DataFrame, you can use
pd.concat((df1, pd.DataFrame(distance1, index=df1.index)), axis=1)
All, I have an analytical csv file with 190 columns and 902 rows. I need to recode values in several columns (18 to be exact) from it's current 1-5 Likert scaling to 0-4 Likert scaling.
I've tried using replace:
df.replace({'Job_Performance1': {1:0, 2:1, 3:2, 4:3, 5:4}}, inplace=True)
But that throws a Value Error: "Replacement not allowed with overlapping keys and values"
I can use map:
df['job_perf1'] = df.Job_Performance1.map({1:0, 2:1, 3:2, 4:3, 5:4})
But, I know there has to be a more efficient way to accomplish this since this use case is standard in statistical analysis and statistical software e.g. SPSS
I've reviewed multiple questions on StackOverFlow but none of them quite fit my use case.
e.g. Pandas - replacing column values, pandas replace multiple values one column, Python pandas: replace values multiple columns matching multiple columns from another dataframe
Suggestions?
You can simply subtract a scalar value from your column which is in effect what you're doing here:
df['job_perf1'] = df['job_perf1'] - 1
Also as you need to do this on 18 cols, then I'd construct a list of the 18 column names and just subtract 1 from all of them at once:
df[col_list] = df[col_list] - 1
No need for a mapping. This can be done as a vector addition, since effectively, what you're doing, is subtracting 1 from each value. This works elegantly:
df['job_perf1'] = df['Job_Performance1'] - numpy.ones(len(df['Job_Performance1']))
Or, without numpy:
df['job_perf1'] = df['Job_Performance1'] - [1] * len(df['Job_Performance1'])