Equality in python dataframe - python

Im performing some operations in a df of 4000 columns and 17520 rows. I have to repeat these operations 100 times with 5 different randomly selected columns from the df. I am using the following function:
for i in range(0,100):
rand_cols = np.random.permutation(df.columns)[0:5]
df2 = df[rand_cols]
df2[:,:] *= 2
My question is the following:
Does the operation in the df2 which is the 5 random columns of df affect the columns in the original df?
Thanks

No it doesn't. Just like Valentino in the comments suggested, if you try it with some dummy DataFrame, you can see it doesn't change:
df=pd.DataFrame({'c':range(50)})
df2=df.loc[df['c']%2==0,:]
df2*=10
if you look at df you can see it didn't change.
The reason is df2 saves the view of df but not the data itself

Related

Pandas: How to Squash Multiple Rows into One Row with More Columns

I'm looking for a way to convert 5 rows in a pandas dataframe into one row with 5 times the amount of columns (so I have the same information, just squashed into one row). Let me explain:
I'm working with hockey game statistics. Currently, there are 5 rows representing the same game in different situations, each with 111 columns. I want to convert these 5 rows into one row (so that one game is represented by one row) but keep the information contained in the different situations. In other words, I want to convert 5 rows, each with 111 columns into one row with 554 columns (554=111*5 minus one since we're joining on gameId).
Here is my DF head:
So, as an example, we can see the first 5 rows have gameId = 2008020001, but each have a different situation (i.e. other, all, 5on5, 4on5, and 5on4). I'd like these 5 rows to be converted into one row with gameId = 2008020001, and with columns labelled according to their situation.
For example, I want columns for all unblockedShotAttemptsAgainst, 5on5 unblockedShotAttemptsAgainst, 5on4 unblockedShotAttemptsAgainst, 4on5 unblockedShotAttemptsAgainst, and other unblockedShotAttemptsAgainst (and the same for every other stat).
Any info would be greatly appreciated. It's also worth mentioning that my dataset is fairly large (177990 rows), so an efficient solution is desired. The resulting dataframe should have one-fifth the rows and 5 times the columns. Thanks in advance!
---- What I've Tried Already ----
I tried to do this using df.apply() and some nested for loops, but it got very ugly very quickly and was incredibly slow. I think pandas has a better way of doing this, but I'm not sure how.
Looking at other SO answers, I initially thought it might have something to do with df.pivot() or df.groupby(), but I couldn't figure it out. Thanks again!
It sounds like what you are looking for is pd.get_dummies()
cols = df.columns
#get dummies
df1 = pd.get_dummies(df, columns = ['situation'])
#drop all columns from existing df, including original col passed into get dummies
df1.drop(cols, axis=1 , inplace=True)
#add dummy cols to original df
df = pd.concat([df, df1], axis=1)
#drop duplicate rows
df.groupby(cols).first()
For the last line you can also use df.drop_duplicates() : https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.drop_duplicates.html

How to merge a big dataframe with small dataframe?

I have a big dataframe with 100 rows and the structure is [qtr_dates<datetime.date>, sales<float>] and a small dataframe with same structure with less than 100 rows. I want to merge these two dfs such that merged df will have all the rows from small df and remaining rows will be taken from big df.
Right now I am doing this
df = big_df.merge(small_df, on=big_df.columns.tolist(), how='outer')
But this is creating a df with duplicate qtr_dates.
Use concat with remove duplicates by DataFrame.drop_duplicates:
pd.concat([small_df, big_df], ignore_index=True).drop_duplicates(subset=['qtr_dates'])
If I understand correctly, you want everything from the bigger dataframe, but if that date is present in the smaller data frame you would want it replaced by the relevant value from the smaller one?
Hence I think you want to do this:
df = big_df.merge(small_df, on=big_df.columns.tolist(),how='left',indicator=True)
df = df[df._merge!= "both"]
df_out = pd.concat([df,small_df],ignore_index=True)
This would remove any rows from the big_df which exist in the small_df in the 2nd step, before then adding the small_df rows by concatenating rather than merging.
If you had more column names that weren't involved with the join you'd have to do some column renaming/dropping though I think.
Hope that's right.
Try maybe join instead of merge.
https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.join.html

Efficiently split pandas dataframe based on combinations of one column values

Lets say I have a dataframe with one column and it has 3 unique values
Click here to see input
import pandas as pd
df = pd.DataFrame(['a', 'b', 'c'], columns = ['string'])
df
I want to split this dataframe into smaller data frames, such that each dataframe will contain 2 unique values. In the above case I need 3 data frames 3c2(nCr) = 3. df1 - [a b] df2 - [a c] df3 - [b c]. Please click on the below link to see my current implementation.
Click here to see current code and output
import itertools
for i in itertools.combinations(df.string.values, 2):
print(df[df.string.isin(i)], '\n')
I am looking something like groupby in pandas. Because sub-setting data inside loop is time consuming. In one of the sample case, I have 609 unique values and it was taking around 3 mins to complete the loop. So, looking for some optimized way to perform the same operation, as the unique values may shoot up to 1000's in real scenarios.
It will be slow because you're creating 370k dataframes. If all of them are supposed to only hold two values, why does it need to be a dataframe?
df = pd.DataFrame({'x': range(100)})
df['key'] = 1
records = df.merge(df, on='key').drop('key', axis=1).to_dict('r')
[pd.Series(x) for x in records]
You will see that records is computed quite fast but then it takes a few minutes to generate all of these series objects.

Compare columns from two different dataframes pandas

I am querying AD for a list of machines. I filter this list with pandas by last log on date. When I am done with this data I have one column in
a dataframe.
I have another report that has a list of machines that a product we use is installed. I clean this data and I am left with the devices that I want to use to compare to the AD data. Which is just one column in a dataframe.
I have also tried comparing list to list. I am not sure on the best the method.
I tried the merge but my guess this compares DF1 row 1 to DF2 row 1.
DF1 = comp1,comp2,comp3,comp5
DF2 = comp1,comp2,comp3
How would I check each row in DF1 to make sure that each value in DF2 exist and return true or false?
I am trying to figure out machines in DF1 that don't exist in DF2.
DataFrame.isin
this is a simple check to see if one value is in another, you do this in a multitude of ways, this is probably one of the simpliest.
I'm providing some dummy data but please check out How to make good reproducible pandas examples
machines = ['A','B','C']
machines_to_check = ['A','B']
df = pd.DataFrame({'AD' : machines})
df2 = pd.DataFrame({'AD' : machines_to_check})
now, if we want to check for the machines that exist in df but not in df2 we can use ~ which inverts the .isin function.
non_matching_machines = df.loc[~df['AD'].isin(df2['AD'])]
print(non_matching_machines)
AD
2 C

Setting a pandas index or transposing

I imported a table with 30 columns of data and pandas automatically generated an index for the rows from 0-232. I went to make a new dataframe with only 5 of the columns, using the below code:
df = pd.DataFrame(data=[data['Age'], data['FG'], data['FGA'], data['3P'], data['3PA']])
When I viewed the df the rows and columns had been transposed, so that the index made 232 columns and there were 5 rows. How can I set the index vertically, or transpose the dataframe?
The correct approach is actually much simpler. You just need to pull out the columns simultaneously with a list of column names:
df = data[['Age', 'FG', 'FGA', '3P', '3PA']]
Paul's response is the most preferred way to perform this operation. But as you suggest, you could alternatively transpose the DataFrame after reading it in:
df = df.T

Categories

Resources