I have a DataFrame ('main') that has about 300 columns. I created a smaller DataFrame ('public') and have been working on this.
I now want to delete the columns contained within 'public' from the larger DataFrame ('main').
I've tried the following instructions:
http://pandas.pydata.org/pandas-docs/dev/generated/pandas.DataFrame.drop.html
Python Pandas - Deleting multiple series from a data frame in one command
without any success, along with various other statements that have been unsuccessful.
The columns that make up 'public' are not consecutive - i.e. they are taken from various points in the larger DataFrame 'main'. All of the columns have the same Index. [Not sure if this is important, but 'public' was created using the 'join' function].
Yes, I'm being lazy - I don't want to have to type out the names of every column! I'm hoping there's a way to use the DataFrame 'public' in a statement that will allow deletion of these columns en masse. If anyone has any suggestions and/or guidance I'd be most grateful.
(Have Python 2.7 and am using Pandas, numpy, math, pylab etc.)
Thanks in advance.
Ignore my question - Murphy's Law prevails and I've just solved it.
I was using the statement from the stackoverflow question mentioned below:
df.drop(df.columns[1:], axis=1)
and this was not working. I have instead used
df = df.drop(df2, axis=1)
and this worked (df = main, df2 = public). Simple really once you don't overthink it.
Related
I have two dataframes and have a code to extract some data from one of the dataframes and add to the other dataframe:
sales= pd.read_excel("data.xlsx", sheet_name = 'sales', header = 0)
born= pd.read_excel("data.xlsx", sheet_name = 'born', header = 0)
bornuni = born.number.unique()
for babies in bornuni:
datafame = born[born["id"]==number]
for i, r in sales.iterrows():
if r["number"] == babies:
sales.loc[i,'ini_weight'] = datafame["weight"].iloc[0]
sales.loc[i,'ini_date'] = datafame["date of birth"].iloc[0]
else:
pass
this is pretty inefficient with bigger data sets so I want to parallelize this code but I donĀ“t have a clue how to do it. Any help would be great. Here is a link to a mock dataset.
So before worrying about parallelizing, I can't help but notice that you're using lots of for loops to deal with the dataframes. Dataframes are pretty fast when you use their vectorized capabilities.
I see a lot of inefficient use of pandas here, so maybe we first fix that and then worry about throwing more CPU cores at it.
It seems to me you want to accomplish the following:
For each unique baby id number in the born dataframe, you want to update the ini_weight and ini_date fields of the corresponding entry in the sales dataframe.
There's a good chance that you can use some dataframe merging / joining to help you with that, as well as using the pivot table functionality:
https://pandas.pydata.org/pandas-docs/stable/user_guide/reshaping.html
https://pandas.pydata.org/pandas-docs/stable/user_guide/merging.html
I strongly suggest you take a look at those, try using the ideas from these articles, and then reframe your question in terms of these operations, because as you correctly notice, looping over all the rows repeatedly to find the row with some matching index is very inefficient.
I have a hard time to formulate this problem in abstract terms, therefore I will mostly try to explain it with examples.
I have 2 pandas dataframes (I get them from a sqlite DB).
First DF:
Second DF:
So the thing is: There are several images per "capture". I would like to add the images to the capture df as columns, so that each capture has 9 image columns, each with a path. There are always 9 images per capture.
I solved it in pandas with what I know in the following way:
cam_idxs = sorted(list(range(9)) * 2)
for cam_idx in cam_idxs:
sub_df = images.loc[(images["camera_id"]==cam_idx)]
captures = captures.merge(sub_df[["image", "capture_id"]], left_on="id",
right_on="capture_id")
I imagine though that there must be a better way. At least I imagine people probably stumble into this problem more often when getting data from a sql database.
Since I am getting the data into pandas from a sql database, I am also open to SQL commands that get me this result. And I'm also grateful for people telling me what this kind of operation is called, I did not find a good way to google for this, therefore I am asking here. Excuse me when this question was asked somewhere, I did not find anything with my searchterms.
So the question at the end is: Is there a better way to do this, especially a more efficient way to do this?
What you are looking for is the pivot table.
You just need to create a column containing the index of the number of image by capture_id that you will use as columns in the pivot table.
For example this could be :
images['column_pivot'] = [x for x in range(1,10)]*int(images.shape[0]/9)
In your case 'column_pivot' would be [1,2,3,4,5,6,7,8,9,1,2,3,4,5,6,7,8,9...7,8,9] (e.g. rolling from 1 to 9)
Then you pivot :
pd.pivot_table(images, columns='column_pivot', index='capture_id', values='image')
This will give the expected result.
This problem has been solved (I think). Excel was the problem and not python after all. The below code should work for my needs and doesn't seem to be dropping rows after all.
Rows Highlighted in yellow are the rows I want to select in DF1. The selection should be made based on the values in column_2 of DF1 that match the values of column_1 of DF2
Here was my preferred solution using Pandas package in python after a lot of trail and error/searching:
NEW_MATCHED_DF1 = DF1.loc[DF1['column 2'].isin(DF2['column_1'])]
The problem I am seeing is that when I compare my results to what happens in excel when I do the same thing, I am getting almost double the results and I think that my python technique is dropping duplicates. Of course, it is possible that I am doing something wrong in excel, or excel is incorrect for some other reason, but it is something I have verified in the past and much more familiar with excel so I am suspecting that it is more likely that I am doing something wrong in python. EXCEL IS THE PROBLEM AFTER ALL!! :/
Ultimately, I would like to use python to select any and all rows in DF1 where column_2 of DF1 matches column_1 of DF2. Excel is absurdly slow and I would like to move away from using excel for manipulating large dataframes.
I appreciate any help or directions to help. I really haven't been able to figure out if my code is in fact dropping duplicates and/or if there is another solution that I can be confident that wont do this.
Try this using np.where:
import numpy as np
list_df2 = df2['column1'].unique().tolist()
df1['matching_rows'] = np.where(df1['column2'].isin(list_df2),'Match','No Match')
And then create a new dataframe with the matches:
matched_df = df1[df1['matching_rows']=='Match']
I currently have a Python script which converts two pandas DataFrames to tables in a SQLite database in memory, before reading reading and running SQL code on the tables. I would like the script to be more "Pythonic", merging and/or joining the DataFrames, but am having a difficult time finding Python code examples for the equivalent of SELECTing specific, and not all, elements from both tables, along with FROM, WHERE and ORDER BY clauses. I am fairly new to Python, and being the Guinea Pig of my department, so if I can get this working, it will become a template for MANY more scripts from my partners in my work group. Actual element names have been changed do to proprietary information, but the structure is the same. Thanks in advance for the help!
SELECT
dfE.Element05 AS [Alt Element05],
dfE.Element03 AS [Alt Element03],
dfE.Element04 AS [Alt Element04],
dfN.Element03,
dfN.Element04,
dfN.Element08,
dfN.Element09,
dfN.Element10,
dfN.Element17,
dfN.Element18,
dfN.Element19,
dfN.Element20,
dfN.Element23,
dfN.Element26,
dfN.Element13
FROM dfE INNER JOIN dfN ON (dfE.Element17 = dfN.Element17) AND (dfE.Element20 = dfN.Element20)
WHERE (((dfN.Element03)<>dfE.Element03))
GROUP BY
dfE.Element05,
dfE.Element03,
dfE.Element04,
dfN.Element03,
dfN.Element04,
dfN.Element08,
dfN.Element09,
dfN.Element10,
dfN.Element17,
dfN.Element18,
dfN.Element19,
dfN.Element20,
dfN.Element23,
dfN.Element26,
dfN.Element13
ORDER BY
dfE.Element03,
dfN.Element03,
dfN.Element08
I would start by copying the DataFrames that you want to join and selecting the specific columns there. I have included "Element17" & "Element20" from dfE because you need to have it when joining.
Ex.
df1 = dfE['Element05,'Element03','Element04','Element17','Element20'].copy()
In order to rename the columns use the following:
df1.rename(columns={'Element05':'Alt Element05','Element03':'Alt Element03','Element04':'Alt Element04'},inplace=True)
Once you have the other df set up (lets name it df2) you would use pd.merge() to join them as you would in SQL. *When using pd.merge, the columns on which you are going to join have to have the same name or it won't work! (Lets say df1['A'] shares the same data as df2['B'] and you want to join the DataFrames. You would have to change the name on one of the DataFrames so that it is equal to the column name of the Df you are joining to or it won't work.)
Ex.
df3 = pd.merge(df1,df1,how='inner',on=['Element17','Element20'])
For the Where I would do the following.
df3= df3[df3['Alt Element03']!=df3['Element']]
For Order By you could use .sort() but I'm not comfortable with giving you advice on how to use it as I haven't used it much.
I hope this helps! Let me know if you have questions.
*** This might not be the best way to do things. I apologize beforehand if I'm leading you to develop bad habits. I'm new to python as well!
When converting a Pandas dataframe to a H2O frame using the h2o.H2OFrame() function an error is occurring.
Additional rows are being created in the H2o Frame. When I looked into this, it appears the new rows are duplicates of other rows. Depending on the data size the number of duplicate rows added varies, but typically around 2-10.
Code:
train_h2o = h2o.H2OFrame(python_obj=train_df_complete)
print(train_df_complete.shape[0])
print(train_h2o.nrow)
Output:
3871998
3872000
As you can see here, 2 additional rows have being added. When studied closer there are now 2 rows per user for 2 of the users. I.e. 2 rows have being duplicated.
This appears to be a major bug, does anyone have experience of this problem and is there a way to fix it?
Thanks
I had the same issue, assume your "train_h2o" does not have duplicates, just identify the index of the duplicates in dataframe and remove it. Unfortunately, the h2o Dataframe has limited functionality.
temp_df = train_h2o.as_data_frame()
train_h2o = train_h2o.drop(list(temp_df[temp_df.duplicated()].index), axis=0)
In case your dataset can contain other duplicate rows that do not come from this H2O bug, the proposed solution will drop also those rows. If you want to make sure that you remove only the additional rows added by H2O, this solution might help you out:
temp_df = train_df_complete.copy()
temp_df['__temp_id__'] = np.arange(len(temp_df))
train_h2o = H2OFrame(temp_df)
train_h2o.drop_duplicates(columns=['__temp_id__'], keep='first')
train_h2o = train_h2o.drop('__temp_id__', axis=1)
What I'm doing here is creating a temporary column that I'll then use as ID in order to drop only the duplicates that have been generated by H2OFrame. Once the duplicates have been remove I drop the temporary column. It might not the most elegant way, but it works.
I had the same issue with a specific dataset.
Reset index on the base data frame worked for me.
import h2o
train_df_complete = train_df_complete.reset_index()
train_h2o = h2o.H2OFrame(train_df_complete)
I am using h2o 3.30.1.3.