I am currently using Python 2.7. I currently have three columns in an Excel document all with different integer values in. The amount of values can vary ranging from 10 through to thousands. Basically, what I am looking to do is scan through the column one and compare each value to see if any appear in column two and three. Similarly, I will then do the same with column 2 to see if any appear in column one and three etc....
My thinking on this is to populate the content of each column into a respective list and then iterate over list 1 (column 1) and then run an if statement to compare each iteration value and see if it exists in list 2 (column 2).
My question is, is this the most efficient means of running this comparison? As said, within the three columns, the same number should appear in each of the three columns (it may appear on a number of occasions) and so I'm looking to identify those numbers which appear in each of the three columns.
Thanks
What about using set intersection?
set(column_1_vals) & set(column_2_vals) & set(column_3_vals)
That will give you those values which appear in all three columns.
Related
I have a dataframe containing duplicates, that are flagged by a specific variable. The df looks like this:
enter image description here
The idea is that the rows to keep and its duplicates are stacked in batches (a pair or more if many duplicates)and identified by the "duplicate" column. I would like, for each batch, to keep the row depending on one conditions: it has to be the row with the smallest number of empty cells. For Alice for instance, it should be the second row (and not the one flagged "keep").
The difficulty lies also in the fact that I cannot group by on the "name", "lastname" or "phone" column, because they are not always filled (the duplicates are computed on these 3 concatenated columns by a ML algo).
Unlike already posted questions I've seen (how do I remove rows with duplicate values of columns in pandas data frame?), here the conditions to select the row to keep is not fixed (like keeping the first row or the last withing the batch of duplicates) but depends on the rows completion in each batch.
How can I parse the dataframe according to this column "duplicate", and among each batch extract the row I want ?
I tried to assign an unique label for each batch, in order to iterate over these label, but it fails.
I have a dataframe which is similar to this
d1 = pd.DataFrame({'name':['xyz','abc','dfg'],
'age':[15,34,22],
'sex':['s1','s2','s3'],
'w-1(6)':[96,66,74],
'w-2(5)':[55,86,99],
'w-3(4)':[11,66,44]})
Note that in my original DataFrame the week numbers are generated dynamically (i.e) The columns
w-1(6),w-2(5) and w-3(4) are generated dynamically and change every week. I want to sort all the three columns of the week based on descending order of the values.
But the names of the columns cannot be used as they change every week.
Is there any possible way to achieve this?
Edit : The numbers might not always present for all the three weeks, in the sense that if W-1 has no data, i wont have that column in the dataset at all. So that would mean only two week columns and not three.
You can use the column indices.
d1.sort_values(by=[d1.columns[3], d1.columns[4], d1.columns[5]] , ascending=False)
Suppose I have a dataframe with an index column filled with strings. Now, suppose I have very similar but somewhat different strings that I want to use to look up rows within the dataframe. How would I do this since they aren't identical? My guess would be to simply choose the row with the lowest distance between the two strings, but I'm not really sure how I could do that efficiently.
For example, if my dataframe is:
and I want to lookup "Lord of the rings", I should get the 2nd row. How would I do this in pandas?
I have excel data file with thousands of rows and columns.
I am using python and have started using pandas dataframes to analyze data.
What I want to do in column D is to calculate annual change for values in column C for each year for each ID.
I can use excel to do this – if the org ID is same are that in the prior row, calculate annual change (leaving the cells highlighted in blue because that’s the first period for that particular ID). I don’t know how to do this using python. Can anyone help?
Assuming the dataframe is already sorted
df.groupby(‘ID’).Cash.pct_change()
However, you can speed things up with the assumption things are sorted. Because it’s not necessary to group in order to calculate percentage change from one row to next
df.Cash.pct_change().mask(
df.ID != df.ID.shift()
)
These should produce the column values you are looking for. In order to add the column, you’ll need to assign to a column or create a new dataframe with the new column
df[‘AnnChange’] = df.groupby(‘ID’).Cash.pct_change()
I have a pandas dataframe column as shown in the figure below. Only two values: Increase and Decrease occur randomly in the column. Is there a way to process that data?
For this particular problem, I want to get the first (2 CONSECUTIVE) occurrence of the word Increase AFTER at least one (2 CONSECUTIVE) occurrences (maybe more, 2 is the minimum) of the word Decrease.
As an example, if the series is (I for "Increase", D for "Decrease"): "I,I,I,I,D,I,I,D,I,D,I,D,D,D,D,I,D,I,D,D,I,I,I,I", it should return the index of row 21 (the third last I in the given series). Assume that the example series that I just showed in a pandas column, meaning the series is vertical and not horizontal, and the indexing starts at 0, meaning that the first I is considered as row 0.
For this particular example, it should return 2009q4, which is the index of that particular row.
If somebody can show me a way to do common tasks like count the number of consecutive occurrences of a given value, detect a value change, get a particular positioned value after a value change etc. for this type of data (which may not required for this problem, but can be useful for future problems), I shall be really grateful.