I want to check how many values are lower than 2500
1)Using .count(
df[df.price<2500]["price"].count()
Using .values_counts()
df[df.price<2500]["price"].value_counts()
this ise code view
First one results 27540 and second 2050. Which one is correct count?
Definitely not 2050, analyze your histogram.
The method value_counts will assign only one row for a number that has duplicates but it will associate the number of duplicates. So it seems to be 2050 differents prices, but if you count duplicates there are much more.
Related
I have a big dataset, with 10,000 or so rows as pandas Dataframe. [['Date', 'TAMETR']].
The float values under 'TAMETR' increases and decreases over time.
I wish to loop through the 'TAMETR' column and check if there are consecutive instances where values are greater than let's say 1. Ultimately I'd like to get the average duration length and the distribution of the instances.
I've played around a little with what is written here: How to count consecutive ordered values on pandas data frame
Doubt I fully understand the code but I cant make it work. I don't understand how to tweak it with greater or lower than (</>).
The preferred output would be a dataframe, or array, with all the instances (greater than 1).
I can calculate the average and plot the distribution.
I have a large DF with 3m rows and 16 columns. I have been trying to find duplicates based on certain columns only. Hence I want to subset the data where these rows have the exact same value in the 6 columns. I want to keep all rows based on the duplicates.
pp19952017[pp19952017.duplicated(subset=['Postcode', 'Property Type','Street','Town/City', 'District', 'County'],keep=False)]
Edit:
Here is an example of mostly like a single property but won't show up with show duplicates because not every column is the same and a few cells are different. I want to have a list of duplicates so I can see how the same properties have increase in price.
15, ARMINGER ROAD, LONDON, HAMMERSMITH AND FULHAM, W12 7BA, GREATER LONDON
and
15, ARMINGER ROAD, LONDON, LONDON, HAMMERSMITH AND FULHAM, W12 7BA, GREATER LONDON
Unfortunately, this gives me nearly every line. I've checked manually and there aren't this many duplicates so I'm a bit stuck as to how to find the duplicates.
As this is data is from 1995 to the present day the way it was recorded changed, so I can only attempt to use this subset to find the duplicates.
Thanks in advance for any help.
Solution:
I think I found a way to do it. Which is that I concatenated the various columns that were relevant and had repeating data and used that new concatenated column's data to use as a check for duplication. It is a workaround but does what I want.
The duplicated is working as intended. The problem is with the usage. You are passing keep=False. It means that all the duplicate records will be marked as duplicate and it will return all the duplicate records.
e.g. (Adam, 30, NY), (Adam, 35, NY), (Adam, 30, MA) and you are doing duplicates based on name & state (keep = False) will return 2 rows as it will mark both the rows as duplicate.
if you pass keep = first or last it will mark the first or last record as duplicate accordingly and will return only 1 row.
Probably a naive question but new to this :
I have a column with 100000 entries having dates from Jan 1, 2018 to August 1, 2019.( repeated entries as well) I want to create a new column wherein I want to divide a number lets say 3500 in such a way that sum(new_column) for a particular day is less than or equal to 3500.
For example lets say 01-01-2018 has 40 entries in the dataset, then 3500 is to be distributed randomly between 40 entries in such a way that the total of these 40 rows is less than or equal to 3500 and it needs to be done for all the dates in the dataset.
Can anyone advise me as to how to achieve that.
EDIT : The excel file is Here
Thanks
My answer is not the best but may work for you. But because you have 100000 entries, it will probably slow down performance, so use it and paste values, because the solution uses function RANDBETWEEN and it keeps recalculating every time you make a change in a cell.
So I made a data test like this:
First column ID would be the dates, and second column would be random numbers.
And bottom right corner shows totals, so as you can see, totals for each number sum up 3500.
The formula I've used is:
=IF(COUNTIF($A$2:$A$7;A2)=1;3500;IF(COUNTIF($A$2:A2;A2)=COUNTIF($A$2:$A$7;A2);3500-SUMIF($A$1:A1;A2;$B$1:B1);IF(COUNTIF($A$2:A2;A2)=1;RANDBETWEEN(1;3500);RANDBETWEEN(1;3500-SUMIF($A$1:A1;A2;$B$1:B1)))))
And it works pretty good. Just pressing F9 to recalculate the worksheet, gives random numbers, but all of them sum up 3500 all the time.
Hope you can adapt this to your needs.
UPDATE: You need to know that my solution will always force the numbers to sum up 3500. In any case the sum of all values would be less than 3500. You'll need to adapt that part. As i said, not my best answer...
UPDATE 2: Uploaded a sample file to my Gdrive in case you want to check how it works. https://drive.google.com/open?id=1ivW2b0b05WV32HxcLc11gP2JWvdYTa84
You will need 2 columns
I to count the number of dates and then one for the values
Formula in B2 is =COUNTIF($A$2:$A$51,A2)
Formula in C2 is =RANDBETWEEN(1,3500/B2)
Column B is giving the count of repetition for each date
Column C is giving a random number whose sum will be at maximum 3500 for each count
The range in formula in B column is $A$2:$A$51, which you can change according to your data
EDIT
For each date in your list you can apply a formula like below
The formula in D2 is =SUMIF(B:B,B2,C:C)
For the difference value for each unique date you can use a pivot and apply the formula on sum of each date like below
Formula in J2 is =3500-I2
Sorry - a little late to the party but this looked like a fun challenge!
The simplest way I could think of is to add a rand() column (then hard code, if required) and then another column which calculates the 3500 split per date, based on the rand() column.
Here's the function:
=ROUNDDOWN(3500*B2/SUMIF($A$2:$A$100000,A2,$B$2:$B$100000),0)
Illustrated here:
I have a pandas dataframe column as shown in the figure below. Only two values: Increase and Decrease occur randomly in the column. Is there a way to process that data?
For this particular problem, I want to get the first (2 CONSECUTIVE) occurrence of the word Increase AFTER at least one (2 CONSECUTIVE) occurrences (maybe more, 2 is the minimum) of the word Decrease.
As an example, if the series is (I for "Increase", D for "Decrease"): "I,I,I,I,D,I,I,D,I,D,I,D,D,D,D,I,D,I,D,D,I,I,I,I", it should return the index of row 21 (the third last I in the given series). Assume that the example series that I just showed in a pandas column, meaning the series is vertical and not horizontal, and the indexing starts at 0, meaning that the first I is considered as row 0.
For this particular example, it should return 2009q4, which is the index of that particular row.
If somebody can show me a way to do common tasks like count the number of consecutive occurrences of a given value, detect a value change, get a particular positioned value after a value change etc. for this type of data (which may not required for this problem, but can be useful for future problems), I shall be really grateful.
I have a two column dataframe with name limitData where the first column is CcyPair and second is trade notional
CcyPair,TradeNotional
USDCAD,1000000
USDCAD,7600
USDCAD,40000
GBPUSD,100000
GBPUSD,345000
etc
with a large number of CcyPair's and TradeNotional's per CcyPair. From here I generate summary statistics as follows
limitDataStats = limitData.groupby(['CcyPair']).describe()
This is easy enough. However, I would like to add a column to sumStats that contains the count of TradeNotional's greater than that ccyPair's 75% determined by .describe() stored in limitDataStats. I've searched a great deal and tried a number of variations but can't figure it out. Think it should be somewhere along the lines of the below (I thought I could reference the index of the groupby as mentioned here but that gives me the actual integer index here
limitData.groupby(['CcyPair'])['AbsBaseTrade'].apply(lambda x: x[x > limitDataStats.loc[x.index , '75%']].count())
Any ideas? Thanks, Colin
You can filter values greater than the 75th percentile and then count how many are greater than or equal to that value (used .sum() since boolean series is returned from ge())
limitData.groupby('CcyPair')['AbsBaseTrade'].apply(
lambda x: x.ge(x.quantile(.75)).sum()))