I have a Pandas dataframe with values which should lie between, say, 11-100. However, sometimes I'll have values between 1-10, and this is because the person who was entering that row used a convention that the value in question should be multiplied by 10. So what I'd like to do is run a Pandas command which will fix those particular rows by multiplying their value by 10.
I can reference the values in question by doing something like
my_dataframe[my_dataframe['column_name']<10]
and I could set them all to a particular value, like 50, like so
my_dataframe[my_dataframe['column_name']<10] = 50
but how do I set them to a value which is 10* the value of that particular row?
I think you can use:
my_dataframe[my_dataframe['column_name']<10] *= 10
Related
I have a lot of data calculated and stored in a dataframe. A small example from the datatable:
enter image description here
At first I calculate all the values in the 3th column. After that I want to change every value that is bigger than the value 2. Is there a function where I can find all the values bigger than 2 and replace them by another value?
I can only find a function to replace in a dataframe when a specific value is present, but I can't determine all the values and the location in the column up front.
The function I tried: df.loc[df['Zelfconsumptie'] > 2, 'Zelfconsumptie'] = 2
To find all values in a given column 'Zelfconsumptie' in df that are greater than 2 and set those values = 2 use this:
df[df['Zelfconsumptie'] > 2] = 2
I have a 'city' column which has more than 1000 unique entries. (The entries are integers for some reason and are currently assigned float type.)
I tried df['city'].value_counts()/len(df) to get their frequences. It returned a table. The fist few values were 0.12,.4,.4,.3.....
I'm a complete beginner so I'm not sure how to use this information to assign everything in, say, the last 10 percentile to 'other'.
I want to reduce the unique city values from 1000 to something like 10, so I can later use get_dummies on this.
Let's go through the logic of expected actions:
Count frequencies for every city
Calculate the bottom 10% percentage
Find the cities with frequencies less then 10%
Change them to other
You started in the right direction. To get frequencies for every city:
city_freq = (df['city'].value_counts())/df.shape[0]
We want to find the bottom 10%. We use pandas' quantile to do it:
bottom_decile = city_freq.quantile(q=0.1)
Now bottom_decile is a float which represents the number that differs bottom 10% from the rest. Cities with frequency less then 10%:
less_freq_cities = city_freq[city_freq<=botton_decile]
less_freq_cities will hold enteries of cities. If you want to change the value of them in 'df' to "other":
df.loc[df["city"].isin(less_freq_cities.index.tolist())] = "other"
complete code:
city_freq = (df['city'].value_counts())/df.shape[0]
botton_decile = city_freq.quantile(q=0.1)
less_freq_cities = city_freq[city_freq<=botton_decile]
df.loc[df["city"].isin(less_freq_cities.index.tolist())] = "other"
This is how you replace 10% (or whatever you want, just change q param in quantile) to a value of your choice.
EDIT:
As suggested in comment, to get normalized frequency it's better use
city_freq = df['city'].value_counts(normalize=True) instead of dividing it by shape. But actually, we don't need normalized frequencies. pandas' qunatile will work even if they are not normalize. We can use:
city_freq = df['city'].value_counts() and it will still work.
After I calculate average of quantities within 5 rows for each row in a pyspark dataframe by using window and partitioning over a group of columns
from pyspark.sql import functions as F
prep_df = ...
window = Window.partitionBy([F.col(x) for x in group_list]).rowsBetween(Window.currentRow, Window.currentRow + 4)
consecutive_df = prep_df.withColumn('aveg', F.avg(prep_df['quantity']).over(window))
I am trying to group by with the same group and select the maximum value of the average values like this:
grouped_consecutive_df = consecutive_df.groupBy(group_column_list).agg(F.max(consecutive_df['aveg']).alias('aveg'))
However, when I debug, I see that the calculated maximum values are wrong. For specific instances, I saw that the retrieved max numbers are not even in the 'aveg' column.
I'd like to ask whether I am taking a false approach or missing something trivial. Any comments are appreciated.
I could solve this by a workaround like this: Before aggregation, I mapped the max values of quantity averages to another new column, then I selected one of the rows in the group.
I'm new to Pandas.
I've got a dataframe where I want to group by user and then find their lowest score up until that date in the their speed column.
So I can't just use df.groupby(['user'])['speed'].transform('min) as this would give the min of all values not just form the current row to the first.
What can I use to get what I need?
Without seeing your dataset it's hard to help you directly. The problem does boil down to the following. You need to select the range of data you want to work with (so select rows for the date range and columns for the user/speed).
That would look something like x = df.loc[["2-4-2018","2-4-2019"], ['users', 'speed']]
From there you could do a simple x['users'].min() for the value or x['users'].idxmin() for the index of the value.
I haven't played around for a bit with Dataframes, but you're looking for how to slice Dataframes.
I have a pandas dataframe column as shown in the figure below. Only two values: Increase and Decrease occur randomly in the column. Is there a way to process that data?
For this particular problem, I want to get the first (2 CONSECUTIVE) occurrence of the word Increase AFTER at least one (2 CONSECUTIVE) occurrences (maybe more, 2 is the minimum) of the word Decrease.
As an example, if the series is (I for "Increase", D for "Decrease"): "I,I,I,I,D,I,I,D,I,D,I,D,D,D,D,I,D,I,D,D,I,I,I,I", it should return the index of row 21 (the third last I in the given series). Assume that the example series that I just showed in a pandas column, meaning the series is vertical and not horizontal, and the indexing starts at 0, meaning that the first I is considered as row 0.
For this particular example, it should return 2009q4, which is the index of that particular row.
If somebody can show me a way to do common tasks like count the number of consecutive occurrences of a given value, detect a value change, get a particular positioned value after a value change etc. for this type of data (which may not required for this problem, but can be useful for future problems), I shall be really grateful.