so i am kind of stuck here, my data is something like this
df = pd.DataFrame({'X': [1, 2, 3, 4, 5, 4, 3, 2, 1],
'Y': [6, 7, 8, 9, 10, 9, 8, 7, 6],
'Z': [11, 12, 13, 14, 15, 14, 13, 12, 11]})
id like to write a code to set the values of the rows 6 to 9 of the column 'Z' to NaN.
the best ive come to do is:
df.replace({'Z': { 6: np.NaN, 7: np.NaN }})
but all this does is replaces values for the new value if set in column Y.
i am confused as to how to change the values of the rows in a column if some values are same in that particular column.
You can use the loc indexer for your dataframe. I've used column 6 to 8 because df doesn't have a column 9:
df.loc[range(6, 9), 'Z'] = pd.NA
you could use:
df.Z[6:9] = np.NaN
I think you should use .iloc for this.
First of all, the index is zero based, so there is no row 9.
To change the values from row 5 to 8 on column 'Z' to pd.NA you could do something like this:
df.iloc[6:9,2:] = pd.NA
I'm assuming Pandas > 1.0 which introduced NA values.
Related
I have a dataframe consisting of float64 values in it. I have to divide each value by hundred except for the the values of the row of index no. 388. For that I wrote the following code.
Dataset
Preprocessing:
df = pd.read_csv('state_cpi.csv')
d = {'January':1, 'February':2, 'March':3, 'April':4, 'May':5, 'June':6, 'July':7, 'August':8, 'September':9, 'October':10, 'November':11, 'December':12}
df['Month']=df['Name'].map(d)
r = {'Rural':1, 'Urban':2, 'Rural+Urban':3}
df['Region_code']=df['Sector'].map(r)
df['Himachal Pradesh'] = df['Himachal Pradesh'].str.replace('--','NaN')
df['Himachal Pradesh'] = df['Himachal Pradesh'].astype('float64')
Extracting the data of use:
data = df.iloc[:,3:-2]
Applying the division on the data dataframe
data[:,:388] = (data[:,:388] / 100).round(2)
data[:,389:] = (data[:,389:] / 100).round(2)
It returned me a dataframe where the data of row no. 388 was also divided by 100.
Dataset
As an example, I give the created dataframe. Indices except for 10 are copied into the aaa list. These index numbers are then supplied when querying and 1 is added to each element. The row with index 10 remains unchanged.
df = pd.DataFrame({'a': [1, 23, 4, 5, 7, 7, 8, 10, 9],
'b': [1, 2, 3, 4, 5, 6, 7, 8, 9]},
index=[1, 2, 5, 7, 8, 9, 10, 11, 12])
aaa = df[df.index != 10].index
df.loc[aaa, :] = df.loc[aaa, :] + 1
In your case, the code will be as follows:
aaa = data[data.index != 388].index
data.loc[aaa, :] = (data.loc[aaa, :] / 100).round(2)
display (df)
display(prices)
I have 2 dataframes, I want to replace the month numbers in dataframe 1 with the DA HB West value for that month. It also has to have the same cheat code as the df.
I feel like this is really easy to do but I keep getting an error.
The error reads "ValueError: Can only compare identically-labeled Series objects"
With a sample of your data:
df2 = pd.DataFrame({"Month": [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12],
"DA HB West": np.random.random(12),
"Year": [2019]*12,
"Cheat": ["2019PeakWE"]*12})
df = pd.DataFrame({"Month1": [7, 7, 7, 9, 11],
"Month2": [8, 8, 8, 10, 12],
"Month3": [9.0, 9.0, 9.0, 11.0, np.nan],
"Cheat4": ["2019PeakWE"]*5})
df.columns = df.columns.str[:-1]
Fill the na values so that there isn't an error with changing value types to integers:
df.fillna(0, inplace=True)
Map all but the last column:
d = {}
for i, j in df.groupby("Cheat"):
mapping = df2[df2["Cheat"] == i].set_index("Month")["DA HB West"].to_dict()
d[i] = j
d[i].iloc[:, :-1] = j.iloc[:, :-1].astype(int).apply(lambda x: x.map(mapping))
This creates a dictionary of all the different Cheats.
You can then append them all together, if you need to.
I have a data frame with a column 'score'. It contains scores from 1 to 10. I want to create a new column "color" which gives the column color according to the score.
For e.g. if the score is 1, the value of color should be "#75968f", if the score is 2, the value of color should be "#a5bab7". i.e. we need colors ["#75968f", "#a5bab7", "#c9d9d3", "#e2e2e2","#f1d4Af", "#dfccce", "#ddb7b1", "#cc7878", "#933b41", "#550b1d"] for scores [1, 2, 3, 4, 5, 6, 7, 8, 9, 10] respectively.
Is it possible to do this without using a loop?
Let me know in case you have a problem understanding the question.
Use Series.map with dictionary generated by zipping both lists or if need range by length of list colors is possible use enumerate:
df = pd.DataFrame({'score':[2,4,6,3,8]})
colors = ["#75968f", "#a5bab7", "#c9d9d3", "#e2e2e2","#f1d4Af",
"#dfccce", "#ddb7b1", "#cc7878", "#933b41", "#550b1d"]
scores = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
df['new'] = df['score'].map(dict(zip(scores, colors)))
df['new1'] = df['score'].map(dict(enumerate(colors, 1)))
print (df)
score new new1
0 2 #a5bab7 #a5bab7
1 4 #e2e2e2 #e2e2e2
2 6 #dfccce #dfccce
3 3 #c9d9d3 #c9d9d3
4 8 #cc7878 #cc7878
I have a array/list/pandas series :
np.arange(15)
Out[11]: array([ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14])
What I want is:
[[0,1,2,3,4,5],
[1,2,3,4,5,6],
[2,3,4,5,6,7],
...
[10,11,12,13,14]]
That is, recurently transpose this columns into a 5-column matrix.
The reason is that I am doing feature engineering for a column of temperature data. I want to use last 5 data as features and the next as target.
What's the most efficient way to do that? my data is large.
If the array is formatted like this :
arr = np.array([1,2,3,4,5,6,7,8,....])
You could try it like this :
recurr_transpose = np.matrix([[arr[i:i+5] for i in range(len(arr)-4)]])
The dataframe looks like this:
0, 3710.968017578125, 2012-01-07T03:13:43.859Z
1, 3710.968017578125, 2012-01-07T03:13:48.890Z
2, 3712.472900390625, 2012-01-07T03:13:53.906Z
3, 3712.472900390625, 2012-01-07T03:13:58.921Z
4, 3713.110107421875, 2012-01-07T03:14:03.900Z
5, 3713.110107421875, 2012-01-07T03:14:03.937Z
6, 3713.89892578125, 2012-01-07T03:14:13.900Z
7, 3713.89892578125, 2012-01-07T03:14:13.968Z
8, 3713.89892578125, 2012-01-07T03:14:19.000Z
9, 3714.64990234375, 2012-01-07T03:14:24.000Z
10, 3714.64990234375, 2012-01-07T03:14:24.015Z
11, 3714.64990234375, 2012-01-07T03:14:29.000Z
12, 3714.64990234375, 2012-01-07T03:14:29.031Z
At some rows, there are lines with millisecond different timestamps, I want to drop them and only keep the rows that have different second timestamps. there are rows that have the same value for milliseconds and seconds different rows like from row 9 to 12, therefore, I can't use a.loc[a.shift() != a]
The desired output would be:
0, 3710.968017578125, 2012-01-07T03:13:43.859Z
1, 3710.968017578125, 2012-01-07T03:13:48.890Z
2, 3712.472900390625, 2012-01-07T03:13:53.906Z
3, 3712.472900390625, 2012-01-07T03:13:58.921Z
4, 3713.110107421875, 2012-01-07T03:14:03.900Z
6, 3713.89892578125, 2012-01-07T03:14:13.900Z
8, 3713.89892578125, 2012-01-07T03:14:19.000Z
9, 3714.64990234375, 2012-01-07T03:14:24.000Z
11, 3714.64990234375, 2012-01-07T03:14:29.000Z
Try:
df.groupby(pd.to_datetime(df[2]).astype('datetime64[s]')).head(1)
I hope it's self-explained.
You can use below script. I didn't get your dataframe column names so I invented below columns ['x', 'date_time']
df = pd.DataFrame([
(3710.968017578125, pd.to_datetime('2012-01-07T03:13:43.859Z')),
(3710.968017578125, pd.to_datetime('2012-01-07T03:13:48.890Z')),
(3712.472900390625, pd.to_datetime('2012-01-07T03:13:53.906Z')),
(3712.472900390625, pd.to_datetime('2012-01-07T03:13:58.921Z')),
(3713.110107421875, pd.to_datetime('2012-01-07T03:14:03.900Z')),
(3713.110107421875, pd.to_datetime('2012-01-07T03:14:03.937Z')),
(3713.89892578125, pd.to_datetime('2012-01-07T03:14:13.900Z')),
(3713.89892578125, pd.to_datetime('2012-01-07T03:14:13.968Z')),
(3713.89892578125, pd.to_datetime('2012-01-07T03:14:19.000Z')),
(3714.64990234375, pd.to_datetime('2012-01-07T03:14:24.000Z')),
(3714.64990234375, pd.to_datetime('2012-01-07T03:14:24.015Z')),
(3714.64990234375, pd.to_datetime('2012-01-07T03:14:29.000Z')),
(3714.64990234375, pd.to_datetime('2012-01-07T03:14:29.031Z'))],
columns=['x', 'date_time'])
create a column 'time_diff' to get the difference between the
datetime of current row and next row
only get those difference either
None or more than 1 second
drop temp column time_diff
df['time_diff'] = df.groupby('x')['date_time'].diff()
df = df[(df['time_diff'].isnull()) | (df['time_diff'].map(lambda x: x.seconds > 1))]
df = df.drop(['time_diff'], axis=1)
df