Equivalent R and Python with a DataFrame [duplicate] - python

This question already has answers here:
How do I select rows from a DataFrame based on column values?
(16 answers)
Closed 2 years ago.
I'm stuck with an equivalence of code between R and Python.
Code in R
library(datasets)
data <- airquality
data2 <- data[data$Ozone < 63,]
I download the file of airquality and use pd.read_csv() function for obtain the .csv file into Python. But I don't know how obtain this equivalent line data[data$Ozone < 63,].

data2 = data.loc[data["Ozone"] < 63,:]
This should do the trick.
data["Ozone"] < 63 returns an index where the condition is verified
data.loc[index, :] returns a copy of the dataframe data, for all columns : on the given index

Related

Python math operation on column [duplicate]

This question already has answers here:
Convert pandas.Series from dtype object to float, and errors to nans
(3 answers)
Closed 3 years ago.
Data from json is in df and am trying to ouput to a csv.
I am trying to multiply dataframe column with a fixed value and having issues how data is displayed
I have used the following but the data is still not how i want to display
df_entry['Hours'] = df_entry['Hours'].multiply(2)
df_entry['Hours'] = df_entry['Hours'] * 2
Input
ID, name,hrs
100,AB,37.5
Expected
ID, name,hrs
100,AB,75.0
What I am getting
ID, name,hrs
100,AB,37.537.5
That happens because the dtype of the column is str. You need to convert it to float before multiplication.
df_entry['Hours'] = df_entry['Hours'].astype(float) * 2
You can use apply function.
df_entry['Hours'] = df_entry['Hours'].apply(lambda x: float(int(x))*2)

How to select multiple columns and rows from dataframe under condition? [duplicate]

This question already has answers here:
How do I select rows from a DataFrame based on column values?
(16 answers)
Closed 3 years ago.
I would like to choose the rows and columns under condition, for example:
0 is camera, 1 is video.
when the column == 1, return the data of video.
else return the data of photo
The purpose is to get separate data based on video and photo.
The code shows below. I guess the problem is from loc.[i, :] because when I changed i to 0, it grab the first row successfully. But don't know why i doesn't work.
for i in range(len(dataset)):
if dataset['status_type_num'][i] == 1:
video_data = dataset[['num_reactions', 'num_comments', 'num_shares', 'num_likes', 'num_loves']].loc[i, :]
print(video_data)
I expect output would be the data from 5 columns('num_reactions', 'num_comments', 'num_shares', 'num_likes', 'num_loves') of video.
Thank you.
Subset the dataset.
Example:
Df_Camera = Dataset[(Dataset['status_type_num'] == 0)]
Df_Video = Dataset[(Dataset['status_type_num'] == 1)]

How to get the second largest value in Pandas Python [duplicate]

This question already has answers here:
Get first and second highest values in pandas columns
(7 answers)
Closed 4 years ago.
This is my code:
maxData = all_data.groupby(['Id'])[features].agg('max')
all_data = pd.merge(all_data, maxData.reset_index(), suffixes=["", "_max"], how='left', on=['Id'])
Now Instead of getting the max value, How can I fetch the second max value in the above code (groupBy Id)
Try using nlargest
maxData = all_data.groupby(['Id'])[features].apply(lambda x:x.nlargest(2)[1]).reset_index(drop=True)
You can use the nth method just after sorting the values;
maxData = all_data.sort_values("features", ascending=False).groupby(['Id']).nth(1)
Please ignore apply method as it decreases performance of code.

Python: len() of unknown numpy array column length [duplicate]

This question already has answers here:
Counting the number of non-NaN elements in a numpy ndarray in Python
(5 answers)
Closed 4 years ago.
I'm currently trying to learn Python and Numpy. The task is to determine the length of individual columns of an imported CSV file.
So far I have:
import numpy as np
data = np.loadtxt("assignment5_data.csv", delimiter = ',')
print (data.shape[:])
Which returns:
(62, 2)
Is there a way to iterate through each column to count [not is.nan]?
If I understand correctly, and you are trying to get the length of non-nan values in each column, use:
np.sum(~np.isnan(data),axis=0)

Performance of Pandas string contains for column [duplicate]

This question already has answers here:
Pandas filtering for multiple substrings in series
(3 answers)
Closed 4 years ago.
I have a DataFrame of 83k rows and a column "Text" of text that i have to search for ~200 masks. Is there a way to pass a column to .str.contains()?
I'm able to do it like this:
start = time.time()
[a["Text"].str.contains(m).sum() for m in \
b["mask"].values]
print time.time() - start
But it's taking 34.013s. Is there any faster way?
Edit:
b["mask"] looks like:
'PR347856|P5478'
'BS7623|B5763'
and i want the count of occurances for each mask, so i can't join them.
Edit:
a["text"] contains strings of the size of ~ 3 sentences
Maybe you can vectorize the containment operation.
text_contains = a['Text'].str.contains
b['mask'].map(lambda m: text_contains(m).sum())

Categories

Resources