I have the following excel sheet:
and want to print column 1 value if the column 2 value is not null. The output should be [1,3].
This the script created by me, but it doesn't work:
import xlrd
import pandas as pd
filename='test.xlsx'
dataframe = pd.read_excel(filename)
frame = dataframe.loc[dataframe["col2"] !=" "]
df = frame.iloc[:, 0]
ndarray = df.to_numpy()
print(ndarray)
You can first filter down to nona rows and then show the values of the column you want to show:
dataframe[df['col2'].notna()]['col1'].values
If you print the dataframe, you will see that the empty cells are NaN:
Col1 Col2
0 1 a
1 2 NaN
2 3 b
3 4 NaN
So, you need to use the notna() method to filter
Here is your fixed code:
import xlrd
import pandas as pd
filename='test.xlsx'
dataframe = pd.read_excel(filename)
frame = dataframe.loc[dataframe["col2"].notna()]
df = frame.iloc[:, 0]
ndarray = df.to_numpy()
print(ndarray)
Related
I have dataframe df:
0
0 a
1 b
2 c
3 d
4 e
O/P should be:
a b c d e
0
1
2
3
4
5
I want column containing(a, b,c,d,e) as header of my dataframe.
Could anyone help?
If your dataframe is pandas and its name is df. Try solving it with pandas:
Firstly convert initial df content to a list, afterwards create a new dataframe defining its columns with the list.
import pandas as pd
list = df[0].tolist() #df[0] is getting the content of first column
dfSolved = pd.DataFrame([], columns = list)
You may provide more details like the index and values of the expected output, the operation you wanna do, etc, so that we could give a specific solution to your case
Here is the solution:
import pandas as pd
import io
import numpy as np
data_string = """ columns_name
0 a
1 b
2 c
3 d
4 e
"""
df = pd.read_csv(io.StringIO(data_string), sep='\s+')
# Solution
df_result = pd.DataFrame(data=[[np.nan]*5],
columns=df['columns_name'].tolist())
I have the following dataframe:
df = pd.DataFrame([['A', 1],['B', 2],['C', 3]], columns=['index', 'result'])
index
result
A
1
B
2
C
3
I would like to create a new column, for example multiply the column 'result' by two, and I am just curious to know if there is a way to do it in pandas as pyspark does it.
In pyspark:
df = df\
.withColumn("result_multiplied", F.col("result")*2)
I don't like the fact of writing the name of the dataframe everytime I have to perform an operation as it is done in pandas such as:
In pandas:
df['result_multiplied'] = df['result']*2
Use DataFrame.assign:
df = df.assign(result_multiplied = df['result']*2)
Or if column result is processing in code before is necessary lambda function for processing counted values in column result:
df = df.assign(result_multiplied = lambda x: x['result']*2)
Sample for see difference column result_multiplied is count by multiple original df['result'], for result_multiplied1 is used multiplied column after mul(2):
df = df.mul(2).assign(result_multiplied = df['result']*2,
result_multiplied1 = lambda x: x['result']*2)
print (df)
index result result_multiplied result_multiplied1
0 AA 2 2 4
1 BB 4 4 8
2 CC 6 6 12
I have a dataset imported from a CSV file to a dataframe in Python. I want to remove some specific rows from this dataframe and append them to an empty dataframe. So far I have tried to remove row 1 and 0 from the "big" dataframe called df and put these into dff using this code:
dff = pd.DataFrame() #Create empty dataframe
for x in range(0, 2):
dff = dff.append(df.iloc[x]) #Append the first 2 rows from df to dff
#How to remove appended rows from df?
This seems to work, however the columns are flipped, for e.g., df got order A, B, C, then dff will get the order C, B, A; other than that the data is correct. Also how do I remove a specific row from a dataframe?
If your goal is just to remove the first two rows into another dataframe, you don't need to use a loop, just slice:
import pandas as pd
df = pd.DataFrame({"col1": [1,2,3,4,5,6], "col2": [11,22,33,44,55,66]})
dff = df.iloc[:2]
df = df.iloc[2:]
Will give you:
dff
Out[6]:
col1 col2
0 1 11
1 2 22
df
Out[8]:
col1 col2
2 3 33
3 4 44
4 5 55
5 6 66
If your list of desired rows is more complex than just the first two, per your example, a more generic method could be:
dff = df.iloc[[1,3,5]] # Your list of row numbers
df = df.iloc[~df.index.isin(dff.index)]
This means that even if the index column isn't sequential integers, any rows that you used to populate dff will be removed from df.
I managed to solve it by doing:
dff = pd.DataFrame()
dff = df.iloc[:0]
This will copy the first row of df (the titles of the colums e.g. A,B,C) into dff, then append work as it should with any row and row e.g. 1150 can be appended and removed using:
dff = dff.append(df.iloc[1150])
df = df.drop(df.index[1150])
I have a dataframe which contains 36 columns and 1 600 000 rows. I have XNA value in data so when i try to find null value using df.isnull().sum(). the xna value didnot count so for count that i have to replace xna value with Nan so, how i can do that?
Just do:
import pandas as pd
import numpy as np
df = pd.DataFrame([[0, 1, 2], ["test", "NXA", "test2"]]).T
df.columns = ["col1", "col2"]
df.col2.replace("NXA", np.nan)
to replace all "XNA" value in col2 with missing values in numpy / pandas format.
I am trying to fillna in a specific column of the dataframe with the mean of not-null values of the same type (based on the value from another column in the dataframe).
Here is the code to reproduce my issue:
import numpy as np
import pandas as pd
df = pd.DataFrame()
#Create the DateFrame with a column of floats
#And a column of labels (str)
np.random.seed(seed=6)
df['col0']=np.random.randn(100)
lett=['a','b','c','d']
df['col1']=np.random.choice(lett,100)
#Set some of the floats to NaN for the test.
toz = np.random.randint(0,100,25)
df.loc[toz,'col0']=np.NaN
df[df['col0'].isnull()==False].count()
#Create a DF with mean for each label.
w_series = df.loc[(~df['col0'].isnull())].groupby('col1').mean()
col0
col1
a 0.057199
b 0.363899
c -0.068074
d 0.251979
#This dataframe has our label (a,b,c,d) as the index. Doesn't seem
#to work when I try to df.fillna(w_series). So I try to reindex such
#that the labels (a,b,c,d) become a column again.
#
#For some reason I cannot just do a set_index and expect the
#old index to become column. So I append the new index and
#then reset it.
w_series['col2'] = list(range(w_series.size))
w_frame = w_series.set_index('col2',append=True)
w_frame.reset_index('col1',inplace=True)
#I try fillna() with the new dataframe.
df.fillna(w_frame)
Still no luck:
col0 col1
0 0.057199 b
1 0.729004 a
2 0.217821 d
3 0.251979 c
4 -2.486781 a
5 0.913252 b
6 NaN a
7 NaN b
What am I doing wrong?
How do I fillna the dataframe with the averages of specific rows that match the missing information?
Does the size of the dataframe being filled (df) and the filler dataframe (w_frame) have to match?
Thank you
fillna is base on index, so , you need same index for your target dataframe and process dataframe
df.set_index('col1')['col0'].fillna(w_frame.set_index('col1').col0).reset_index()
# I only show the first 11 row
Out[74]:
col1 col0
0 b 0.363899
1 a 0.729004
2 d 0.217821
3 c -0.068074
4 a -2.486781
5 b 0.913252
6 a 0.057199
7 b 0.363899
8 c -0.068074
9 b -0.429894
10 a 2.631281
My way to fillna
df['col1']=df.groupby("col1")['col0'].transform(lambda x: x.fillna(x.mean()))