while reading a text file to pandas data frames, what should I do to exclude the first column and read it
code currently using:
dframe_main =pd.read_table('/Users/ankit/Desktop/input.txt',sep =',')
Would it suffice to just delete the column after you've read it in? This is functionally the same as excluding the first column from the read. Here's a toy example:
import numpy as np
import pandas as pd
data = np.array([[1,2,3,4,5], [2,2,2,2,2], [3,3,3,3,3], [4,4,3,4,4], [7,2,3,4,5]])
columns = ["one", "two", "three", "four", "five"]
dframe_main = pd.DataFrame(data=data, columns=columns)
print "All columns:"
print dframe_main
del dframe_main[dframe_main.columns[0]] # get rid of the first column
print "All columns except the first:"
print dframe_main
Output is:
All columns:
one two three four five
0 1 2 3 4 5
1 2 2 2 2 2
2 3 3 3 3 3
3 4 4 3 4 4
4 5 2 3 4 5
All columns except the first:
two three four five
0 2 3 4 5
1 2 2 2 2
2 3 3 3 3
3 4 3 4 4
4 2 3 4 5
I would recommend to use usecols parameter:
usecols : array-like, default None Return a subset of the columns.
Results in much faster parsing time and lower memory usage.
Assuming that your file has 5 columns:
In [32]: list(range(5))[1:]
Out[32]: [1, 2, 3, 4]
dframe_main = pd.read_table('/Users/ankit/Desktop/input.txt', usecols=list(range(5))[1:])
Related
I tried to create a data frame df using the below code :
import numpy as np
import pandas as pd
index = [0,1,2,3,4,5]
s = pd.Series([1,2,3,4,5,6],index= index)
t = pd.Series([2,4,6,8,10,12],index= index)
df = pd.DataFrame(s,columns = ["MUL1"])
df["MUL2"] =t
print df
MUL1 MUL2
0 1 2
1 2 4
2 3 6
3 4 8
4 5 10
5 6 12
While trying to create the same data frame using the below syntax, I am getting a wierd output.
df = pd.DataFrame([s,t],columns = ["MUL1","MUL2"])
print df
MUL1 MUL2
0 NaN NaN
1 NaN NaN
Please explain why the NaN is being displayed in the dataframe when both the Series are non empty and why only two rows are getting displayed and no the rest.
Also provide the correct way to create the data frame same as has been mentioned above by using the columns argument in the pandas DataFrame method.
One of the correct ways would be to stack the array data from the input list holding those series into columns -
In [161]: pd.DataFrame(np.c_[s,t],columns = ["MUL1","MUL2"])
Out[161]:
MUL1 MUL2
0 1 2
1 2 4
2 3 6
3 4 8
4 5 10
5 6 12
Behind the scenes, the stacking creates a 2D array, which is then converted to a dataframe. Here's what the stacked array looks like -
In [162]: np.c_[s,t]
Out[162]:
array([[ 1, 2],
[ 2, 4],
[ 3, 6],
[ 4, 8],
[ 5, 10],
[ 6, 12]])
If remove columns argument get:
df = pd.DataFrame([s,t])
print (df)
0 1 2 3 4 5
0 1 2 3 4 5 6
1 2 4 6 8 10 12
Then define columns - if columns not exist get NaNs column:
df = pd.DataFrame([s,t], columns=[0,'MUL2'])
print (df)
0 MUL2
0 1.0 NaN
1 2.0 NaN
Better is use dictionary:
df = pd.DataFrame({'MUL1':s,'MUL2':t})
print (df)
MUL1 MUL2
0 1 2
1 2 4
2 3 6
3 4 8
4 5 10
5 6 12
And if need change columns order add columns parameter:
df = pd.DataFrame({'MUL1':s,'MUL2':t}, columns=['MUL2','MUL1'])
print (df)
MUL2 MUL1
0 2 1
1 4 2
2 6 3
3 8 4
4 10 5
5 12 6
More information is in dataframe documentation.
Another solution by concat - DataFrame constructor is not necessary:
df = pd.concat([s,t], axis=1, keys=['MUL1','MUL2'])
print (df)
MUL1 MUL2
0 1 2
1 2 4
2 3 6
3 4 8
4 5 10
5 6 12
A pandas.DataFrame takes in the parameter data that can be of type ndarray, iterable, dict, or dataframe.
If you pass in a list it will assume each member is a row. Example:
a = [1,2,3]
b = [2,4,6]
df = pd.DataFrame([a, b], columns = ["Col1","Col2", "Col3"])
# output 1:
Col1 Col2 Col3
0 1 2 3
1 2 4 6
You are getting NaN because it expects index = [0,1] but you are giving [0,1,2,3,4,5]
To get the shape you want, first transpose the data:
data = np.array([a, b]).transpose()
How to create a pandas dataframe
import pandas as pd
a = [1,2,3]
b = [2,4,6]
df = pd.DataFrame(dict(Col1=a, Col2=b))
Output:
Col1 Col2
0 1 2
1 2 4
2 3 6
I'm working on a project where my original dataframe is:
A B C label
0 1 2 2 Nan
1 2 4 5 7
2 3 6 5 Nan
3 4 8 7 Nan
4 5 10 3 8
5 6 12 4 8
But, I have an array with new labels for certain points (for that I only used columns A and B) in the original dataframe. Something like this:
X_labeled = [[2, 4], [3,6]]
y_labeled = [5,9]
My goal is to add the new labels to the original dataframe. I know that the combination of A and B unique is. What is the fastest way to assign the new label to the correct row?
This is my try:
y_labeled = np.array(y).astype('float64')
current_position = 0
for point in X_labeled:
row = df.loc[(df['A'] == point[0]) & (df['B'] == point[1])]
df.at[row.index, 'label'] = y_labeled[current_position]
current_position += 1
Wanted output (rows with index 1 and 2 are changed):
A B C label
0 1 2 2 Nan
1 2 4 5 5
2 3 6 5 9
3 4 8 7 Nan
4 5 10 3 8
5 6 12 4 8
For small datasets may this be okay with I'm currently using it for datasets with more than 25000 labels. Is there a way that is faster?
Also, in some cases I used all columns expect the column 'label'. That dataframe exists out of 64 columns so my method can not be used here. Has someone an idea to improve this?
Thanks in advance
Best solution is to make your arrays into a dataframe and use df.update():
new = pd.DataFrame(X_labeled, columns=['A', 'B'])
new['label'] = y_labeled
new = new.set_index(['A', 'B'])
df = df.set_index(['A', 'B'])
df.update(new)
df = df.reset_index()
Here's a numpy based approach aimed at performance. To vectorize this we want a way to check membership of the rows in X_labeled in columns A and B. So what we can do, is view these two columns as 1D arrays (based on this answer) and then we can use np.in1d to index the dataframe and assign the values in y_labeled:
import numpy as np
X_labeled = [[2, 4], [3,6]]
y_labeled = [5,9]
a = df.values[:,:2].astype(int) #indexing on A and B
def view_as_1d(a):
a = np.ascontiguousarray(a)
return a.view(np.dtype((np.void, a.dtype.itemsize * a.shape[-1])))
ix = np.in1d(view_as_1d(a), view_as_1d(X_labeled))
df.loc[ix, 'label'] = y_labeled
print(df)
A B C label
0 1 2 2 Nan
1 2 4 5 5
2 3 6 5 9
3 4 8 7 Nan
4 5 10 3 8
5 6 12 4 8
The question was originally asked here as a comment but could not get a proper answer as the question was marked as a duplicate.
For a given pandas.DataFrame, let us say
df = DataFrame({'A' : [5,6,3,4], 'B' : [1,2,3, 5]})
df
A B
0 5 1
1 6 2
2 3 3
3 4 5
How can we select rows from a list, based on values in a column ('A' for instance)
For instance
# from
list_of_values = [3,4,6]
# we would like, as a result
# A B
# 2 3 3
# 3 4 5
# 1 6 2
Using isin as mentioned here is not satisfactory as it does not keep order from the input list of 'A' values.
How can the abovementioned goal be achieved?
One way to overcome this is to make the 'A' column an index and use loc on the newly generated pandas.DataFrame. Eventually, the subsampled dataframe's index can be reset.
Here is how:
ret = df.set_index('A').loc[list_of_values].reset_index(inplace=False)
# ret is
# A B
# 0 3 3
# 1 4 5
# 2 6 2
Note that the drawback of this method is that the original indexing has been lost in the process.
More on pandas indexing: What is the point of indexing in pandas?
Use merge with helper DataFrame created by list and with column name of matched column:
df = pd.DataFrame({'A' : [5,6,3,4], 'B' : [1,2,3,5]})
list_of_values = [3,6,4]
df1 = pd.DataFrame({'A':list_of_values}).merge(df)
print (df1)
A B
0 3 3
1 6 2
2 4 5
For more general solution:
df = pd.DataFrame({'A' : [5,6,5,3,4,4,6,5], 'B':range(8)})
print (df)
A B
0 5 0
1 6 1
2 5 2
3 3 3
4 4 4
5 4 5
6 6 6
7 5 7
list_of_values = [6,4,3,7,7,4]
#create df from list
list_df = pd.DataFrame({'A':list_of_values})
print (list_df)
A
0 6
1 4
2 3
3 7
4 7
5 4
#column for original index values
df1 = df.reset_index()
#helper column for count duplicates values
df1['g'] = df1.groupby('A').cumcount()
list_df['g'] = list_df.groupby('A').cumcount()
#merge together, create index from column and remove g column
df = list_df.merge(df1).set_index('index').rename_axis(None).drop('g', axis=1)
print (df)
A B
1 6 1
4 4 4
3 3 3
5 4 5
1] Generic approach for list_of_values.
In [936]: dff = df[df.A.isin(list_of_values)]
In [937]: dff.reindex(dff.A.map({x: i for i, x in enumerate(list_of_values)}).sort_values().index)
Out[937]:
A B
2 3 3
3 4 5
1 6 2
2] If list_of_values is sorted. You can use
In [926]: df[df.A.isin(list_of_values)].sort_values(by='A')
Out[926]:
A B
2 3 3
3 4 5
1 6 2
The following code:
import pandas as pd
from StringIO import StringIO
data = StringIO("""a,b,c
1,2,3
4,5,6
6,7,8,9
1,2,5
3,4,5""")
pd.read_csv(data, warn_bad_lines=True, error_bad_lines=False)
produces this output:
Skipping line 4: expected 3 fields, saw 4
a b c
0 1 2 3
1 4 5 6
2 1 2 5
3 3 4 5
That is, third line is rejected because it contains four (and not the expected three) values. This csv datafile is considered to be malformed.
What if I wanted instead a different behavior, i.e. not skipping lines having more fields than expected, but keeping their values by using a larger dataframe.
In the given example this would be the behavior ('UNK' is just an example, might be any other string):
a b c UNK
0 1 2 3 nan
1 4 5 6 nan
2 6 7 8 9
3 1 2 5 nan
4 3 4 5 nan
This is just an example in which there is only one additional value, what about an arbitrary (and a priori unknown) number of fields? Is this obtainable by some way through pandas read_csv?
Please note: I can do this by using csv.reader, I am just trying to switch now to pandas.
Any help/hints is appreciated.
Looks like you need the names argument when reading a csv
import pandas as pd
from StringIO import StringIO
data = StringIO("""a,b,c
1,2,3
4,5,6
6,7,8,9
1,2,5
3,4,5""")
df = pd.read_csv(data, warn_bad_lines=True, error_bad_lines=False, names = ["a", "b", "c", "UNK"])
print(df)
Output:
a b c UNK
0 a b c NaN
1 1 2 3 NaN
2 4 5 6 NaN
3 6 7 8 9.0
4 1 2 5 NaN
5 3 4 5 NaN
Supposing that Afile.csv contains :
a,b,c#Incomplete Header
1,2,3
4,5,6
6,7,8,9
1,2,5
3,4,5,,8
The following function yields a DataFrame containing all fields:
def readRawValuesFromCSV(file1, separator=',', commentMark='#'):
df = pd.DataFrame()
with open(file1, 'r') as f:
for line in f:
b = line.strip().split(commentMark)
if len(b[0])>0:
lineList = tuple(b[0].strip().split(separator))
df = pd.concat( [df, pd.DataFrame([lineList])], ignore_index=True )
return df
You can test it with this code:
file1 = 'Afile.csv'
# Read all values of a (maybe malformed) CSV file
df = readRawValuesFromCSV (file1, ',', '#')
That yields:
df
0 1 2 3 4
0 a b c NaN NaN
1 1 2 3 NaN NaN
2 4 5 6 NaN NaN
3 6 7 8 9 NaN
4 1 2 5 NaN NaN
5 3 4 5 8
I am indebted with herrfz for his answer in
Handling Variable Number of Columns with Pandas - Python. The present question might be a generalization of the other.
I have 2 worksheets in excel. They both contain 3 columns a,b and c. I need to delete any row in worksheet 1 if the data items for columns a,b,c are the same between the two worksheets. How would I do this using the Pandas python library?
import pandas as pd
ws1 = pd.read_excel(pathname/worksheet1.xlsx)
ws2 = pd.read_excel(pathname/worksheet2.xlsx)
Basically worksheet1 looks something like this (dummy numbers assume they're different in actual data):
a b c d e f
1 2 3 4 4 4
1 2 3 4 4 4
1 2 3 4 4 4
1 2 3 4 4 4
1 2 3 4 4 4
worksheet2 looks something like this:
a b f d e c
1 2 4 4 4 3
1 2 4 4 4 3
1 2 4 4 4 3
1 2 4 4 4 3
1 2 4 4 4 3
I have to check columns a,b and c in worksheet1 and if the same data shows up in worksheet2, I would delete that row in worksheet1.
For example, in worksheet1 the values 1,2 and 3 are returned for columns a,b and c. I need to check if 1,2 and 3 show up in columns a,b and c in worksheet2 (located differently). If they do show up in worksheet2, I need to delete the row in worksheet1 with the values 1,2 and 3.
Try this (assuming that worksheets list1 and list 2 - two separate excel files):
df1 = pd.read_excel('/path/to/file_name1.xlsx')
df2 = pd.read_excel('/path/to/file_name2.xlsx')
df1 = df1[~df1.email.isin(df2.email)]
The third line of code removes those rows from df1 which are found in the df2 (assuming that the column name is email in both DFs)