The following code:
import pandas as pd
from StringIO import StringIO
data = StringIO("""a,b,c
1,2,3
4,5,6
6,7,8,9
1,2,5
3,4,5""")
pd.read_csv(data, warn_bad_lines=True, error_bad_lines=False)
produces this output:
Skipping line 4: expected 3 fields, saw 4
a b c
0 1 2 3
1 4 5 6
2 1 2 5
3 3 4 5
That is, third line is rejected because it contains four (and not the expected three) values. This csv datafile is considered to be malformed.
What if I wanted instead a different behavior, i.e. not skipping lines having more fields than expected, but keeping their values by using a larger dataframe.
In the given example this would be the behavior ('UNK' is just an example, might be any other string):
a b c UNK
0 1 2 3 nan
1 4 5 6 nan
2 6 7 8 9
3 1 2 5 nan
4 3 4 5 nan
This is just an example in which there is only one additional value, what about an arbitrary (and a priori unknown) number of fields? Is this obtainable by some way through pandas read_csv?
Please note: I can do this by using csv.reader, I am just trying to switch now to pandas.
Any help/hints is appreciated.
Looks like you need the names argument when reading a csv
import pandas as pd
from StringIO import StringIO
data = StringIO("""a,b,c
1,2,3
4,5,6
6,7,8,9
1,2,5
3,4,5""")
df = pd.read_csv(data, warn_bad_lines=True, error_bad_lines=False, names = ["a", "b", "c", "UNK"])
print(df)
Output:
a b c UNK
0 a b c NaN
1 1 2 3 NaN
2 4 5 6 NaN
3 6 7 8 9.0
4 1 2 5 NaN
5 3 4 5 NaN
Supposing that Afile.csv contains :
a,b,c#Incomplete Header
1,2,3
4,5,6
6,7,8,9
1,2,5
3,4,5,,8
The following function yields a DataFrame containing all fields:
def readRawValuesFromCSV(file1, separator=',', commentMark='#'):
df = pd.DataFrame()
with open(file1, 'r') as f:
for line in f:
b = line.strip().split(commentMark)
if len(b[0])>0:
lineList = tuple(b[0].strip().split(separator))
df = pd.concat( [df, pd.DataFrame([lineList])], ignore_index=True )
return df
You can test it with this code:
file1 = 'Afile.csv'
# Read all values of a (maybe malformed) CSV file
df = readRawValuesFromCSV (file1, ',', '#')
That yields:
df
0 1 2 3 4
0 a b c NaN NaN
1 1 2 3 NaN NaN
2 4 5 6 NaN NaN
3 6 7 8 9 NaN
4 1 2 5 NaN NaN
5 3 4 5 8
I am indebted with herrfz for his answer in
Handling Variable Number of Columns with Pandas - Python. The present question might be a generalization of the other.
Related
I have this code that can merge multiple csv files by row ( horizontally) with header already available in each file:
import csv
import pandas as pd
import glob
interesting_files = glob.glob("*.csv")
df_list = []
similarity = ['(add)']
for csv_file in sorted(interesting_files):
if any(pattern in csv_file for pattern in similarity):
df_list.append(pd.read_csv(csv_file))#, header=None))
df = pd.concat(df_list, ignore_index=True, axis=0,)
df.to_csv("_Combined_.csv", index=False)
But the index column of the output shows number in order, I want to set the index column to a column of different category for each row. How to do it?
current csv output:
a b c d
0 1 2 3 4
1 5 6 7 8
2 9 1 2 3
I have tried with creating a list of index and merge them together but as I doubted, result looks like this:
a b c d
mel
bou
rne
0 1 2 3 4
1 5 6 7 8
2 9 1 2 3
While here is my expected output:
a b c d
mel 1 2 3 4
bou 5 6 7 8
rne 9 1 2 3
I am trying to add a column in a dataset, based on a dictionary which is applied to one of the columns in the dataset. But after trying the code below, I am getting NaN in the new column even though, values are not missing from the column on which the dictionary is based on.
Code:
import pandas as pd
df = pd.read_csv('test.csv')
val_dict = {'1':'8','2':'5','3':'3','4':'2'}
df['val2'] = df['val'].map(val_dict)
df
The output I am getting is
val val2
Based on your df, i assume the column val contains interger value. But the dictionary which you presented above contain the keys as str.
So change the dict keys from str to int. (i.e val_dict = {1:'8',2:'5',3:'3',4:'2'})
E.g : 1 (Shows Error)
df = pd.DataFrame({'val' : [1,2,2,1,2,3,3,4]})
val_dict = {'1':'8','2':'5','3':'3','4':'2'}
df['val_2'] = df['val'].map(val_dict)
print(df)
val val_2
0 1 NaN
1 2 NaN
2 2 NaN
3 1 NaN
4 2 NaN
5 3 NaN
6 3 NaN
7 4 NaN
E.g : 2 (Corrected dict results)
df = pd.DataFrame({'val' : [1,2,2,1,2,3,3,4]})
val_dict = {1:'8',2:'5',3:'3',4:'2'}
df['val_2'] = df['val'].map(val_dict)
print(df)
val val_2
0 1 8
1 2 5
2 2 5
3 1 8
4 2 5
5 3 3
6 3 3
7 4 2
This sounds a bit weird, but I think that's exactly what I needed now:
I got several pandas dataframes that contains columns with float numbers, for example:
a b c
0 0 1 2
1 3 4 5
2 6 7 8
Now I want to add a column, with only one row, and the value is equal to the average of column 'a', in this case, is 3.0. So the new dataframe will looks like this:
a b c average
0 0 1 2 3.0
1 3 4 5
2 6 7 8
And all the rows below are empty.
I've tried things like df['average'] = np.mean(df['a']) but that give me a whole column of 3.0. Any help will be appreciated.
Assign a series, this is cleaner.
df['average'] = pd.Series(df['a'].mean(), index=df.index[[0]])
Or, even better, assign with loc:
df.loc[df.index[0], 'average'] = df['a'].mean().item()
Filling NaNs is straightforward, you can do
df['average'] = df['average'].fillna('')
df
a b c average
0 0 1 2 3
1 3 4 5
2 6 7 8
Can do something like:
df['average'] = [np.mean(df['a'])]+['']*(len(df)-1)
Here is a full example:
import pandas as pd
import numpy as np
df = pd.DataFrame(
[(0,1,2), (3,4,5), (6,7,8)],
columns=['a', 'b', 'c'])
print(df)
a b c
0 0 1 2
1 3 4 5
2 6 7 8
df['average'] = ''
df['average'][0] = df['a'].mean()
print(df)
a b c average
0 0 1 2 3
1 3 4 5
2 6 7 8
I have 5 different data frames that I'd like to output to a single text file one after the other.
Because of my specific purpose, I do not want a delimiter.
What is the fastest way to do this?
Example:
Below are 5 dataframes. Space indicates new column.
1st df AAA 1 2 3 4 5 6
2nd BBB 1 2 3 4 5 6 7 8 9 10
3rd CCC 1 2 3 4 5 6 6 7 12 2 3 3 4 51 2
CCC 1 2 3 4 5 6 6 7 12 2 3 3 4 51 2
4th DDD 1 2 3 4 5 6 2 3 4 5
5th EEE 1 2 3 4 5 6 7 8 9 10 1 2 2
I'd like convert the above to below in a single text file:
AAA123456
BBB12345678910
CCC12345667122334512
CCC12345667122334512
DDD1234562345
EEE12345678910122
Notice that columns are just removed but rows are preserved with new lines.
I've tried googling around but to_csv seems to require a delimiter and I also came across a few solutions using "with open" and "write" but that seems to require iterating through every row in the dataframe.
Appreciate any ideas!
Thanks,
You can combine the dataframes with pd.concat.
import pandas as pd
df1 = pd.DataFrame({'AAA': [1, 2, 3, 4, 5, 6]}).T
df2 = pd.DataFrame({'BBB': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]}).T
df_all = pd.concat([df1, df2])
Output (which you can format as you see necessary):
0 1 2 3 4 5 6 7 8 9
AAA 1 2 3 4 5 6 NaN NaN NaN NaN
BBB 1 2 3 4 5 6 7.0 8.0 9.0 10.0
Write to CSV [edit: with a creative delimiter]:
df_all.to_csv('df_all.csv', sep = ',')
Import the CSV and remove spaces:
with open('df_all.csv', mode = 'r') as file_in:
with open('df_all_no_spaces.txt', mode = 'w') as file_out:
text = file_in.read()
text = text.replace(',', '')
file_out.write(text)
There's gotta be a more elegant way to do that last bit, but this works. Perhaps for good reason, pandas doesn't support exporting to CSV with no delimiter.
Edit: you can write to CSV with commas and then remove them. :)
to gather some data frame in single text file do:
whole_curpos = ''
#read every dataframe
for df in dataframe_list:
#gather all the column in a single column
df['whole_text'] = df[col0].astype(str)+df[col1]+...+df[coln]
for row in range(df.shape[0]):
whole_curpos = whole_curpos + df['whole_text'].iloc[row]
whole_curpos = whole_curpos + '\n'
while reading a text file to pandas data frames, what should I do to exclude the first column and read it
code currently using:
dframe_main =pd.read_table('/Users/ankit/Desktop/input.txt',sep =',')
Would it suffice to just delete the column after you've read it in? This is functionally the same as excluding the first column from the read. Here's a toy example:
import numpy as np
import pandas as pd
data = np.array([[1,2,3,4,5], [2,2,2,2,2], [3,3,3,3,3], [4,4,3,4,4], [7,2,3,4,5]])
columns = ["one", "two", "three", "four", "five"]
dframe_main = pd.DataFrame(data=data, columns=columns)
print "All columns:"
print dframe_main
del dframe_main[dframe_main.columns[0]] # get rid of the first column
print "All columns except the first:"
print dframe_main
Output is:
All columns:
one two three four five
0 1 2 3 4 5
1 2 2 2 2 2
2 3 3 3 3 3
3 4 4 3 4 4
4 5 2 3 4 5
All columns except the first:
two three four five
0 2 3 4 5
1 2 2 2 2
2 3 3 3 3
3 4 3 4 4
4 2 3 4 5
I would recommend to use usecols parameter:
usecols : array-like, default None Return a subset of the columns.
Results in much faster parsing time and lower memory usage.
Assuming that your file has 5 columns:
In [32]: list(range(5))[1:]
Out[32]: [1, 2, 3, 4]
dframe_main = pd.read_table('/Users/ankit/Desktop/input.txt', usecols=list(range(5))[1:])