Issue with reading partial header CSV using pandas.read_csv - python

I'm trying to read a csv file using pandas.read_csv when the files header is not full, i.e., only some columns have names, others are empty.
When reading the data frame using .iloc I only get the columns which the header do not have any names.
The reason some columns do not have names is that the column size is variable and I did not assign a name for each column.
here's an example of the code, input file and output
dataframe = pandas.read_csv('filename.csv', sep = ",", header = 0)
dataframe = dataframe.iloc[::]
dataset = dataframe.values[:,0:]
input file
A B C
3 5 0 1 2 3
3 5 4 5 6 7
3 5 8 9 10 11
3 5 12 13 14 15
dataset output
dataset = [[1,2,3][5,6,7][9,10,11][13,14,15]]
How can I get dataframe to use the entire array (without the header)?

I think you need .values to get back numpy ndarray.
from io import StringIO
csv_file = StringIO("""A B C
3 5 0 1 2 3
3 5 4 5 6 7
3 5 8 9 10 11
3 5 12 13 14 15""")
df = pd.read_csv(csv_file,sep='\s',engine='python')
df.values
Output:
array([[ 1, 2, 3],
[ 5, 6, 7],
[ 9, 10, 11],
[13, 14, 15]])

Why not skip=1 when you load the csv file ?

Related

How to convert values like '2+3' in a Python Pandas column to its aggregated value

I have a column in a DataFrame named fatalities in which few of the values are like below:
data[''fatalities']= [1, 4, , 10, 1+8, 5, 2+9, , 16, 4+5]
I want the values of like '1+8', '2+9', etc to be converted to its aggregated value i.e,
data[''fatalities']= [1, 4, , 10, 9, 5, 11, , 16, 9]
I not sure how to write a code to perform above aggregation for one of the column in pandas DataFrame in Python. But when I tried with the below code its throwing an error.
def addition(col):
col= col.split('+')
col= int(col[0]) + int(col[1])
return col
data['fatalities']= [addition(row) for row in data['fatalities']]
Error:
IndexError: list index out of range
Use pandas.eval what is different like pure python eval:
data['fatalities'] = pd.eval(data['fatalities'])
print (data)
fatalities
0 1
1 4
2 10
3 9
4 5
5 11
6 16
7 9
But because this working only to 100 rows because bug:
AttributeError: 'PandasExprVisitor' object has no attribute 'visit_Ellipsis'
Then solution is:
data['fatalities'] = data['fatalities'].apply(pd.eval)
using .map and .astype(str) to force conversion if you have mixed data types.
df['fatalities'].astype(str).map(eval)
print(df)
fatalities
0 1
1 4
2 10
3 9
4 5
5 11
6 16
7 9

Frequency table with disorganized datas made by pandas

I have an excel archive with different numbers and i open it using pandas.
when i read and then print the xslx archive ,i have something like this:
5 7 7
0 6 16 5
1 10 12 15
2 1 5 6
3 5 6 18
. . . .
. . . .
n . . n
All i need is to distribute them with different intervals according to their frequencies.
my code is
import pandas as pd
excel_archive=pd.read_exceL("file name")
print(excel)
I think excel file has no header, so first add header=None to read_excel and then use DataFrame.stack with Series.value_counts:
excel_archive=pd.read_exceL("file name", header=None)
s = excel_archive.stack().value_counts()
print (s)
5 4
6 3
7 2
15 1
12 1
10 1
18 1
1 1
16 1
dtype: int64
Your question is not very clear but if you just have to count the number of occurrence you can try something like this:
#generate a dataframe
df = pd.DataFrame(np.array([[1, 2, 3], [4, 5, 4], [7, 8, 9], [1, 5, 2], [7, 9, 9]]),columns=['a', 'b', 'c'])
#Flatten the array
df_flat=df.stack().reset_index(drop=True)
#Count the number of occurences
df_flat.groupby(df_flat).size()
This is the input:
a b c
0 1 2 3
1 4 5 4
2 7 8 9
3 1 5 2
4 7 9 9
And this is the output:
1 2
2 2
3 1
4 2
5 2
7 2
8 1
9 3
If you want instead to divide in some predefined intervals you can use pd.cut together with groupby:
#define intervals
intervals = pd.IntervalIndex.from_arrays([0,3,6],[3,6,9],closed='right')
#cut and groupby
df_flat.groupby(pd.cut(df_flat,intervals)).size()
and the result would be:
(0, 3] 5
(3, 6] 4
(6, 9] 6

How to convert pandas series with numpy array values to dataframe

I have a pandas series which values are numpy array. For simplicity, say
series = pd.Series([np.array([1,2,3,4]), np.array([5,6,7,8]), np.array([9,10,11,12])], index=['file1', 'file2', 'file3'])
file1 [1, 2, 3, 4]
file2 [5, 6, 7, 8]
file3 [9, 10, 11, 12]
How can I expand it to a dataframe of the form df_concatenated:
0 1 2 3
file1 1 2 3 4
file2 5 6 7 8
file3 9 10 11 12
A wider version of the same problem. Actually the series is obtained from a different dataframe of the form:
DataFrame:
0 1
file slide
file1 1 1 2
2 3 4
file2 1 5 6
2 7 8
file3 1 9 10
2 11 12
by grouping on 'file' index with concatenation of columns.
def concat_sublevel(data):
return np.concatenate(data.values)
series = data.groupby(level=[0]).apply(concat_sublevel)
May be somebody see a better way to come from dataframe data to df_concatenated.
Caveat. slide sub-index can have different number of values for different file values. In such a case I need to repeat one of the rows to get the same dimensions in all resulting rows
You can try of using pandas Dataframe from records
pd.DataFrame.from_records(series.values,index=series.index)
Out:
0 1 2 3
file1 1 2 3 4
file2 5 6 7 8
file3 9 10 11 12

Write multiple dataframes to a single text file without any delimiters

I have 5 different data frames that I'd like to output to a single text file one after the other.
Because of my specific purpose, I do not want a delimiter.
What is the fastest way to do this?
Example:
Below are 5 dataframes. Space indicates new column.
1st df AAA 1 2 3 4 5 6
2nd BBB 1 2 3 4 5 6 7 8 9 10
3rd CCC 1 2 3 4 5 6 6 7 12 2 3 3 4 51 2
CCC 1 2 3 4 5 6 6 7 12 2 3 3 4 51 2
4th DDD 1 2 3 4 5 6 2 3 4 5
5th EEE 1 2 3 4 5 6 7 8 9 10 1 2 2
I'd like convert the above to below in a single text file:
AAA123456
BBB12345678910
CCC12345667122334512
CCC12345667122334512
DDD1234562345
EEE12345678910122
Notice that columns are just removed but rows are preserved with new lines.
I've tried googling around but to_csv seems to require a delimiter and I also came across a few solutions using "with open" and "write" but that seems to require iterating through every row in the dataframe.
Appreciate any ideas!
Thanks,
You can combine the dataframes with pd.concat.
import pandas as pd
df1 = pd.DataFrame({'AAA': [1, 2, 3, 4, 5, 6]}).T
df2 = pd.DataFrame({'BBB': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]}).T
df_all = pd.concat([df1, df2])
Output (which you can format as you see necessary):
0 1 2 3 4 5 6 7 8 9
AAA 1 2 3 4 5 6 NaN NaN NaN NaN
BBB 1 2 3 4 5 6 7.0 8.0 9.0 10.0
Write to CSV [edit: with a creative delimiter]:
df_all.to_csv('df_all.csv', sep = ',')
Import the CSV and remove spaces:
with open('df_all.csv', mode = 'r') as file_in:
with open('df_all_no_spaces.txt', mode = 'w') as file_out:
text = file_in.read()
text = text.replace(',', '')
file_out.write(text)
There's gotta be a more elegant way to do that last bit, but this works. Perhaps for good reason, pandas doesn't support exporting to CSV with no delimiter.
Edit: you can write to CSV with commas and then remove them. :)
to gather some data frame in single text file do:
whole_curpos = ''
#read every dataframe
for df in dataframe_list:
#gather all the column in a single column
df['whole_text'] = df[col0].astype(str)+df[col1]+...+df[coln]
for row in range(df.shape[0]):
whole_curpos = whole_curpos + df['whole_text'].iloc[row]
whole_curpos = whole_curpos + '\n'

Load columns x1,x2,x3.. with headers: 0,1,

I'm new to python, and having an extremely frustrating problem. I need to load the columns 1-12 of a csv files (so not the 0th column), but I need to skip the header of the excel, and overwrite it with "0,1,..,11"
I need to use panda.read_csv() for this.
basically, my csv is:
"a", "b", "c", ..., "l"
1, 2, 3, ..., 12
1, 2, 3, ..., 12
and I want to load it as a dataframe such that
dataframe[0] = 2,2,2,..
dataframe[1] = 3,3,3..
ergo skipping the first column, and making the dataframe start with index 0.
I've tried setting usecols = [1,2,3..], but then the indexes are 1,2,3,.. .
Any help would be grateful.
You can use header=(int) to remove the header lines, usecols=range(1,12) to grab the last 11 columns, and names=range(11) to name the 11 columns from 0 to 10.
Here is a fake dataset:
This is the header. Header header header.
And the second header line.
a,b,c,d,e,f,g,h,i,j,k,l
1,2,3,4,5,6,7,8,9,10,11,12
1,2,3,4,5,6,7,8,9,10,11,12
1,2,3,4,5,6,7,8,9,10,11,12
Using the code:
> df = pd.read_csv('data_file.csv', usecols=range(1,12), names=range(11), header=2)
> df
# returns:
0 1 2 3 4 5 6 7 8 9 10
0 2 3 4 5 6 7 8 9 10 11 12
1 2 3 4 5 6 7 8 9 10 11 12
2 2 3 4 5 6 7 8 9 10 11 12
> df[0]
# returns:
0 2
1 2
2 2

Categories

Resources