Load columns x1,x2,x3.. with headers: 0,1, - python

I'm new to python, and having an extremely frustrating problem. I need to load the columns 1-12 of a csv files (so not the 0th column), but I need to skip the header of the excel, and overwrite it with "0,1,..,11"
I need to use panda.read_csv() for this.
basically, my csv is:
"a", "b", "c", ..., "l"
1, 2, 3, ..., 12
1, 2, 3, ..., 12
and I want to load it as a dataframe such that
dataframe[0] = 2,2,2,..
dataframe[1] = 3,3,3..
ergo skipping the first column, and making the dataframe start with index 0.
I've tried setting usecols = [1,2,3..], but then the indexes are 1,2,3,.. .
Any help would be grateful.

You can use header=(int) to remove the header lines, usecols=range(1,12) to grab the last 11 columns, and names=range(11) to name the 11 columns from 0 to 10.
Here is a fake dataset:
This is the header. Header header header.
And the second header line.
a,b,c,d,e,f,g,h,i,j,k,l
1,2,3,4,5,6,7,8,9,10,11,12
1,2,3,4,5,6,7,8,9,10,11,12
1,2,3,4,5,6,7,8,9,10,11,12
Using the code:
> df = pd.read_csv('data_file.csv', usecols=range(1,12), names=range(11), header=2)
> df
# returns:
0 1 2 3 4 5 6 7 8 9 10
0 2 3 4 5 6 7 8 9 10 11 12
1 2 3 4 5 6 7 8 9 10 11 12
2 2 3 4 5 6 7 8 9 10 11 12
> df[0]
# returns:
0 2
1 2
2 2

Related

remove unnamed colums pandas dataframe

i'm a student and have a problem that i cant figure it out how to solve it.i have csv data like this :
"","","","","","","","","",""
"","report","","","","","","","",""
"","bla1","bla2","","","","bla3","","",""
"","bla4","bla5","","","","","bla6","",""
"","bla6","bla7","bla8","","1","2","3","4","5"
"","bla9","bla10","bla11","","6","7","8","9","10"
"","bla12","bla13","bla14","","11","12","13","14","15"
"","","","","","","","","",""
code for reading csv like this :
SMT = pd.read_csv(file.csv, usecols=(5,6,7,8), skiprows=(1,2,3), nrows=(3))
SMT.fillna(0, inplace=True)
SMT print out :
Unnamed: 5 Unnamed: 6 Unnamed: 7 Unnamed: 8
0 1 2 3 4
1 6 7 8 9
2 11 12 13 14
expected output :
1 2 3 4
6 7 8 9
11 12 13 14
i already trying skiprows=(0,1,2,3) but it will be like this :
1 2 3 4
0 6 7 8 9
1 11 12 13 14
2 0 0 0 0
i already trying to put index=Flase SMT = pd.read_csv(file.csv,index=False, usecols=(5,6,7,8), skiprows=(1,2,3), nrows=(3)) or index_col=0/None/Falseis not working, and the last time I tried it like this :
df1 = SMT.loc[:, ~SMT.columns.str.contains('^Unnamed')]
and i got
Empty DataFrame
columns: []
Index: [0, 1, 2]
i just want to get rid the Unnamed: 5 ~ Unnamed: 8, how the correct way to get rid of this Unnamed thing ?
The "unnamed" just says, that pandas does not know how to name the columns. So these are just names. You could set the names like this in the read_csv
pd.read_csv("test.csv", usecols=(5,6,7,8), skiprows=3, nrows=3, header=0, names=["c1", "c2", "c3", "c4"])
Output:
c1 c2 c3 c4
0 1 2 3 4
1 6 7 8 9
2 11 12 13 14
You have to set header=0 so that pandas knows that this is usually the header. Or you set skiprows=4
Just assign new column names:
df = pd.read_csv('temp.csv', usecols=[5,6,7,8], skiprows=[1,2,3], nrows=3)
df.columns = range(1, 1+len(df.columns))

How do I convert my 2D numpy array to a pandas dataframe with given categories?

I have an array called 'values' which features 2 columns of mean reaction time data from 10 individuals. The first column refers to data collected for a single individual in condition A, the second for that same individual in condition B:
array([[451.75 , 488.55555556],
[552.44444444, 590.40740741],
[629.875 , 637.62962963],
[454.66666667, 421.88888889],
[637.16666667, 539.94444444],
[538.83333333, 516.33333333],
[463.83333333, 448.83333333],
[429.2962963 , 497.16666667],
[524.66666667, 458.83333333]])
I would like to plot these data using seaborn, to display the mean values and connected single values for each individual across the two conditions. What is the simplest way to convert the array 'values' into a 3 column DataFrame, whereby one column features all the values, another features a label distinguishing that value as condition A or condition B, and a final column which provides a number for each individual (i.e., 1-10)? For example, as follows:
Value Condition Individual
451.75 A 1
488.56 B 1
488.55 A 2
...etc
melt
You can do that using pd.melt:
pd.DataFrame(data, columns=['A','B']).reset_index().melt(id_vars = 'index')\
.rename(columns={'index':'Individual'})
Individual variable value
0 0 A 451.750000
1 1 A 552.444444
2 2 A 629.875000
3 3 A 454.666667
4 4 A 637.166667
5 5 A 538.833333
6 6 A 463.833333
7 7 A 429.296296
8 8 A 524.666667
9 0 B 488.555556
10 1 B 590.407407
11 2 B 637.629630
12 3 B 421.888889
13 4 B 539.944444
14 5 B 516.333333
15 6 B 448.833333
16 7 B 497.166667
17 8 B 458.833333
This should work
import pandas as pd
import numpy as np
np_array = np.array([[451.75 , 488.55555556],
[552.44444444, 590.40740741],
[629.875 , 637.62962963],
[454.66666667, 421.88888889],
[637.16666667, 539.94444444],
[538.83333333, 516.33333333],
[463.83333333, 448.83333333],
[429.2962963 , 497.16666667],
[524.66666667, 458.83333333]])
pd_df = pd.DataFrame(np_array, columns=["A", "B"])
num_individuals = len(pd_df.index)
pd_df = pd_df.melt()
pd_df["INDIVIDUAL"] = [(i)%(num_individuals) + 1 for i in pd_df.index]
pd_df
variable value INDIVIDUAL
0 A 451.750000 1
1 A 552.444444 2
2 A 629.875000 3
3 A 454.666667 4
4 A 637.166667 5
5 A 538.833333 6
6 A 463.833333 7
7 A 429.296296 8
8 A 524.666667 9
9 B 488.555556 1
10 B 590.407407 2
11 B 637.629630 3
12 B 421.888889 4
13 B 539.944444 5
14 B 516.333333 6
15 B 448.833333 7
16 B 497.166667 8
17 B 458.833333 9

Write multiple dataframes to a single text file without any delimiters

I have 5 different data frames that I'd like to output to a single text file one after the other.
Because of my specific purpose, I do not want a delimiter.
What is the fastest way to do this?
Example:
Below are 5 dataframes. Space indicates new column.
1st df AAA 1 2 3 4 5 6
2nd BBB 1 2 3 4 5 6 7 8 9 10
3rd CCC 1 2 3 4 5 6 6 7 12 2 3 3 4 51 2
CCC 1 2 3 4 5 6 6 7 12 2 3 3 4 51 2
4th DDD 1 2 3 4 5 6 2 3 4 5
5th EEE 1 2 3 4 5 6 7 8 9 10 1 2 2
I'd like convert the above to below in a single text file:
AAA123456
BBB12345678910
CCC12345667122334512
CCC12345667122334512
DDD1234562345
EEE12345678910122
Notice that columns are just removed but rows are preserved with new lines.
I've tried googling around but to_csv seems to require a delimiter and I also came across a few solutions using "with open" and "write" but that seems to require iterating through every row in the dataframe.
Appreciate any ideas!
Thanks,
You can combine the dataframes with pd.concat.
import pandas as pd
df1 = pd.DataFrame({'AAA': [1, 2, 3, 4, 5, 6]}).T
df2 = pd.DataFrame({'BBB': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]}).T
df_all = pd.concat([df1, df2])
Output (which you can format as you see necessary):
0 1 2 3 4 5 6 7 8 9
AAA 1 2 3 4 5 6 NaN NaN NaN NaN
BBB 1 2 3 4 5 6 7.0 8.0 9.0 10.0
Write to CSV [edit: with a creative delimiter]:
df_all.to_csv('df_all.csv', sep = ',')
Import the CSV and remove spaces:
with open('df_all.csv', mode = 'r') as file_in:
with open('df_all_no_spaces.txt', mode = 'w') as file_out:
text = file_in.read()
text = text.replace(',', '')
file_out.write(text)
There's gotta be a more elegant way to do that last bit, but this works. Perhaps for good reason, pandas doesn't support exporting to CSV with no delimiter.
Edit: you can write to CSV with commas and then remove them. :)
to gather some data frame in single text file do:
whole_curpos = ''
#read every dataframe
for df in dataframe_list:
#gather all the column in a single column
df['whole_text'] = df[col0].astype(str)+df[col1]+...+df[coln]
for row in range(df.shape[0]):
whole_curpos = whole_curpos + df['whole_text'].iloc[row]
whole_curpos = whole_curpos + '\n'

Issue with reading partial header CSV using pandas.read_csv

I'm trying to read a csv file using pandas.read_csv when the files header is not full, i.e., only some columns have names, others are empty.
When reading the data frame using .iloc I only get the columns which the header do not have any names.
The reason some columns do not have names is that the column size is variable and I did not assign a name for each column.
here's an example of the code, input file and output
dataframe = pandas.read_csv('filename.csv', sep = ",", header = 0)
dataframe = dataframe.iloc[::]
dataset = dataframe.values[:,0:]
input file
A B C
3 5 0 1 2 3
3 5 4 5 6 7
3 5 8 9 10 11
3 5 12 13 14 15
dataset output
dataset = [[1,2,3][5,6,7][9,10,11][13,14,15]]
How can I get dataframe to use the entire array (without the header)?
I think you need .values to get back numpy ndarray.
from io import StringIO
csv_file = StringIO("""A B C
3 5 0 1 2 3
3 5 4 5 6 7
3 5 8 9 10 11
3 5 12 13 14 15""")
df = pd.read_csv(csv_file,sep='\s',engine='python')
df.values
Output:
array([[ 1, 2, 3],
[ 5, 6, 7],
[ 9, 10, 11],
[13, 14, 15]])
Why not skip=1 when you load the csv file ?

Automating slicing prodcedures using pandas

I am currently using Pandas and Python to handle much of the repetitive tasks, I need done for my master thesis. At this point, I have written some code (with help from stack overflow) that, based on some event dates in one file, finds a start and end date to use as a date range in another file. These dates are then located and appended to an empty list, which I can then output to excel. However, using the below code I get a dataframe with 5 columns and 400.000 + rows (which is basically what I want), but not how I want the data outputted to excel. Below is my code:
end_date = pd.DataFrame(data=(df_sample['Date']-pd.DateOffset(days=2)))
start_date = pd.DataFrame(data=(df_sample['Date']-pd.offsets.BDay(n=252)))
merged_dates = pd.merge(end_date,start_date,left_index=True,right_index=True)
ff_factors = []
for index, row in merged_dates.iterrows():
time_range= (df['Date'] > row['Date_y']) & (df['Date'] <= row['Date_x'])
df_factor = df.loc[time_range]
ff_factors.append(df_factor)
appended_data = pd.concat(ff_factors, axis=0)
I need the data to be 5 columns and 250 rows (columns are variable identifiers) side by side, so that when outputting it to excel I have, for example column A-D and then 250 rows for each column. This then needs to be repeated for column E-H and so on. Using iloc, I can locate the 250 observations using appended_data.iloc[0:250], with both 5 columns and 250 rows, and then output it to excel.
Are the any way for me to automate the process, so that after selecting the first 250 and outputting it to excel, it selects the next 250 and outputs it next to the first 250 and so on?
I hope the above is precise and clear, else I'm happy to elaborate!
EDIT:
The above picture illustrate what I get when outputting to excel; 5 columns and 407.764 rows. What I needed is to get this split up into the following way:
The second picture illustrates how I needed the total sample to be split up. The first five columns and corresponding 250 rows needs to be as the second picture. When I do the next split using iloc[250:500], I will get the next 250 rows, which needs to be added after the initial five columns and so on.
You can do this with a combination of np.reshape, which can be made to behave as desired on individual columns, and which should be much faster than a loop through the rows, and pd.concat, to join the dataframes it makes back together:
def reshape_appended(df, target_rows, pad=4):
df = df.copy() # don't modify in-place
# below line adds strings, '0000',...,'0004' to the column names
# this ensures sorting the columns preserves the order
df.columns = [str(i).zfill(pad)+df.columns[i] for i in range(len(df.columns))]
#target number of new columns per column in df
target_cols = len(df.index)//target_rows
last_group = pd.DataFrame()
# below conditional fires if there will be leftover rows - % is mod
if len(df.index)%target_rows != 0:
last_group = df.iloc[-(len(df.index)%target_rows):].reset_index(drop=True)
df = df.iloc[:-(len(df.index)%target_rows)] # keep rows that divide nicely
#this is a large list comprehension, that I'll elaborate on below
groups = [pd.DataFrame(df[col].values.reshape((target_rows, target_cols),
order='F'),
columns=[str(i).zfill(pad)+col for i in range(target_cols)])
for col in df.columns]
if not last_group.empty: # if there are leftover rows, add them back
last_group.columns = [pad*'9'+col for col in last_group.columns]
groups.append(last_group)
out = pd.concat(groups, axis=1).sort_index(axis=1)
out.columns = out.columns.str[2*pad:] # remove the extra characters in the column names
return out
last_group takes care of any rows that don't divide evenly into sets of 250. The playing around with column names enforces proper sorting order.
df[col].values.reshape((target_rows, target_cols), order='F')
Reshapes the values in the column col of df into the shape specified by the tuple (target_rows, target_cols), with the ordering Fortran uses, indicated by F.
columns=[str(i).zfill(pad)+col for i in range(target_cols)]
is just giving names to these columns, with any eye to establishing proper ordering afterward.
Ex:
df = pd.DataFrame(np.random.randint(0, 10, (23, 3)), columns=list('abc'))
reshape_appended(df, 5)
Out[160]:
a b c a b c a b c a b c a b c
0 8 3 0 4 1 9 5 4 7 2 3 4 5.0 7.0 2.0
1 1 6 1 3 5 1 1 6 0 5 9 4 6.0 0.0 1.0
2 3 1 3 4 3 8 9 3 9 8 7 8 7.0 3.0 2.0
3 4 0 1 5 5 6 6 4 4 0 0 3 NaN NaN NaN
4 9 7 3 5 7 4 6 5 8 9 5 5 NaN NaN NaN
df
Out[161]:
a b c
0 8 3 0
1 1 6 1
2 3 1 3
3 4 0 1
4 9 7 3
5 4 1 9
6 3 5 1
7 4 3 8
8 5 5 6
9 5 7 4
10 5 4 7
11 1 6 0
12 9 3 9
13 6 4 4
14 6 5 8
15 2 3 4
16 5 9 4
17 8 7 8
18 0 0 3
19 9 5 5
20 5 7 2
21 6 0 1
22 7 3 2
My best guess to solving the problem would be to try and loop, until the counter is greater than length, so
i = 250 # counter
j = 0 # left limit
for x in range(len("your dataframe")):
appended_data.iloc[j:i]
i+=250
if i > len("your df"):
appended_data.iloc[j:(len("your df"))
break
else:
j = i

Categories

Resources