I have 5 different data frames that I'd like to output to a single text file one after the other.
Because of my specific purpose, I do not want a delimiter.
What is the fastest way to do this?
Example:
Below are 5 dataframes. Space indicates new column.
1st df AAA 1 2 3 4 5 6
2nd BBB 1 2 3 4 5 6 7 8 9 10
3rd CCC 1 2 3 4 5 6 6 7 12 2 3 3 4 51 2
CCC 1 2 3 4 5 6 6 7 12 2 3 3 4 51 2
4th DDD 1 2 3 4 5 6 2 3 4 5
5th EEE 1 2 3 4 5 6 7 8 9 10 1 2 2
I'd like convert the above to below in a single text file:
AAA123456
BBB12345678910
CCC12345667122334512
CCC12345667122334512
DDD1234562345
EEE12345678910122
Notice that columns are just removed but rows are preserved with new lines.
I've tried googling around but to_csv seems to require a delimiter and I also came across a few solutions using "with open" and "write" but that seems to require iterating through every row in the dataframe.
Appreciate any ideas!
Thanks,
You can combine the dataframes with pd.concat.
import pandas as pd
df1 = pd.DataFrame({'AAA': [1, 2, 3, 4, 5, 6]}).T
df2 = pd.DataFrame({'BBB': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]}).T
df_all = pd.concat([df1, df2])
Output (which you can format as you see necessary):
0 1 2 3 4 5 6 7 8 9
AAA 1 2 3 4 5 6 NaN NaN NaN NaN
BBB 1 2 3 4 5 6 7.0 8.0 9.0 10.0
Write to CSV [edit: with a creative delimiter]:
df_all.to_csv('df_all.csv', sep = ',')
Import the CSV and remove spaces:
with open('df_all.csv', mode = 'r') as file_in:
with open('df_all_no_spaces.txt', mode = 'w') as file_out:
text = file_in.read()
text = text.replace(',', '')
file_out.write(text)
There's gotta be a more elegant way to do that last bit, but this works. Perhaps for good reason, pandas doesn't support exporting to CSV with no delimiter.
Edit: you can write to CSV with commas and then remove them. :)
to gather some data frame in single text file do:
whole_curpos = ''
#read every dataframe
for df in dataframe_list:
#gather all the column in a single column
df['whole_text'] = df[col0].astype(str)+df[col1]+...+df[coln]
for row in range(df.shape[0]):
whole_curpos = whole_curpos + df['whole_text'].iloc[row]
whole_curpos = whole_curpos + '\n'
Related
I have an excel archive with different numbers and i open it using pandas.
when i read and then print the xslx archive ,i have something like this:
5 7 7
0 6 16 5
1 10 12 15
2 1 5 6
3 5 6 18
. . . .
. . . .
n . . n
All i need is to distribute them with different intervals according to their frequencies.
my code is
import pandas as pd
excel_archive=pd.read_exceL("file name")
print(excel)
I think excel file has no header, so first add header=None to read_excel and then use DataFrame.stack with Series.value_counts:
excel_archive=pd.read_exceL("file name", header=None)
s = excel_archive.stack().value_counts()
print (s)
5 4
6 3
7 2
15 1
12 1
10 1
18 1
1 1
16 1
dtype: int64
Your question is not very clear but if you just have to count the number of occurrence you can try something like this:
#generate a dataframe
df = pd.DataFrame(np.array([[1, 2, 3], [4, 5, 4], [7, 8, 9], [1, 5, 2], [7, 9, 9]]),columns=['a', 'b', 'c'])
#Flatten the array
df_flat=df.stack().reset_index(drop=True)
#Count the number of occurences
df_flat.groupby(df_flat).size()
This is the input:
a b c
0 1 2 3
1 4 5 4
2 7 8 9
3 1 5 2
4 7 9 9
And this is the output:
1 2
2 2
3 1
4 2
5 2
7 2
8 1
9 3
If you want instead to divide in some predefined intervals you can use pd.cut together with groupby:
#define intervals
intervals = pd.IntervalIndex.from_arrays([0,3,6],[3,6,9],closed='right')
#cut and groupby
df_flat.groupby(pd.cut(df_flat,intervals)).size()
and the result would be:
(0, 3] 5
(3, 6] 4
(6, 9] 6
The following code:
import pandas as pd
from StringIO import StringIO
data = StringIO("""a,b,c
1,2,3
4,5,6
6,7,8,9
1,2,5
3,4,5""")
pd.read_csv(data, warn_bad_lines=True, error_bad_lines=False)
produces this output:
Skipping line 4: expected 3 fields, saw 4
a b c
0 1 2 3
1 4 5 6
2 1 2 5
3 3 4 5
That is, third line is rejected because it contains four (and not the expected three) values. This csv datafile is considered to be malformed.
What if I wanted instead a different behavior, i.e. not skipping lines having more fields than expected, but keeping their values by using a larger dataframe.
In the given example this would be the behavior ('UNK' is just an example, might be any other string):
a b c UNK
0 1 2 3 nan
1 4 5 6 nan
2 6 7 8 9
3 1 2 5 nan
4 3 4 5 nan
This is just an example in which there is only one additional value, what about an arbitrary (and a priori unknown) number of fields? Is this obtainable by some way through pandas read_csv?
Please note: I can do this by using csv.reader, I am just trying to switch now to pandas.
Any help/hints is appreciated.
Looks like you need the names argument when reading a csv
import pandas as pd
from StringIO import StringIO
data = StringIO("""a,b,c
1,2,3
4,5,6
6,7,8,9
1,2,5
3,4,5""")
df = pd.read_csv(data, warn_bad_lines=True, error_bad_lines=False, names = ["a", "b", "c", "UNK"])
print(df)
Output:
a b c UNK
0 a b c NaN
1 1 2 3 NaN
2 4 5 6 NaN
3 6 7 8 9.0
4 1 2 5 NaN
5 3 4 5 NaN
Supposing that Afile.csv contains :
a,b,c#Incomplete Header
1,2,3
4,5,6
6,7,8,9
1,2,5
3,4,5,,8
The following function yields a DataFrame containing all fields:
def readRawValuesFromCSV(file1, separator=',', commentMark='#'):
df = pd.DataFrame()
with open(file1, 'r') as f:
for line in f:
b = line.strip().split(commentMark)
if len(b[0])>0:
lineList = tuple(b[0].strip().split(separator))
df = pd.concat( [df, pd.DataFrame([lineList])], ignore_index=True )
return df
You can test it with this code:
file1 = 'Afile.csv'
# Read all values of a (maybe malformed) CSV file
df = readRawValuesFromCSV (file1, ',', '#')
That yields:
df
0 1 2 3 4
0 a b c NaN NaN
1 1 2 3 NaN NaN
2 4 5 6 NaN NaN
3 6 7 8 9 NaN
4 1 2 5 NaN NaN
5 3 4 5 8
I am indebted with herrfz for his answer in
Handling Variable Number of Columns with Pandas - Python. The present question might be a generalization of the other.
I am currently using Pandas and Python to handle much of the repetitive tasks, I need done for my master thesis. At this point, I have written some code (with help from stack overflow) that, based on some event dates in one file, finds a start and end date to use as a date range in another file. These dates are then located and appended to an empty list, which I can then output to excel. However, using the below code I get a dataframe with 5 columns and 400.000 + rows (which is basically what I want), but not how I want the data outputted to excel. Below is my code:
end_date = pd.DataFrame(data=(df_sample['Date']-pd.DateOffset(days=2)))
start_date = pd.DataFrame(data=(df_sample['Date']-pd.offsets.BDay(n=252)))
merged_dates = pd.merge(end_date,start_date,left_index=True,right_index=True)
ff_factors = []
for index, row in merged_dates.iterrows():
time_range= (df['Date'] > row['Date_y']) & (df['Date'] <= row['Date_x'])
df_factor = df.loc[time_range]
ff_factors.append(df_factor)
appended_data = pd.concat(ff_factors, axis=0)
I need the data to be 5 columns and 250 rows (columns are variable identifiers) side by side, so that when outputting it to excel I have, for example column A-D and then 250 rows for each column. This then needs to be repeated for column E-H and so on. Using iloc, I can locate the 250 observations using appended_data.iloc[0:250], with both 5 columns and 250 rows, and then output it to excel.
Are the any way for me to automate the process, so that after selecting the first 250 and outputting it to excel, it selects the next 250 and outputs it next to the first 250 and so on?
I hope the above is precise and clear, else I'm happy to elaborate!
EDIT:
The above picture illustrate what I get when outputting to excel; 5 columns and 407.764 rows. What I needed is to get this split up into the following way:
The second picture illustrates how I needed the total sample to be split up. The first five columns and corresponding 250 rows needs to be as the second picture. When I do the next split using iloc[250:500], I will get the next 250 rows, which needs to be added after the initial five columns and so on.
You can do this with a combination of np.reshape, which can be made to behave as desired on individual columns, and which should be much faster than a loop through the rows, and pd.concat, to join the dataframes it makes back together:
def reshape_appended(df, target_rows, pad=4):
df = df.copy() # don't modify in-place
# below line adds strings, '0000',...,'0004' to the column names
# this ensures sorting the columns preserves the order
df.columns = [str(i).zfill(pad)+df.columns[i] for i in range(len(df.columns))]
#target number of new columns per column in df
target_cols = len(df.index)//target_rows
last_group = pd.DataFrame()
# below conditional fires if there will be leftover rows - % is mod
if len(df.index)%target_rows != 0:
last_group = df.iloc[-(len(df.index)%target_rows):].reset_index(drop=True)
df = df.iloc[:-(len(df.index)%target_rows)] # keep rows that divide nicely
#this is a large list comprehension, that I'll elaborate on below
groups = [pd.DataFrame(df[col].values.reshape((target_rows, target_cols),
order='F'),
columns=[str(i).zfill(pad)+col for i in range(target_cols)])
for col in df.columns]
if not last_group.empty: # if there are leftover rows, add them back
last_group.columns = [pad*'9'+col for col in last_group.columns]
groups.append(last_group)
out = pd.concat(groups, axis=1).sort_index(axis=1)
out.columns = out.columns.str[2*pad:] # remove the extra characters in the column names
return out
last_group takes care of any rows that don't divide evenly into sets of 250. The playing around with column names enforces proper sorting order.
df[col].values.reshape((target_rows, target_cols), order='F')
Reshapes the values in the column col of df into the shape specified by the tuple (target_rows, target_cols), with the ordering Fortran uses, indicated by F.
columns=[str(i).zfill(pad)+col for i in range(target_cols)]
is just giving names to these columns, with any eye to establishing proper ordering afterward.
Ex:
df = pd.DataFrame(np.random.randint(0, 10, (23, 3)), columns=list('abc'))
reshape_appended(df, 5)
Out[160]:
a b c a b c a b c a b c a b c
0 8 3 0 4 1 9 5 4 7 2 3 4 5.0 7.0 2.0
1 1 6 1 3 5 1 1 6 0 5 9 4 6.0 0.0 1.0
2 3 1 3 4 3 8 9 3 9 8 7 8 7.0 3.0 2.0
3 4 0 1 5 5 6 6 4 4 0 0 3 NaN NaN NaN
4 9 7 3 5 7 4 6 5 8 9 5 5 NaN NaN NaN
df
Out[161]:
a b c
0 8 3 0
1 1 6 1
2 3 1 3
3 4 0 1
4 9 7 3
5 4 1 9
6 3 5 1
7 4 3 8
8 5 5 6
9 5 7 4
10 5 4 7
11 1 6 0
12 9 3 9
13 6 4 4
14 6 5 8
15 2 3 4
16 5 9 4
17 8 7 8
18 0 0 3
19 9 5 5
20 5 7 2
21 6 0 1
22 7 3 2
My best guess to solving the problem would be to try and loop, until the counter is greater than length, so
i = 250 # counter
j = 0 # left limit
for x in range(len("your dataframe")):
appended_data.iloc[j:i]
i+=250
if i > len("your df"):
appended_data.iloc[j:(len("your df"))
break
else:
j = i
I'm new to python, and having an extremely frustrating problem. I need to load the columns 1-12 of a csv files (so not the 0th column), but I need to skip the header of the excel, and overwrite it with "0,1,..,11"
I need to use panda.read_csv() for this.
basically, my csv is:
"a", "b", "c", ..., "l"
1, 2, 3, ..., 12
1, 2, 3, ..., 12
and I want to load it as a dataframe such that
dataframe[0] = 2,2,2,..
dataframe[1] = 3,3,3..
ergo skipping the first column, and making the dataframe start with index 0.
I've tried setting usecols = [1,2,3..], but then the indexes are 1,2,3,.. .
Any help would be grateful.
You can use header=(int) to remove the header lines, usecols=range(1,12) to grab the last 11 columns, and names=range(11) to name the 11 columns from 0 to 10.
Here is a fake dataset:
This is the header. Header header header.
And the second header line.
a,b,c,d,e,f,g,h,i,j,k,l
1,2,3,4,5,6,7,8,9,10,11,12
1,2,3,4,5,6,7,8,9,10,11,12
1,2,3,4,5,6,7,8,9,10,11,12
Using the code:
> df = pd.read_csv('data_file.csv', usecols=range(1,12), names=range(11), header=2)
> df
# returns:
0 1 2 3 4 5 6 7 8 9 10
0 2 3 4 5 6 7 8 9 10 11 12
1 2 3 4 5 6 7 8 9 10 11 12
2 2 3 4 5 6 7 8 9 10 11 12
> df[0]
# returns:
0 2
1 2
2 2
I am curious if their is a way to display my columns of data right beside each other instead of 7 columns and then the remaining columns below.
This is the output
aaaaaaaa bbbbbbbb cccccccc dddddddd eeeeeeee ffffffff gggggggg \
1 1 2 3 4 5 6 7
2 1 2 3 4 5 6 7
3 1 2 3 4 5 6 7
hhhhhhhh iiiiiiii
1 8 9
2 8 9
3 8 9
Press any key to continue . . .
This is the code
import pandas as pd
df = pd.DataFrame(index=['1','2','3'])
df['aaaaaaaa'] = 1
df['bbbbbbbb'] = 2
df['cccccccc'] = 3
df['dddddddd'] = 4
df['eeeeeeee'] = 5
df['ffffffff'] = 6
df['gggggggg'] = 7
df['hhhhhhhh'] = 8
df['iiiiiiii'] = 9
print (df.head())
Were you trying to combine two different csv files? If so you need to use the append method. DataFrame.append