I have this code:
import pandas as pd
import os
ext = ('.tsv')
for files in os.listdir(os.getcwd()):
if files.endswith(ext):
x = pd.read_table(files, sep='\t', usecols=['#Chrom','Pos','RawScore','PHRED'])
x.drop_duplicates(subset ="Pos",keep = False, inplace = True)
data_frame=x.head()
print(data_frame)
#Chrom Pos RawScore PHRED
77171 6 167709702 7.852318 39.0
19180 6 31124849 7.623789 38.0
15823 6 29407955 6.982213 37.0
19182 6 31125257 6.817868 36.0
19974 6 31544591 6.201438 35.0
#Chrom Pos RawScore PHRED
52445 9 139634495 6.950686 36.0
46470 9 125391241 5.477094 34.0
49866 9 134385435 4.841222 33.0
48642 9 131475583 4.357986 31.0
40099 9 113233652 4.284035 31.0
#Chrom Pos RawScore PHRED
7050 13 32972626 6.472542 36.0
32416 13 100518634 5.405765 33.0
10834 13 42465713 4.406294 32.0
9963 13 39422624 4.374808 31.0
22993 13 76395620 4.193058 29.4
As you can imagine, I got multiple dataframes with the same columns names but from different Chromosomes.
How can I get this multiples dataframes in differents csv files?
You can save your dataFrames to .csv using panda's pandas.DataFrame.to_csv (https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.to_csv.html).
More specifically, in your case you can do this:
for files in os.listdir(os.getcwd()):
if files.endswith(ext):
x = pd.read_table(files, sep='\t', usecols=
['#Chrom','Pos','RawScore','PHRED'])
x.drop_duplicates(subset ="Pos",keep = False, inplace = True)
x.to_csv(f'Chrom{x.iloc[0,0]}.csv')
In here, x.iloc[0,0] will take the first element of the first column which is the #Chrom. You can also do this manually. Note that this method would not work if you want to have two different DataFrames with the same #Chromosome. In that case, you have to manually input the name of the csv file.
Related
I have a load of ASC files to extract data from. The issue I am having is that some of the columns have empty rows where there is no data, when I load these files into a df - it populates the first columns with all the data and just adds nans to the end... like this:
a| b| c
1 | 2 | nan
when I want it to be:
a | b | c
1 |nan|2
(I can't figure out how to make a table here to save my life) but where there is no data I want it to preserve the space. Part of my code says the separator is any space with more than two white spaces so I can preserve the headers that have one space within them, I think this is causing the issue but I am not sure how to fix it. I've tried using astropy.io to open the files and determine the delimiter but I get the error that the number of columns doesn't match the data columns.
here's an image of the general look of the files I have so you can see the lack of char delimiters and empty columns.
starting_words = ['Core no.', 'Core No.','Core','Core no.']
data = []
file_paths = []
for file in filepaths:
with open(file) as f:
for i, l in enumerate(f):
if l.startswith(tuple(starting_words)):
df = (pd.read_csv(file,sep = '\\s{2,}', engine = 'python', skiprows = i))
file_paths.append((file.stem + file.suffix))
df.insert(0,'Filepath', file)
data += [df]
break
this is the script that I've used to open the files and keep the header words together, I never got the astropy stuff to run - I either get the columns dont match error or it could not determine the file format.Also, this code has the skiprows part because the files all have random notes at the top that I don't want in my dataframe.
Your data looks well behaved, you could try to make use of the Pandas fwf to read the files with fixed-width formatted lines. If the inference from the fwf is not good enough for you, you can manually describe the extents of the fixed-width fields of each line using the parameter colspecs.
Sample
Core no. Depth Depth Perm Porosity Saturations Oil
ft m mD % %
1 5516.0 1681.277 40.0 1.0
2 5527.0 1684.630 39.0 16.0
3 5566.0 1696.517 508 37.0 4.0
5571.0 1698.041 105 33.0 8.0
6 5693.0 1735.226 44.0 16.0
5702.0 1737.970 4320 35.0 31.0
9 5686.0 1733.093 2420 33.0 26.0
df = pd.read_fwf('sample.txt', skiprows=2, header=None)
df.columns=['Core no.', 'Depth ft', 'Depth m' , 'Perm mD', 'Porosity%', 'Saturations Oil%']
print(df)
Output from df
Core no. Depth ft Depth m Perm mD Porosity% Saturations Oil%
0 1.0 5516.0 1681.277 NaN 40.0 1.0
1 2.0 5527.0 1684.630 NaN 39.0 16.0
2 3.0 5566.0 1696.517 508.0 37.0 4.0
3 NaN 5571.0 1698.041 105.0 33.0 8.0
4 6.0 5693.0 1735.226 NaN 44.0 16.0
5 NaN 5702.0 1737.970 4320.0 35.0 31.0
6 9.0 5686.0 1733.093 2420.0 33.0 26.0
I have a file with 4 columns(csv file) and n lines.
I want the 4th column values to move to the next line every time.
ex :
[LN],[cb],[I], [LS]
to
[LN],[cb],[I]
[LS]
that is, if my file is:
[LN1],[cb1],[I1], [LS1]
[LN2],[cb2],[I2], [LS2]
[LN3],[cb2],[I3], [LS3]
[LN4],[cb4],[I4], [LS4]
the output file will look like
[LN1],[cb1],[I1]
[LS1]
[LN2],[cb2],[I2]
[LS2]
[LN3],[cb2],[I3]
[LS3]
[LN4],[cb4],[I4]
[LS4]
Test file:
101 Xavier Mexico City 41 88.0
102 Ann Toronto 28 79.0
103 Jana Prague 33 81.0
104 Yi Shanghai 34 80.0
105 Robin Manchester 38 68.0
Output required:
101 Xavier Mexico City 41
88.0
102 Ann Toronto 28
79.0
103 Jana Prague 33
81.0
104 Yi Shanghai 34
80.0
105 Robin Manchester 38
68.0
Split the dataframe into 2 dataframes, one with the first 3 columns and the other with the last column. Add a new helper-column to both so you can order them afterwards. Now combine them again and order them first by index (which is identical for entries which where previously in the same row) and then by the helper column.
Since there is no test data, this answer is untested:
from io import StringIO
import pandas as pd
s = """col1,col2,col3,col4
101 Xavier,Mexico City,41,88.0
102 Ann,Toronto,28,79.0
103 Jana,Prague,33,81.0
104 Yi,Shanghai,34,80.0
105 Robin,Manchester,38,68.0"""
df = pd.read_csv(StringIO(s), sep=',')
df1 = df[['col1', 'col2', 'col3']].copy()
df2 = df[['col4']].rename(columns={'col4':'col1'}).copy()
df1['ranking'] = 1
df2['ranking'] = 2
df_out = df1.append(df2)
df_out = df_out.rename_axis('index_name').sort_values(by=['index_name', 'ranking'], ascending=[True, True])
df_out = df_out.drop(['ranking'], axis=1)
Another solution to this is to convert the table to a list, then rearrange the list to reconstruct the table.
import pandas as pd
df = pd.read_csv(r"test_file.csv")
df_list = df.values.tolist()
new_list = []
for x in df_list:
# Removes last element from list and save it to a variable 'last_x'.
# This action also modifies the original list
last_x = x.pop()
# append the modified list and the last values to an empty list.
new_list.append(x)
new_list.append([last_x])
# Use the new list to create the new table...
new_df = pd.DataFrame(new_list)
I want to convert the html to csv using pandas functions
This is a part of what I read in the dataframe df
0 1
0 sequence 2
1 trainNo K805
2 trainNumber K805
3 departStation 鹰潭
4 departStationPy yingtan
5 arriveStation 南昌
6 arriveStationPy nanchang
7 departDate 2020-05-24
8 departTime 03:55
9 arriveDate 2020-05-24
10 arriveTime 05:44
11 isStartStation False
12 isEndStation False
13 runTime 1小时49分钟
14 preSaleTime NaN
15 takeDays 0
16 isBookable True
17 seatList seatNamepriceorderPriceinventoryisBookablebutt...
18 curSeatIndex 0
seatName price orderPrice inventory isBookable buttonDisplayName buttonType
0 硬座 23.5 23.5 99 True NaN 0
1 硬卧 69.5 69.5 99 True NaN 0
2 软卧 104.5 104.5 4 True NaN 0
0 1
0 departDate 2020-05-23
1 departStationList NaN
2 endStationList NaN
3 departStationFilterMap NaN
4 endStationFilterMap NaN
5 departCityName 上海
6 arriveCityName 南昌
7 gtMinPrice NaN
My code is like this
for i, df in enumerate(pd.read_html(html,encoding='utf-8')):
df.to_csv(".\other.csv",index=True,encoding='utf-8-sig')
To preserve the characters in csv, I need to use utf-8-sig encoding. But I don't know how to write the format symbol %
,0,1
0,departDate,2020-05-23
1,departStationList,
2,endStationList,
3,departStationFilterMap,
4,endStationFilterMap,
5,departCityName,上海
6,arriveCityName,南昌
7,gtMinPrice,
This is what I got in csv file, only the last part is preserved.
The dataframe is correct, while the csv need correction. Can you show me how to make the correct output?
you're saving each dataframe to the same file, so each is getting overwritten until the last one.
note the addition of the f-string to change the save file name e.g. f".\other_{i}.csv"
each dataframe is a different shape, so they won't all fit together properly
To CSV
for i, df in enumerate(pd.read_html(html,encoding='utf-8')):
df.to_csv(f".\other_{i}.csv", index=True, encoding='utf-8-sig')
To Excel
with pd.ExcelWriter('output.xlsx', mode='w') as writer:
for i, df in enumerate(pd.read_html(html,encoding='utf-8')):
df.to_excel(writer, sheet_name=f'Sheet{i}', encoding='utf-8-sig')
I am starting with a DataFrame that looks like this:
id tof
0 43.0 1999991.0
1 43.0 2095230.0
2 43.0 4123105.0
3 43.0 5560423.0
4 46.0 2098996.0
5 46.0 2114971.0
6 46.0 4130033.0
7 46.0 4355096.0
8 82.0 2055207.0
9 82.0 2093996.0
10 82.0 4193587.0
11 90.0 2059360.0
12 90.0 2083762.0
13 90.0 2648235.0
14 90.0 4212177.0
15 103.0 1993306.0
.
.
.
and ultimately my goal is to create a very long two dimensional array that contains all combinations of items with the same id like this (for rows with id 43):
[(1993306.0, 2105441.0), (1993306.0, 3972679.0), (1993306.0, 3992558.0), (1993306.0, 4009044.0), (2105441.0, 3972679.0), (2105441.0, 3992558.0), (2105441.0, 4009044.0), (3972679.0, 3992558.0), (3972679.0, 4009044.0), (3992558.0, 4009044.0),...]
except changing all the tuples to arrays so that I could transpose the array after iterating over all id numbers.
Naturally, itertools came to mind, and my first thought was doing something with df.groupby('id') so that it would apply itertools internally to every group with the same id, but I would guess that this would take absolutely forever with the million line datafiles I have.
Is there a vectorized way to do this?
IIUC:
from itertools import combinations
pd.DataFrame([
[k, c0, c1] for k, tof in df.groupby('id').tof
for c0, c1 in combinations(tof, 2)
], columns=['id', 'tof0', 'tof1'])
id tof0 tof1
0 43.0 1999991.0 2095230.0
1 43.0 1999991.0 4123105.0
2 43.0 1999991.0 5560423.0
3 43.0 2095230.0 4123105.0
4 43.0 2095230.0 5560423.0
5 43.0 4123105.0 5560423.0
6 46.0 2098996.0 2114971.0
7 46.0 2098996.0 4130033.0
8 46.0 2098996.0 4355096.0
9 46.0 2114971.0 4130033.0
10 46.0 2114971.0 4355096.0
11 46.0 4130033.0 4355096.0
12 82.0 2055207.0 2093996.0
13 82.0 2055207.0 4193587.0
14 82.0 2093996.0 4193587.0
15 90.0 2059360.0 2083762.0
16 90.0 2059360.0 2648235.0
17 90.0 2059360.0 4212177.0
18 90.0 2083762.0 2648235.0
19 90.0 2083762.0 4212177.0
20 90.0 2648235.0 4212177.0
Explanation
This is a list comprehension that returns a list of lists wrapped up by a dataframe constructor. Look up comprehensions to understand better.
from itertools import combinations
pd.DataFrame([
# name series of tof values
# ↓ ↓
[k, c0, c1] for k, tof in df.groupby('id').tof
# items from combinations
# first second
# ↓ ↓
for c0, c1 in combinations(tof, 2)
], columns=['id', 'tof0', 'tof1'])
from itertools import product
x = df[df.id == 13].tof.values.astype(float)
all_combinations = list(product(x,x))
if you'd prefer that elements don't repeat, you can use
from itertools import combinations
x = df[df.id == 13].tof.values.astype(float)
all_combinations = list(combinations(x,2))
Groupby does work:
def get_product(x):
return pd.MultiIndex.from_product((x.tof, x.tof)).values
for i, g in df.groupby('id'):
print(i, get_product(g))
Output:
43.0 [(1999991.0, 1999991.0) (1999991.0, 2095230.0) (1999991.0, 4123105.0)
(1999991.0, 5560423.0) (2095230.0, 1999991.0) (2095230.0, 2095230.0)
(2095230.0, 4123105.0) (2095230.0, 5560423.0) (4123105.0, 1999991.0)
(4123105.0, 2095230.0) (4123105.0, 4123105.0) (4123105.0, 5560423.0)
(5560423.0, 1999991.0) (5560423.0, 2095230.0) (5560423.0, 4123105.0)
(5560423.0, 5560423.0)]
46.0 [(2098996.0, 2098996.0) (2098996.0, 2114971.0) (2098996.0, 4130033.0)
(2098996.0, 4355096.0) (2114971.0, 2098996.0) (2114971.0, 2114971.0)
(2114971.0, 4130033.0) (2114971.0, 4355096.0) (4130033.0, 2098996.0)
(4130033.0, 2114971.0) (4130033.0, 4130033.0) (4130033.0, 4355096.0)
(4355096.0, 2098996.0) (4355096.0, 2114971.0) (4355096.0, 4130033.0)
(4355096.0, 4355096.0)]
82.0 [(2055207.0, 2055207.0) (2055207.0, 2093996.0) (2055207.0, 4193587.0)
(2093996.0, 2055207.0) (2093996.0, 2093996.0) (2093996.0, 4193587.0)
(4193587.0, 2055207.0) (4193587.0, 2093996.0) (4193587.0, 4193587.0)]
90.0 [(2059360.0, 2059360.0) (2059360.0, 2083762.0) (2059360.0, 2648235.0)
(2059360.0, 4212177.0) (2083762.0, 2059360.0) (2083762.0, 2083762.0)
(2083762.0, 2648235.0) (2083762.0, 4212177.0) (2648235.0, 2059360.0)
(2648235.0, 2083762.0) (2648235.0, 2648235.0) (2648235.0, 4212177.0)
(4212177.0, 2059360.0) (4212177.0, 2083762.0) (4212177.0, 2648235.0)
(4212177.0, 4212177.0)]
103.0 [(1993306.0, 1993306.0)]
I have pandas DataFrame like this
X Y Z Value
0 18 55 1 70
1 18 55 2 67
2 18 57 2 75
3 18 58 1 35
4 19 54 2 70
I want to write this data to a text file that looks like this:
18 55 1 70
18 55 2 67
18 57 2 75
18 58 1 35
19 54 2 70
I have tried something like
f = open(writePath, 'a')
f.writelines(['\n', str(data['X']), ' ', str(data['Y']), ' ', str(data['Z']), ' ', str(data['Value'])])
f.close()
It's not correct. How to do this?
You can just use np.savetxt and access the np attribute .values:
np.savetxt(r'c:\data\np.txt', df.values, fmt='%d')
yields:
18 55 1 70
18 55 2 67
18 57 2 75
18 58 1 35
19 54 2 70
or to_csv:
df.to_csv(r'c:\data\pandas.txt', header=None, index=None, sep=' ', mode='a')
Note for np.savetxt you'd have to pass a filehandle that has been created with append mode.
The native way to do this is to use df.to_string() :
with open(writePath, 'a') as f:
dfAsString = df.to_string(header=False, index=False)
f.write(dfAsString)
Will output the following
18 55 1 70
18 55 2 67
18 57 2 75
18 58 1 35
19 54 2 70
This method also lets you easily choose which columns to print with the columns attribute, lets you keep the column, index labels if you wish, and has other attributes for spacing ect.
You can use pandas.DataFrame.to_csv(), and setting both index and header to False:
In [97]: print df.to_csv(sep=' ', index=False, header=False)
18 55 1 70
18 55 2 67
18 57 2 75
18 58 1 35
19 54 2 70
pandas.DataFrame.to_csv can write to a file directly, for more info you can refer to the docs linked above.
Late to the party: Try this>
base_filename = 'Values.txt'
with open(os.path.join(WorkingFolder, base_filename),'w') as outfile:
df.to_string(outfile)
#Neatly allocate all columns and rows to a .txt file
#AHegde - To get the tab delimited output use separator sep='\t'.
For df.to_csv:
df.to_csv(r'c:\data\pandas.txt', header=None, index=None, sep='\t', mode='a')
For np.savetxt:
np.savetxt(r'c:\data\np.txt', df.values, fmt='%d', delimiter='\t')
Way to get Excel data to text file in tab delimited form.
Need to use Pandas as well as xlrd.
import pandas as pd
import xlrd
import os
Path="C:\downloads"
wb = pd.ExcelFile(Path+"\\input.xlsx", engine=None)
sheet2 = pd.read_excel(wb, sheet_name="Sheet1")
Excel_Filter=sheet2[sheet2['Name']=='Test']
Excel_Filter.to_excel("C:\downloads\\output.xlsx", index=None)
wb2=xlrd.open_workbook(Path+"\\output.xlsx")
df=wb2.sheet_by_name("Sheet1")
x=df.nrows
y=df.ncols
for i in range(0,x):
for j in range(0,y):
A=str(df.cell_value(i,j))
f=open(Path+"\\emails.txt", "a")
f.write(A+"\t")
f.close()
f=open(Path+"\\emails.txt", "a")
f.write("\n")
f.close()
os.remove(Path+"\\output.xlsx")
print(Excel_Filter)
We need to first generate the xlsx file with filtered data and then convert the information into a text file.
Depending on requirements, we can use \n \t for loops and type of data we want in the text file.
I used a slightly modified version:
with open(file_name, 'w', encoding = 'utf-8') as f:
for rec_index, rec in df.iterrows():
f.write(rec['<field>'] + '\n')
I had to write the contents of a dataframe field (that was delimited) as a text file.
If you have a Dataframe that is an output of pandas compare method, such a dataframe looks like below when it is printed:
grossRevenue netRevenue defaultCost
self other self other self other
2098 150.0 160.0 NaN NaN NaN NaN
2110 1400.0 400.0 NaN NaN NaN NaN
2127 NaN NaN NaN NaN 0.0 909.0
2137 NaN NaN 0.000000 8.900000e+01 NaN NaN
2150 NaN NaN 0.000000 8.888889e+07 NaN NaN
2162 NaN NaN 1815.000039 1.815000e+03 NaN NaN
I was looking to persist the whole dataframe into a text file as its visible above. Using pandas's to_csv or numpy's savetxt does not achieve this goal. I used plain old print to log the same into a text file:
with open('file1.txt', mode='w') as file_object:
print(data_frame, file=file_object)