I am trying to read a text file which has split lines randomly generated at column 28th from a third party.
When I conver to csv it is fine but, when I feed the files to Athena, it is not able to read because of split.
Is there a way to fine the CR here and put it back as other lines are?
Thanks,
SM
This is a code snippet :
import pandas as pd
add_columns = ["col1", "col2", "col3"...."col59"]
res = pd.read_csv("file_name.txt", names= add_columns, sep=',\s+', delimiter=',', encoding="utf-8", skipinitialspace=True)
df = pd.DataFrame(res)
df.to_csv('final_name.csv', index = None)
file_name.txt
99,999,00499013,X701,,,5669,5669,1232,,1,1,,2,,,,0,0,0,,,,,,,,,,,,,,2400,1232,LXA,,<<line is split on column 28>>
2,5669,,,,68,,,1,,,,,,,,,,,,71,
99,999,00499017,X701,,,5669,5669,1160,,1,1,,2,,,,0,0,0,,,,,,,,,,,,,,2400,1160,LXA,,1,5669,,,,,,,1,,,,,,,,,,,,71,
99,999,00499019,X701,,,5669,5669,1284,,1,1,,2,,,,0,0,0,,,,,,,,,,,,,,2400,1284,LXA,,2,5669,,,,66,,,1,,,,,,,,,,,,71,
I have tried str.split but, no luck.
If you are able to convert it successfully to CSV using pandas, you can try to save it as a CSV to feed into Athena.
Hi i'm trying to convert .dat file to .csv file.
But I have a problem with it.
I have a file .dat which looks like(column name)
region GPS name ID stop1 stop2 stopname1 stopname2 time1 time2 stopgps1 stopgps2
it delimiter is a tab.
so I want to convert dat file to csv file.
but the data keeps coming out in one column.
i try to that, using next code
import pandas as pd
with open('file.dat', 'r') as f:
df = pd.DataFrame([l.rstrip() for l in f.read().split()])
and
with open('file.dat', 'r') as input_file:
lines = input_file.readlines()
newLines = []
for line in lines:
newLine = line.strip('\t').split()
newLines.append(newLine)
with open('file.csv', 'w') as output_file:
file_writer = csv.writer(output_file)
file_writer.writerows(newLines)
But all the data is being expressed in one column.
(i want to express 15 column, 80,000 row, but it look 1 column, 1,200,000 row)
I want to convert this into a csv file with the original data structure.
Where is a mistake?
Please help me... It's my first time dealing with data in Python.
If you're already using pandas, you can just use pd.read_csv() with another delimiter:
df = pd.read_csv("file.dat", sep="\t")
df.to_csv("file.csv")
See also the documentation for read_csv and to_csv
Like the Title says, I am trying to read an online data file that is in .tbl format. Here is the link to the data: https://irsa.ipac.caltech.edu/data/COSMOS/tables/morphology/cosmos_morph_cassata_1.1.tbl
I tried the following code
cosmos= pd.read_table('https://irsa.ipac.caltech.edu/data/COSMOS/tables/morphology/cosmos_morph_cassata_1.1.tbl')
Running this didn't give me any errors however when I wrote print (cosmos.column), it didn't give me a list of individuals columns instead python put everything together and gave me the output that looks like:
Index(['| ID| RA| DEC| MAG_AUTO_ACS| R_PETRO| R_HALF| CONC_PETRO| ASYMMETRY| GINI| M20| Axial Ratio| AUTOCLASS| CLASSWEIGHT|'], dtype='object').
My main goal is to print the columns of that table individually and then print cosmos['RA']. Anyone know how this can be done?
Your file has four header rows and different delimiters in header (|) and data (whitespace). You can read the data by using skiprows argument of read_table.
import requests
import pandas as pd
filename = 'cosmos_morph_cassata_1.1.tbl'
url = 'https://irsa.ipac.caltech.edu/data/COSMOS/tables/morphology/' + filename
n_header = 4
## Download large file to disc, so we can reuse it...
table_file = requests.get(url)
open(filename, 'wb').write(table_file.content)
## Skip the first 4 header rows and use whitespace as delimiter
cosmos = pd.read_table(filename, skiprows=n_header, header=None, delim_whitespace=True)
## create header from first line of file
with open(filename) as f:
header_line = f.readline()
## trim whitespaces and split by '|'
header_columns = header_line.replace(' ', '').split('|')[1:-1]
cosmos.columns = header_columns
i have huge csv and i tried to filter data using with open.
I know i can use FINDSTR on command line but i would like to use python to create a new file filtered or i would like to create a pandas dataframe as output.
here is my code:
outfile = open('my_file2.csv', 'a')
with open('my_file1.csv', 'r') as f:
for lines in f:
if '31/10/2018' in lines:
print(lines)
outfile.write(lines)
The problem is that the output file generated is = input file and there is no filter(and the size of file is the same)
Thanks to all
The problem with your code is the indentation of the last line. It should be within the if-statement, so only lines that contain '31/10/2018' get written.
outfile = open('my_file2.csv', 'a')
with open('my_file1.csv', 'r') as f:
for lines in f:
if '31/10/2018' in lines:
print(lines)
outfile.write(lines)
To filter using Pandas and creating a DataFrame, do something along the lines of:
import pandas as pd
import datetime
# I assume here that the date is in a seperate column, named 'Date'
df = pd.read_csv('my_file1.csv', parse_dates=['Date'])
# Filter on October 31st 2018
df_filter = df[df['Date'].dt.date == datetime.date(2018, 10, 31)]
# Output to csv
df_filter.to_csv('my_file2.csv', index=False)
(For very large csv's, look at the pd.read_csv() argument 'chunksize')
To use with open(....) as f:, you could do something like:
import pandas as pd
filtered_list = []
with open('my_file1.csv', 'r') as f:
for lines in f:
if '31/10/2018' in lines:
print(lines)
# Split line by comma into list
line_data = lines.split(',')
filtered_list.append(line_data)
# Convert to dataframe and export as csv
df = pd.DataFrame(filtered_list)
df_filter.to_csv('my_file2.csv', index=False)
I’m new to coding, and trying to extract a subset of data from a large file.
File_1 contains the data in two columns: ID and Values.
File_2 contains a large list of IDs, some of which may be present in File_1 while others will not be present.
If an ID from File_2 is present in File_1, I would like to extract those values and write the ID and value to a new file, but I’m not sure how to do this. Here is an example of the files:
File_1: data.csv
ID Values
HOT224_1_0025m_c100047_1 16
HOT224_1_0025m_c10004_1 3
HOT224_1_0025m_c100061_1 1
HOT224_1_0025m_c10010_2 1
HOT224_1_0025m_c10020_1 1
File_2: ID.xlsx
IDs
HOT224_1_0025m_c100047_1
HOT224_1_0025m_c100061_1
HOT225_1_0025m_c100547_1
HOT225_1_0025m_c100561_1
I tried the following:
import pandas as pd
data_file = pd.read_csv('data.csv', index_col = 0)
ID_file = pd.read_excel('ID.xlsx')
values_from_ID = data_file.loc[['ID_file']]
The following error occurs:
KeyError: "None of [['ID_file']] are in the [index]"
Not sure if I am reading in the excel file correctly.
I also do not know how to write the extracted data to a new file once I get the code to do it.
Thanks for your help.
With pandas:
import pandas as pd
data_file = pd.read_csv('data.csv', index_col=0, delim_whitespace=True)
ID_file = pd.read_excel('ID.xlsx', index_col=0)
res = data_file.loc[ID_file.index].dropna()
res.to_csv('result.csv')
Content of result.csv:
IDs,Values
HOT224_1_0025m_c100047_1,16.0
HOT224_1_0025m_c100061_1,1.0
In steps:
You need to read your csv with whitespace delimited:
data_file = pd.read_csv('data.csv', index_col=0, delim_whitespace=True)
it looks like this:
>>> data_file
Values
ID
HOT224_1_0025m_c100047_1 16
HOT224_1_0025m_c10004_1 3
HOT224_1_0025m_c100061_1 1
HOT224_1_0025m_c10010_2 1
HOT224_1_0025m_c10020_1 1
Now, read your Excel file, using the ids as index:
ID_file = pd.read_excel('ID.xlsx', index_col=0)
and you use its index with locto get the matching entries from your first dataframe. Drop the missing values with dropna():
res = data_file.loc[ID_file.index].dropna()
Finally, write to the result csv:
res.to_csv('result.csv')
You can do it using a simple dictionary in Python. You can make a dictionary from file 1 and read the IDs from File 2. The IDS from file 2 can be checked in the dictionary and only the matching ones can be written to your output file. Something like this could work :
with open('data.csv','r') as f:
lines = f.readlines()
#Skip the CSV Header
lines = lines[1:]
table = {l.split()[0]:l.split()[1] for l in lines if len(l.strip()) != 0}
with open('id.csv','r') as f:
lines = f.readlines()
#Skip the CSV Header
lines = lines[1:]
matchedIDs = [(l.strip(),table[l.strip()]) for l in line if l.strip() in table]
Now you will have your matched IDs and their values in a list of tuples called matchedIDs. You can write them in any format you like in a file.
I'm also new to python programming. So the code that I used below might not be the most efficient. The situation I assumed is that find ids in data.csv also in id.csv, there might be some ids in data.csv not in id.csv and vise versa.
import pandas as pd
data = pd.read_csv('data.csv')
id2 = pd.read_csv('id.csv')
data.ID = data['ID']
id2.ID = idd['IDs']
d=[]
for row in data.ID:
d.append(row)
f=[]
for row in id2.ID:
f.append(row)
g=[]
for i in d:
if i in f:
g.append(i)
data = pd.read_csv('data.csv',index_col='ID')
new_data = data.loc[g,:]
new_data.to_csv('new_data.csv')
This is the code I ended up using. It worked perfectly. Thanks to everyone for their responses.
import pandas as pd
data_file = pd.read_csv('data.csv', index_col=0)
ID_file = pd.read_excel('ID.xlsx', index_col=0)
res = data_file.loc[ID_file.index].dropna()
res.to_csv('result.csv')