Convert a text file with a particular format into dataframe

Convert a text file with a particular format into dataframe - python

I am new to Pandas and thus I wanted to know if I can convert my text file with a particular format into a Pandas data frame. Below is my text file format
"FACT"|"FSYM"|"POSITION"|"INDIRECT_OPTIONS"|"REPORT"|"SOURCE"|"COMMENTS"|
"ABCX"|"VVG1"|2800000|760000|2022-11-03|"A"|"INCLUDES CAR"|0
I wanted to convert this format in Pandas with same columns and values as separated by | sign. That is my data frame columns will be FACT, FYSM,POSITION, and so on.
I am trying below code but it does not give me the desired output.
def convert_factset_file_to_dataframe(test_case_name, file_name):
data = pd.read_csv("{}/output/Float_Ingestion_files/{}/{}.txt".format(str(parentDir), test_case_name, file_name), sep=',')
print(data)
It is printing as follows. Just adding the index.
"FACT"|"FSYM"|"POSITION"|"INDIRECT_OPTIONS"|"REPORT"|"SOURCE"|"COMMENTS"|
0 "ABCX"|"VVG1"|2800000|760000|2022-11-03|"A"|"INCLUDES CAR"|0
Is there any other way of converting my text file format to a data frame besides reading it as a CSV? Or I need to incorporate some changes in the code?

You can use the argument sep (as stated in Thomas' comment).
data = pd.read_csv(filepath, sep="|")
For more information, see the documentation.

df.to_csv(file_name, sep='\t')
To use a specific encoding (e.g. 'utf-8') use the encoding argument:
df.to_csv(file_name, sep='\t', encoding='utf-8')

I think you have a typo and should call
data = pd.read_csv(
"{}/output/Float_Ingestion_files/{}/{}.txt".format(
str(parentDir), test_case_name, file_name
),
sep="|", # <<<<<<<<< don't choose the comma here, choose `|`
)
That is, just change the argument for the separator to be the | sign

Related

How to create a new csv from a csv that separated cell

I created a function for convert the csv.
The main topic is: get a csv file like:
,features,corr_dropped,var_dropped,uv_dropped
0,AghEnt,False,False,False
and I want to conver it to an another csv file:
features
corr_dropped
var_dropped
uv_dropped
0
AghEnt
False
False
False
I created a function for that but it is not working. The output is same as the input file.
function
def convert_file():
input_file = "../input.csv"
output_file = os.path.splitext(input_file)[0] + "_converted.csv"
df = pd.read_table(input_file, sep=',')
df.to_csv(output_file, index=False, header=True, sep=',')

you could use
df = pd.read_csv(input_file)
this works with your data. There is not much difference though. The only thing that changes is that the empty space before the first delimiter now has Unnamed: 0 in there.
Is that what you wanted? (Still not entirely sure what you are trying to achieve, as you are importing a csv and exporting the same data as a csv without really doing anything with it. the output example you showed is just a formated version of your initial data. but formating is not something csv can do.)

Bug on read csv format , assertin the header in jupyter notebook [duplicate]

I a importing a .csv file in python with pandas.
Here is the file format from the .csv :
a1;b1;c1;d1;e1;...
a2;b2;c2;d2;e2;...
.....
here is how get it :
from pandas import *
csv_path = "C:...."
data = read_csv(csv_path)
Now when I print the file I get that :
0 a1;b1;c1;d1;e1;...
1 a2;b2;c2;d2;e2;...
And so on... So I need help to read the file and split the values in columns, with the semi color character ;.

read_csv takes a sep param, in your case just pass sep=';' like so:
data = read_csv(csv_path, sep=';')
The reason it failed in your case is that the default value is ',' so it scrunched up all the columns as a single column entry.

In response to Morris' question above:
"Is there a way to programatically tell if a CSV is separated by , or ; ?"
This will tell you:
import pandas as pd
df_comma = pd.read_csv(your_csv_file_path, nrows=1,sep=",")
df_semi = pd.read_csv(your_csv_file_path, nrows=1, sep=";")
if df_comma.shape[1]>df_semi.shape[1]:
print("comma delimited")
else:
print("semicolon delimited")

Pandas error reading csv with double quotes

I've read all related topics - like this, this and this - but couldn't get a solution to work.
I have an input csv file like this:
ItemId,Content
i0000008,{"Title":"Edison Kinetoscopic Record of a Sneeze","Year":"1894","Rated":"N/A"}
i0000010,{"Title":"Employees, Leaving the Lumiére, Factory","Year":"1895","Rated":"N/A"}
I've tried several different approaches but couldn't get it to work. I want to read this csv file into a Dataframe like this:
ItemId Content
-------- -------------------------------------------------------------------------------
i0000008 {"Title":"Edison Kinetoscopic Record of a Sneeze","Year":"1894","Rated":"N/A"}
i0000010 {"Title":"Employees, Leaving the Lumiére, Factory","Year":"1895","Rated":"N/A"}
With following code (Python 3.9):
df = pd.read_csv('test.csv', sep=',', skipinitialspace = True, quotechar = '"')
As far as I understand, commas inside dictionary column and commas inside quotation marks are being treated as regular separators, so it raises following error:
pandas.errors.ParserError: Error tokenizing data. C error: Expected 4 fields in line 3, saw 6
Is it possible to produce desired result? Thanks.

The problem is that the comma's in the Content column are interpreted as separators. You can solve this by using pd.read_fwf to manually set the number of characters on which to split:
df = pd.read_fwf('test.csv', colspecs=[(0, 8),(9,100)], header=0, names=['ItemId', 'Content'])
Result:
ItemId
Content
0
i0000008
{"Title":"Edison Kinetoscopic Record of a Sneeze","Year":"1894","Rated":"N/A"}
1
i0000010
{"Title":"Employees, Leaving the Lumiére, Factory","Year":"1895","Rated":"N/A"}

I don't think you'll be able to read it normally with pandas because it has the delimiter used multiple times for a single value; however, reading it with python and doing some processing, you should be able to convert it to pandas dataframe:
def splitValues(x):
index = x.find(',')
return x[:index], x[index+1:].strip()
import pandas as pd
data = open('file.csv')
columns = next(data)
columns = columns.strip().split(',')
df = pd.DataFrame(columns=columns, data=(splitValues(row) for row in data))
OUTPUT:
ItemId Content
0 i0000008 {"Title":"Edison Kinetoscopic Record of a Sneeze","Year":"1894","Rated":"N/A"}
1 i0000010 {"Title":"Employees, Leaving the Lumiére, Factory","Year":"1895","Rated":"N/A"}

Turning Panda Column into text file seperated by line break

I would like to create a txt file, where every line is a so called "ticker symbol" (=symbol for a stock). As a first step, I downloaded all the tickers I want via a wikipedia api:
import pandas as pd
import wikipedia as wp
html1 = wp.page("List of S&P 500 companies").html().encode("UTF-8")
df = pd.read_html(html1,header =0)[0]
df = df.drop(['SEC filings','CIK', 'Headquarters Location', 'Date first added', 'Founded'], axis = 1)
df.columns = df.columns.str.replace('Symbol', 'Ticker')
Secondly, I would like to create a txt file as mentionned above with all the ticker names of column "Ticker" from df. To do so, I probably have to do somithing similar to:
f = open("tickertest.txt","w+")
f.write("MMM\nABT\n...etc.")
f.close()
Now my problem: Does anybody know how it is possible to bring my Ticker column from df into one big string where between every ticker there is a \n or every ticker is on a new line?

You can use to_csv for this.
df.to_csv("test.txt", columns=["Ticker"], header=False, index=False)
This provides flexibility to include other columns, column names, and index values at some future point (should you need to do some sleuthing, or in case your boss asks for more information). You can even change the separator. This would be a simple modification (obvious changes, e.g.):
df.to_csv("test.txt", columns=["Ticker", "Symbol",], header=True, index=True, sep="\t")
I think the benefit of this method over jfaccioni's answer is flexibility and ease of adapability. This also gets you away from explicitly opening a file. However, if you still want to explicitly open a file you should consider using "with", which will automatically close the buffer when you break out of the current indentation. e.g.
with open("test.txt", "w") as fid:
fid.write("MMM\nABT\n...etc.")

This should do the trick:
'\n'.join(df['Ticker'].astype(str).values)

Store (df.info) method output in DataFrame or CSV

I have a giant Dataframe(df) that's dimensions are (42,--- x 135). I'm running a df.info on it, but the output is unreadable. I'm wondering if there is any way to dump it in a Dataframe or CSV? I think it has something to do with:
```buf : writable buffer, defaults to sys.stdout
```Where to send the output. By default, the output is printed to sys.stdout. Pass a writable buffer
```if you need to further process the output."
But when i add a (buf = buffer) the output is just each word in the output then a new line which is very hard to read/work with. My goal is to be-able to better understand what columns are in the dataframe and to be able to sort them by type.

You need to open a file then pass the file handle to df.info:
with open('info_output.txt','w') as file_out:
df.info(buf=file_out)

You could try avoiding pandas.dataframe.info() and instead create the information that you need as a pandas.DataFrame:
import pandas as pd
def get_info(df: pd.DataFrame):
info = df.dtypes.to_frame('dtypes')
info['non_null'] = df.count()
info['unique_values'] = df.apply(lambda srs: len(srs.unique()))
info['first_row'] = df.iloc[0]
info['last_row'] = df.iloc[-1]
return info
And write it to csv with df.to_csv('info_output.csv').
The memory usage information may also be useful, so you could do:
df.memory_usage().sum()

import pandas as pd
df = pd.read_csv('/content/house_price.csv')
import io
buffer = io.StringIO()
df.info(buf=buffer)
s = buffer.getvalue()
with open("df_info.csv", "w", encoding="utf-8") as f: f.write(s.split(" ----- ")[1].split("dtypes")[0])
di = pd.read_csv('df_info.csv', sep="\s+", header=None)
di

Just to build on mechanical_meat's and Adam Safi's combined solution, the following code will convert the info output into a dataframe with no manual intervention:
with open('info_output.txt','w') as file_out:
df.info(buf=file_out)
info_output_df = pd.read_csv('info_output.txt', sep="\s+", header=None, index_col=0, engine='python', skiprows=5, skipfooter=2)
Note that according to the docs, the 'skipfooter' option is only compatible with the python engine.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Convert a text file with a particular format into dataframe - python

You can use the argument sep (as stated in Thomas' comment). data = pd.read_csv(filepath, sep="|") For more information, see the documentation.

df.to_csv(file_name, sep='\t') To use a specific encoding (e.g. 'utf-8') use the encoding argument: df.to_csv(file_name, sep='\t', encoding='utf-8')

I think you have a typo and should call data = pd.read_csv( "{}/output/Float_Ingestion_files/{}/{}.txt".format( str(parentDir), test_case_name, file_name ), sep="|", # <<<<<<<<< don't choose the comma here, choose `|` ) That is, just change the argument for the separator to be the | sign

Related

How to create a new csv from a csv that separated cell

Bug on read csv format , assertin the header in jupyter notebook [duplicate]

Pandas error reading csv with double quotes

Turning Panda Column into text file seperated by line break

Store (df.info) method output in DataFrame or CSV

Categories

Resources