how to determine the shape of .tsv file through python - python

I have a .tsv file that looks like this .tsv File structure in MSExcel
I want to determine its shape through pytorch. How Can I do that
I wrote a code
import pandas as pd
df = pd.read_csv(path/to/.tsv)
df.shape
and it output
(13596, 1)
But clearly the shape conflicts the image that I provided. What am I doing wrong?

You need to specify how the data is delimited when using pd.read_csv (unless it is comma separated)
df = pd.read_csv(path/to/.tsv, sep = '\t')
Should load the data correctly.
See: https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html
Edit: looking at your data you should also specify header=None because you don't have a header row. Ideally also supply a list of column names using the names parameter of pd.read_csv

The issue is you are missing seperator attribute
import pandas as pd
df = pd.read_csv("data/test.txt")
print(df.shape)
Output: (2, 1)
import pandas as pd
df = pd.read_csv("data/test.txt", sep='\t')
print(df.shape)
Output: (2, 3)
So please add sep='\t' to your read_csv
Also If you have a header, you can pass header=0
pd.read_csv("data/test.txt", sep='\t', header=0)
Plz let me know if it helps

Related

Usecols do not match columns, columns expected but not found csv issue

My code for getting all column value from exl is given below :
import pandas as pd
from pandas import ExcelWriter
from pandas import ExcelFile
col_list = ["storeName"]
df = pd.read_csv('/home/preety/Downloads/my_store.xlsx',usecols=col_list)
print("Column headings:")
print(df['storeName'])
Error i am getting :
File "/var/www/html/fulfilment-admin/venv/lib/python3.8/site-packages/pandas/io/parsers.py", line 1232, in _validate_usecols_names
raise ValueError(
ValueError: Usecols do not match columns, columns expected but not found: ['CategoryName']
My excel is given below:
what i exactly want is i want all store_code in a list but when i trying to get it is returning me the error i dont know what i am doing wrong here can any one please help me related this . thanx in advance
To avoid such kind of errors specify separator argument sep= like here: https://stackoverflow.com/a/55514024/12385909
For those who has troubles with usecols= in read_excel() function - you need to specify here excel column names, e.g. usecols=“A:E” or usecols=“A,C,E:F”.
import pandas as pd
from pandas import ExcelWriter
from pandas import ExcelFile
col_list = ["storeName"]
df = pd.read_csv('/home/preety/Downloads/my_store.xlsx',usecols=col_list)
print("Column headings:")
print(df['StoreName'])
The title of your column contains a capital "S", thus pandas is unable to locate "storeName" because it doesn't exist.

How to read excel data only after a string is found but without using skiprows

I want to read the data after the string "Executed Trade". I want to do that dynamically. Not using "skiprows". I know openpyxl can be an option. But I am still struggling to do so. Could you guys please help me with that thing cause I have many files like the one is shown in image.
Try:
import pandas as pd
#change the Excel filename and the two mentions of 'col1' for whatever the column is
df = pd.read_excel('dictatorem.xlsx')
df = df.iloc[df.col1[df.col1 == 'Executed Trades'].index.tolist()[0]+1:]
df.columns = df.iloc[0]
df = df[1:]
df = df.reset_index(drop=True)
print(df)
Example input/output:

Converting Comma delimted CSV to Tab delimted CSV in pandas

I am using python I have a CSV file which had values separated by tab,
I applied a rule to each of its row and created a new csv file, the resulting dataframe is comma separated , I want this new csv to be tab separated as well. How can I do it ?
I understand using sep = '\t' can work but where do I apply it ?
I applied the following code but it didn't work either
df = pd.read_csv('data.csv', header=None)
df_norm= df.apply(lambda x:np.where(x>0,x/x.max(),np.where(x<0,-x/x.min(),x)),axis=1)
df_norm.to_csv("file.csv", sep="\t")
Have you tried, this ?
pd.read_csv('file.csv', sep='\t')
I found the issue, the rule had changed the type to "object', because of which I was unable to perform any further operations. I followed Remove dtype at the end of numpy array, and converted my data frame to a list which solved the issue.
df = pd.read_csv('data.csv', header=None)
df_norm= df.apply(lambda x:np.where(x>0,x/x.max(),np.where(x<0,-x/x.min(),x)),axis=1)
df_norm=df_norm.tolist()
df_norm = np.squeeze(np.asarray(df_norm))
np.savetxt('result.csv', df_norm, delimiter=",")

How to import all fields from xls as strings into a Pandas dataframe?

I am trying to import a file from xlsx into a Python Pandas dataframe. I would like to prevent fields/columns being interpreted as integers and thus losing leading zeros or other desired heterogenous formatting.
So for an Excel sheet with 100 columns, I would do the following using a dict comprehension with range(99).
import pandas as pd
filename = 'C:\DemoFile.xlsx'
fields = {col: str for col in range(99)}
df = pd.read_excel(filename, sheetname=0, converters=fields)
These import files do have a varying number of columns all the time, and I am looking to handle this differently than changing the range manually all the time.
Does somebody have any further suggestions or alternatives for reading Excel files into a dataframe and treating all fields as strings by default?
Many thanks!
Try this:
xl = pd.ExcelFile(r'C:\DemoFile.xlsx')
ncols = xl.book.sheet_by_index(0).ncols
df = xl.parse(0, converters={i : str for i in range(ncols)})
UPDATE:
In [261]: type(xl)
Out[261]: pandas.io.excel.ExcelFile
In [262]: type(xl.book)
Out[262]: xlrd.book.Book
Use dtype=str when calling .read_excel()
import pandas as pd
filename = 'C:\DemoFile.xlsx'
df = pd.read_excel(filename, dtype=str)
the usual solution is:
read in one row of data just to get the column names and number of columns
create the dictionary automatically where each columns has a string type
re-read the full data using the dictionary created at step 2.

How to specify column names while reading an Excel file using Pandas?

I read an Excel Sheet into a pandas DataFrame this way:
import pandas as pd
xl = pd.ExcelFile("Path + filename")
df = xl.parse("Sheet1")
the first cell's value of each column is selected as the column name for the dataFrame, I want to specify my own column names, How do I do this?
This thread is 5 years old and outdated now, but still shows up on the top of the list from a generic search. So I am adding this note. Pandas now (v0.22) has a keyword to specify column names at parsing Excel files. Use:
import pandas as pd
xl = pd.ExcelFile("Path + filename")
df = xl.parse("Sheet 1", header=None, names=['A', 'B', 'C'])
If header=None is not set, pd seems to consider the first row as header and delete it during parsing. If there is indeed a header, but you dont want to use it, you have two choices, either (1) use "names" kwarg only; or (2) use "names" with header=None and skiprows=1. I personally prefer the second option, since it clearly makes note that the input file is not in the format I want, and that I am doing something to go around it.
I think setting them afterwards is the only way in this case, so if you have for example four columns in your DataFrame:
df.columns = ['W','X','Y','Z']
If you know in advance what the headers in the Excelfile are its probably better to rename them, this would rename W into A, etc:
df.rename(columns={'W':'A', 'X':'B', etc})
As Ram said, this post comes on the top and may be useful to some....
In pandas 0.24.2 (may be earlier as well), read_excel itself has the capability of ignoring the source headers and giving your own col names and few other good controls:
DID = pd.read_excel(file1, sheet_name=0, header=None, usecols=[0, 1, 6], names=['A', 'ID', 'B'], dtype={2:str}, skiprows=10)
# for example....
# usecols => read only specific col indexes
# dtype => specifying the data types
# skiprows => skip number of rows from the top.
call .parse with header=None keyword argument.
df = xl.parse("Sheet1", header=None)
In case the excel sheet only contains the data without headers:
df=pd.read_excel("the excel file",header=None,names=["A","B","C"])
In case the excel sheet already contains header names, then use skiprows to skip the line:
df=pd.read_excel("the excel file",header=None,names=["A","B","C"],skiprows=1)

Categories

Resources