I have a .tsv file that looks like this .tsv File structure in MSExcel
I want to determine its shape through pytorch. How Can I do that
I wrote a code
import pandas as pd
df = pd.read_csv(path/to/.tsv)
df.shape
and it output
(13596, 1)
But clearly the shape conflicts the image that I provided. What am I doing wrong?
You need to specify how the data is delimited when using pd.read_csv (unless it is comma separated)
df = pd.read_csv(path/to/.tsv, sep = '\t')
Should load the data correctly.
See: https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html
Edit: looking at your data you should also specify header=None because you don't have a header row. Ideally also supply a list of column names using the names parameter of pd.read_csv
The issue is you are missing seperator attribute
import pandas as pd
df = pd.read_csv("data/test.txt")
print(df.shape)
Output: (2, 1)
import pandas as pd
df = pd.read_csv("data/test.txt", sep='\t')
print(df.shape)
Output: (2, 3)
So please add sep='\t' to your read_csv
Also If you have a header, you can pass header=0
pd.read_csv("data/test.txt", sep='\t', header=0)
Plz let me know if it helps
Related
My code for getting all column value from exl is given below :
import pandas as pd
from pandas import ExcelWriter
from pandas import ExcelFile
col_list = ["storeName"]
df = pd.read_csv('/home/preety/Downloads/my_store.xlsx',usecols=col_list)
print("Column headings:")
print(df['storeName'])
Error i am getting :
File "/var/www/html/fulfilment-admin/venv/lib/python3.8/site-packages/pandas/io/parsers.py", line 1232, in _validate_usecols_names
raise ValueError(
ValueError: Usecols do not match columns, columns expected but not found: ['CategoryName']
My excel is given below:
what i exactly want is i want all store_code in a list but when i trying to get it is returning me the error i dont know what i am doing wrong here can any one please help me related this . thanx in advance
To avoid such kind of errors specify separator argument sep= like here: https://stackoverflow.com/a/55514024/12385909
For those who has troubles with usecols= in read_excel() function - you need to specify here excel column names, e.g. usecols=“A:E” or usecols=“A,C,E:F”.
import pandas as pd
from pandas import ExcelWriter
from pandas import ExcelFile
col_list = ["storeName"]
df = pd.read_csv('/home/preety/Downloads/my_store.xlsx',usecols=col_list)
print("Column headings:")
print(df['StoreName'])
The title of your column contains a capital "S", thus pandas is unable to locate "storeName" because it doesn't exist.
I want to read the data after the string "Executed Trade". I want to do that dynamically. Not using "skiprows". I know openpyxl can be an option. But I am still struggling to do so. Could you guys please help me with that thing cause I have many files like the one is shown in image.
Try:
import pandas as pd
#change the Excel filename and the two mentions of 'col1' for whatever the column is
df = pd.read_excel('dictatorem.xlsx')
df = df.iloc[df.col1[df.col1 == 'Executed Trades'].index.tolist()[0]+1:]
df.columns = df.iloc[0]
df = df[1:]
df = df.reset_index(drop=True)
print(df)
Example input/output:
I am using python I have a CSV file which had values separated by tab,
I applied a rule to each of its row and created a new csv file, the resulting dataframe is comma separated , I want this new csv to be tab separated as well. How can I do it ?
I understand using sep = '\t' can work but where do I apply it ?
I applied the following code but it didn't work either
df = pd.read_csv('data.csv', header=None)
df_norm= df.apply(lambda x:np.where(x>0,x/x.max(),np.where(x<0,-x/x.min(),x)),axis=1)
df_norm.to_csv("file.csv", sep="\t")
Have you tried, this ?
pd.read_csv('file.csv', sep='\t')
I found the issue, the rule had changed the type to "object', because of which I was unable to perform any further operations. I followed Remove dtype at the end of numpy array, and converted my data frame to a list which solved the issue.
df = pd.read_csv('data.csv', header=None)
df_norm= df.apply(lambda x:np.where(x>0,x/x.max(),np.where(x<0,-x/x.min(),x)),axis=1)
df_norm=df_norm.tolist()
df_norm = np.squeeze(np.asarray(df_norm))
np.savetxt('result.csv', df_norm, delimiter=",")
I am trying to import a file from xlsx into a Python Pandas dataframe. I would like to prevent fields/columns being interpreted as integers and thus losing leading zeros or other desired heterogenous formatting.
So for an Excel sheet with 100 columns, I would do the following using a dict comprehension with range(99).
import pandas as pd
filename = 'C:\DemoFile.xlsx'
fields = {col: str for col in range(99)}
df = pd.read_excel(filename, sheetname=0, converters=fields)
These import files do have a varying number of columns all the time, and I am looking to handle this differently than changing the range manually all the time.
Does somebody have any further suggestions or alternatives for reading Excel files into a dataframe and treating all fields as strings by default?
Many thanks!
Try this:
xl = pd.ExcelFile(r'C:\DemoFile.xlsx')
ncols = xl.book.sheet_by_index(0).ncols
df = xl.parse(0, converters={i : str for i in range(ncols)})
UPDATE:
In [261]: type(xl)
Out[261]: pandas.io.excel.ExcelFile
In [262]: type(xl.book)
Out[262]: xlrd.book.Book
Use dtype=str when calling .read_excel()
import pandas as pd
filename = 'C:\DemoFile.xlsx'
df = pd.read_excel(filename, dtype=str)
the usual solution is:
read in one row of data just to get the column names and number of columns
create the dictionary automatically where each columns has a string type
re-read the full data using the dictionary created at step 2.
I read an Excel Sheet into a pandas DataFrame this way:
import pandas as pd
xl = pd.ExcelFile("Path + filename")
df = xl.parse("Sheet1")
the first cell's value of each column is selected as the column name for the dataFrame, I want to specify my own column names, How do I do this?
This thread is 5 years old and outdated now, but still shows up on the top of the list from a generic search. So I am adding this note. Pandas now (v0.22) has a keyword to specify column names at parsing Excel files. Use:
import pandas as pd
xl = pd.ExcelFile("Path + filename")
df = xl.parse("Sheet 1", header=None, names=['A', 'B', 'C'])
If header=None is not set, pd seems to consider the first row as header and delete it during parsing. If there is indeed a header, but you dont want to use it, you have two choices, either (1) use "names" kwarg only; or (2) use "names" with header=None and skiprows=1. I personally prefer the second option, since it clearly makes note that the input file is not in the format I want, and that I am doing something to go around it.
I think setting them afterwards is the only way in this case, so if you have for example four columns in your DataFrame:
df.columns = ['W','X','Y','Z']
If you know in advance what the headers in the Excelfile are its probably better to rename them, this would rename W into A, etc:
df.rename(columns={'W':'A', 'X':'B', etc})
As Ram said, this post comes on the top and may be useful to some....
In pandas 0.24.2 (may be earlier as well), read_excel itself has the capability of ignoring the source headers and giving your own col names and few other good controls:
DID = pd.read_excel(file1, sheet_name=0, header=None, usecols=[0, 1, 6], names=['A', 'ID', 'B'], dtype={2:str}, skiprows=10)
# for example....
# usecols => read only specific col indexes
# dtype => specifying the data types
# skiprows => skip number of rows from the top.
call .parse with header=None keyword argument.
df = xl.parse("Sheet1", header=None)
In case the excel sheet only contains the data without headers:
df=pd.read_excel("the excel file",header=None,names=["A","B","C"])
In case the excel sheet already contains header names, then use skiprows to skip the line:
df=pd.read_excel("the excel file",header=None,names=["A","B","C"],skiprows=1)