I am new to Python and I am trying to read a csv file using pandas but I have a bit of a problem within my csv file.
I have strings which contains commas at the end and this creates an undesired column at towards the end as shown:
This is the raw csv:
For example, on line 14, the green string value ends with a comma and creates a new column which then gives me parsing errors when using this:
import pandas as pd
pd.read_csv("data.csv")
ParserError: Error tokenizing data. C error: Expected 6 fields in line 8, saw 7
Is there a way I can clean up this and merge the last two columns?
You can use np.where to replace APP with the last column where APP is missing, then drop the last column.
import pandas as pd
import numpy as np
df = pd.read_csv("data.csv")
df['APP'] = np.where(df.app.isna(), df[-1], df.APP)
df = df.iloc[:,:-1]
Related
I am trying to create a list from a CSV. This CSV contains a 2 dimensional table [540 rows and 8 columns] and I would like to create a list that contains the values of an specific column, column 4 to be specific.
I tried: list(df.columns.values)[4], it does mention the name of the column but i'm trying to get the values from the rows on column 4 and make them a list.
import pandas as pd
import urllib
#This is the empty list
company_name = []
#Uploading CSV file
df = pd.read_csv('Downloads\Dropped_Companies.csv')
#Extracting list of all companies name from column "Name of Stock"
companies_column=list(df.columns.values)[4] #This returns the name of the column.
companies_column = list(df.iloc[:,4].values)
So for this you can just add the following line after the code you've posted:
company_name = df[companies_column].tolist()
This will get the column data in the companies column as pandas Series (essentially a Series is just a fancy list) and then convert it to a regular python list.
Or, if you were to start from scratch, you can also just use these two lines
import pandas as pd
df = pd.read_csv('Downloads\Dropped_Companies.csv')
company_name = df[df.columns[4]].tolist()
Another option: If this is the only thing you need to do with your csv file, you can also get away just using the csv library that comes with python instead of installing pandas, using this approach.
If you want to learn more about how to get data out of your pandas DataFrame (the df variable in your code), you might find this blog post helpful.
I think that you can try this for getting all the values of a specific column:
companies_column = df[{column name}]
Replace "{column name}" with the column you want to access the values of.
I am using this code. but instead of new with just the required rows, I'm getting an empty .csv with just the header.
import pandas as pd
df = pd.read_csv("E:/Mac&cheese.csv")
newdf = df[df["fruit"]=="watermelon"+"*"]
newdf.to_csv("E:/Mac&cheese(2).csv",index=False)
I believe the problem is in how you select the rows containing the word "watermelon". Instead of:
newdf = df[df["fruit"]=="watermelon"+"*"]
Try:
newdf = df[df["fruit"].str.contains("watermelon")]
In your example, pandas is literally looking for cells containing the word "watermelon*".
missing the underscore in pd.read_csv on first call, also it looks like the actual location is incorrect. missing the // in the file location.
My code for getting all column value from exl is given below :
import pandas as pd
from pandas import ExcelWriter
from pandas import ExcelFile
col_list = ["storeName"]
df = pd.read_csv('/home/preety/Downloads/my_store.xlsx',usecols=col_list)
print("Column headings:")
print(df['storeName'])
Error i am getting :
File "/var/www/html/fulfilment-admin/venv/lib/python3.8/site-packages/pandas/io/parsers.py", line 1232, in _validate_usecols_names
raise ValueError(
ValueError: Usecols do not match columns, columns expected but not found: ['CategoryName']
My excel is given below:
what i exactly want is i want all store_code in a list but when i trying to get it is returning me the error i dont know what i am doing wrong here can any one please help me related this . thanx in advance
To avoid such kind of errors specify separator argument sep= like here: https://stackoverflow.com/a/55514024/12385909
For those who has troubles with usecols= in read_excel() function - you need to specify here excel column names, e.g. usecols=“A:E” or usecols=“A,C,E:F”.
import pandas as pd
from pandas import ExcelWriter
from pandas import ExcelFile
col_list = ["storeName"]
df = pd.read_csv('/home/preety/Downloads/my_store.xlsx',usecols=col_list)
print("Column headings:")
print(df['StoreName'])
The title of your column contains a capital "S", thus pandas is unable to locate "storeName" because it doesn't exist.
actually I am fetching data from database
which has data as below CSV format
DS,SID,SID_T,E_DATE,S_DATA
TECH,312,TID,2021-01-03,"{""idx"":""msc"",""cid"":""f323d3"",""iname"":""master_in_science"",""mcap"":21.33,""sg"":[{""upt"":true,""dwt"":true,""high_low"":false}]}"
TECH,343,TID,2021-01-03,"{""idx"":""bsc"",""cid"":""k33d3"",""iname"":""bachelor_in_science"",""mcap"":81.33,""sg"":[{""upt"":false,""dwt"":true,""high_low"":false}]}"
TECH,554,TID,2021-01-03,"{""idx"":""ba"",""cid"":""3d3f32"",""iname"":""bachelor_in_art"",""mcap"":67.83,""sg"":[{""upt"":true,""dwt"":false,""high_low"":false}]}"
TECH,323,TID,2021-01-03,"{""idx"":""ma"",""cid"":""m23k66"",""iname"":""master_in_art"",""mcap"":97.13,""sg"":[{""upt"":true,""dwt"":true,""high_low"":true}]}"
dataframe look like this
i wanted to split the S_DATA column into multiple column
hence the output is like this
what i have tried
i tried to convert the dataframe into json + and trying to normalize the json using pandas.normalize
but i was unable to do the so
also S_DATA.sg values i.e.""sg"":[{""upt"":true,""dwt"":true,""high_low"":false} is also creating the trouble while the entire conversion process
Try this:
import pandas as pd
import json
df = pd.read_csv('data.csv')
df['S_DATA'] = df['S_DATA'].apply(lambda x: json.loads(x))
pd.concat([df[df.columns.difference(['S_DATA'])], pd.json_normalize(df.S_DATA)], axis=1)
I am trying to import a file from xlsx into a Python Pandas dataframe. I would like to prevent fields/columns being interpreted as integers and thus losing leading zeros or other desired heterogenous formatting.
So for an Excel sheet with 100 columns, I would do the following using a dict comprehension with range(99).
import pandas as pd
filename = 'C:\DemoFile.xlsx'
fields = {col: str for col in range(99)}
df = pd.read_excel(filename, sheetname=0, converters=fields)
These import files do have a varying number of columns all the time, and I am looking to handle this differently than changing the range manually all the time.
Does somebody have any further suggestions or alternatives for reading Excel files into a dataframe and treating all fields as strings by default?
Many thanks!
Try this:
xl = pd.ExcelFile(r'C:\DemoFile.xlsx')
ncols = xl.book.sheet_by_index(0).ncols
df = xl.parse(0, converters={i : str for i in range(ncols)})
UPDATE:
In [261]: type(xl)
Out[261]: pandas.io.excel.ExcelFile
In [262]: type(xl.book)
Out[262]: xlrd.book.Book
Use dtype=str when calling .read_excel()
import pandas as pd
filename = 'C:\DemoFile.xlsx'
df = pd.read_excel(filename, dtype=str)
the usual solution is:
read in one row of data just to get the column names and number of columns
create the dictionary automatically where each columns has a string type
re-read the full data using the dictionary created at step 2.