I am trying to clean a dataset in pandas, information is stored ona csv file and is imported using:
tester = pd.read_csv('date.csv')
Every column contains a '?' where the value is missing. For example there is an age column that contains 9 question marks (?)
I am trying to set the all the question marks to NaN, i have tried:
tester = pd.read_csv('date.csv', na_values=["?"])
tester['age'].replace("?", np.NaN)
tester.replace('?', np.NaN)
for col in tester :
print tester[col].value_counts(dropna=False)
Still returns 0 for the age when I know there is 9 (?s). In this case I assume the check is failing as the value is never seen as being ?.
I have looked at the csv file in notepage and there is no space etc around the character.
Is there anyway of forcing this so that it is recognised?
sample data:
read_csv had a na_values parameter. See here.
df = pd.read_csv('date.csv', na_values='?')
You are very near:
# IT looks like file is having spaces after comma, so use `sep`
tester = pd.read_csv('date.csv', sep=', ', engine='python')
tester['age'].replace('?', np.nan)
There seems problem with data somewhere so, for debug..
pd.read_csv('file', error_bad_lines=False)
tester = tester [~(tester == '?').any(axis=1)]
OR
pd.read_csv('file', sep='delimiter', header=None)
OR
pd.read_csv('file',header=None,sep=', ')
Related
It's probably a silly thing but I can't seem to correctly convert a pandas series originally got from an excel sheet to a list.
dfCI is created by importing data from an excel sheet and looks like this:
tab var val
MsrData sortfield DetailID
MsrData strow 4
MsrData inputneeded "MeasDescriptionTest", "SiteLocTest", "SavingsCalcsProvided","BiMonthlyTest"
# get list of cols for which input is needed
cols = dfCI[((dfCI['var'] == 'inputneeded') & (dfCI['tab'] == 'MsrData'))]['val'].values.tolist()
print(cols)
>> ['"MeasDescriptionTest", "SiteLocTest", "SavingsCalcsProvided", "BiMonthlyTest"']
# replace null text with text
invalid = 'Input Needed'
for col in cols:
dfMSR[col] = np.where((dfMSR[col].isnull()), invalid, dfMSR[col])
However the second set of (single) quotes added when I converted cols from series to list, makes all the columns a single value so that
col = '"MeasDescriptionTest", "SiteLocTest", "SavingsCalcsProvided", "BiMonthlyTest"'
The desired output for cols is
cols = ["MeasDescriptionTest", "SiteLocTest", "SavingsCalcsProvided", "BiMonthlyTest"]
What am I doing wrong?
Once you've got col, you can convert it to your expected output:
In [1109]: col = '"MeasDescriptionTest", "SiteLocTest", "SavingsCalcsProvided", "BiMonthlyTest"'
In [1114]: cols = [i.strip() for i in col.replace('"', '').split(',')]
In [1115]: cols
Out[1115]: ['MeasDescriptionTest', 'SiteLocTest', 'SavingsCalcsProvided', 'BiMonthlyTest']
Another possible solution that comes to mind given the structure of cols is:
list(eval(cols[0])) # ['MeasDescriptionTest', 'SiteLocTest', 'SavingsCalcsProvided', 'BiMonthlyTest']
Although this is valid, it's less safe and I would go with list-comprehension as #MayankPorwal suggested.
I just started with programming and have some trouble to correctly import my CSV file.
To import it I use the following code:
data_fundamentals = open(path_fundamentals, newline= '')
reader_fundamentals = csv.reader(data_fundamentals)
header_fundamentals = next(reader_fundamentals)
fundamentals = [row for row in reader_fundamentals]
Then convert it into a DataFrame:
df_fundamentals = pd.DataFrame(fundamentals, columns= header_fundamentals)
Here comes my first problem: Out of the CSV file "fundamentals" I just need certain columns for my DataFrame. I started by inserting them all by hand, which of course is not very efficient. Do you have an easier way?
df_kennzahlen.insert(1, 'Fiscal Year' , df_fundamentals['fyear'])
df_kennzahlen.insert(2, 'Current Assets' , df_fundamentals['act'])
df_kennzahlen.insert(3, 'Net Income/Loss' , df_fundamentals['ni'])
df_kennzahlen.insert(4, 'Total Liabilities' , df_fundamentals['lt'])
df_kennzahlen.insert(5, 'Long-Term Debt' , df_fundamentals['dltp'])
df_kennzahlen.insert(6, 'Cash' , df_fundamentals['ch'])
df_kennzahlen.insert(7, 'Total Assets' , df_fundamentals['at'])
df_kennzahlen.insert(8, 'Trade Payables' , df_fundamentals['ap'])
df_kennzahlen.insert(9, 'R&D-Expenses' , df_fundamentals['xrd'])
df_kennzahlen.insert(10, 'Sales' , df_fundamentals['sale'])
The values in the DataFrame are numbers, but have the string data-type. To convert them I use the following code:
df_kennzahlen['Net Income/Loss'] = pd.to_numeric(df_kennzahlen['Net Income/Loss'], downcast='integer')
df_kennzahlen['Total Liabilities'] = pd.to_numeric(df_kennzahlen['Total Liabilities'], downcast='integer')
df_kennzahlen['Long-Term Debt'] = pd.to_numeric(df_kennzahlen['Long-Term Debt'], downcast='integer')
df_kennzahlen['Cash'] = pd.to_numeric(df_kennzahlen['Cash'], downcast='integer')
df_kennzahlen['Total Assets'] = pd.to_numeric(df_kennzahlen['Total Assets'], downcast='integer')
df_kennzahlen['Trade Payables'] = pd.to_numeric(df_kennzahlen['Trade Payables'], downcast='integer')
df_kennzahlen['R&D-Expenses'] = pd.to_numeric(df_kennzahlen['R&D-Expenses'], downcast='integer')
df_kennzahlen['Sales'] = pd.to_numeric(df_kennzahlen['Sales'], downcast='integer')
Again I have the same problem, it is not very efficient and the values in the DataFrame are not converted correctly. For example a 4680 is displayed as 0.4680 and 3235300 is shown as 323.530. Do you have any ideas how I can make the code more efficient and have the correct values in the DataFrame?
You can pass the columns that you need as a list via the usecols parameter
import pandas as pd
df=pd.read_csv(filename,header=0,usecols=['a','b'],converters={'a': str, 'b': str})
With the pd.read_csv function, you can specify exactly how to read your CSV file. Specially, you can select columns (usecols param), parse date columns (parse_dates param), change the default separator (sep = ";", for instance), change the decimal and thousands separator (decimal = ",", thousands = ".", for instance). These 2 last items are speccially useful when working with non-default CSVs.
Please refer to the docs for the full list of parameters.
I'm working on a data frame taken from Adafruit IO and sadly some of my data is from a time when my project malfunctioned so some of the values are just equal NaN.
I tried to remove it by typing this code lines:
onlyValidData=temp_data.mask(temp_data['value'] =='NaN')
onlyValidData
This is data retreived from Adafruit IO Feed, getting analyzed by pandas, I tried using 'where' function too but it didn't work
my entire code is
import pandas as pd
temp_data = pd.read_json('https://io.adafruit.com/api/(...)')
light_data = pd.read_json('https://io.adafruit.com/api/(...)')
temp_data['created_at'] = pd.to_datetime(temp_data['created_at'], infer_datetime_format=True)
temp_data = temp_data.set_index('created_at')
light_data['created_at'] = pd.to_datetime(light_data['created_at'], infer_datetime_format=True)
light_data = light_data.set_index('created_at')
tempVals = pd.Series(temp_data['value'])
lightVals = pd.Series(light_data['value'])
onlyValidData=temp_data.mask(temp_data['value'] =='NaN')
onlyValidData
The output is all of my data for some reason, but it should be only the valid values.
Hey I think the issue here that you're looking for values equal to the string 'NaN', while actual NaN values aren't a string, or more specifically aren't anything.
Try using:
onlyValidData = temp_data.mask(temp_data['value'].isnull())
Edit: to remove rows rather than marking all values in that row as NaN:
onlyValidData = temp_data.dropna()
I have the following code,
df = pd.read_csv(CsvFileName)
p = df.pivot_table(index=['Hour'], columns='DOW', values='Changes', aggfunc=np.mean).round(0)
p.fillna(0, inplace=True)
p[["1Sun", "2Mon", "3Tue", "4Wed", "5Thu", "6Fri", "7Sat"]] = p[["1Sun", "2Mon", "3Tue", "4Wed", "5Thu", "6Fri", "7Sat"]].astype(int)
It has always been working until the csv file doesn't have enough coverage (of all week days). For e.g., with the following .csv file,
DOW,Hour,Changes
4Wed,01,237
3Tue,07,2533
1Sun,01,240
3Tue,12,4407
1Sun,09,2204
1Sun,01,240
1Sun,01,241
1Sun,01,241
3Tue,11,662
4Wed,01,4
2Mon,18,4737
1Sun,15,240
2Mon,02,4
6Fri,01,1
1Sun,01,240
2Mon,19,2300
2Mon,19,2532
I'll get the following error:
KeyError: "['5Thu' '7Sat'] not in index"
It seems to have a very easy fix, but I'm just too new to Python to know how to fix it.
Use reindex to get all columns you need. It'll preserve the ones that are already there and put in empty columns otherwise.
p = p.reindex(columns=['1Sun', '2Mon', '3Tue', '4Wed', '5Thu', '6Fri', '7Sat'])
So, your entire code example should look like this:
df = pd.read_csv(CsvFileName)
p = df.pivot_table(index=['Hour'], columns='DOW', values='Changes', aggfunc=np.mean).round(0)
p.fillna(0, inplace=True)
columns = ["1Sun", "2Mon", "3Tue", "4Wed", "5Thu", "6Fri", "7Sat"]
p = p.reindex(columns=columns)
p[columns] = p[columns].astype(int)
I had a very similar issue. I got the same error because the csv contained spaces in the header. My csv contained a header "Gender " and I had it listed as:
[['Gender']]
If it's easy enough for you to access your csv, you can use the excel formula trim() to clip any spaces of the cells.
or remove it like this
df.columns = df.columns.to_series().apply(lambda x: x.strip())
please try this to clean and format your column names:
df.columns = (df.columns.str.strip().str.upper()
.str.replace(' ', '_')
.str.replace('(', '')
.str.replace(')', ''))
I had the same issue.
During the 1st development I used a .csv file (comma as separator) that I've modified a bit before saving it.
After saving the commas became semicolon.
On Windows it is dependent on the "Regional and Language Options" customize screen where you find a List separator. This is the char Windows applications expect to be the CSV separator.
When testing from a brand new file I encountered that issue.
I've removed the 'sep' argument in read_csv method
before:
df1 = pd.read_csv('myfile.csv', sep=',');
after:
df1 = pd.read_csv('myfile.csv');
That way, the issue disappeared.
I have a large data file and I need to delete rows that have certain keywords.
Here is an example of the file I'm using:
User Name DN
MB31212 CN=MB31212,CN=Users,DC=prod,DC=trovp,DC=net
MB23423 CN=MB23423 ,OU=Generic Mailbox,DC=prod,DC=trovp,DC=net
MB23424 CN=MB23424 ,CN=Users,DC=prod,DC=trovp,DC=net
MB23423 CN=MB23423,OU=DNA,DC=prod,DC=trovp,DC=net
MB23234 CN=MB23234 ,OU=DNA,DC=prod,DC=trovp,DC=net
This is how I import file:
import pandas as pd
df = pd.read_csv('sample.csv', sep=',', encoding='latin1')
How can I
Delete all rows that contain 'OU=DNA' in DN column for example?
How can I delete the first attribute 'CN= x' in the DN column without deleting the rest of the data in the column?
I would like to get something like what is posted below, with the 2 rows that contained 'OU=DNA' deleted and the 'CN=x' deleted from every row:
User Name DN
MB31212 CN=Users,DC=prod,DC=trovp,DC=net
MB23423 OU=Generic Mailbox,DC=prod,DC=trovp,DC=net
MB23424 CN=Users,DC=prod,DC=trovp,DC=net
You can try this two-step filtering as your logic. Use the str.contains method to filter out rows with OU=DNA and use str.replace method with regular expression to trim the leading CN=x:
newDf = df.loc[~df.DN.str.contains("OU=DNA")]
newDf.DN = newDf.DN.str.replace("^CN=[^,]*,", "")
newDf
UserName DN
0 MB31212 CN=Users,DC=prod,DC=trovp,DC=net
1 MB23423 OU=Generic Mailbox,DC=prod,DC=trovp,DC=net
2 MB23424 CN=Users,DC=prod,DC=trovp,DC=net
A little break down of the regular expression: ^ stands for the beginning of the string which is followed by CN= and use [^,]*, to capture pattern until the first comma;
To read the file sample you gave I used:
df = pd.read_csv('sample.csv', sep=' ', encoding='latin1', engine="python")
and then:
df = df.drop(df[df.DN.str.contains("OU=DNA")].index)
df.DN = df.DN.str.replace('(CN=MB[0-9]{5}\s*,)', '')
df
gave the desired result:
User Name DN
0 MB31212 CN=Users,DC=prod,DC=trovp,DC=net
1 MB23423 OU=Generic Mailbox,DC=prod,DC=trovp,DC=net
2 MB23424 CN=Users,DC=prod,DC=trovp,DC=net