I am using the ALL.zip file located here. My goal is to create a pandas DataFrame with it. However, if I run
data=pd.read_csv(foo.csv)
the column names do not match up. The first column has no name, and then the second column is labeled with the first, and the last column is a Series of NaN. So I tried
colnames=[list of colnames]
data=pd.read_csv(foo.csv, names=colnames, header=False)
which gave me the exact same thing, so I ran
data=pd.read_csv(foo.csv, names=colnames)
which lined the colnames up perfectly, but had the csv assigned column names(the first line in the csv document) perfectly aligned as the first row of data it. So I ran
data=data[1:]
which did the trick.
So I found a work around without solving the actual problem. I looked at the read_csv document and found it a bit overwhelming, and could not figure out a way using only pd.read_csv to fix this problem.
What was the fundamental problem (I am assuming it is either user error or a problem with the file)? Is there a way to fix it with one of the commands from the read_csv?
Here is the first 2 rows from the csv file
cmte_id,cand_id,cand_nm,contbr_nm,contbr_city,contbr_st,contbr_zip,contbr_employer,contbr_occupation,contb_receipt_amt,contb_receipt_dt,receipt_desc,memo_cd,memo_text,form_tp,file_num,tran_id,election_tp
C00458844,"P60006723","Rubio, Marco","HEFFERNAN, MICHAEL","APO","AE","090960009","INFORMATION REQUESTED PER BEST EFFORTS","INFORMATION REQUESTED PER BEST EFFORTS",210,27-JUN-15,"","","","SA17A","1015697","SA17.796904","P2016",
It's not the column that you're having a problem with, it's the index
import pandas as pd
df = pd.read_csv('P00000001-ALL.csv', index_col=False, low_memory=False)
print(df.head(1))
cmte_id cand_id cand_nm contbr_nm contbr_city \
0 C00458844 P60006723 Rubio, Marco HEFFERNAN, MICHAEL APO
contbr_st contbr_zip contbr_employer \
0 AE 090960009 INFORMATION REQUESTED PER BEST EFFORTS
contbr_occupation contb_receipt_amt contb_receipt_dt \
0 INFORMATION REQUESTED PER BEST EFFORTS 210 27-JUN-15
receipt_desc memo_cd memo_text form_tp file_num tran_id election_tp
0 NaN NaN NaN SA17A 1015697 SA17.796904 P2016
The low_memory=False is because column 6 has mixed datatype.
The problem comes from having every line in the file except for the first terminating in a comma (the separator character). Pandas thinks there's an empty column there if it needs to consider the first 'column name' as the index column.
Try
data= pd.read_csv('P00000001-AL.csv',index_col=False)
Related
I am working with one data frame in pandas only. This error does not occur when I perform the following on a subset of this data frame (6 rows with NaN in some). And it does exactly what I needed done. In this case all the NaN in the 'Season' column got filled out properly.
Before:
Code:
s = df.set_index('Description')['Season'].dropna()
df['Season'] = df['Season'].fillna(df['Description'].map(s))
After:
Great! This is what I want to happen, one column at time. I worry about the other columns later.
But then I try the same code above on the entire data frame which is over 5000 rows long, then I get the error stated above in the title, and I am unable to pin point it to a specific row(s).
What did I try:
I removed all non-ascii characters, and these special characters: ', ", and # from the strings in the 'Description' column which sometimes has 50 characters including non-ascii and the three specific characters that I removed.
df['Description'] = df['Description'].str.encode('ascii', 'ignore').str.decode('ascii')
df['Description'] = df['Description'].str.replace('"', '')
df['Description'] = df['Description'].str.replace("'", "")
df['Description'] = df['Description'].str.replace('#', '')
But the above did not help, and I still get the error. Does anyone have additional troubleshooting tips, or know what I am failing to look for? Or ideally a solution.
The code with subset DataFrame and main DataFrame are isolated. So I am not mixing and using the 'df' and 's' interchangeably. I wish that was the problem.
Recall the subset data frame above where the code worked perfectly. Through process of elimination I discovered that when the subset data frame has one extra row - total of 8 rows, the code still works as expected. But once the 9th row is entered, the I get the error. I can't figure out why.
Then the code:
s = df.set_index('Description')['Season'].dropna()
df['Season'] = df['Season'].fillna(df['Description'].map(s))
And the data frame is updated as expected:
But when the 9th row added then the code above does not work:
Discovered how to solve the problem by adding .drop_duplicates('Description') and therefore modifying:
s= df.set_index('Description')['Season'].dropna()
to
s= df.drop_duplicates('Description').set_index('Description')['Season'].dropna()
This is essentially the main part of the code concerning the problem:
df1 = pandas.read_csv('~/Dropbox/data/anand/surface/scan/7/relaxed/dos/DOSCAR',skiprows=5, names=['energy','su' , 'sd','pu' ,'pd' ,'du' ,'dd' ],delimiter='\s+',engine='python')
#Remove the two lines but I want to keep the 1st one only
#it is why I change the value to 1 () instead of being 1!
df1.iat[0,1]=1
df=df1[df1.su != enmin ]
df=df1[df1.energy != enmin ]
df1=None
df.to_csv(r'~/df', header=None, index=None, sep=' ',mode='a')
df=df.reset_index(drop=True)
and then I just access some values of df through df.iat[,]
Actually the line number 6000 corresponds to the line that I wanted to keep and which should be located at the beginning of the text. It is like misplacing/inverting the order.
I have tried to open it also with vim but same issue, and I am sure the dataframe window on python is the correct one. Please note that gedit and other tex editor starts with line number 1 while dataframe start with zero.
The mistake came from the fact the second dropping operation of df does not take into account the first one since you a rewriting a again a completely new dataframe from df1 and not from the previous df.
I m trying to read multiple files whose names start with 'site_%'. Example, file names like site_1, site_a.
Each file has data like :
Login_id, Web
1,http://www.x1.com
2,http://www.x1.com,as.php
I need two columns in my pandas df: Login_id and Web.
I am facing error when I try to read records like 2.
df_0 = pd.read_csv('site_1',sep='|')
df_0[['Login_id, Web','URL']] = df_0['Login_id, Web'].str.split(',',expand=True)
I am facing the following error :
ValueError: Columns must be same length as key.
Please let me know where I am doing some serious mistake and any good approach to solve the problem. Thanks
Solution 1: use split with argument n=1 and expand=True.
result= df['Login_id, Web'].str.split(',', n=1, expand=True)
result.columns= ['Login_id', 'Web']
That results in a dataframe with two columns, so if you have more columns in your dataframe, you need to concat it with your original dataframe (that also applies to the next method).
EDIT Solution 2: there is a nicer regex-based solution which uses a pandas function:
result= df['Login_id, Web'].str.extract('^\s*(?P<Login_id>[^,]*),\s*(?P<URL>.*)', expand=True)
This splits the field and uses the names of the matching groups to create columns with their content. The output is:
Login_id URL
0 1 http://www.x1.com
1 2 http://www.x1.com,as.php
Solution 3: convetional version with regex:
You could do something customized, e.g with a regex:
import re
sp_re= re.compile('([^,]*),(.*)')
aux_series= df['Login_id, Web'].map(lambda val: sp_re.match(val).groups())
df['Login_id']= aux_series.str[0]
df['URL']= aux_series.str[1]
The result on your example data is:
Login_id, Web Login_id URL
0 1,http://www.x1.com 1 http://www.x1.com
1 2,http://www.x1.com,as.php 2 http://www.x1.com,as.php
Now you could drop the column 'Login_id, Web'.
I have a large pandas dataframe, where I am running groups by operations.
CHROM POS Data01 Data02 ......
1 ....................
1 ...................
2 ..................
2 ............
scaf_9 .............
scaf_9 ............
So, i am doing:
my_data_grouped = my_data.groupby('CHROM')
for chr_, data in my_data_grouped:
do something in chr_
write something from that chr_ data
Everything is fine in small data and in the data where there is no string type CHROM i.e scaff_9. But, with very large data and with scaff_9, I am getting two groups of 2. It really isn't an error message and it is not affecting the computation. The issue is when I write the data by group in the file; I am getting two groups of 2 (splitted unequally).
It is becoming very hard for me to traceback the origin of this problem, since there is no error message and with small data it works well. My only assumption are:
Is there certain limit on the the number of lines in total dataframe vs. grouped dataframe the pandas module can handle. What is the fix to this problem ?
Among all the 2 most of them are treated as integer object and some (later part) as string object being close to scaff_9. Is this possible ?
Sorry, I am only making my assumption here, and it is becoming impossible for me to know the origin of the problem.
Post Edit:
I have also tried to run sort_by(['CHROM']) before doing to groupby, but the problem still persists.
Any possible fix to the issue.
Thanks,
In my opinion there is data problem, obviously some whitespaces, so pandas processes each group separately.
Solution should be remove traling whitespaces first:
df.index = df.index.astype(str).str.strip()
You can also check unique strings values of index:
a = df.index[df.index.map(type) == str].unique().tolist()
If first column is not index:
df['CHROM'] = df['CHROM'].astype(str).str.strip()
a = df.loc[df['CHROM'].map(type) == str, 'CHROM'].unique().tolist()
EDIT:
Last final solution was simplier - casting to str like:
df['CHROM'] = df['CHROM'].astype(str)
(I´m new to python, sorry for any mistakes I make, I hope you can understand me)
I have searched for a method to insert a Row into a Pandas DataFrame in Python, and I have found this:
add one row in a pandas.DataFrame
I have use the code provided in the accepted answer of that topic by fred, but the code overwrites my row:
My code (Inserts a row with values "-1" for each column in certains conditions):
df.loc[i+1] = [-1 for n in range(len(df.columns))]
How can I make the code insert a row WITHOUT overriding it?
For example, if I have a 50 row DataFrame, insert a row in position 25 (Just as an example), and make the DataFrame have 51 rows (Being the 25 a extra NEW row).
I hope you can understand my question. (Sorry for any english mistakes)
Thank you.
UPDATE:
Someone suggested this:
Is it possible to insert a row at an arbitrary position in a dataframe using pandas?
Tried, did not work. In fact, it does nothing, does not add any row.
line = pd.DataFrame({[-1 for n in range(len(df.columns))]}, index=[i+1])
df = pd.concat([ df.ix[:i], line, df.ix[i+1:]]).reset_index(drop=True)
Any other ideas?
With a small DataFrame, if you want to insert a line, at position 1:
df = pd.DataFrame({'a':[0,1],'b':[2,3]})
df1 = pd.DataFrame([4,5]).T
pd.concat([df[:1], df1.rename(columns=dict(zip(df1.columns,df.columns))), df[1:]])
#Out[46]:
# a b
#0 0 2
#0 4 5
#1 1 3