i have a problem with excel. I downloaded a csv that has data not separated by comma or other and are not divided by row or single cell. Here is a screenshot. How can i divide this data in python or R to manage them separated one by one?
Thanks in advance
If the number of columns is the same on all lines, this should work:
import pandas as pd
df = pd.read_csv('my_full_filepath.csv', sep ='|')
print(df.info())
Otherwise, you can try, as a first approach,
df = pd.read_csv('my_full_filepath.csv', sep ='|', error_bad_lines = False)
print(df.info())
If there are data dropped, you might have to try opening your dataframe with an argument passing the maximal number of columns of your file.
This can be done, for example by :
n = int('number of my columns')
df = pd.read_csv('my_full_file_path.csv', sep ='|', names = range(n))
Related
I have an Excel file with numbers (integers) in some rows of the first column (A) and text in all rows of the second column (B):
I want to clean this text, that is I want to remove tags like < b r > (without spaces). My current approach doesn't seem to work:
file_name = "F:\Project\comments_all_sorted.xlsx"
import pandas as pd
df = pd.read_excel(file_name, header=None, index_col=None, usecols='B') # specify that there's no header and no column for row labels, use only column B (which includes the text)
clean_df = df.replace('<br>', '')
clean_df.to_excel('output.xlsx')
What this code does (which I don't want it to do) is it adds running numbers in the first column (A), replacing also the few numbers that were already there, and it adds a first row with '1' in second column of this row (cell 1B):
I'm sure there's an easy way to solve my problem and I'm just not trained enough to see it.
Thanks!
Try this:
df['column_name'] = df['column_name'].str.replace(r'<br>', '')
The index in the output file can be turned off with index=False in the df.to_excel function, i.e,
clean_df.to_excel('output.xlsx', index=False)
As far as I'm aware, you can't use .replace on an entire dataframe. You need to explicitly call out the column. In this case, I just iterate through all columns in case there are more than just the one column.
To get rid of the first column with the sequential numbers (that's the index of the dataframe), add the parameter index=False. The number 1 on the top is the column name. To get rid of that, use header=False
import pandas as pd
file_name = "F:\Project\comments_all_sorted.xlsx"
df = pd.read_excel(file_name, header=None, index_col=None, usecols='B') # specify that there's no header and no column for row labels, use only column B (which includes the text)
clean_df = df.copy()
for col in clean_df.columns:
clean_df[col] = df[col].str.replace('<br>', '')
clean_df.to_excel('output.xlsx', index=False, header=False)
I am using this code. but instead of new with just the required rows, I'm getting an empty .csv with just the header.
import pandas as pd
df = pd.read_csv("E:/Mac&cheese.csv")
newdf = df[df["fruit"]=="watermelon"+"*"]
newdf.to_csv("E:/Mac&cheese(2).csv",index=False)
I believe the problem is in how you select the rows containing the word "watermelon". Instead of:
newdf = df[df["fruit"]=="watermelon"+"*"]
Try:
newdf = df[df["fruit"].str.contains("watermelon")]
In your example, pandas is literally looking for cells containing the word "watermelon*".
missing the underscore in pd.read_csv on first call, also it looks like the actual location is incorrect. missing the // in the file location.
I am using python I have a CSV file which had values separated by tab,
I applied a rule to each of its row and created a new csv file, the resulting dataframe is comma separated , I want this new csv to be tab separated as well. How can I do it ?
I understand using sep = '\t' can work but where do I apply it ?
I applied the following code but it didn't work either
df = pd.read_csv('data.csv', header=None)
df_norm= df.apply(lambda x:np.where(x>0,x/x.max(),np.where(x<0,-x/x.min(),x)),axis=1)
df_norm.to_csv("file.csv", sep="\t")
Have you tried, this ?
pd.read_csv('file.csv', sep='\t')
I found the issue, the rule had changed the type to "object', because of which I was unable to perform any further operations. I followed Remove dtype at the end of numpy array, and converted my data frame to a list which solved the issue.
df = pd.read_csv('data.csv', header=None)
df_norm= df.apply(lambda x:np.where(x>0,x/x.max(),np.where(x<0,-x/x.min(),x)),axis=1)
df_norm=df_norm.tolist()
df_norm = np.squeeze(np.asarray(df_norm))
np.savetxt('result.csv', df_norm, delimiter=",")
I have a csv file with a wrong first row data. The names of labels are in the row number 2. So when I am storing this file to the DataFrame the names of labels are incorrect. And correct names become values of the row 0. Is there any function similar to reset_index() but for columns? PS I can not change csv file. Here is an image for better understanding. DataFrame with wrong labels
Hello let's suppose you csv file is data.csv :
Try this code:
import pandas as pd
#reading the csv file
df = pd.read_csv('data.csv')
#changing the headers name to integers
df.columns = range(df.shape[1])
#saving the data in another csv file
df.to_csv('data_without_header.csv',header=None,index=False)
#reading the new csv file
new_df = pd.read_csv('data_without_header.csv')
#plotting the new data
new_df.head()
If you do not care about the rows preceding your column names, you can pass in the "header" argument with the value of the correct row, for example if the proper column names are in row 2:
df = pd.read_csv('my_csv.csv', header=2)
Keep in mind that this will erase the previous rows from the DataFrame. If you still want to keep them, you can do the following thing:
df = pd.read_csv('my_csv.csv')
df.columns = df.iloc[2, :] # replace columns with values in row 2
Cheers.
def Text2Col(df_File):
for i in range(0,len(df_File)):
with open(df_File.iloc[i]['Input']) as inf:
with open(df_File.iloc[i]['Output'], 'w') as outf:
i=0
for line in inf:
i=i+1
if i==2 or i==3:
continue
outf.write(','.join(line.split(';')))
Above code is used to convert a csv file from text to column.
This code makes all values string ( because split() ) which is problematic for me.
I tried using map function but cant make it.
Is there any other way in which I can do this.
My input file has 5 columns, the first column is string, the second is int and the rest are float.
I think it required some modification in last statement
outf.write(','.join(line.split(';')))
Please let me know if any other input is required.
Ok, trying to help here. If this doesn't work, please specify in your question, what you're missing or what else needs to be done:
Use pandas to read in a csv file:
import pandas as pd
df = pd.read_csv('your_file.csv')
If you have a header on the first row, then use:
import pandas as pd
df = pd.read_csv('your_file.csv', header=0)
If you have a tab delimiter instead of a comma delimiter, then use:
import pandas as pd
df = pd.read_csv('your_file.csv', header=0, sep='\t')
Thank you !
Following Code worked:
def Text2Col(df_File):
for i in range(0,len(df_File)):
df = pd.read_csv(df_File.iloc[i]['Input'],sep=';')
df = df[df.index != 0]
df= df[df.index != 1]
df.to_csv(df_File.iloc[i]['Output'])
File_List="File_List.csv"
df_File=pd.read_csv(File_List)
Text2Col(df_File)
Input files are kept in same folder with same name as mentioned in File_List.xls
Output files will be created in same folder with separated in column. I deleted row 0 and 1 for my use. One can skip or add depending upon his requirement.
In above code df_file is dataframe contain two column list, first column is input file name and second column is output file name.