I have a txt file that looks like this
1000 lewis hamilton 36
1001 sebastian vettel 34
1002 lando norris 21
i want them to look like this
I tried the solution in here but it gave me a blank excel file and error when trying to open it
There is more than one million lines and each lines contains around 10 column
And one last thing i am not 100% sure if they are tab elimited because some columns looks like they have more space in between them than the others but when i press to backspace once they stick to each other so i guess it is
you can use pandas read_csv for read your txt file and then save it like an excel file with .to_excel
df = pd.read_csv('your_file.txt' , delim_whitespace=True)
df.to_excel('your_file.xlsx' , index = False)
here some documentation :
pandas.read_csv : https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html
.to_excel : https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.to_excel.html
If you're not sure about how the fields are separated, you can use '\s' to split by spaces.
import pandas as pd
df = pd.read_csv('f1.txt', sep="\s+", header=None)
# you might need: pip install openpyxl
df.to_excel('f1.xlsx', 'Sheet1')
Example of randomly separated fields (f1.txt):
1000 lewis hamilton 2 36
1001 sebastian vettel 8 34
1002 lando norris 6 21
If you have some lines having more columns than the first one, causing:
ParserError: Error tokenizing data. C error: Expected 5 fields in line 5, saw 6
You can ignore those by using:
df = pd.read_csv('f1.txt', sep="\s+", header=None, error_bad_lines=False)
This is an example of data:
1000 lewis hamilton 2 36
1001 sebastian vettel 8 34
1002 lando norris 6 21
1003 charles leclerc 1 3
1004 carlos sainz ferrari 2 2
The last line will be ignored:
b'Skipping line 5: expected 5 fields, saw 6\n'
Related
From the dataframe, I create a new dataframe, in which the values from the "Select activity" column contain lists, which I will split and transform into new rows. But there is a value: "Nothing, just walking", which I need to leave unchanged. Tell me, please, how can I do this?
The original dataframe looks like this:
Name Age Select activity Profession
0 Ann 25 Cycling, Running Saleswoman
1 Mark 30 Nothing, just walking Manager
2 John 41 Cycling, Running, Swimming Accountant
My code looks like this:
df_new = df.loc[:, ['Name', 'Age']]
df_new['Activity'] = df['Select activity'].str.split(', ')
df_new = df_new.explode('Activity').reset_index(drop=True)
I get this result:
Name Age Activity
0 Ann 25 Cycling
1 Ann 25 Running
2 Mark 30 Nothing
3 Mark 30 just walking
4 John 41 Cycling
5 John 41 Running
6 John 41 Swimming
In order for the value "Nothing, just walking" not to be divided by 2 values, I added the following line:
if df['Select activity'].isin(['Nothing, just walking']) is False:
But it throws an error.
then let's look ahead after comma to guarantee a Capital letter, and only then split. So instead of , we have , (?=[A-Z])
df_new = df.loc[:, ["Name", "Age"]]
df_new["Activity"] = df["Select activity"].str.split(", (?=[A-Z])")
df_new = df_new.explode("Activity", ignore_index=True)
i only changed the splitter, and ignore_index=True to explode instead of resetting afterwards (also the single quotes..)
to get
>>> df_new
Name Age Activity
0 Ann 25 Cycling
1 Ann 25 Running
2 Mark 30 Nothing, just walking
3 John 41 Cycling
4 John 41 Running
5 John 41 Swimming
one line as usual
df_new = (df.loc[:, ["Name", "Age"]]
.assign(Activity=df["Select activity"].str.split(", (?=[A-Z])"))
.explode("Activity", ignore_index=True))
I have a program that takes a URL as input, and checks it against a df that I'm reading from csv:
Name ID Date URL
0 Faye 111 12/31/16 https://www.url1.com
1 Faye 111 3/31/17 https://www.url2.com
2 Mike 222 3/31/17 https://www.url3.com
3 Mike 222 6/30/18 https://www.url4.com
4 Mike 222 9/30/18 https://www.url5.com
5 Jim 333 9/30/18 https://www.url6.com
If the URL doesn't exist in the df, the program executes some code, and then adds a new row with the URL to the df; else it moves on to another URL.
The program works fine if I just run it, stop it, and restart it. But if I delete an existing row (e.g., [1]) directly from the csv file in Excel to reprocess the data for that one url, it gets added at column [4] and row [5] to df:
Name ID Date URL
0 Faye 111 12/31/16 https://www.url1.com
2 Mike 222 3/31/17 https://www.url3.com
3 Mike 222 6/30/18 https://www.url4.com
4 Mike 222 9/30/18 https://www.url5.com
5 Jim 333 9/30/18 https://www.url6.com Faye 111 3/31/17 https://www.url2.com
rather than adding at row [6] as a new row, which happens when I remove the row in a plain text editor (rather than Excel):
Name ID Date URL
... ... ... ...
5 Jim 333 9/30/18 https://www.url6.com
6 Faye 111 3/31/17 https://www.url2.com
I'm adding the data to the existing csv via df.to_csv('~/file.csv', mode='a', header=False, index=False), so can anyone identify what I'm doing wrong?
Instead of trying appending another (1-line) dataframe as another chunk of CSV-file:
df.to_csv('~/file.csv', mode='a', header=False, index=False)
append it first to your dataframe
df = df.append(your_1_row_df, ignore_index=True)
and only then write it into your ~/file.csv, this time with the (default) mode 'w' (effectively rewriting it):
df.to_csv('~/file.csv', header=False, index=False)
So I had a dataframe and I had to do some cleansing to minimize the duplicates. In order to do that I created a dataframe that had instead of 40 only 8 of the original columns. Now I have two columns I need for further analysis from the original dataframe but they would mess with the desired outcome if I used them in my previous analysis. Anyone have any idea on how to "extract" these columns based on the new "clean" dataframe I have?
You can merge the new "clean" dataframe with the other two variables by using the indexes. Let me use a pratical example. Suppose the "initial" dataframe, called "df", is:
df
name year reports location
0 Jason 2012 4 Cochice
1 Molly 2012 24 Pima
2 Tina 2013 31 Santa Cruz
3 Jake 2014 2 Maricopa
4 Amy 2014 3 Yuma
while the "clean" dataframe is:
d1
year location
0 2012 Cochice
2 2013 Santa Cruz
3 2014 Maricopa
The remaing columns are saved in dataframe "d2" ( d2 = df[['name','reports']] ):
d2
name reports
0 Jason 4
1 Molly 24
2 Tina 31
3 Jake 2
4 Amy 3
By using the inner join on the indexes d1.merge(d2, how = 'inner' left_index= True, right_index = True) you get the following result:
name year reports location
0 Jason 2012 4 Cochice
2 Tina 2013 31 Santa Cruz
3 Jake 2014 2 Maricopa
You can make a new dataframe with the specified columns;
import pandas
#If your columns are named a,b,c,d etc
df1 = df[['a','b']]
#This will extract columns 0, to 2 based on their index
#[remember that pandas indexes columns from zero!
df2 = df.iloc[:,0:2]
If you could, provide a sample piece of data, that'd make it easier for us to help you.
I have a .csv file with rows with multiple columns lengths.
import pandas as pd
df = pd.read_csv(infile, header=None)
returns the
ParserError: Error tokenizing data. C error: Expected 6 fields in line 8, saw 8
error. I know I can use the
names=my_cols
option in the read_csv call, but surely there has to be something more 'pythonic' than that?? Also, this is not a duplicate question, since
error_bad_lines=False
causes lines to be skipped (which is not desired). The .csv looks like::
Anne,Beth,Caroline,Ernie,Frank,Hannah
Beth,Caroline,David,Ernie
Caroline,Hannah
David,,Anne,Beth,Caroline,Ernie
Ernie,Anne,Beth,Frank,George
Frank,Anne,Caroline,Hannah
George,
Hannah,Anne,Beth,Caroline,David,Ernie,Frank,George
OK, somewhat inspired by this related question: Pandas variable numbers of columns to binary matrix
So read in the csv but override the separator to a tab so it doesn't try to split the names:
In[7]:
import pandas as pd
import io
t="""Anne,Beth,Caroline,Ernie,Frank,Hannah
Beth,Caroline,David,Ernie
Caroline,Hannah
David,,Anne,Beth,Caroline,Ernie
Ernie,Anne,Beth,Frank,George
Frank,Anne,Caroline,Hannah
George,
Hannah,Anne,Beth,Caroline,David,Ernie,Frank,George"""
df = pd.read_csv(io.StringIO(t), sep='\t', header=None)
df
Out[7]:
0
0 Anne,Beth,Caroline,Ernie,Frank,Hannah
1 Beth,Caroline,David,Ernie
2 Caroline,Hannah
3 David,,Anne,Beth,Caroline,Ernie
4 Ernie,Anne,Beth,Frank,George
5 Frank,Anne,Caroline,Hannah
6 George,
7 Hannah,Anne,Beth,Caroline,David,Ernie,Frank,Ge...
We can now use str.split with expand=True to expand the names into their own columns:
In[8]:
df[0].str.split(',', expand=True)
Out[8]:
0 1 2 3 4 5 6 7
0 Anne Beth Caroline Ernie Frank Hannah None None
1 Beth Caroline David Ernie None None None None
2 Caroline Hannah None None None None None None
3 David Anne Beth Caroline Ernie None None
4 Ernie Anne Beth Frank George None None None
5 Frank Anne Caroline Hannah None None None None
6 George None None None None None None
7 Hannah Anne Beth Caroline David Ernie Frank George
So just to be clear modify your read_csv line to this:
df = pd.read_csv(infile, header=None, sep='\t')
and then do the str.split as above
One can do some manipulation with the csv before using pandas.
# load data into list
with open('new_data.txt', 'r') as fil:
data = fil.readlines()
# remove line breaks from string entries
data = [ x.replace('\r\n', '') for x in data]
data = [ x.replace('\n', '') for x in data]
# calculate the number of columns
total_cols = max([x.count(',') for x in data])
# add ',' to end of list depending on how many are needed
new_data = [x + ','*(total_cols-x.count(',')) for x in data]
# save data
with open('save_data.txt', 'w') as outp:
outp.write('\n'.join(new_data))
# read it in as you did.
pd.read_csv('save_data.txt', header=None)
This is some rough python, but should work. I'll clean this up when I have time.
Or use the other answer, it's neat as it is.
I have a CSV file which has two sets of data on the same sheet. I did my research and the closest I could find is what I have attached. The issue I am having is that both of them are not tables, their separate data sets; both of which are separated by a number of rows. I want to save each of the data sets as a separate CSV. Is this possible in Python? Please provide your kind assistance.
Python CSV module: How can I account for multiple tables within the same file?
First Set:
Presented_By: Source: City:
Chris Realtor Knoxville
John Engineer Lantana
Wade Doctor Birmingham
Second Set:
DriveBy 15
BillBoard 45
Social Media 85
My source is a Excel file which I convert into a CSV file.
import pandas as pd
data_xls = pd.read_excel('T:\DataDump\Matthews\REPORT 11.13.16.xlsm', 'InfoCenterTracker', index_col=None)
data_xls.to_csv('your_csv.csv', encoding='utf-8')
second_set = pd.read_csv('your_csv.csv',skiprows=[10,11,12,13,14,15,16,17,18,19,20,21,22,23,23])
Use skiprows in pandas' read_csv
$ cat d.dat
Presented_By: Source: City:
Chris Realtor Knoxville
John Engineer Lantana
Wade Doctor Birmingham
DriveBy 15
BillBoard 45
Social Media 85
In [1]: import pandas as pd
In [2]: pd.read_csv('d.dat',skiprows=[0,1,2,3])
Out[2]:
DriveBy 15
0 BillBoard 45
1 Social Media 85
In [3]: pd.read_csv('d.dat',skiprows=[4,5,6])
Out[3]:
Presented_By: Source: City:
0 Chris Realtor Knoxv...
1 John Engineer Lantana
2 Wade Doctor Birmi...
You can detect what rows to skip by searching for when the csv has 2 entries not 3
In [25]: for n, line in enumerate(open('d.dat','r').readlines()):
...: if len(line.split()) !=3:
...: breakpoint = n
...:
In [26]: pd.read_csv('d.dat',skiprows=range(breakpoint-1))
Out[26]:
DriveBy 15
0 BillBoard 45
1 Social Media 85
In [27]: pd.read_csv('d.dat',skiprows=range(breakpoint-1, n+1))
Out[27]:
Presented_By: Source: City:
0 Chris Realtor Knoxv...
1 John Engineer Lantana
2 Wade Doctor Birmi...