I have a CSV file.
Each line contains data separated A Comma; each row is ended by
a new line character. So in my file, the first data entry of each line
is a year and the second entry in the line is the title of a film.
For example:
1990, Tie Me Up! Tie Me Down!
So, it just has a bunch of years, and then movie titles.
My question is, how do I print the movie title IF it was made before 1990?
So something like:
2015, Spongebob movie
wouldn't print.
So far I have:
f = open("filmdata.csv","r",encoding="utf-8")
for line in f:
entries = line.split(",")
if entries in f < 1990 :
print(line)
f.close()
But nothing is printing out? But it doesn't say error or anything.. I'm trying to keep it in this format.
Although the previous answers will work in your generic case, it is a much better idea to use Python's csv module when working with a CSV. It is more extensible, provides handling of more complicated cases (like escaping quotes and data with commas), and is very easy to use.
The following code snippet should work in your case:
import csv
with open('filmdata.csv', 'r') as csvfile:
data = csv.reader(csvfile, delimiter=',', encoding='utf-8')
for year, movie in data:
if int(year) < 1990:
print (year, movie)
You can of course modify the print to use any format. If you want to separate by commas, this will work:
print('{}, {}'.format(year, movie))
Try this:
f = open("filmdata.csv","r",encoding="utf-8")
for line in f:
entries = line.split(",")
if int(entries[0]) < 1990 :
print(line)
f.close()
Have you tried looking into using Pandas. I moc'd up a movie.csv file with 2 colums (year,title).
import pandas as pd
movies = pd.read_csv('movie.csv',sep=',',names=["year","title"])
Output of movies array:
year title
0 1990 Movie1
1 1991 Movie2
2 1992 Movie3
3 1993 Movie4
4 1994 Movie5
5 1995 Movie6
6 1996 Movie7
7 1997 Movie8
8 1999 Movie9
9 2000 Movie10
Let's say we'd like to see all the movies where the year is over 1994:
movies[movies['year']> 1994]
year title
5 1995 Movie6
6 1996 Movie7
7 1997 Movie8
8 1999 Movie9
9 2000 Movie10
Related
So I have a text file containing 22 lines and three headers which is:
economy name
unique economy code given the World Bank standard (3 uppercase letters)
Trade-to-GDP from year 1990 to year 2019 (30 years, 30 data points); 0.3216 means that trade-to-gdp ratio for Australia in 1990
is 32.16%
The code I have used to import this file and open/read it is:
def Input(filename):
f = open(filename, 'r')
lines = f.readlines()
lines = [l.strip() for l in lines]
f.close()
return lines
However once I have done that I have to create a code with for-loops to create a list variable named result. It should contain 22 tuples, and each tuple contains four elements:
economy name,
World Bank economy code,
average trade-to-gdp ratio for this economy from 1990 to 2004,
average trade-to-gdp ratio for this economy from 2005 to 2019.
Coming out like
('Australia', 'AUS', '0.378', '0.423')
So far the code I have written looks like this:
def result:
name, age, height, weight = zip(*[l.split() for l in text_file.readlines()])
I am having trouble starting this and knowing how to grapple with the multiple years required and output all the countries with corresponding ratios.Here is the table of all the data I have on the text file.
I would suggest to use Pandas for this.
You can simply do:
import pandas as pd
df = read_csv('filename.csv')
for index, row in df.iterrows():
***Do something***
In for loop you can use row['columnName'] and get the data, For example: row['code'] or row['1999'].
This approach will be lot easier for you to carry operations and process the data.
Also to answer your approach:
You can iter over the lines and extract the data using index.
Try the below code:
def Input(filename):
f = open(filename, 'r')
lines = f.readlines()
lines = [l.strip().split() for l in lines]
f.close()
return lines
for line in lines[1:]:
total = sum([float(x) for x in line[2:17])# this will give you sum of values from 1990 to 2004
total2 = sum([float(x) for x in line[17:])# this will give you sum of values from 2005 to 2019
val= (line[0], line[1], total, total1) #This will give you tuple
You can continue the approach and create a tuple in each for loop.
This is what I'm trying to accomplish with my code: I have a current csv file with tennis player names, and I want to add new players to it once they show in the rankings. My script goes through the rankings and creates an array, then imports the names from the csv file. It is supposed to see which names are not in the latter, and then extract online data for those names. Then, I just want the new rows to be appended at the end of that old CSV file. My issue is that the new row is being indexed with the Player's name rather than following the index of the old file. Any ideas why that's happening? Also why is an unnamed column being added?
def get_all_players():
# imports names of players currently in the atp rankings
current_atp_ranking = check_atp_rankings()
current_player_list = current_atp_ranking['Player']
# clean up names in case of white spaces
for i in range(0, len(current_player_list)):
current_player_list[i] = current_player_list[i].strip()
# reads the main file and makes a dataframe out of it
current_file = 'ATP_stats_new.csv'
df = pd.read_csv(current_file)
# gets all the names within the main file to see which current ones aren't there
names_on_file = list(df['Player'])
# cleans up in case of any white spaces
for i in range(0, len(names_on_file)):
names_on_file[i] = names_on_file[i].strip()
# Removing Nadal for testing purposes
names_on_file.remove("Rafael Nadal")
# creating a list of players in current_players_list but not in names_on_file
new_player_list = [x for x in current_player_list if x not in names_on_file]
# loop through new_player_list
for player in new_player_list:
# delay to avoid stopping
time.sleep(2)
# finding the player's atp link for profile based on their name
atp_link = current_atp_ranking.loc[current_atp_ranking['Player'] == player, 'ATP_Link']
atp_link = atp_link.iloc[0]
# make a basic dictionary with just the player's name and link
player_dict = [{'Name': player, 'ATP_Link': atp_link}]
# enter the new dictionary into the existing main file
df.append(player_dict, ignore_index=True)
# print dataframe to see how it looks before exporting
print(df)
# export dataframe into current file
df.to_csv(current_file)
This is what the file looks like at first:
Unnamed: 0 Player ... Coach Turned_Pro
0 0 Novak Djokovic ... NaN NaN
1 1 Rafael Nadal ... Carlos Moya, Francisco Roig 2001.0
2 2 Roger Federer ... Ivan Ljubicic, Severin Luthi 1998.0
3 3 Daniil Medvedev ... NaN NaN
4 4 Dominic Thiem ... NaN NaN
... ... ... ... ... ...
1976 1976 Brian Bencic ... NaN NaN
1977 1977 Boruch Skierkier ... NaN NaN
1978 1978 Majed Kilani ... NaN NaN
1979 1979 Quentin Gueydan ... NaN NaN
1980 1980 Preston Brown ... NaN NaN
And this is what the new row looks like:
1977 1977.0 ... NaN
1978 1978.0 ... NaN
1979 1979.0 ... NaN
1980 1980.0 ... NaN
Rafael Nadal NaN ... 2001
There are critical parts of your code missing that are necessary to answer the question precisely. Two thoughts based on what you posted:
Importing Your CSV File
Your previous csv file was probably saved with the index. Make sure the csv file contents does not have the dataframe index when you last used it in the first csv column. When you save do the following:
file.to_csv('file.csv', index=False)
When you load the file like this;
pandas.read_csv('file.csv')
it will automatically assigned the index number and there won't be a duplicate column.
Misordering of Columns
Not sure what info in what order atp_link is pulling in. From what you provided it looks like it is returning two columns: "Coach" and "Turning Pro".
I would recommend creating a list (not a dict) for each new player you want to add after you pull the info from atp_link. So if you are adding Nadal, You create an info list from the information for each new player. Nadal's info list would look like this:
info_list = ['Rafael Nadal', '','2001']
Then you append the list to the dataframe like this:
df.loc[len(df),:] = info_list
Hope this helps.
I trying to read the message from database, but under the class label can't really read same as CSV dataset.
messages = pandas.read_csv('bitcoin_reddit.csv', delimiter='\t',
names=["title","class"])
print (messages)
Under the class label the pandas only can read as NaN
The version of my CSV file
title,url,timestamp,class
"It's official! 1 Bitcoin = $10,000 USD",https://v.redd.it/e7io27rdgt001,29/11/2017 17:25,0
The last 3 months in 47 seconds.,https://v.redd.it/typ8fdslz3e01,4/2/2018 18:42,0
It's over 9000!!!,https://i.imgur.com/jyoZGyW.gifv,26/11/2017 20:55,1
Everyone who's trading BTC right now,http://cdn.mutually.com/wp-content/uploads/2017/06/08-19.jpg,7/1/2018 12:38,1
I hope James is doing well,https://i.redd.it/h4ngqma643101.jpg,1/12/2017 1:50,1
Weeeeeeee!,https://i.redd.it/iwl7vz69cea01.gif,17/1/2018 1:13,0
Bitcoin.. The King,https://i.redd.it/4tl0oustqed01.jpg,1/2/2018 5:46,1
Nothing can increase by that much and still be a good investment.,https://i.imgur.com/oWePY7q.jpg,14/12/2017 0:02,1
"This is why I want bitcoin to hit $10,000",https://i.redd.it/fhzsxgcv9nyz.jpg,18/11/2017 18:25,1
Bitcoin Doesn't Give a Fuck.,https://v.redd.it/ty2y74gawug01,18/2/2018 15:19,-1
Working Hard or Hardly Working?,https://i.redd.it/c2o6204tvc301.jpg,12/12/2017 12:49,1
The separator in your csv file is a comma, not a tab. And since , is the default, there is no need to define it.
However, names= defines custom names for the columns. Your header already provides these names, so passing the column names you are interested in to usecols is all you need then:
>>> pd.read_csv(file, usecols=['title', 'class'])
title class
0 It's official! 1 Bitcoin = $10,000 USD 0
1 The last 3 months in 47 seconds. 0
2 It's over 9000!!! 1
3 Everyone who's trading BTC right now 1
4 I hope James is doing well 1
5 Weeeeeeee! 0
I have the following table on a website which I am extracting with BeautifulSoup
This is the url (I have also attached a picture
Ideally I would like to have each company in one row in csv however I am getting it in different rows. Please see picture attached.
I would like it to have it like in field "D" but I am getting it in A1,A2,A3...
This is the code I am using to extract:
def _writeInCSV(text):
print "Writing in CSV File"
with open('sara.csv', 'wb') as csvfile:
#spamwriter = csv.writer(csvfile, delimiter='\t',quotechar='\n', quoting=csv.QUOTE_MINIMAL)
spamwriter = csv.writer(csvfile, delimiter='\t',quotechar="\n")
for item in text:
spamwriter.writerow([item])
read_list=[]
initial_list=[]
url="http://www.nse.com.ng/Issuers-section/corporate-disclosures/corporate-actions/closure-of-register"
r=requests.get(url)
soup = BeautifulSoup(r._content, "html.parser")
#gdata_even=soup.find_all("td", {"class":"ms-rteTableEvenRow-3"})
gdata_even=soup.find_all("td", {"class":"ms-rteTable-default"})
for item in gdata_even:
print item.text.encode("utf-8")
initial_list.append(item.text.encode("utf-8"))
print ""
_writeInCSV(initial_list)
Can someone help please ?
Here is the idea:
read the header cells from the table
read all the other rows from the table
zip all the data row cells with headers producing a list of dictionaries
use csv.DictWriter() to dump to csv
Implementation:
import csv
from pprint import pprint
from bs4 import BeautifulSoup
import requests
url = "http://www.nse.com.ng/Issuers-section/corporate-disclosures/corporate-actions/closure-of-register"
soup = BeautifulSoup(requests.get(url).content, "html.parser")
rows = soup.select("table.ms-rteTable-default tr")
headers = [header.get_text(strip=True).encode("utf-8") for header in rows[0].find_all("td")]
data = [dict(zip(headers, [cell.get_text(strip=True).encode("utf-8") for cell in row.find_all("td")]))
for row in rows[1:]]
# see what the data looks like at this point
pprint(data)
with open('sara.csv', 'wb') as csvfile:
spamwriter = csv.DictWriter(csvfile, headers, delimiter='\t', quotechar="\n")
for row in data:
spamwriter.writerow(row)
Since #alecxe has already provided an amazing answer, here's another take using the pandas library.
import pandas as pd
url = "http://www.nse.com.ng/Issuers-section/corporate-disclosures/corporate-actions/closure-of-register"
tables = pd.read_html(url)
tb1 = tables[0] # Get the first table.
tb1.columns = tb1.iloc[0] # Assign the first row as header.
tb1 = tb1.iloc[1:] # Drop the first row.
tb1.reset_index(drop=True, inplace=True) # Reset the index.
print tb1.head() # Print first 5 rows.
# tb1.to_csv("table1.csv") # Export to CSV file.
Result:
In [5]: runfile('C:/Users/.../.spyder2/temp.py', wdir='C:/Users/.../.spyder2')
0 Company Dividend Bonus Closure of Register \
0 Nigerian Breweries Plc N3.50 Nil 5th - 11th March 2015
1 Forte Oil Plc N2.50 1 for 5 1st – 7th April 2015
2 Nestle Nigeria N17.50 Nil 27th April 2015
3 Greif Nigeria Plc 60 kobo Nil 25th - 27th March 2015
4 Guaranty Bank Plc N1.50 (final) Nil 17th March 2015
0 AGM Date Payment Date
0 13th May 2015 14th May 2015
1 15th April 2015 22nd April 2015
2 11th May 2015 12th May 2015
3 28th April 2015 5th May 2015
4 31st March 2015 31st March 2015
In [6]:
I am trying to import a weirdly formatted text file into a pandas DataFrame. Two example lines are below:
LOADED LANE 1 MAT. TYPE= 2 LEFFECT= 1 SPAN= 200. SPACE= 10. BETA= 3.474 LOADEFFECT 5075. LMAX= 3643. COV= .13
LOADED LANE 1 MAT. TYPE= 3 LEFFECT= 1 SPAN= 200. SPACE= 10. BETA= 3.515 LOADEFFECT10009. LMAX= 9732. COV= .08
First I tried the following:
df = pd.read_csv('beta.txt', header=None, delim_whitespace=True, usecols=[2,5,7,9,11,13,15,17,19])
This seemed to work fine, however got messed up when it hit the above example line, where there is no whitespace after the LOADEFFECT string (you may need to scroll a bit right to see it in the example). I got a result like:
632 1 2 1 200 10 3.474 5075. 3643. 0.13
633 1 3 1 200 10 3.515 LMAX= COV= NaN
Then I decided to use a regular expression to define my delimiters. After many trial and error runs (I am no expert in regex), I managed to get close with the following line:
df = pd.read_csv('beta.txt', header=None, sep='/s +|LOADED LANE|MAT. TYPE=|LEFFECT=|SPAN=|SPACE=|BETA=|LOADEFFECT|LMAX=|COV=', engine='python')
This almost works, but creates a NaN column for some reason at the very beginning:
632 NaN 1 2 1 200 10 3.474 5075 3643 0.13
633 NaN 1 3 1 200 10 3.515 10009 9732 0.08
At this point I think I can just delete that first column, and get away with it. However I wonder what would be the correct way to set up the regex to correctly parse this text file in one shot. Any ideas? Other than that, I am sure there is a smarter way to parse this text file. I would be glad to hear your recommendations.
Thanks!
import re
import pandas as pd
import csv
csvfile = open("parsing.txt") #open text file
reader = csv.reader(csvfile)
new_list=[]
for line in reader:
for i in line:
new_list.append(re.findall(r'(\d*\.\d+|\d+)', i))
table = pd.DataFrame(new_list)
table # output will be pandas DataFrame with values