I trying to read the message from database, but under the class label can't really read same as CSV dataset.
messages = pandas.read_csv('bitcoin_reddit.csv', delimiter='\t',
names=["title","class"])
print (messages)
Under the class label the pandas only can read as NaN
The version of my CSV file
title,url,timestamp,class
"It's official! 1 Bitcoin = $10,000 USD",https://v.redd.it/e7io27rdgt001,29/11/2017 17:25,0
The last 3 months in 47 seconds.,https://v.redd.it/typ8fdslz3e01,4/2/2018 18:42,0
It's over 9000!!!,https://i.imgur.com/jyoZGyW.gifv,26/11/2017 20:55,1
Everyone who's trading BTC right now,http://cdn.mutually.com/wp-content/uploads/2017/06/08-19.jpg,7/1/2018 12:38,1
I hope James is doing well,https://i.redd.it/h4ngqma643101.jpg,1/12/2017 1:50,1
Weeeeeeee!,https://i.redd.it/iwl7vz69cea01.gif,17/1/2018 1:13,0
Bitcoin.. The King,https://i.redd.it/4tl0oustqed01.jpg,1/2/2018 5:46,1
Nothing can increase by that much and still be a good investment.,https://i.imgur.com/oWePY7q.jpg,14/12/2017 0:02,1
"This is why I want bitcoin to hit $10,000",https://i.redd.it/fhzsxgcv9nyz.jpg,18/11/2017 18:25,1
Bitcoin Doesn't Give a Fuck.,https://v.redd.it/ty2y74gawug01,18/2/2018 15:19,-1
Working Hard or Hardly Working?,https://i.redd.it/c2o6204tvc301.jpg,12/12/2017 12:49,1
The separator in your csv file is a comma, not a tab. And since , is the default, there is no need to define it.
However, names= defines custom names for the columns. Your header already provides these names, so passing the column names you are interested in to usecols is all you need then:
>>> pd.read_csv(file, usecols=['title', 'class'])
title class
0 It's official! 1 Bitcoin = $10,000 USD 0
1 The last 3 months in 47 seconds. 0
2 It's over 9000!!! 1
3 Everyone who's trading BTC right now 1
4 I hope James is doing well 1
5 Weeeeeeee! 0
Related
I'm cleaning a csv file with pandas, mainly removing special characters such as ('/', '#', etc). The file has 7 columns (none of which are dates).
In some of the columns, there's just numerical data such as '11/6/1980'.
I've noticed that directly after reading the csv file,
df = pd.read_csv ('report16.csv', encoding ='ANSI')
this data becomes '11/6/80', after cleaning it becomes '11 6 80' (it's the same result in the output file). So wherever the data has ' / ', it's being interpreted as a date and python is eliminating the first 2 digits from the data.
Data
Expected result
Actual Result
11/6/1980
11 6 1980
11 6 80
12/8/1983
12 8 1983
12 8 83
Both of the above results are wrong because in the Actual Result column, I'm losing 2 digits towards the end.
The data looks like this
Org Name
Code
Code copy
ABC
11/6/1980
11/6/1980
DEF
12/8/1983
12/8/1983
GH
11/5/1987
11/5/1987
OrgName, Code, Code copy
ABC, 11/6/1980, 11/6/1980
DEF, 12/8/1983, 12/8/1983
GH, 11/5/1987, 11/5/1987
KC, 9000494, 9000494
It's worth mentioning that the column contains other data such as '900490', strings, etc but in these instances, there aren't any problems.
What could be done to not allow this conversion?
Not an answer, but comments do not allow to include well presented code and data.
Here is what I call a minimal reproducible example:
Content of the sample.csv file:
Data,Expected result,Actual Result
11/6/1980,11 6 1980,11 6 80
12/8/1983,12 8 1983,12 8 83
Code:
df = pd.read_csv('sample.csv')
print(df)
s = df['Data'].str.replace('/', ' ')
print((df['Expected result'] == s).all())
It gives :
Data Expected result Actual Result
0 11/6/1980 11 6 1980 11 6 80
1 12/8/1983 12 8 1983 12 8 83
True
This proves that read_csv has correctly read the file and has not changed anything.
PLEASE SHOW THE CONTENT OF YOUR CSV FILE AS TEXT, along with enough code to reproduce your problem.
How about trying string operation?! First select the column that you would like to modify and replace "/" or "#" with whitespace : column.str.replace("/", " "). I hope this is gonna work !
The behavior of converting dates is not strictly a python issue. You are using pandas read_csv.
Try to explicitly declare a separator. If sep not declared, it makes guesses.
df = pd.read_csv ('report16.csv', encoding ='ANSI', sep =',')
When reading a CSV with a list games in a single column, the game in the first/top row is displayed out of order, like so:
Fatal Labyrinthâ„¢
0 Beat Hazard
1 Dino D-Day
2 Anomaly: Warzone Earth
3 Project Zomboid
4 Avernum: Escape From the Pit
..with the code being:
my_data = pd.read_csv(r'C:\Users\kosta\list.csv', encoding='utf-16', delimiter='=')
print(my_data)
Fatal Labyrinth is, I suppose, not indexed. Adding 'index_col=0' lists each game, like so:
Empty DataFrame
Columns: []
Index: [Beat Hazard, Dino D-Day, more games etc...]
But this does not help, as the endgame here is to count each game and determine the most common, but when doing:
counts = Counter(my_data)
dictTime = dict(counts.most_common(3))
for key in dictTime:
print(key)
..all I'm getting back is:
Fatal Labyrinthâ„¢
Thank you :)
Need to add "names=" parameter when you read the CSV file.
my_data = pd.read_csv('test.csv', delimiter='=', names=['Game_Name']) # Game_Name is given as column name
print(my_data)
Game_Name
0 Fatal Labyrinthâ„¢
1 Beat Hazard
2 Dino D-Day
3 Anomaly: Warzone Earth
4 Project Zomboid
5 Avernum: Escape From the Pit
Also value_counts() can be used on the dataframe to find the frequency of the value.
(my_data.Game_Name.value_counts(ascending=False)).head(3) # Top three most frequent value
Project Zomboid 1
Anomaly: Warzone Earth 1
Beat Hazard 1
Name: Game_Name, dtype: int64
In case, you need to get the top game name by its frequency,
(my_data.Game_Name.value_counts()).head(1).index[0]
'Project Zomboid'
I am a decent SAS programmer, but I am quite new in Python. Now, I have been given Twitter feeds, each saved into very large flat files, with headers in row #1 and a data structure like the below:
CREATED_AT||||ID||||TEXT||||IN_REPLY_TO_USER_ID||||NAME||||SCREEN_NAME||||DESCRIPTION||||FOLLOWERS_COUNT||||TIME_ZONE||||QUOTE_COUNT||||REPLY_COUNT||||RETWEET_COUNT||||FAVORITE_COUNT
Tue Nov 14 12:33:00 +0000 2017||||930413253766791168||||ICYMI: Football clubs join the craft beer revolution! A good read|||| ||||BAB||||BABBrewers||||Monthly homebrew meet-up at 1000 Trades, Jewellery Quarter. First Tuesday of the month. All welcome, even if you've never brewed before.||||95|||| ||||0||||0||||0||||0
Tue Nov 14 12:34:00 +0000 2017||||930413253766821456||||I'm up for it|||| ||||Misty||||MistyGrl||||You CAN DO it!||||45|||| ||||0||||0||||0||||0
I guess it's like that because any sort of characters can be found in a Twitter feed, but a quadruple pipe is unlikely enough.
I know some people use JSON for that, but I've got these files as such: lots of them. I could use SAS to easily transform these files, but I prefer to "go pythonic", this time.
Now, I cannot seem to find a way to make Python (2.7) understand that the quadruple pipe is the actual separator. The output from the code below:
import pandas as pd
with open('C:/Users/myname.mysurname/Desktop/my_twitter_flow_1.txt') as theInFile:
inTbl = pd.read_table(theInFile, engine='python', sep='||||', header=1)
print inTbl.head()
seem to suggest that Python does not see the distinct fields as distinct but, simply, brings in each of the first 5 rows, up to the line feed character, ignoring the |||| separator.
Basically, I am getting an output like the one I wrote above to show you the data structure.
Any hints?
Using just the data in your question:
>>> df = pd.read_csv('rio.txt', sep='\|{4}', skip_blank_lines=True, engine='python')
>>> df
CREATED_AT ID \
0 Tue Nov 14 12:33:00 +0000 2017 930413253766791168
1 Tue Nov 14 12:34:00 +0000 2017 930413253766821456
TEXT IN_REPLY_TO_USER_ID \
0 ICYMI: Football clubs join the craft beer revo...
1 I'm up for it
NAME SCREEN_NAME DESCRIPTION \
0 BAB BABBrewers Monthly homebrew meet-up at 1000 Trades, Jewel...
1 Misty MistyGrl You CAN DO it!
FOLLOWERS_COUNT TIME_ZONE QUOTE_COUNT REPLY_COUNT RETWEET_COUNT \
0 95 0 0 0
1 45 0 0 0
FAVORITE_COUNT
0 0
1 0
Notice the sep parameter. When it's more than one character long and not equal to '\s+' it's interpreted as a regular expression. But the '|' character has special meaning in a regex, hence it must be escaped, using the '\' character. I could simply have written sep='\|\|\|\|'; however, I've used an abbreviation.
I am converting some of my web-scraping code from R to Python (I can't get geckodriver to work with R, but it's working with Python). Anyways, I am trying to understand how to parse and read HTML tables with Python. Quick background, here is my code for R:
doc <- htmlParse(remDr$getPageSource()[[1]],ignoreBlanks=TRUE, replaceEntities = FALSE, trim=TRUE, encoding="UTF-8")
WebElem <- readHTMLTable(doc, stringsAsFactors = FALSE)[[7]]
I would parse the HTML page to the doc object. Then I would start with doc[[1]], and move through higher numbers until I saw the data I wanted. In this case I got to doc[[7]] and saw the data I wanted. I then would read that HTML table and assign it to the WebElem object. Eventually I would turn this into a dataframe and play with it.
So what I am doing in Python is this:
html = None
doc = None
html = driver.page_source
doc = BeautifulSoup(html)
Then I started to play with doc.get_text but I don't really know how to get just the data I want to see. The data I want to see is like a 10x10 matrix. When I used R, I would just use doc[[7]] and that matrix would almost be in a perfect structure for me to convert it to a dataframe. However, I just can't seem to do that with Python. Any advice would be much appreciated.
UPDATE:
I have been able to get the data I want using Python--I followed this blog for creating a dataframe with python: Python Web-Scraping. Here is the website that we are scraping in that blog: Most Popular Dog Breeds. In that blog post, you have to work your way through the elements, create a dict, loop through each row of the table and store the data in each column, and then you are able to create a dataframe.
With R, the only code I had to write was:
doc <- htmlParse(remDr$getPageSource()[[1]],ignoreBlanks=TRUE, replaceEntities = FALSE, trim=TRUE, encoding="UTF-8")
df <- as.data.frame(readHTMLTable(doc, stringsAsFactors = FALSE)
With just that, I have a pretty nice dataframe that I only need to adjust the column names and data types--it looks like this with just that code:
NULL.V1 NULL.V2 NULL.V3 NULL.V4
1 BREED 2015 2014 2013
2 Retrievers (Labrador) 1 1 1
3 German Shepherd Dogs 2 2 2
4 Retrievers (Golden) 3 3 3
5 Bulldogs 4 4 5
6 Beagles 5 5 4
7 French Bulldogs 6 9 11
8 Yorkshire Terriers 7 6 6
9 Poodles 8 7 8
10 Rottweilers 9 10 9
Is there not something available in Python to make this a bit simpler, or is this just simpler in R because R is more built for dataframes(at least that's how it seems to me, but I could be wrong)?
Ok, after some hefty digging around I feel like I came to good solution--matching that of R. If you are looking at the HTML provided in the link above, Dog Breeds, and you have the web driver running for that link you can run the following code:
tbl = driver.find_element_by_xpath("//html/body/main/article/section[2]/div/article/table").get_attribute('outerHTML')
df = pd.read_html(tbl)
Then you are looking a pretty nice dataframe after only a couple lines of code:
In [145]: df
Out[145]:
[ 0 1 2 3
0 BREED 2015 2014 2013.0
1 Retrievers (Labrador) 1 1 1.0
2 German Shepherd Dogs 2 2 2.0
3 Retrievers (Golden) 3 3 3.0
4 Bulldogs 4 4 5.0
5 Beagles 5 5 4.0
I feel like this is much easier than working through the tags, creating a dict, and looping through each row of data as the blog suggests. It might not be the most correct way of doing things, I'm new to Python, but it gets the job done quickly. I hope this helps out some fellow web-scrapers.
tbl = driver.find_element_by_xpath("//html/body/main/article/section[2]/div/article/table").get_attribute('outerHTML')
df = pd.read_html(tbl)
it Worked pretty well.
First, read Selenium with Python, you will get basic idea of how Selenium work with Python.
Than, if you want to locate element in Python, there are tow ways:
Use Selenium API, you can refer Locating Elements
Use BeautifulSoup, there is nice Document you can read
BeautifulSoupDocumentation
I have the following two databases:
url='https://raw.githubusercontent.com/108michael/ms_thesis/master/rgdp_catcode.merge'
df=pd.read_csv(url, index_col=0)
df.head(1)
naics catcode GeoName Description ComponentName year GDP state
0 22 E1600',\t'E1620',\t'A4000',\t'E5000',\t'E3000'... Alabama Utilities Real GDP by state 2004 5205 AL
url='https://raw.githubusercontent.com/108michael/ms_thesis/master/mpl.Bspons.merge'
df1=pd.read_csv(url, index_col=0)
df1.head(1)
state year unemployment log_diff_unemployment id.thomas party type date bills id.fec years_exp session name disposition catcode
0 AK 2006 6.6 -0.044452 1440 Republican sen 2006-05-01 s2686-109 S2AK00010 39 109 National Cable & Telecommunications Association support C4500
Regarding df, I had to manually input the catcode values. I think that is why the formatting is off. What I would like is to simply have the values without the \t prefix. I want to merge the dfs on catcode, state, year. I made a test earlier wherein a df1.catcode with only one value per cell was matched with the values in another df.catcode that had more than one value per cell and it worked.
So technically, all I need to do is lose the \t before each consecutive value in df.catcode, but additionally, if anyone has ever done a merge of this sort before, any 'caveats' learned through experience would be appreciated. My merge code looks like this:
mplmerge=pd.merge(df1,df, on=(['catcode', 'state', 'year']), how='left' )
I think this can be done with the regex method, I'm looking at the documentation now.
Cleaning catcode column in df is rather straightforward:
catcode_fixed = df.catcode.str.findall('[A-Z][0-9]{4}')
This will produce a series with a list of catcodes in every row:
catcode_fixed.head(3)
Out[195]:
0 [E1600, E1620, A4000, E5000, E3000, E1000]
1 [X3000, X3200, L1400, H6000, X5000]
2 [X3000, X3200, L1400, H6000, X5000]
Name: catcode, dtype: object
If I understand correctly what you want, then you need to "ungroup" these lists. Here is the trick, in short:
catcode_fixed = catcode_fixed = catcode_fixed.apply(pd.Series).stack()
catcode_fixed.index = catcode_fixed.index.droplevel(-1)
So, we've got (note the index values):
catcode_fixed.head(12)
Out[206]:
0 E1600
0 E1620
0 A4000
0 E5000
0 E3000
0 E1000
1 X3000
1 X3200
1 L1400
1 H6000
1 X5000
2 X3000
dtype: object
Now, dropping the old catcode and joining in the new one:
df.drop('catcode',axis = 1, inplace = True)
catcode_fixed.name = 'catcode'
df = df.join(catcode_fixed)
By the way, you may also need to use df1.reset_index() when merging the data frames.