This is a simple question for which answer are surprisingly difficult to find online. Here's the situation:
>>> A
[('hey', 'you', 4), ('hey', 'not you', 5), ('not hey', 'you', 2), ('not hey', 'not you', 6)]
>>> A_p = pandas.DataFrame(A)
>>> A_p
0 1 2
0 hey you 4
1 hey not you 5
2 not hey you 2
3 not hey not you 6
>>> B_p = A_p.pivot(0,1,2)
>>> B_p
1 not you you
0
hey 5 4
not hey 6 2
This isn't quite what's suggested in the documentation for pivot -- there, it shows results without the 1 and 0 in the upper-left-hand corner. And that's what I'm looking for, a DataFrame object that prints as
not you you
hey 5 4
not hey 6 2
The problem is that the normal behavior results in a csv file whose first line is
0,not you,you
when I really want
not you, you
When the normal csv file (with the preceding "0,") reads into R, it doesn't properly set the column and row names from the frame object, resulting in painful manual manipulation to get it in the right format. Is there a way to get pivot to give me a DataFrame object without that additional information in the upper-left corner?
Well, you have:
In [17]: B_p.to_csv(sys.stdout)
0,not you,you
hey,5.0,4.0
not hey,6.0,2.0
In [18]: B_p.to_csv(sys.stdout, index=False)
not you,you
5.0,4.0
6.0,2.0
But I assume you want the row names. Setting the index name to None (B_p.index.name = None) gives a leading comma:
In [20]: B_p.to_csv(sys.stdout)
,not you,you
hey,5.0,4.0
not hey,6.0,2.0
This roughly matches (ignoring quoted strings) what R writes in write.csv when row.names=TRUE:
"","a","b"
"foo",0.720538259472741,-0.848304940318957
"bar",-0.64266667412325,-0.442441171401282
"baz",-0.419181615269841,-0.658545964124229
"qux",0.881124313748992,0.36383198969179
"bar2",-1.35613767310069,-0.124014006180608
Any of these help?
EDIT: Added the index_label=False option today which does what you want:
In [2]: df
Out[2]:
A B
one 1 4
two 2 5
three 3 6
In [3]: df.to_csv('foo.csv', index_
index_exp index_label= index_name=
In [3]: df.to_csv('foo.csv', index_name=False)
In [4]:
11:24 ~/code/pandas (master)$ R
R version 2.14.0 (2011-10-31)
Copyright (C) 2011 The R Foundation for Statistical Computing
ISBN 3-900051-07-0
Platform: x86_64-unknown-linux-gnu (64-bit)
R is free software and comes with ABSOLUTELY NO WARRANTY.
You are welcome to redistribute it under certain conditions.
Type 'license()' or 'licence()' for distribution details.
Natural language support but running in an English locale
R is a collaborative project with many contributors.
Type 'contributors()' for more information and
'citation()' on how to cite R or R packages in publications.
Type 'demo()' for some demos, 'help()' for on-line help, or
'help.start()' for an HTML browser interface to help.
Type 'q()' to quit R.
[Previously saved workspace restored]
re> read.csv('foo.csv')
A B
one 1 4
two 2 5
three 3 6
Related
Here is my problem (I'm working on python) :
I have a Dataframe with columns: Index(['job_title', 'company', 'job_label', 'description'], dtype='object')
And I have a list of words that contains 300 skills:
keywords = ["C++","Data Analytics","python","R", ............ "Django"]
I need to match those keywords with each of the jobs descriptions and obtain a new dataframe saying if is true or false that C++ is in job description[0]...job description[1], job description[2] and so on.
My new dataframe will be:
columns : ['job_title', 'company', 'description', "C++", "Data Analytics",
....... "Django"]
Where each column of keywords said true or false if it match(is found) or not on the job description.
There might be another ways to structure the dataframe (I'm listening suggestions).
Hope I'm clear with my question. I try regex but I can't make it iterate trough each row, I try with a loop using "fnmatch" library and I can't make it work. The best approach so far was:
df["microservice"]= df.description.str.contains("microservice")
df["cloud-based architecture"] = df.description.str.contains("cloud-based architecture")
df["service oriented architecture"] = df.description.str.contains("service oriented architecture")
However, First I could not manage to make it loop trough each rows of description column, so i have to input 300 times the code with each word (it doesn't make sense). Second, trough this way, I have problems with few words such as "R" because it find the letter R in each description, so it will pull true in each of them.
Iterate over list of keywords and extract each column from the description one:
for name in keywords:
df[name] = df['description'].apply(lambda x: True if name in x else False)
EDIT:
That doesn't solve the problem with R. To do so you could add a space to make sure it's isolated so the code would be:
for name in keywords:
df[name] = df['description'].apply(lambda x: True if ' '+str(name)+' ' in x else False)
But that's really ugly and not optimised. Regular expression should do the trick but I have to look back into it: found it! [ ]*+[str(name)]+[.?!] is better! (and more appropriate)
One way is to build a regex string to identify any keyword in your string... this example is case insensitive and will find any substring matches - not just whole words...
import pandas as pd
import re
keywords = ['python', 'C++', 'admin', 'Developer']
rx = '(?i)(?P<keywords>{})'.format('|'.join(re.escape(kw) for kw in keywords))
Then with a sample DF of:
df = pd.DataFrame({
'job_description': ['C++ developer', 'traffic warden', 'Python developer', 'linux admin', 'cat herder']
})
You can find all keywords for the relevant column...
matches = df['job_description'].str.extractall(rx)
Which gives:
keyword
match
0 0 C++
1 developer
2 0 Python
1 developer
3 0 admin
Then you want to get a list of "dummies" and take the max (so you always get a 1 where a word was found) using:
dummies = pd.get_dummies(matches).max(level=0)
Which gives:
keyword_C++ keyword_Python keyword_admin keyword_developer
0 1 0 0 1
2 0 1 0 1
3 0 0 1 0
You then left join that back to your original DF:
result = df.join(dummies, how='left')
And the result is:
job_description keyword_C++ keyword_Python keyword_admin keyword_developer
0 C++ developer 1.0 0.0 0.0 1.0
1 traffic warden NaN NaN NaN NaN
2 Python developer 0.0 1.0 0.0 1.0
3 linux admin 0.0 0.0 1.0 0.0
4 cat herder NaN NaN NaN NaN
skill = "C++", or any of the others
frame = an instance of
Index(['job_title', 'company', 'job_label', 'description'],
dtype='object')
jobs = a list/np.array of frames, which is probably your input
A naive implementation could look a bit like this:
for skill in keywords:
for frame in jobs:
if skill in frame["description"]: # or more exact matching, but this is what's in the question
# exists
But you need to put more work into what output structure you are going to use. Just having an output array of 300 columns most of which just contain a False isn't going to be a good plan. I've never worked with Panda's myself, but if it were normal numpy arrays (which panda's DataFrames are under the hood), I would add a column "skills" that can enumerate them.
You can leverage .apply() like so (#Jacco van Dorp made a solid suggestion of storing all of the found skills inside a single column, which I agree is likely the best approach to your problem):
df = pd.DataFrame([['Engineer','Firm','AERO1','Work with python and Django'],
['IT','Dell','ITD4','Work with Django and R'],
['Office Assistant','Dental','OAD3','Coordinate schedules'],
['QA Engineer','Factory','QA2','Work with R and python'],
['Mechanic','Autobody','AERO1','Love the movie Django']],
columns=['job_title','company','job_label','description'])
Which yields:
job_title company job_label description
0 Engineer Firm AERO1 Work with python and Django
1 IT Dell ITD4 Work with Django and R
2 Office Assistant Dental OAD3 Coordinate schedules
3 QA Engineer Factory QA2 Work with R and python
4 Mechanic Autobody AERO1 Love the movie Django
Then define your skill set and your list comprehension to pass to .apply():
skills = ['python','R','Django']
df['skills'] = df.apply(lambda x: [i for i in skills if i in x['description'].split()], axis=1)
Which yields this column:
skills
0 [python, Django]
1 [R, Django]
2 []
3 [python, R]
4 [Django]
If you are still interested in having individual columns for each skill, I can edit my answer to provide that as well.
I have a csv file that has a primary_id field and a version field and it looks like this:
ful_id version xs at_grade date
000c1a6c-1f1c-45a6-a70d-f3555f7dd980 3 123 yes 20171003
000c1a6c-1f1c-45a6-a70d-f3555f7dd980 1 12 no 20170206
034c1a6c-4f1c-aa36-a70d-f2245f7rr342 1 334 yes 20150302
00dc5fec-ddb8-45fa-9c86-77e09ff590a9 1 556 yes 20170201
000c1a6c-1f1c-45a6-a70d-f3555f7dd980 2 123 no 20170206
edit this is what the actual data looks like plus add 106 more columns of data and 20,000 records
The larger version number is the latest version of that record.I am having a difficult time thinking of the logic to get the latest record based on version and dumping that into a dictionary.I am pulling the info from the csv into a blank list but If anyone could give me some guidance on some of the logic moving forward, I would appreciate it
import csv
from collections import defaultdict
reader = csv.DictReader(open('rpm_inv.csv', 'rb'))
allData = list(reader)
dict_list = []
for line in allData:
dict_list.append(line)
pprint.pprint(dict_list)
I'm not exactly sure how you want your output to look like, but this might point you at least in the right direction, as long as you're not opposed to pandas.
import pandas as pd
df = pd.read_csv('rpm_inv.csv', header=True)
by_version = df.groupby('Version')
latest = by_version.max()
# To put it into a dictionary of {version:ID}
{v:row['ID'] for v, row in latest.iterrows()}
There's no need for anything fancy.
defaultdict is included in Python's standard library. It's an improved dictionary. I've used it here because it obviates the need to initialise entries in a dictionary. This means that I can write, for instance, result[id] = max(result[id], version). If no entry exists for id then defaultdict creates one and puts version in it (because it's obvious that this will be the maximum).
I read through the lines in the input file, one at a time, discarding end-lines and blanks, splitting on the commas, and then use map to apply the int function to each string produced.
I ignore the first line in the file simply be reading it and assigning its contents to a variable that I have arbitrarily called ignore.
Finally, just to make the results more intelligible, I sort the keys in the dictionary, and present the contents of it in order.
>>> from collections import defaultdict
>>> result = defaultdict(int)
>>> with open('to_dict.txt') as input:
... ignore = input.readline()
... for line in input:
... id, version = map(int, line.strip().replace(' ', '').split(','))
... result[id] = max(result[id], version)
...
>>> ids = list(result.keys())
>>> ids.sort()
>>> for id in ids:
... id, result[id]
...
(3, 1)
(11, 3)
(20, 2)
(400, 2)
EDIT: With that much data it becomes a different question, in my estimation, better processed with pandas.
I've put the df.groupby(['ful_id']).version.idxmax() bit in to demonstrate what I've done. I group on ful_id, then ask for the maximum value of version and the index of the maximum value, all in one step using idxmax. Although pandas displays this as a two-column table the result is actually a list of integers that I can use to select rows from the dataframe.
That's what I do with df.iloc[df.groupby(['ful_id']).version.idxmax(),:]. Here the df.groupby(['ful_id']).version.idxmax() part identifies the rows, and the : part identifies the columns, namely all of them.
Thanks for an interesting question!
>>> import pandas as pd
>>> df = pd.read_csv('different.csv', sep='\s+')
>>> df
ful_id version xs at_grade date
0 000c1a6c-1f1c-45a6-a70d-f3555f7dd980 3 123 yes 20171003
1 000c1a6c-1f1c-45a6-a70d-f3555f7dd980 1 12 no 20170206
2 034c1a6c-4f1c-aa36-a70d-f2245f7rr342 1 334 yes 20150302
3 00dc5fec-ddb8-45fa-9c86-77e09ff590a9 1 556 yes 20170201
4 000c1a6c-1f1c-45a6-a70d-f3555f7dd980 2 123 no 20170206
>>> df.groupby(['ful_id']).version.idxmax()
ful_id
000c1a6c-1f1c-45a6-a70d-f3555f7dd980 0
00dc5fec-ddb8-45fa-9c86-77e09ff590a9 3
034c1a6c-4f1c-aa36-a70d-f2245f7rr342 2
Name: version, dtype: int64
>>> new_df = df.iloc[df.groupby(['ful_id']).version.idxmax(),:]
>>> new_df
ful_id version xs at_grade date
0 000c1a6c-1f1c-45a6-a70d-f3555f7dd980 3 123 yes 20171003
3 00dc5fec-ddb8-45fa-9c86-77e09ff590a9 1 556 yes 20170201
2 034c1a6c-4f1c-aa36-a70d-f2245f7rr342 1 334 yes 20150302
I have a pandas df.
Say I have a column "activity" which can be "fun" or "work" and I want to convert it to an integer.
What I do is:
df["activity_id"] = 1*(df["activity"]=="fun") + 2*(df["activity"]=="work")
This works, since I do not know how to put an if/else in there (and if you have 10 activities it can get complicated).
However, say I now have the opposite problem, and I want to convert from an id to a string, I cannot use this trick anymore because I cannot multiply a string with a Boolean. How do I do it? Is there a way to use if/else?
You can create a dictionary with id as the key and the string as the value and then use the map series method to convert the integer to a string.
my_map = {1:'fun', 2:'work'}
df['activity']= df.activity_id.map(my_map)
You could instead convert your activity column to categorical dtype:
df['activity'] = pd.Categorical(df['activity'])
Then you would automatically have access to integer labels for the values via df['activity'].cat.codes.
import pandas as pd
df = pd.DataFrame({'activity':['fun','work','fun']})
df['activity'] = pd.Categorical(df['activity'])
print(df['activity'].cat.codes)
0 0
1 1
2 0
dtype: int8
Meanwhile the string values can still be accessed as before while saving memory:
print(df)
still yields
activity
0 fun
1 work
2 fun
You could also use a dictionary and list comprehension to recalculate values for an entire column. This makes it easy to define the reverse mapping as well:
>>> import pandas as pd
>>> forward_map = {'fun': 1, 'work': 2}
>>> reverse_map = {v: k for k, v in forward_map.iteritems()}
>>> df = pd.DataFrame(
{'activity': ['work', 'work', 'fun', 'fun', 'work'],
'detail': ['reports', 'coding', 'hiking', 'games', 'secret games']})
>>> df
activity detail
0 work reports
1 work coding
2 fun hiking
3 fun games
4 work secret games
>>> df['activity'] = [forward_map[i] for i in df['activity']]
>>> df
activity detail
0 2 reports
1 2 coding
2 1 hiking
3 1 games
4 2 secret games
>>> df['activity'] = [reverse_map[i] for i in df['activity']]
>>> df
activity detail
0 work reports
1 work coding
2 fun hiking
3 fun games
4 work secret games
I am converting some of my web-scraping code from R to Python (I can't get geckodriver to work with R, but it's working with Python). Anyways, I am trying to understand how to parse and read HTML tables with Python. Quick background, here is my code for R:
doc <- htmlParse(remDr$getPageSource()[[1]],ignoreBlanks=TRUE, replaceEntities = FALSE, trim=TRUE, encoding="UTF-8")
WebElem <- readHTMLTable(doc, stringsAsFactors = FALSE)[[7]]
I would parse the HTML page to the doc object. Then I would start with doc[[1]], and move through higher numbers until I saw the data I wanted. In this case I got to doc[[7]] and saw the data I wanted. I then would read that HTML table and assign it to the WebElem object. Eventually I would turn this into a dataframe and play with it.
So what I am doing in Python is this:
html = None
doc = None
html = driver.page_source
doc = BeautifulSoup(html)
Then I started to play with doc.get_text but I don't really know how to get just the data I want to see. The data I want to see is like a 10x10 matrix. When I used R, I would just use doc[[7]] and that matrix would almost be in a perfect structure for me to convert it to a dataframe. However, I just can't seem to do that with Python. Any advice would be much appreciated.
UPDATE:
I have been able to get the data I want using Python--I followed this blog for creating a dataframe with python: Python Web-Scraping. Here is the website that we are scraping in that blog: Most Popular Dog Breeds. In that blog post, you have to work your way through the elements, create a dict, loop through each row of the table and store the data in each column, and then you are able to create a dataframe.
With R, the only code I had to write was:
doc <- htmlParse(remDr$getPageSource()[[1]],ignoreBlanks=TRUE, replaceEntities = FALSE, trim=TRUE, encoding="UTF-8")
df <- as.data.frame(readHTMLTable(doc, stringsAsFactors = FALSE)
With just that, I have a pretty nice dataframe that I only need to adjust the column names and data types--it looks like this with just that code:
NULL.V1 NULL.V2 NULL.V3 NULL.V4
1 BREED 2015 2014 2013
2 Retrievers (Labrador) 1 1 1
3 German Shepherd Dogs 2 2 2
4 Retrievers (Golden) 3 3 3
5 Bulldogs 4 4 5
6 Beagles 5 5 4
7 French Bulldogs 6 9 11
8 Yorkshire Terriers 7 6 6
9 Poodles 8 7 8
10 Rottweilers 9 10 9
Is there not something available in Python to make this a bit simpler, or is this just simpler in R because R is more built for dataframes(at least that's how it seems to me, but I could be wrong)?
Ok, after some hefty digging around I feel like I came to good solution--matching that of R. If you are looking at the HTML provided in the link above, Dog Breeds, and you have the web driver running for that link you can run the following code:
tbl = driver.find_element_by_xpath("//html/body/main/article/section[2]/div/article/table").get_attribute('outerHTML')
df = pd.read_html(tbl)
Then you are looking a pretty nice dataframe after only a couple lines of code:
In [145]: df
Out[145]:
[ 0 1 2 3
0 BREED 2015 2014 2013.0
1 Retrievers (Labrador) 1 1 1.0
2 German Shepherd Dogs 2 2 2.0
3 Retrievers (Golden) 3 3 3.0
4 Bulldogs 4 4 5.0
5 Beagles 5 5 4.0
I feel like this is much easier than working through the tags, creating a dict, and looping through each row of data as the blog suggests. It might not be the most correct way of doing things, I'm new to Python, but it gets the job done quickly. I hope this helps out some fellow web-scrapers.
tbl = driver.find_element_by_xpath("//html/body/main/article/section[2]/div/article/table").get_attribute('outerHTML')
df = pd.read_html(tbl)
it Worked pretty well.
First, read Selenium with Python, you will get basic idea of how Selenium work with Python.
Than, if you want to locate element in Python, there are tow ways:
Use Selenium API, you can refer Locating Elements
Use BeautifulSoup, there is nice Document you can read
BeautifulSoupDocumentation
I am very new with python and am struggling to figure out how to output my data to a file.
The following section of my script works great, however I would like to be able to output the printed data to a text file.
x = np.linspace(141, 144.5, 500)
print x
y = np.linspace(-38.53, -38.53, 500)
z = np.linspace(0, 0, 500)
gz = tesseroid.gz(x,y,z,model)
print gz
As an addendum to that; when it prints in my terminal it prints the data as follows...
(x)
1 2 3 4
5 ... 500
(gz)
1 2 3 4
5 ... 500
...however I would love it to output the data in a single column like this:
1
2
3
4
5
...
8
It would be even better if they could both be in the same file...
1 1
2 2
3 3
4 4
5 5
...
500 500
...however this is not necessary, as I could easily manually manipulate them from there.
Many thanks in advance, and apologies if it is a very simple question; it is just something that I haven't been able to figure out with my limited experience.
Edit: Please note that it does not necessarily have to use the numpy savetext function. If there is an easier way to perform this task then I am perfectly willing to use it instead.
As you title says, you can use np.savetxt() to save in one column:
np.savetxt('x.txt', x.ravel())
np.savetxt('gz.txt', gz.ravel())
or to save in two columns:
np.savetxt('x_gz.txt', np.vstack((x.ravel(), gz.ravel())).T)