I am a rookie. Collectively, I have about two weeks of experience with any sort of computer code.
I created a dictionary with some baseball player names, their batting order, and the position they play. I'm trying to get it to print out in columns with "order", "name", and "position" as headings, with the order numbers, positions, and names under those. Think spreadsheet kind of layout (stackoverflow won't let me format the way I want to in here).
order name position
1 A Baddoo LF
2 J Schoop 1B
3 R Grossman DH
I'm new here, so apparently you have to click the link to see what I wrote. Dodgy, I know...
As you can see, I have tried 257 times to get this thing to work. I've consulted the google, a python book, and other sources to no avail.
Here is the working code
for order in lineup["order"]:
order -= 1
position = lineup["position"][order]
name = lineup["name"][order ]
print(order + 1, name, position)
Your Dictionary consists of 3 lists - if you want entry 0 you will have to access entry 0 in each list separately.
This would be an easier way to use a dictionary
players = {1: ["LF", "A Baddoo"],
2: ["1B", "J Schoop"]}
for player in players:
print(players[player])
Hope this helps
Pandas is good for creating tables as you describe what you are trying to do. You can convert a dictionary into a a dataframe/table. The only catch here is the number of rows need to be the same length, which in this case it is not. The columns order, position, name all have 9 items in the list, while pitcher only has 1. So what we can do is pull out the pitcher key using pop. This will leave you with a dictionary with just those 3 column names mentioned above, and then a list with your 1 item for pitcher.
import pandas as pd
lineup = {'order': [1,2,3,4,5,6,7,8,9],
'position': ['LF', '1B', 'RF','DH','3B','SS','2B','C','CF'],
'name': ['A Baddoo','J Schoop','R Grossman','M Cabrera','J Candelario', 'H Castro','W Castro','D Garneau','D Hill'],
'pitcher':['Matt Manning']}
pitching = lineup.pop('pitcher')
starting_lineup = pd.DataFrame(lineup)
print("Today's Detroit Tigers starting lineup is: \n", starting_lineup.to_string(index=False), "\nPitcher: ", pitching[0])
Output:
Today's Detriot Tigers starting lineup is:
order position name
1 LF A Baddoo
2 1B J Schoop
3 RF R Grossman
4 DH M Cabrera
5 3B J Candelario
6 SS H Castro
7 2B W Castro
8 C D Garneau
9 CF D Hill
Pitcher: Matt Manning
Related
user_id
partner_name
order_sequence
2
Star Bucks
1
2
KFC
2
2
MCD
3
6
Coffee Store
1
6
MCD
2
9
KFC
1
I am trying to figure out what two restaurant combinations occur the most often. For instance, user with user_id 2 went to star bucks, KFC, and MCD, so I want a two-dimensional array that has [[star bucks, KFC],[KFC, MCD].
However, each time the user_id changes, for instance, in lines 3 and 4, the code should skip adding this combination.
Also, if a user has only one entry in the table, for instance, user with user_id 9, then this user should not be added to the list because he did not visit two or more restaurants.
The final result I am expecting for this table are:
[[Star Bucks, KFC], [KFC,MCD], [Coffee Store, MCD]]
I have written the following lines of code but so far, I am unable to crack it.
Requesting help!
arr1 = []
arr2 = []
for idx,x in enumerate(df['order_sequence']):
if x!=1:
arr1.append(df['partner_name'][idx])
arr1.append(df['partner_name'][idx+1])
arr2.append(arr1)
You could try to use .groupby() and zip():
res = [
pair
for _, sdf in df.groupby("user_id")
for pair in zip(sdf["partner_name"], sdf["partner_name"].iloc[1:])
]
Result for the sample dataframe:
[('Star Bucks', 'KFC'), ('KFC', 'MCD'), ('Coffee Store', 'MCD')]
Or try
res = (
df.groupby("user_id")["partner_name"].agg(list)
.map(lambda l: list(zip(l, l[1:])))
.sum()
)
with the same result.
Might be, that you have to sort the dataframe before:
df = df.sort_values(["user_id", "order_sequence"])
When reading a CSV with a list games in a single column, the game in the first/top row is displayed out of order, like so:
Fatal Labyrinth™
0 Beat Hazard
1 Dino D-Day
2 Anomaly: Warzone Earth
3 Project Zomboid
4 Avernum: Escape From the Pit
..with the code being:
my_data = pd.read_csv(r'C:\Users\kosta\list.csv', encoding='utf-16', delimiter='=')
print(my_data)
Fatal Labyrinth is, I suppose, not indexed. Adding 'index_col=0' lists each game, like so:
Empty DataFrame
Columns: []
Index: [Beat Hazard, Dino D-Day, more games etc...]
But this does not help, as the endgame here is to count each game and determine the most common, but when doing:
counts = Counter(my_data)
dictTime = dict(counts.most_common(3))
for key in dictTime:
print(key)
..all I'm getting back is:
Fatal Labyrinth™
Thank you :)
Need to add "names=" parameter when you read the CSV file.
my_data = pd.read_csv('test.csv', delimiter='=', names=['Game_Name']) # Game_Name is given as column name
print(my_data)
Game_Name
0 Fatal Labyrinth™
1 Beat Hazard
2 Dino D-Day
3 Anomaly: Warzone Earth
4 Project Zomboid
5 Avernum: Escape From the Pit
Also value_counts() can be used on the dataframe to find the frequency of the value.
(my_data.Game_Name.value_counts(ascending=False)).head(3) # Top three most frequent value
Project Zomboid 1
Anomaly: Warzone Earth 1
Beat Hazard 1
Name: Game_Name, dtype: int64
In case, you need to get the top game name by its frequency,
(my_data.Game_Name.value_counts()).head(1).index[0]
'Project Zomboid'
I have a question regarding text file handling. My text file prints as one column. The column has data scattered throughout the rows and visually looks great & somewhat uniform however, still just one column. Ultimately, I'd like to append the row where the keyword is found to the end of the top previous row until data is one long row. Then I'll use str.split() to cut up sections into columns as I need.
In Excel (code below-Top) I took this same text file and removed headers, aligned left, and performed searches for keywords. When found, Excel has a nice feature called offset where you can place or append the cell value basically anywhere using this offset(x,y).value from the active-cell start position. Once done, I would delete the row. This allowed my to get the data into a tabular column format that I could work with.
What I Need:
The below Python code will cycle down through each row looking for the keyword 'Address:'. This part of the code works. Once it finds the keyword, the next line should append the row to the end of the previous row. This is where my problem is. I can not find a way to get the active row number into a variable so I can use in place of the word [index] for the active row. Or [index-1] for the previous row.
Excel Code of similar task
Do
Set Rng = WorkRng.Find("Address", LookIn:=xlValues)
If Not Rng Is Nothing Then
Rng.Offset(-1, 2).Value = Rng.Value
Rng.Value = ""
End If
Loop While Not Rng Is Nothing
Python Equivalent
import pandas as pd
from pandas import DataFrame, Series
file = {'Test': ['Last Name: Nobody','First Name: Tommy','Address: 1234 West Juniper St.','Fav
Toy', 'Notes','Time Slot' ] }
df = pd.DataFrame(file)
Test
0 Last Name: Nobody
1 First Name: Tommy
2 Address: 1234 West Juniper St.
3 Fav Toy
4 Notes
5 Time Slot
I've tried the following:
for line in df.Test:
if line.startswith('Address:'):
df.loc[[index-1],:].values = df.loc[index-1].values + ' ' + df.loc[index].values
Line above does not work with index statement
else:
pass
# df.loc[[1],:] = df.loc[1].values + ' ' + df.loc[2].values # copies row 2 at the end of row 1,
# works with static row numbers only
# df.drop([2,0], inplace=True) # Deletes row from df
Expected output:
Test
0 Last Name: Nobody
1 First Name: Tommy Address: 1234 West Juniper St.
2 Address: 1234 West Juniper St.
3 Fav Toy
4 Notes
5 Time Slot
I am trying to wrap my head around the entire series vectorization approach but still stuck trying loops that I'm semi familiar with. If there is a way to achieve this please point me in the right direction.
As always, I appreciate your time and your knowledge. Please let me know if you can help with this issue.
Thank You,
Use Series.shift on Test then use Series.str.startswith to create a boolean mask, then use boolean indexing with this mask to update the values in Test column:
s = df['Test'].shift(-1)
m = s.str.startswith('Address', na=False)
df.loc[m, 'Test'] += (' ' + s[m])
Result:
Test
0 Last Name: Nobody
1 First Name: Tommy Address: 1234 West Juniper St.
2 Address: 1234 West Juniper St.
3 Fav Toy
4 Notes
5 Time Slot
Here is my problem (I'm working on python) :
I have a Dataframe with columns: Index(['job_title', 'company', 'job_label', 'description'], dtype='object')
And I have a list of words that contains 300 skills:
keywords = ["C++","Data Analytics","python","R", ............ "Django"]
I need to match those keywords with each of the jobs descriptions and obtain a new dataframe saying if is true or false that C++ is in job description[0]...job description[1], job description[2] and so on.
My new dataframe will be:
columns : ['job_title', 'company', 'description', "C++", "Data Analytics",
....... "Django"]
Where each column of keywords said true or false if it match(is found) or not on the job description.
There might be another ways to structure the dataframe (I'm listening suggestions).
Hope I'm clear with my question. I try regex but I can't make it iterate trough each row, I try with a loop using "fnmatch" library and I can't make it work. The best approach so far was:
df["microservice"]= df.description.str.contains("microservice")
df["cloud-based architecture"] = df.description.str.contains("cloud-based architecture")
df["service oriented architecture"] = df.description.str.contains("service oriented architecture")
However, First I could not manage to make it loop trough each rows of description column, so i have to input 300 times the code with each word (it doesn't make sense). Second, trough this way, I have problems with few words such as "R" because it find the letter R in each description, so it will pull true in each of them.
Iterate over list of keywords and extract each column from the description one:
for name in keywords:
df[name] = df['description'].apply(lambda x: True if name in x else False)
EDIT:
That doesn't solve the problem with R. To do so you could add a space to make sure it's isolated so the code would be:
for name in keywords:
df[name] = df['description'].apply(lambda x: True if ' '+str(name)+' ' in x else False)
But that's really ugly and not optimised. Regular expression should do the trick but I have to look back into it: found it! [ ]*+[str(name)]+[.?!] is better! (and more appropriate)
One way is to build a regex string to identify any keyword in your string... this example is case insensitive and will find any substring matches - not just whole words...
import pandas as pd
import re
keywords = ['python', 'C++', 'admin', 'Developer']
rx = '(?i)(?P<keywords>{})'.format('|'.join(re.escape(kw) for kw in keywords))
Then with a sample DF of:
df = pd.DataFrame({
'job_description': ['C++ developer', 'traffic warden', 'Python developer', 'linux admin', 'cat herder']
})
You can find all keywords for the relevant column...
matches = df['job_description'].str.extractall(rx)
Which gives:
keyword
match
0 0 C++
1 developer
2 0 Python
1 developer
3 0 admin
Then you want to get a list of "dummies" and take the max (so you always get a 1 where a word was found) using:
dummies = pd.get_dummies(matches).max(level=0)
Which gives:
keyword_C++ keyword_Python keyword_admin keyword_developer
0 1 0 0 1
2 0 1 0 1
3 0 0 1 0
You then left join that back to your original DF:
result = df.join(dummies, how='left')
And the result is:
job_description keyword_C++ keyword_Python keyword_admin keyword_developer
0 C++ developer 1.0 0.0 0.0 1.0
1 traffic warden NaN NaN NaN NaN
2 Python developer 0.0 1.0 0.0 1.0
3 linux admin 0.0 0.0 1.0 0.0
4 cat herder NaN NaN NaN NaN
skill = "C++", or any of the others
frame = an instance of
Index(['job_title', 'company', 'job_label', 'description'],
dtype='object')
jobs = a list/np.array of frames, which is probably your input
A naive implementation could look a bit like this:
for skill in keywords:
for frame in jobs:
if skill in frame["description"]: # or more exact matching, but this is what's in the question
# exists
But you need to put more work into what output structure you are going to use. Just having an output array of 300 columns most of which just contain a False isn't going to be a good plan. I've never worked with Panda's myself, but if it were normal numpy arrays (which panda's DataFrames are under the hood), I would add a column "skills" that can enumerate them.
You can leverage .apply() like so (#Jacco van Dorp made a solid suggestion of storing all of the found skills inside a single column, which I agree is likely the best approach to your problem):
df = pd.DataFrame([['Engineer','Firm','AERO1','Work with python and Django'],
['IT','Dell','ITD4','Work with Django and R'],
['Office Assistant','Dental','OAD3','Coordinate schedules'],
['QA Engineer','Factory','QA2','Work with R and python'],
['Mechanic','Autobody','AERO1','Love the movie Django']],
columns=['job_title','company','job_label','description'])
Which yields:
job_title company job_label description
0 Engineer Firm AERO1 Work with python and Django
1 IT Dell ITD4 Work with Django and R
2 Office Assistant Dental OAD3 Coordinate schedules
3 QA Engineer Factory QA2 Work with R and python
4 Mechanic Autobody AERO1 Love the movie Django
Then define your skill set and your list comprehension to pass to .apply():
skills = ['python','R','Django']
df['skills'] = df.apply(lambda x: [i for i in skills if i in x['description'].split()], axis=1)
Which yields this column:
skills
0 [python, Django]
1 [R, Django]
2 []
3 [python, R]
4 [Django]
If you are still interested in having individual columns for each skill, I can edit my answer to provide that as well.
Hi I'm learning data science and am trying to make a big data company list from a list with companies in various industries.
I have a list of row numbers for big data companies, named comp_rows.
Now, I'm trying to make a new dataframe with the filtered companies based on the row numbers. Here I need to add rows to an existing dataframe but I got an error. Could someone help?
my datarame looks like this.
company_url company tag_line product data
0 https://angel.co/billguard BillGuard The fastest smartest way to track your spendin... BillGuard is a personal finance security app t... New York City · Financial Services · Security ...
1 https://angel.co/tradesparq Tradesparq The world's largest social network for global ... Tradesparq is Alibaba.com meets LinkedIn. Trad... Shanghai · B2B · Marketplaces · Big Data · Soc...
2 https://angel.co/sidewalk Sidewalk Hoovers (D&B) for the social era Sidewalk helps companies close more sales to s... New York City · Lead Generation · Big Data · S...
3 https://angel.co/pangia Pangia The Internet of Things Platform: Big data mana... We collect and manage data from sensors embedd... San Francisco · SaaS · Clean Technology · Big ...
4 https://angel.co/thinknum Thinknum Financial Data Analysis Thinknum is a powerful web platform to value c... New York City · Enterprise Software · Financia...
My code is below:
bigdata_comp = DataFrame(data=None,columns=['company_url','company','tag_line','product','data'])
for count, item in enumerate(data.iterrows()):
for number in comp_rows:
if int(count) == int(number):
bigdata_comp.append(item)
Error:
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-234-1e4ea9bd9faa> in <module>()
4 for number in comp_rows:
5 if int(count) == int(number):
----> 6 bigdata_comp.append(item)
7
/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/pandas/core/frame.pyc in append(self, other, ignore_index, verify_integrity)
3814 from pandas.tools.merge import concat
3815 if isinstance(other, (list, tuple)):
-> 3816 to_concat = [self] + other
3817 else:
3818 to_concat = [self, other]
TypeError: can only concatenate list (not "tuple") to list
It seems you are trying to filter out an existing dataframe based on indices (which are stored in your variable called comp_rows). You can do this without using loops by using loc, like shown below:
In [1161]: df1.head()
Out[1161]:
A B C D
a 1.935094 -0.160579 -0.173458 0.433267
b 1.669632 -1.130893 -1.210353 0.822138
c 0.494622 1.014013 0.215655 1.045139
d -0.628889 0.223170 -0.616019 -0.264982
e -0.823133 0.385790 -0.654533 0.582255
We will get the rows with indices 'a','b' and 'c', for all columns:
In [1162]: df1.loc[['a','b','c'],:]
Out[1162]:
A B C D
a 1.935094 -0.160579 -0.173458 0.433267
b 1.669632 -1.130893 -1.210353 0.822138
c 0.494622 1.014013 0.215655 1.045139
You can read more about it here.
About your code:
1.
You do not need to iterate through a list to see if an item is present in it:
Use the in operator. For example -
In [1199]: 1 in [1,2,3,4,5]
Out[1199]: True
so, instead of
for number in comp_rows:
if int(count) == int(number):
do this
if number in comp_rows
2.
pandas append does not happen in-place. You have to store the result into another variable. See here.
3.
Append one row at a time is a slow way to do what you want.
Instead, save each row that you want to add into a list of lists, make a dataframe of it and append it to the target dataframe in one-go. Something like this..
temp = []
for count, item in enumerate(df1.loc[['a','b','c'],:].iterrows()):
# if count in comp_rows:
temp.append( list(item[1]))
## -- End pasted text --
In [1233]: temp
Out[1233]:
[[1.9350940285526077,
-0.16057932637141861,
-0.17345827000000605,
0.43326722021644282],
[1.66963201034217,
-1.1308932586268696,
-1.2103527446031515,
0.82213753819050794],
[0.49462218161377397,
1.0140133740187862,
0.2156547595968879,
1.0451391564351897]]
In [1236]: df2 = df1.append(pd.DataFrame(temp, columns=['A','B','C','D']))
In [1237]: df2
Out[1237]:
A B C D
a 1.935094 -0.160579 -0.173458 0.433267
b 1.669632 -1.130893 -1.210353 0.822138
c 0.494622 1.014013 0.215655 1.045139
d -0.628889 0.223170 -0.616019 -0.264982
e -0.823133 0.385790 -0.654533 0.582255
f -0.872135 2.938475 -0.099367 -1.472519
0 1.935094 -0.160579 -0.173458 0.433267
1 1.669632 -1.130893 -1.210353 0.822138
2 0.494622 1.014013 0.215655 1.045139
Replace the following line:
for count, item in enumerate(data.iterrows()):
by
for count, (index, item) in enumerate(data.iterrows()):
or even simply as
for count, item in data.iterrows():