Python - First line reading CSV not in order

Python - First line reading CSV not in order - python

When reading a CSV with a list games in a single column, the game in the first/top row is displayed out of order, like so:
Fatal Labyrinth™
0 Beat Hazard
1 Dino D-Day
2 Anomaly: Warzone Earth
3 Project Zomboid
4 Avernum: Escape From the Pit
..with the code being:
my_data = pd.read_csv(r'C:\Users\kosta\list.csv', encoding='utf-16', delimiter='=')
print(my_data)
Fatal Labyrinth is, I suppose, not indexed. Adding 'index_col=0' lists each game, like so:
Empty DataFrame
Columns: []
Index: [Beat Hazard, Dino D-Day, more games etc...]
But this does not help, as the endgame here is to count each game and determine the most common, but when doing:
counts = Counter(my_data)
dictTime = dict(counts.most_common(3))
for key in dictTime:
print(key)
..all I'm getting back is:
Fatal Labyrinth™
Thank you :)

Need to add "names=" parameter when you read the CSV file.
my_data = pd.read_csv('test.csv', delimiter='=', names=['Game_Name']) # Game_Name is given as column name
print(my_data)
Game_Name
0 Fatal Labyrinth™
1 Beat Hazard
2 Dino D-Day
3 Anomaly: Warzone Earth
4 Project Zomboid
5 Avernum: Escape From the Pit
Also value_counts() can be used on the dataframe to find the frequency of the value.
(my_data.Game_Name.value_counts(ascending=False)).head(3) # Top three most frequent value
Project Zomboid 1
Anomaly: Warzone Earth 1
Beat Hazard 1
Name: Game_Name, dtype: int64
In case, you need to get the top game name by its frequency,
(my_data.Game_Name.value_counts()).head(1).index[0]
'Project Zomboid'

Related

How do I print out the Phone number from a csv with padded 0 using pandas?

So I have a CSV file with the following content:
Person,Phone
One,08001111111
Two,08002222222
Three,08003333333
When I used the following code:
import pandas as pd
df = pd.read_csv('test_stuff.csv')
print(df)
It prints out:
Person Phone
0 One 8001111111
1 Two 8002222222
2 Three 8003333333
It removed the starting 0 from the Phone column. I then tried to add the phone as string in the csv file, like so:
Person,Phone
One,'08001111111'
Two,'08002222222'
Three,'08003333333'
However, the result is now this:
Person Phone
0 One '08001111111'
1 Two '08002222222'
2 Three '08003333333'
What can I do to resolve this? I am hoping for a result like this:
Person Phone
0 One 08001111111
1 Two 08002222222
2 Three 08003333333
Thanks in advance.

Don't try to add the zeros back, just don't delete them in the first place by telling pandas.read_csv that you column is a string:
pd.read_csv('test_stuff.csv', dtype={'Phone': 'str'})
output:
Person Phone
0 One 08001111111
1 Two 08002222222
2 Three 08003333333

pandas row manipulation - If startwith keyword found - append row to end of previous row

I have a question regarding text file handling. My text file prints as one column. The column has data scattered throughout the rows and visually looks great & somewhat uniform however, still just one column. Ultimately, I'd like to append the row where the keyword is found to the end of the top previous row until data is one long row. Then I'll use str.split() to cut up sections into columns as I need.
In Excel (code below-Top) I took this same text file and removed headers, aligned left, and performed searches for keywords. When found, Excel has a nice feature called offset where you can place or append the cell value basically anywhere using this offset(x,y).value from the active-cell start position. Once done, I would delete the row. This allowed my to get the data into a tabular column format that I could work with.
What I Need:
The below Python code will cycle down through each row looking for the keyword 'Address:'. This part of the code works. Once it finds the keyword, the next line should append the row to the end of the previous row. This is where my problem is. I can not find a way to get the active row number into a variable so I can use in place of the word [index] for the active row. Or [index-1] for the previous row.
Excel Code of similar task
Do
Set Rng = WorkRng.Find("Address", LookIn:=xlValues)
If Not Rng Is Nothing Then
Rng.Offset(-1, 2).Value = Rng.Value
Rng.Value = ""
End If
Loop While Not Rng Is Nothing
Python Equivalent
import pandas as pd
from pandas import DataFrame, Series
file = {'Test': ['Last Name: Nobody','First Name: Tommy','Address: 1234 West Juniper St.','Fav
Toy', 'Notes','Time Slot' ] }
df = pd.DataFrame(file)
Test
0 Last Name: Nobody
1 First Name: Tommy
2 Address: 1234 West Juniper St.
3 Fav Toy
4 Notes
5 Time Slot
I've tried the following:
for line in df.Test:
if line.startswith('Address:'):
df.loc[[index-1],:].values = df.loc[index-1].values + ' ' + df.loc[index].values
Line above does not work with index statement
else:
pass
# df.loc[[1],:] = df.loc[1].values + ' ' + df.loc[2].values # copies row 2 at the end of row 1,
# works with static row numbers only
# df.drop([2,0], inplace=True) # Deletes row from df
Expected output:
Test
0 Last Name: Nobody
1 First Name: Tommy Address: 1234 West Juniper St.
2 Address: 1234 West Juniper St.
3 Fav Toy
4 Notes
5 Time Slot
I am trying to wrap my head around the entire series vectorization approach but still stuck trying loops that I'm semi familiar with. If there is a way to achieve this please point me in the right direction.
As always, I appreciate your time and your knowledge. Please let me know if you can help with this issue.
Thank You,

Use Series.shift on Test then use Series.str.startswith to create a boolean mask, then use boolean indexing with this mask to update the values in Test column:
s = df['Test'].shift(-1)
m = s.str.startswith('Address', na=False)
df.loc[m, 'Test'] += (' ' + s[m])
Result:
Test
0 Last Name: Nobody
1 First Name: Tommy Address: 1234 West Juniper St.
2 Address: 1234 West Juniper St.
3 Fav Toy
4 Notes
5 Time Slot

Pandas.read_csv issue

I trying to read the message from database, but under the class label can't really read same as CSV dataset.
messages = pandas.read_csv('bitcoin_reddit.csv', delimiter='\t',
names=["title","class"])
print (messages)
Under the class label the pandas only can read as NaN
The version of my CSV file
title,url,timestamp,class
"It's official! 1 Bitcoin = $10,000 USD",https://v.redd.it/e7io27rdgt001,29/11/2017 17:25,0
The last 3 months in 47 seconds.,https://v.redd.it/typ8fdslz3e01,4/2/2018 18:42,0
It's over 9000!!!,https://i.imgur.com/jyoZGyW.gifv,26/11/2017 20:55,1
Everyone who's trading BTC right now,http://cdn.mutually.com/wp-content/uploads/2017/06/08-19.jpg,7/1/2018 12:38,1
I hope James is doing well,https://i.redd.it/h4ngqma643101.jpg,1/12/2017 1:50,1
Weeeeeeee!,https://i.redd.it/iwl7vz69cea01.gif,17/1/2018 1:13,0
Bitcoin.. The King,https://i.redd.it/4tl0oustqed01.jpg,1/2/2018 5:46,1
Nothing can increase by that much and still be a good investment.,https://i.imgur.com/oWePY7q.jpg,14/12/2017 0:02,1
"This is why I want bitcoin to hit $10,000",https://i.redd.it/fhzsxgcv9nyz.jpg,18/11/2017 18:25,1
Bitcoin Doesn't Give a Fuck.,https://v.redd.it/ty2y74gawug01,18/2/2018 15:19,-1
Working Hard or Hardly Working?,https://i.redd.it/c2o6204tvc301.jpg,12/12/2017 12:49,1

The separator in your csv file is a comma, not a tab. And since , is the default, there is no need to define it.
However, names= defines custom names for the columns. Your header already provides these names, so passing the column names you are interested in to usecols is all you need then:
>>> pd.read_csv(file, usecols=['title', 'class'])
title class
0 It's official! 1 Bitcoin = $10,000 USD 0
1 The last 3 months in 47 seconds. 0
2 It's over 9000!!! 1
3 Everyone who's trading BTC right now 1
4 I hope James is doing well 1
5 Weeeeeeee! 0

Best way to match list of words with a list of job descriptions python

Here is my problem (I'm working on python) :
I have a Dataframe with columns: Index(['job_title', 'company', 'job_label', 'description'], dtype='object')
And I have a list of words that contains 300 skills:
keywords = ["C++","Data Analytics","python","R", ............ "Django"]
I need to match those keywords with each of the jobs descriptions and obtain a new dataframe saying if is true or false that C++ is in job description[0]...job description[1], job description[2] and so on.
My new dataframe will be:
columns : ['job_title', 'company', 'description', "C++", "Data Analytics",
....... "Django"]
Where each column of keywords said true or false if it match(is found) or not on the job description.
There might be another ways to structure the dataframe (I'm listening suggestions).
Hope I'm clear with my question. I try regex but I can't make it iterate trough each row, I try with a loop using "fnmatch" library and I can't make it work. The best approach so far was:
df["microservice"]= df.description.str.contains("microservice")
df["cloud-based architecture"] = df.description.str.contains("cloud-based architecture")
df["service oriented architecture"] = df.description.str.contains("service oriented architecture")
However, First I could not manage to make it loop trough each rows of description column, so i have to input 300 times the code with each word (it doesn't make sense). Second, trough this way, I have problems with few words such as "R" because it find the letter R in each description, so it will pull true in each of them.

Iterate over list of keywords and extract each column from the description one:
for name in keywords:
df[name] = df['description'].apply(lambda x: True if name in x else False)
EDIT:
That doesn't solve the problem with R. To do so you could add a space to make sure it's isolated so the code would be:
for name in keywords:
df[name] = df['description'].apply(lambda x: True if ' '+str(name)+' ' in x else False)
But that's really ugly and not optimised. Regular expression should do the trick but I have to look back into it: found it! [ ]*+[str(name)]+[.?!] is better! (and more appropriate)

One way is to build a regex string to identify any keyword in your string... this example is case insensitive and will find any substring matches - not just whole words...
import pandas as pd
import re
keywords = ['python', 'C++', 'admin', 'Developer']
rx = '(?i)(?P<keywords>{})'.format('|'.join(re.escape(kw) for kw in keywords))
Then with a sample DF of:
df = pd.DataFrame({
'job_description': ['C++ developer', 'traffic warden', 'Python developer', 'linux admin', 'cat herder']
})
You can find all keywords for the relevant column...
matches = df['job_description'].str.extractall(rx)
Which gives:
keyword
match
0 0 C++
1 developer
2 0 Python
1 developer
3 0 admin
Then you want to get a list of "dummies" and take the max (so you always get a 1 where a word was found) using:
dummies = pd.get_dummies(matches).max(level=0)
Which gives:
keyword_C++ keyword_Python keyword_admin keyword_developer
0 1 0 0 1
2 0 1 0 1
3 0 0 1 0
You then left join that back to your original DF:
result = df.join(dummies, how='left')
And the result is:
job_description keyword_C++ keyword_Python keyword_admin keyword_developer
0 C++ developer 1.0 0.0 0.0 1.0
1 traffic warden NaN NaN NaN NaN
2 Python developer 0.0 1.0 0.0 1.0
3 linux admin 0.0 0.0 1.0 0.0
4 cat herder NaN NaN NaN NaN

skill = "C++", or any of the others
frame = an instance of
Index(['job_title', 'company', 'job_label', 'description'],
dtype='object')
jobs = a list/np.array of frames, which is probably your input
A naive implementation could look a bit like this:
for skill in keywords:
for frame in jobs:
if skill in frame["description"]: # or more exact matching, but this is what's in the question
# exists
But you need to put more work into what output structure you are going to use. Just having an output array of 300 columns most of which just contain a False isn't going to be a good plan. I've never worked with Panda's myself, but if it were normal numpy arrays (which panda's DataFrames are under the hood), I would add a column "skills" that can enumerate them.

You can leverage .apply() like so (#Jacco van Dorp made a solid suggestion of storing all of the found skills inside a single column, which I agree is likely the best approach to your problem):
df = pd.DataFrame([['Engineer','Firm','AERO1','Work with python and Django'],
['IT','Dell','ITD4','Work with Django and R'],
['Office Assistant','Dental','OAD3','Coordinate schedules'],
['QA Engineer','Factory','QA2','Work with R and python'],
['Mechanic','Autobody','AERO1','Love the movie Django']],
columns=['job_title','company','job_label','description'])
Which yields:
job_title company job_label description
0 Engineer Firm AERO1 Work with python and Django
1 IT Dell ITD4 Work with Django and R
2 Office Assistant Dental OAD3 Coordinate schedules
3 QA Engineer Factory QA2 Work with R and python
4 Mechanic Autobody AERO1 Love the movie Django
Then define your skill set and your list comprehension to pass to .apply():
skills = ['python','R','Django']
df['skills'] = df.apply(lambda x: [i for i in skills if i in x['description'].split()], axis=1)
Which yields this column:
skills
0 [python, Django]
1 [R, Django]
2 []
3 [python, R]
4 [Django]
If you are still interested in having individual columns for each skill, I can edit my answer to provide that as well.

Python - average of unique values

I have a CSV file that looks like this:
DATE,TEMP
0101,39.0
0102,40.9
0103,44.4
0104,41.0
0105,40.0
0106,42.2
...
0101,41.0
0102,39.9
0103,44.6
0104,42.0
0105,43.0
0106,42.4
It's a list of temperatures for specific dates. It contains data for several years so the same dates occur multiple times. I would like to average the temperature so that I get a new table where each date is only occurring once and has the average temperature for that date in the second column.
I know that Stack Overflow requires you to include what you've attempted, but I really don't know how to do this and couldn't find any other answers on this.
I hope someone can help. Any help is much appreciated.

You can use pandas, and run the groupby command, when df is your data frame:
df.groupby('DATE').mean()
Here is some toy example to depict the behaviour
import pandas as pd
df=pd.DataFrame({"a":[1,2,3,1,2,3],"b":[1,2,3,4,5,6]})
df.groupby('a').mean()
Will result in
a b
1 2.5
2 3.5
3 4.5
When the original dataframe was
a b
0 1 1
1 2 2
2 3 3
3 1 4
4 2 5
5 3 6

If you can use the defaultdict pacakge from collections, makes this type of thing pretty easy.
Assuming your list is in the same directory as the python script and it looks like this:
list.csv:
DATE,TEMP
0101,39.0
0102,40.9
0103,44.4
0104,41.0
0105,40.0
0106,42.2
0101,39.0
0102,40.9
0103,44.4
0104,41.0
0105,40.0
0106,42.2
Here is the code I used to print out the averages.
#test.py
#usage: python test.py list.csv
import sys
from collections import defaultdict
#Open a file who is listed in the command line in the second position
with open(sys.argv[1]) as File:
#Skip the first line of the file, if its just "data,value"
File.next()
#Create a dictionary of lists
ourDict = defaultdict(list)
#parse the file, line by line
for each in File:
# Split the file, by a comma,
#or whatever separates them (Comma Seperated Values = CSV)
each = each.split(',')
# now each[0] is a year, and each[1] is a value.
# We use each[0] as the key, and append vallues to the list
ourDict[each[0]].append(float(each[1]))
print "Date\tValue"
for key,value in ourDict.items():
# Average is the sum of the value of all members of the list
# divided by the list's length
print key,'\t',sum(value)/len(value)

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Python - First line reading CSV not in order - python

Related

How do I print out the Phone number from a csv with padded 0 using pandas?

pandas row manipulation - If startwith keyword found - append row to end of previous row

Pandas.read_csv issue

Best way to match list of words with a list of job descriptions python

Python - average of unique values

Categories

Resources