![]() |
Return five most frequent words - Printable Version +- Python Forum (https://python-forum.io) +-- Forum: Python Coding (https://python-forum.io/forum-7.html) +--- Forum: Homework (https://python-forum.io/forum-9.html) +--- Thread: Return five most frequent words (/thread-41249.html) |
Return five most frequent words - Elisabet - Dec-06-2023 Hello, I'm pretty new to coding and have started attending an online course on programming techniques. I'm stuck on one part of an assignment where I have to return the 5 most frequent words in a text file, disregarding any stop words. I'm not allowed to use any modules. My function only returns the index numbers and not the actual words. Does anyone have any tips on what I can do? # Returns 5 most frequent words in a text def important_words(an_index, stop_words): mydict = index_text(an_index) # takes the result from index_text function all_words = [] # Create an empty list to put in all words from an_index in # Tar bort stop_words från text for item in stop_words: if item in an_index: del an_index[item] # Combine all the words into a single list for key in mydict: all_words.extend(mydict[key]) # Count occurrences of each word word_counter = {} # This dictionary stores words as keys and their respective numbers as values for word in all_words: if word not in stop_words: # If the word is a stop-word, it's ignored and not included in word_counter if word in word_counter: word_counter[word] += 1 # If the word has already been inserted into the dictionary, it adds to the count by 1 else: word_counter[word] = 1 # If the word isn't already in word_counter, it will be added sorted_words = sorted(word_counter.items(), key=lambda x: x[1], reverse=True) # Sorts tuple, x[1] takes the second elements of the tuple (the values). # The order of the tuple is reversed, starting with the largest values to the smallest. top_words = [word[0] for word in sorted_words[:5]] return top_words # Returns the five most frequent words in the text RE: Return five most frequent words - rob101 - Dec-07-2023 Possibly, you're over engineering this. I would start with something basic, like this: text = """This is a text object. This object has many words. I will now count the words and return most frequent ones, in ascending order.""" word_list = [word for word in text.split()] word_list.sort() for word in word_list: print(word)... and then figure how to count the occurrence of the words in that list. Note that you'll need to figure how to distinguish between object and object. , in the above example, because the period will mean that the two are not the same.There maybe a floor in my crude example, which will make it unworkable, but it's a starting point and (like I say) where I would start from. For the so-called "stop words", simply iterate over the word list, and remove them. That way, they'll not form part of the counting process. RE: Return five most frequent words - DPaul - Dec-07-2023 Hi, As Rob stated, you have to get rid of the punctuation. And, take upper() and lower() case into account. Also assuming numbers count as "words". Max 4 lines of code to do this. ![]() Paul RE: Return five most frequent words - deanhystad - Dec-07-2023 The question has to do with why the function returns a list of numbers instead of a list of words. We can assume that the original word list has already been processed so there is no punctuation and capitalization issues are resolved before calling the function. You did not include the code for the index_text() function, but I think the mistake may be there. Looking at your code, logic dictates index_text returns a index: word dictionary. When I write the function so it returns a word: index dictionary your code returns a list of index numbers. So I wrote the function like this: text = "i will now count the words and return the most frequent words in ascending order".split() an_index = list(range(len(text))) def index_text(index): return {text[i]: i for i in index}This uncovers other errors. For instance, this code does nothing: for item in stop_words: if item in an_index: del an_index[item]By the time this code runs, your function is done using an_index. This code is a problem. Read about list.extend() and list.append(). You are using the wrong one. # Combine all the words into a single list for key in mydict: all_words.extend(mydict[key])extend(mydict[key]) treats mydict[key] as a sequence of letters that are appended to all_words. You want to append(mydict[key]). If you read about dictionaries you would see there is a function that returns all the dictionary values so you don't need this loop at all. Your dictionary ready would also uncover a cleaner way to do this: if word in word_counter: word_counter[word] += 1 # If the word has already been inserted into the dictionary, it adds to the count by 1 else: word_counter[word] = 1 # If the word isn't already in word_counter, it will be addedif statements in python often indicate you are doing something wrong. There is no error in this code, it is just longer than it needs to be. RE: Return five most frequent words - Elisabet - Dec-11-2023 Thank you for the help. I realized that the course I'm taking is not very beginner-friendly, and that this task was too much for me at the moment. I'm grateful that you guys tried to explain this for me anyway! RE: Return five most frequent words - karimali - Jan-18-2025 I can help you with these as I already created Link Removed using my skills and good things is it looks like you’re almost there, but there are a couple of issues in your code that need fixing. First, the issue you're encountering is likely related to how you are manipulating the an_index variable and its structure. In the loop where you’re removing stop words from an_index, you're deleting items from an_index directly, but an_index is passed as an argument, and it’s not clear how it’s structured. It might not be a dictionary, or it could be that the keys you're attempting to delete don’t match the structure you expect. Here’s a revised version of your function: def important_words(an_index, stop_words): mydict = index_text(an_index) # assumes this returns a dictionary {index: [words]} all_words = [] # To collect all words from the index # Remove stop words from the text for key in mydict: filtered_words = [word for word in mydict[key] if word not in stop_words] all_words.extend(filtered_words) # Add the filtered words to the list # Count occurrences of each word word_counter = {} for word in all_words: word_counter[word] = word_counter.get(word, 0) + 1 # Count the word # Sort words by frequency in descending order sorted_words = sorted(word_counter.items(), key=lambda x: x[1], reverse=True) # Get the top 5 frequent words top_words = [word[0] for word in sorted_words[:5]] return top_words RE: Return five most frequent words - Pedroski55 - Jan-23-2025 Funny, in school I never liked homework! Just out of interest, I tried like this. path2text = '/home/pedro/temp/Frankenstein_Letter_1.txt' with open(path2text) as f: words = f.read() # make everything capital letters or The and the will be 2 different words words = words.upper() # find punctuation and numbers unwanted = [] for w in words: if not w.isalpha(): if not w in unwanted: unwanted.append(w) # unwanted looks like: ['_', ' ', '.', ',', '\n', '1', '7', '—', '?', ';', '’', '-', '!', ':', "'"] len(unwanted) # 15 words_list = words.split() len(words_list) # 1221 for i in range(len(words_list)): for u in unwanted: if u in words_list[i]: words_list[i] = words_list[i].replace(u, '') words_set = set(words_list) len(words_set) # 586 # a dictionary to hold the count of each word words_dict = {w:0 for w in words_set} # loop through words_list and increase the count for each dictionary key for word in words_list: words_dict[word] +=1 # make a list of (word, count) tuples tups = [(key, words_dict[key]) for key in words_dict.keys()] # sort tups by tup[1], the count reversed so the highest count comes first tups.sort(key=lambda tup: tup[1], reverse=True) # show the results for i in range(5): print(tups[i]) If you were allowed to use a regex, you can get the words_list more easily and words like The and the will count as different words, unless you change everything to uppercase or lowercase first.I put some other words in my English text like: 'Ödipus', 'Müttern', 'über', 'Vätern', 'dächten', 'wäre', 'naïve' just to see how the regex coped with them. No problems! import re # allow for other characters than a-zA-Z e = re.compile(r'\b[A-Za-züäÜÄÖöï]+\b') words_list = e.findall(words) # carry on from here as above, but no need to lose numbers and punctuation |